{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"[Home Page](../START_HERE.ipynb)\n",
"\n",
"[Previous Notebook](03-Cudf_Exercise.ipynb)\n",
" \n",
" \n",
" \n",
" \n",
"[1](01-Intro_to_cuDF.ipynb)\n",
"[2](02-Intro_to_cuDF_UDFs.ipynb)\n",
"[3](03-Cudf_Exercise.ipynb)\n",
"[4]\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Applying CuDF: The Solution\n",
"\n",
"Welcome to fourth cuDF tutorial notebook! This is a practical example that utilizes cuDF and cuPy, geared primarily for new users. The purpose of this tutorial is to introduce new users to a data science processing pipeline using RAPIDS on real life datasets. We will be working on a data science problem: US Accidents Prediction. This is a countrywide car accident dataset, which covers 49 states of the USA. The accident data are collected from February 2016 to June 2020, using two APIs that provide streaming traffic incident (or event) data. These APIs broadcast traffic data captured by a variety of entities, such as the US and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors within the road-networks. Currently, there are about 3.5 million accident records in this dataset. \n",
"\n",
"\n",
"## What should I do?\n",
"\n",
"Given below is a complete data science preprocessing pipeline for the dataset using Pandas and Numpy libraries. Using the methods and techniques from the previous notebooks, you have to convert this pipeline to a a RAPIDS implementation, using CuDF and CuPy. Don't forget to time your code cells and compare the performance with this original code, to understand why we are using RAPIDS. If you get stuck in the middle, feel free to refer to this sample solution. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Here is the list of exercises in the lab where you need to modify code:\n",
"- Exercise 1
Loading the dataset from a csv file and store in a CuDF dataframe.\n",
"- Exercise 2
Creating kernel functions to run the given function optimally on a GPU.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The first step is downloading the dataset and putting it in the data directory, for using in this tutorial.\n",
"Download the dataset here, and place it in (host/data) folder. Now we will import the necessary libraries."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import cudf\n",
"import numpy as np\n",
"import cupy as cp\n",
"import math\n",
"np.random.seed(12)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"First we need to load the dataset from the csv into CuDF dataframes, for the preprocessing steps. If you need help, refer to the Getting Data In and Out module from this [notebook](01-Intro_to_cuDF.ipynb/)."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 547 ms, sys: 225 ms, total: 772 ms\n",
"Wall time: 773 ms\n",
" ID Source TMC Severity Start_Time \\\n",
"0 A-1 MapQuest 201.0 3 2016-02-08 05:46:00 \n",
"1 A-2 MapQuest 201.0 2 2016-02-08 06:07:59 \n",
"2 A-3 MapQuest 201.0 2 2016-02-08 06:49:27 \n",
"3 A-4 MapQuest 201.0 3 2016-02-08 07:23:34 \n",
"4 A-5 MapQuest 201.0 2 2016-02-08 07:39:07 \n",
"... ... ... ... ... ... \n",
"2916679 A-2916810 Bing 1 2020-04-15 15:50:00 \n",
"2916680 A-2916811 Bing 1 2020-04-15 15:05:44 \n",
"2916681 A-2916812 Bing 2 2020-04-15 16:06:28 \n",
"2916682 A-2916813 Bing 1 2020-04-15 15:27:40 \n",
"2916683 A-2916814 Bing 2 2020-04-15 15:25:20 \n",
"\n",
" End_Time Start_Lat Start_Lng End_Lat End_Lng \\\n",
"0 2016-02-08 11:00:00 39.865147 -84.058723 \n",
"1 2016-02-08 06:37:59 39.928059 -82.831184 \n",
"2 2016-02-08 07:19:27 39.063148 -84.032608 \n",
"3 2016-02-08 07:53:34 39.747753 -84.205582 \n",
"4 2016-02-08 08:09:07 39.627781 -84.188354 \n",
"... ... ... ... ... ... \n",
"2916679 2020-04-15 16:35:00 40.565540 -112.295990 40.56554 -112.29599 \n",
"2916680 2020-04-15 15:50:44 40.653760 -111.910300 40.65376 -111.9103 \n",
"2916681 2020-04-15 16:36:28 39.740420 -105.017800 39.74042 -105.0178 \n",
"2916682 2020-04-15 16:12:40 40.183780 -111.646710 40.18378 -111.64671 \n",
"2916683 2020-04-15 16:17:20 40.183780 -111.646710 40.17617 -111.64677 \n",
"\n",
" ... Roundabout Station Stop Traffic_Calming Traffic_Signal \\\n",
"0 ... False False False False False \n",
"1 ... False False False False False \n",
"2 ... False False False False True \n",
"3 ... False False False False False \n",
"4 ... False False False False True \n",
"... ... ... ... ... ... ... \n",
"2916679 ... False False False False True \n",
"2916680 ... False False False False True \n",
"2916681 ... False False False False False \n",
"2916682 ... False False False False False \n",
"2916683 ... \n",
"\n",
" Turning_Loop Sunrise_Sunset Civil_Twilight Nautical_Twilight \\\n",
"0 False Night Night Night \n",
"1 False Night Night Night \n",
"2 False Night Night Day \n",
"3 False Night Day Day \n",
"4 False Day Day Day \n",
"... ... ... ... ... \n",
"2916679 False Day Day Day \n",
"2916680 False Day Day Day \n",
"2916681 False Day Day Day \n",
"2916682 False Day Day Day \n",
"2916683 \n",
"\n",
" Astronomical_Twilight \n",
"0 Night \n",
"1 Day \n",
"2 Day \n",
"3 Day \n",
"4 Day \n",
"... ... \n",
"2916679 Day \n",
"2916680 Day \n",
"2916681 Day \n",
"2916682 Day \n",
"2916683 \n",
"\n",
"[2916684 rows x 49 columns]\n"
]
}
],
"source": [
"%time df = cudf.read_csv('../../data/data.csv')\n",
"print(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"First we will analyse the data and observe patterns that can help us process the data better for feeding to the machine learning algorithms in the future. By using the describe, we will generate the descriptive statistics for all the columns. Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" TMC | \n",
" Severity | \n",
" Start_Lat | \n",
" Start_Lng | \n",
" End_Lat | \n",
" End_Lng | \n",
" Distance(mi) | \n",
" Number | \n",
" Temperature(F) | \n",
" Wind_Chill(F) | \n",
" Humidity(%) | \n",
" Pressure(in) | \n",
" Visibility(mi) | \n",
" Wind_Speed(mph) | \n",
" Precipitation(in) | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 2.478818e+06 | \n",
" 2.916684e+06 | \n",
" 2.916684e+06 | \n",
" 2.916684e+06 | \n",
" 437866.000000 | \n",
" 437866.000000 | \n",
" 2.916684e+06 | \n",
" 1.104392e+06 | \n",
" 2.866833e+06 | \n",
" 1.268002e+06 | \n",
" 2.863532e+06 | \n",
" 2.874585e+06 | \n",
" 2.857484e+06 | \n",
" 2.527224e+06 | \n",
" 1.152141e+06 | \n",
"
\n",
" \n",
" mean | \n",
" 2.080226e+02 | \n",
" 2.338357e+00 | \n",
" 3.626797e+01 | \n",
" -9.433088e+01 | \n",
" 37.117276 | \n",
" -97.085631 | \n",
" 2.368010e-01 | \n",
" 5.173121e+03 | \n",
" 6.263615e+01 | \n",
" 5.381072e+01 | \n",
" 6.550242e+01 | \n",
" 2.978332e+01 | \n",
" 9.128722e+00 | \n",
" 8.310882e+00 | \n",
" 1.776500e-02 | \n",
"
\n",
" \n",
" std | \n",
" 2.076627e+01 | \n",
" 5.228540e-01 | \n",
" 4.830769e+00 | \n",
" 1.678646e+01 | \n",
" 4.746686 | \n",
" 18.208733 | \n",
" 1.523115e+00 | \n",
" 1.015564e+04 | \n",
" 1.844134e+01 | \n",
" 2.401773e+01 | \n",
" 2.253846e+01 | \n",
" 7.520620e-01 | \n",
" 2.817181e+00 | \n",
" 5.222119e+00 | \n",
" 2.057930e-01 | \n",
"
\n",
" \n",
" min | \n",
" 2.000000e+02 | \n",
" 1.000000e+00 | \n",
" 2.455527e+01 | \n",
" -1.246238e+02 | \n",
" 24.587610 | \n",
" -124.497410 | \n",
" 0.000000e+00 | \n",
" 1.000000e+00 | \n",
" -8.900000e+01 | \n",
" -8.900000e+01 | \n",
" 1.000000e+00 | \n",
" 0.000000e+00 | \n",
" 0.000000e+00 | \n",
" 0.000000e+00 | \n",
" 0.000000e+00 | \n",
"
\n",
" \n",
" 25% | \n",
" 2.010000e+02 | \n",
" 2.000000e+00 | \n",
" 3.345090e+01 | \n",
" -1.121182e+02 | \n",
" 33.928530 | \n",
" -117.937170 | \n",
" 0.000000e+00 | \n",
" 8.000000e+02 | \n",
" 5.100000e+01 | \n",
" 3.520000e+01 | \n",
" 4.900000e+01 | \n",
" 2.976000e+01 | \n",
" 1.000000e+01 | \n",
" 5.000000e+00 | \n",
" 0.000000e+00 | \n",
"
\n",
" \n",
" 50% | \n",
" 2.010000e+02 | \n",
" 2.000000e+00 | \n",
" 3.559464e+01 | \n",
" -8.803175e+01 | \n",
" 37.560840 | \n",
" -91.412735 | \n",
" 0.000000e+00 | \n",
" 2.599000e+03 | \n",
" 6.440000e+01 | \n",
" 5.700000e+01 | \n",
" 6.800000e+01 | \n",
" 2.996000e+01 | \n",
" 1.000000e+01 | \n",
" 8.000000e+00 | \n",
" 0.000000e+00 | \n",
"
\n",
" \n",
" 75% | \n",
" 2.010000e+02 | \n",
" 3.000000e+00 | \n",
" 4.007000e+01 | \n",
" -8.083621e+01 | \n",
" 40.717975 | \n",
" -80.811870 | \n",
" 0.000000e+00 | \n",
" 6.501000e+03 | \n",
" 7.600000e+01 | \n",
" 7.300000e+01 | \n",
" 8.400000e+01 | \n",
" 3.010000e+01 | \n",
" 1.000000e+01 | \n",
" 1.150000e+01 | \n",
" 0.000000e+00 | \n",
"
\n",
" \n",
" max | \n",
" 4.060000e+02 | \n",
" 4.000000e+00 | \n",
" 4.900220e+01 | \n",
" -6.711317e+01 | \n",
" 49.075000 | \n",
" -67.109242 | \n",
" 3.336300e+02 | \n",
" 9.904150e+05 | \n",
" 1.670000e+02 | \n",
" 1.150000e+02 | \n",
" 1.000000e+02 | \n",
" 5.774000e+01 | \n",
" 1.400000e+02 | \n",
" 9.840000e+02 | \n",
" 2.500000e+01 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" TMC Severity Start_Lat Start_Lng End_Lat \\\n",
"count 2.478818e+06 2.916684e+06 2.916684e+06 2.916684e+06 437866.000000 \n",
"mean 2.080226e+02 2.338357e+00 3.626797e+01 -9.433088e+01 37.117276 \n",
"std 2.076627e+01 5.228540e-01 4.830769e+00 1.678646e+01 4.746686 \n",
"min 2.000000e+02 1.000000e+00 2.455527e+01 -1.246238e+02 24.587610 \n",
"25% 2.010000e+02 2.000000e+00 3.345090e+01 -1.121182e+02 33.928530 \n",
"50% 2.010000e+02 2.000000e+00 3.559464e+01 -8.803175e+01 37.560840 \n",
"75% 2.010000e+02 3.000000e+00 4.007000e+01 -8.083621e+01 40.717975 \n",
"max 4.060000e+02 4.000000e+00 4.900220e+01 -6.711317e+01 49.075000 \n",
"\n",
" End_Lng Distance(mi) Number Temperature(F) \\\n",
"count 437866.000000 2.916684e+06 1.104392e+06 2.866833e+06 \n",
"mean -97.085631 2.368010e-01 5.173121e+03 6.263615e+01 \n",
"std 18.208733 1.523115e+00 1.015564e+04 1.844134e+01 \n",
"min -124.497410 0.000000e+00 1.000000e+00 -8.900000e+01 \n",
"25% -117.937170 0.000000e+00 8.000000e+02 5.100000e+01 \n",
"50% -91.412735 0.000000e+00 2.599000e+03 6.440000e+01 \n",
"75% -80.811870 0.000000e+00 6.501000e+03 7.600000e+01 \n",
"max -67.109242 3.336300e+02 9.904150e+05 1.670000e+02 \n",
"\n",
" Wind_Chill(F) Humidity(%) Pressure(in) Visibility(mi) \\\n",
"count 1.268002e+06 2.863532e+06 2.874585e+06 2.857484e+06 \n",
"mean 5.381072e+01 6.550242e+01 2.978332e+01 9.128722e+00 \n",
"std 2.401773e+01 2.253846e+01 7.520620e-01 2.817181e+00 \n",
"min -8.900000e+01 1.000000e+00 0.000000e+00 0.000000e+00 \n",
"25% 3.520000e+01 4.900000e+01 2.976000e+01 1.000000e+01 \n",
"50% 5.700000e+01 6.800000e+01 2.996000e+01 1.000000e+01 \n",
"75% 7.300000e+01 8.400000e+01 3.010000e+01 1.000000e+01 \n",
"max 1.150000e+02 1.000000e+02 5.774000e+01 1.400000e+02 \n",
"\n",
" Wind_Speed(mph) Precipitation(in) \n",
"count 2.527224e+06 1.152141e+06 \n",
"mean 8.310882e+00 1.776500e-02 \n",
"std 5.222119e+00 2.057930e-01 \n",
"min 0.000000e+00 0.000000e+00 \n",
"25% 5.000000e+00 0.000000e+00 \n",
"50% 8.000000e+00 0.000000e+00 \n",
"75% 1.150000e+01 0.000000e+00 \n",
"max 9.840000e+02 2.500000e+01 "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will check the size of the dataset that is to be processed using the len function."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2916684"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You will notice that the dataset has many rows and takes quite a lot of time to read from the file. As we go ahead with the preprocessing, computations will require more time to execute, and that's where the RAPIDS comes to the rescue!\n",
"\n",
"Now we use the info function to check the datatype of all the columns in the dataset."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 2916684 entries, 0 to 2916683\n",
"Data columns (total 49 columns):\n",
" # Column Dtype\n",
"--- ------ -----\n",
" 0 ID object\n",
" 1 Source object\n",
" 2 TMC float64\n",
" 3 Severity int64\n",
" 4 Start_Time object\n",
" 5 End_Time object\n",
" 6 Start_Lat float64\n",
" 7 Start_Lng float64\n",
" 8 End_Lat float64\n",
" 9 End_Lng float64\n",
" 10 Distance(mi) float64\n",
" 11 Description object\n",
" 12 Number float64\n",
" 13 Street object\n",
" 14 Side object\n",
" 15 City object\n",
" 16 County object\n",
" 17 State object\n",
" 18 Zipcode object\n",
" 19 Country object\n",
" 20 Timezone object\n",
" 21 Airport_Code object\n",
" 22 Weather_Timestamp object\n",
" 23 Temperature(F) float64\n",
" 24 Wind_Chill(F) float64\n",
" 25 Humidity(%) float64\n",
" 26 Pressure(in) float64\n",
" 27 Visibility(mi) float64\n",
" 28 Wind_Direction object\n",
" 29 Wind_Speed(mph) float64\n",
" 30 Precipitation(in) float64\n",
" 31 Weather_Condition object\n",
" 32 Amenity bool\n",
" 33 Bump bool\n",
" 34 Crossing bool\n",
" 35 Give_Way bool\n",
" 36 Junction object\n",
" 37 No_Exit bool\n",
" 38 Railway bool\n",
" 39 Roundabout bool\n",
" 40 Station bool\n",
" 41 Stop bool\n",
" 42 Traffic_Calming bool\n",
" 43 Traffic_Signal bool\n",
" 44 Turning_Loop bool\n",
" 45 Sunrise_Sunset object\n",
" 46 Civil_Twilight object\n",
" 47 Nautical_Twilight object\n",
" 48 Astronomical_Twilight object\n",
"dtypes: bool(12), float64(14), int64(1), object(22)\n",
"memory usage: 1.2+ GB\n"
]
}
],
"source": [
"df.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will also check the number of missing values in the dataset, so that we can drop or fill in the missing values."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"ID 0\n",
"Source 0\n",
"TMC 437866\n",
"Severity 0\n",
"Start_Time 0\n",
"End_Time 0\n",
"Start_Lat 0\n",
"Start_Lng 0\n",
"End_Lat 2478818\n",
"End_Lng 2478818\n",
"Distance(mi) 0\n",
"Description 1\n",
"Number 1812292\n",
"Street 0\n",
"Side 0\n",
"City 92\n",
"County 0\n",
"State 0\n",
"Zipcode 544\n",
"Country 0\n",
"Timezone 2492\n",
"Airport_Code 4882\n",
"Weather_Timestamp 32683\n",
"Temperature(F) 49851\n",
"Wind_Chill(F) 1648682\n",
"Humidity(%) 53152\n",
"Pressure(in) 42099\n",
"Visibility(mi) 59200\n",
"Wind_Direction 43860\n",
"Wind_Speed(mph) 389460\n",
"Precipitation(in) 1764543\n",
"Weather_Condition 59147\n",
"Amenity 0\n",
"Bump 0\n",
"Crossing 0\n",
"Give_Way 0\n",
"Junction 0\n",
"No_Exit 1\n",
"Railway 1\n",
"Roundabout 1\n",
"Station 1\n",
"Stop 1\n",
"Traffic_Calming 1\n",
"Traffic_Signal 1\n",
"Turning_Loop 1\n",
"Sunrise_Sunset 96\n",
"Civil_Twilight 96\n",
"Nautical_Twilight 96\n",
"Astronomical_Twilight 96\n",
"dtype: uint64"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.isna().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are many columns with null values, and we will fill them with random values or the mean from the column. We will drop some text columns, as we are not doing any natural language processing right now, but feel free to explore them on your own. We will also drop the columns with too many Nans as filling them will throw our accuracy."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"df = df.drop(columns = ['ID','Start_Time','End_Time','Street','Side','Description','Number','City','Country','Zipcode','Timezone','Airport_Code','Weather_Timestamp','Wind_Chill(F)','Wind_Direction','Wind_Speed(mph)','Precipitation(in)'])"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"#Here we are filling the TMC with mean.\n",
"df['TMC'] = df['TMC'].fillna(df['TMC'].mean())\n",
"df['End_Lat'] = df['End_Lat'].fillna(df['End_Lat'].mean())\n",
"df['End_Lng'] = df['End_Lng'].fillna(df['End_Lng'].mean())\n",
"df['Temperature(F)'] = df['Temperature(F)'].fillna(df['Temperature(F)'].mean())\n",
"df['Humidity(%)'] = df['Humidity(%)'].fillna(df['Humidity(%)'].mean())\n",
"df['Pressure(in)'] = df['Pressure(in)'].fillna(df['Pressure(in)'].mean())\n",
"df['Visibility(mi)'] = df['Visibility(mi)'].fillna(df['Visibility(mi)'].mean())\n",
"df['Humidity(%)'] = df['Humidity(%)'].fillna(df['Humidity(%)'].mean())\n",
"df['Pressure(in)'] = df['Pressure(in)'].fillna(df['Pressure(in)'].mean())\n",
"df['Visibility(mi)'] = df['Visibility(mi)'].fillna(df['Visibility(mi)'].mean())\n",
"\n",
"\n",
"df['Weather_Condition'] = df['Weather_Condition'].fillna('Fair')\n",
"df['Sunrise_Sunset'] = df['Sunrise_Sunset'].fillna('Day')\n",
"df['Civil_Twilight'] = df['Civil_Twilight'].fillna('Day')\n",
"df['Nautical_Twilight'] = df['Nautical_Twilight'].fillna('Day')\n",
"df['Astronomical_Twilight'] = df['Astronomical_Twilight'].fillna('Day')\n",
"df['Weather_Condition'] = df['Weather_Condition'].fillna('Fair')\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Source 0\n",
"TMC 0\n",
"Severity 0\n",
"Start_Lat 0\n",
"Start_Lng 0\n",
"End_Lat 0\n",
"End_Lng 0\n",
"Distance(mi) 0\n",
"County 0\n",
"State 0\n",
"Temperature(F) 0\n",
"Humidity(%) 0\n",
"Pressure(in) 0\n",
"Visibility(mi) 0\n",
"Weather_Condition 0\n",
"Amenity 0\n",
"Bump 0\n",
"Crossing 0\n",
"Give_Way 0\n",
"Junction 0\n",
"No_Exit 0\n",
"Railway 0\n",
"Roundabout 0\n",
"Station 0\n",
"Stop 0\n",
"Traffic_Calming 0\n",
"Traffic_Signal 0\n",
"Turning_Loop 0\n",
"Sunrise_Sunset 0\n",
"Civil_Twilight 0\n",
"Nautical_Twilight 0\n",
"Astronomical_Twilight 0\n",
"dtype: uint64"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.drop(df.tail(1).index,inplace=True) \n",
"df.isna().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now all the columns contain no Nan values and we can go ahead with the preprocessing."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
" \n",
"As you have observed in the dataset we have the start and end coordinates,so let us apply Haversine distance formula to get the accident coverage distance. Take note of how these functions use the row-wise operations, something that we have learnt before. If you need help while creating the user defined functions refer to this [notebook](02-Intro_to_cuDF_UDFs.ipynb)."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"from math import cos, sin, asin, sqrt, pi, atan2\n",
"from numba import cuda"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"def haversine_distance_kernel(Start_Lat, Start_Lng, End_Lat, End_Lng, out):\n",
" \n",
" for i, (x_1, y_1, x_2, y_2) in enumerate(zip(Start_Lat, Start_Lng, End_Lat, End_Lng)):\n",
" \n",
"\n",
" x_1 = pi/180 * x_1\n",
" y_1 = pi/180 * y_1\n",
" x_2 = pi/180 * x_2\n",
" y_2 = pi/180 * y_2\n",
" \n",
" dlon = y_2 - y_1\n",
" dlat = x_2 - x_1\n",
" a = sin(dlat/2)**2 + cos(x_1) * cos(x_2) * sin(dlon/2)**2\n",
" \n",
" c = 2 * asin(sqrt(a)) \n",
" r = 6371 # Radius of earth in kilometers\n",
" \n",
" out[i] = c * r"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 617 ms, sys: 36.2 ms, total: 654 ms\n",
"Wall time: 655 ms\n"
]
}
],
"source": [
"%%time\n",
"\n",
"df = df.apply_rows(haversine_distance_kernel,\n",
" incols=['Start_Lat', 'Start_Lng', 'End_Lat', 'End_Lng'],\n",
" outcols=dict(out=np.float64),\n",
" kwargs=dict())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Wow! The code segment that previously took 7 minutes to compute, now gets executed in less than a second! "
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"def haversine_distance_kernel(Start_Lat, Start_Lng, End_Lat, End_Lng, out):\n",
" \n",
" for i, (x_1, y_1, x_2, y_2) in enumerate(zip(Start_Lat, Start_Lng, End_Lat, End_Lng)):\n",
" \n",
"\n",
" x_1 = pi/180 * x_1\n",
" y_1 = pi/180 * y_1\n",
" x_2 = pi/180 * x_2\n",
" y_2 = pi/180 * y_2\n",
" \n",
" dlon = y_2 - y_1\n",
" dlat = x_2 - x_1\n",
" a = sin(dlat/2)**2 + cos(x_1) * cos(x_2) * sin(dlon/2)**2\n",
" \n",
" c = 2 * asin(sqrt(a)) \n",
" r = 6371 # Radius of earth in kilometers\n",
" \n",
" out[i] = c * r"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 455 ms, sys: 17 ms, total: 472 ms\n",
"Wall time: 473 ms\n"
]
}
],
"source": [
"%%time\n",
"\n",
"outdf = df.apply_chunks(haversine_distance_kernel,\n",
" incols=['Start_Lat', 'Start_Lng', 'End_Lat', 'End_Lng'],\n",
" outcols=dict(out=np.float64),\n",
" kwargs=dict(),\n",
" chunks=8,\n",
" tpb=8)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This kernel also took less than a second. The difference is merely the control we have over the execution."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Save the dataframe in a csv for future use, and make sure you refer to our sample solution and compared your code's performance with it."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Source | \n",
" TMC | \n",
" Severity | \n",
" Start_Lat | \n",
" Start_Lng | \n",
" End_Lat | \n",
" End_Lng | \n",
" Distance(mi) | \n",
" County | \n",
" State | \n",
" ... | \n",
" Station | \n",
" Stop | \n",
" Traffic_Calming | \n",
" Traffic_Signal | \n",
" Turning_Loop | \n",
" Sunrise_Sunset | \n",
" Civil_Twilight | \n",
" Nautical_Twilight | \n",
" Astronomical_Twilight | \n",
" out | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" MapQuest | \n",
" 201.0 | \n",
" 3 | \n",
" 39.865147 | \n",
" -84.058723 | \n",
" 37.117276 | \n",
" -97.085631 | \n",
" 0.01 | \n",
" Montgomery | \n",
" OH | \n",
" ... | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" Night | \n",
" Night | \n",
" Night | \n",
" Night | \n",
" 1172.997367 | \n",
"
\n",
" \n",
" 1 | \n",
" MapQuest | \n",
" 201.0 | \n",
" 2 | \n",
" 39.928059 | \n",
" -82.831184 | \n",
" 37.117276 | \n",
" -97.085631 | \n",
" 0.01 | \n",
" Franklin | \n",
" OH | \n",
" ... | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" Night | \n",
" Night | \n",
" Night | \n",
" Day | \n",
" 1277.283911 | \n",
"
\n",
" \n",
" 2 | \n",
" MapQuest | \n",
" 201.0 | \n",
" 2 | \n",
" 39.063148 | \n",
" -84.032608 | \n",
" 37.117276 | \n",
" -97.085631 | \n",
" 0.01 | \n",
" Clermont | \n",
" OH | \n",
" ... | \n",
" False | \n",
" False | \n",
" False | \n",
" True | \n",
" False | \n",
" Night | \n",
" Night | \n",
" Day | \n",
" Day | \n",
" 1161.565094 | \n",
"
\n",
" \n",
" 3 | \n",
" MapQuest | \n",
" 201.0 | \n",
" 3 | \n",
" 39.747753 | \n",
" -84.205582 | \n",
" 37.117276 | \n",
" -97.085631 | \n",
" 0.01 | \n",
" Montgomery | \n",
" OH | \n",
" ... | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" Night | \n",
" Day | \n",
" Day | \n",
" Day | \n",
" 1158.238470 | \n",
"
\n",
" \n",
" 4 | \n",
" MapQuest | \n",
" 201.0 | \n",
" 2 | \n",
" 39.627781 | \n",
" -84.188354 | \n",
" 37.117276 | \n",
" -97.085631 | \n",
" 0.01 | \n",
" Montgomery | \n",
" OH | \n",
" ... | \n",
" False | \n",
" False | \n",
" False | \n",
" True | \n",
" False | \n",
" Day | \n",
" Day | \n",
" Day | \n",
" Day | \n",
" 1157.325865 | \n",
"
\n",
" \n",
"
\n",
"
5 rows × 33 columns
\n",
"
"
],
"text/plain": [
" Source TMC Severity Start_Lat Start_Lng End_Lat End_Lng \\\n",
"0 MapQuest 201.0 3 39.865147 -84.058723 37.117276 -97.085631 \n",
"1 MapQuest 201.0 2 39.928059 -82.831184 37.117276 -97.085631 \n",
"2 MapQuest 201.0 2 39.063148 -84.032608 37.117276 -97.085631 \n",
"3 MapQuest 201.0 3 39.747753 -84.205582 37.117276 -97.085631 \n",
"4 MapQuest 201.0 2 39.627781 -84.188354 37.117276 -97.085631 \n",
"\n",
" Distance(mi) County State ... Station Stop Traffic_Calming \\\n",
"0 0.01 Montgomery OH ... False False False \n",
"1 0.01 Franklin OH ... False False False \n",
"2 0.01 Clermont OH ... False False False \n",
"3 0.01 Montgomery OH ... False False False \n",
"4 0.01 Montgomery OH ... False False False \n",
"\n",
" Traffic_Signal Turning_Loop Sunrise_Sunset Civil_Twilight \\\n",
"0 False False Night Night \n",
"1 False False Night Night \n",
"2 True False Night Night \n",
"3 False False Night Day \n",
"4 True False Day Day \n",
"\n",
" Nautical_Twilight Astronomical_Twilight out \n",
"0 Night Night 1172.997367 \n",
"1 Night Day 1277.283911 \n",
"2 Day Day 1161.565094 \n",
"3 Day Day 1158.238470 \n",
"4 Day Day 1157.325865 \n",
"\n",
"[5 rows x 33 columns]"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"df = df.dropna()"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"df.to_csv(\"../../data/data_proc.csv\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Conclusion \n",
"\n",
"Thus we have successfully used CuDF and CuPy to process the accidents dataset, and converted the data to a form more suitable to apply machine learning algorithms. In the extra labs for future labs in CuML we will be using this processed dataset. You must have observed the parallels between the RAPIDS pipeline and traditional pipeline while writing your code. Try to experiment with the processing and making your code as efficient as possible."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# References\n",
"\n",
"- Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. “A Countrywide Traffic Accident Dataset.”, 2019.\n",
"\n",
"- Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. \"Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights.\" In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.\n",
"\n",
"- If you need to refer to the dataset, you can download it [here](https://www.kaggle.com/sobhanmoosavi/us-accidents)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"- This dataset is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Licensing\n",
" \n",
"This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[Previous Notebook](03-Cudf_Exercise.ipynb)\n",
" \n",
" \n",
" \n",
" \n",
"[1](01-Intro_to_cuDF.ipynb)\n",
"[2](02-Intro_to_cuDF_UDFs.ipynb)\n",
"[3](03-Cudf_Exercise.ipynb)\n",
"[4]\n",
" \n",
" \n",
" \n",
" \n",
"\n",
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"[Home Page](../START_HERE.ipynb)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.2"
}
},
"nbformat": 4,
"nbformat_minor": 4
}