&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&ensp;
[Home Page](../START_HERE.ipynb)

[Previous Notebook](03-Cudf_Exercise.ipynb)
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
[1](01-Intro_to_cuDF.ipynb)
[2](02-Intro_to_cuDF_UDFs.ipynb)
[3](03-Cudf_Exercise.ipynb)
[4]


# Applying CuDF: The Solution

Welcome to fourth cuDF tutorial notebook! This is a practical example that utilizes cuDF and cuPy, geared primarily for new users. The purpose of this tutorial is to introduce new users to a data science processing pipeline using RAPIDS on real life datasets. We will be working on a data science problem: US Accidents Prediction. This is a countrywide car accident dataset, which covers 49 states of the USA. The accident data are collected from February 2016 to June 2020, using two APIs that provide streaming traffic incident (or event) data. These APIs broadcast traffic data captured by a variety of entities, such as the US and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors within the road-networks. Currently, there are about 3.5 million accident records in this dataset. 


## What should I do?

Given below is a complete data science preprocessing pipeline for the dataset using Pandas and Numpy libraries. Using the methods and techniques from the previous notebooks, you have to convert this pipeline to a a RAPIDS implementation, using CuDF and CuPy. Don't forget to time your code cells and compare the performance with this original code, to understand why we are using RAPIDS. If you get stuck in the middle, feel free to refer to this sample solution. 

## Here is the list of exercises in the lab where you need to modify code:
- <a href='#ex1'>Exercise 1</a><br> Loading the dataset from a csv file and store in a CuDF dataframe.
- <a href='#ex2'>Exercise 2</a><br> Creating kernel functions to run the given function optimally on a GPU.


The first step is downloading the dataset and putting it in the data directory, for using in this tutorial.
Download the dataset here, and place it in (host/data) folder. Now we will import the necessary libraries.

In [4]:
import os
import cudf
import numpy as np
import cupy as cp
import math
np.random.seed(12)

<a id='ex1'></a>

First we need to load the dataset from the csv into CuDF dataframes, for the preprocessing steps. If you need help, refer to the Getting Data In and Out module from this [notebook](01-Intro_to_cuDF.ipynb/).

In [5]:
%time df = cudf.read_csv('../../data/data.csv')
print(df)

CPU times: user 547 ms, sys: 225 ms, total: 772 ms
Wall time: 773 ms
                ID    Source    TMC  Severity           Start_Time  \
0              A-1  MapQuest  201.0         3  2016-02-08 05:46:00   
1              A-2  MapQuest  201.0         2  2016-02-08 06:07:59   
2              A-3  MapQuest  201.0         2  2016-02-08 06:49:27   
3              A-4  MapQuest  201.0         3  2016-02-08 07:23:34   
4              A-5  MapQuest  201.0         2  2016-02-08 07:39:07   
...            ...       ...    ...       ...                  ...   
2916679  A-2916810      Bing   <NA>         1  2020-04-15 15:50:00   
2916680  A-2916811      Bing   <NA>         1  2020-04-15 15:05:44   
2916681  A-2916812      Bing   <NA>         2  2020-04-15 16:06:28   
2916682  A-2916813      Bing   <NA>         1  2020-04-15 15:27:40   
2916683  A-2916814      Bing   <NA>         2  2020-04-15 15:25:20   

                    End_Time  Start_Lat   Start_Lng   End_Lat     End_Lng  \
0        2016


First we will analyse the data and observe patterns that can help us process the data better for feeding to the machine learning algorithms in the future. By using the describe, we will generate the descriptive statistics for all the columns. Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

In [6]:
df.describe()

Unnamed: 0,TMC,Severity,Start_Lat,Start_Lng,End_Lat,End_Lng,Distance(mi),Number,Temperature(F),Wind_Chill(F),Humidity(%),Pressure(in),Visibility(mi),Wind_Speed(mph),Precipitation(in)
count,2478818.0,2916684.0,2916684.0,2916684.0,437866.0,437866.0,2916684.0,1104392.0,2866833.0,1268002.0,2863532.0,2874585.0,2857484.0,2527224.0,1152141.0
mean,208.0226,2.338357,36.26797,-94.33088,37.117276,-97.085631,0.236801,5173.121,62.63615,53.81072,65.50242,29.78332,9.128722,8.310882,0.017765
std,20.76627,0.522854,4.830769,16.78646,4.746686,18.208733,1.523115,10155.64,18.44134,24.01773,22.53846,0.752062,2.817181,5.222119,0.205793
min,200.0,1.0,24.55527,-124.6238,24.58761,-124.49741,0.0,1.0,-89.0,-89.0,1.0,0.0,0.0,0.0,0.0
25%,201.0,2.0,33.4509,-112.1182,33.92853,-117.93717,0.0,800.0,51.0,35.2,49.0,29.76,10.0,5.0,0.0
50%,201.0,2.0,35.59464,-88.03175,37.56084,-91.412735,0.0,2599.0,64.4,57.0,68.0,29.96,10.0,8.0,0.0
75%,201.0,3.0,40.07,-80.83621,40.717975,-80.81187,0.0,6501.0,76.0,73.0,84.0,30.1,10.0,11.5,0.0
max,406.0,4.0,49.0022,-67.11317,49.075,-67.109242,333.63,990415.0,167.0,115.0,100.0,57.74,140.0,984.0,25.0


We will check the size of the dataset that is to be processed using the len function.

In [7]:
len(df)

2916684

You will notice that the dataset has many rows and takes quite a lot of time to read from the file. As we go ahead with the preprocessing, computations will require more time to execute, and that's where the RAPIDS comes to the rescue!

Now we use the info function to check the datatype of all the columns in the dataset.

In [8]:
df.info()

<class 'cudf.core.dataframe.DataFrame'>
RangeIndex: 2916684 entries, 0 to 2916683
Data columns (total 49 columns):
 #   Column                 Dtype
---  ------                 -----
 0   ID                     object
 1   Source                 object
 2   TMC                    float64
 3   Severity               int64
 4   Start_Time             object
 5   End_Time               object
 6   Start_Lat              float64
 7   Start_Lng              float64
 8   End_Lat                float64
 9   End_Lng                float64
 10  Distance(mi)           float64
 11  Description            object
 12  Number                 float64
 13  Street                 object
 14  Side                   object
 15  City                   object
 16  County                 object
 17  State                  object
 18  Zipcode                object
 19  Country                object
 20  Timezone               object
 21  Airport_Code           object
 22  Weather_Timestamp      object
 23  T

We will also check the number of missing values in the dataset, so that we can drop or fill in the missing values.

In [9]:
df.isna().sum()

ID                             0
Source                         0
TMC                       437866
Severity                       0
Start_Time                     0
End_Time                       0
Start_Lat                      0
Start_Lng                      0
End_Lat                  2478818
End_Lng                  2478818
Distance(mi)                   0
Description                    1
Number                   1812292
Street                         0
Side                           0
City                          92
County                         0
State                          0
Zipcode                      544
Country                        0
Timezone                    2492
Airport_Code                4882
Weather_Timestamp          32683
Temperature(F)             49851
Wind_Chill(F)            1648682
Humidity(%)                53152
Pressure(in)               42099
Visibility(mi)             59200
Wind_Direction             43860
Wind_Speed(mph)           389460
Precipitat

There are many columns with null values, and we will fill them with random values or the mean from the column. We will drop some text columns, as we are not doing any natural language processing right now, but feel free to explore them on your own. We will also drop the columns with too many Nans as filling them will throw our accuracy.

In [10]:
df = df.drop(columns = ['ID','Start_Time','End_Time','Street','Side','Description','Number','City','Country','Zipcode','Timezone','Airport_Code','Weather_Timestamp','Wind_Chill(F)','Wind_Direction','Wind_Speed(mph)','Precipitation(in)'])

In [11]:
#Here we are filling the TMC with mean.
df['TMC'] = df['TMC'].fillna(df['TMC'].mean())
df['End_Lat'] = df['End_Lat'].fillna(df['End_Lat'].mean())
df['End_Lng'] = df['End_Lng'].fillna(df['End_Lng'].mean())
df['Temperature(F)'] = df['Temperature(F)'].fillna(df['Temperature(F)'].mean())
df['Humidity(%)'] = df['Humidity(%)'].fillna(df['Humidity(%)'].mean())
df['Pressure(in)'] = df['Pressure(in)'].fillna(df['Pressure(in)'].mean())
df['Visibility(mi)'] = df['Visibility(mi)'].fillna(df['Visibility(mi)'].mean())
df['Humidity(%)'] = df['Humidity(%)'].fillna(df['Humidity(%)'].mean())
df['Pressure(in)'] = df['Pressure(in)'].fillna(df['Pressure(in)'].mean())
df['Visibility(mi)'] = df['Visibility(mi)'].fillna(df['Visibility(mi)'].mean())


df['Weather_Condition'] = df['Weather_Condition'].fillna('Fair')
df['Sunrise_Sunset'] = df['Sunrise_Sunset'].fillna('Day')
df['Civil_Twilight'] = df['Civil_Twilight'].fillna('Day')
df['Nautical_Twilight'] = df['Nautical_Twilight'].fillna('Day')
df['Astronomical_Twilight'] = df['Astronomical_Twilight'].fillna('Day')
df['Weather_Condition'] = df['Weather_Condition'].fillna('Fair')


In [12]:
df.drop(df.tail(1).index,inplace=True) 
df.isna().sum()

Source                   0
TMC                      0
Severity                 0
Start_Lat                0
Start_Lng                0
End_Lat                  0
End_Lng                  0
Distance(mi)             0
County                   0
State                    0
Temperature(F)           0
Humidity(%)              0
Pressure(in)             0
Visibility(mi)           0
Weather_Condition        0
Amenity                  0
Bump                     0
Crossing                 0
Give_Way                 0
Junction                 0
No_Exit                  0
Railway                  0
Roundabout               0
Station                  0
Stop                     0
Traffic_Calming          0
Traffic_Signal           0
Turning_Loop             0
Sunrise_Sunset           0
Civil_Twilight           0
Nautical_Twilight        0
Astronomical_Twilight    0
dtype: uint64

Now all the columns contain no Nan values and we can go ahead with the preprocessing.

<a id='ex2'></a>
       
As you have observed in the dataset we have the start and end coordinates,so  let us apply Haversine distance formula to get the accident coverage distance. Take note of how these functions use the row-wise operations, something that we have learnt before. If you need help while creating the user defined functions refer to this [notebook](02-Intro_to_cuDF_UDFs.ipynb).

In [13]:
from math import cos, sin, asin, sqrt, pi, atan2
from numba import cuda

In [14]:
def haversine_distance_kernel(Start_Lat, Start_Lng, End_Lat, End_Lng, out):
 
    for i, (x_1, y_1, x_2, y_2) in enumerate(zip(Start_Lat, Start_Lng, End_Lat, End_Lng)):
 

        x_1 = pi/180 * x_1
        y_1 = pi/180 * y_1
        x_2 = pi/180 * x_2
        y_2 = pi/180 * y_2
        
        dlon = y_2 - y_1
        dlat = x_2 - x_1
        a = sin(dlat/2)**2 + cos(x_1) * cos(x_2) * sin(dlon/2)**2
        
        c = 2 * asin(sqrt(a)) 
        r = 6371 # Radius of earth in kilometers
        
        out[i] = c * r

In [15]:
%%time

df = df.apply_rows(haversine_distance_kernel,
                   incols=['Start_Lat', 'Start_Lng', 'End_Lat', 'End_Lng'],
                   outcols=dict(out=np.float64),
                   kwargs=dict())

CPU times: user 617 ms, sys: 36.2 ms, total: 654 ms
Wall time: 655 ms


Wow! The code segment that previously took  7 minutes to compute, now gets executed in less than a second! 

In [16]:
def haversine_distance_kernel(Start_Lat, Start_Lng, End_Lat, End_Lng, out):
 
    for i, (x_1, y_1, x_2, y_2) in enumerate(zip(Start_Lat, Start_Lng, End_Lat, End_Lng)):
 

        x_1 = pi/180 * x_1
        y_1 = pi/180 * y_1
        x_2 = pi/180 * x_2
        y_2 = pi/180 * y_2
        
        dlon = y_2 - y_1
        dlat = x_2 - x_1
        a = sin(dlat/2)**2 + cos(x_1) * cos(x_2) * sin(dlon/2)**2
        
        c = 2 * asin(sqrt(a)) 
        r = 6371 # Radius of earth in kilometers
        
        out[i] = c * r

In [17]:
%%time

outdf = df.apply_chunks(haversine_distance_kernel,
                        incols=['Start_Lat', 'Start_Lng', 'End_Lat', 'End_Lng'],
                        outcols=dict(out=np.float64),
                        kwargs=dict(),
                        chunks=8,
                        tpb=8)

CPU times: user 455 ms, sys: 17 ms, total: 472 ms
Wall time: 473 ms


This kernel also took less than a second. The difference is merely the control we have over the execution.

Save the dataframe in a csv for future use, and make sure you refer to our sample solution and compared your code's performance with it.

In [18]:
df.head()

Unnamed: 0,Source,TMC,Severity,Start_Lat,Start_Lng,End_Lat,End_Lng,Distance(mi),County,State,...,Station,Stop,Traffic_Calming,Traffic_Signal,Turning_Loop,Sunrise_Sunset,Civil_Twilight,Nautical_Twilight,Astronomical_Twilight,out
0,MapQuest,201.0,3,39.865147,-84.058723,37.117276,-97.085631,0.01,Montgomery,OH,...,False,False,False,False,False,Night,Night,Night,Night,1172.997367
1,MapQuest,201.0,2,39.928059,-82.831184,37.117276,-97.085631,0.01,Franklin,OH,...,False,False,False,False,False,Night,Night,Night,Day,1277.283911
2,MapQuest,201.0,2,39.063148,-84.032608,37.117276,-97.085631,0.01,Clermont,OH,...,False,False,False,True,False,Night,Night,Day,Day,1161.565094
3,MapQuest,201.0,3,39.747753,-84.205582,37.117276,-97.085631,0.01,Montgomery,OH,...,False,False,False,False,False,Night,Day,Day,Day,1158.23847
4,MapQuest,201.0,2,39.627781,-84.188354,37.117276,-97.085631,0.01,Montgomery,OH,...,False,False,False,True,False,Day,Day,Day,Day,1157.325865


In [19]:
df = df.dropna()

In [20]:
df.to_csv("../../data/data_proc.csv")

# Conclusion 

Thus we have successfully used CuDF and CuPy to process the accidents dataset, and converted the data to a form more suitable to apply machine learning algorithms. In the extra labs for future labs in CuML we will be using this processed dataset. You must have observed the parallels between the RAPIDS pipeline and traditional pipeline while writing your code. Try to experiment with the processing and making your code as efficient as possible.

# References

- Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. “A Countrywide Traffic Accident Dataset.”, 2019.

- Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. "Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights." In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.

- If you need to refer to the dataset, you can download it [here](https://www.kaggle.com/sobhanmoosavi/us-accidents).

<center><a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a></center><br />

- This dataset is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

## Licensing
  
This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0).

[Previous Notebook](03-Cudf_Exercise.ipynb)
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
[1](01-Intro_to_cuDF.ipynb)
[2](02-Intro_to_cuDF_UDFs.ipynb)
[3](03-Cudf_Exercise.ipynb)
[4]
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;


&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&ensp;
[Home Page](../START_HERE.ipynb)