{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"[Home Page](../START_HERE.ipynb)\n",
"\n",
"[Previous Notebook](03-Cudf_Exercise.ipynb)\n",
" \n",
" \n",
" \n",
" \n",
"[1](01-Intro_to_cuDF.ipynb)\n",
"[2](02-Intro_to_cuDF_UDFs.ipynb)\n",
"[3](03-Cudf_Exercise.ipynb)\n",
"[4]\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Applying CuDF: Workbook\n",
"\n",
"Welcome to fourth cuDF tutorial notebook! This is a practical example that utilizes cuDF and cuPy, geared primarily for new users. The purpose of this tutorial is to introduce new users to a data science processing pipeline using RAPIDS on real life datasets. We will be working on a data science problem: US Accidents Prediction. This is a countrywide car accident dataset, which covers 49 states of the USA. The accident data are collected from February 2016 to June 2020, using two APIs that provide streaming traffic incident (or event) data. These APIs broadcast traffic data captured by a variety of entities, such as the US and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors within the road-networks. Currently, there are about 3.5 million accident records in this dataset. \n",
"\n",
"\n",
"## What should I do?\n",
"\n",
"Given below is a complete data science preprocessing pipeline for the dataset using Pandas and Numpy libraries. Using the methods and techniques from the previous notebooks, you have to convert this pipeline to a a RAPIDS implementation, using CuDF and CuPy. Don't forget to time your code cells and compare the performance with this original code, to understand why we are using RAPIDS. If you get stuck in the middle, feel free to refer to this sample solution. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Here is the list of exercises in the lab where you need to modify code:\n",
"- Exercise 1
Loading the dataset from a csv file and store in a CuDF dataframe.\n",
"- Exercise 2
Creating kernel functions to run the given function optimally on a GPU.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The first step is downloading the dataset and putting it in the data directory, for using in this tutorial.\n",
"Download the dataset here, and place it in (host/data) folder. Now we will import the necessary libraries."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import cudf\n",
"import numpy as np\n",
"import cupy as cp\n",
"import math\n",
"np.random.seed(12)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"First we need to load the dataset from the csv into CuDF dataframes, for the preprocessing steps. If you need help, refer to the Getting Data In and Out module from this [notebook](01-Intro_to_cuDF.ipynb/)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Modify the code in this cell\n",
"\n",
"# Use cudf to read csv\n",
"%time df = \n",
"print(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"First we will analyse the data and observe patterns that can help us process the data better for feeding to the machine learning algorithms in the future. By using the describe, we will generate the descriptive statistics for all the columns. Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n", " | TMC | \n", "Severity | \n", "Start_Lat | \n", "Start_Lng | \n", "End_Lat | \n", "End_Lng | \n", "Distance(mi) | \n", "Number | \n", "Temperature(F) | \n", "Wind_Chill(F) | \n", "Humidity(%) | \n", "Pressure(in) | \n", "Visibility(mi) | \n", "Wind_Speed(mph) | \n", "Precipitation(in) | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | \n", "2.478818e+06 | \n", "3.513617e+06 | \n", "3.513617e+06 | \n", "3.513617e+06 | \n", "1.034799e+06 | \n", "1.034799e+06 | \n", "3.513617e+06 | \n", "1.250753e+06 | \n", "3.447885e+06 | \n", "1.645368e+06 | \n", "3.443930e+06 | \n", "3.457735e+06 | \n", "3.437761e+06 | \n", "3.059008e+06 | \n", "1.487743e+06 | \n", "
mean | \n", "2.080226e+02 | \n", "2.339929e+00 | \n", "3.654194e+01 | \n", "-9.579151e+01 | \n", "3.755758e+01 | \n", "-1.004560e+02 | \n", "2.816170e-01 | \n", "5.975383e+03 | \n", "6.193512e+01 | \n", "5.355730e+01 | \n", "6.511427e+01 | \n", "2.974463e+01 | \n", "9.122644e+00 | \n", "8.219025e+00 | \n", "1.598300e-02 | \n", "
std | \n", "2.076627e+01 | \n", "5.521930e-01 | \n", "4.883520e+00 | \n", "1.736877e+01 | \n", "4.861215e+00 | \n", "1.852879e+01 | \n", "1.550134e+00 | \n", "1.496624e+04 | \n", "1.862106e+01 | \n", "2.377334e+01 | \n", "2.275558e+01 | \n", "8.319760e-01 | \n", "2.885879e+00 | \n", "5.262847e+00 | \n", "1.928260e-01 | \n", "
min | \n", "2.000000e+02 | \n", "1.000000e+00 | \n", "2.455527e+01 | \n", "-1.246238e+02 | \n", "2.457011e+01 | \n", "-1.244978e+02 | \n", "0.000000e+00 | \n", "0.000000e+00 | \n", "-8.900000e+01 | \n", "-8.900000e+01 | \n", "1.000000e+00 | \n", "0.000000e+00 | \n", "0.000000e+00 | \n", "0.000000e+00 | \n", "0.000000e+00 | \n", "
25% | \n", "2.010000e+02 | \n", "2.000000e+00 | \n", "3.363784e+01 | \n", "-1.174418e+02 | \n", "3.399477e+01 | \n", "-1.183440e+02 | \n", "0.000000e+00 | \n", "8.640000e+02 | \n", "5.000000e+01 | \n", "3.570000e+01 | \n", "4.800000e+01 | \n", "2.973000e+01 | \n", "1.000000e+01 | \n", "5.000000e+00 | \n", "0.000000e+00 | \n", "
50% | \n", "2.010000e+02 | \n", "2.000000e+00 | \n", "3.591687e+01 | \n", "-9.102601e+01 | \n", "3.779736e+01 | \n", "-9.703438e+01 | \n", "0.000000e+00 | \n", "2.798000e+03 | \n", "6.400000e+01 | \n", "5.700000e+01 | \n", "6.700000e+01 | \n", "2.995000e+01 | \n", "1.000000e+01 | \n", "7.000000e+00 | \n", "0.000000e+00 | \n", "
75% | \n", "2.010000e+02 | \n", "3.000000e+00 | \n", "4.032217e+01 | \n", "-8.093299e+01 | \n", "4.105139e+01 | \n", "-8.210168e+01 | \n", "1.000000e-02 | \n", "7.098000e+03 | \n", "7.590000e+01 | \n", "7.200000e+01 | \n", "8.400000e+01 | \n", "3.009000e+01 | \n", "1.000000e+01 | \n", "1.150000e+01 | \n", "0.000000e+00 | \n", "
max | \n", "4.060000e+02 | \n", "4.000000e+00 | \n", "4.900220e+01 | \n", "-6.711317e+01 | \n", "4.907500e+01 | \n", "-6.710924e+01 | \n", "3.336300e+02 | \n", "9.999997e+06 | \n", "1.706000e+02 | \n", "1.150000e+02 | \n", "1.000000e+02 | \n", "5.774000e+01 | \n", "1.400000e+02 | \n", "9.840000e+02 | \n", "2.500000e+01 | \n", "
\n", " | Source | \n", "TMC | \n", "Severity | \n", "Start_Lat | \n", "Start_Lng | \n", "End_Lat | \n", "End_Lng | \n", "Distance(mi) | \n", "Temperature(F) | \n", "Humidity(%) | \n", "... | \n", "WC_Thunderstorm | \n", "WC_Thunderstorms and Rain | \n", "WC_Thunderstorms and Snow | \n", "WC_Tornado | \n", "WC_Volcanic Ash | \n", "WC_Widespread Dust | \n", "WC_Widespread Dust / Windy | \n", "WC_Wintry Mix | \n", "WC_Wintry Mix / Windy | \n", "out | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "MapQuest | \n", "201.0 | \n", "3 | \n", "39.865147 | \n", "-84.058723 | \n", "37.557578 | \n", "-100.455981 | \n", "0.01 | \n", "36.9 | \n", "91.0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1443.524390 | \n", "
1 | \n", "MapQuest | \n", "201.0 | \n", "2 | \n", "39.928059 | \n", "-82.831184 | \n", "37.557578 | \n", "-100.455981 | \n", "0.01 | \n", "37.9 | \n", "100.0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1548.467903 | \n", "
2 | \n", "MapQuest | \n", "201.0 | \n", "2 | \n", "39.063148 | \n", "-84.032608 | \n", "37.557578 | \n", "-100.455981 | \n", "0.01 | \n", "36.0 | \n", "100.0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1440.697621 | \n", "
3 | \n", "MapQuest | \n", "201.0 | \n", "3 | \n", "39.747753 | \n", "-84.205582 | \n", "37.557578 | \n", "-100.455981 | \n", "0.01 | \n", "35.1 | \n", "96.0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1429.927497 | \n", "
4 | \n", "MapQuest | \n", "201.0 | \n", "2 | \n", "39.627781 | \n", "-84.188354 | \n", "37.557578 | \n", "-100.455981 | \n", "0.01 | \n", "36.0 | \n", "89.0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1430.383177 | \n", "
5 rows × 1930 columns
\n", "