{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "     \n", "     \n", "     \n", "     \n", "     \n", "   \n", "[Home Page](../START_HERE.ipynb)\n", "\n", "[Previous Notebook](03-Cudf_Exercise.ipynb)\n", "     \n", "     \n", "     \n", "     \n", "[1](01-Intro_to_cuDF.ipynb)\n", "[2](02-Intro_to_cuDF_UDFs.ipynb)\n", "[3](03-Cudf_Exercise.ipynb)\n", "[4]\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Applying CuDF: Workbook\n", "\n", "Welcome to fourth cuDF tutorial notebook! This is a practical example that utilizes cuDF and cuPy, geared primarily for new users. The purpose of this tutorial is to introduce new users to a data science processing pipeline using RAPIDS on real life datasets. We will be working on a data science problem: US Accidents Prediction. This is a countrywide car accident dataset, which covers 49 states of the USA. The accident data are collected from February 2016 to June 2020, using two APIs that provide streaming traffic incident (or event) data. These APIs broadcast traffic data captured by a variety of entities, such as the US and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors within the road-networks. Currently, there are about 3.5 million accident records in this dataset. \n", "\n", "\n", "## What should I do?\n", "\n", "Given below is a complete data science preprocessing pipeline for the dataset using Pandas and Numpy libraries. Using the methods and techniques from the previous notebooks, you have to convert this pipeline to a a RAPIDS implementation, using CuDF and CuPy. Don't forget to time your code cells and compare the performance with this original code, to understand why we are using RAPIDS. If you get stuck in the middle, feel free to refer to this sample solution. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Here is the list of exercises in the lab where you need to modify code:\n", "- Exercise 1
Loading the dataset from a csv file and store in a CuDF dataframe.\n", "- Exercise 2
Creating kernel functions to run the given function optimally on a GPU.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first step is downloading the dataset and putting it in the data directory, for using in this tutorial.\n", "Download the dataset here, and place it in (host/data) folder. Now we will import the necessary libraries." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import cudf\n", "import numpy as np\n", "import cupy as cp\n", "import math\n", "np.random.seed(12)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "First we need to load the dataset from the csv into CuDF dataframes, for the preprocessing steps. If you need help, refer to the Getting Data In and Out module from this [notebook](01-Intro_to_cuDF.ipynb/)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Modify the code in this cell\n", "\n", "# Use cudf to read csv\n", "%time df = \n", "print(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "First we will analyse the data and observe patterns that can help us process the data better for feeding to the machine learning algorithms in the future. By using the describe, we will generate the descriptive statistics for all the columns. Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will check the size of the dataset that is to be processed using the len function." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "len(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You will notice that the dataset has 3513616 rows and takes quite a lot of time to read from the file. As we go ahead with the preprocessing, computations will require more time to execute, and that's where the RAPIDS comes to the rescue!\n", "\n", "Now we use the info function to check the datatype of all the columns in the dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will also check the number of missing values in the dataset, so that we can drop or fill in the missing values." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.isna().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are many columns with null values, and we will fill them with random values or the mean from the column. We will drop some text columns, as we are not doing any natural language processing right now, but feel free to explore them on your own. We will also drop the columns with too many Nans as filling them will throw our accuracy." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = df.drop(columns = ['ID','Start_Time','End_Time','Street','Side','Description','Number','City','Country','Zipcode','Timezone','Airport_Code','Weather_Timestamp','Wind_Chill(F)','Wind_Direction','Wind_Speed(mph)','Precipitation(in)'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Here we are filling the TMC with mean.\n", "df['TMC'] = df['TMC'].fillna(df['TMC'].mean())\n", "df['End_Lat'] = df['End_Lat'].fillna(df['End_Lat'].mean())\n", "df['End_Lng'] = df['End_Lng'].fillna(df['End_Lng'].mean())\n", "df['Temperature(F)'] = df['Temperature(F)'].fillna(df['Temperature(F)'].mean())\n", "df['Humidity(%)'] = df['Humidity(%)'].fillna(df['Humidity(%)'].mean())\n", "df['Pressure(in)'] = df['Pressure(in)'].fillna(df['Pressure(in)'].mean())\n", "df['Visibility(mi)'] = df['Visibility(mi)'].fillna(df['Visibility(mi)'].mean())\n", "df['Humidity(%)'] = df['Humidity(%)'].fillna(df['Humidity(%)'].mean())\n", "df['Pressure(in)'] = df['Pressure(in)'].fillna(df['Pressure(in)'].mean())\n", "df['Visibility(mi)'] = df['Visibility(mi)'].fillna(df['Visibility(mi)'].mean())\n", "\n", "\n", "df['Weather_Condition'] = df['Weather_Condition'].fillna('Fair')\n", "df['Sunrise_Sunset'] = df['Sunrise_Sunset'].fillna('Day')\n", "df['Civil_Twilight'] = df['Civil_Twilight'].fillna('Day')\n", "df['Nautical_Twilight'] = df['Nautical_Twilight'].fillna('Day')\n", "df['Astronomical_Twilight'] = df['Astronomical_Twilight'].fillna('Day')\n", "df['Weather_Condition'] = df['Weather_Condition'].fillna('Fair')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now all the columns contain no Nan values and we can go ahead with the preprocessing." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " \n", "As you have observed in the dataset we have the start and end coordinates,so let us apply Haversine distance formula to get the accident coverage distance. Take note of how these functions use the row-wise operations, something that we have learnt before. If you need help while creating the user defined functions refer to this [notebook](02-Intro_to_cuDF_UDFs.ipynb)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from math import cos, sin, asin, sqrt, pi, atan2\n", "from numba import cuda" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Modify the code in this cell\n", "\n", "def haversine_distance_kernel(Start_Lat, Start_Lng, End_Lat, End_Lng, out):\n", " \n", " for i, (x_1, y_1, x_2, y_2) in enumerate(zip(Start_Lat, Start_Lng, End_Lat, End_Lng)):\n", " \n", " #Perform the computations here and store the final value in out[i]\n", " \n", " out[i] = " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Modify the code in this cell\n", "\n", "%%time\n", "#Add the arguments to the apply_rows function for the haversine distance kernel\n", "df = df.apply_rows()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wow! The code segment that previously took 7 minutes to compute, now gets executed in less than a second! " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Modify the code in this cell\n", "\n", "def haversine_distance_kernel(Start_Lat, Start_Lng, End_Lat, End_Lng, out):\n", " \n", " for i, (x_1, y_1, x_2, y_2) in enumerate(zip(Start_Lat, Start_Lng, End_Lat, End_Lng)):\n", " \n", " #Perform the computations here and store the final value in out[i]\n", " \n", " out[i] = " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Modify the code in this cell\n", "\n", "%%time\n", "#Add the arguments to the apply_chunks function for the haversine distance kernel\n", "outdf = df.apply_chunks()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Save the dataframe in a csv for future use, and make sure you refer to our sample solution and compared your code's performance with it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = df.dropna()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.to_csv(\"../../data/data_proc.csv\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Conclusion \n", "\n", "Thus we have successfully used CuDF and CuPy to process the accidents dataset, and converted the data to a form more suitable to apply machine learning algorithms. In the extra labs for future labs in CuML we will be using this processed dataset. You must have observed the parallels between the RAPIDS pipeline and traditional pipeline while writing your code. Try to experiment with the processing and making your code as efficient as possible." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# References\n", "\n", "- Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. “A Countrywide Traffic Accident Dataset.”, 2019.\n", "\n", "- Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. \"Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights.\" In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.\n", "\n", "- If you need to refer to the dataset, you can download it [here](https://www.kaggle.com/sobhanmoosavi/us-accidents)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\"Creative

\n", "\n", "- This dataset is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Licensing\n", " \n", "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Previous Notebook](03-Cudf_Exercise.ipynb)\n", "     \n", "     \n", "     \n", "     \n", "[1](01-Intro_to_cuDF.ipynb)\n", "[2](02-Intro_to_cuDF_UDFs.ipynb)\n", "[3](03-Cudf_Exercise.ipynb)\n", "[4]\n", "     \n", "     \n", "     \n", "     \n", "\n", "\n", "     \n", "     \n", "     \n", "     \n", "     \n", "   \n", "[Home Page](../START_HERE.ipynb)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.2" } }, "nbformat": 4, "nbformat_minor": 4 }