{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&ensp;\n",
    "[Home Page](../START_HERE.ipynb)\n",
    "\n",
    "[Previous Notebook](03-Cudf_Exercise.ipynb)\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "[1](01-Intro_to_cuDF.ipynb)\n",
    "[2](02-Intro_to_cuDF_UDFs.ipynb)\n",
    "[3](03-Cudf_Exercise.ipynb)\n",
    "[4]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Applying CuDF: The Solution\n",
    "\n",
    "Welcome to fourth cuDF tutorial notebook! This is a practical example that utilizes cuDF and cuPy, geared primarily for new users. The purpose of this tutorial is to introduce new users to a data science processing pipeline using RAPIDS on real life datasets. We will be working on a data science problem: US Accidents Prediction. This is a countrywide car accident dataset, which covers 49 states of the USA. The accident data are collected from February 2016 to June 2020, using two APIs that provide streaming traffic incident (or event) data. These APIs broadcast traffic data captured by a variety of entities, such as the US and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors within the road-networks. Currently, there are about 3.5 million accident records in this dataset. \n",
    "\n",
    "\n",
    "## What should I do?\n",
    "\n",
    "Given below is a complete data science preprocessing pipeline for the dataset using Pandas and Numpy libraries. Using the methods and techniques from the previous notebooks, you have to convert this pipeline to a a RAPIDS implementation, using CuDF and CuPy. Don't forget to time your code cells and compare the performance with this original code, to understand why we are using RAPIDS. If you get stuck in the middle, feel free to refer to this sample solution. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Here is the list of exercises in the lab where you need to modify code:\n",
    "- <a href='#ex1'>Exercise 1</a><br> Loading the dataset from a csv file and store in a CuDF dataframe.\n",
    "- <a href='#ex2'>Exercise 2</a><br> Creating kernel functions to run the given function optimally on a GPU.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first step is downloading the dataset and putting it in the data directory, for using in this tutorial.\n",
    "Download the dataset here, and place it in (host/data) folder. Now we will import the necessary libraries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import cudf\n",
    "import numpy as np\n",
    "import cupy as cp\n",
    "import math\n",
    "np.random.seed(12)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='ex1'></a>\n",
    "\n",
    "First we need to load the dataset from the csv into CuDF dataframes, for the preprocessing steps. If you need help, refer to the Getting Data In and Out module from this [notebook](01-Intro_to_cuDF.ipynb/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 547 ms, sys: 225 ms, total: 772 ms\n",
      "Wall time: 773 ms\n",
      "                ID    Source    TMC  Severity           Start_Time  \\\n",
      "0              A-1  MapQuest  201.0         3  2016-02-08 05:46:00   \n",
      "1              A-2  MapQuest  201.0         2  2016-02-08 06:07:59   \n",
      "2              A-3  MapQuest  201.0         2  2016-02-08 06:49:27   \n",
      "3              A-4  MapQuest  201.0         3  2016-02-08 07:23:34   \n",
      "4              A-5  MapQuest  201.0         2  2016-02-08 07:39:07   \n",
      "...            ...       ...    ...       ...                  ...   \n",
      "2916679  A-2916810      Bing   <NA>         1  2020-04-15 15:50:00   \n",
      "2916680  A-2916811      Bing   <NA>         1  2020-04-15 15:05:44   \n",
      "2916681  A-2916812      Bing   <NA>         2  2020-04-15 16:06:28   \n",
      "2916682  A-2916813      Bing   <NA>         1  2020-04-15 15:27:40   \n",
      "2916683  A-2916814      Bing   <NA>         2  2020-04-15 15:25:20   \n",
      "\n",
      "                    End_Time  Start_Lat   Start_Lng   End_Lat     End_Lng  \\\n",
      "0        2016-02-08 11:00:00  39.865147  -84.058723      <NA>        <NA>   \n",
      "1        2016-02-08 06:37:59  39.928059  -82.831184      <NA>        <NA>   \n",
      "2        2016-02-08 07:19:27  39.063148  -84.032608      <NA>        <NA>   \n",
      "3        2016-02-08 07:53:34  39.747753  -84.205582      <NA>        <NA>   \n",
      "4        2016-02-08 08:09:07  39.627781  -84.188354      <NA>        <NA>   \n",
      "...                      ...        ...         ...       ...         ...   \n",
      "2916679  2020-04-15 16:35:00  40.565540 -112.295990  40.56554  -112.29599   \n",
      "2916680  2020-04-15 15:50:44  40.653760 -111.910300  40.65376   -111.9103   \n",
      "2916681  2020-04-15 16:36:28  39.740420 -105.017800  39.74042   -105.0178   \n",
      "2916682  2020-04-15 16:12:40  40.183780 -111.646710  40.18378  -111.64671   \n",
      "2916683  2020-04-15 16:17:20  40.183780 -111.646710  40.17617  -111.64677   \n",
      "\n",
      "         ...  Roundabout Station   Stop Traffic_Calming Traffic_Signal  \\\n",
      "0        ...       False   False  False           False          False   \n",
      "1        ...       False   False  False           False          False   \n",
      "2        ...       False   False  False           False           True   \n",
      "3        ...       False   False  False           False          False   \n",
      "4        ...       False   False  False           False           True   \n",
      "...      ...         ...     ...    ...             ...            ...   \n",
      "2916679  ...       False   False  False           False           True   \n",
      "2916680  ...       False   False  False           False           True   \n",
      "2916681  ...       False   False  False           False          False   \n",
      "2916682  ...       False   False  False           False          False   \n",
      "2916683  ...        <NA>    <NA>   <NA>            <NA>           <NA>   \n",
      "\n",
      "        Turning_Loop Sunrise_Sunset Civil_Twilight Nautical_Twilight  \\\n",
      "0              False          Night          Night             Night   \n",
      "1              False          Night          Night             Night   \n",
      "2              False          Night          Night               Day   \n",
      "3              False          Night            Day               Day   \n",
      "4              False            Day            Day               Day   \n",
      "...              ...            ...            ...               ...   \n",
      "2916679        False            Day            Day               Day   \n",
      "2916680        False            Day            Day               Day   \n",
      "2916681        False            Day            Day               Day   \n",
      "2916682        False            Day            Day               Day   \n",
      "2916683         <NA>           <NA>           <NA>              <NA>   \n",
      "\n",
      "        Astronomical_Twilight  \n",
      "0                       Night  \n",
      "1                         Day  \n",
      "2                         Day  \n",
      "3                         Day  \n",
      "4                         Day  \n",
      "...                       ...  \n",
      "2916679                   Day  \n",
      "2916680                   Day  \n",
      "2916681                   Day  \n",
      "2916682                   Day  \n",
      "2916683                  <NA>  \n",
      "\n",
      "[2916684 rows x 49 columns]\n"
     ]
    }
   ],
   "source": [
    "%time df = cudf.read_csv('../../data/data.csv')\n",
    "print(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "First we will analyse the data and observe patterns that can help us process the data better for feeding to the machine learning algorithms in the future. By using the describe, we will generate the descriptive statistics for all the columns. Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>TMC</th>\n",
       "      <th>Severity</th>\n",
       "      <th>Start_Lat</th>\n",
       "      <th>Start_Lng</th>\n",
       "      <th>End_Lat</th>\n",
       "      <th>End_Lng</th>\n",
       "      <th>Distance(mi)</th>\n",
       "      <th>Number</th>\n",
       "      <th>Temperature(F)</th>\n",
       "      <th>Wind_Chill(F)</th>\n",
       "      <th>Humidity(%)</th>\n",
       "      <th>Pressure(in)</th>\n",
       "      <th>Visibility(mi)</th>\n",
       "      <th>Wind_Speed(mph)</th>\n",
       "      <th>Precipitation(in)</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>2.478818e+06</td>\n",
       "      <td>2.916684e+06</td>\n",
       "      <td>2.916684e+06</td>\n",
       "      <td>2.916684e+06</td>\n",
       "      <td>437866.000000</td>\n",
       "      <td>437866.000000</td>\n",
       "      <td>2.916684e+06</td>\n",
       "      <td>1.104392e+06</td>\n",
       "      <td>2.866833e+06</td>\n",
       "      <td>1.268002e+06</td>\n",
       "      <td>2.863532e+06</td>\n",
       "      <td>2.874585e+06</td>\n",
       "      <td>2.857484e+06</td>\n",
       "      <td>2.527224e+06</td>\n",
       "      <td>1.152141e+06</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>2.080226e+02</td>\n",
       "      <td>2.338357e+00</td>\n",
       "      <td>3.626797e+01</td>\n",
       "      <td>-9.433088e+01</td>\n",
       "      <td>37.117276</td>\n",
       "      <td>-97.085631</td>\n",
       "      <td>2.368010e-01</td>\n",
       "      <td>5.173121e+03</td>\n",
       "      <td>6.263615e+01</td>\n",
       "      <td>5.381072e+01</td>\n",
       "      <td>6.550242e+01</td>\n",
       "      <td>2.978332e+01</td>\n",
       "      <td>9.128722e+00</td>\n",
       "      <td>8.310882e+00</td>\n",
       "      <td>1.776500e-02</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>2.076627e+01</td>\n",
       "      <td>5.228540e-01</td>\n",
       "      <td>4.830769e+00</td>\n",
       "      <td>1.678646e+01</td>\n",
       "      <td>4.746686</td>\n",
       "      <td>18.208733</td>\n",
       "      <td>1.523115e+00</td>\n",
       "      <td>1.015564e+04</td>\n",
       "      <td>1.844134e+01</td>\n",
       "      <td>2.401773e+01</td>\n",
       "      <td>2.253846e+01</td>\n",
       "      <td>7.520620e-01</td>\n",
       "      <td>2.817181e+00</td>\n",
       "      <td>5.222119e+00</td>\n",
       "      <td>2.057930e-01</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>2.000000e+02</td>\n",
       "      <td>1.000000e+00</td>\n",
       "      <td>2.455527e+01</td>\n",
       "      <td>-1.246238e+02</td>\n",
       "      <td>24.587610</td>\n",
       "      <td>-124.497410</td>\n",
       "      <td>0.000000e+00</td>\n",
       "      <td>1.000000e+00</td>\n",
       "      <td>-8.900000e+01</td>\n",
       "      <td>-8.900000e+01</td>\n",
       "      <td>1.000000e+00</td>\n",
       "      <td>0.000000e+00</td>\n",
       "      <td>0.000000e+00</td>\n",
       "      <td>0.000000e+00</td>\n",
       "      <td>0.000000e+00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>2.010000e+02</td>\n",
       "      <td>2.000000e+00</td>\n",
       "      <td>3.345090e+01</td>\n",
       "      <td>-1.121182e+02</td>\n",
       "      <td>33.928530</td>\n",
       "      <td>-117.937170</td>\n",
       "      <td>0.000000e+00</td>\n",
       "      <td>8.000000e+02</td>\n",
       "      <td>5.100000e+01</td>\n",
       "      <td>3.520000e+01</td>\n",
       "      <td>4.900000e+01</td>\n",
       "      <td>2.976000e+01</td>\n",
       "      <td>1.000000e+01</td>\n",
       "      <td>5.000000e+00</td>\n",
       "      <td>0.000000e+00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>2.010000e+02</td>\n",
       "      <td>2.000000e+00</td>\n",
       "      <td>3.559464e+01</td>\n",
       "      <td>-8.803175e+01</td>\n",
       "      <td>37.560840</td>\n",
       "      <td>-91.412735</td>\n",
       "      <td>0.000000e+00</td>\n",
       "      <td>2.599000e+03</td>\n",
       "      <td>6.440000e+01</td>\n",
       "      <td>5.700000e+01</td>\n",
       "      <td>6.800000e+01</td>\n",
       "      <td>2.996000e+01</td>\n",
       "      <td>1.000000e+01</td>\n",
       "      <td>8.000000e+00</td>\n",
       "      <td>0.000000e+00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>2.010000e+02</td>\n",
       "      <td>3.000000e+00</td>\n",
       "      <td>4.007000e+01</td>\n",
       "      <td>-8.083621e+01</td>\n",
       "      <td>40.717975</td>\n",
       "      <td>-80.811870</td>\n",
       "      <td>0.000000e+00</td>\n",
       "      <td>6.501000e+03</td>\n",
       "      <td>7.600000e+01</td>\n",
       "      <td>7.300000e+01</td>\n",
       "      <td>8.400000e+01</td>\n",
       "      <td>3.010000e+01</td>\n",
       "      <td>1.000000e+01</td>\n",
       "      <td>1.150000e+01</td>\n",
       "      <td>0.000000e+00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>4.060000e+02</td>\n",
       "      <td>4.000000e+00</td>\n",
       "      <td>4.900220e+01</td>\n",
       "      <td>-6.711317e+01</td>\n",
       "      <td>49.075000</td>\n",
       "      <td>-67.109242</td>\n",
       "      <td>3.336300e+02</td>\n",
       "      <td>9.904150e+05</td>\n",
       "      <td>1.670000e+02</td>\n",
       "      <td>1.150000e+02</td>\n",
       "      <td>1.000000e+02</td>\n",
       "      <td>5.774000e+01</td>\n",
       "      <td>1.400000e+02</td>\n",
       "      <td>9.840000e+02</td>\n",
       "      <td>2.500000e+01</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                TMC      Severity     Start_Lat     Start_Lng        End_Lat  \\\n",
       "count  2.478818e+06  2.916684e+06  2.916684e+06  2.916684e+06  437866.000000   \n",
       "mean   2.080226e+02  2.338357e+00  3.626797e+01 -9.433088e+01      37.117276   \n",
       "std    2.076627e+01  5.228540e-01  4.830769e+00  1.678646e+01       4.746686   \n",
       "min    2.000000e+02  1.000000e+00  2.455527e+01 -1.246238e+02      24.587610   \n",
       "25%    2.010000e+02  2.000000e+00  3.345090e+01 -1.121182e+02      33.928530   \n",
       "50%    2.010000e+02  2.000000e+00  3.559464e+01 -8.803175e+01      37.560840   \n",
       "75%    2.010000e+02  3.000000e+00  4.007000e+01 -8.083621e+01      40.717975   \n",
       "max    4.060000e+02  4.000000e+00  4.900220e+01 -6.711317e+01      49.075000   \n",
       "\n",
       "             End_Lng  Distance(mi)        Number  Temperature(F)  \\\n",
       "count  437866.000000  2.916684e+06  1.104392e+06    2.866833e+06   \n",
       "mean      -97.085631  2.368010e-01  5.173121e+03    6.263615e+01   \n",
       "std        18.208733  1.523115e+00  1.015564e+04    1.844134e+01   \n",
       "min      -124.497410  0.000000e+00  1.000000e+00   -8.900000e+01   \n",
       "25%      -117.937170  0.000000e+00  8.000000e+02    5.100000e+01   \n",
       "50%       -91.412735  0.000000e+00  2.599000e+03    6.440000e+01   \n",
       "75%       -80.811870  0.000000e+00  6.501000e+03    7.600000e+01   \n",
       "max       -67.109242  3.336300e+02  9.904150e+05    1.670000e+02   \n",
       "\n",
       "       Wind_Chill(F)   Humidity(%)  Pressure(in)  Visibility(mi)  \\\n",
       "count   1.268002e+06  2.863532e+06  2.874585e+06    2.857484e+06   \n",
       "mean    5.381072e+01  6.550242e+01  2.978332e+01    9.128722e+00   \n",
       "std     2.401773e+01  2.253846e+01  7.520620e-01    2.817181e+00   \n",
       "min    -8.900000e+01  1.000000e+00  0.000000e+00    0.000000e+00   \n",
       "25%     3.520000e+01  4.900000e+01  2.976000e+01    1.000000e+01   \n",
       "50%     5.700000e+01  6.800000e+01  2.996000e+01    1.000000e+01   \n",
       "75%     7.300000e+01  8.400000e+01  3.010000e+01    1.000000e+01   \n",
       "max     1.150000e+02  1.000000e+02  5.774000e+01    1.400000e+02   \n",
       "\n",
       "       Wind_Speed(mph)  Precipitation(in)  \n",
       "count     2.527224e+06       1.152141e+06  \n",
       "mean      8.310882e+00       1.776500e-02  \n",
       "std       5.222119e+00       2.057930e-01  \n",
       "min       0.000000e+00       0.000000e+00  \n",
       "25%       5.000000e+00       0.000000e+00  \n",
       "50%       8.000000e+00       0.000000e+00  \n",
       "75%       1.150000e+01       0.000000e+00  \n",
       "max       9.840000e+02       2.500000e+01  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will check the size of the dataset that is to be processed using the len function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2916684"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You will notice that the dataset has many rows and takes quite a lot of time to read from the file. As we go ahead with the preprocessing, computations will require more time to execute, and that's where the RAPIDS comes to the rescue!\n",
    "\n",
    "Now we use the info function to check the datatype of all the columns in the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'cudf.core.dataframe.DataFrame'>\n",
      "RangeIndex: 2916684 entries, 0 to 2916683\n",
      "Data columns (total 49 columns):\n",
      " #   Column                 Dtype\n",
      "---  ------                 -----\n",
      " 0   ID                     object\n",
      " 1   Source                 object\n",
      " 2   TMC                    float64\n",
      " 3   Severity               int64\n",
      " 4   Start_Time             object\n",
      " 5   End_Time               object\n",
      " 6   Start_Lat              float64\n",
      " 7   Start_Lng              float64\n",
      " 8   End_Lat                float64\n",
      " 9   End_Lng                float64\n",
      " 10  Distance(mi)           float64\n",
      " 11  Description            object\n",
      " 12  Number                 float64\n",
      " 13  Street                 object\n",
      " 14  Side                   object\n",
      " 15  City                   object\n",
      " 16  County                 object\n",
      " 17  State                  object\n",
      " 18  Zipcode                object\n",
      " 19  Country                object\n",
      " 20  Timezone               object\n",
      " 21  Airport_Code           object\n",
      " 22  Weather_Timestamp      object\n",
      " 23  Temperature(F)         float64\n",
      " 24  Wind_Chill(F)          float64\n",
      " 25  Humidity(%)            float64\n",
      " 26  Pressure(in)           float64\n",
      " 27  Visibility(mi)         float64\n",
      " 28  Wind_Direction         object\n",
      " 29  Wind_Speed(mph)        float64\n",
      " 30  Precipitation(in)      float64\n",
      " 31  Weather_Condition      object\n",
      " 32  Amenity                bool\n",
      " 33  Bump                   bool\n",
      " 34  Crossing               bool\n",
      " 35  Give_Way               bool\n",
      " 36  Junction               object\n",
      " 37  No_Exit                bool\n",
      " 38  Railway                bool\n",
      " 39  Roundabout             bool\n",
      " 40  Station                bool\n",
      " 41  Stop                   bool\n",
      " 42  Traffic_Calming        bool\n",
      " 43  Traffic_Signal         bool\n",
      " 44  Turning_Loop           bool\n",
      " 45  Sunrise_Sunset         object\n",
      " 46  Civil_Twilight         object\n",
      " 47  Nautical_Twilight      object\n",
      " 48  Astronomical_Twilight  object\n",
      "dtypes: bool(12), float64(14), int64(1), object(22)\n",
      "memory usage: 1.2+ GB\n"
     ]
    }
   ],
   "source": [
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will also check the number of missing values in the dataset, so that we can drop or fill in the missing values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "ID                             0\n",
       "Source                         0\n",
       "TMC                       437866\n",
       "Severity                       0\n",
       "Start_Time                     0\n",
       "End_Time                       0\n",
       "Start_Lat                      0\n",
       "Start_Lng                      0\n",
       "End_Lat                  2478818\n",
       "End_Lng                  2478818\n",
       "Distance(mi)                   0\n",
       "Description                    1\n",
       "Number                   1812292\n",
       "Street                         0\n",
       "Side                           0\n",
       "City                          92\n",
       "County                         0\n",
       "State                          0\n",
       "Zipcode                      544\n",
       "Country                        0\n",
       "Timezone                    2492\n",
       "Airport_Code                4882\n",
       "Weather_Timestamp          32683\n",
       "Temperature(F)             49851\n",
       "Wind_Chill(F)            1648682\n",
       "Humidity(%)                53152\n",
       "Pressure(in)               42099\n",
       "Visibility(mi)             59200\n",
       "Wind_Direction             43860\n",
       "Wind_Speed(mph)           389460\n",
       "Precipitation(in)        1764543\n",
       "Weather_Condition          59147\n",
       "Amenity                        0\n",
       "Bump                           0\n",
       "Crossing                       0\n",
       "Give_Way                       0\n",
       "Junction                       0\n",
       "No_Exit                        1\n",
       "Railway                        1\n",
       "Roundabout                     1\n",
       "Station                        1\n",
       "Stop                           1\n",
       "Traffic_Calming                1\n",
       "Traffic_Signal                 1\n",
       "Turning_Loop                   1\n",
       "Sunrise_Sunset                96\n",
       "Civil_Twilight                96\n",
       "Nautical_Twilight             96\n",
       "Astronomical_Twilight         96\n",
       "dtype: uint64"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.isna().sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are many columns with null values, and we will fill them with random values or the mean from the column. We will drop some text columns, as we are not doing any natural language processing right now, but feel free to explore them on your own. We will also drop the columns with too many Nans as filling them will throw our accuracy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = df.drop(columns = ['ID','Start_Time','End_Time','Street','Side','Description','Number','City','Country','Zipcode','Timezone','Airport_Code','Weather_Timestamp','Wind_Chill(F)','Wind_Direction','Wind_Speed(mph)','Precipitation(in)'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Here we are filling the TMC with mean.\n",
    "df['TMC'] = df['TMC'].fillna(df['TMC'].mean())\n",
    "df['End_Lat'] = df['End_Lat'].fillna(df['End_Lat'].mean())\n",
    "df['End_Lng'] = df['End_Lng'].fillna(df['End_Lng'].mean())\n",
    "df['Temperature(F)'] = df['Temperature(F)'].fillna(df['Temperature(F)'].mean())\n",
    "df['Humidity(%)'] = df['Humidity(%)'].fillna(df['Humidity(%)'].mean())\n",
    "df['Pressure(in)'] = df['Pressure(in)'].fillna(df['Pressure(in)'].mean())\n",
    "df['Visibility(mi)'] = df['Visibility(mi)'].fillna(df['Visibility(mi)'].mean())\n",
    "df['Humidity(%)'] = df['Humidity(%)'].fillna(df['Humidity(%)'].mean())\n",
    "df['Pressure(in)'] = df['Pressure(in)'].fillna(df['Pressure(in)'].mean())\n",
    "df['Visibility(mi)'] = df['Visibility(mi)'].fillna(df['Visibility(mi)'].mean())\n",
    "\n",
    "\n",
    "df['Weather_Condition'] = df['Weather_Condition'].fillna('Fair')\n",
    "df['Sunrise_Sunset'] = df['Sunrise_Sunset'].fillna('Day')\n",
    "df['Civil_Twilight'] = df['Civil_Twilight'].fillna('Day')\n",
    "df['Nautical_Twilight'] = df['Nautical_Twilight'].fillna('Day')\n",
    "df['Astronomical_Twilight'] = df['Astronomical_Twilight'].fillna('Day')\n",
    "df['Weather_Condition'] = df['Weather_Condition'].fillna('Fair')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Source                   0\n",
       "TMC                      0\n",
       "Severity                 0\n",
       "Start_Lat                0\n",
       "Start_Lng                0\n",
       "End_Lat                  0\n",
       "End_Lng                  0\n",
       "Distance(mi)             0\n",
       "County                   0\n",
       "State                    0\n",
       "Temperature(F)           0\n",
       "Humidity(%)              0\n",
       "Pressure(in)             0\n",
       "Visibility(mi)           0\n",
       "Weather_Condition        0\n",
       "Amenity                  0\n",
       "Bump                     0\n",
       "Crossing                 0\n",
       "Give_Way                 0\n",
       "Junction                 0\n",
       "No_Exit                  0\n",
       "Railway                  0\n",
       "Roundabout               0\n",
       "Station                  0\n",
       "Stop                     0\n",
       "Traffic_Calming          0\n",
       "Traffic_Signal           0\n",
       "Turning_Loop             0\n",
       "Sunrise_Sunset           0\n",
       "Civil_Twilight           0\n",
       "Nautical_Twilight        0\n",
       "Astronomical_Twilight    0\n",
       "dtype: uint64"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.drop(df.tail(1).index,inplace=True) \n",
    "df.isna().sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now all the columns contain no Nan values and we can go ahead with the preprocessing."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='ex2'></a>\n",
    "       \n",
    "As you have observed in the dataset we have the start and end coordinates,so  let us apply Haversine distance formula to get the accident coverage distance. Take note of how these functions use the row-wise operations, something that we have learnt before. If you need help while creating the user defined functions refer to this [notebook](02-Intro_to_cuDF_UDFs.ipynb)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "from math import cos, sin, asin, sqrt, pi, atan2\n",
    "from numba import cuda"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "def haversine_distance_kernel(Start_Lat, Start_Lng, End_Lat, End_Lng, out):\n",
    " \n",
    "    for i, (x_1, y_1, x_2, y_2) in enumerate(zip(Start_Lat, Start_Lng, End_Lat, End_Lng)):\n",
    " \n",
    "\n",
    "        x_1 = pi/180 * x_1\n",
    "        y_1 = pi/180 * y_1\n",
    "        x_2 = pi/180 * x_2\n",
    "        y_2 = pi/180 * y_2\n",
    "        \n",
    "        dlon = y_2 - y_1\n",
    "        dlat = x_2 - x_1\n",
    "        a = sin(dlat/2)**2 + cos(x_1) * cos(x_2) * sin(dlon/2)**2\n",
    "        \n",
    "        c = 2 * asin(sqrt(a)) \n",
    "        r = 6371 # Radius of earth in kilometers\n",
    "        \n",
    "        out[i] = c * r"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 617 ms, sys: 36.2 ms, total: 654 ms\n",
      "Wall time: 655 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "df = df.apply_rows(haversine_distance_kernel,\n",
    "                   incols=['Start_Lat', 'Start_Lng', 'End_Lat', 'End_Lng'],\n",
    "                   outcols=dict(out=np.float64),\n",
    "                   kwargs=dict())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Wow! The code segment that previously took  7 minutes to compute, now gets executed in less than a second! "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "def haversine_distance_kernel(Start_Lat, Start_Lng, End_Lat, End_Lng, out):\n",
    " \n",
    "    for i, (x_1, y_1, x_2, y_2) in enumerate(zip(Start_Lat, Start_Lng, End_Lat, End_Lng)):\n",
    " \n",
    "\n",
    "        x_1 = pi/180 * x_1\n",
    "        y_1 = pi/180 * y_1\n",
    "        x_2 = pi/180 * x_2\n",
    "        y_2 = pi/180 * y_2\n",
    "        \n",
    "        dlon = y_2 - y_1\n",
    "        dlat = x_2 - x_1\n",
    "        a = sin(dlat/2)**2 + cos(x_1) * cos(x_2) * sin(dlon/2)**2\n",
    "        \n",
    "        c = 2 * asin(sqrt(a)) \n",
    "        r = 6371 # Radius of earth in kilometers\n",
    "        \n",
    "        out[i] = c * r"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 455 ms, sys: 17 ms, total: 472 ms\n",
      "Wall time: 473 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "outdf = df.apply_chunks(haversine_distance_kernel,\n",
    "                        incols=['Start_Lat', 'Start_Lng', 'End_Lat', 'End_Lng'],\n",
    "                        outcols=dict(out=np.float64),\n",
    "                        kwargs=dict(),\n",
    "                        chunks=8,\n",
    "                        tpb=8)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This kernel also took less than a second. The difference is merely the control we have over the execution."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Save the dataframe in a csv for future use, and make sure you refer to our sample solution and compared your code's performance with it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Source</th>\n",
       "      <th>TMC</th>\n",
       "      <th>Severity</th>\n",
       "      <th>Start_Lat</th>\n",
       "      <th>Start_Lng</th>\n",
       "      <th>End_Lat</th>\n",
       "      <th>End_Lng</th>\n",
       "      <th>Distance(mi)</th>\n",
       "      <th>County</th>\n",
       "      <th>State</th>\n",
       "      <th>...</th>\n",
       "      <th>Station</th>\n",
       "      <th>Stop</th>\n",
       "      <th>Traffic_Calming</th>\n",
       "      <th>Traffic_Signal</th>\n",
       "      <th>Turning_Loop</th>\n",
       "      <th>Sunrise_Sunset</th>\n",
       "      <th>Civil_Twilight</th>\n",
       "      <th>Nautical_Twilight</th>\n",
       "      <th>Astronomical_Twilight</th>\n",
       "      <th>out</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>MapQuest</td>\n",
       "      <td>201.0</td>\n",
       "      <td>3</td>\n",
       "      <td>39.865147</td>\n",
       "      <td>-84.058723</td>\n",
       "      <td>37.117276</td>\n",
       "      <td>-97.085631</td>\n",
       "      <td>0.01</td>\n",
       "      <td>Montgomery</td>\n",
       "      <td>OH</td>\n",
       "      <td>...</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>Night</td>\n",
       "      <td>Night</td>\n",
       "      <td>Night</td>\n",
       "      <td>Night</td>\n",
       "      <td>1172.997367</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>MapQuest</td>\n",
       "      <td>201.0</td>\n",
       "      <td>2</td>\n",
       "      <td>39.928059</td>\n",
       "      <td>-82.831184</td>\n",
       "      <td>37.117276</td>\n",
       "      <td>-97.085631</td>\n",
       "      <td>0.01</td>\n",
       "      <td>Franklin</td>\n",
       "      <td>OH</td>\n",
       "      <td>...</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>Night</td>\n",
       "      <td>Night</td>\n",
       "      <td>Night</td>\n",
       "      <td>Day</td>\n",
       "      <td>1277.283911</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>MapQuest</td>\n",
       "      <td>201.0</td>\n",
       "      <td>2</td>\n",
       "      <td>39.063148</td>\n",
       "      <td>-84.032608</td>\n",
       "      <td>37.117276</td>\n",
       "      <td>-97.085631</td>\n",
       "      <td>0.01</td>\n",
       "      <td>Clermont</td>\n",
       "      <td>OH</td>\n",
       "      <td>...</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>Night</td>\n",
       "      <td>Night</td>\n",
       "      <td>Day</td>\n",
       "      <td>Day</td>\n",
       "      <td>1161.565094</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>MapQuest</td>\n",
       "      <td>201.0</td>\n",
       "      <td>3</td>\n",
       "      <td>39.747753</td>\n",
       "      <td>-84.205582</td>\n",
       "      <td>37.117276</td>\n",
       "      <td>-97.085631</td>\n",
       "      <td>0.01</td>\n",
       "      <td>Montgomery</td>\n",
       "      <td>OH</td>\n",
       "      <td>...</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>Night</td>\n",
       "      <td>Day</td>\n",
       "      <td>Day</td>\n",
       "      <td>Day</td>\n",
       "      <td>1158.238470</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>MapQuest</td>\n",
       "      <td>201.0</td>\n",
       "      <td>2</td>\n",
       "      <td>39.627781</td>\n",
       "      <td>-84.188354</td>\n",
       "      <td>37.117276</td>\n",
       "      <td>-97.085631</td>\n",
       "      <td>0.01</td>\n",
       "      <td>Montgomery</td>\n",
       "      <td>OH</td>\n",
       "      <td>...</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>Day</td>\n",
       "      <td>Day</td>\n",
       "      <td>Day</td>\n",
       "      <td>Day</td>\n",
       "      <td>1157.325865</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 33 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     Source    TMC  Severity  Start_Lat  Start_Lng    End_Lat    End_Lng  \\\n",
       "0  MapQuest  201.0         3  39.865147 -84.058723  37.117276 -97.085631   \n",
       "1  MapQuest  201.0         2  39.928059 -82.831184  37.117276 -97.085631   \n",
       "2  MapQuest  201.0         2  39.063148 -84.032608  37.117276 -97.085631   \n",
       "3  MapQuest  201.0         3  39.747753 -84.205582  37.117276 -97.085631   \n",
       "4  MapQuest  201.0         2  39.627781 -84.188354  37.117276 -97.085631   \n",
       "\n",
       "   Distance(mi)      County State  ...  Station   Stop  Traffic_Calming  \\\n",
       "0          0.01  Montgomery    OH  ...    False  False            False   \n",
       "1          0.01    Franklin    OH  ...    False  False            False   \n",
       "2          0.01    Clermont    OH  ...    False  False            False   \n",
       "3          0.01  Montgomery    OH  ...    False  False            False   \n",
       "4          0.01  Montgomery    OH  ...    False  False            False   \n",
       "\n",
       "   Traffic_Signal  Turning_Loop  Sunrise_Sunset Civil_Twilight  \\\n",
       "0           False         False           Night          Night   \n",
       "1           False         False           Night          Night   \n",
       "2            True         False           Night          Night   \n",
       "3           False         False           Night            Day   \n",
       "4            True         False             Day            Day   \n",
       "\n",
       "  Nautical_Twilight Astronomical_Twilight          out  \n",
       "0             Night                 Night  1172.997367  \n",
       "1             Night                   Day  1277.283911  \n",
       "2               Day                   Day  1161.565094  \n",
       "3               Day                   Day  1158.238470  \n",
       "4               Day                   Day  1157.325865  \n",
       "\n",
       "[5 rows x 33 columns]"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = df.dropna()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.to_csv(\"../../data/data_proc.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Conclusion \n",
    "\n",
    "Thus we have successfully used CuDF and CuPy to process the accidents dataset, and converted the data to a form more suitable to apply machine learning algorithms. In the extra labs for future labs in CuML we will be using this processed dataset. You must have observed the parallels between the RAPIDS pipeline and traditional pipeline while writing your code. Try to experiment with the processing and making your code as efficient as possible."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# References\n",
    "\n",
    "- Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. “A Countrywide Traffic Accident Dataset.”, 2019.\n",
    "\n",
    "- Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. \"Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights.\" In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.\n",
    "\n",
    "- If you need to refer to the dataset, you can download it [here](https://www.kaggle.com/sobhanmoosavi/us-accidents)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center><a rel=\"license\" href=\"http://creativecommons.org/licenses/by-nc-sa/4.0/\"><img alt=\"Creative Commons License\" style=\"border-width:0\" src=\"https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png\" /></a></center><br />\n",
    "\n",
    "- This dataset is licensed under a <a rel=\"license\" href=\"http://creativecommons.org/licenses/by-nc-sa/4.0/\">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Licensing\n",
    "  \n",
    "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Previous Notebook](03-Cudf_Exercise.ipynb)\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "[1](01-Intro_to_cuDF.ipynb)\n",
    "[2](02-Intro_to_cuDF_UDFs.ipynb)\n",
    "[3](03-Cudf_Exercise.ipynb)\n",
    "[4]\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "\n",
    "\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&ensp;\n",
    "[Home Page](../START_HERE.ipynb)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}