{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&ensp;\n",
    "[Home Page](../START_HERE.ipynb)\n",
    "\n",
    "[Previous Notebook](03_CuML_Exercise.ipynb)\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "[1](01-LinearRegression-Hyperparam.ipynb)\n",
    "[2](03_CuML_Exercise.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# CuML Exercise - Solution\n",
    "Scikit-Learn is an incredibly powerful toolkit that allows data scientists to quickly build models from their data, and it one of the most common and useful tools in the Python data science ecosystem. cuML is the RAPIDS library that implements similar machine learning algorithms that use CUDA to run on GPUs, with an API that mirrors the Scikit-learn one as much as possible.\n",
    "\n",
    "In this notebook we present a small exercise for new users to experiment with CuML and apply their knowledge on a real world machine learning dataset. We will be working on the Car Accidents dataset that we started preprocessing in the CuDF tutorial. This is a countrywide car accident dataset, which covers 49 states of the USA. The accident data are collected from February 2016 to June 2020, using two APIs that provide streaming traffic incident (or event) data. These APIs broadcast traffic data captured by a variety of entities, such as the US and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors within the road-networks. Currently, there are about 3.5 million accident records in this dataset. If you skipped that tutorial, you can download the processed dataset here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Challenge\n",
    "\n",
    "We begin by perfoming some data manipulation using Scikit learn preprocessing and removing any class imbalance. The actual exercise begins <a href= '#exercise'> here</a>, where we have provided the implementation of 4 different Scikit-learn models and you have to convert them to CuML and evaluate the performance difference."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first step is downloading the dataset and putting it in the data directory, for using in this tutorial. Download the dataset here, and place it in (host/data) folder. Now we will import the necessary libraries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "NumPy Version: 1.19.2\n",
      "Scikit-Learn Version: 0.23.1\n"
     ]
    }
   ],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import numpy as np; print('NumPy Version:', np.__version__)\n",
    "%matplotlib inline\n",
    "import sys\n",
    "import sklearn; print('Scikit-Learn Version:', sklearn.__version__)\n",
    "from sklearn.linear_model import LinearRegression\n",
    "\n",
    "from sklearn import preprocessing \n",
    "import pandas as pd\n",
    "from sklearn.utils import resample\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "from sklearn.feature_selection import SelectFromModel\n",
    "from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_curve, auc\n",
    "from sklearn.preprocessing import OrdinalEncoder, StandardScaler\n",
    "import cudf\n",
    "import cupy\n",
    "\n",
    "# import for visualization\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# import for model building\n",
    "from sklearn.svm import SVC\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "from cuml.linear_model import MBSGDRegressor as cumlSGD\n",
    "from sklearn.linear_model import SGDRegressor as skSGD\n",
    "from sklearn.datasets import make_regression\n",
    "from sklearn.metrics import mean_squared_error\n",
    "\n",
    "from cuml.ensemble import RandomForestClassifier as curfc\n",
    "from sklearn.ensemble import RandomForestClassifier as skrfc\n",
    "\n",
    "from cuml import make_regression\n",
    "from cuml.linear_model import LinearRegression as cuLinearRegression\n",
    "from cuml.metrics.regression import r2_score\n",
    "from sklearn.linear_model import LinearRegression as skLinearRegression\n",
    "\n",
    "from cuml.neighbors import KNeighborsClassifier as KNeighborsC\n",
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "from cuml.linear_model import MBSGDClassifier as cumlMBSGDClassifier\n",
    "from sklearn.linear_model import SGDClassifier\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.pipeline import make_pipeline\n",
    "from cuml import Ridge\n",
    "from cuml.linear_model import Ridge\n",
    "from sklearn.linear_model import Ridge\n",
    "from cuml import LogisticRegression\n",
    "from sklearn.linear_model import LogisticRegression as skLogistic\n",
    "from cuml.linear_model import ElasticNet\n",
    "from sklearn import linear_model\n",
    "\n",
    "from cuml.linear_model import Lasso\n",
    "from cuml.solvers import SGD as cumlSGD"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's read the dataframe from the csv which was processed in the previous tutorial and stored in the data folder."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 43 ms, sys: 10.1 ms, total: 53.1 ms\n",
      "Wall time: 52.5 ms\n",
      "       Unnamed: 0  Source    TMC  Severity  Start_Lat   Start_Lng    End_Lat  \\\n",
      "0               0       1  201.0         3  39.865147  -84.058723  37.557578   \n",
      "1               1       1  201.0         2  39.928059  -82.831184  37.557578   \n",
      "2               2       1  201.0         2  39.063148  -84.032608  37.557578   \n",
      "3               3       1  201.0         3  39.747753  -84.205582  37.557578   \n",
      "4               4       1  201.0         2  39.627781  -84.188354  37.557578   \n",
      "...           ...     ...    ...       ...        ...         ...        ...   \n",
      "17317       17317       1  201.0         3  37.396164 -121.907578  37.557578   \n",
      "17318       17318       1  201.0         3  37.825649 -122.304092  37.557578   \n",
      "17319       17319       1  201.0         2  36.979454 -121.909035  37.557578   \n",
      "17320       17320       1  201.0         2  37.314030 -121.827065  37.557578   \n",
      "17321       17321       1  201.0         3  37.758404 -122.212173  37.557578   \n",
      "\n",
      "          End_Lng  Distance(mi)       County  ... Station  Stop  \\\n",
      "0     -100.455981          0.01   Montgomery  ...     0.0   0.0   \n",
      "1     -100.455981          0.01     Franklin  ...     0.0   0.0   \n",
      "2     -100.455981          0.01     Clermont  ...     0.0   0.0   \n",
      "3     -100.455981          0.01   Montgomery  ...     0.0   0.0   \n",
      "4     -100.455981          0.01   Montgomery  ...     0.0   0.0   \n",
      "...           ...           ...          ...  ...     ...   ...   \n",
      "17317 -100.455981          0.01  Santa Clara  ...     0.0   0.0   \n",
      "17318 -100.455981          0.01      Alameda  ...     0.0   0.0   \n",
      "17319 -100.455981          0.00   Santa Cruz  ...     0.0   0.0   \n",
      "17320 -100.455981          0.01  Santa Clara  ...     0.0   0.0   \n",
      "17321 -100.455981          0.01      Alameda  ...     NaN   NaN   \n",
      "\n",
      "       Traffic_Calming  Traffic_Signal  Turning_Loop Sunrise_Sunset  \\\n",
      "0                  0.0             0.0           0.0            0.0   \n",
      "1                  0.0             0.0           0.0            0.0   \n",
      "2                  0.0             0.0           0.0            0.0   \n",
      "3                  0.0             0.0           0.0            0.0   \n",
      "4                  0.0             0.0           0.0            1.0   \n",
      "...                ...             ...           ...            ...   \n",
      "17317              0.0             0.0           0.0            1.0   \n",
      "17318              0.0             0.0           0.0            1.0   \n",
      "17319              0.0             0.0           0.0            1.0   \n",
      "17320              0.0             0.0           0.0            1.0   \n",
      "17321              NaN             NaN           NaN            NaN   \n",
      "\n",
      "       Civil_Twilight  Nautical_Twilight  Astronomical_Twilight  cov_distance  \n",
      "0                 0.0                0.0                    0.0   1443.524390  \n",
      "1                 0.0                0.0                    1.0   1548.467903  \n",
      "2                 0.0                1.0                    1.0   1440.697621  \n",
      "3                 1.0                1.0                    1.0   1429.927497  \n",
      "4                 1.0                1.0                    1.0   1430.383177  \n",
      "...               ...                ...                    ...           ...  \n",
      "17317             1.0                1.0                    1.0   1888.935551  \n",
      "17318             1.0                1.0                    1.0   1918.251042  \n",
      "17319             1.0                1.0                    1.0   1895.341155  \n",
      "17320             1.0                1.0                    1.0   1883.025767  \n",
      "17321             NaN                NaN                    NaN           NaN  \n",
      "\n",
      "[17322 rows x 34 columns]\n"
     ]
    }
   ],
   "source": [
    "%time df = pd.read_csv('../../data/data_proc.csv')\n",
    "print(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Drop the unnecessary columns which got added while reading the file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = df.drop(columns = [\"Unnamed: 0\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Observe the dataset by printing the first 5 rows using the head function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Source</th>\n",
       "      <th>TMC</th>\n",
       "      <th>Severity</th>\n",
       "      <th>Start_Lat</th>\n",
       "      <th>Start_Lng</th>\n",
       "      <th>End_Lat</th>\n",
       "      <th>End_Lng</th>\n",
       "      <th>Distance(mi)</th>\n",
       "      <th>County</th>\n",
       "      <th>State</th>\n",
       "      <th>...</th>\n",
       "      <th>Station</th>\n",
       "      <th>Stop</th>\n",
       "      <th>Traffic_Calming</th>\n",
       "      <th>Traffic_Signal</th>\n",
       "      <th>Turning_Loop</th>\n",
       "      <th>Sunrise_Sunset</th>\n",
       "      <th>Civil_Twilight</th>\n",
       "      <th>Nautical_Twilight</th>\n",
       "      <th>Astronomical_Twilight</th>\n",
       "      <th>cov_distance</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>201.0</td>\n",
       "      <td>3</td>\n",
       "      <td>39.865147</td>\n",
       "      <td>-84.058723</td>\n",
       "      <td>37.557578</td>\n",
       "      <td>-100.455981</td>\n",
       "      <td>0.01</td>\n",
       "      <td>Montgomery</td>\n",
       "      <td>OH</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1443.524390</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>201.0</td>\n",
       "      <td>2</td>\n",
       "      <td>39.928059</td>\n",
       "      <td>-82.831184</td>\n",
       "      <td>37.557578</td>\n",
       "      <td>-100.455981</td>\n",
       "      <td>0.01</td>\n",
       "      <td>Franklin</td>\n",
       "      <td>OH</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1548.467903</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>201.0</td>\n",
       "      <td>2</td>\n",
       "      <td>39.063148</td>\n",
       "      <td>-84.032608</td>\n",
       "      <td>37.557578</td>\n",
       "      <td>-100.455981</td>\n",
       "      <td>0.01</td>\n",
       "      <td>Clermont</td>\n",
       "      <td>OH</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1440.697621</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>201.0</td>\n",
       "      <td>3</td>\n",
       "      <td>39.747753</td>\n",
       "      <td>-84.205582</td>\n",
       "      <td>37.557578</td>\n",
       "      <td>-100.455981</td>\n",
       "      <td>0.01</td>\n",
       "      <td>Montgomery</td>\n",
       "      <td>OH</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1429.927497</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>201.0</td>\n",
       "      <td>2</td>\n",
       "      <td>39.627781</td>\n",
       "      <td>-84.188354</td>\n",
       "      <td>37.557578</td>\n",
       "      <td>-100.455981</td>\n",
       "      <td>0.01</td>\n",
       "      <td>Montgomery</td>\n",
       "      <td>OH</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1430.383177</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 33 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   Source    TMC  Severity  Start_Lat  Start_Lng    End_Lat     End_Lng  \\\n",
       "0       1  201.0         3  39.865147 -84.058723  37.557578 -100.455981   \n",
       "1       1  201.0         2  39.928059 -82.831184  37.557578 -100.455981   \n",
       "2       1  201.0         2  39.063148 -84.032608  37.557578 -100.455981   \n",
       "3       1  201.0         3  39.747753 -84.205582  37.557578 -100.455981   \n",
       "4       1  201.0         2  39.627781 -84.188354  37.557578 -100.455981   \n",
       "\n",
       "   Distance(mi)      County State  ...  Station  Stop  Traffic_Calming  \\\n",
       "0          0.01  Montgomery    OH  ...      0.0   0.0              0.0   \n",
       "1          0.01    Franklin    OH  ...      0.0   0.0              0.0   \n",
       "2          0.01    Clermont    OH  ...      0.0   0.0              0.0   \n",
       "3          0.01  Montgomery    OH  ...      0.0   0.0              0.0   \n",
       "4          0.01  Montgomery    OH  ...      0.0   0.0              0.0   \n",
       "\n",
       "   Traffic_Signal Turning_Loop  Sunrise_Sunset  Civil_Twilight  \\\n",
       "0             0.0          0.0             0.0             0.0   \n",
       "1             0.0          0.0             0.0             0.0   \n",
       "2             0.0          0.0             0.0             0.0   \n",
       "3             0.0          0.0             0.0             1.0   \n",
       "4             0.0          0.0             1.0             1.0   \n",
       "\n",
       "   Nautical_Twilight  Astronomical_Twilight  cov_distance  \n",
       "0                0.0                    0.0   1443.524390  \n",
       "1                0.0                    1.0   1548.467903  \n",
       "2                1.0                    1.0   1440.697621  \n",
       "3                1.0                    1.0   1429.927497  \n",
       "4                1.0                    1.0   1430.383177  \n",
       "\n",
       "[5 rows x 33 columns]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Drop any null values that may be present."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = df.dropna()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are continuing a bit of the preprocessing that is easier using Scikit-learn and can use Label encoding to convert the labels to numbers without increasing the dimensions of our dataset. Label encoder converts the string categorical values to numbers. Eg. [Chicago, New York, Mumbai] would get encoded to [0, 1, 2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 14.6 ms, sys: 515 µs, total: 15.1 ms\n",
      "Wall time: 14.9 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "#link to label encoder\n",
    "label_encoder = preprocessing.LabelEncoder() \n",
    "df['County']= label_encoder.fit_transform(df['County']) \n",
    "df['State']= label_encoder.fit_transform(df['State'])\n",
    "df['Weather_Condition']= label_encoder.fit_transform(df['Weather_Condition'])\n",
    "\n",
    "df['Source'] = label_encoder.fit_transform(df['Source'])\n",
    "\n",
    "df['Sunrise_Sunset'] = label_encoder.fit_transform(df['Sunrise_Sunset'])\n",
    "df['Civil_Twilight'] = label_encoder.fit_transform(df['Civil_Twilight'])\n",
    "df['Nautical_Twilight'] = label_encoder.fit_transform(df['Nautical_Twilight'])\n",
    "df['Astronomical_Twilight'] = label_encoder.fit_transform(df['Astronomical_Twilight'])\n",
    "\n",
    "df['Amenity'] = label_encoder.fit_transform(df['Amenity'])\n",
    "df['Bump'] =label_encoder.fit_transform(df['Bump'])\n",
    "df['Crossing'] = label_encoder.fit_transform(df['Crossing'])\n",
    "df['Give_Way'] = label_encoder.fit_transform(df['Give_Way'])\n",
    "df['Junction'] =label_encoder.fit_transform(df['Junction'])\n",
    "df['No_Exit'] = label_encoder.fit_transform(df['No_Exit'])\n",
    "df['Railway'] = label_encoder.fit_transform(df['Railway'])\n",
    "df['Roundabout'] = label_encoder.fit_transform(df['Roundabout'])\n",
    "\n",
    "df['Station'] = label_encoder.fit_transform(df['Station'])\n",
    "df['Stop'] = label_encoder.fit_transform(df['Stop'])\n",
    "df['Traffic_Calming'] = label_encoder.fit_transform(df['Traffic_Calming'])\n",
    "df['Traffic_Signal'] = label_encoder.fit_transform(df['Traffic_Signal'])\n",
    "df['Turning_Loop'] =label_encoder.fit_transform(df['Turning_Loop'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's continue with exploring the dataset. We can check how the values are distributed in different categories."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['Severity'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The distribution across all the severities is imbalanced and Machine Learning algorithms tend to produce unsatisfactory classifiers when faced with imbalanced datasets.So we will convert this dataset to the necessary form by performing class balancing using up sampling. Up-sampling is the process of randomly duplicating observations from the minority class in order to reinforce its signal.\n",
    "\n",
    "- First, we'll separate observations from each class into different DataFrames.\n",
    "- Next, we'll resample the minority class with replacement, setting the number of samples to match that of the majority class.\n",
    "- Finally, we'll combine the up-sampled minority class DataFrame with the original majority class DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 30.1 ms, sys: 4.45 ms, total: 34.6 ms\n",
      "Wall time: 34.1 ms\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "Severity\n",
       "1    10584\n",
       "2    10584\n",
       "3    10584\n",
       "4    10584\n",
       "Name: Severity, dtype: int64"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "# Class Balancing | Using Up Sampling\n",
    "\n",
    "# Separate majority and minority classes\n",
    "df_s1 = df[df['Severity']==1]\n",
    "df_s2 = df[df['Severity']==2]\n",
    "df_s3 = df[df['Severity']==3]\n",
    "df_s4 = df[df['Severity']==4]\n",
    "\n",
    "count = max(df_s1.count()[0], df_s2.count()[0], df_s3.count()[0], df_s4.count()[0])\n",
    "\n",
    "# Upsample minority class\n",
    "df_s1 = resample(df_s1, replace=df_s1.count()[0]<count, n_samples=count, random_state=42)\n",
    "df_s2 = resample(df_s2, replace=df_s2.count()[0]<count, n_samples=count, random_state=42)\n",
    "df_s3 = resample(df_s3, replace=df_s3.count()[0]<count, n_samples=count, random_state=42)\n",
    "df_s4 = resample(df_s4, replace=df_s4.count()[0]<count, n_samples=count, random_state=42)\n",
    " \n",
    "# Combine majority class with upsampled minority class\n",
    "df = pd.concat([df_s1, df_s2, df_s3, df_s4])\n",
    " \n",
    "# Display new class counts\n",
    "df.groupby(by='Severity')['Severity'].count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we will separate our target data column from the other columns and encode categorical features present in the dataframe as an integer array."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 7.11 ms, sys: 5.16 ms, total: 12.3 ms\n",
      "Wall time: 11.5 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "# Set the target for the prediction\n",
    "target='Severity' \n",
    "cols = df.select_dtypes(include='object').columns\n",
    "\n",
    "# set X and y\n",
    "y = df[target]\n",
    "X = df.drop(target, axis=1)\n",
    "\n",
    "# Create the encoder.\n",
    "encoder = OrdinalEncoder()\n",
    "X[cols] = encoder.fit_transform(X[cols])\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we will use the train test split function of scikit learn to create the train and test datasets. The train-test split is a technique for evaluating the performance of a machine learning algorithm. The procedure involves taking a dataset and dividing it into two subsets. The first subset is used to fit the model and is referred to as the training dataset. The second subset is not used to train the model; instead, the input element of the dataset is provided to the model, then predictions are made and compared to the expected values. This second dataset is referred to as the test dataset.\n",
    "\n",
    "- Train Dataset: Used to fit the machine learning model.\n",
    "- Test Dataset: Used to evaluate the fit machine learning model.\n",
    "\n",
    "\n",
    "The objective is to estimate the performance of the machine learning model on new data: data not used to train the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(X, y,  test_size = 0.3, random_state=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now the data is in the required format and ready to be fed to our model. Now we will convert the dataframe to a CuDF dataframe. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 855 ms, sys: 445 ms, total: 1.3 s\n",
      "Wall time: 1.31 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "#Convert the data to CuDF dataframes here\n",
    "X_cudf_train = cudf.DataFrame.from_pandas(X_train)\n",
    "X_cudf_test = cudf.DataFrame.from_pandas(X_test)\n",
    "\n",
    "y_cudf_train = cudf.Series(y_train.values)\n",
    "y_cudf_test = cudf.Series(y_test.values)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id= 'exercise'></a>\n",
    "\n",
    "#### Your exercise begins here. Provided below are 4 ML models in Scikit-learn, which you have to convert to CuML and evaluate the performance difference.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Logistic Regression\n",
    "\n",
    "Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable.\n",
    "\n",
    "## Scikit-learn\n",
    "\n",
    "### Fit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 20 s, sys: 50.9 s, total: 1min 10s\n",
      "Wall time: 1.9 s\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/opt/conda/envs/rapids/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:764: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "LogisticRegression()"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "clf = skLogistic()\n",
    "clf.fit(X_train, y_train)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.501299110306275\n",
      "CPU times: user 81.7 ms, sys: 193 ms, total: 275 ms\n",
      "Wall time: 7.28 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "print(clf.score(X_test, y_test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='ex4'> Implement the code above in CuML</a><br>\n",
    "\n",
    "## CuML\n",
    "\n",
    "### Fit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[E] [23:59:24.749199] L-BFGS line search failed\n",
      "CPU times: user 686 ms, sys: 2.11 s, total: 2.8 s\n",
      "Wall time: 74 ms\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "LogisticRegression(penalty='l2', tol=0.0001, C=1.0, fit_intercept=True, max_iter=1000, linesearch_max_iter=50, verbose=4, l1_ratio=None, solver='qn', handle=<cuml.raft.common.handle.Handle object at 0x7fd97c0ee1f0>, output_type='cudf')"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "reg = LogisticRegression()\n",
    "reg.fit(X_cudf_train,y_cudf_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.24864183366298676\n",
      "CPU times: user 171 ms, sys: 523 ms, total: 695 ms\n",
      "Wall time: 18.4 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "print(reg.score(X_cudf_test, y_cudf_test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Nearest Neighbours Classifier\n",
    "\n",
    "NearestNeighbors implements unsupervised nearest neighbors learning. It acts as a uniform interface to three different nearest neighbors algorithms: BallTree, KDTree, and a brute-force algorithm based on routines in sklearn.metrics.pairwise. The choice of neighbors search algorithm is controlled through the keyword 'algorithm', which must be one of ['auto', 'ball_tree', 'kd_tree', 'brute']. When the default value 'auto' is passed, the algorithm attempts to determine the best approach from the training data.\n",
    "\n",
    "## Scikit-learn\n",
    "\n",
    "### Fit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 522 ms, sys: 4.43 ms, total: 527 ms\n",
      "Wall time: 526 ms\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "KNeighborsClassifier(n_neighbors=3)"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "neigh = KNeighborsClassifier(n_neighbors=3)\n",
    "neigh.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.8876466419966932\n",
      "CPU times: user 1.15 s, sys: 5.22 ms, total: 1.15 s\n",
      "Wall time: 1.15 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "print(neigh.score(X_test, y_test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='ex7'> Implement the code above in CuML</a><br>\n",
    "\n",
    "## CuML\n",
    "\n",
    "### Fit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 14.5 ms, sys: 2.39 ms, total: 16.8 ms\n",
      "Wall time: 16 ms\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "KNeighborsClassifier(weights='uniform')"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "knn = KNeighborsC(n_neighbors=10)\n",
    "knn.fit(X_cudf_train, y_cudf_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.8689079880714417\n",
      "CPU times: user 22.1 ms, sys: 126 ms, total: 148 ms\n",
      "Wall time: 148 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "print(knn.score(X_cudf_test, y_test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "## ElasticNet Classifier\n",
    "\n",
    "Elastic Net first emerged as a result of critique on lasso, whose variable selection can be too dependent on data and thus unstable. The solution is to combine the penalties of ridge regression and lasso to get the best of both worlds. Ridge Regression, which penalizes sum of squared coefficients (L2 penalty). Lasso Regression, which penalizes the sum of absolute values of the coefficients (L1 penalty).\n",
    "\n",
    "### Scikit-learn model\n",
    "\n",
    "#### Fit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 163 ms, sys: 62.1 ms, total: 225 ms\n",
      "Wall time: 226 ms\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "ElasticNet(alpha=1.0, l1_ratio=0.5, fit_intercept=True, normalize=False, max_iter=1000, tol=0.001, selection='cyclic', handle=<cuml.raft.common.handle.Handle object at 0x7fd97c163210>, output_type='numpy', verbose=4)"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "regr = ElasticNet()\n",
    "regr.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Evaluate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.22519596677633613\n",
      "CPU times: user 5.97 ms, sys: 2.98 ms, total: 8.96 ms\n",
      "Wall time: 8.11 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "X_test = X_test.astype(np.float64)\n",
    "y_test = y_test.astype(np.float64)\n",
    "print(regr.score(X_test,y_test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='ex2'> Implement the code above in CuML</a><br>\n",
    "\n",
    "### CuML model\n",
    "\n",
    "#### Fit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 126 ms, sys: 3.94 ms, total: 130 ms\n",
      "Wall time: 129 ms\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "ElasticNet(alpha=1.0, l1_ratio=0.5, fit_intercept=True, normalize=False, max_iter=1000, tol=0.001, selection='cyclic', handle=<cuml.raft.common.handle.Handle object at 0x7fd97c152b70>, output_type='cudf', verbose=4)"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "enet = ElasticNet()\n",
    "\n",
    "enet.fit(X_cudf_train, y_cudf_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.22519596677633613\n",
      "CPU times: user 6.12 ms, sys: 2.09 ms, total: 8.21 ms\n",
      "Wall time: 7.49 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "X_cudf_test = X_cudf_test.astype(np.float64)\n",
    "y_cudf_test = y_cudf_test.astype(np.float64)\n",
    "print(enet.score(X_cudf_test, y_cudf_test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# CONCLUSION"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's compare the performance of our solution!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "| Algorithm     | Implementation | Accuracy      | Time | Algorithm     | Implementation | Accuracy      | Time |\n",
    "| ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- |\n",
    "| Logistic Regression     | Scikit-learn        | 0.5     | 1.9 s       | Logistic Regression   | CuML       | 0.248     | 74 ms       |\n",
    "| Nearest Neighbours Classifier     | Scikit-learn       | 0.88      | 526 ms       | Nearest Neighbours Classifier     | CuML        | 0.86     | 6 ms     |\n",
    "| ElasticNet    | Scikit-learn       | 0.225      | 226 ms      | ElasticNet      | CuML      | 0.225      | 129 ms      |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Thus we can observe that for most cases, the CuML implementation is reducing the computation time by 10 or even upto 1000 times. It is interesting to note that in some cases where the Scikit-learn model failed to converge, CuML was able to converge within record time and provide another accuracy. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here are some reasons why Nearest Neighbours could be working so well for our dataset:\n",
    "- Flexible to feature/distance choices\n",
    "- Naturally handles multi-class cases\n",
    "- Can do well in practice with enough representative data\n",
    "\n",
    "Wow! This was an interesting exercise. We hope you enjoyed applying your machine learning skills and appreciated the GPU boost provided by RAPIDS. CuML supports many ML models which can provide interesting results on this dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# References\n",
    "\n",
    "- Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. “A Countrywide Traffic Accident Dataset.”, 2019.\n",
    "\n",
    "- Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. \"Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights.\" In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.\n",
    "\n",
    "- If you need to refer to the dataset, you can download it [here](https://www.kaggle.com/sobhanmoosavi/us-accidents)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center><a rel=\"license\" href=\"http://creativecommons.org/licenses/by-nc-sa/4.0/\"><img alt=\"Creative Commons License\" style=\"border-width:0\" src=\"https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png\" /></a></center><br />This dataset is licensed under a <a rel=\"license\" href=\"http://creativecommons.org/licenses/by-nc-sa/4.0/\">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Licensing\n",
    "  \n",
    "This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Previous Notebook](03_CuML_Exercise.ipynb)\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "[1](01-LinearRegression-Hyperparam.ipynb)\n",
    "[2](03_CuML_Exercise.ipynb)\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "\n",
    "\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&emsp;&emsp;&emsp;\n",
    "&emsp;&emsp;&ensp;\n",
    "[Home Page](../START_HERE.ipynb)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}