{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ " \n", " \n", " \n", " \n", " \n", " \n", "[Home Page](../START_HERE.ipynb)\n", "\n", "[Previous Notebook](03_CuML_Exercise.ipynb)\n", " \n", " \n", " \n", " \n", "[1](01-LinearRegression-Hyperparam.ipynb)\n", "[2](03_CuML_Exercise.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# CuML Exercise - Solution\n", "Scikit-Learn is an incredibly powerful toolkit that allows data scientists to quickly build models from their data, and it one of the most common and useful tools in the Python data science ecosystem. cuML is the RAPIDS library that implements similar machine learning algorithms that use CUDA to run on GPUs, with an API that mirrors the Scikit-learn one as much as possible.\n", "\n", "In this notebook we present a small exercise for new users to experiment with CuML and apply their knowledge on a real world machine learning dataset. We will be working on the Car Accidents dataset that we started preprocessing in the CuDF tutorial. This is a countrywide car accident dataset, which covers 49 states of the USA. The accident data are collected from February 2016 to June 2020, using two APIs that provide streaming traffic incident (or event) data. These APIs broadcast traffic data captured by a variety of entities, such as the US and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors within the road-networks. Currently, there are about 3.5 million accident records in this dataset. If you skipped that tutorial, you can download the processed dataset here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Challenge\n", "\n", "We begin by perfoming some data manipulation using Scikit learn preprocessing and removing any class imbalance. The actual exercise begins here, where we have provided the implementation of 4 different Scikit-learn models and you have to convert them to CuML and evaluate the performance difference." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first step is downloading the dataset and putting it in the data directory, for using in this tutorial. Download the dataset here, and place it in (host/data) folder. Now we will import the necessary libraries." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NumPy Version: 1.19.2\n", "Scikit-Learn Version: 0.23.1\n" ] } ], "source": [ "import matplotlib.pyplot as plt\n", "import numpy as np; print('NumPy Version:', np.__version__)\n", "%matplotlib inline\n", "import sys\n", "import sklearn; print('Scikit-Learn Version:', sklearn.__version__)\n", "from sklearn.linear_model import LinearRegression\n", "\n", "from sklearn import preprocessing \n", "import pandas as pd\n", "from sklearn.utils import resample\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.model_selection import GridSearchCV\n", "from sklearn.feature_selection import SelectFromModel\n", "from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_curve, auc\n", "from sklearn.preprocessing import OrdinalEncoder, StandardScaler\n", "import cudf\n", "import cupy\n", "\n", "# import for visualization\n", "import matplotlib.pyplot as plt\n", "\n", "# import for model building\n", "from sklearn.svm import SVC\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from cuml.linear_model import MBSGDRegressor as cumlSGD\n", "from sklearn.linear_model import SGDRegressor as skSGD\n", "from sklearn.datasets import make_regression\n", "from sklearn.metrics import mean_squared_error\n", "\n", "from cuml.ensemble import RandomForestClassifier as curfc\n", "from sklearn.ensemble import RandomForestClassifier as skrfc\n", "\n", "from cuml import make_regression\n", "from cuml.linear_model import LinearRegression as cuLinearRegression\n", "from cuml.metrics.regression import r2_score\n", "from sklearn.linear_model import LinearRegression as skLinearRegression\n", "\n", "from cuml.neighbors import KNeighborsClassifier as KNeighborsC\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from cuml.linear_model import MBSGDClassifier as cumlMBSGDClassifier\n", "from sklearn.linear_model import SGDClassifier\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.pipeline import make_pipeline\n", "from cuml import Ridge\n", "from cuml.linear_model import Ridge\n", "from sklearn.linear_model import Ridge\n", "from cuml import LogisticRegression\n", "from sklearn.linear_model import LogisticRegression as skLogistic\n", "from cuml.linear_model import ElasticNet\n", "from sklearn import linear_model\n", "\n", "from cuml.linear_model import Lasso\n", "from cuml.solvers import SGD as cumlSGD" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's read the dataframe from the csv which was processed in the previous tutorial and stored in the data folder." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 43 ms, sys: 10.1 ms, total: 53.1 ms\n", "Wall time: 52.5 ms\n", " Unnamed: 0 Source TMC Severity Start_Lat Start_Lng End_Lat \\\n", "0 0 1 201.0 3 39.865147 -84.058723 37.557578 \n", "1 1 1 201.0 2 39.928059 -82.831184 37.557578 \n", "2 2 1 201.0 2 39.063148 -84.032608 37.557578 \n", "3 3 1 201.0 3 39.747753 -84.205582 37.557578 \n", "4 4 1 201.0 2 39.627781 -84.188354 37.557578 \n", "... ... ... ... ... ... ... ... \n", "17317 17317 1 201.0 3 37.396164 -121.907578 37.557578 \n", "17318 17318 1 201.0 3 37.825649 -122.304092 37.557578 \n", "17319 17319 1 201.0 2 36.979454 -121.909035 37.557578 \n", "17320 17320 1 201.0 2 37.314030 -121.827065 37.557578 \n", "17321 17321 1 201.0 3 37.758404 -122.212173 37.557578 \n", "\n", " End_Lng Distance(mi) County ... Station Stop \\\n", "0 -100.455981 0.01 Montgomery ... 0.0 0.0 \n", "1 -100.455981 0.01 Franklin ... 0.0 0.0 \n", "2 -100.455981 0.01 Clermont ... 0.0 0.0 \n", "3 -100.455981 0.01 Montgomery ... 0.0 0.0 \n", "4 -100.455981 0.01 Montgomery ... 0.0 0.0 \n", "... ... ... ... ... ... ... \n", "17317 -100.455981 0.01 Santa Clara ... 0.0 0.0 \n", "17318 -100.455981 0.01 Alameda ... 0.0 0.0 \n", "17319 -100.455981 0.00 Santa Cruz ... 0.0 0.0 \n", "17320 -100.455981 0.01 Santa Clara ... 0.0 0.0 \n", "17321 -100.455981 0.01 Alameda ... NaN NaN \n", "\n", " Traffic_Calming Traffic_Signal Turning_Loop Sunrise_Sunset \\\n", "0 0.0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 1.0 \n", "... ... ... ... ... \n", "17317 0.0 0.0 0.0 1.0 \n", "17318 0.0 0.0 0.0 1.0 \n", "17319 0.0 0.0 0.0 1.0 \n", "17320 0.0 0.0 0.0 1.0 \n", "17321 NaN NaN NaN NaN \n", "\n", " Civil_Twilight Nautical_Twilight Astronomical_Twilight cov_distance \n", "0 0.0 0.0 0.0 1443.524390 \n", "1 0.0 0.0 1.0 1548.467903 \n", "2 0.0 1.0 1.0 1440.697621 \n", "3 1.0 1.0 1.0 1429.927497 \n", "4 1.0 1.0 1.0 1430.383177 \n", "... ... ... ... ... \n", "17317 1.0 1.0 1.0 1888.935551 \n", "17318 1.0 1.0 1.0 1918.251042 \n", "17319 1.0 1.0 1.0 1895.341155 \n", "17320 1.0 1.0 1.0 1883.025767 \n", "17321 NaN NaN NaN NaN \n", "\n", "[17322 rows x 34 columns]\n" ] } ], "source": [ "%time df = pd.read_csv('../../data/data_proc.csv')\n", "print(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Drop the unnecessary columns which got added while reading the file." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "df = df.drop(columns = [\"Unnamed: 0\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Observe the dataset by printing the first 5 rows using the head function." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Source | \n", "TMC | \n", "Severity | \n", "Start_Lat | \n", "Start_Lng | \n", "End_Lat | \n", "End_Lng | \n", "Distance(mi) | \n", "County | \n", "State | \n", "... | \n", "Station | \n", "Stop | \n", "Traffic_Calming | \n", "Traffic_Signal | \n", "Turning_Loop | \n", "Sunrise_Sunset | \n", "Civil_Twilight | \n", "Nautical_Twilight | \n", "Astronomical_Twilight | \n", "cov_distance | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1 | \n", "201.0 | \n", "3 | \n", "39.865147 | \n", "-84.058723 | \n", "37.557578 | \n", "-100.455981 | \n", "0.01 | \n", "Montgomery | \n", "OH | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1443.524390 | \n", "
1 | \n", "1 | \n", "201.0 | \n", "2 | \n", "39.928059 | \n", "-82.831184 | \n", "37.557578 | \n", "-100.455981 | \n", "0.01 | \n", "Franklin | \n", "OH | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "1548.467903 | \n", "
2 | \n", "1 | \n", "201.0 | \n", "2 | \n", "39.063148 | \n", "-84.032608 | \n", "37.557578 | \n", "-100.455981 | \n", "0.01 | \n", "Clermont | \n", "OH | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "1.0 | \n", "1440.697621 | \n", "
3 | \n", "1 | \n", "201.0 | \n", "3 | \n", "39.747753 | \n", "-84.205582 | \n", "37.557578 | \n", "-100.455981 | \n", "0.01 | \n", "Montgomery | \n", "OH | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1429.927497 | \n", "
4 | \n", "1 | \n", "201.0 | \n", "2 | \n", "39.627781 | \n", "-84.188354 | \n", "37.557578 | \n", "-100.455981 | \n", "0.01 | \n", "Montgomery | \n", "OH | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1430.383177 | \n", "
5 rows × 33 columns
\n", "