{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "     \n", "     \n", "     \n", "     \n", "     \n", "   \n", "[Home Page](../../START_HERE.ipynb)\n", "\n", "[Previous Notebook](Challenge.ipynb)\n", "     \n", "     \n", "     \n", "     \n", "[1](Challenge.ipynb)\n", "[2]\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Bike Rental Prediction Challenge - Solution\n", "\n", "## 1. Introduction\n", "\n", "This notebook walks through an end-to-end GPU machine learning workflow where cuDF is used for processing the data and cuML is used to train machine learning models on it. \n", "\n", "After completing this excercise, you will be able to use cuDF to load data from disk, combine tables, scale features, use one-hote encoding and even write your own GPU kernels to efficiently transform feature columns. Additionaly you will learn how to pass this data to cuML, and how to train ML models on it. The trained model is saved and it will be used for prediction.\n", "\n", "It is not required that the user is familiar with cuDF or cuML. Since our aim is to go from ETL to ML training, a detailed introduction is out of scope for this notebook. We recommend [Introduction to cuDF](../../CuDF/01-Intro_to_cuDF.ipynb) for additional information.\n", "\n", "### 1.2. Problem statement\n", "\n", "We are trying to predict daily demand for short-term bike rentals made in 2011 and 2012. We will combine three data sources: bike rental information, historical weather data, and dates of public holidays. In Section 2 of this notebook we will use cuDF to combine these data into a single dataset that can be used as an input for machine learning algorithms. In Section 3 we train models using cuML to predict bike rentals.\n", "\n", "### 1.3 Why RAPIDS?\n", "\n", "Using the GPU accelerated libraries from RAPIDS greatly reduces the execution time of a data science workflow. This leads to faster iteration with data preparation and model selection, and overall a more efficient workflow.\n", "\n", "### 1.2.1 References\n", "\n", "This notebook is inspired by the [blog article](https://medium.com/rapids-ai/essential-machine-learning-with-linear-models-in-rapids-part-1-of-a-series-992fab0240da) from Paul Mahler and its accompanying [notebook](https://github.com/rapidsai-community/notebooks-contrib/blob/master/blog_notebooks/regression/regression_blog_notebook.ipynb). The dataset is prepared along the steps given by *Hadi Fanaee-T and Joao Gama* in their [paper](https://doi.org/10.1007/s13748-013-0040-3) *Event labeling combining ensemble detectors and background knowledge*. The exploratory data analysis notebooks by [Vivek Srinivasan](https://www.kaggle.com/viveksrinivasan/eda-ensemble-model-top-10-percentile) and [Mitesh Yadav](https://www.kaggle.com/miteshyadav/comprehensive-eda-with-xgboost-top-10-percentile) provided useful input for this excercise. \n", "\n", "First part of this notebook contains sections from [Introduction to cuDF](https://github.com/rapidsai-community/notebooks-contrib/blob/master/getting_started_notebooks/intro_tutorials/02_Introduction_to_cuDF.ipynb) by Paul Hendricks, which gives a concise introduction to cuDF, also discussing a few points not mentioned in this notebook. \n", "\n", "Dataset sources:\n", "- The bike sharing dataset is provided by [Capital Bike Share](https://www.capitalbikeshare.com/system-data).\n", "- The weather data is retrieved from the [Bike Sharing Dataset](https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset) hosted by the UCI Machine Learning repository.\n", "- The original source of the weather data is https://www.freemeteo.com.\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Prepare dataset with cuDF\n", "Let's start by loading the necessary libraries" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import cudf\n", "import pandas as pd\n", "import numpy as np\n", "\n", "from datetime import datetime, timedelta\n", "import os\n", "import sys\n", "sys.path.insert(1, os.path.realpath(os.path.pardir))\n", "import importlib\n", "import utils\n", "importlib.reload(utils)\n", "from utils import fetch_bike_dataset, fetch_weather_dataset, read_bike_data_pandas\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.1 Prepare weather data\n", "First, we will download the weather data." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloading https://archive.ics.uci.edu/ml/machine-learning-databases/00275/Bike-Sharing-Dataset.zip to data/Bike-Sharing-Dataset.zip\n", "Weather file saved at data/weather2011-2012.csv\n" ] } ], "source": [ "filename = fetch_weather_dataset()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "cuDF DataFrames are a tabular structure of data that reside on the GPU. We interface with these cuDF DataFrames in the same way we interface with Pandas DataFrames that reside on the CPU - with a few deviations. Load data from CSV file into a cuDF DataFrame." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "weather = cudf.read_csv(filename)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.1.1 Inspecting a cuDF DataFrame\n", "\n", "There are several ways to inspect a cuDF DataFrame. The first method is to enter the cuDF DataFrame directly into the REPL. This shows us an overview about the DatFrame including its type and metadata such as the number of rows or columns." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
HourTemperatureRelative TemperatureRel. humidityWindWeather
02011-01-01T00:00:00Z3.283.0014810.0000Clear or Partly cloudy
12011-01-01T01:00:00Z2.341.9982800.0000Clear or Partly cloudy
22011-01-01T02:00:00Z2.341.9982800.0000Clear or Partly cloudy
32011-01-01T03:00:00Z3.283.0014750.0000Clear or Partly cloudy
42011-01-01T04:00:00Z3.283.0014750.0000Clear or Partly cloudy
.....................
173742012-12-31T19:00:00Z4.221.00166011.0014Mist or Cloudy
173752012-12-31T20:00:00Z4.221.00166011.0014Mist or Cloudy
173762012-12-31T21:00:00Z4.221.00166011.0014Clear or Partly cloudy
173772012-12-31T22:00:00Z4.221.9982568.9981Clear or Partly cloudy
173782012-12-31T23:00:00Z4.221.9982658.9981Clear or Partly cloudy
\n", "

17379 rows × 6 columns

\n", "
" ], "text/plain": [ " Hour Temperature Relative Temperature Rel. humidity \\\n", "0 2011-01-01T00:00:00Z 3.28 3.0014 81 \n", "1 2011-01-01T01:00:00Z 2.34 1.9982 80 \n", "2 2011-01-01T02:00:00Z 2.34 1.9982 80 \n", "3 2011-01-01T03:00:00Z 3.28 3.0014 75 \n", "4 2011-01-01T04:00:00Z 3.28 3.0014 75 \n", "... ... ... ... ... \n", "17374 2012-12-31T19:00:00Z 4.22 1.0016 60 \n", "17375 2012-12-31T20:00:00Z 4.22 1.0016 60 \n", "17376 2012-12-31T21:00:00Z 4.22 1.0016 60 \n", "17377 2012-12-31T22:00:00Z 4.22 1.9982 56 \n", "17378 2012-12-31T23:00:00Z 4.22 1.9982 65 \n", "\n", " Wind Weather \n", "0 0.0000 Clear or Partly cloudy \n", "1 0.0000 Clear or Partly cloudy \n", "2 0.0000 Clear or Partly cloudy \n", "3 0.0000 Clear or Partly cloudy \n", "4 0.0000 Clear or Partly cloudy \n", "... ... ... \n", "17374 11.0014 Mist or Cloudy \n", "17375 11.0014 Mist or Cloudy \n", "17376 11.0014 Clear or Partly cloudy \n", "17377 8.9981 Clear or Partly cloudy \n", "17378 8.9981 Clear or Partly cloudy \n", "\n", "[17379 rows x 6 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "weather" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A second way to inspect a cuDF DataFrame is to wrap the object in a Python print function `print(weather)` function. This results in showing the rows and columns of the dataframe with simple formating.\n", "\n", "For very large dataframes, we often want to see the first couple rows. We can use the `head` method of a cuDF DataFrame to view the first N rows." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
HourTemperatureRelative TemperatureRel. humidityWindWeather
02011-01-01T00:00:00Z3.283.0014810.0Clear or Partly cloudy
12011-01-01T01:00:00Z2.341.9982800.0Clear or Partly cloudy
22011-01-01T02:00:00Z2.341.9982800.0Clear or Partly cloudy
\n", "
" ], "text/plain": [ " Hour Temperature Relative Temperature Rel. humidity \\\n", "0 2011-01-01T00:00:00Z 3.28 3.0014 81 \n", "1 2011-01-01T01:00:00Z 2.34 1.9982 80 \n", "2 2011-01-01T02:00:00Z 2.34 1.9982 80 \n", "\n", " Wind Weather \n", "0 0.0 Clear or Partly cloudy \n", "1 0.0 Clear or Partly cloudy \n", "2 0.0 Clear or Partly cloudy " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "weather.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.1.2 Columns\n", "\n", "cuDF DataFrames store metadata such as information about columns or data types. We can access the columns of a cuDF DataFrame using the `.columns` attribute." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Index(['Hour', 'Temperature', 'Relative Temperature', 'Rel. humidity', 'Wind',\n", " 'Weather'],\n", " dtype='object')\n" ] } ], "source": [ "print(weather.columns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can modify the columns of a cuDF DataFrame by modifying the `columns` attribute. We can do this by setting that attribute equal to a list of strings representing the new columns. Let's shorten the two longest column names!" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
HourTemperatureRTempHumidityWindWeather
02011-01-01T00:00:00Z3.283.0014810.0Clear or Partly cloudy
12011-01-01T01:00:00Z2.341.9982800.0Clear or Partly cloudy
22011-01-01T02:00:00Z2.341.9982800.0Clear or Partly cloudy
32011-01-01T03:00:00Z3.283.0014750.0Clear or Partly cloudy
42011-01-01T04:00:00Z3.283.0014750.0Clear or Partly cloudy
\n", "
" ], "text/plain": [ " Hour Temperature RTemp Humidity Wind \\\n", "0 2011-01-01T00:00:00Z 3.28 3.0014 81 0.0 \n", "1 2011-01-01T01:00:00Z 2.34 1.9982 80 0.0 \n", "2 2011-01-01T02:00:00Z 2.34 1.9982 80 0.0 \n", "3 2011-01-01T03:00:00Z 3.28 3.0014 75 0.0 \n", "4 2011-01-01T04:00:00Z 3.28 3.0014 75 0.0 \n", "\n", " Weather \n", "0 Clear or Partly cloudy \n", "1 Clear or Partly cloudy \n", "2 Clear or Partly cloudy \n", "3 Clear or Partly cloudy \n", "4 Clear or Partly cloudy " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "### TODO rename the relative temperature column to RTemp, and the relative humidity to Humidity\n", "#weather.columns = ['Hour', 'Temperature', 'Relative Temperature', 'Rel. Humidity', 'Wind', 'Weather']\n", "weather.columns = ['Hour', 'Temperature', 'RTemp', 'Humidity', 'Wind', 'Weather']\n", "weather.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.1.3 Series\n", "\n", "cuDF DataFrames are composed of rows and columns. Each column is represented using an object of type `Series`. For example, if we subset a cuDF DataFrame using just one column we will be returned an object of type `cudf.dataframe.series.Series`." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "0 81\n", "1 80\n", "2 80\n", "3 75\n", "4 75\n", " ..\n", "17374 60\n", "17375 60\n", "17376 60\n", "17377 56\n", "17378 65\n", "Name: Humidity, Length: 17379, dtype: int64\n" ] } ], "source": [ "humidity = weather['Humidity']\n", "print(type(humidity))\n", "print(humidity)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We also see a column of values on the left hand side with values 0, 1, 2, 3. These values represent the index of the Series.\n", "The DataFrame and Series objects have both an index attribute that will be useful for joining tables and also for selecting data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.1.4 Data Types\n", "\n", "We can also inspect the data types of the columns of a cuDF DataFrame using the `dtypes` attribute." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Hour object\n", "Temperature float64\n", "RTemp float64\n", "Humidity int64\n", "Wind float64\n", "Weather object\n", "dtype: object\n" ] } ], "source": [ "print(weather.dtypes)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can modify the data types of the columns of a cuDF DataFrame by passing in a cuDF Series with a modified data type." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Hour object\n", "Temperature float64\n", "RTemp float64\n", "Humidity float64\n", "Wind float64\n", "Weather object\n", "dtype: object\n" ] } ], "source": [ "weather['Humidity'] = weather['Humidity'].astype(np.float64)\n", "print(weather.dtypes)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The 'Weather' column provides a description of the weather condidions. We should mark it as a categorical column." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 Clear or Partly cloudy\n", "1 Clear or Partly cloudy\n", "2 Clear or Partly cloudy\n", "3 Clear or Partly cloudy\n", "4 Clear or Partly cloudy\n", " ... \n", "17374 Mist or Cloudy\n", "17375 Mist or Cloudy\n", "17376 Clear or Partly cloudy\n", "17377 Clear or Partly cloudy\n", "17378 Clear or Partly cloudy\n", "Name: Weather, Length: 17379, dtype: category\n", "Categories (4, object): ['Clear or Partly cloudy', 'Heavy Rain, Snow + Fog, Ice', 'Light Rain or Snow, Thunderstorm', 'Mist or Cloudy']" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "weather['Weather'] = weather['Weather'].astype('category')\n", "weather['Weather']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After this step the numerical category codes can be accessed using the `.cat.codes` attribute of the column. We actually will not need the category labels, we just replace the 'Weather' column with the category codes." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "weather['Weather'] = weather['Weather'].cat.codes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data type of the 'Hour' column is `object` which means a string. Let's convert this to a numeric value! This cannot be done with the `astype` method, you should use the [cudf.to_datetime](https://docs.rapids.ai/api/cudf/nightly/api.html#cudf.to_datetime) function!" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Hour datetime64[ns]\n", "Temperature float64\n", "RTemp float64\n", "Humidity float64\n", "Wind float64\n", "Weather uint8\n", "dtype: object" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "### TODO convert the 'Hour' column from string to datetime\n", "weather['Hour'] = cudf.to_datetime(weather['Hour'])\n", "weather.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.1.2 Prepare features\n", "##### Operations with cudf Series\n", "We can perform mathematical operations on the Series data type. We will scale the Humidity and and Temperature variables, so that they lay in the [0, 1] range (some ML algorithms work better if the input data is scaled this way)." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "weather['Humidity'] = weather['Humidity'] / 100.0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will scale the temperature using the following formula T = (T - Tmin) / (Tmax - Tmin). First we select the min and max values." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-7.06 39.0\n" ] } ], "source": [ "T = weather['Temperature']\n", "\n", "# Select the minimum temperature\n", "Tmin = T.min()\n", "\n", "### TODO select the maximum temperature (1 line of code)\n", "Tmax = T.max()\n", "\n", "print(Tmin, Tmax)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could simply use the Tmin and Tmax values and apply the above formula on the series. \n", "\n", "##### User defined functions (UDF)\n", "We can write custom functions to operate on the data. When cuDF executes a UDF, it gets just-in-time (JIT) compiled into a CUDA kernel (either explicitly or implicitly) and is run on the GPU. Let's write a function that scales the temperature!" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "def scale_temp(T):\n", " # Note that the Tmin and Tmax variables are stored during compilation time and remain constant afterwards\n", " T = (T - Tmin) / (Tmax - Tmin)\n", " return T " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The applymap function will call scale_temp on all element of the series" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "weather['Temperature'] = weather['Temperature'].applymap(scale_temp)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets do the same min-max scaling for the wind data" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.0 56.996900000000004\n" ] } ], "source": [ "### TODO calculate the minimum and maximum values of the 'Wind' column (2 lines of code)\n", "Wmin = weather['Wind'].min()\n", "Wmax = weather['Wind'].max()\n", "\n", "print(Wmin, Wmax)\n", "\n", "### TODO define a scale_wind function and apply it on the Wind column (~ 2-3 lines of code)\n", "def scale_wind(w):\n", " return (w - Wmin) / ( Wmax - Wmin)\n", "\n", "### TODO apply the scale_wind function on the 'Wind' column\n", "weather['Wind'] = weather['Wind'].applymap(scale_wind)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's inspect the table, the Temperature, Wind and Humidity columns should have values in the [0, 1] range." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TemperatureRTempHumidityWindWeather
count17379.00000017379.00000017379.00000017379.00000017379.000000
mean0.48672215.4011570.6269470.2234600.947868
std0.19648611.3421140.1930130.1438111.334769
min0.000000-16.0000000.0000000.0000000.000000
25%0.3265315.9978000.4800000.1228400.000000
50%0.48979615.9968000.6300000.2280470.000000
75%0.65306124.9992000.7800000.2982253.000000
max1.00000050.0000001.0000001.0000003.000000
\n", "
" ], "text/plain": [ " Temperature RTemp Humidity Wind Weather\n", "count 17379.000000 17379.000000 17379.000000 17379.000000 17379.000000\n", "mean 0.486722 15.401157 0.626947 0.223460 0.947868\n", "std 0.196486 11.342114 0.193013 0.143811 1.334769\n", "min 0.000000 -16.000000 0.000000 0.000000 0.000000\n", "25% 0.326531 5.997800 0.480000 0.122840 0.000000\n", "50% 0.489796 15.996800 0.630000 0.228047 0.000000\n", "75% 0.653061 24.999200 0.780000 0.298225 3.000000\n", "max 1.000000 50.000000 1.000000 1.000000 3.000000" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "weather.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Dropping Columns\n", "\n", "The relative temperature column is correlated with the temperature, it will not give much extra information for the ML model. We want to remove this column from our `DataFrame`. We can do so using the `drop_column` method. Note that this method removes a column in-place - meaning that the `DataFrame` we act on will be modified." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
HourTemperatureHumidityWindWeather
02011-01-01 00:00:000.2244900.810.0000000
12011-01-01 01:00:000.2040820.800.0000000
22011-01-01 02:00:000.2040820.800.0000000
32011-01-01 03:00:000.2244900.750.0000000
42011-01-01 04:00:000.2244900.750.0000000
..................
173742012-12-31 19:00:000.2448980.600.1930183
173752012-12-31 20:00:000.2448980.600.1930183
173762012-12-31 21:00:000.2448980.600.1930180
173772012-12-31 22:00:000.2448980.560.1578700
173782012-12-31 23:00:000.2448980.650.1578700
\n", "

17379 rows × 5 columns

\n", "
" ], "text/plain": [ " Hour Temperature Humidity Wind Weather\n", "0 2011-01-01 00:00:00 0.224490 0.81 0.000000 0\n", "1 2011-01-01 01:00:00 0.204082 0.80 0.000000 0\n", "2 2011-01-01 02:00:00 0.204082 0.80 0.000000 0\n", "3 2011-01-01 03:00:00 0.224490 0.75 0.000000 0\n", "4 2011-01-01 04:00:00 0.224490 0.75 0.000000 0\n", "... ... ... ... ... ...\n", "17374 2012-12-31 19:00:00 0.244898 0.60 0.193018 3\n", "17375 2012-12-31 20:00:00 0.244898 0.60 0.193018 3\n", "17376 2012-12-31 21:00:00 0.244898 0.60 0.193018 0\n", "17377 2012-12-31 22:00:00 0.244898 0.56 0.157870 0\n", "17378 2012-12-31 23:00:00 0.244898 0.65 0.157870 0\n", "\n", "[17379 rows x 5 columns]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "weather.drop(['RTemp'],axis=1,inplace=True)\n", "weather" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want to remove a column without modifying the original DataFrame, we can use the `drop` method. This method will return a new DataFrame without that column (or columns)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Index\n", "\n", "Like `Series` objects, each `DataFrame` has an index attribute." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RangeIndex(start=0, stop=17379)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "weather.index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use the index values to subset the `DataFrame`. Lets use this to plot the first 48 values. Before plotting we have to transfer from the GPU memory to the system memory. We use the `to_array` method to return a copy of the data as a numpy array." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0, 0.5, 'Temperature [C]')" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "selection = weather[weather.index<48]\n", "plt.plot(selection['Hour'].to_array(), selection['Temperature'].to_array())\n", "plt.xlabel('Hour')\n", "plt.ylabel('Temperature [C]')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also change the index. Our dataset has one entry for each hour, so one could set the 'Hour' coulmn as index by calling\n", "```\n", "weather = weather.set_index('Hour')\n", "```\n", "\n", "We do not perform this change right now, but we will use it later." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "#weather = weather.set_index('Hour')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2 Prepare bike sharing data\n", "We start by downloading the data" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "24744.0kB [00:01, 13625.36kB/s] \n", "42448.0kB [00:17, 2456.30kB/s] \n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Files extracted: ['data/2011-capitalbikeshare-tripdata.csv', 'data/2012Q1-capitalbikeshare-tripdata.csv', 'data/2012Q2-capitalbikeshare-tripdata.csv', 'data/2012Q3-capitalbikeshare-tripdata.csv', 'data/2012Q4-capitalbikeshare-tripdata.csv']\n" ] } ], "source": [ "files = fetch_bike_dataset([2011,2012])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's read the first file to have an idea of the dataset" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DurationStart dateEnd dateStart station numberStart stationEnd station numberEnd stationBike numberMember type
035482011-01-01 00:01:292011-01-01 01:00:37316205th & F St NW316205th & F St NWW00247Member
13462011-01-01 00:02:462011-01-01 00:08:323110514th & Harvard St NW3110114th & V St NWW00675Casual
25622011-01-01 00:06:132011-01-01 00:15:3631400Georgia & New Hampshire Ave NW31104Adams Mill & Columbia Rd NWW00357Member
34342011-01-01 00:09:212011-01-01 00:16:363111110th & U St NW31503Florida Ave & R St NWW00970Member
42332011-01-01 00:28:262011-01-01 00:32:1931104Adams Mill & Columbia Rd NW31106Calvert & Biltmore St NWW00346Casual
..............................
12267623002011-12-31 23:41:192011-12-31 23:46:203120115th & P St NW3121417th & Corcoran St NWW01459Member
12267633872011-12-31 23:46:432011-12-31 23:53:1031223Convention Center / 7th & M St NW3120115th & P St NWW01262Member
12267642612011-12-31 23:47:272011-12-31 23:51:4931107Lamont & Mt Pleasant NW31602Park Rd & Holmead Pl NWW00998Member
122676520602011-12-31 23:55:122012-01-01 00:29:333120521st & I St NW31222New York Ave & 15th St NWW00042Member
12267664682011-12-31 23:55:562012-01-01 00:03:453122118th & M St NW3111110th & U St NWW01319Member
\n", "

1226767 rows × 9 columns

\n", "
" ], "text/plain": [ " Duration Start date End date \\\n", "0 3548 2011-01-01 00:01:29 2011-01-01 01:00:37 \n", "1 346 2011-01-01 00:02:46 2011-01-01 00:08:32 \n", "2 562 2011-01-01 00:06:13 2011-01-01 00:15:36 \n", "3 434 2011-01-01 00:09:21 2011-01-01 00:16:36 \n", "4 233 2011-01-01 00:28:26 2011-01-01 00:32:19 \n", "... ... ... ... \n", "1226762 300 2011-12-31 23:41:19 2011-12-31 23:46:20 \n", "1226763 387 2011-12-31 23:46:43 2011-12-31 23:53:10 \n", "1226764 261 2011-12-31 23:47:27 2011-12-31 23:51:49 \n", "1226765 2060 2011-12-31 23:55:12 2012-01-01 00:29:33 \n", "1226766 468 2011-12-31 23:55:56 2012-01-01 00:03:45 \n", "\n", " Start station number Start station \\\n", "0 31620 5th & F St NW \n", "1 31105 14th & Harvard St NW \n", "2 31400 Georgia & New Hampshire Ave NW \n", "3 31111 10th & U St NW \n", "4 31104 Adams Mill & Columbia Rd NW \n", "... ... ... \n", "1226762 31201 15th & P St NW \n", "1226763 31223 Convention Center / 7th & M St NW \n", "1226764 31107 Lamont & Mt Pleasant NW \n", "1226765 31205 21st & I St NW \n", "1226766 31221 18th & M St NW \n", "\n", " End station number End station Bike number \\\n", "0 31620 5th & F St NW W00247 \n", "1 31101 14th & V St NW W00675 \n", "2 31104 Adams Mill & Columbia Rd NW W00357 \n", "3 31503 Florida Ave & R St NW W00970 \n", "4 31106 Calvert & Biltmore St NW W00346 \n", "... ... ... ... \n", "1226762 31214 17th & Corcoran St NW W01459 \n", "1226763 31201 15th & P St NW W01262 \n", "1226764 31602 Park Rd & Holmead Pl NW W00998 \n", "1226765 31222 New York Ave & 15th St NW W00042 \n", "1226766 31111 10th & U St NW W01319 \n", "\n", " Member type \n", "0 Member \n", "1 Casual \n", "2 Member \n", "3 Member \n", "4 Casual \n", "... ... \n", "1226762 Member \n", "1226763 Member \n", "1226764 Member \n", "1226765 Member \n", "1226766 Member \n", "\n", "[1226767 rows x 9 columns]" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cudf.read_csv(files[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are only interested in the events when a bicicle was rented. Let us read the first column from all files, by specifying the `usecols` argument to [read_csv](https://docs.rapids.ai/api/cudf/nightly/api.html#cudf.io.csv.read_csv). We can use the `parse_dates` argument to parse the date string into a datetime variable, or the [to_datetime](https://docs.rapids.ai/api/cudf/nightly/api.html#cudf.to_datetime) function that we have used for the weather dataset. After all the tables are read we will concatenate them.\n", "\n", "Note: one has to specify a list of columns [ column1, column2 ] for the `usecol` argument." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "def read_bike_data(files):\n", " # Reads a list of files and concatenates them\n", " tables = []\n", " for filename in files:\n", " ### TODO read column 1 ('Start date') from the CSV file, and convert it to datetime format\n", " ### (1-2 lines of code)\n", " tmp_df = cudf.read_csv(filename, parse_dates=[1], usecols=[1])\n", " \n", " ### END TODO\n", " tables.append(tmp_df) \n", " \n", " merged_df = cudf.concat(tables, ignore_index=True)\n", " \n", " # Sanity checks\n", " if merged_df.columns != ['Start date']:\n", " raise ValueError(\"Error incorrect set of columns read\")\n", " if merged_df['Start date'].dtype != 'datetime64[ns]':\n", " raise TypeError(\"Stard date should be converted to datetime type\")\n", " \n", " return merged_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will also measure the execution time of reading and processing the data (you can execute the cell multiple times to have a better measurement)." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 91.6 ms, sys: 330 ms, total: 422 ms\n", "Wall time: 421 ms\n" ] } ], "source": [ "%%time\n", "bikes_raw = read_bike_data(files)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For comparision, we can repeat the same operation on the CPU. We have prepared a helper function for that." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 4.31 s, sys: 192 ms, total: 4.51 s\n", "Wall time: 4.5 s\n" ] } ], "source": [ "%%time\n", "bikes_raw_pd = read_bike_data_pandas(files)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We want to count the number of rental events in every hour. We will define a new feature where we remove the minutes and seconds part of the time stamp. Since pandas has a convenient `floor` function defined to do it, we will convert the column to a pandas Series, transform it with the floor operation, and then put it back on the GPU." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "bikes_raw['Hour'] = bikes_raw['Start date'].to_pandas().dt.floor('h')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will aggregate the number of bicicle rental events for each hour. We use the [groupby](https://docs.rapids.ai/api/cudf/nightly/api.html#groupby) function." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cnt
Hour
2011-01-01 00:00:0016
2011-01-01 01:00:0038
2011-01-01 02:00:0031
2011-01-01 03:00:0012
2011-01-01 04:00:001
\n", "
" ], "text/plain": [ " cnt\n", "Hour \n", "2011-01-01 00:00:00 16\n", "2011-01-01 01:00:00 38\n", "2011-01-01 02:00:00 31\n", "2011-01-01 03:00:00 12\n", "2011-01-01 04:00:00 1" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bikes = bikes_raw.groupby('Hour').agg('count')\n", "bikes.columns = ['cnt']\n", "bikes.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's add a column to the new dataset: the date without the time of the day. We can derive that similarly to the 'Hour' feature above. After the groupby operation, the 'Hour' became the index of the dataset, we will apply the `floor` operation on the index. " ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "bikes['date'] = bikes.index.to_pandas().floor('D')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It will be usefull to define a set of additional features: hour of the day, day of month, month and year https://docs.rapids.ai/api/cudf/nightly/api.html#datetimeindex" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "bikes['hr'] = bikes.index.hour\n", "\n", "### TODO add year and month features (~ 2 lines of code)\n", "bikes['year'] = bikes.index.year\n", "bikes['month'] = bikes.index.month" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remove the offset from year" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "bikes['year'] = bikes['year'] - 2011" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Visualize data\n", "It is a good practice to visulize the data. We will have to use the to_array() method to convert the cudF Series objects to numpy arrays that can be plotted." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.plot(bikes.index.to_array(), bikes['cnt'].to_array())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is hard to see much apart from the global trend. Let's have a look how the 'cnt' variable looks like as a function the 'month' and 'hr' features. We will use [boxplot](https://seaborn.pydata.org/generated/seaborn.boxplot.html) from the Seaborn package." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "fig, axes = plt.subplots(nrows=1,ncols=2)\n", "fig.set_size_inches(12, 5)\n", "sns.boxplot(data=bikes.to_pandas(), y=\"cnt\",x=\"month\",orient=\"v\",ax=axes[0])\n", "sns.boxplot(data=bikes.to_pandas(), y=\"cnt\",x=\"hr\",orient=\"v\",ax=axes[1])\n", "axes[0].set(xlabel='Months', ylabel='Count',title=\"Box Plot On Count Across months\")\n", "axes[1].set(xlabel='Hour Of The Day', ylabel='Count',title=\"Box Plot On Count Across Hour Of The Day\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.2.1 Combine weather data with bike rental data" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cntdatehryearmonthHourTemperatureHumidityWindWeather
17376902012-12-31211122012-12-31 21:00:000.2448980.600.1930180
17377612012-12-31221122012-12-31 22:00:000.2448980.560.1578700
17378502012-12-31231122012-12-31 23:00:000.2448980.650.1578700
51841462011-08-0822082011-08-08 22:00:000.7551020.570.1228400
5185722011-08-0823082011-08-08 23:00:000.7346940.660.1228400
.................................
2331132011-04-121042011-04-12 01:00:000.6122450.500.1053253
2332142011-04-122042011-04-12 02:00:000.5918370.530.1578703
233312011-04-123042011-04-12 03:00:000.5714290.560.1578703
233462011-04-124042011-04-12 04:00:000.5510200.640.1578703
2335162011-04-125042011-04-12 05:00:000.5306120.680.1578703
\n", "

17378 rows × 10 columns

\n", "
" ], "text/plain": [ " cnt date hr year month Hour Temperature \\\n", "17376 90 2012-12-31 21 1 12 2012-12-31 21:00:00 0.244898 \n", "17377 61 2012-12-31 22 1 12 2012-12-31 22:00:00 0.244898 \n", "17378 50 2012-12-31 23 1 12 2012-12-31 23:00:00 0.244898 \n", "5184 146 2011-08-08 22 0 8 2011-08-08 22:00:00 0.755102 \n", "5185 72 2011-08-08 23 0 8 2011-08-08 23:00:00 0.734694 \n", "... ... ... .. ... ... ... ... \n", "2331 13 2011-04-12 1 0 4 2011-04-12 01:00:00 0.612245 \n", "2332 14 2011-04-12 2 0 4 2011-04-12 02:00:00 0.591837 \n", "2333 1 2011-04-12 3 0 4 2011-04-12 03:00:00 0.571429 \n", "2334 6 2011-04-12 4 0 4 2011-04-12 04:00:00 0.551020 \n", "2335 16 2011-04-12 5 0 4 2011-04-12 05:00:00 0.530612 \n", "\n", " Humidity Wind Weather \n", "17376 0.60 0.193018 0 \n", "17377 0.56 0.157870 0 \n", "17378 0.65 0.157870 0 \n", "5184 0.57 0.122840 0 \n", "5185 0.66 0.122840 0 \n", "... ... ... ... \n", "2331 0.50 0.105325 3 \n", "2332 0.53 0.157870 3 \n", "2333 0.56 0.157870 3 \n", "2334 0.64 0.157870 3 \n", "2335 0.68 0.157870 3 \n", "\n", "[17378 rows x 10 columns]" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gdf_bw = bikes.merge(weather, left_index=True, right_on='Hour', how='inner')\n", "\n", "# inspect the merged table\n", "gdf_bw" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that the data is not sorted after the merge use the [sort_values](https://docs.rapids.ai/api/cudf/nightly/api.html#cudf.core.dataframe.DataFrame.sort_values) method to" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cntdatehryearmonthHourTemperatureHumidityWindWeather
0162011-01-010012011-01-01 00:00:000.2244900.810.0000000
1382011-01-011012011-01-01 01:00:000.2040820.800.0000000
2312011-01-012012011-01-01 02:00:000.2040820.800.0000000
3122011-01-013012011-01-01 03:00:000.2244900.750.0000000
412011-01-014012011-01-01 04:00:000.2244900.750.0000000
.................................
173741182012-12-31191122012-12-31 19:00:000.2448980.600.1930183
17375892012-12-31201122012-12-31 20:00:000.2448980.600.1930183
17376902012-12-31211122012-12-31 21:00:000.2448980.600.1930180
17377612012-12-31221122012-12-31 22:00:000.2448980.560.1578700
17378502012-12-31231122012-12-31 23:00:000.2448980.650.1578700
\n", "

17378 rows × 10 columns

\n", "
" ], "text/plain": [ " cnt date hr year month Hour Temperature \\\n", "0 16 2011-01-01 0 0 1 2011-01-01 00:00:00 0.224490 \n", "1 38 2011-01-01 1 0 1 2011-01-01 01:00:00 0.204082 \n", "2 31 2011-01-01 2 0 1 2011-01-01 02:00:00 0.204082 \n", "3 12 2011-01-01 3 0 1 2011-01-01 03:00:00 0.224490 \n", "4 1 2011-01-01 4 0 1 2011-01-01 04:00:00 0.224490 \n", "... ... ... .. ... ... ... ... \n", "17374 118 2012-12-31 19 1 12 2012-12-31 19:00:00 0.244898 \n", "17375 89 2012-12-31 20 1 12 2012-12-31 20:00:00 0.244898 \n", "17376 90 2012-12-31 21 1 12 2012-12-31 21:00:00 0.244898 \n", "17377 61 2012-12-31 22 1 12 2012-12-31 22:00:00 0.244898 \n", "17378 50 2012-12-31 23 1 12 2012-12-31 23:00:00 0.244898 \n", "\n", " Humidity Wind Weather \n", "0 0.81 0.000000 0 \n", "1 0.80 0.000000 0 \n", "2 0.80 0.000000 0 \n", "3 0.75 0.000000 0 \n", "4 0.75 0.000000 0 \n", "... ... ... ... \n", "17374 0.60 0.193018 3 \n", "17375 0.60 0.193018 3 \n", "17376 0.60 0.193018 0 \n", "17377 0.56 0.157870 0 \n", "17378 0.65 0.157870 0 \n", "\n", "[17378 rows x 10 columns]" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "### TODO sort the table according to the index (1 line of code)\n", "gdf_bw = gdf_bw.sort_values(by='Hour')\n", "\n", "# Inspect the sorted table\n", "gdf_bw" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.3 Add working day feature\n", "\n", "Apart from the weather, in important factor that influences people's daily activities is whether it is a working day or not. In this section we will create a working day feature. First we add the weekday as a new feature column. \n", "We can use the [weekday](https://docs.rapids.ai/api/cudf/nightly/api.html#cudf.core.series.DatetimeProperties.weekday) attribute of the [datetime](https://docs.rapids.ai/api/cudf/nightly/api.html#datetimeindex)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "gdf_bw['Weekday'] = gdf_bw['date'].dt.weekday" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next create a table with all the holidays in Washington DC in 2011-2011" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dateDescription
02011-01-17Martin Luther King Jr. Day
12011-02-21Washington's Birthday
22011-04-15Emancipation Day
32011-05-30Memorial Day
42011-07-04Independence Day
52011-09-05Labor Day
62011-11-11Veterans Day
72011-11-24Thanksgiving
82011-12-26Christmas Day
92012-01-02New Year's Day
102012-01-16Martin Luther King Jr. Day
112012-02-20Washington's Birthday
122012-04-16Emancipation Day
132012-05-28Memorial Day
142012-07-04Independence Day
152012-09-03Labor Day
162012-11-12Veterans Day
172012-11-22Thanksgiving
182012-12-25Christmas Day
\n", "
" ], "text/plain": [ " date Description\n", "0 2011-01-17 Martin Luther King Jr. Day\n", "1 2011-02-21 Washington's Birthday\n", "2 2011-04-15 Emancipation Day\n", "3 2011-05-30 Memorial Day\n", "4 2011-07-04 Independence Day\n", "5 2011-09-05 Labor Day\n", "6 2011-11-11 Veterans Day\n", "7 2011-11-24 Thanksgiving\n", "8 2011-12-26 Christmas Day\n", "9 2012-01-02 New Year's Day\n", "10 2012-01-16 Martin Luther King Jr. Day\n", "11 2012-02-20 Washington's Birthday\n", "12 2012-04-16 Emancipation Day\n", "13 2012-05-28 Memorial Day\n", "14 2012-07-04 Independence Day\n", "15 2012-09-03 Labor Day\n", "16 2012-11-12 Veterans Day\n", "17 2012-11-22 Thanksgiving\n", "18 2012-12-25 Christmas Day" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "holidays = cudf.DataFrame({'date': ['2011-01-17', '2011-02-21', '2011-04-15', '2011-05-30', '2011-07-04', '2011-09-05', '2011-11-11', '2011-11-24', '2011-12-26', '2012-01-02', '2012-01-16', '2012-02-20', '2012-04-16', '2012-05-28', '2012-07-04', '2012-09-03', '2012-11-12', '2012-11-22', '2012-12-25'],\n", "'Description': [\"Martin Luther King Jr. Day\", \"Washington's Birthday\", \"Emancipation Day\", \"Memorial Day\", \"Independence Day\", \"Labor Day\", \"Veterans Day\", \"Thanksgiving\", \"Christmas Day\", \n", "\"New Year's Day\", \"Martin Luther King Jr. Day\", \"Washington's Birthday\", \"Emancipation Day\", \"Memorial Day\", \"Independence Day\", \"Labor Day\", \"Veterans Day\", \"Thanksgiving\", \"Christmas Day\"]})\n", "\n", "# Print the dataframe\n", "holidays" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We convert the date from string to datetime type, and drop the description column. Additionally we add a new column marked 'Holiday'. This will be useful to mark the holidays after we merge the tables." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dateHoliday
02011-01-171
12011-02-211
22011-04-151
32011-05-301
42011-07-041
52011-09-051
62011-11-111
72011-11-241
82011-12-261
92012-01-021
102012-01-161
112012-02-201
122012-04-161
132012-05-281
142012-07-041
152012-09-031
162012-11-121
172012-11-221
182012-12-251
\n", "
" ], "text/plain": [ " date Holiday\n", "0 2011-01-17 1\n", "1 2011-02-21 1\n", "2 2011-04-15 1\n", "3 2011-05-30 1\n", "4 2011-07-04 1\n", "5 2011-09-05 1\n", "6 2011-11-11 1\n", "7 2011-11-24 1\n", "8 2011-12-26 1\n", "9 2012-01-02 1\n", "10 2012-01-16 1\n", "11 2012-02-20 1\n", "12 2012-04-16 1\n", "13 2012-05-28 1\n", "14 2012-07-04 1\n", "15 2012-09-03 1\n", "16 2012-11-12 1\n", "17 2012-11-22 1\n", "18 2012-12-25 1" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "holidays['date'] = cudf.to_datetime(holidays['date'])\n", "holidays.drop(['Description'],axis=1,inplace=True)\n", "holidays['Holiday'] = 1\n", "holidays" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we are ready to merge the tables using the commond `date` column. We want keep every element from the gdf_bw table (our *left* table), so we use a left join. Hint: use [merge](https://docs.rapids.ai/api/cudf/nightly/api.html#cudf.core.dataframe.DataFrame.merge) with the `on` and `how` attributes" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cntdatehryearmonthHourTemperatureHumidityWindWeatherWeekdayHoliday
02972011-12-03160122011-12-03 16:00:000.3673470.460.00000005<NA>
12292011-12-03170122011-12-03 17:00:000.3469390.620.00000005<NA>
22202011-12-03180122011-12-03 18:00:000.3265310.530.15787005<NA>
31742011-12-03190122011-12-03 19:00:000.2857140.610.10532505<NA>
41242011-12-03200122011-12-03 20:00:000.2857140.610.10532505<NA>
.......................................
17373152011-12-25190122011-12-25 19:00:000.3061220.560.15787006<NA>
17374252011-12-25200122011-12-25 20:00:000.3061220.490.10532506<NA>
17375182011-12-25210122011-12-25 21:00:000.2857140.560.15787006<NA>
17376172011-12-25220122011-12-25 22:00:000.2653060.610.19301806<NA>
17377162011-12-25230122011-12-25 23:00:000.2448980.650.15787006<NA>
\n", "

17378 rows × 12 columns

\n", "
" ], "text/plain": [ " cnt date hr year month Hour Temperature \\\n", "0 297 2011-12-03 16 0 12 2011-12-03 16:00:00 0.367347 \n", "1 229 2011-12-03 17 0 12 2011-12-03 17:00:00 0.346939 \n", "2 220 2011-12-03 18 0 12 2011-12-03 18:00:00 0.326531 \n", "3 174 2011-12-03 19 0 12 2011-12-03 19:00:00 0.285714 \n", "4 124 2011-12-03 20 0 12 2011-12-03 20:00:00 0.285714 \n", "... ... ... .. ... ... ... ... \n", "17373 15 2011-12-25 19 0 12 2011-12-25 19:00:00 0.306122 \n", "17374 25 2011-12-25 20 0 12 2011-12-25 20:00:00 0.306122 \n", "17375 18 2011-12-25 21 0 12 2011-12-25 21:00:00 0.285714 \n", "17376 17 2011-12-25 22 0 12 2011-12-25 22:00:00 0.265306 \n", "17377 16 2011-12-25 23 0 12 2011-12-25 23:00:00 0.244898 \n", "\n", " Humidity Wind Weather Weekday Holiday \n", "0 0.46 0.000000 0 5 \n", "1 0.62 0.000000 0 5 \n", "2 0.53 0.157870 0 5 \n", "3 0.61 0.105325 0 5 \n", "4 0.61 0.105325 0 5 \n", "... ... ... ... ... ... \n", "17373 0.56 0.157870 0 6 \n", "17374 0.49 0.105325 0 6 \n", "17375 0.56 0.157870 0 6 \n", "17376 0.61 0.193018 0 6 \n", "17377 0.65 0.157870 0 6 \n", "\n", "[17378 rows x 12 columns]" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "### TODO merge tables and on the column 'date', use a left merge \n", "gdf = gdf_bw.merge(holidays, on='date', how='left')\n", "\n", "# inspect the result\n", "gdf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We reset the index to 'Hour' and sort the table accordingly. Notice that most of the rows in the 'Holiday' column are filled with ``, only the dates that appeared in the holiday table are filled with 1. We shall fill the empty fields with zero." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cntdatehryearmonthTemperatureHumidityWindWeatherWeekdayHoliday
Hour
2011-01-01 00:00:00162011-01-010010.2244900.810.000000050
2011-01-01 01:00:00382011-01-011010.2040820.800.000000050
2011-01-01 02:00:00312011-01-012010.2040820.800.000000050
2011-01-01 03:00:00122011-01-013010.2244900.750.000000050
2011-01-01 04:00:0012011-01-014010.2244900.750.000000050
....................................
2012-12-31 19:00:001182012-12-31191120.2448980.600.193018300
2012-12-31 20:00:00892012-12-31201120.2448980.600.193018300
2012-12-31 21:00:00902012-12-31211120.2448980.600.193018000
2012-12-31 22:00:00612012-12-31221120.2448980.560.157870000
2012-12-31 23:00:00502012-12-31231120.2448980.650.157870000
\n", "

17378 rows × 11 columns

\n", "
" ], "text/plain": [ " cnt date hr year month Temperature Humidity \\\n", "Hour \n", "2011-01-01 00:00:00 16 2011-01-01 0 0 1 0.224490 0.81 \n", "2011-01-01 01:00:00 38 2011-01-01 1 0 1 0.204082 0.80 \n", "2011-01-01 02:00:00 31 2011-01-01 2 0 1 0.204082 0.80 \n", "2011-01-01 03:00:00 12 2011-01-01 3 0 1 0.224490 0.75 \n", "2011-01-01 04:00:00 1 2011-01-01 4 0 1 0.224490 0.75 \n", "... ... ... .. ... ... ... ... \n", "2012-12-31 19:00:00 118 2012-12-31 19 1 12 0.244898 0.60 \n", "2012-12-31 20:00:00 89 2012-12-31 20 1 12 0.244898 0.60 \n", "2012-12-31 21:00:00 90 2012-12-31 21 1 12 0.244898 0.60 \n", "2012-12-31 22:00:00 61 2012-12-31 22 1 12 0.244898 0.56 \n", "2012-12-31 23:00:00 50 2012-12-31 23 1 12 0.244898 0.65 \n", "\n", " Wind Weather Weekday Holiday \n", "Hour \n", "2011-01-01 00:00:00 0.000000 0 5 0 \n", "2011-01-01 01:00:00 0.000000 0 5 0 \n", "2011-01-01 02:00:00 0.000000 0 5 0 \n", "2011-01-01 03:00:00 0.000000 0 5 0 \n", "2011-01-01 04:00:00 0.000000 0 5 0 \n", "... ... ... ... ... \n", "2012-12-31 19:00:00 0.193018 3 0 0 \n", "2012-12-31 20:00:00 0.193018 3 0 0 \n", "2012-12-31 21:00:00 0.193018 0 0 0 \n", "2012-12-31 22:00:00 0.157870 0 0 0 \n", "2012-12-31 23:00:00 0.157870 0 0 0 \n", "\n", "[17378 rows x 11 columns]" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gdf = gdf.set_index('Hour')\n", "gdf = gdf.sort_index()\n", "\n", "### TODO fill empty holiday values with zero\n", "gdf['Holiday'] = gdf['Holiday'].fillna(0)\n", "gdf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we create a workingday feature. Assuming that the first five days of the week are working days, one could do that simply with the following operation:\n", "```\n", "gdf['Workingday'] = (gdf['Weekday'] < 5) & (gdf['Holiday']!=1)\n", "```\n", "But we could do it with user defined functions too. Previously we have only used UDF to process elements of a series. Now we will process rows of a dataframe and\n", "combine the 'Weekday' and 'Holiday' columns to calculate the new feature 'Workingday'.\n", "\n", "More on user defined functions in our [blog](https://medium.com/rapids-ai/user-defined-functions-in-rapids-cudf-2d7c3fc2728d) and in the [documentation](https://docs.rapids.ai/api/cudf/nightly/guide-to-udfs.html)." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "def workday_kernel(Weekday, Holiday, Workingday):\n", " for i, (w, h) in enumerate(zip(Weekday, Holiday)):\n", " # variable w will take values from the Weekday column\n", " # variable h will take values from the Holiday column\n", " Workingday[i] = w < 5 and h != 1" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "gdf = gdf.apply_rows(workday_kernel, incols=['Weekday', 'Holiday'], outcols=dict(Workingday=np.float64), kwargs=dict())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After this step we will not need the 'Holiday' and 'date' columns, we can drop them" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "gdf = gdf.drop(['Holiday', 'date'],axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.4 One-hot encoding\n", "\n", "We have all now the data in a single table, but we still want to change their encoding. We're going to create one-hot encoded variables, also known as dummy variables, for each of the time variables as well as the weather situation.\n", "\n", "\n", "A summary from https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/:\n", "\n", "\"The integer values have a natural ordered relationship between each other and machine learning algorithms may be able to understand and harness this relationship.\n", "For categorical variables where no such ordinal relationship exists, the integer encoding is not enough.\n", "\n", "In fact, using this encoding and allowing the model to assume a natural ordering between categories may result in poor performance or unexpected results (predictions halfway between categories).\n", "\n", "In this case, a one-hot encoding can be applied to the integer representation. This is where the integer encoded variable is removed and a new binary variable is added for each unique integer value.\n", "\"\n", "\n", "We start by one-hot encoding the 'Weather' column using the [one_hot_encoding](https://docs.rapids.ai/api/cudf/nightly/api.html#cudf.core.dataframe.DataFrame.one_hot_encoding) method from cuDF DataFrame. This is very the [get_dummies](https://docs.rapids.ai/api/cudf/nightly/api.html#cudf.core.reshape.get_dummies) function (which might be more familiar for Pandas users), but one_hot_encoding works on a single input column and performs the operation in place. " ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cnthryearmonthTemperatureHumidityWindWeatherWeekdayWorkingdayWeather_dummy_0Weather_dummy_1Weather_dummy_2Weather_dummy_3
Hour
2011-01-01 00:00:00160010.2244900.810.0050.01.00.00.00.0
2011-01-01 01:00:00381010.2040820.800.0050.01.00.00.00.0
2011-01-01 02:00:00312010.2040820.800.0050.01.00.00.00.0
\n", "
" ], "text/plain": [ " cnt hr year month Temperature Humidity Wind \\\n", "Hour \n", "2011-01-01 00:00:00 16 0 0 1 0.224490 0.81 0.0 \n", "2011-01-01 01:00:00 38 1 0 1 0.204082 0.80 0.0 \n", "2011-01-01 02:00:00 31 2 0 1 0.204082 0.80 0.0 \n", "\n", " Weather Weekday Workingday Weather_dummy_0 \\\n", "Hour \n", "2011-01-01 00:00:00 0 5 0.0 1.0 \n", "2011-01-01 01:00:00 0 5 0.0 1.0 \n", "2011-01-01 02:00:00 0 5 0.0 1.0 \n", "\n", " Weather_dummy_1 Weather_dummy_2 Weather_dummy_3 \n", "Hour \n", "2011-01-01 00:00:00 0.0 0.0 0.0 \n", "2011-01-01 01:00:00 0.0 0.0 0.0 \n", "2011-01-01 02:00:00 0.0 0.0 0.0 " ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "codes = gdf['Weather'].unique()\n", "gdf = gdf.one_hot_encoding('Weather', 'Weather_dummy', codes)\n", "# Inspect the results\n", "gdf.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We're going to drop the original variable as well as one of the new dummy variables so we don't create colinearity (more about this problem [here](https://towardsdatascience.com/one-hot-encoding-multicollinearity-and-the-dummy-variable-trap-b5840be3c41a))." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "gdf = gdf.drop(['Weather', 'Weather_dummy_1'],axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We create a copy of the dataset. It will make it easier to start over in case something would go wrong during the next excercise. " ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "gdf_backup = gdf.copy()" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "dummies_list = ['month', 'hr', 'Weekday']\n", "\n", "gdf = gdf_backup.copy()\n", "\n", "for item in dummies_list:\n", " ### Todo implement one-hot encoding for item\n", " codes = gdf[item].unique()\n", " gdf = gdf.one_hot_encoding(item, item + '_dummy', codes)\n", " gdf = gdf.drop('{}_dummy_1'.format(item),axis=1)\n", " gdf = gdf.drop(item,axis=1) # drop the original item" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cntyearTemperatureHumidityWindWorkingdayWeather_dummy_0Weather_dummy_2Weather_dummy_3month_dummy_2...hr_dummy_20hr_dummy_21hr_dummy_22hr_dummy_23Weekday_dummy_0Weekday_dummy_2Weekday_dummy_3Weekday_dummy_4Weekday_dummy_5Weekday_dummy_6
Hour
2011-01-01 00:00:001600.2244900.810.0000000.01.00.00.00.0...0.00.00.00.00.00.00.00.01.00.0
2011-01-01 01:00:003800.2040820.800.0000000.01.00.00.00.0...0.00.00.00.00.00.00.00.01.00.0
2011-01-01 02:00:003100.2040820.800.0000000.01.00.00.00.0...0.00.00.00.00.00.00.00.01.00.0
2011-01-01 03:00:001200.2244900.750.0000000.01.00.00.00.0...0.00.00.00.00.00.00.00.01.00.0
2011-01-01 04:00:00100.2244900.750.0000000.01.00.00.00.0...0.00.00.00.00.00.00.00.01.00.0
..................................................................
2012-12-31 19:00:0011810.2448980.600.1930181.00.00.01.00.0...0.00.00.00.01.00.00.00.00.00.0
2012-12-31 20:00:008910.2448980.600.1930181.00.00.01.00.0...1.00.00.00.01.00.00.00.00.00.0
2012-12-31 21:00:009010.2448980.600.1930181.01.00.00.00.0...0.01.00.00.01.00.00.00.00.00.0
2012-12-31 22:00:006110.2448980.560.1578701.01.00.00.00.0...0.00.01.00.01.00.00.00.00.00.0
2012-12-31 23:00:005010.2448980.650.1578701.01.00.00.00.0...0.00.00.01.01.00.00.00.00.00.0
\n", "

17378 rows × 49 columns

\n", "
" ], "text/plain": [ " cnt year Temperature Humidity Wind Workingday \\\n", "Hour \n", "2011-01-01 00:00:00 16 0 0.224490 0.81 0.000000 0.0 \n", "2011-01-01 01:00:00 38 0 0.204082 0.80 0.000000 0.0 \n", "2011-01-01 02:00:00 31 0 0.204082 0.80 0.000000 0.0 \n", "2011-01-01 03:00:00 12 0 0.224490 0.75 0.000000 0.0 \n", "2011-01-01 04:00:00 1 0 0.224490 0.75 0.000000 0.0 \n", "... ... ... ... ... ... ... \n", "2012-12-31 19:00:00 118 1 0.244898 0.60 0.193018 1.0 \n", "2012-12-31 20:00:00 89 1 0.244898 0.60 0.193018 1.0 \n", "2012-12-31 21:00:00 90 1 0.244898 0.60 0.193018 1.0 \n", "2012-12-31 22:00:00 61 1 0.244898 0.56 0.157870 1.0 \n", "2012-12-31 23:00:00 50 1 0.244898 0.65 0.157870 1.0 \n", "\n", " Weather_dummy_0 Weather_dummy_2 Weather_dummy_3 \\\n", "Hour \n", "2011-01-01 00:00:00 1.0 0.0 0.0 \n", "2011-01-01 01:00:00 1.0 0.0 0.0 \n", "2011-01-01 02:00:00 1.0 0.0 0.0 \n", "2011-01-01 03:00:00 1.0 0.0 0.0 \n", "2011-01-01 04:00:00 1.0 0.0 0.0 \n", "... ... ... ... \n", "2012-12-31 19:00:00 0.0 0.0 1.0 \n", "2012-12-31 20:00:00 0.0 0.0 1.0 \n", "2012-12-31 21:00:00 1.0 0.0 0.0 \n", "2012-12-31 22:00:00 1.0 0.0 0.0 \n", "2012-12-31 23:00:00 1.0 0.0 0.0 \n", "\n", " month_dummy_2 ... hr_dummy_20 hr_dummy_21 \\\n", "Hour ... \n", "2011-01-01 00:00:00 0.0 ... 0.0 0.0 \n", "2011-01-01 01:00:00 0.0 ... 0.0 0.0 \n", "2011-01-01 02:00:00 0.0 ... 0.0 0.0 \n", "2011-01-01 03:00:00 0.0 ... 0.0 0.0 \n", "2011-01-01 04:00:00 0.0 ... 0.0 0.0 \n", "... ... ... ... ... \n", "2012-12-31 19:00:00 0.0 ... 0.0 0.0 \n", "2012-12-31 20:00:00 0.0 ... 1.0 0.0 \n", "2012-12-31 21:00:00 0.0 ... 0.0 1.0 \n", "2012-12-31 22:00:00 0.0 ... 0.0 0.0 \n", "2012-12-31 23:00:00 0.0 ... 0.0 0.0 \n", "\n", " hr_dummy_22 hr_dummy_23 Weekday_dummy_0 \\\n", "Hour \n", "2011-01-01 00:00:00 0.0 0.0 0.0 \n", "2011-01-01 01:00:00 0.0 0.0 0.0 \n", "2011-01-01 02:00:00 0.0 0.0 0.0 \n", "2011-01-01 03:00:00 0.0 0.0 0.0 \n", "2011-01-01 04:00:00 0.0 0.0 0.0 \n", "... ... ... ... \n", "2012-12-31 19:00:00 0.0 0.0 1.0 \n", "2012-12-31 20:00:00 0.0 0.0 1.0 \n", "2012-12-31 21:00:00 0.0 0.0 1.0 \n", "2012-12-31 22:00:00 1.0 0.0 1.0 \n", "2012-12-31 23:00:00 0.0 1.0 1.0 \n", "\n", " Weekday_dummy_2 Weekday_dummy_3 Weekday_dummy_4 \\\n", "Hour \n", "2011-01-01 00:00:00 0.0 0.0 0.0 \n", "2011-01-01 01:00:00 0.0 0.0 0.0 \n", "2011-01-01 02:00:00 0.0 0.0 0.0 \n", "2011-01-01 03:00:00 0.0 0.0 0.0 \n", "2011-01-01 04:00:00 0.0 0.0 0.0 \n", "... ... ... ... \n", "2012-12-31 19:00:00 0.0 0.0 0.0 \n", "2012-12-31 20:00:00 0.0 0.0 0.0 \n", "2012-12-31 21:00:00 0.0 0.0 0.0 \n", "2012-12-31 22:00:00 0.0 0.0 0.0 \n", "2012-12-31 23:00:00 0.0 0.0 0.0 \n", "\n", " Weekday_dummy_5 Weekday_dummy_6 \n", "Hour \n", "2011-01-01 00:00:00 1.0 0.0 \n", "2011-01-01 01:00:00 1.0 0.0 \n", "2011-01-01 02:00:00 1.0 0.0 \n", "2011-01-01 03:00:00 1.0 0.0 \n", "2011-01-01 04:00:00 1.0 0.0 \n", "... ... ... \n", "2012-12-31 19:00:00 0.0 0.0 \n", "2012-12-31 20:00:00 0.0 0.0 \n", "2012-12-31 21:00:00 0.0 0.0 \n", "2012-12-31 22:00:00 0.0 0.0 \n", "2012-12-31 23:00:00 0.0 0.0 \n", "\n", "[17378 rows x 49 columns]" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gdf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.5 Save the prepared dataset" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "gdf.to_csv('../../../data/bike_sharing.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Predict bike rentals with cuML\n", "\n", "cuML is a GPU accelerated machine learning library. cuML's Python API mirrors the [Scikit-Learn](https://scikit-learn.org/stable/) API.\n", "\n", "cuML currently requires all data be of the same type, so this loop converts all values into floats" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "import cuml" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [], "source": [ "for col in gdf.columns:\n", " gdf[col] = gdf[col].astype('float64')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.1 Prepare training and test data\n", "It is customary to denote the input feature matrix with X, and the target that we want to predict with y. We separete the target column 'cnt' from the rest of the table." ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "y = gdf['cnt']\n", "X = gdf.drop('cnt',axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's split the data randomly into a train and a test set" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = cuml.preprocessing.model_selection.train_test_split(X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.2 Linear regression\n", "Our first model is a simple linear regression, which tries to predict the output as a linear combination of the input features:\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LinearRegression(algorithm='eig', fit_intercept=True, normalize=False, handle=, verbose=4, output_type='cudf')" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reg = cuml.LinearRegression()\n", "reg.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can make prediction with the trained data" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [], "source": [ "y_hat = reg.predict(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can visualize the how well the model works. Let's plot data for may 2012:" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [], "source": [ "import datetime as dt\n", "def plot_timerange(model, start, end):\n", " start_date = dt.datetime.strptime(start, '%Y-%m-%d')\n", " end_date = dt.datetime.strptime(end, '%Y-%m-%d')\n", " idx = (X.index >= start_date) & (X.index <= end_date)\n", " X_may = X[idx]\n", " y_may = y[idx]\n", "\n", " ### TODO predict the values for X_may (1 line of code)\n", " y_pred = reg.predict(X_may)\n", "\n", " plt.plot(X_may.index.to_array(), y_may.to_array())\n", " plt.plot(X_may.index.to_array(), y_pred.to_array(), '--')" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plot_timerange(reg, '2011-06-01', '2011-06-05')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Often we are interested in a single score. The default score method for regression problems is the [r2_score](https://en.wikipedia.org/wiki/Coefficient_of_determination)." ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "train score 0.6813958127305259\n", "test score 0.6820591407880712\n" ] } ], "source": [ "train_score = reg.score(X_train, y_train)\n", "### TODO calculate test score (the score on X_test, ~ 1 line of code)\n", "test_score = reg.score(X_test, y_test)\n", "\n", "print('train score', train_score)\n", "print('test score', test_score)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.3 Save and load the trained model\n", "We can pickle any cuML model" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [], "source": [ "import pickle\n", "pickle_file = 'my_model.pickle'\n", "\n", "with open(pickle_file, 'wb') as pf:\n", " pickle.dump(reg, pf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load the saved model" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loaded model score 0.6820591407880712\n", "Original model score 0.6820591407880712\n" ] } ], "source": [ "with open(pickle_file, 'rb') as pf:\n", " loaded_model = pickle.load(pf)\n", "\n", "print('Loaded model score', loaded_model.score(X_test, y_test))\n", "print('Original model score', reg.score(X_test, y_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.4 Ridge regression with hyperparameter tuning\n", "Ridge regression is a linear regression model with an added L2 regularization term. Regularization is often used in practice to avoid overfitting. The strength of the regularization is set by the alpha hyperparameter. \n", "We're going to do a small hyperparameter search for alpha, checking 100 different values. This is fast to do with RAPIDS. Also notice that we are appending the results of each Ridge model onto the dictionary containing our earlier results, so we can more easily see which model is the best at the end. " ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [], "source": [ "output = {'score_OLS': test_score}\n", "\n", "for alpha in np.arange(0.01, 1, 0.01): #alpha value has to be positive \n", " ridge = cuml.Ridge(alpha=alpha, fit_intercept=True)\n", " ### TODO fit the model and calculate the test score (2 lines of code)\n", " ridge.fit(X_train, y_train)\n", " score = ridge.score(X_test, y_test)\n", " ### END EXCERCISE ###\n", " output['score_RIDGE_{}'.format(alpha)] = score" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we see that our regulaized model does better than the rest, include OLS with all the variables. " ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Max score: score_RIDGE_0.33\n" ] } ], "source": [ "print('Max score: {}'.format(max(output, key=output.get)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.5 Additional cuML models (Optional)\n", "#### 3.5.1 Support vector regression\n", "\n", "Support vector regression is a more complex model, with an execution time that scales with at least O(n_rows^2). RAPIDS cuML includes a fast SVM solver that makes it feasable to run SVM on larger datasets." ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 276 ms, sys: 160 ms, total: 437 ms\n", "Wall time: 436 ms\n" ] }, { "data": { "text/plain": [ "0.8799074097909845" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "reg = cuml.svm.SVR(kernel='rbf', gamma=0.1, C=100, epsilon=0.1)\n", "## Todo\n", "reg.fit(X_train, y_train)\n", "reg.score(X_train, y_train)\n", "reg.score(X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also use sklearns [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) to perform hyperparameter search. Sklearn's GridSearchCV requires input that the input data is a host array. Fortunately cuML is flexible with the [input format](https://medium.com/rapids-ai/input-and-output-configurability-in-rapids-cuml-e719d72c135b), and we can pass numpy array directly to it (at a cost of additional host to device copies, because under the hood cuML copies the data to the GPU). If the data size is reasonably small, then we can pay the price of additional data movement and combine the convenience of GridSearchCV with the speed of cuML algorithms." ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [], "source": [ "param_grid = [ {'C': [0.01, 0.1, 1, 10, 100], 'gamma': [10, 1, 0.1, 0.01, 0.001], 'kernel': ['rbf']} ]" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [], "source": [ "X_train_np = X_train.as_matrix()\n", "y_train_np = y_train.to_array()\n", "X_test_np = X_test.as_matrix() \n", "y_test_np = y_test.to_array()" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Best parameters set found on development set:\n", "\n", "{'C': 100, 'gamma': 0.1, 'kernel': 'rbf'}\n", "\n", "Grid scores on development set:\n", "\n", "-0.067 (+/-0.011) for {'C': 0.01, 'gamma': 10, 'kernel': 'rbf'}\n", "-0.066 (+/-0.011) for {'C': 0.01, 'gamma': 1, 'kernel': 'rbf'}\n", "-0.061 (+/-0.011) for {'C': 0.01, 'gamma': 0.1, 'kernel': 'rbf'}\n", "-0.066 (+/-0.011) for {'C': 0.01, 'gamma': 0.01, 'kernel': 'rbf'}\n", "-0.067 (+/-0.011) for {'C': 0.01, 'gamma': 0.001, 'kernel': 'rbf'}\n", "-0.067 (+/-0.011) for {'C': 0.1, 'gamma': 10, 'kernel': 'rbf'}\n", "-0.060 (+/-0.011) for {'C': 0.1, 'gamma': 1, 'kernel': 'rbf'}\n", "-0.012 (+/-0.011) for {'C': 0.1, 'gamma': 0.1, 'kernel': 'rbf'}\n", "-0.057 (+/-0.011) for {'C': 0.1, 'gamma': 0.01, 'kernel': 'rbf'}\n", "-0.066 (+/-0.011) for {'C': 0.1, 'gamma': 0.001, 'kernel': 'rbf'}\n", "-0.063 (+/-0.011) for {'C': 1, 'gamma': 10, 'kernel': 'rbf'}\n", "0.001 (+/-0.011) for {'C': 1, 'gamma': 1, 'kernel': 'rbf'}\n", "0.273 (+/-0.014) for {'C': 1, 'gamma': 0.1, 'kernel': 'rbf'}\n", "0.023 (+/-0.009) for {'C': 1, 'gamma': 0.01, 'kernel': 'rbf'}\n", "-0.056 (+/-0.011) for {'C': 1, 'gamma': 0.001, 'kernel': 'rbf'}\n", "-0.025 (+/-0.012) for {'C': 10, 'gamma': 10, 'kernel': 'rbf'}\n", "0.377 (+/-0.017) for {'C': 10, 'gamma': 1, 'kernel': 'rbf'}\n", "0.669 (+/-0.013) for {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}\n", "0.366 (+/-0.017) for {'C': 10, 'gamma': 0.01, 'kernel': 'rbf'}\n", "0.028 (+/-0.009) for {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}\n", "0.248 (+/-0.022) for {'C': 100, 'gamma': 10, 'kernel': 'rbf'}\n", "0.831 (+/-0.024) for {'C': 100, 'gamma': 1, 'kernel': 'rbf'}\n", "0.876 (+/-0.011) for {'C': 100, 'gamma': 0.1, 'kernel': 'rbf'}\n", "0.633 (+/-0.013) for {'C': 100, 'gamma': 0.01, 'kernel': 'rbf'}\n", "0.373 (+/-0.017) for {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}\n", "\n" ] } ], "source": [ "from sklearn.model_selection import GridSearchCV\n", "reg = GridSearchCV(cuml.svm.SVR(), param_grid, scoring='r2' )\n", "\n", "reg.fit(X_train_np, y_train_np)\n", "\n", "print(\"Best parameters set found on development set:\")\n", "print()\n", "print(reg.best_params_)\n", "print()\n", "print(\"Grid scores on development set:\")\n", "print()\n", "means = reg.cv_results_['mean_test_score']\n", "stds = reg.cv_results_['std_test_score']\n", "for mean, std, params in zip(means, stds, reg.cv_results_['params']):\n", " print(\"%0.3f (+/-%0.03f) for %r\" % (mean, std * 2, params))\n", "print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.5.2 KNN Regression\n", "k-Nearest Neighbors regression is a machine learning technique that predicts an unknown observation by using the k most similar known observations in the training dataset." ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 60.3 ms, sys: 261 ms, total: 321 ms\n", "Wall time: 320 ms\n" ] }, { "data": { "text/plain": [ "0.7120309195645623" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "### TODO tune the n_neighbors hyperparameter to achieve better performance\n", "knn = cuml.neighbors.KNeighborsRegressor(n_neighbors=8)\n", "knn.fit(X_train, y_train, convert_dtype=True)\n", "pred = knn.predict(X_test)\n", "knn.score(X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can compare the execution time of training KNN with cuML and with scikit-learn" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [], "source": [ "import sklearn\n", "import sklearn.neighbors" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 5.83 s, sys: 0 ns, total: 5.83 s\n", "Wall time: 5.83 s\n" ] }, { "data": { "text/plain": [ "0.7119398662494515" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "knn = sklearn.neighbors.KNeighborsRegressor(n_neighbors=8)\n", "knn.fit(X_train_np, y_train_np,)\n", "pred = knn.predict(X_test_np)\n", "knn.score(X_test_np, y_test_np)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.6 XGBoost (Optional)\n", "RAPIDS integrates seamlessly with the XGBoost library. Here is how to use it for our example" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [], "source": [ "import xgboost as xgb" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [], "source": [ "xgr=xgb.XGBRegressor(max_depth=8,min_child_weight=6,gamma=0.4)\n", "dtrain = xgb.DMatrix(X_train, label=y_train)\n", "dtest = xgb.DMatrix(X_test, label=y_test)" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'tree_method': 'gpu_hist', 'objective': 'reg:squarederror'}\n" ] } ], "source": [ "# instantiate params\n", "params = {}\n", "\n", "# general params\n", "general_params = {}\n", "params.update(general_params)\n", "\n", "# booster params\n", "booster_params = {'tree_method': 'gpu_hist'}\n", "params.update(booster_params)\n", "\n", "# learning task params\n", "learning_task_params = {'objective': 'reg:squarederror'}\n", "params.update(learning_task_params)\n", "print(params)" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [], "source": [ "evallist = [(dtest, 'test'), (dtrain, 'train')]\n", "num_round = 100" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0]\ttest-rmse:199.18784\ttrain-rmse:200.30351\n", "[1]\ttest-rmse:162.09444\ttrain-rmse:162.71335\n", "[2]\ttest-rmse:137.25905\ttrain-rmse:136.83050\n", "[3]\ttest-rmse:120.35908\ttrain-rmse:119.77563\n", "[4]\ttest-rmse:107.99054\ttrain-rmse:107.15930\n", "[5]\ttest-rmse:100.17688\ttrain-rmse:99.31792\n", "[6]\ttest-rmse:94.51825\ttrain-rmse:93.23972\n", "[7]\ttest-rmse:90.02152\ttrain-rmse:88.41567\n", "[8]\ttest-rmse:87.24374\ttrain-rmse:85.44088\n", "[9]\ttest-rmse:84.38741\ttrain-rmse:82.57428\n", "[10]\ttest-rmse:81.15561\ttrain-rmse:78.72398\n", "[11]\ttest-rmse:78.11271\ttrain-rmse:74.94260\n", "[12]\ttest-rmse:76.49943\ttrain-rmse:72.95665\n", "[13]\ttest-rmse:74.66074\ttrain-rmse:70.91374\n", "[14]\ttest-rmse:72.05053\ttrain-rmse:67.98150\n", "[15]\ttest-rmse:70.95134\ttrain-rmse:66.81727\n", "[16]\ttest-rmse:70.03428\ttrain-rmse:65.69633\n", "[17]\ttest-rmse:67.92854\ttrain-rmse:63.02364\n", "[18]\ttest-rmse:67.05556\ttrain-rmse:61.93159\n", "[19]\ttest-rmse:65.70933\ttrain-rmse:60.53908\n", "[20]\ttest-rmse:65.15248\ttrain-rmse:59.73316\n", "[21]\ttest-rmse:64.53308\ttrain-rmse:58.98966\n", "[22]\ttest-rmse:63.51178\ttrain-rmse:57.82574\n", "[23]\ttest-rmse:62.71511\ttrain-rmse:56.73420\n", "[24]\ttest-rmse:62.33528\ttrain-rmse:56.03647\n", "[25]\ttest-rmse:61.82107\ttrain-rmse:55.35518\n", "[26]\ttest-rmse:61.35674\ttrain-rmse:54.68117\n", "[27]\ttest-rmse:61.01687\ttrain-rmse:54.21539\n", "[28]\ttest-rmse:60.57789\ttrain-rmse:53.62215\n", "[29]\ttest-rmse:59.97852\ttrain-rmse:52.79226\n", "[30]\ttest-rmse:59.75838\ttrain-rmse:52.38144\n", "[31]\ttest-rmse:59.33269\ttrain-rmse:51.76035\n", "[32]\ttest-rmse:59.00826\ttrain-rmse:51.20699\n", "[33]\ttest-rmse:58.75043\ttrain-rmse:50.73730\n", "[34]\ttest-rmse:58.22901\ttrain-rmse:49.93842\n", "[35]\ttest-rmse:57.82102\ttrain-rmse:49.37877\n", "[36]\ttest-rmse:57.50741\ttrain-rmse:48.89431\n", "[37]\ttest-rmse:57.35290\ttrain-rmse:48.60536\n", "[38]\ttest-rmse:57.03769\ttrain-rmse:48.08765\n", "[39]\ttest-rmse:56.89780\ttrain-rmse:47.75967\n", "[40]\ttest-rmse:56.47898\ttrain-rmse:47.12102\n", "[41]\ttest-rmse:56.33428\ttrain-rmse:46.83911\n", "[42]\ttest-rmse:56.12777\ttrain-rmse:46.53151\n", "[43]\ttest-rmse:55.93485\ttrain-rmse:46.22023\n", "[44]\ttest-rmse:55.77229\ttrain-rmse:45.91986\n", "[45]\ttest-rmse:55.61733\ttrain-rmse:45.64424\n", "[46]\ttest-rmse:55.41594\ttrain-rmse:45.24457\n", "[47]\ttest-rmse:55.28618\ttrain-rmse:45.01399\n", "[48]\ttest-rmse:55.17868\ttrain-rmse:44.72273\n", "[49]\ttest-rmse:55.03380\ttrain-rmse:44.47870\n", "[50]\ttest-rmse:54.97081\ttrain-rmse:44.23529\n", "[51]\ttest-rmse:54.90307\ttrain-rmse:44.03345\n", "[52]\ttest-rmse:54.76747\ttrain-rmse:43.59862\n", "[53]\ttest-rmse:54.72055\ttrain-rmse:43.33952\n", "[54]\ttest-rmse:54.64332\ttrain-rmse:43.20725\n", "[55]\ttest-rmse:54.60225\ttrain-rmse:43.05660\n", "[56]\ttest-rmse:54.43866\ttrain-rmse:42.85313\n", "[57]\ttest-rmse:54.35900\ttrain-rmse:42.68765\n", "[58]\ttest-rmse:54.27787\ttrain-rmse:42.53922\n", "[59]\ttest-rmse:54.06460\ttrain-rmse:42.21499\n", "[60]\ttest-rmse:54.02811\ttrain-rmse:42.03825\n", "[61]\ttest-rmse:53.99411\ttrain-rmse:41.78006\n", "[62]\ttest-rmse:53.94808\ttrain-rmse:41.66055\n", "[63]\ttest-rmse:53.85906\ttrain-rmse:41.51670\n", "[64]\ttest-rmse:53.75808\ttrain-rmse:41.27258\n", "[65]\ttest-rmse:53.68151\ttrain-rmse:41.04708\n", "[66]\ttest-rmse:53.60903\ttrain-rmse:40.91206\n", "[67]\ttest-rmse:53.47453\ttrain-rmse:40.72940\n", "[68]\ttest-rmse:53.45846\ttrain-rmse:40.52702\n", "[69]\ttest-rmse:53.41235\ttrain-rmse:40.39108\n", "[70]\ttest-rmse:53.30926\ttrain-rmse:40.24170\n", "[71]\ttest-rmse:53.24992\ttrain-rmse:40.05484\n", "[72]\ttest-rmse:53.20187\ttrain-rmse:39.83406\n", "[73]\ttest-rmse:53.15774\ttrain-rmse:39.66462\n", "[74]\ttest-rmse:53.13012\ttrain-rmse:39.50941\n", "[75]\ttest-rmse:53.05473\ttrain-rmse:39.39987\n", "[76]\ttest-rmse:53.01971\ttrain-rmse:39.26834\n", "[77]\ttest-rmse:52.90548\ttrain-rmse:39.09363\n", "[78]\ttest-rmse:52.85016\ttrain-rmse:38.97043\n", "[79]\ttest-rmse:52.78326\ttrain-rmse:38.87738\n", "[80]\ttest-rmse:52.68287\ttrain-rmse:38.64835\n", "[81]\ttest-rmse:52.63625\ttrain-rmse:38.45353\n", "[82]\ttest-rmse:52.60709\ttrain-rmse:38.39142\n", "[83]\ttest-rmse:52.54555\ttrain-rmse:38.27810\n", "[84]\ttest-rmse:52.52675\ttrain-rmse:38.13488\n", "[85]\ttest-rmse:52.47684\ttrain-rmse:38.01691\n", "[86]\ttest-rmse:52.44044\ttrain-rmse:37.93409\n", "[87]\ttest-rmse:52.38221\ttrain-rmse:37.81918\n", "[88]\ttest-rmse:52.37294\ttrain-rmse:37.68341\n", "[89]\ttest-rmse:52.34564\ttrain-rmse:37.59253\n", "[90]\ttest-rmse:52.32901\ttrain-rmse:37.52166\n", "[91]\ttest-rmse:52.24760\ttrain-rmse:37.38437\n", "[92]\ttest-rmse:52.21513\ttrain-rmse:37.26507\n", "[93]\ttest-rmse:52.19221\ttrain-rmse:37.18996\n", "[94]\ttest-rmse:52.19040\ttrain-rmse:36.98294\n", "[95]\ttest-rmse:52.17594\ttrain-rmse:36.85601\n", "[96]\ttest-rmse:52.16236\ttrain-rmse:36.73325\n", "[97]\ttest-rmse:52.15352\ttrain-rmse:36.65076\n", "[98]\ttest-rmse:52.08389\ttrain-rmse:36.48926\n", "[99]\ttest-rmse:52.05710\ttrain-rmse:36.43145\n" ] } ], "source": [ "bst = xgb.train(params, dtrain, num_round, evallist)\n" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [], "source": [ "y_pred = bst.predict(dtest)" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [], "source": [ "import cupy as cp\n", "from cuml.metrics.regression import r2_score\n", "y_pred_cp = cp.asarray(y_pred)\n", "y_test_cp = cp.asarray(y_test).astype(np.float32)" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9151233434677124" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r2_score(y_test_cp, y_pred_cp)" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plot_timerange(reg, '2011-06-01', '2011-06-05')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Licensing\n", " \n", "This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Previous Notebook](Challenge.ipynb)\n", "     \n", "     \n", "     \n", "     \n", "[1](Challenge.ipynb)\n", "[2]\n", "     \n", "     \n", "     \n", "     \n", "\n", "\n", "     \n", "     \n", "     \n", "     \n", "     \n", "   \n", "[Home Page](../../START_HERE.ipynb)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.2" } }, "nbformat": 4, "nbformat_minor": 4 }