{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "     \n", "     \n", "     \n", "     \n", "     \n", "   \n", "[Home Page](../../START_HERE.ipynb)\n", "\n", "[Previous Notebook](Challenge.ipynb)\n", "     \n", "     \n", "     \n", "     \n", "[1](Challenge.ipynb)\n", "[2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Bike Rental Prediction Challenge- Workbook\n", "\n", "## 1. Introduction\n", "\n", "This notebook walks through an end-to-end GPU machine learning workflow where cuDF is used for processing the data and cuML is used to train machine learning models on it. \n", "\n", "After completing this excercise, you will be able to use cuDF to load data from disk, combine tables, scale features, use one-hote encoding and even write your own GPU kernels to efficiently transform feature columns. Additionaly you will learn how to pass this data to cuML, and how to train ML models on it. The trained model is saved and it will be used for prediction.\n", "\n", "It is not required that the user is familiar with cuDF or cuML. Since our aim is to go from ETL to ML training, a detailed introduction is out of scope for this notebook. We recommend [Introduction to cuDF](../../CuDF/01-Intro_to_cuDF.ipynb) for additional information.\n", "\n", "\n", "### 1.2. Problem statement\n", "\n", "We are trying to predict daily demand for short-term bike rentals made in 2011 and 2012. We will combine three data sources: bike rental information, historical weather data, and dates of public holidays. In Section 2 of this notebook we will use cuDF to combine these data into a single dataset that can be used as an input for machine learning algorithms. In Section 3 we train models using cuML to predict bike rentals.\n", "\n", "### 1.3 Why RAPIDS?\n", "\n", "Using the GPU accelerated libraries from RAPIDS greatly reduces the execution time of a data science workflow. This leads to faster iteration with data preparation and model selection, and overall a more efficient workflow. \n", "\n", "\n", "### 1.2.1 References\n", "\n", "This notebook is inspired by the [blog article](https://medium.com/rapids-ai/essential-machine-learning-with-linear-models-in-rapids-part-1-of-a-series-992fab0240da) from Paul Mahler and its accompanying [notebook](https://github.com/rapidsai-community/notebooks-contrib/blob/master/blog_notebooks/regression/regression_blog_notebook.ipynb). The dataset is prepared along the steps given by *Hadi Fanaee-T and Joao Gama* in their [paper](https://doi.org/10.1007/s13748-013-0040-3) *Event labeling combining ensemble detectors and background knowledge*. The exploratory data analysis notebooks by [Vivek Srinivasan](https://www.kaggle.com/viveksrinivasan/eda-ensemble-model-top-10-percentile) and [Mitesh Yadav](https://www.kaggle.com/miteshyadav/comprehensive-eda-with-xgboost-top-10-percentile) provided useful input for this excercise. \n", "\n", "First part of this notebook contains sections from [Introduction to cuDF](https://github.com/rapidsai-community/notebooks-contrib/blob/master/getting_started_notebooks/intro_tutorials/02_Introduction_to_cuDF.ipynb) by Paul Hendricks, which gives a concise introduction to cuDF, also discussing a few points not mentioned in this notebook. \n", "\n", "Dataset sources:\n", "- The bike sharing dataset is provided by [Capital Bike Share](https://www.capitalbikeshare.com/system-data).\n", "- The weather data is retrieved from the [Bike Sharing Dataset](https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset) hosted by the UCI Machine Learning repository.\n", "- The original source of the weather data is https://www.freemeteo.com.\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Prepare dataset with cuDF\n", "Let's start by loading the necessary libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import cudf\n", "import pandas as pd\n", "import numpy as np\n", "\n", "from datetime import datetime, timedelta\n", "import os\n", "import sys\n", "sys.path.insert(1, os.path.realpath(os.path.pardir))\n", "import importlib\n", "import utils\n", "importlib.reload(utils)\n", "from utils import fetch_bike_dataset, fetch_weather_dataset, read_bike_data_pandas\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.1 Prepare weather data\n", "First, we will download the weather data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "filename = fetch_weather_dataset()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "cuDF DataFrames are a tabular structure of data that reside on the GPU. We interface with these cuDF DataFrames in the same way we interface with Pandas DataFrames that reside on the CPU - with a few deviations. Load data from CSV file into a cuDF DataFrame." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "weather = cudf.read_csv(filename)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.1.1 Inspecting a cuDF DataFrame\n", "\n", "There are several ways to inspect a cuDF DataFrame. The first method is to enter the cuDF DataFrame directly into the REPL. This shows us an overview about the DatFrame including its type and metadata such as the number of rows or columns." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "weather" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A second way to inspect a cuDF DataFrame is to wrap the object in a Python print function `print(weather)` function. This results in showing the rows and columns of the dataframe with simple formating.\n", "\n", "For very large dataframes, we often want to see the first couple rows. We can use the `head` method of a cuDF DataFrame to view the first N rows." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "weather.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.1.2 Columns\n", "\n", "cuDF DataFrames store metadata such as information about columns or data types. We can access the columns of a cuDF DataFrame using the `.columns` attribute." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(weather.columns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can modify the columns of a cuDF DataFrame by modifying the `columns` attribute. We can do this by setting that attribute equal to a list of strings representing the new columns. Let's shorten the two longest column names!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "### TODO rename the relative temperature column to RTemp, and the relative humidity to Humidity\n", "#weather.columns = ['Hour', 'Temperature', 'Relative Temperature', 'Rel. Humidity', 'Wind', 'Weather']\n", "weather.columns = ['Hour', 'Temperature', 'RTemp', 'Humidity', 'Wind', 'Weather']\n", "weather.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.1.3 Series\n", "\n", "cuDF DataFrames are composed of rows and columns. Each column is represented using an object of type `Series`. For example, if we subset a cuDF DataFrame using just one column we will be returned an object of type `cudf.dataframe.series.Series`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "humidity = weather['Humidity']\n", "print(type(humidity))\n", "print(humidity)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We also see a column of values on the left hand side with values 0, 1, 2, 3. These values represent the index of the Series.\n", "The DataFrame and Series objects have both an index attribute that will be useful for joining tables and also for selecting data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.1.4 Data Types\n", "\n", "We can also inspect the data types of the columns of a cuDF DataFrame using the `dtypes` attribute." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(weather.dtypes)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can modify the data types of the columns of a cuDF DataFrame by passing in a cuDF Series with a modified data type." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "weather['Humidity'] = weather['Humidity'].astype(np.float64)\n", "print(weather.dtypes)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The 'Weather' column provides a description of the weather condidions. We should mark it as a categorical column." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "weather['Weather'] = weather['Weather'].astype('category')\n", "weather['Weather']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After this step the numerical category codes can be accessed using the `.cat.codes` attribute of the column. We actually will not need the category labels, we just replace the 'Weather' column with the category codes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "weather['Weather'] = weather['Weather'].cat.codes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data type of the 'Hour' column is `object` which means a string. Let's convert this to a numeric value! This cannot be done with the `astype` method, you should use the [cudf.to_datetime](https://docs.rapids.ai/api/cudf/nightly/api.html#cudf.to_datetime) function!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "### TODO convert the 'Hour' column from string to datetime\n", "weather['Hour'] = cudf.to_datetime(weather['Hour'])\n", "weather.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.1.2 Prepare features\n", "##### Operations with cudf Series\n", "We can perform mathematical operations on the Series data type. We will scale the Humidity and and Temperature variables, so that they lay in the [0, 1] range (some ML algorithms work better if the input data is scaled this way)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "weather['Humidity'] = weather['Humidity'] / 100.0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will scale the temperature using the following formula T = (T - Tmin) / (Tmax - Tmin). First we select the min and max values." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "T = weather['Temperature']\n", "\n", "# Select the minimum temperature\n", "Tmin = T.min()\n", "\n", "### TODO select the maximum temperature (1 line of code)\n", "Tmax = T.max()\n", "\n", "print(Tmin, Tmax)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could simply use the Tmin and Tmax values and apply the above formula on the series. \n", "\n", "##### User defined functions (UDF)\n", "We can write custom functions to operate on the data. When cuDF executes a UDF, it gets just-in-time (JIT) compiled into a CUDA kernel (either explicitly or implicitly) and is run on the GPU. Let's write a function that scales the temperature!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def scale_temp(T):\n", " # Note that the Tmin and Tmax variables are stored during compilation time and remain constant afterwards\n", " T = (T - Tmin) / (Tmax - Tmin)\n", " return T " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The applymap function will call scale_temp on all element of the series" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "weather['Temperature'] = weather['Temperature'].applymap(scale_temp)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets do the same min-max scaling for the wind data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "### TODO calculate the minimum and maximum values of the 'Wind' column (2 lines of code)\n", "Wmin = weather['Wind'].min()\n", "Wmax = weather['Wind'].max()\n", "\n", "print(Wmin, Wmax)\n", "\n", "### TODO define a scale_wind function and apply it on the Wind column (~ 2-3 lines of code)\n", "def scale_wind(w):\n", " return (w - Wmin) / ( Wmax - Wmin)\n", "\n", "### TODO apply the scale_wind function on the 'Wind' column\n", "weather['Wind'] = weather['Wind'].applymap(scale_wind)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's inspect the table, the Temperature, Wind and Humidity columns should have values in the [0, 1] range." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "weather.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Dropping Columns\n", "\n", "The relative temperature column is correlated with the temperature, it will not give much extra information for the ML model. We want to remove this column from our `DataFrame`. We can do so using the `drop_column` method. Note that this method removes a column in-place - meaning that the `DataFrame` we act on will be modified." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "weather.drop(['RTemp'],axis=1,inplace=True)\n", "weather" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want to remove a column without modifying the original DataFrame, we can use the `drop` method. This method will return a new DataFrame without that column (or columns)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Index\n", "\n", "Like `Series` objects, each `DataFrame` has an index attribute." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "weather.index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use the index values to subset the `DataFrame`. Lets use this to plot the first 48 values. Before plotting we have to transfer from the GPU memory to the system memory. We use the `to_array` method to return a copy of the data as a numpy array." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "selection = weather[weather.index<48]\n", "plt.plot(selection['Hour'].to_array(), selection['Temperature'].to_array())\n", "plt.xlabel('Hour')\n", "plt.ylabel('Temperature [C]')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also change the index. Our dataset has one entry for each hour, so one could set the 'Hour' coulmn as index by calling\n", "```\n", "weather = weather.set_index('Hour')\n", "```\n", "\n", "We do not perform this change right now, but we will use it later." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#weather = weather.set_index('Hour')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2 Prepare bike sharing data\n", "We start by downloading the data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "files = fetch_bike_dataset([2011,2012])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's read the first file to have an idea of the dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cudf.read_csv(files[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are only interested in the events when a bicicle was rented. Let us read the first column from all files, by specifying the `usecols` argument to [read_csv](https://docs.rapids.ai/api/cudf/nightly/api.html#cudf.io.csv.read_csv). We can use the `parse_dates` argument to parse the date string into a datetime variable, or the [to_datetime](https://docs.rapids.ai/api/cudf/nightly/api.html#cudf.to_datetime) function that we have used for the weather dataset. After all the tables are read we will concatenate them.\n", "\n", "Note: one has to specify a list of columns [ column1, column2 ] for the `usecol` argument." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def read_bike_data(files):\n", " # Reads a list of files and concatenates them\n", " tables = []\n", " for filename in files:\n", " ### TODO read column 1 ('Start date') from the CSV file, and convert it to datetime format\n", " ### (1-2 lines of code)\n", " tmp_df = cudf.read_csv(filename, parse_dates=[1], usecols=[1])\n", " \n", " ### END TODO\n", " tables.append(tmp_df) \n", " \n", " merged_df = cudf.concat(tables, ignore_index=True)\n", " \n", " # Sanity checks\n", " if merged_df.columns != ['Start date']:\n", " raise ValueError(\"Error incorrect set of columns read\")\n", " if merged_df['Start date'].dtype != 'datetime64[ns]':\n", " raise TypeError(\"Stard date should be converted to datetime type\")\n", " \n", " return merged_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will also measure the execution time of reading and processing the data (you can execute the cell multiple times to have a better measurement)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "bikes_raw = read_bike_data(files)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For comparision, we can repeat the same operation on the CPU. We have prepared a helper function for that." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "bikes_raw_pd = read_bike_data_pandas(files)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We want to count the number of rental events in every hour. We will define a new feature where we remove the minutes and seconds part of the time stamp. Since pandas has a convenient `floor` function defined to do it, we will convert the column to a pandas Series, transform it with the floor operation, and then put it back on the GPU." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bikes_raw['Hour'] = bikes_raw['Start date'].to_pandas().dt.floor('h')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will aggregate the number of bicicle rental events for each hour. We use the [groupby](https://docs.rapids.ai/api/cudf/nightly/api.html#groupby) function." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bikes = bikes_raw.groupby('Hour').agg('count')\n", "bikes.columns = ['cnt']\n", "bikes.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's add a column to the new dataset: the date without the time of the day. We can derive that similarly to the 'Hour' feature above. After the groupby operation, the 'Hour' became the index of the dataset, we will apply the `floor` operation on the index. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bikes['date'] = bikes.index.to_pandas().floor('D')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It will be usefull to define a set of additional features: hour of the day, day of month, month and year https://docs.rapids.ai/api/cudf/nightly/api.html#datetimeindex" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bikes['hr'] = bikes.index.hour\n", "\n", "### TODO add year and month features (~ 2 lines of code)\n", "bikes['year'] = bikes.index.year\n", "bikes['month'] = bikes.index.month" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remove the offset from year" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bikes['year'] = bikes['year'] - 2011" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Visualize data\n", "It is a good practice to visulize the data. We will have to use the to_array() method to convert the cudF Series objects to numpy arrays that can be plotted." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.plot(bikes.index.to_array(), bikes['cnt'].to_array())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is hard to see much apart from the global trend. Let's have a look how the 'cnt' variable looks like as a function the 'month' and 'hr' features. We will use [boxplot](https://seaborn.pydata.org/generated/seaborn.boxplot.html) from the Seaborn package." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, axes = plt.subplots(nrows=1,ncols=2)\n", "fig.set_size_inches(12, 5)\n", "sns.boxplot(data=bikes.to_pandas(), y=\"cnt\",x=\"month\",orient=\"v\",ax=axes[0])\n", "sns.boxplot(data=bikes.to_pandas(), y=\"cnt\",x=\"hr\",orient=\"v\",ax=axes[1])\n", "axes[0].set(xlabel='Months', ylabel='Count',title=\"Box Plot On Count Across months\")\n", "axes[1].set(xlabel='Hour Of The Day', ylabel='Count',title=\"Box Plot On Count Across Hour Of The Day\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.2.1 Combine weather data with bike rental data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gdf_bw = bikes.merge(weather, left_index=True, right_on='Hour', how='inner')\n", "\n", "# inspect the merged table\n", "gdf_bw" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that the data is not sorted after the merge use the [sort_values](https://docs.rapids.ai/api/cudf/nightly/api.html#cudf.core.dataframe.DataFrame.sort_values) method to" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "### TODO sort the table according to the index (1 line of code)\n", "gdf_bw = gdf_bw.sort_values(by='Hour')\n", "\n", "# Inspect the sorted table\n", "gdf_bw" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.3 Add working day feature\n", "\n", "Apart from the weather, in important factor that influences people's daily activities is whether it is a working day or not. In this section we will create a working day feature. First we add the weekday as a new feature column. \n", "We can use the [weekday](https://docs.rapids.ai/api/cudf/nightly/api.html#cudf.core.series.DatetimeProperties.weekday) attribute of the [datetime](https://docs.rapids.ai/api/cudf/nightly/api.html#datetimeindex)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gdf_bw['Weekday'] = gdf_bw['date'].dt.weekday" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next create a table with all the holidays in Washington DC in 2011-2011" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "holidays = cudf.DataFrame({'date': ['2011-01-17', '2011-02-21', '2011-04-15', '2011-05-30', '2011-07-04', '2011-09-05', '2011-11-11', '2011-11-24', '2011-12-26', '2012-01-02', '2012-01-16', '2012-02-20', '2012-04-16', '2012-05-28', '2012-07-04', '2012-09-03', '2012-11-12', '2012-11-22', '2012-12-25'],\n", "'Description': [\"Martin Luther King Jr. Day\", \"Washington's Birthday\", \"Emancipation Day\", \"Memorial Day\", \"Independence Day\", \"Labor Day\", \"Veterans Day\", \"Thanksgiving\", \"Christmas Day\", \n", "\"New Year's Day\", \"Martin Luther King Jr. Day\", \"Washington's Birthday\", \"Emancipation Day\", \"Memorial Day\", \"Independence Day\", \"Labor Day\", \"Veterans Day\", \"Thanksgiving\", \"Christmas Day\"]})\n", "\n", "# Print the dataframe\n", "holidays" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We convert the date from string to datetime type, and drop the description column. Additionally we add a new column marked 'Holiday'. This will be useful to mark the holidays after we merge the tables." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "holidays['date'] = cudf.to_datetime(holidays['date'])\n", "holidays.drop(['Description'],axis=1,inplace=True)\n", "holidays['Holiday'] = 1\n", "holidays" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we are ready to merge the tables using the commond `date` column. We want keep every element from the gdf_bw table (our *left* table), so we use a left join. Hint: use [merge](https://docs.rapids.ai/api/cudf/nightly/api.html#cudf.core.dataframe.DataFrame.merge) with the `on` and `how` attributes" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "### TODO merge tables and on the column 'date', use a left merge \n", "gdf = gdf_bw.merge(holidays, on='date', how='left')\n", "\n", "# inspect the result\n", "gdf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We reset the index to 'Hour' and sort the table accordingly. Notice that most of the rows in the 'Holiday' column are filled with ``, only the dates that appeared in the holiday table are filled with 1. We shall fill the empty fields with zero." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gdf = gdf.set_index('Hour')\n", "gdf = gdf.sort_index()\n", "\n", "### TODO fill empty holiday values with zero\n", "gdf['Holiday'] = gdf['Holiday'].fillna(0)\n", "gdf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we create a workingday feature. Assuming that the first five days of the week are working days, one could do that simply with the following operation:\n", "```\n", "gdf['Workingday'] = (gdf['Weekday'] < 5) & (gdf['Holiday']!=1)\n", "```\n", "But we could do it with user defined functions too. Previously we have only used UDF to process elements of a series. Now we will process rows of a dataframe and\n", "combine the 'Weekday' and 'Holiday' columns to calculate the new feature 'Workingday'.\n", "\n", "More on user defined functions in our [blog](https://medium.com/rapids-ai/user-defined-functions-in-rapids-cudf-2d7c3fc2728d) and in the [documentation](https://docs.rapids.ai/api/cudf/nightly/guide-to-udfs.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def workday_kernel(Weekday, Holiday, Workingday):\n", " for i, (w, h) in enumerate(zip(Weekday, Holiday)):\n", " # variable w will take values from the Weekday column\n", " # variable h will take values from the Holiday column\n", " Workingday[i] = w < 5 and h != 1" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gdf = gdf.apply_rows(workday_kernel, incols=['Weekday', 'Holiday'], outcols=dict(Workingday=np.float64), kwargs=dict())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After this step we will not need the 'Holiday' and 'date' columns, we can drop them" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gdf = gdf.drop(['Holiday', 'date'],axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.4 One-hot encoding\n", "\n", "We have all now the data in a single table, but we still want to change their encoding. We're going to create one-hot encoded variables, also known as dummy variables, for each of the time variables as well as the weather situation.\n", "\n", "\n", "A summary from https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/:\n", "\n", "\"The integer values have a natural ordered relationship between each other and machine learning algorithms may be able to understand and harness this relationship.\n", "For categorical variables where no such ordinal relationship exists, the integer encoding is not enough.\n", "\n", "In fact, using this encoding and allowing the model to assume a natural ordering between categories may result in poor performance or unexpected results (predictions halfway between categories).\n", "\n", "In this case, a one-hot encoding can be applied to the integer representation. This is where the integer encoded variable is removed and a new binary variable is added for each unique integer value.\n", "\"\n", "\n", "We start by one-hot encoding the 'Weather' column using the [one_hot_encoding](https://docs.rapids.ai/api/cudf/nightly/api.html#cudf.core.dataframe.DataFrame.one_hot_encoding) method from cuDF DataFrame. This is very the [get_dummies](https://docs.rapids.ai/api/cudf/nightly/api.html#cudf.core.reshape.get_dummies) function (which might be more familiar for Pandas users), but one_hot_encoding works on a single input column and performs the operation in place. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "codes = gdf['Weather'].unique()\n", "gdf = gdf.one_hot_encoding('Weather', 'Weather_dummy', codes)\n", "# Inspect the results\n", "gdf.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We're going to drop the original variable as well as one of the new dummy variables so we don't create colinearity (more about this problem [here](https://towardsdatascience.com/one-hot-encoding-multicollinearity-and-the-dummy-variable-trap-b5840be3c41a))." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gdf = gdf.drop(['Weather', 'Weather_dummy_1'],axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We create a copy of the dataset. It will make it easier to start over in case something would go wrong during the next excercise. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gdf_backup = gdf.copy()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dummies_list = ['month', 'hr', 'Weekday']\n", "\n", "gdf = gdf_backup.copy()\n", "\n", "for item in dummies_list:\n", " ### Todo implement one-hot encoding for item\n", " codes = gdf[item].unique()\n", " gdf = gdf.one_hot_encoding(item, item + '_dummy', codes)\n", " gdf = gdf.drop('{}_dummy_1'.format(item),axis=1)\n", " gdf = gdf.drop(item,axis=1) # drop the original item" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gdf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.5 Save the prepared dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gdf.to_csv('../../../data/bike_sharing.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Predict bike rentals with cuML\n", "\n", "cuML is a GPU accelerated machine learning library. cuML's Python API mirrors the [Scikit-Learn](https://scikit-learn.org/stable/) API.\n", "\n", "cuML currently requires all data be of the same type, so this loop converts all values into floats" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import cuml" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for col in gdf.columns:\n", " gdf[col] = gdf[col].astype('float64')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.1 Prepare training and test data\n", "It is customary to denote the input feature matrix with X, and the target that we want to predict with y. We separete the target column 'cnt' from the rest of the table." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y = gdf['cnt']\n", "X = gdf.drop('cnt',axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's split the data randomly into a train and a test set" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = cuml.preprocessing.model_selection.train_test_split(X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.2 Linear regression\n", "Our first model is a simple linear regression, which tries to predict the output as a linear combination of the input features:\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "reg = cuml.LinearRegression()\n", "reg.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can make prediction with the trained data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y_hat = reg.predict(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can visualize the how well the model works. Let's plot data for may 2012:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import datetime as dt\n", "def plot_timerange(model, start, end):\n", " start_date = dt.datetime.strptime(start, '%Y-%m-%d')\n", " end_date = dt.datetime.strptime(end, '%Y-%m-%d')\n", " idx = (X.index >= start_date) & (X.index <= end_date)\n", " X_may = X[idx]\n", " y_may = y[idx]\n", "\n", " ### TODO predict the values for X_may (1 line of code)\n", " y_pred = reg.predict(X_may)\n", "\n", " plt.plot(X_may.index.to_array(), y_may.to_array())\n", " plt.plot(X_may.index.to_array(), y_pred.to_array(), '--')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_timerange(reg, '2011-06-01', '2011-06-05')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Often we are interested in a single score. The default score method for regression problems is the [r2_score](https://en.wikipedia.org/wiki/Coefficient_of_determination)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_score = reg.score(X_train, y_train)\n", "### TODO calculate test score (the score on X_test, ~ 1 line of code)\n", "test_score = reg.score(X_test, y_test)\n", "\n", "print('train score', train_score)\n", "print('test score', test_score)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.3 Save and load the trained model\n", "We can pickle any cuML model" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pickle\n", "pickle_file = 'my_model.pickle'\n", "\n", "with open(pickle_file, 'wb') as pf:\n", " pickle.dump(reg, pf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load the saved model" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with open(pickle_file, 'rb') as pf:\n", " loaded_model = pickle.load(pf)\n", "\n", "print('Loaded model score', loaded_model.score(X_test, y_test))\n", "print('Original model score', reg.score(X_test, y_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.4 Ridge regression with hyperparameter tuning\n", "Ridge regression is a linear regression model with an added L2 regularization term. Regularization is often used in practice to avoid overfitting. The strength of the regularization is set by the alpha hyperparameter. \n", "We're going to do a small hyperparameter search for alpha, checking 100 different values. This is fast to do with RAPIDS. Also notice that we are appending the results of each Ridge model onto the dictionary containing our earlier results, so we can more easily see which model is the best at the end. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "output = {'score_OLS': test_score}\n", "\n", "for alpha in np.arange(0.01, 1, 0.01): #alpha value has to be positive \n", " ridge = cuml.Ridge(alpha=alpha, fit_intercept=True)\n", " ### TODO fit the model and calculate the test score (2 lines of code)\n", " ridge.fit(X_train, y_train)\n", " score = ridge.score(X_test, y_test)\n", " ### END EXCERCISE ###\n", " output['score_RIDGE_{}'.format(alpha)] = score" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we see that our regulaized model does better than the rest, include OLS with all the variables. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('Max score: {}'.format(max(output, key=output.get)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.5 Additional cuML models (Optional)\n", "#### 3.5.1 Support vector regression\n", "\n", "Support vector regression is a more complex model, with an execution time that scales with at least O(n_rows^2). RAPIDS cuML includes a fast SVM solver that makes it feasable to run SVM on larger datasets." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "reg = cuml.svm.SVR(kernel='rbf', gamma=0.1, C=100, epsilon=0.1)\n", "## Todo\n", "reg.fit(X_train, y_train)\n", "reg.score(X_train, y_train)\n", "reg.score(X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also use sklearns [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) to perform hyperparameter search. Sklearn's GridSearchCV requires input that the input data is a host array. Fortunately cuML is flexible with the [input format](https://medium.com/rapids-ai/input-and-output-configurability-in-rapids-cuml-e719d72c135b), and we can pass numpy array directly to it (at a cost of additional host to device copies, because under the hood cuML copies the data to the GPU). If the data size is reasonably small, then we can pay the price of additional data movement and combine the convenience of GridSearchCV with the speed of cuML algorithms." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "param_grid = [ {'C': [0.01, 0.1, 1, 10, 100], 'gamma': [10, 1, 0.1, 0.01, 0.001], 'kernel': ['rbf']} ]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_train_np = X_train.as_matrix()\n", "y_train_np = y_train.to_array()\n", "X_test_np = X_test.as_matrix() \n", "y_test_np = y_test.to_array()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import GridSearchCV\n", "reg = GridSearchCV(cuml.svm.SVR(), param_grid, scoring='r2' )\n", "\n", "reg.fit(X_train_np, y_train_np)\n", "\n", "print(\"Best parameters set found on development set:\")\n", "print()\n", "print(reg.best_params_)\n", "print()\n", "print(\"Grid scores on development set:\")\n", "print()\n", "means = reg.cv_results_['mean_test_score']\n", "stds = reg.cv_results_['std_test_score']\n", "for mean, std, params in zip(means, stds, reg.cv_results_['params']):\n", " print(\"%0.3f (+/-%0.03f) for %r\" % (mean, std * 2, params))\n", "print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.5.2 KNN Regression\n", "k-Nearest Neighbors regression is a machine learning technique that predicts an unknown observation by using the k most similar known observations in the training dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "### TODO tune the n_neighbors hyperparameter to achieve better performance\n", "knn = cuml.neighbors.KNeighborsRegressor(n_neighbors=8)\n", "knn.fit(X_train, y_train, convert_dtype=True)\n", "pred = knn.predict(X_test)\n", "knn.score(X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can compare the execution time of training KNN with cuML and with scikit-learn" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sklearn\n", "import sklearn.neighbors" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "knn = sklearn.neighbors.KNeighborsRegressor(n_neighbors=8)\n", "knn.fit(X_train_np, y_train_np,)\n", "pred = knn.predict(X_test_np)\n", "knn.score(X_test_np, y_test_np)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.6 XGBoost (Optional)\n", "RAPIDS integrates seamlessly with the XGBoost library. Here is how to use it for our example" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import xgboost as xgb" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "xgr=xgb.XGBRegressor(max_depth=8,min_child_weight=6,gamma=0.4)\n", "dtrain = xgb.DMatrix(X_train, label=y_train)\n", "dtest = xgb.DMatrix(X_test, label=y_test)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# instantiate params\n", "params = {}\n", "\n", "# general params\n", "general_params = {}\n", "params.update(general_params)\n", "\n", "# booster params\n", "booster_params = {'tree_method': 'gpu_hist'}\n", "params.update(booster_params)\n", "\n", "# learning task params\n", "learning_task_params = {'objective': 'reg:squarederror'}\n", "params.update(learning_task_params)\n", "print(params)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "evallist = [(dtest, 'test'), (dtrain, 'train')]\n", "num_round = 100" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bst = xgb.train(params, dtrain, num_round, evallist)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y_pred = bst.predict(dtest)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import cupy as cp\n", "from cuml.metrics.regression import r2_score\n", "y_pred_cp = cp.asarray(y_pred)\n", "y_test_cp = cp.asarray(y_test).astype(np.float32)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "r2_score(y_test_cp, y_pred_cp)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_timerange(reg, '2011-06-01', '2011-06-05')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Licensing\n", " \n", "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Previous Notebook](Challenge.ipynb)\n", "     \n", "     \n", "     \n", "     \n", "[1](Challenge.ipynb)\n", "[2]\n", "     \n", "     \n", "     \n", "     \n", "\n", "\n", "     \n", "     \n", "     \n", "     \n", "     \n", "   \n", "[Home Page](../../START_HERE.ipynb)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.2" } }, "nbformat": 4, "nbformat_minor": 4 }