{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "     \n", "     \n", "     \n", "     \n", "     \n", "   \n", "[Home Page](../START_HERE.ipynb)\n", "\n", "     \n", "     \n", "     \n", "     \n", "     \n", "   \n", "[1]\n", "[2](03_CuML_Exercise.ipynb)\n", "     \n", "     \n", "     \n", "     \n", "[Next Notebook](03_CuML_Exercise.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to cuML and how it relates to Scikit-learn\n", "\n", "Scikit-Learn is an incredibly powerful toolkit that allows data scientists to quickly build models from their data, and it one of the most common and useful tools in the Python data science ecosystem. cuML is the RAPIDS library that implements similar machine learning algorithms that use CUDA to run on GPUs, with an API that mirrors the Scikit-learn one as much as possible. \n", "\n", "Below we will go through and example of how to create a Linear Regression model, and how easy it is to pick up from Scikit-learn based workflows. Afterwards we will explore some more advanced functionality, like hyperparameter optimization and ecosystem interoperability that showcase the usefulness of cuML in different contexts. The tutorial contains modules with embedded exercises to help understanding the concepts.\n", "\n", "For more information about CuML, refer to the documentation here: https://docs.rapids.ai/api/cuml/stable/api.html#regression-and-classification" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Here is the list of exercises and modules in the lab:\n", "- Linear Regression
This module covers the Scikit-learn implementation of the Linear Regression algorithm and the corresponding CuML version.\n", "- Ridge Regression and Hyperparameters
This module covers the Scikit-learn implementation of the Ridge Regression algorithm and the corresponding CuML version. We will also learn how to perform hyperparameter optimization to boost the accuracy of our model.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "## 1. Simple Linear Regression\n", "The basic Linear Regression is a simple machine learning model where the relationship between a variable `y`, which we will call the response, and a set of variables `X`, which we will call the predictors, is explained by trying to model `y` as a linear combination of variables in `X`.\n", "\n", "Lets start by creating a sample dataset: " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Using numpy for data genera\n", "\n", "import matplotlib.pyplot as plt\n", "import numpy as np; print('NumPy Version:', np.__version__)\n", "%matplotlib inline\n", "import sys" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# create the relationship: y = 2.0 * x + 1.0\n", "n_rows = 10000\n", "w = 2.0\n", "x = np.random.normal(loc=0, scale=1, size=(n_rows,))\n", "b = 1.0\n", "y = w * x + b\n", "\n", "# add a bit of random noise\n", "noise = np.random.normal(loc=0, scale=2, size=(n_rows,))\n", "y_noisy = y + noise" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now visualize our data using `matplotlib`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.scatter(x, y_noisy, label='empirical data points')\n", "plt.plot(x, y, color='black', label='true relationship')\n", "plt.legend()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `LinearRegression` class implemented in both cuML and Scikit-Learn are based on ordinary least squares (OLS), which essentially minimizes the square distance between the observarions (blue dots) and the relantionship (black line) estimated by the class. \n", "\n", "This means that this is actually an optimization process, so cuML offers 3 algorithms for this: Singular Value Decomposition `SVD`, Eigendecomposition `Eig` and Coordinate Descente `CD` to fit the linear model. The `SVD` is more stable, `Eig` (which is the default) is typically much faster and `CD` can be faster when the data is large enough.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "### Scikit-Learn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll begin with the `LinearRegression` class from Scikit-Learn to instantiate a model and fit it to our data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sklearn; print('Scikit-Learn Version:', sklearn.__version__)\n", "from sklearn.linear_model import LinearRegression" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "linear_regression = LinearRegression()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have a trained class with an estimated model, we can predict new observations. Typically, for regression models, the Scikit-learn API offers two fundamental methods:\n", "\n", "1. `fit`: Fit the model with X and y. This method performs the training of the model. \n", "2. `predict`: Predicts the y for X." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "linear_regression.fit(np.expand_dims(x, 1), y_noisy)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To visualize how the model looks like, lets use NumPy to create a uniform number of points: " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# create new data and perform inference\n", "inputs = np.linspace(start=-5, stop=5, num=1000000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can use our `predict` function: " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "outputs = linear_regression.predict(np.expand_dims(inputs, 1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's now visualize our empirical data points, the true relationship of the data, and the relationship estimated by the model. Looks pretty close!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.scatter(x, y_noisy, label='empirical data points')\n", "plt.plot(x, y, color='black', label='true relationship')\n", "plt.plot(inputs, outputs, color='red', label='predicted relationship (cpu)')\n", "plt.legend()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "### cuML\n", "\n", "The mathematical operations underlying many machine learning algorithms are often matrix multiplications, just like the ordinary least squares approach that was described above. These types of operations are highly parallelizable and can be greatly accelerated using a GPU. \n", "\n", "The objective of cuML is to make it easy to build machine learning models in an accelerated fashion using an interface nearly identical to Scikit-Learn. Now we'll explore how this looks in practice.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import cudf; print('cuDF Version:', cudf.__version__)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "### Create a cuDF Dataframe with `x` and `y`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use the following cell to create a dataframe called `df`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = cudf.DataFrame({'x': x, 'y': y_noisy})\n", "print(df.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we'll load the GPU accelerated `LinearRegression` class from cuML, instantiate it, and fit it to our data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import cuml; print('cuML Version:', cuml.__version__)\n", "from cuml.linear_model import LinearRegression as LinearRegressionGPU" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Linear Regression function accepts the following parameters:\n", "1. algorithm:`eig`, `cd` or `svd` (default = `eig`). Eig uses a eigendecomposition of the covariance matrix, and is much faster. SVD is slower, but is guaranteed to be stable.\n", "2. fit_intercept:boolean (default = True). If True, LinearRegression tries to correct for the global mean of y. If False, the model expects that you have centered the data.\n", "3. normalize:boolean (default = False). If True, the predictors in X will be normalized by dividing by it’s L2 norm. If False, no scaling will be done.\n", "\n", "We will use the different columns of our dataframe to train our model " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import cupy as cp\n", "# instantiate and fit model, change the column names if you used a different name\n", "linear_regression_gpu = LinearRegressionGPU()\n", "linear_regression_gpu.fit(cp.expand_dims(cp.array(df['x']),1), y_noisy)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use this model to predict values for new data points, a step often called \"inference\" or \"scoring\". All model fitting and predicting steps are GPU accelerated." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "### Mini Exercise: Use np.linspace to create a set of points adequate to visualize our model like we did with Scikit-learn\n", "\n", "
Solution\n", "
\n",
    "\n",
    "inputs = np.linspace(start=-5, stop=5, num=1000000)\n",
    "new_data_df = cudf.DataFrame({'inputs': inputs})\n",
    "gpu_outputs = linear_regression_gpu.predict(new_data_df[['inputs']])\n",
    "\n",
    "
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# create new data and perform inference. inputs is the same data from the scikit-learn example above \n", "# transformed to a cuDF Dataframe\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lastly, we can overlay our predicted relationship using our GPU accelerated Linear Regression model (green line) over our empirical data points (light blue circles), the true relationship (blue line), and the predicted relationship from a model built on the CPU (red line). We see that our GPU accelerated model's estimate of the true relationship (green line) is identical to the CPU based model's estimate of the true relationship (red line)!\n", "\n", "
Solution\n", "
\n",
    "\n",
    "plt.scatter(x, y_noisy, label='empirical data points')\n",
    "plt.plot(x, y, color='black', label='true relationship')\n",
    "plt.plot(inputs, outputs, color='red', label='predicted relationship (cpu)')\n",
    "plt.plot(inputs, gpu_outputs.get(), color='green', label='predicted relationship (gpu)')\n",
    "plt.legend()\n",
    "\n",
    "
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Adapt the code for the graph we used before to graph the new model\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Its nice to see that the red and green lines are identical, showing cuML and Scikit-learn ended up with identical models. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "## 2. Ridge Regression and Hyperparameters" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ridge extends the `LinearRegression` (in both Scikit-learn and cuML) by providing L2 regularization on the coefficients when predicting response y with a linear combination of the predictors in X. Essentially it can reduce the variance of the predictors, and improves the conditioning of the problem, which can improve the performance of the models when data is not suitable for a simple ordinary least squares linear regression.\n", "\n", "For this section we will use a built in dataset of Scikit-learn: the diabetes dataset. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn import datasets\n", "diabetes = datasets.load_diabetes()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Description of the Diabetes dataset\n", "\n", "```\n", "Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of `n=442` diabetes patients, as well as the response of interest, aquantitative measure of disease progression one year after baseline.\n", "\n", "**Data Set Characteristics:** :\n", "- Number of Instances: 442 \n", "- Number of Attributes: First 10 columns are numeric predictive values \n", "- Target: Column 11 is a quantitative measure of disease progression one year after baseline \n", "- Attribute Information: Age, Sex, Body mass index, Average blood pressure, S1, S2, S3, S4, S5, S6\n", "\n", "Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).Source \n", "- URL:https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html\n", "\n", "For more information see:Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) \"Least Angle Regression,\" Annals of Statistics (with discussion), 407-499.(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)'\n", "```\n", "\n", "A common practice is to train models on part of the data and use the rest for testing the performance of the model, so lets divide our dataset:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(diabetes.data, diabetes.target, test_size=0.2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now lets import our Ridge Regression classes from cuML and Scikit-learn:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from cuml import Ridge as cuRidge\n", "from sklearn.linear_model import Ridge as skRidge" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Ridge Regression classes of both cuML and Scikit-learn have quite a few parameters that can be set. For this exercise the relevant cuML ones are: \n", "\n", "```\n", " alpha : float or double\n", " Regularization strength - must be a positive float. Larger values\n", " specify stronger regularization. Array input will be supported later.\n", " solver : 'eig' or 'svd' or 'cd' (default = 'eig')\n", " Eig uses a eigendecomposition of the covariance matrix, and is much\n", " faster.\n", " SVD is slower, but guaranteed to be stable.\n", " CD or Coordinate Descent is very fast and is suitable for large\n", " problems.\n", " fit_intercept : boolean (default = True)\n", " If True, Ridge tries to correct for the global mean of y.\n", " If False, the model expects that you have centered the data.\n", " normalize : boolean (default = False)\n", " If True, the predictors in X will be normalized by dividing by it's L2\n", " norm.\n", " If False, no scaling will be done.\n", "```\n", "\n", "Most of the parameters are the same for Scikit-learn, except for solver, where the options are: \n", "\n", "```\n", " solver : ‘auto’, ‘svd’, ‘cholesky’, ‘lsqr’, ‘sparse_cg’, ‘sag’, ‘saga’\n", " \n", "```\n", " \n", "It is important to see that even though both libraries are performing a Ridge Regression, underneath there are different solvers being used (except for `svd`). This can lead to slightly different results (and final performance) where both results are technically correct even though they differ. \n", "\n", "For the exercise feel free to use `solver=auto` for Scikit-learn, otherwise we recommend using `cholesky`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "### Exercise: Create the Ridge Regression objects for both cuML and Scikit-learn with matching parameters and fit `X_train` and `y_train`\n", "\n", "
Solution for Scikit-learn\n", "
\n",
    "\n",
    "alpha = np.array([1.0])\n",
    "fit_intercept = True\n",
    "normalize = False\n",
    "\n",
    "ridge = skRidge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver='cholesky')\n",
    "ridge.fit(X_train, y_train)\n",
    "\n",
    "\n",
    "
\n", "
\n", "\n", "
Solution for cuML\n", "
\n",
    "\n",
    "alpha = np.array([1.0])\n",
    "fit_intercept = True\n",
    "normalize = False\n",
    "\n",
    "cu_ridge = cuRidge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver='eig')\n",
    "cu_ridge.fit(X_train, y_train)\n",
    "\n",
    "
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "### Exercise: Predict the values for `X_test` and evaluate its performance\n", "\n", "*Hint: The Ridge classes have a `score` method to evaluate performance*\n", "\n", "*Note: cuML as of version 0.8 has a limitation to only being able to accept cuDF objects for `score`, so we've provided a helpful conversion*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from numba import cuda\n", "record_data = {'fea%d'%i: X_test[:,i] for i in range(X_test.shape[1])}\n", "test_df = cudf.DataFrame(record_data)\n", "\n", "y_df = cudf.Series(y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Solution for Scikit-learn\n", "
\n",
    "\n",
    "print('Scikit-learn accuracy: ' + str(ridge.score(X_test, y_test)))\n",
    "
\n", "
\n", "\n", "
Solution for cuML\n", "
\n",
    "\n",
    "print('cuML accuracy: ' + str(cu_ridge.score(test_df, y_df)))\n",
    "\n",
    "
\n", "
\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Improving the accuracy of our models\n", "\n", "One of the most useful components beyond basic models that Scikit-learn offers is hyperparameter optimization for its models. Hyperparameter optimization means looking for the values of parameters that maximize how well our model can predict observations in `X_test`. \n", "\n", "Fortunately cuML is compatible with Scikit-learn hyperparameter optimization!!! Note: It also is compatible with other libraries, such as dask-ml that perform hyperparameter optimization with more advanced strategies/levels of parallelization.\n", "\n", "For this lets use the `GridSearchCV` class of Scikit-learn, which like the name suggest performs a search in a grid of values of the parameters that we specify:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# lets import the class\n", "from sklearn.model_selection import GridSearchCV\n", "\n", "# Here we tell GridSearchCV what values of parameters to explore\n", "params = {'alpha': np.logspace(-3, -1, 10)}\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are the parameter we are going to optimise:\n", "- Alpha - Regularization strength; must be a positive float. Regularization improves the conditioning of the problem and reduces the variance of the estimates. Larger values specify stronger regularization. Alpha corresponds to 1 / (2C) in other linear models such as LogisticRegression or sklearn.svm.LinearSVC. If an array is passed, penalties are assumed to be specific to the targets. Hence they must correspond in number.\n", "- fit_interceptbool, default=True\n", "Whether to fit the intercept for this model. If set to false, no intercept will be used in calculations (i.e. X and y are expected to be centered).\n", "- normalizebool, default=False\n", "This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm. If you wish to standardize, please use sklearn.preprocessing.StandardScaler before calling fit on an estimator with normalize=False.\n", "- solver{‘auto’, ‘svd’, ‘cholesky’, ‘lsqr’, ‘sparse_cg’, ‘sag’, ‘saga’}, default=’auto’\n", "Solver to use in the computational routines:\n", "‘auto’ chooses the solver automatically based on the type of data. \n", "‘svd’ uses a Singular Value Decomposition of X to compute the Ridge coefficients. More stable for singular matrices than ‘cholesky’. ‘cholesky’ uses the standard scipy.linalg.solve function to obtain a closed-form solution. ‘sparse_cg’ uses the conjugate gradient solver as found in scipy.sparse.linalg.cg. As an iterative algorithm, this solver is more appropriate than ‘cholesky’ for large-scale data (possibility to set tol and max_iter). ‘lsqr’ uses the dedicated regularized least-squares routine scipy.sparse.linalg.lsqr. It is the fastest and uses an iterative procedure. ‘sag’ uses a Stochastic Average Gradient descent, and ‘saga’ uses its improved, unbiased version named SAGA. Both methods also use an iterative procedure, and are often faster than other solvers when both n_samples and n_features are large. Note that ‘sag’ and ‘saga’ fast convergence is only guaranteed on features with approximately the same scale. You can preprocess the data with a scaler from sklearn.preprocessing." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "grid = GridSearchCV(ridge, params, scoring='r2')\n", "grid.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "grid.best_params_, grid.best_score_" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ridge = skRidge(alpha=grid.best_params_['alpha'], \n", " fit_intercept=fit_intercept, \n", " normalize=normalize, \n", " solver='cholesky')\n", "ridge.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "### Exercise: Perform a hyperparameter optimization for the cuML Ridge Regression\n", "\n", "*Hint: You do not need to import any other `GridSearchCV` objects, you can use the Scikit-learn one with cuML models!*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Solution\n", "
\n",
    "\n",
    "cu_grid = GridSearchCV(cu_ridge, params, scoring='r2')\n",
    "cu_grid.fit(X_train, y_train)\n",
    "cu_grid.best_params_, cu_grid.best_score_\n",
    "\n",
    "
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Conclusion\n", "\n", "We have learnt how to create machine learning models in CuML, from feeding data to them and fitting the models to testing their performance on data. We have also understood how to perform hyperparameter optimization to boost the model accuracy. As we have based this tutorial only on CuDF objects, you may visit the bonus lab [here](Bonus_Lab-LogisticRegression.ipynb) to learn the implementation of Logistic Regression algorithm using CuPy objects." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Licensing\n", " \n", "This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "     \n", "     \n", "     \n", "     \n", "     \n", "    \n", "[1]\n", "[2](03_CuML_Exercise.ipynb)\n", "     \n", "     \n", "     \n", "     \n", "[Next Notebook](03_CuML_Exercise.ipynb)\n", "\n", "\n", "     \n", "     \n", "     \n", "     \n", "     \n", "   \n", "[Home Page](../START_HERE.ipynb)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.2" } }, "nbformat": 4, "nbformat_minor": 4 }