{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "     \n",
    "     \n",
    "     \n",
    "     \n",
    "     \n",
    "   \n",
    "[Home Page](../START_HERE.ipynb)\n",
    "\n",
    "     \n",
    "     \n",
    "     \n",
    "     \n",
    "     \n",
    "   \n",
    "[1]\n",
    "[2](03_CuML_Exercise.ipynb)\n",
    "     \n",
    "     \n",
    "     \n",
    "     \n",
    "[Next Notebook](03_CuML_Exercise.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction to cuML and how it relates to Scikit-learn\n",
    "\n",
    "Scikit-Learn is an incredibly powerful toolkit that allows data scientists to quickly build models from their data, and it one of the most common and useful tools in the Python data science ecosystem. cuML is the RAPIDS library that implements similar machine learning algorithms that use CUDA to run on GPUs, with an API that mirrors the Scikit-learn one as much as possible. \n",
    "\n",
    "Below we will go through and example of how to create a Linear Regression model, and how easy it is to pick up from Scikit-learn based workflows. Afterwards we will explore some more advanced functionality, like hyperparameter optimization and ecosystem interoperability that showcase the usefulness of cuML in different contexts. The tutorial contains modules with embedded exercises to help understanding the concepts.\n",
    "\n",
    "For more information about CuML, refer to the documentation here: https://docs.rapids.ai/api/cuml/stable/api.html#regression-and-classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Here is the list of exercises and modules in the lab:\n",
    "- Linear Regression
 This module covers the Scikit-learn implementation of the Linear Regression algorithm and the corresponding CuML version.\n",
    "- Ridge Regression and Hyperparameters
 This module covers the Scikit-learn implementation of the Ridge Regression algorithm and the corresponding CuML version. We will also learn how to perform hyperparameter optimization to boost the accuracy of our model.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "## 1. Simple Linear Regression\n",
    "The basic Linear Regression is a simple machine learning model where the relationship between a variable `y`, which we will call the response, and a set of variables `X`, which we will call the predictors, is explained by trying to model `y` as a linear combination of variables in `X`.\n",
    "\n",
    "Lets start by creating a sample dataset: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Using numpy for data genera\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np; print('NumPy Version:', np.__version__)\n",
    "%matplotlib inline\n",
    "import sys"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create the relationship: y = 2.0 * x + 1.0\n",
    "n_rows = 10000\n",
    "w = 2.0\n",
    "x = np.random.normal(loc=0, scale=1, size=(n_rows,))\n",
    "b = 1.0\n",
    "y = w * x + b\n",
    "\n",
    "# add a bit of random noise\n",
    "noise = np.random.normal(loc=0, scale=2, size=(n_rows,))\n",
    "y_noisy = y + noise"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now visualize our data using `matplotlib`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.scatter(x, y_noisy, label='empirical data points')\n",
    "plt.plot(x, y, color='black', label='true relationship')\n",
    "plt.legend()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `LinearRegression` class implemented in both cuML and Scikit-Learn are based on ordinary least squares (OLS), which essentially minimizes the square distance between the observarions (blue dots) and the relantionship (black line) estimated by the class. \n",
    "\n",
    "This means that this is actually an optimization process, so cuML offers 3 algorithms for this: Singular Value Decomposition `SVD`, Eigendecomposition `Eig` and Coordinate Descente `CD` to fit the linear model. The  `SVD` is more stable, `Eig` (which is the default) is typically much faster and `CD`  can be faster when the data is large enough.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "### Scikit-Learn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll begin with the `LinearRegression` class from Scikit-Learn to instantiate a model and fit it to our data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sklearn; print('Scikit-Learn Version:', sklearn.__version__)\n",
    "from sklearn.linear_model import LinearRegression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "linear_regression = LinearRegression()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have a trained class with an estimated model, we can predict new observations. Typically, for regression models, the Scikit-learn API offers two fundamental methods:\n",
    "\n",
    "1. `fit`: Fit the model with X and y. This method performs the training of the model. \n",
    "2. `predict`: Predicts the y for X."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "linear_regression.fit(np.expand_dims(x, 1), y_noisy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To visualize how the model looks like, lets use NumPy to create a uniform number of points: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create new data and perform inference\n",
    "inputs = np.linspace(start=-5, stop=5, num=1000000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can use our `predict` function: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "outputs = linear_regression.predict(np.expand_dims(inputs, 1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's now visualize our empirical data points, the true relationship of the data, and the relationship estimated by the model. Looks pretty close!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.scatter(x, y_noisy, label='empirical data points')\n",
    "plt.plot(x, y, color='black', label='true relationship')\n",
    "plt.plot(inputs, outputs, color='red', label='predicted relationship (cpu)')\n",
    "plt.legend()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "### cuML\n",
    "\n",
    "The mathematical operations underlying many machine learning algorithms are often matrix multiplications, just like the ordinary least squares approach that was described above. These types of operations are highly parallelizable and can be greatly accelerated using a GPU. \n",
    "\n",
    "The objective of cuML is to make it easy to build machine learning models in an accelerated fashion using an interface nearly identical to Scikit-Learn. Now we'll explore how this looks in practice.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import cudf; print('cuDF Version:', cudf.__version__)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "### Create a cuDF Dataframe with `x` and `y`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use the following cell to create a dataframe called `df`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = cudf.DataFrame({'x': x, 'y': y_noisy})\n",
    "print(df.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we'll load the GPU accelerated `LinearRegression` class from cuML, instantiate it, and fit it to our data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import cuml; print('cuML Version:', cuml.__version__)\n",
    "from cuml.linear_model import LinearRegression as LinearRegressionGPU"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Linear Regression function accepts the following parameters:\n",
    "1. algorithm:`eig`, `cd` or `svd` (default = `eig`).  Eig uses a eigendecomposition of the covariance matrix, and is much faster. SVD is slower, but is guaranteed to be stable.\n",
    "2. fit_intercept:boolean (default = True).  If True, LinearRegression tries to correct for the global mean of y. If False, the model expects that you have centered the data.\n",
    "3. normalize:boolean (default = False).  If True, the predictors in X will be normalized by dividing by it’s L2 norm. If False, no scaling will be done.\n",
    "\n",
    "We will use the different columns of our dataframe to train our model "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import cupy as cp\n",
    "# instantiate and fit model, change the column names if you used a different name\n",
    "linear_regression_gpu = LinearRegressionGPU()\n",
    "linear_regression_gpu.fit(cp.expand_dims(cp.array(df['x']),1), y_noisy)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use this model to predict values for new data points, a step often called \"inference\" or \"scoring\". All model fitting and predicting steps are GPU accelerated."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "### Mini Exercise: Use np.linspace to create a set of points adequate to visualize our model like we did with Scikit-learn\n",
    "\n",
    "Solution
\n",
    "   \n",
    "\n",
    "inputs = np.linspace(start=-5, stop=5, num=1000000)\n",
    "new_data_df = cudf.DataFrame({'inputs': inputs})\n",
    "gpu_outputs = linear_regression_gpu.predict(new_data_df[['inputs']])\n",
    "\n",
    "\n",
    "Solution
\n",
    "   \n",
    "\n",
    "plt.scatter(x, y_noisy, label='empirical data points')\n",
    "plt.plot(x, y, color='black', label='true relationship')\n",
    "plt.plot(inputs, outputs, color='red', label='predicted relationship (cpu)')\n",
    "plt.plot(inputs, gpu_outputs.get(), color='green', label='predicted relationship (gpu)')\n",
    "plt.legend()\n",
    "\n",
    "\n",
    "Solution for Scikit-learn
\n",
    "   \n",
    "\n",
    "alpha = np.array([1.0])\n",
    "fit_intercept = True\n",
    "normalize = False\n",
    "\n",
    "ridge = skRidge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver='cholesky')\n",
    "ridge.fit(X_train, y_train)\n",
    "\n",
    "\n",
    "\n",
    "Solution for cuML
\n",
    "   \n",
    "\n",
    "alpha = np.array([1.0])\n",
    "fit_intercept = True\n",
    "normalize = False\n",
    "\n",
    "cu_ridge = cuRidge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver='eig')\n",
    "cu_ridge.fit(X_train, y_train)\n",
    "\n",
    "\n",
    "Solution for Scikit-learn
\n",
    "   \n",
    "\n",
    "print('Scikit-learn accuracy: ' + str(ridge.score(X_test, y_test)))\n",
    "\n",
    "Solution for cuML
\n",
    "   \n",
    "\n",
    "print('cuML accuracy: ' + str(cu_ridge.score(test_df, y_df)))\n",
    "\n",
    "\n",
    "Solution
\n",
    "   \n",
    "\n",
    "cu_grid = GridSearchCV(cu_ridge, params, scoring='r2')\n",
    "cu_grid.fit(X_train, y_train)\n",
    "cu_grid.best_params_, cu_grid.best_score_\n",
    "\n",
    "\n",
    "