&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&ensp;
[Home Page](../START_HERE.ipynb)

[Previous Notebook](01-LinearRegression-Hyperparam.ipynb)
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
[1](01-LinearRegression-Hyperparam.ipynb)
[2]
[3](03_CuML_Exercise.ipynb)
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
[Next Notebook](03_CuML_Exercise.ipynb)

# Mini Batch SGD classifier and regressor
Mini Batch SGD (MBSGD) models are linear models which are fitted by minimizing a regularized empirical loss with mini-batch SGD. In this notebook we compare the performance of cuMl's MBSGD classifier and regressor models with their respective scikit-learn counterparts.

The model can take array-like objects, either in host as NumPy arrays or in device (as Numba or cuda_array_interface-compliant), as well as cuDF DataFrames as the input.

For information about cuDF, refer to the cuDF documentation: https://rapidsai.github.io/projects/cudf/en/latest/

## Here is the list of exercises and modules in the lab:

- <a href='#ex1'>Define Parameters</a><br> First we will define the data and model parameters, as we will be generating the data based on them later and creating a model to fit on the data.
- <a href='#ex2'>Generate Data</a><br> We will generate the data on the host device and then make them available to GPU using CuDF dataframes.
- <a href='#ex3'>Scikit-learn model</a><br> Here we create the MBSGD model in Scikit-learn for easy conversion to CuML format later.
- <a href='#ex4'>CuML model</a><br> Now we will convert the Scikit-learn implementation to CuML.
- <a href='#ex5'>Evaluate Results</a><br> Evaluate the performance of both models with respect to speed and accuracy.


In [None]:
import cudf as gd
import cuml
import numpy as np
import pandas as pd
import sklearn

from sklearn import linear_model
from sklearn.datasets.samples_generator import make_classification, make_regression
from sklearn.metrics import accuracy_score, r2_score
from sklearn.model_selection import train_test_split

<a id='ex1'></a>

## Define parameters

### Data parameters

In [None]:
num_samples = 2**13
num_features = 300
n_informative = 270
random_state = 0
train_size = 0.8
datatype = np.float32

### Model parameters

- learning_ratestr, default=’optimal’
The learning rate schedule:

    ‘constant’: eta = eta0

    ‘optimal’: eta = 1.0 / (alpha * (t + t0)) where t0 is chosen by a heuristic proposed by Leon Bottou.

    ‘invscaling’: eta = eta0 / pow(t, power_t)

    ‘adaptive’: eta = eta0, as long as the training keeps decreasing. Each time n_iter_no_change consecutive epochs fail to decrease the training loss by tol or fail to increase validation score by tol if early_stopping is True, the current learning rate is divided by 5.
    
- penalty{‘l2’, ‘l1’, ‘elasticnet’}, default=’l2’
The penalty (aka regularization term) to be used. Defaults to ‘l2’ which is the standard regularizer for linear SVM models. ‘l1’ and ‘elasticnet’ might bring sparsity to the model (feature selection) not achievable with ‘l2’.

- eta0 double, default=0.0
The initial learning rate for the ‘constant’, ‘invscaling’ or ‘adaptive’ schedules. The default value is 0.0 as eta0 is not used by the default schedule ‘optimal’.

- max_iter int, default=1000
The maximum number of passes over the training data (aka epochs). It only impacts the behavior in the fit method, and not the partial_fit method.

- fit_intercept bool, default=True
Whether the intercept should be estimated or not. If False, the data is assumed to be already centered.

- tol float, default=1e-3
The stopping criterion. If it is not None, training will stop when (loss > best_loss - tol) for n_iter_no_change consecutive epochs. 

In [None]:
learning_rate = 'constant'
penalty = 'elasticnet'
eta0 = 0.005
max_iter = 100
fit_intercept = True
tol=0.0
batch_size=2

<a id='ex2'></a>

## Generate data

### Host

In [None]:
%%time
X_class, y_class = make_classification(n_samples=num_samples, n_features=num_features,
                                       n_informative=n_informative, random_state=random_state)
# change the datatype of the input data
X_class = X_class.astype(datatype)
y_class = y_class.astype(datatype)

# convert numpy arrays to pandas dataframe
X_class = pd.DataFrame(X_class)
y_class = pd.DataFrame(y_class)

X_class_train, X_class_test, y_class_train, y_class_test = train_test_split(X_class, y_class,
                                                                            train_size=train_size,
                                                                            random_state=random_state)
X_reg, y_reg = make_regression(n_samples=num_samples, n_features=num_features,
                               n_informative=n_informative, random_state=random_state)

# change the datatype of the input data
X_reg = X_reg.astype(datatype)
y_reg = y_reg.astype(datatype)

# convert numpy arrays to pandas dataframe
X_reg = pd.DataFrame(X_reg)
y_reg = pd.DataFrame(y_reg)

X_reg_train, X_reg_test, y_reg_train, y_reg_test = train_test_split(X_reg, y_reg,
                                                                    train_size=train_size,
                                                                    random_state=random_state)

### GPU

In [None]:
%%time
# classification dataset
X_class_cudf = gd.DataFrame.from_pandas(X_class_train)
X_class_cudf_test = gd.DataFrame.from_pandas(X_class_test)

y_class_cudf = gd.Series(y_class_train.values[:,0])

# regression dataset
X_reg_cudf = gd.DataFrame.from_pandas(X_reg_train)
X_reg_cudf_test = gd.DataFrame.from_pandas(X_reg_test)

y_reg_cudf = gd.Series(y_reg_train.values[:,0])

<a id='ex3'></a>

## Scikit-learn Model

### Classification :

#### Fit

In [None]:
%%time
skl_sgd_classifier = sklearn.linear_model.SGDClassifier(learning_rate=learning_rate,
                                                        eta0=eta0,
                                                        max_iter=max_iter,
                                                        fit_intercept=fit_intercept,
                                                        tol=tol,
                                                        penalty=penalty,
                                                        random_state=random_state)

skl_sgd_classifier.fit(X_class_train, y_class_train)

#### Predict

In [None]:
%%time
skl_class_pred = skl_sgd_classifier.predict(X_class_test)
skl_class_acc = accuracy_score(skl_class_pred, y_class_test)

## Scikit-learn Model

### Regression :

#### Fit

In [None]:
%%time
skl_sgd_regressor = sklearn.linear_model.SGDRegressor(learning_rate=learning_rate,
                                                      eta0=eta0,
                                                      max_iter=max_iter,
                                                      fit_intercept=fit_intercept,
                                                      tol=tol,
                                                      penalty=penalty,
                                                      random_state=random_state)

skl_sgd_regressor.fit(X_reg_train, y_reg_train)

#### Predict

In [None]:
%%time
skl_reg_pred = skl_sgd_regressor.predict(X_reg_test)
skl_reg_r2 = r2_score(skl_reg_pred, y_reg_test)

<a id='ex4'></a>

## cuML Model

### Classification:

#### Fit

In [None]:
%%time
cu_mbsgd_classifier = cuml.linear_model.MBSGDClassifier(learning_rate=learning_rate,
                                                        eta0=eta0,
                                                        epochs=max_iter,
                                                        fit_intercept=fit_intercept,
                                                        batch_size=batch_size,
                                                        tol=tol,
                                                        penalty=penalty)

cu_mbsgd_classifier.fit(X_class_cudf, y_class_cudf)

#### Predict

In [None]:
%%time
cu_class_pred = cu_mbsgd_classifier.predict(X_class_cudf_test).to_array()
cu_class_acc = accuracy_score(cu_class_pred, y_class_test)

### Regression:

#### Fit

In [None]:
%%time
cu_mbsgd_regressor = cuml.linear_model.MBSGDRegressor(learning_rate=learning_rate,
                                                      eta0=eta0,
                                                      epochs=max_iter,
                                                      fit_intercept=fit_intercept,
                                                      batch_size=batch_size,
                                                      tol=tol,
                                                      penalty=penalty)

cu_mbsgd_regressor.fit(X_reg_cudf, y_reg_cudf)

#### Predict

In [None]:
%%time
cu_reg_pred = cu_mbsgd_regressor.predict(X_reg_cudf_test).to_array()
cu_reg_r2 = r2_score(cu_reg_pred, y_reg_test)

<a id='ex5'></a>

## Evaluate Results

### Classification

In [None]:
print("Sklearn's R^2 score for classification : %s" % skl_class_acc)
print("cuML's R^2 score for classification : %s" % cu_class_acc)

### Regression

In [None]:
print("Sklearn's R^2 score for regression : %s" % skl_reg_r2)
print("cuML's R^2 score for regression : %s" % cu_reg_r2)

## Licensing
  
This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0).

# Conclusion

Now we know how to create both simple and complex machine learning models and deal with different data types using CuML and CUDf. If you would like to explore more models refer to the documentation here: https://docs.rapids.ai/api/cuml/stable/api.html#regression-and-classification or try out our bonus lab [here](Bonus_Lab-LogisticRegression.ipynb). If you are feeling fairly confident about CuML now, head over to the next lab which will test your skills with an interesting exercise.

[Previous Notebook](01-LinearRegression-Hyperparam.ipynb)
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
[1](01-LinearRegression-Hyperparam.ipynb)
[2]
[3](03_CuML_Exercise.ipynb)
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
[Next Notebook](03_CuML_Exercise.ipynb)

&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&ensp;
[Home Page](../START_HERE.ipynb)