The goal of this lab is to explore linear and logistic regression, implement them yourself and learn to use their respective scikit-learn implementation.
Let us start by loading some of the usual librairies
# 1. Linear regression
We will now implement a linear regression, first using the closed form solution, and second with our gradient descent.
## 1.1 Linear regression data
Our first data set regards the quality ratings of a white _vinho verde_. Each wine is described by a number of physico-chemical descriptors such as acidity, sulfur dioxide content, density or pH.
## 1.2 Cross-validation
Let us create a cross-validation utility function (similar to what we have done in Lab 3, but for regression).
## 1.3 Linear regression with scikit-learn
__Question__ Cross-validate scikit-learn's [linear_model.LinearRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) on your data.
# 2. Logistic regression
We will now implement a linear regression, first using the closed form solution, and second with our gradient descent.
## 2.1 Logistic regression data
Our second data set comes from the world of bioinformatics. In this data set, each observation is a tumor, and it is described by the expression of 3,000 genes. The expression of a gene is a measure of how much of that gene is present in the biological sample. Because this affects how much of the protein this gene codes for is produced, and because proteins dictacte what cells can do, gene expression gives us valuable information about the tumor. In particular, the expression of the same gene in the same individual is different in different tissues (although the DNA is the same): this is why blood cells look different from skin cells. In our data set, there are two types of tumors: breast tumors and ovary tumors. Let us see if gene expression can be used to separate them!
__Question:__ How many samples do we have? How many belong to each class? How many features do we have?
## 2.2 Cross-validation
Let us create a cross-validation utility function (similar to what we have done in Lab 3).
## 2.3 Logistic regression with scikit-learn
__Question__ Cross-validate scikit-learn's [linear_model.LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) on your data.
** Question : ** Plot the ROC curve. Use plt.semilogx to use a logarithmic scale on the x-axis. This "spreads out" the curve a little, making it easier to read.
### Data scaling
See [preprocessing.StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)
**Question** Scale the data, and compute the cross-validated predictions of the logistic regression on the scaled data.
**Question** Plot the two ROC curves (one for the logistic regression on the original data, one for the logistic regression on the scaled data) on the same plot.
In a cross-validation setting, we ignore the samples from the test fold when training the classifier. This also means that scaling should be done on the training data only.
In scikit-learn, we can use a scaler to make centering and scaling happen independently on each feature by computing the relevant statistics on the samples *in the training set*.
The mean and standard deviation will be stored to be used on the test data.
**Question** Rewrite the cross_validate method to include a scaling step.
**Question** Now use the cross_validate_with_scaling method to cross-validate the logistic regression on our data.
**Question** Again, compare the AUROC and ROC curves with those obtained previously. What do you conclude?