# Lab 03: Linear and logistic regressions
The goal of this lab is to explore linear and logistic regression, implement them yourself and learn to use their respective scikit-learn implementation.
Let us start by loading some of the usual librairies
import pandas as pd
%pylab inline
We will now implement a linear regression, first using the closed form solution, and second with our gradient descent.
Our first data set regards the quality ratings of a white vinho verde. Each wine is described by a number of physico-chemical descriptors such as acidity, sulfur dioxide content, density or pH.
# load the regression task data
wine_data = pd.read_csv('data/winequality-white.csv', sep=";")
wine_data.head(5)
# Load the data into X and y data arrays
X_regr = wine_data.drop(['quality'], axis=1).values
y_regr = wine_data['quality'].values
# Standardize the data
from sklearn import preprocessing
sc = preprocessing.StandardScaler()
sc.fit(X_regr)
X_regr = sc.transform(X_regr)
Let us create a cross-validation utility function (similar to what we have done in Lab 3, but for regression).
# set up folds for cross_validation
from sklearn import model_selection
folds_regr = model_selection.KFold(n_splits=10, shuffle=True)
def cross_validate_regr(design_matrix, labels, regressor, cv_folds):
""" Perform a cross-validation and returns the predictions.
Parameters:
-----------
design_matrix: (n_samples, n_features) np.array
Design matrix for the experiment.
labels: (n_samples, ) np.array
Vector of labels.
regressor: Regressor instance; must have the following methods:
- fit(X, y) to train the regressor on the data X, y
- predict(X) to apply the trained regressor to the data X and return estimates
cv_folds: sklearn cross-validation object
Cross-validation iterator.
Returns:
-------
pred: (n_samples, ) np.array
Vectors of predictions (same order as labels).
"""
pred = np.zeros(labels.shape)
for tr, te in cv_folds:
regressor.fit(design_matrix[tr,:], labels[tr])
pred[te] = (regressor.predict(design_matrix[te,:]))
return pred
Question Cross-validate scikit-learn's linear_model.LinearRegression on your data.
from sklearn import linear_model
# Initialize a LinearRegression model
regr = # TODO
# Cross-validate it
pred = # TODO
from sklearn import metrics
print("Mean squared error: %.3f" % metrics.mean_squared_error(y_regr, pred))
We will now implement a linear regression, first using the closed form solution, and second with our gradient descent.
Our second data set comes from the world of bioinformatics. In this data set, each observation is a tumor, and it is described by the expression of 3,000 genes. The expression of a gene is a measure of how much of that gene is present in the biological sample. Because this affects how much of the protein this gene codes for is produced, and because proteins dictacte what cells can do, gene expression gives us valuable information about the tumor. In particular, the expression of the same gene in the same individual is different in different tissues (although the DNA is the same): this is why blood cells look different from skin cells. In our data set, there are two types of tumors: breast tumors and ovary tumors. Let us see if gene expression can be used to separate them!
# Load the classification task data
breast_data = pd.read_csv('data/small_Breast_Ovary.csv')
# Drop the 'Tissue' column to create the design matrix
X_clf = np.array(breast_data.drop(['Tissue', 'ID_REF'], axis=1).values)
# Use the 'Tissue' column to create the labels (0=Breast, 1=Ovary)
y_clf = np.array(breast_data['Tissue'].values)
y_clf[np.where(y_clf == 'Breast')] = 0
y_clf[np.where(y_clf == 'Ovary')] = 1
y_clf = y_clf.astype(np.int)
#sc = preprocessing.StandardScaler()
#sc.fit(X_clf)
#X_clf = sc.transform(X_clf)
Question: How many samples do we have? How many belong to each class? How many features do we have?
# TODO
Let us create a cross-validation utility function (similar to what we have done in Lab 3).
# Set up folds for cross_validation
from sklearn import model_selection
folds_clf = model_selection.StratifiedKFold(n_splits=10, shuffle=True)
def cross_validate_clf(design_matrix, labels, classifier, cv_folds):
""" Perform a cross-validation and returns the predictions.
Parameters:
-----------
design_matrix: (n_samples, n_features) np.array
Design matrix for the experiment.
labels: (n_samples, ) np.array
Vector of labels.
classifier: sklearn classifier object
Classifier instance; must have the following methods:
- fit(X, y) to train the classifier on the data X, y
- predict_proba(X) to apply the trained classifier to the data X and return probability estimates
cv_folds: sklearn cross-validation object
Cross-validation iterator.
Return:
-------
pred: (n_samples, ) np.array
Vectors of predictions (same order as labels).
"""
pred = np.zeros(labels.shape)
for tr, te in cv_folds:
classifier.fit(design_matrix[tr,:], labels[tr])
pred[te] = classifier.predict_proba(design_matrix[te,:])[:,1]
return pred
Question Cross-validate scikit-learn's linear_model.LogisticRegression on your data.
from sklearn import linear_model
# Initialize a LogisticRegression model.
# Use C=1e7 to ensure there is no regularization (we'll talk about regularization next time!)
clf = # TODO
# Cross-validate it
ypred_logreg = # TODO
print("Accuracy: %.3f" % metrics.accuracy_score(ypred_logreg > 0.5, 1, 0))
Question : Plot the ROC curve. Use plt.semilogx to use a logarithmic scale on the x-axis. This "spreads out" the curve a little, making it easier to read.
fpr_logreg, tpr_logreg, thresholds = # TODO
auc_logreg = # TODO
plt.semilogx(fpr_logreg, tpr_logreg, '-', color='orange',
label='AUC = %0.3f' % auc_logreg)
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('ROC curve: Logistic regression', fontsize=16)
plt.legend(loc="lower right")
Question Scale the data, and compute the cross-validated predictions of the logistic regression on the scaled data.
from sklearn import preprocessing
# Scale the data with preprocessing.StandardScaler
# Initialize a scaler
scaler = # TODO
# Scale your design matrix
X_clf_scaled = # TODO
# Initialize a LogisticRegression model.
# Use C=1e7 to ensure there is no regularization (we'll talk about regularization next time!)
clf = # TODO
# Cross-validate it for the scaled data
ypred_logreg_scaled = # TODO
print("Accuracy: %.3f" % metrics.accuracy_score(ypred_logreg_scaled > 0.5, 1, 0))
Question Plot the two ROC curves (one for the logistic regression on the original data, one for the logistic regression on the scaled data) on the same plot.
fpr_logreg_scaled, tpr_logreg_scaled, thresholds = # TODO
auc_logreg_scaled = # TODO
plt.semilogx(fpr_logreg_scaled, tpr_logreg_scaled, '-', color='blue',
label='scaled data; AUC = %0.3f' % auc_logreg_scaled)
plt.semilogx(fpr_logreg, tpr_logreg, '-', color='orange',
label='original data; AUC = %0.3f' % auc_logreg)
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('ROC curve: Logistic regression', fontsize=16)
plt.legend(loc="lower right")
In a cross-validation setting, we ignore the samples from the test fold when training the classifier. This also means that scaling should be done on the training data only.
In scikit-learn, we can use a scaler to make centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. The mean and standard deviation will be stored to be used on the test data.
Question Rewrite the cross_validate method to include a scaling step.
def cross_validate_clf_with_scaling(design_matrix, labels, classifier, cv_folds):
""" Perform a cross-validation and returns the predictions.
Parameters:
-----------
design_matrix: (n_samples, n_features) np.array
Design matrix for the experiment.
labels: (n_samples, ) np.array
Vector of labels.
classifier: sklearn classifier object
Classifier instance; must have the following methods:
- fit(X, y) to train the classifier on the data X, y
- predict_proba(X) to apply the trained classifier to the data X and return probability estimates
cv_folds: sklearn cross-validation object
Cross-validation iterator.
Return:
-------
pred: (n_samples, ) np.array
Vectors of predictions (same order as labels).
"""
pred = np.zeros(labels.shape)
for tr, te in cv_folds:
# TODO
return pred
Question Now use the cross_validate_with_scaling method to cross-validate the logistic regression on our data.
clf = linear_model.LogisticRegression(C=1e6)
ypred_logreg_scaled_ = cross_validate_clf_with_scaling(X_clf, y_clf, clf, folds_clf)
print(metrics.accuracy_score(y_clf,np.where(ypred_logreg_scaled_ > 0.5, 1, 0)))
Question Again, compare the AUROC and ROC curves with those obtained previously. What do you conclude?
fpr_logreg_scaled_, tpr_logreg_scaled_, thresholds = # TODO
auc_logreg_scaled_ = # TODO
plt.semilogx(fpr_logreg_scaled, tpr_logreg_scaled, '-',
color='blue', label='scaled data overfit; AUC = %0.3f' % auc_logreg_scaled)
plt.semilogx(fpr_logreg, tpr_logreg, '-', color='orange',
label='original data; AUC = %0.3f' % auc_logreg)
plt.semilogx(fpr_logreg_scaled_, tpr_logreg_scaled_, '-', color='black',
label='scaled data no overfit; AUC = %0.3f' % auc_logreg_scaled_)
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('ROC curve: Logistic regression', fontsize=16)
plt.legend(loc="lower right")