05-Regularization.ipynb 20 KB

Lab 05: Regularized linear regression

The goal of this lab is to explore and understand l1 and l2 regularization of linear models.

import pandas as pd
%pylab inline

1. Classification data

We will use the same data as in Lab 4: the samples are tumors, each described by the expression (= the abundance) of 3,000 genes. The goal is to separate the endometrium tumors from the uterine ones.

# load the endometrium vs. uterus tumor data
endometrium_data = pd.read_csv('data/small_Endometrium_Uterus.csv', sep=",")  # load data
endometrium_data.head(n=5)  # adjust n to view more data
# Create the design matrix and target vector
X_clf = endometrium_data.drop(['ID_REF', 'Tissue'], axis=1).values
y_clf = pd.get_dummies(endometrium_data['Tissue']).values[:,1]

Question: Split the data in a train set containing 70% of the data and a test set containing the remaining 30%. We use model_selection.train_test_split

from sklearn import model_selection
Xtr, Xte, ytr, yte = model_selection.train_test_split(X_clf, y_clf, 
                                                      #test_size=#TODO, 
                                                      random_state=27)

Let us also compute a scaled version of the data. The data is scaled on the train set, and the scaling parameters (mean, standard deviation) are applied to the test set. Question: Why?

from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
Xtr_scaled = scaler.fit_transform(Xtr)
Xte_scaled = scaler.transform(Xte)

2. Logistic regression, not regularized

Let us train a logisitic regression without regularization on our train set, and evaluate it on the test set. This is similar to Lab 4 and will serve as a comparison point.

# Logistic regression (no regularization, no scaling)
from sklearn import linear_model
clf_logreg = linear_model.LogisticRegression(C=1e6) # large C = no regularization

# Train the model
clf_logreg.fit(Xtr, ytr)

# Predict on the test set
# Predicted probabilities of belonging to the positive class
pos_idx = list(clf_logreg.classes_).index(1)
ypred_logreg = clf_logreg.predict_proba(Xte)[:, pos_idx]

# Predicted binary labels
ypred_logreg_b = np.where(ypred_logreg > 0.5, 1, 0)

from sklearn import metrics
print("No regularization: accuracy = %.3f" % metrics.accuracy_score(yte, ypred_logreg_b))
print("AUC = %.3f" % (metrics.roc_auc_score(yte, ypred_logreg)))

Question: Repeat the experiment on the scaled data. What do you observe in terms of performance?

# Logistic regression (no regularization, scaling)
clf_logreg_s = linear_model.LogisticRegression(C=1e6)

# Train the model
# TODO

# Predict on the test set
# Predicted probabilities of belonging to the positive class
pos_idx = list(clf_logreg_s.classes_).index(1)
ypred_logreg_s = # TODO
# Predicted binary labels
ypred_logreg_s_b = # TODO

print("Scaled, no regularization: accuracy = %.3f" % metrics.accuracy_score(yte, ypred_logreg_s_b))
print("AUC = %.3f" % (metrics.roc_auc_score(yte, ypred_logreg_s)))

3. L2-regularized logistic regression

Question: What is the role of L2 regularization?

Let us use an l2-regularized logistic regression with the parameter C set to 0.01.

Question: What is the role of C? How does it relate to the lambda regularization parameter we have seen in class?

Question: Train the l2-regularized logistic regression initialized below on the scaled training data, and evaluate it on the sclaed test set (as above). How does the performance evolve?

cvalue = 0.01
clf_logreg_l2_s = linear_model.LogisticRegression(C=cvalue, penalty='l2')

# Train the model
# TODO

# index of positive class
pos_idx = list(clf_logreg_l2_s.classes_).index(1)
# predict probability of being positive
ypred_logreg_l2_s = # TODO
# predict binary labels
ypred_logreg_l2_s_b = np.where(ypred_logreg_l2_s > 0.5, 1, 0)

print("Scaled, l2 regularization (C=%.2e): accuracy = %.3f" % (cvalue, 
                                                               metrics.accuracy_score(yte, ypred_logreg_l2_s_b)))
print("AUC = %.3f" % (metrics.roc_auc_score(yte, ypred_logreg_l2_s)))

3.1 Effect of L2-regularization on the logistic regression coefficients

We will now look at how the regression coefficients have evolved between the non-regularized and the regularized versions of the logistic regression.

Question: Fill in the blanks below to plot the regression coefficients of both the trained clf_logreg_l2_s and clf_logreg_s models. Use the documentation to figure out how to access these coefficients. What do you observe?

# Effect of l2-regularization on the weights
num_features = X_clf.shape[1]
plt.scatter(range(num_features), # TODO, 
            color='blue', marker='x', label='Logistic regression')
plt.scatter(range(num_features), #TODO, 
            color='orange', marker='+', label='L2-regularized logistic regression')

plt.xlabel('Genes', fontsize=16)
plt.ylabel('Weights', fontsize=16)
plt.title('Logistic regression weights', fontsize=16)
plt.legend(fontsize=14, loc=(1.05, 0))
plt.xlim([0, num_features])

3.2 Optimization of the regularization parameter

We will now use a 3-fold cross-validation on the training set to optimize the value of C. Scikit-learn makes it really easy to use a cross-validation to choose a good value for $\alpha$ among a grid of several choices. Check the GridSearchCV class.

# Create a range of values to test for the parameter C
cvalues_list = np.logspace(-5, 1, 20)
print(cvalues_list)

Question: Fill in the blanks below to find the optimal value of the parameter C.

Use the .best_estimator_ attribute of a GridSearchCV.

# Optimize cvalue
classifier = linear_model.LogisticRegression(penalty='l2')
param_grid = {'C': #TODO
             }
clf_logreg_l2_s_opt = model_selection.GridSearchCV(#TODO, 
                                                   #TODO, 
                                                   cv=3)     

# Train the model
# TODO

# index of the positive class
pos_idx = # TODO
# predict probability of being positive
ypred_logreg_l2_s_opt = # TODO
# predict binary label
ypred_logreg_l2_s_opt_b = # TODO

# optimal value of C
cvalue_opt = # TODO
print("Scaled, l2 regularization (C=%.2e): accuracy = %.3f" % (cvalue_opt, 
                                                               metrics.accuracy_score(yte, ypred_logreg_l2_s_opt_b)))
print("AUC = %.3f" % (metrics.roc_auc_score(yte, ypred_logreg_l2_s_opt)))

Question: Fill in the code below to compare the ROC curves of the non-regularized and l2-regularized logistic regressions.

fpr_logreg_s, tpr_logreg_s, t = # TODO
auc_logreg_s = metrics.auc(fpr_logreg_s, tpr_logreg_s)
plt.plot(fpr_logreg_s, tpr_logreg_s, color='blue', 
         label='No regularization: AUC = %0.3f' % auc_logreg_s)

fpr_logreg_l2_s_opt, tpr_logreg_l2_s_opt, t = # TODO
auc_logreg_l2_s_opt = metrics.auc(fpr_logreg_l2_s_opt, tpr_logreg_l2_s_opt)
plt.plot(fpr_logreg_l2_s_opt, tpr_logreg_l2_s_opt, color='orange', 
         label='L2 regularization: AUC = %0.3f' % auc_logreg_l2_s_opt)

plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('ROC curve: Logistic regression', fontsize=16)
plt.legend(fontsize=14)

Question: Is the optimal C larger or smaller than the one we tried before? Does this mean more or less regularization? Do you expect larger or smaller regularization coefficients?

Question: Fill in the blanks to compare the regularization weights of the different methods.

# Effect of l2-regularization on the weights
num_features = X_clf.shape[1]
plt.scatter(range(num_features), # TODO, 
            color='blue', marker='x', label='Logistic regression')
plt.scatter(range(num_features), # TODO, 
            color='orange', marker='+', label='L2-regularized logistic regression')
plt.scatter(range(num_features), # TODO, 
            color='magenta', marker='.', label='L2-regularized logistic regression (opt)')

plt.xlabel('Genes', fontsize=16)
plt.ylabel('Weights', fontsize=16)
plt.title('Logistic regression weights', fontsize=16)
plt.legend(fontsize=14, loc=(1.05, 0))
plt.xlim([0, num_features])

4. L1-regularized logistic regression

Question: What is the role of the l1-regularized logistic regression?

Question: Instead of a l2-regularized logistic regression with C=0.01, now train and evaluate a l1-regularized logistic regression with C=10.0.

cvalue = # TODO
clf_logreg_l1_s = # TODO
clf_logreg_l1_s.fit(Xtr_scaled, ytr)

# index of the positive class
pos_idx = list(clf_logreg_l1_s.classes_).index(1)
# predict the probability of belonging to the positive class
ypred_logreg_l1_s = clf_logreg_l1_s.predict_proba(Xte_scaled)[:, pos_idx]
# predict binary labels
ypred_logreg_l1_s_b = np.where(ypred_logreg_l1_s > 0.5, 1, 0)

print("Scaled, l1 regularization (C=%.2e): accuracy = %.3f" % (cvalue, 
                                                               metrics.accuracy_score(yte, ypred_logreg_l1_s_b)))
print("AUC = %.3f" % (metrics.roc_auc_score(yte, ypred_logreg_l1_s)))

Question: How did the performance evolve?

4.1 Effect of regularization on the regression coefficients

Question: Plot the weights that were given to each feature in your data.

num_features = X_clf.shape[1]
plt.scatter(range(num_features), #TODO, 
            color='blue', marker='x', label='Logistic regression')
plt.scatter(range(num_features), #TODO, 
            color='orange', marker='+', label='L1-regularized logistic regression')

plt.xlabel('Genes', fontsize=16)
plt.ylabel('Weights', fontsize=16)
plt.title('Logistic regression weights', fontsize=16)
plt.legend(fontsize=14, loc=(1.05, 0))
plt.xlim([0, num_features])

Question: What do you observe? How does this differ from l2-regularization?

Question: How many weights are different from zero? How many features are not used by the l1-regularized model?

# Number of selected features
print("The non-regularized logistic regression uses %d features" % len(np.where(clf_logreg_s.coef_ !=0)[1]))
print("The L2-regularized logistic regression uses %d features" % #TODO
     )
print("The L1-regularized logistic regression uses %d features" % #TODO
     )

print("Number of features discarded by the L1-regularization: %d" % #TODO

4.2 Optimization of the regularization parameter

Question: Fill in the blanks to optimize the value of C for l1-regularized logistic regression.

# Optimize cvalue
cvalues_list = np.logspace(-2, 3, 20)
print(cvalues_list)

classifier = # TODO
param_grid = # TODO
clf_logreg_l1_s_opt = # TODO   

# Train the model
# TODO

# index of the positive class
pos_idx = # TODO
# predict the probability of belonging to the positive class
ypred_logreg_l1_s_opt = # TODO
# predict binary labels
ypred_logreg_l1_s_b = # TODO

# optimal value of C
cvalue_opt = # TODO

print("Scaled, l1 regularization (C=%.2e): accuracy = %.3f" % (cvalue_opt, 
                                                               metrics.accuracy_score(yte, ypred_logreg_l1_s_b)))
print("AUC = %.3f" % (metrics.roc_auc_score(yte, ypred_logreg_l1_s_opt)))
# ROC curve
fpr_logreg_s, tpr_logreg_s, t = # TODO
auc_logreg_s = metrics.auc(fpr_logreg_s, tpr_logreg_s)
plt.plot(fpr_logreg_s, tpr_logreg_s, color='blue', 
         label='No regularization: AUC = %0.3f' % auc_logreg_s)

fpr_logreg_l2_s_opt, tpr_logreg_l2_s_opt, t = # TODO
auc_logreg_l2_s_opt = metrics.auc(fpr_logreg_l2_s_opt, tpr_logreg_l2_s_opt)
plt.plot(fpr_logreg_l2_s_opt, tpr_logreg_l2_s_opt, color='orange', 
         label='L2 regularization: AUC = %0.3f' % auc_logreg_l2_s_opt)

fpr_logreg_l1_s_opt, tpr_logreg_l1_s_opt, t = metrics.roc_curve(yte, ypred_logreg_l1_s_opt, pos_label=1)
auc_logreg_l1_s_opt = metrics.auc(fpr_logreg_l1_s_opt, tpr_logreg_l1_s_opt)
plt.plot(fpr_logreg_l1_s_opt, tpr_logreg_l1_s_opt, color='magenta', 
         label='L1 regularization: AUC = %0.3f' % auc_logreg_l1_s_opt)

plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('ROC curve: Logistic regression', fontsize=16)
plt.legend(fontsize=14)

Question: How many genes does the l1-regularized approach select? How does this affect the performance, compared to C=10.0 or the l2-regularized approach?

print("The L1-regularized logistic regression uses %d features" % # TODO
     )
print("The optimized L1-regularized logistic regression uses %d features" % \
      # TODO
     )

Question: Compare the features selected with C=0.01 and your optimal C value. What do you observe?

num_features = X_clf.shape[1]
plt.scatter(range(num_features), #TODO, 
            color='orange', marker='+', label='L1-regularized logistic regression')
plt.scatter(range(num_features), #TODO, 
            color='magenta', marker='x', label='L1-regularized logistic regression (optimal C)')

plt.xlabel('Genes', fontsize=16)
plt.ylabel('Weights', fontsize=16)
plt.title('Logistic regression weights', fontsize=16)
plt.legend(fontsize=14, loc=(1.05, 0))
plt.xlim([0, num_features])