The goal of this lab is to explore and understand l1 and l2 regularization of linear models.
import pandas as pd
%pylab inline
We will use the same data as in Lab 4: the samples are tumors, each described by the expression (= the abundance) of 3,000 genes. The goal is to separate the endometrium tumors from the uterine ones.
# load the endometrium vs. uterus tumor data
endometrium_data = pd.read_csv('data/small_Endometrium_Uterus.csv', sep=",") # load data
endometrium_data.head(n=5) # adjust n to view more data
# Create the design matrix and target vector
X_clf = endometrium_data.drop(['ID_REF', 'Tissue'], axis=1).values
y_clf = pd.get_dummies(endometrium_data['Tissue']).values[:,1]
Question: Split the data in a train set containing 70% of the data and a test set containing the remaining 30%. We use model_selection.train_test_split
from sklearn import model_selection
Xtr, Xte, ytr, yte = model_selection.train_test_split(X_clf, y_clf,
#test_size=#TODO,
random_state=27)
Let us also compute a scaled version of the data. The data is scaled on the train set, and the scaling parameters (mean, standard deviation) are applied to the test set. Question: Why?
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
Xtr_scaled = scaler.fit_transform(Xtr)
Xte_scaled = scaler.transform(Xte)
Let us train a logisitic regression without regularization on our train set, and evaluate it on the test set. This is similar to Lab 4 and will serve as a comparison point.
# Logistic regression (no regularization, no scaling)
from sklearn import linear_model
clf_logreg = linear_model.LogisticRegression(C=1e6) # large C = no regularization
# Train the model
clf_logreg.fit(Xtr, ytr)
# Predict on the test set
# Predicted probabilities of belonging to the positive class
pos_idx = list(clf_logreg.classes_).index(1)
ypred_logreg = clf_logreg.predict_proba(Xte)[:, pos_idx]
# Predicted binary labels
ypred_logreg_b = np.where(ypred_logreg > 0.5, 1, 0)
from sklearn import metrics
print("No regularization: accuracy = %.3f" % metrics.accuracy_score(yte, ypred_logreg_b))
print("AUC = %.3f" % (metrics.roc_auc_score(yte, ypred_logreg)))
Question: Repeat the experiment on the scaled data. What do you observe in terms of performance?
# Logistic regression (no regularization, scaling)
clf_logreg_s = linear_model.LogisticRegression(C=1e6)
# Train the model
# TODO
# Predict on the test set
# Predicted probabilities of belonging to the positive class
pos_idx = list(clf_logreg_s.classes_).index(1)
ypred_logreg_s = # TODO
# Predicted binary labels
ypred_logreg_s_b = # TODO
print("Scaled, no regularization: accuracy = %.3f" % metrics.accuracy_score(yte, ypred_logreg_s_b))
print("AUC = %.3f" % (metrics.roc_auc_score(yte, ypred_logreg_s)))
Question: What is the role of L2 regularization?
Let us use an l2-regularized logistic regression with the parameter C
set to 0.01.
Question: What is the role of C
? How does it relate to the lambda
regularization parameter we have seen in class?
Question: Train the l2-regularized logistic regression initialized below on the scaled training data, and evaluate it on the sclaed test set (as above). How does the performance evolve?
cvalue = 0.01
clf_logreg_l2_s = linear_model.LogisticRegression(C=cvalue, penalty='l2')
# Train the model
# TODO
# index of positive class
pos_idx = list(clf_logreg_l2_s.classes_).index(1)
# predict probability of being positive
ypred_logreg_l2_s = # TODO
# predict binary labels
ypred_logreg_l2_s_b = np.where(ypred_logreg_l2_s > 0.5, 1, 0)
print("Scaled, l2 regularization (C=%.2e): accuracy = %.3f" % (cvalue,
metrics.accuracy_score(yte, ypred_logreg_l2_s_b)))
print("AUC = %.3f" % (metrics.roc_auc_score(yte, ypred_logreg_l2_s)))
We will now look at how the regression coefficients have evolved between the non-regularized and the regularized versions of the logistic regression.
Question: Fill in the blanks below to plot the regression coefficients of both the trained clf_logreg_l2_s
and clf_logreg_s
models. Use the documentation to figure out how to access these coefficients. What do you observe?
# Effect of l2-regularization on the weights
num_features = X_clf.shape[1]
plt.scatter(range(num_features), # TODO,
color='blue', marker='x', label='Logistic regression')
plt.scatter(range(num_features), #TODO,
color='orange', marker='+', label='L2-regularized logistic regression')
plt.xlabel('Genes', fontsize=16)
plt.ylabel('Weights', fontsize=16)
plt.title('Logistic regression weights', fontsize=16)
plt.legend(fontsize=14, loc=(1.05, 0))
plt.xlim([0, num_features])
We will now use a 3-fold cross-validation on the training set to optimize the value of C. Scikit-learn makes it really easy to use a cross-validation to choose a good value for $\alpha$ among a grid of several choices. Check the GridSearchCV class.
# Create a range of values to test for the parameter C
cvalues_list = np.logspace(-5, 1, 20)
print(cvalues_list)
Question: Fill in the blanks below to find the optimal value of the parameter C.
Use the .best_estimator_
attribute of a GridSearchCV
.
# Optimize cvalue
classifier = linear_model.LogisticRegression(penalty='l2')
param_grid = {'C': #TODO
}
clf_logreg_l2_s_opt = model_selection.GridSearchCV(#TODO,
#TODO,
cv=3)
# Train the model
# TODO
# index of the positive class
pos_idx = # TODO
# predict probability of being positive
ypred_logreg_l2_s_opt = # TODO
# predict binary label
ypred_logreg_l2_s_opt_b = # TODO
# optimal value of C
cvalue_opt = # TODO
print("Scaled, l2 regularization (C=%.2e): accuracy = %.3f" % (cvalue_opt,
metrics.accuracy_score(yte, ypred_logreg_l2_s_opt_b)))
print("AUC = %.3f" % (metrics.roc_auc_score(yte, ypred_logreg_l2_s_opt)))
Question: Fill in the code below to compare the ROC curves of the non-regularized and l2-regularized logistic regressions.
fpr_logreg_s, tpr_logreg_s, t = # TODO
auc_logreg_s = metrics.auc(fpr_logreg_s, tpr_logreg_s)
plt.plot(fpr_logreg_s, tpr_logreg_s, color='blue',
label='No regularization: AUC = %0.3f' % auc_logreg_s)
fpr_logreg_l2_s_opt, tpr_logreg_l2_s_opt, t = # TODO
auc_logreg_l2_s_opt = metrics.auc(fpr_logreg_l2_s_opt, tpr_logreg_l2_s_opt)
plt.plot(fpr_logreg_l2_s_opt, tpr_logreg_l2_s_opt, color='orange',
label='L2 regularization: AUC = %0.3f' % auc_logreg_l2_s_opt)
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('ROC curve: Logistic regression', fontsize=16)
plt.legend(fontsize=14)
Question: Is the optimal C larger or smaller than the one we tried before? Does this mean more or less regularization? Do you expect larger or smaller regularization coefficients?
Question: Fill in the blanks to compare the regularization weights of the different methods.
# Effect of l2-regularization on the weights
num_features = X_clf.shape[1]
plt.scatter(range(num_features), # TODO,
color='blue', marker='x', label='Logistic regression')
plt.scatter(range(num_features), # TODO,
color='orange', marker='+', label='L2-regularized logistic regression')
plt.scatter(range(num_features), # TODO,
color='magenta', marker='.', label='L2-regularized logistic regression (opt)')
plt.xlabel('Genes', fontsize=16)
plt.ylabel('Weights', fontsize=16)
plt.title('Logistic regression weights', fontsize=16)
plt.legend(fontsize=14, loc=(1.05, 0))
plt.xlim([0, num_features])
Question: What is the role of the l1-regularized logistic regression?
Question: Instead of a l2-regularized logistic regression with C=0.01
, now train and evaluate a l1-regularized logistic regression with C=10.0
.
cvalue = # TODO
clf_logreg_l1_s = # TODO
clf_logreg_l1_s.fit(Xtr_scaled, ytr)
# index of the positive class
pos_idx = list(clf_logreg_l1_s.classes_).index(1)
# predict the probability of belonging to the positive class
ypred_logreg_l1_s = clf_logreg_l1_s.predict_proba(Xte_scaled)[:, pos_idx]
# predict binary labels
ypred_logreg_l1_s_b = np.where(ypred_logreg_l1_s > 0.5, 1, 0)
print("Scaled, l1 regularization (C=%.2e): accuracy = %.3f" % (cvalue,
metrics.accuracy_score(yte, ypred_logreg_l1_s_b)))
print("AUC = %.3f" % (metrics.roc_auc_score(yte, ypred_logreg_l1_s)))
Question: How did the performance evolve?
Question: Plot the weights that were given to each feature in your data.
num_features = X_clf.shape[1]
plt.scatter(range(num_features), #TODO,
color='blue', marker='x', label='Logistic regression')
plt.scatter(range(num_features), #TODO,
color='orange', marker='+', label='L1-regularized logistic regression')
plt.xlabel('Genes', fontsize=16)
plt.ylabel('Weights', fontsize=16)
plt.title('Logistic regression weights', fontsize=16)
plt.legend(fontsize=14, loc=(1.05, 0))
plt.xlim([0, num_features])
Question: What do you observe? How does this differ from l2-regularization?
Question: How many weights are different from zero? How many features are not used by the l1-regularized model?
# Number of selected features
print("The non-regularized logistic regression uses %d features" % len(np.where(clf_logreg_s.coef_ !=0)[1]))
print("The L2-regularized logistic regression uses %d features" % #TODO
)
print("The L1-regularized logistic regression uses %d features" % #TODO
)
print("Number of features discarded by the L1-regularization: %d" % #TODO
Question: Fill in the blanks to optimize the value of C
for l1-regularized logistic regression.
# Optimize cvalue
cvalues_list = np.logspace(-2, 3, 20)
print(cvalues_list)
classifier = # TODO
param_grid = # TODO
clf_logreg_l1_s_opt = # TODO
# Train the model
# TODO
# index of the positive class
pos_idx = # TODO
# predict the probability of belonging to the positive class
ypred_logreg_l1_s_opt = # TODO
# predict binary labels
ypred_logreg_l1_s_b = # TODO
# optimal value of C
cvalue_opt = # TODO
print("Scaled, l1 regularization (C=%.2e): accuracy = %.3f" % (cvalue_opt,
metrics.accuracy_score(yte, ypred_logreg_l1_s_b)))
print("AUC = %.3f" % (metrics.roc_auc_score(yte, ypred_logreg_l1_s_opt)))
# ROC curve
fpr_logreg_s, tpr_logreg_s, t = # TODO
auc_logreg_s = metrics.auc(fpr_logreg_s, tpr_logreg_s)
plt.plot(fpr_logreg_s, tpr_logreg_s, color='blue',
label='No regularization: AUC = %0.3f' % auc_logreg_s)
fpr_logreg_l2_s_opt, tpr_logreg_l2_s_opt, t = # TODO
auc_logreg_l2_s_opt = metrics.auc(fpr_logreg_l2_s_opt, tpr_logreg_l2_s_opt)
plt.plot(fpr_logreg_l2_s_opt, tpr_logreg_l2_s_opt, color='orange',
label='L2 regularization: AUC = %0.3f' % auc_logreg_l2_s_opt)
fpr_logreg_l1_s_opt, tpr_logreg_l1_s_opt, t = metrics.roc_curve(yte, ypred_logreg_l1_s_opt, pos_label=1)
auc_logreg_l1_s_opt = metrics.auc(fpr_logreg_l1_s_opt, tpr_logreg_l1_s_opt)
plt.plot(fpr_logreg_l1_s_opt, tpr_logreg_l1_s_opt, color='magenta',
label='L1 regularization: AUC = %0.3f' % auc_logreg_l1_s_opt)
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('ROC curve: Logistic regression', fontsize=16)
plt.legend(fontsize=14)
Question: How many genes does the l1-regularized approach select? How does this affect the performance, compared to C=10.0
or the l2-regularized approach?
print("The L1-regularized logistic regression uses %d features" % # TODO
)
print("The optimized L1-regularized logistic regression uses %d features" % \
# TODO
)
Question: Compare the features selected with C=0.01
and your optimal C
value. What do you observe?
num_features = X_clf.shape[1]
plt.scatter(range(num_features), #TODO,
color='orange', marker='+', label='L1-regularized logistic regression')
plt.scatter(range(num_features), #TODO,
color='magenta', marker='x', label='L1-regularized logistic regression (optimal C)')
plt.xlabel('Genes', fontsize=16)
plt.ylabel('Weights', fontsize=16)
plt.title('Logistic regression weights', fontsize=16)
plt.legend(fontsize=14, loc=(1.05, 0))
plt.xlim([0, num_features])