{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab 05: Regularized linear regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The goal of this lab is to explore and understand l1 and l2 regularization of linear models." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "%pylab inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 1. Classification data\n", "\n", "We will use the same data as in Lab 4: the samples are tumors, each described by the expression (= the abundance) of 3,000 genes. The goal is to separate the endometrium tumors from the uterine ones." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# load the endometrium vs. uterus tumor data\n", "endometrium_data = pd.read_csv('data/small_Endometrium_Uterus.csv', sep=\",\") # load data\n", "endometrium_data.head(n=5) # adjust n to view more data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create the design matrix and target vector\n", "X_clf = endometrium_data.drop(['ID_REF', 'Tissue'], axis=1).values\n", "y_clf = pd.get_dummies(endometrium_data['Tissue']).values[:,1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Question:__ Split the data in a train set containing 70% of the data and a test set containing the remaining 30%. We use [model_selection.train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split_)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn import model_selection\n", "Xtr, Xte, ytr, yte = model_selection.train_test_split(X_clf, y_clf, \n", " #test_size=#TODO, \n", " random_state=27)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us also compute a scaled version of the data. The data is scaled on the _train_ set, and the scaling parameters (mean, standard deviation) are applied to the test set. __Question:__ Why?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn import preprocessing\n", "scaler = preprocessing.StandardScaler()\n", "Xtr_scaled = scaler.fit_transform(Xtr)\n", "Xte_scaled = scaler.transform(Xte)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2. Logistic regression, not regularized \n", "\n", "Let us train a logisitic regression _without regularization_ on our train set, and evaluate it on the test set. This is similar to Lab 4 and will serve as a comparison point." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Logistic regression (no regularization, no scaling)\n", "from sklearn import linear_model\n", "clf_logreg = linear_model.LogisticRegression(C=1e6) # large C = no regularization\n", "\n", "# Train the model\n", "clf_logreg.fit(Xtr, ytr)\n", "\n", "# Predict on the test set\n", "# Predicted probabilities of belonging to the positive class\n", "pos_idx = list(clf_logreg.classes_).index(1)\n", "ypred_logreg = clf_logreg.predict_proba(Xte)[:, pos_idx]\n", "\n", "# Predicted binary labels\n", "ypred_logreg_b = np.where(ypred_logreg > 0.5, 1, 0)\n", "\n", "from sklearn import metrics\n", "print(\"No regularization: accuracy = %.3f\" % metrics.accuracy_score(yte, ypred_logreg_b))\n", "print(\"AUC = %.3f\" % (metrics.roc_auc_score(yte, ypred_logreg)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Question:__ Repeat the experiment on the scaled data. What do you observe in terms of performance?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Logistic regression (no regularization, scaling)\n", "clf_logreg_s = linear_model.LogisticRegression(C=1e6)\n", "\n", "# Train the model\n", "# TODO\n", "\n", "# Predict on the test set\n", "# Predicted probabilities of belonging to the positive class\n", "pos_idx = list(clf_logreg_s.classes_).index(1)\n", "ypred_logreg_s = # TODO\n", "# Predicted binary labels\n", "ypred_logreg_s_b = # TODO\n", "\n", "print(\"Scaled, no regularization: accuracy = %.3f\" % metrics.accuracy_score(yte, ypred_logreg_s_b))\n", "print(\"AUC = %.3f\" % (metrics.roc_auc_score(yte, ypred_logreg_s)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 3. L2-regularized logistic regression \n", "\n", "__Question:__ What is the role of L2 regularization?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us use an l2-regularized logistic regression with the parameter `C` set to 0.01. \n", "\n", "__Question:__ What is the role of `C`? How does it relate to the `lambda` regularization parameter we have seen in class?\n", "\n", "__Question:__ Train the l2-regularized logistic regression initialized below on the scaled training data, and evaluate it on the sclaed test set (as above). How does the performance evolve?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cvalue = 0.01\n", "clf_logreg_l2_s = linear_model.LogisticRegression(C=cvalue, penalty='l2')\n", "\n", "# Train the model\n", "# TODO\n", "\n", "# index of positive class\n", "pos_idx = list(clf_logreg_l2_s.classes_).index(1)\n", "# predict probability of being positive\n", "ypred_logreg_l2_s = # TODO\n", "# predict binary labels\n", "ypred_logreg_l2_s_b = np.where(ypred_logreg_l2_s > 0.5, 1, 0)\n", "\n", "print(\"Scaled, l2 regularization (C=%.2e): accuracy = %.3f\" % (cvalue, \n", " metrics.accuracy_score(yte, ypred_logreg_l2_s_b)))\n", "print(\"AUC = %.3f\" % (metrics.roc_auc_score(yte, ypred_logreg_l2_s)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.1 Effect of L2-regularization on the logistic regression coefficients\n", "\n", "We will now look at how the regression coefficients have evolved between the non-regularized and the regularized versions of the logistic regression.\n", "\n", "__Question:__ Fill in the blanks below to plot the regression coefficients of both the trained `clf_logreg_l2_s` and `clf_logreg_s` models. Use the [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) to figure out how to access these coefficients. What do you observe?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Effect of l2-regularization on the weights\n", "num_features = X_clf.shape[1]\n", "plt.scatter(range(num_features), # TODO, \n", " color='blue', marker='x', label='Logistic regression')\n", "plt.scatter(range(num_features), #TODO, \n", " color='orange', marker='+', label='L2-regularized logistic regression')\n", "\n", "plt.xlabel('Genes', fontsize=16)\n", "plt.ylabel('Weights', fontsize=16)\n", "plt.title('Logistic regression weights', fontsize=16)\n", "plt.legend(fontsize=14, loc=(1.05, 0))\n", "plt.xlim([0, num_features])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.2 Optimization of the regularization parameter\n", "\n", "We will now use a 3-fold cross-validation on the training set to optimize the value of C. Scikit-learn makes it really easy to use a cross-validation to choose a good value for $\\alpha$ among a grid of several choices. Check the [GridSearchCV class](http://scikit-learn.org/0.17/modules/generated/sklearn.grid_search.GridSearchCV.html#sklearn.grid_search.GridSearchCV)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create a range of values to test for the parameter C\n", "cvalues_list = np.logspace(-5, 1, 20)\n", "print(cvalues_list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Question:__ Fill in the blanks below to find the optimal value of the parameter C.\n", "\n", "Use the `.best_estimator_` attribute of a `GridSearchCV`. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Optimize cvalue\n", "classifier = linear_model.LogisticRegression(penalty='l2')\n", "param_grid = {'C': #TODO\n", " }\n", "clf_logreg_l2_s_opt = model_selection.GridSearchCV(#TODO, \n", " #TODO, \n", " cv=3) \n", "\n", "# Train the model\n", "# TODO\n", "\n", "# index of the positive class\n", "pos_idx = # TODO\n", "# predict probability of being positive\n", "ypred_logreg_l2_s_opt = # TODO\n", "# predict binary label\n", "ypred_logreg_l2_s_opt_b = # TODO\n", "\n", "# optimal value of C\n", "cvalue_opt = # TODO\n", "print(\"Scaled, l2 regularization (C=%.2e): accuracy = %.3f\" % (cvalue_opt, \n", " metrics.accuracy_score(yte, ypred_logreg_l2_s_opt_b)))\n", "print(\"AUC = %.3f\" % (metrics.roc_auc_score(yte, ypred_logreg_l2_s_opt)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Question:__ Fill in the code below to compare the ROC curves of the non-regularized and l2-regularized logistic regressions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fpr_logreg_s, tpr_logreg_s, t = # TODO\n", "auc_logreg_s = metrics.auc(fpr_logreg_s, tpr_logreg_s)\n", "plt.plot(fpr_logreg_s, tpr_logreg_s, color='blue', \n", " label='No regularization: AUC = %0.3f' % auc_logreg_s)\n", "\n", "fpr_logreg_l2_s_opt, tpr_logreg_l2_s_opt, t = # TODO\n", "auc_logreg_l2_s_opt = metrics.auc(fpr_logreg_l2_s_opt, tpr_logreg_l2_s_opt)\n", "plt.plot(fpr_logreg_l2_s_opt, tpr_logreg_l2_s_opt, color='orange', \n", " label='L2 regularization: AUC = %0.3f' % auc_logreg_l2_s_opt)\n", "\n", "plt.xlabel('False Positive Rate', fontsize=16)\n", "plt.ylabel('True Positive Rate', fontsize=16)\n", "plt.title('ROC curve: Logistic regression', fontsize=16)\n", "plt.legend(fontsize=14)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Question:__ Is the optimal C larger or smaller than the one we tried before? Does this mean more or less regularization? Do you expect larger or smaller regularization coefficients?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Question:__ Fill in the blanks to compare the regularization weights of the different methods." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Effect of l2-regularization on the weights\n", "num_features = X_clf.shape[1]\n", "plt.scatter(range(num_features), # TODO, \n", " color='blue', marker='x', label='Logistic regression')\n", "plt.scatter(range(num_features), # TODO, \n", " color='orange', marker='+', label='L2-regularized logistic regression')\n", "plt.scatter(range(num_features), # TODO, \n", " color='magenta', marker='.', label='L2-regularized logistic regression (opt)')\n", "\n", "plt.xlabel('Genes', fontsize=16)\n", "plt.ylabel('Weights', fontsize=16)\n", "plt.title('Logistic regression weights', fontsize=16)\n", "plt.legend(fontsize=14, loc=(1.05, 0))\n", "plt.xlim([0, num_features])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 4. L1-regularized logistic regression\n", "\n", "__Question:__ What is the role of the l1-regularized logistic regression?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Question:__ Instead of a l2-regularized logistic regression with `C=0.01`, now train and evaluate a __l1__-regularized logistic regression with __`C=10.0`__." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cvalue = # TODO\n", "clf_logreg_l1_s = # TODO\n", "clf_logreg_l1_s.fit(Xtr_scaled, ytr)\n", "\n", "# index of the positive class\n", "pos_idx = list(clf_logreg_l1_s.classes_).index(1)\n", "# predict the probability of belonging to the positive class\n", "ypred_logreg_l1_s = clf_logreg_l1_s.predict_proba(Xte_scaled)[:, pos_idx]\n", "# predict binary labels\n", "ypred_logreg_l1_s_b = np.where(ypred_logreg_l1_s > 0.5, 1, 0)\n", "\n", "print(\"Scaled, l1 regularization (C=%.2e): accuracy = %.3f\" % (cvalue, \n", " metrics.accuracy_score(yte, ypred_logreg_l1_s_b)))\n", "print(\"AUC = %.3f\" % (metrics.roc_auc_score(yte, ypred_logreg_l1_s)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Question__: How did the performance evolve?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4.1 Effect of regularization on the regression coefficients\n", "\n", "__Question:__ Plot the weights that were given to each feature in your data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "num_features = X_clf.shape[1]\n", "plt.scatter(range(num_features), #TODO, \n", " color='blue', marker='x', label='Logistic regression')\n", "plt.scatter(range(num_features), #TODO, \n", " color='orange', marker='+', label='L1-regularized logistic regression')\n", "\n", "plt.xlabel('Genes', fontsize=16)\n", "plt.ylabel('Weights', fontsize=16)\n", "plt.title('Logistic regression weights', fontsize=16)\n", "plt.legend(fontsize=14, loc=(1.05, 0))\n", "plt.xlim([0, num_features])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Question:__ What do you observe? How does this differ from l2-regularization?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Question:__ How many weights are different from zero? How many features are _not_ used by the l1-regularized model? " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Number of selected features\n", "print(\"The non-regularized logistic regression uses %d features\" % len(np.where(clf_logreg_s.coef_ !=0)[1]))\n", "print(\"The L2-regularized logistic regression uses %d features\" % #TODO\n", " )\n", "print(\"The L1-regularized logistic regression uses %d features\" % #TODO\n", " )\n", "\n", "print(\"Number of features discarded by the L1-regularization: %d\" % #TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4.2 Optimization of the regularization parameter\n", "\n", "__Question:__ Fill in the blanks to optimize the value of `C` for l1-regularized logistic regression." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Optimize cvalue\n", "cvalues_list = np.logspace(-2, 3, 20)\n", "print(cvalues_list)\n", "\n", "classifier = # TODO\n", "param_grid = # TODO\n", "clf_logreg_l1_s_opt = # TODO \n", "\n", "# Train the model\n", "# TODO\n", "\n", "# index of the positive class\n", "pos_idx = # TODO\n", "# predict the probability of belonging to the positive class\n", "ypred_logreg_l1_s_opt = # TODO\n", "# predict binary labels\n", "ypred_logreg_l1_s_b = # TODO\n", "\n", "# optimal value of C\n", "cvalue_opt = # TODO\n", "\n", "print(\"Scaled, l1 regularization (C=%.2e): accuracy = %.3f\" % (cvalue_opt, \n", " metrics.accuracy_score(yte, ypred_logreg_l1_s_b)))\n", "print(\"AUC = %.3f\" % (metrics.roc_auc_score(yte, ypred_logreg_l1_s_opt)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ROC curve\n", "fpr_logreg_s, tpr_logreg_s, t = # TODO\n", "auc_logreg_s = metrics.auc(fpr_logreg_s, tpr_logreg_s)\n", "plt.plot(fpr_logreg_s, tpr_logreg_s, color='blue', \n", " label='No regularization: AUC = %0.3f' % auc_logreg_s)\n", "\n", "fpr_logreg_l2_s_opt, tpr_logreg_l2_s_opt, t = # TODO\n", "auc_logreg_l2_s_opt = metrics.auc(fpr_logreg_l2_s_opt, tpr_logreg_l2_s_opt)\n", "plt.plot(fpr_logreg_l2_s_opt, tpr_logreg_l2_s_opt, color='orange', \n", " label='L2 regularization: AUC = %0.3f' % auc_logreg_l2_s_opt)\n", "\n", "fpr_logreg_l1_s_opt, tpr_logreg_l1_s_opt, t = metrics.roc_curve(yte, ypred_logreg_l1_s_opt, pos_label=1)\n", "auc_logreg_l1_s_opt = metrics.auc(fpr_logreg_l1_s_opt, tpr_logreg_l1_s_opt)\n", "plt.plot(fpr_logreg_l1_s_opt, tpr_logreg_l1_s_opt, color='magenta', \n", " label='L1 regularization: AUC = %0.3f' % auc_logreg_l1_s_opt)\n", "\n", "plt.xlabel('False Positive Rate', fontsize=16)\n", "plt.ylabel('True Positive Rate', fontsize=16)\n", "plt.title('ROC curve: Logistic regression', fontsize=16)\n", "plt.legend(fontsize=14)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Question:__ How many genes does the l1-regularized approach select? How does this affect the performance, compared to `C=10.0` or the l2-regularized approach?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "print(\"The L1-regularized logistic regression uses %d features\" % # TODO\n", " )\n", "print(\"The optimized L1-regularized logistic regression uses %d features\" % \\\n", " # TODO\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Question:__ Compare the features selected with `C=0.01` and your optimal `C` value. What do you observe?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "num_features = X_clf.shape[1]\n", "plt.scatter(range(num_features), #TODO, \n", " color='orange', marker='+', label='L1-regularized logistic regression')\n", "plt.scatter(range(num_features), #TODO, \n", " color='magenta', marker='x', label='L1-regularized logistic regression (optimal C)')\n", "\n", "plt.xlabel('Genes', fontsize=16)\n", "plt.ylabel('Weights', fontsize=16)\n", "plt.title('Logistic regression weights', fontsize=16)\n", "plt.legend(fontsize=14, loc=(1.05, 0))\n", "plt.xlim([0, num_features])" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 2 }