Browse Source

Adding lab 5

Nelle Varoquaux 3 years ago
parent
commit
b320e9391b
1 changed files with 603 additions and 0 deletions
  1. 603 0
      05-Regularization.ipynb

+ 603 - 0
05-Regularization.ipynb

@@ -0,0 +1,603 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Lab 05: Regularized linear regression"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The goal of this lab is to explore and understand l1 and l2 regularization of linear models."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "%pylab inline"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 1. Classification data\n",
+    "\n",
+    "We will use the same data as in Lab 4: the samples are tumors, each described by the expression (= the abundance) of 3,000 genes. The goal is to separate the endometrium tumors from the uterine ones."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# load the endometrium vs. uterus tumor data\n",
+    "endometrium_data = pd.read_csv('data/small_Endometrium_Uterus.csv', sep=\",\")  # load data\n",
+    "endometrium_data.head(n=5)  # adjust n to view more data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Create the design matrix and target vector\n",
+    "X_clf = endometrium_data.drop(['ID_REF', 'Tissue'], axis=1).values\n",
+    "y_clf = pd.get_dummies(endometrium_data['Tissue']).values[:,1]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "__Question:__ Split the data in a train set containing 70% of the data and a test set containing the remaining 30%. We use [model_selection.train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split_)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn import model_selection\n",
+    "Xtr, Xte, ytr, yte = model_selection.train_test_split(X_clf, y_clf, \n",
+    "                                                      #test_size=#TODO, \n",
+    "                                                      random_state=27)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let us also compute a scaled version of the data. The data is scaled on the _train_ set, and the scaling parameters (mean, standard deviation) are applied to the test set. __Question:__ Why?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn import preprocessing\n",
+    "scaler = preprocessing.StandardScaler()\n",
+    "Xtr_scaled = scaler.fit_transform(Xtr)\n",
+    "Xte_scaled = scaler.transform(Xte)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 2. Logistic regression, not regularized \n",
+    "\n",
+    "Let us train a logisitic regression _without regularization_ on our train set, and evaluate it on the test set. This is similar to Lab 4 and will serve as a comparison point."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Logistic regression (no regularization, no scaling)\n",
+    "from sklearn import linear_model\n",
+    "clf_logreg = linear_model.LogisticRegression(C=1e6) # large C = no regularization\n",
+    "\n",
+    "# Train the model\n",
+    "clf_logreg.fit(Xtr, ytr)\n",
+    "\n",
+    "# Predict on the test set\n",
+    "# Predicted probabilities of belonging to the positive class\n",
+    "pos_idx = list(clf_logreg.classes_).index(1)\n",
+    "ypred_logreg = clf_logreg.predict_proba(Xte)[:, pos_idx]\n",
+    "\n",
+    "# Predicted binary labels\n",
+    "ypred_logreg_b = np.where(ypred_logreg > 0.5, 1, 0)\n",
+    "\n",
+    "from sklearn import metrics\n",
+    "print(\"No regularization: accuracy = %.3f\" % metrics.accuracy_score(yte, ypred_logreg_b))\n",
+    "print(\"AUC = %.3f\" % (metrics.roc_auc_score(yte, ypred_logreg)))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "__Question:__ Repeat the experiment on the scaled data. What do you observe in terms of performance?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Logistic regression (no regularization, scaling)\n",
+    "clf_logreg_s = linear_model.LogisticRegression(C=1e6)\n",
+    "\n",
+    "# Train the model\n",
+    "# TODO\n",
+    "\n",
+    "# Predict on the test set\n",
+    "# Predicted probabilities of belonging to the positive class\n",
+    "pos_idx = list(clf_logreg_s.classes_).index(1)\n",
+    "ypred_logreg_s = # TODO\n",
+    "# Predicted binary labels\n",
+    "ypred_logreg_s_b = # TODO\n",
+    "\n",
+    "print(\"Scaled, no regularization: accuracy = %.3f\" % metrics.accuracy_score(yte, ypred_logreg_s_b))\n",
+    "print(\"AUC = %.3f\" % (metrics.roc_auc_score(yte, ypred_logreg_s)))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 3. L2-regularized logistic regression \n",
+    "\n",
+    "__Question:__ What is the role of L2 regularization?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let us use an l2-regularized logistic regression with the parameter `C` set to 0.01. \n",
+    "\n",
+    "__Question:__ What is the role of `C`? How does it relate to the `lambda` regularization parameter we have seen in class?\n",
+    "\n",
+    "__Question:__ Train the l2-regularized logistic regression initialized below on the scaled training data, and evaluate it on the sclaed test set (as above). How does the performance evolve?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cvalue = 0.01\n",
+    "clf_logreg_l2_s = linear_model.LogisticRegression(C=cvalue, penalty='l2')\n",
+    "\n",
+    "# Train the model\n",
+    "# TODO\n",
+    "\n",
+    "# index of positive class\n",
+    "pos_idx = list(clf_logreg_l2_s.classes_).index(1)\n",
+    "# predict probability of being positive\n",
+    "ypred_logreg_l2_s = # TODO\n",
+    "# predict binary labels\n",
+    "ypred_logreg_l2_s_b = np.where(ypred_logreg_l2_s > 0.5, 1, 0)\n",
+    "\n",
+    "print(\"Scaled, l2 regularization (C=%.2e): accuracy = %.3f\" % (cvalue, \n",
+    "                                                               metrics.accuracy_score(yte, ypred_logreg_l2_s_b)))\n",
+    "print(\"AUC = %.3f\" % (metrics.roc_auc_score(yte, ypred_logreg_l2_s)))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 3.1 Effect of L2-regularization on the logistic regression coefficients\n",
+    "\n",
+    "We will now look at how the regression coefficients have evolved between the non-regularized and the regularized versions of the logistic regression.\n",
+    "\n",
+    "__Question:__ Fill in the blanks below to plot the regression coefficients of both the trained `clf_logreg_l2_s` and `clf_logreg_s` models. Use the [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) to figure out how to access these coefficients. What do you observe?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Effect of l2-regularization on the weights\n",
+    "num_features = X_clf.shape[1]\n",
+    "plt.scatter(range(num_features), # TODO, \n",
+    "            color='blue', marker='x', label='Logistic regression')\n",
+    "plt.scatter(range(num_features), #TODO, \n",
+    "            color='orange', marker='+', label='L2-regularized logistic regression')\n",
+    "\n",
+    "plt.xlabel('Genes', fontsize=16)\n",
+    "plt.ylabel('Weights', fontsize=16)\n",
+    "plt.title('Logistic regression weights', fontsize=16)\n",
+    "plt.legend(fontsize=14, loc=(1.05, 0))\n",
+    "plt.xlim([0, num_features])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3.2 Optimization of the regularization parameter\n",
+    "\n",
+    "We will now use a 3-fold cross-validation on the training set to optimize the value of C. Scikit-learn makes it really easy to use a cross-validation to choose a good value for $\\alpha$ among a grid of several choices. Check the [GridSearchCV class](http://scikit-learn.org/0.17/modules/generated/sklearn.grid_search.GridSearchCV.html#sklearn.grid_search.GridSearchCV)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Create a range of values to test for the parameter C\n",
+    "cvalues_list = np.logspace(-5, 1, 20)\n",
+    "print(cvalues_list)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "__Question:__ Fill in the blanks below to find the optimal value of the parameter C.\n",
+    "\n",
+    "Use the `.best_estimator_` attribute of a `GridSearchCV`. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Optimize cvalue\n",
+    "classifier = linear_model.LogisticRegression(penalty='l2')\n",
+    "param_grid = {'C': #TODO\n",
+    "             }\n",
+    "clf_logreg_l2_s_opt = model_selection.GridSearchCV(#TODO, \n",
+    "                                                   #TODO, \n",
+    "                                                   cv=3)     \n",
+    "\n",
+    "# Train the model\n",
+    "# TODO\n",
+    "\n",
+    "# index of the positive class\n",
+    "pos_idx = # TODO\n",
+    "# predict probability of being positive\n",
+    "ypred_logreg_l2_s_opt = # TODO\n",
+    "# predict binary label\n",
+    "ypred_logreg_l2_s_opt_b = # TODO\n",
+    "\n",
+    "# optimal value of C\n",
+    "cvalue_opt = # TODO\n",
+    "print(\"Scaled, l2 regularization (C=%.2e): accuracy = %.3f\" % (cvalue_opt, \n",
+    "                                                               metrics.accuracy_score(yte, ypred_logreg_l2_s_opt_b)))\n",
+    "print(\"AUC = %.3f\" % (metrics.roc_auc_score(yte, ypred_logreg_l2_s_opt)))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "__Question:__ Fill in the code below to compare the ROC curves of the non-regularized and l2-regularized logistic regressions."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fpr_logreg_s, tpr_logreg_s, t = # TODO\n",
+    "auc_logreg_s = metrics.auc(fpr_logreg_s, tpr_logreg_s)\n",
+    "plt.plot(fpr_logreg_s, tpr_logreg_s, color='blue', \n",
+    "         label='No regularization: AUC = %0.3f' % auc_logreg_s)\n",
+    "\n",
+    "fpr_logreg_l2_s_opt, tpr_logreg_l2_s_opt, t = # TODO\n",
+    "auc_logreg_l2_s_opt = metrics.auc(fpr_logreg_l2_s_opt, tpr_logreg_l2_s_opt)\n",
+    "plt.plot(fpr_logreg_l2_s_opt, tpr_logreg_l2_s_opt, color='orange', \n",
+    "         label='L2 regularization: AUC = %0.3f' % auc_logreg_l2_s_opt)\n",
+    "\n",
+    "plt.xlabel('False Positive Rate', fontsize=16)\n",
+    "plt.ylabel('True Positive Rate', fontsize=16)\n",
+    "plt.title('ROC curve: Logistic regression', fontsize=16)\n",
+    "plt.legend(fontsize=14)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "__Question:__ Is the optimal C larger or smaller than the one we tried before? Does this mean more or less regularization? Do you expect larger or smaller regularization coefficients?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "__Question:__ Fill in the blanks to compare the regularization weights of the different methods."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Effect of l2-regularization on the weights\n",
+    "num_features = X_clf.shape[1]\n",
+    "plt.scatter(range(num_features), # TODO, \n",
+    "            color='blue', marker='x', label='Logistic regression')\n",
+    "plt.scatter(range(num_features), # TODO, \n",
+    "            color='orange', marker='+', label='L2-regularized logistic regression')\n",
+    "plt.scatter(range(num_features), # TODO, \n",
+    "            color='magenta', marker='.', label='L2-regularized logistic regression (opt)')\n",
+    "\n",
+    "plt.xlabel('Genes', fontsize=16)\n",
+    "plt.ylabel('Weights', fontsize=16)\n",
+    "plt.title('Logistic regression weights', fontsize=16)\n",
+    "plt.legend(fontsize=14, loc=(1.05, 0))\n",
+    "plt.xlim([0, num_features])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 4. L1-regularized logistic regression\n",
+    "\n",
+    "__Question:__ What is the role of the l1-regularized logistic regression?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "__Question:__ Instead of a l2-regularized logistic regression with `C=0.01`, now train and evaluate a __l1__-regularized logistic regression with __`C=10.0`__."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cvalue = # TODO\n",
+    "clf_logreg_l1_s = # TODO\n",
+    "clf_logreg_l1_s.fit(Xtr_scaled, ytr)\n",
+    "\n",
+    "# index of the positive class\n",
+    "pos_idx = list(clf_logreg_l1_s.classes_).index(1)\n",
+    "# predict the probability of belonging to the positive class\n",
+    "ypred_logreg_l1_s = clf_logreg_l1_s.predict_proba(Xte_scaled)[:, pos_idx]\n",
+    "# predict binary labels\n",
+    "ypred_logreg_l1_s_b = np.where(ypred_logreg_l1_s > 0.5, 1, 0)\n",
+    "\n",
+    "print(\"Scaled, l1 regularization (C=%.2e): accuracy = %.3f\" % (cvalue, \n",
+    "                                                               metrics.accuracy_score(yte, ypred_logreg_l1_s_b)))\n",
+    "print(\"AUC = %.3f\" % (metrics.roc_auc_score(yte, ypred_logreg_l1_s)))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "__Question__: How did the performance evolve?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4.1 Effect of regularization on the regression coefficients\n",
+    "\n",
+    "__Question:__ Plot the weights that were given to each feature in your data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "num_features = X_clf.shape[1]\n",
+    "plt.scatter(range(num_features), #TODO, \n",
+    "            color='blue', marker='x', label='Logistic regression')\n",
+    "plt.scatter(range(num_features), #TODO, \n",
+    "            color='orange', marker='+', label='L1-regularized logistic regression')\n",
+    "\n",
+    "plt.xlabel('Genes', fontsize=16)\n",
+    "plt.ylabel('Weights', fontsize=16)\n",
+    "plt.title('Logistic regression weights', fontsize=16)\n",
+    "plt.legend(fontsize=14, loc=(1.05, 0))\n",
+    "plt.xlim([0, num_features])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "__Question:__ What do you observe? How does this differ from l2-regularization?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "__Question:__ How many weights are different from zero? How many features are _not_ used by the l1-regularized model? "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Number of selected features\n",
+    "print(\"The non-regularized logistic regression uses %d features\" % len(np.where(clf_logreg_s.coef_ !=0)[1]))\n",
+    "print(\"The L2-regularized logistic regression uses %d features\" % #TODO\n",
+    "     )\n",
+    "print(\"The L1-regularized logistic regression uses %d features\" % #TODO\n",
+    "     )\n",
+    "\n",
+    "print(\"Number of features discarded by the L1-regularization: %d\" % #TODO"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4.2 Optimization of the regularization parameter\n",
+    "\n",
+    "__Question:__ Fill in the blanks to optimize the value of `C` for l1-regularized logistic regression."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Optimize cvalue\n",
+    "cvalues_list = np.logspace(-2, 3, 20)\n",
+    "print(cvalues_list)\n",
+    "\n",
+    "classifier = # TODO\n",
+    "param_grid = # TODO\n",
+    "clf_logreg_l1_s_opt = # TODO   \n",
+    "\n",
+    "# Train the model\n",
+    "# TODO\n",
+    "\n",
+    "# index of the positive class\n",
+    "pos_idx = # TODO\n",
+    "# predict the probability of belonging to the positive class\n",
+    "ypred_logreg_l1_s_opt = # TODO\n",
+    "# predict binary labels\n",
+    "ypred_logreg_l1_s_b = # TODO\n",
+    "\n",
+    "# optimal value of C\n",
+    "cvalue_opt = # TODO\n",
+    "\n",
+    "print(\"Scaled, l1 regularization (C=%.2e): accuracy = %.3f\" % (cvalue_opt, \n",
+    "                                                               metrics.accuracy_score(yte, ypred_logreg_l1_s_b)))\n",
+    "print(\"AUC = %.3f\" % (metrics.roc_auc_score(yte, ypred_logreg_l1_s_opt)))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# ROC curve\n",
+    "fpr_logreg_s, tpr_logreg_s, t = # TODO\n",
+    "auc_logreg_s = metrics.auc(fpr_logreg_s, tpr_logreg_s)\n",
+    "plt.plot(fpr_logreg_s, tpr_logreg_s, color='blue', \n",
+    "         label='No regularization: AUC = %0.3f' % auc_logreg_s)\n",
+    "\n",
+    "fpr_logreg_l2_s_opt, tpr_logreg_l2_s_opt, t = # TODO\n",
+    "auc_logreg_l2_s_opt = metrics.auc(fpr_logreg_l2_s_opt, tpr_logreg_l2_s_opt)\n",
+    "plt.plot(fpr_logreg_l2_s_opt, tpr_logreg_l2_s_opt, color='orange', \n",
+    "         label='L2 regularization: AUC = %0.3f' % auc_logreg_l2_s_opt)\n",
+    "\n",
+    "fpr_logreg_l1_s_opt, tpr_logreg_l1_s_opt, t = metrics.roc_curve(yte, ypred_logreg_l1_s_opt, pos_label=1)\n",
+    "auc_logreg_l1_s_opt = metrics.auc(fpr_logreg_l1_s_opt, tpr_logreg_l1_s_opt)\n",
+    "plt.plot(fpr_logreg_l1_s_opt, tpr_logreg_l1_s_opt, color='magenta', \n",
+    "         label='L1 regularization: AUC = %0.3f' % auc_logreg_l1_s_opt)\n",
+    "\n",
+    "plt.xlabel('False Positive Rate', fontsize=16)\n",
+    "plt.ylabel('True Positive Rate', fontsize=16)\n",
+    "plt.title('ROC curve: Logistic regression', fontsize=16)\n",
+    "plt.legend(fontsize=14)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "__Question:__ How many genes does the l1-regularized approach select? How does this affect the performance, compared to `C=10.0` or the l2-regularized approach?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "print(\"The L1-regularized logistic regression uses %d features\" % # TODO\n",
+    "     )\n",
+    "print(\"The optimized L1-regularized logistic regression uses %d features\" % \\\n",
+    "      # TODO\n",
+    "     )"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "__Question:__ Compare the features selected with `C=0.01` and your optimal `C` value. What do you observe?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "num_features = X_clf.shape[1]\n",
+    "plt.scatter(range(num_features), #TODO, \n",
+    "            color='orange', marker='+', label='L1-regularized logistic regression')\n",
+    "plt.scatter(range(num_features), #TODO, \n",
+    "            color='magenta', marker='x', label='L1-regularized logistic regression (optimal C)')\n",
+    "\n",
+    "plt.xlabel('Genes', fontsize=16)\n",
+    "plt.ylabel('Weights', fontsize=16)\n",
+    "plt.title('Logistic regression weights', fontsize=16)\n",
+    "plt.legend(fontsize=14, loc=(1.05, 0))\n",
+    "plt.xlim([0, num_features])"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}