|
@@ -0,0 +1,603 @@
|
|
|
+{
|
|
|
+ "cells": [
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "# Lab 05: Regularized linear regression"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "The goal of this lab is to explore and understand l1 and l2 regularization of linear models."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": null,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "import pandas as pd\n",
|
|
|
+ "%pylab inline"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "# 1. Classification data\n",
|
|
|
+ "\n",
|
|
|
+ "We will use the same data as in Lab 4: the samples are tumors, each described by the expression (= the abundance) of 3,000 genes. The goal is to separate the endometrium tumors from the uterine ones."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": null,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "# load the endometrium vs. uterus tumor data\n",
|
|
|
+ "endometrium_data = pd.read_csv('data/small_Endometrium_Uterus.csv', sep=\",\") # load data\n",
|
|
|
+ "endometrium_data.head(n=5) # adjust n to view more data"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": null,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "# Create the design matrix and target vector\n",
|
|
|
+ "X_clf = endometrium_data.drop(['ID_REF', 'Tissue'], axis=1).values\n",
|
|
|
+ "y_clf = pd.get_dummies(endometrium_data['Tissue']).values[:,1]"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "__Question:__ Split the data in a train set containing 70% of the data and a test set containing the remaining 30%. We use [model_selection.train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split_)"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": null,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "from sklearn import model_selection\n",
|
|
|
+ "Xtr, Xte, ytr, yte = model_selection.train_test_split(X_clf, y_clf, \n",
|
|
|
+ " #test_size=#TODO, \n",
|
|
|
+ " random_state=27)"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "Let us also compute a scaled version of the data. The data is scaled on the _train_ set, and the scaling parameters (mean, standard deviation) are applied to the test set. __Question:__ Why?"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": null,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "from sklearn import preprocessing\n",
|
|
|
+ "scaler = preprocessing.StandardScaler()\n",
|
|
|
+ "Xtr_scaled = scaler.fit_transform(Xtr)\n",
|
|
|
+ "Xte_scaled = scaler.transform(Xte)"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "# 2. Logistic regression, not regularized \n",
|
|
|
+ "\n",
|
|
|
+ "Let us train a logisitic regression _without regularization_ on our train set, and evaluate it on the test set. This is similar to Lab 4 and will serve as a comparison point."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": null,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "# Logistic regression (no regularization, no scaling)\n",
|
|
|
+ "from sklearn import linear_model\n",
|
|
|
+ "clf_logreg = linear_model.LogisticRegression(C=1e6) # large C = no regularization\n",
|
|
|
+ "\n",
|
|
|
+ "# Train the model\n",
|
|
|
+ "clf_logreg.fit(Xtr, ytr)\n",
|
|
|
+ "\n",
|
|
|
+ "# Predict on the test set\n",
|
|
|
+ "# Predicted probabilities of belonging to the positive class\n",
|
|
|
+ "pos_idx = list(clf_logreg.classes_).index(1)\n",
|
|
|
+ "ypred_logreg = clf_logreg.predict_proba(Xte)[:, pos_idx]\n",
|
|
|
+ "\n",
|
|
|
+ "# Predicted binary labels\n",
|
|
|
+ "ypred_logreg_b = np.where(ypred_logreg > 0.5, 1, 0)\n",
|
|
|
+ "\n",
|
|
|
+ "from sklearn import metrics\n",
|
|
|
+ "print(\"No regularization: accuracy = %.3f\" % metrics.accuracy_score(yte, ypred_logreg_b))\n",
|
|
|
+ "print(\"AUC = %.3f\" % (metrics.roc_auc_score(yte, ypred_logreg)))"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "__Question:__ Repeat the experiment on the scaled data. What do you observe in terms of performance?"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": null,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "# Logistic regression (no regularization, scaling)\n",
|
|
|
+ "clf_logreg_s = linear_model.LogisticRegression(C=1e6)\n",
|
|
|
+ "\n",
|
|
|
+ "# Train the model\n",
|
|
|
+ "# TODO\n",
|
|
|
+ "\n",
|
|
|
+ "# Predict on the test set\n",
|
|
|
+ "# Predicted probabilities of belonging to the positive class\n",
|
|
|
+ "pos_idx = list(clf_logreg_s.classes_).index(1)\n",
|
|
|
+ "ypred_logreg_s = # TODO\n",
|
|
|
+ "# Predicted binary labels\n",
|
|
|
+ "ypred_logreg_s_b = # TODO\n",
|
|
|
+ "\n",
|
|
|
+ "print(\"Scaled, no regularization: accuracy = %.3f\" % metrics.accuracy_score(yte, ypred_logreg_s_b))\n",
|
|
|
+ "print(\"AUC = %.3f\" % (metrics.roc_auc_score(yte, ypred_logreg_s)))"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "# 3. L2-regularized logistic regression \n",
|
|
|
+ "\n",
|
|
|
+ "__Question:__ What is the role of L2 regularization?"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "Let us use an l2-regularized logistic regression with the parameter `C` set to 0.01. \n",
|
|
|
+ "\n",
|
|
|
+ "__Question:__ What is the role of `C`? How does it relate to the `lambda` regularization parameter we have seen in class?\n",
|
|
|
+ "\n",
|
|
|
+ "__Question:__ Train the l2-regularized logistic regression initialized below on the scaled training data, and evaluate it on the sclaed test set (as above). How does the performance evolve?"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": null,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "cvalue = 0.01\n",
|
|
|
+ "clf_logreg_l2_s = linear_model.LogisticRegression(C=cvalue, penalty='l2')\n",
|
|
|
+ "\n",
|
|
|
+ "# Train the model\n",
|
|
|
+ "# TODO\n",
|
|
|
+ "\n",
|
|
|
+ "# index of positive class\n",
|
|
|
+ "pos_idx = list(clf_logreg_l2_s.classes_).index(1)\n",
|
|
|
+ "# predict probability of being positive\n",
|
|
|
+ "ypred_logreg_l2_s = # TODO\n",
|
|
|
+ "# predict binary labels\n",
|
|
|
+ "ypred_logreg_l2_s_b = np.where(ypred_logreg_l2_s > 0.5, 1, 0)\n",
|
|
|
+ "\n",
|
|
|
+ "print(\"Scaled, l2 regularization (C=%.2e): accuracy = %.3f\" % (cvalue, \n",
|
|
|
+ " metrics.accuracy_score(yte, ypred_logreg_l2_s_b)))\n",
|
|
|
+ "print(\"AUC = %.3f\" % (metrics.roc_auc_score(yte, ypred_logreg_l2_s)))"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "### 3.1 Effect of L2-regularization on the logistic regression coefficients\n",
|
|
|
+ "\n",
|
|
|
+ "We will now look at how the regression coefficients have evolved between the non-regularized and the regularized versions of the logistic regression.\n",
|
|
|
+ "\n",
|
|
|
+ "__Question:__ Fill in the blanks below to plot the regression coefficients of both the trained `clf_logreg_l2_s` and `clf_logreg_s` models. Use the [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) to figure out how to access these coefficients. What do you observe?"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": null,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "# Effect of l2-regularization on the weights\n",
|
|
|
+ "num_features = X_clf.shape[1]\n",
|
|
|
+ "plt.scatter(range(num_features), # TODO, \n",
|
|
|
+ " color='blue', marker='x', label='Logistic regression')\n",
|
|
|
+ "plt.scatter(range(num_features), #TODO, \n",
|
|
|
+ " color='orange', marker='+', label='L2-regularized logistic regression')\n",
|
|
|
+ "\n",
|
|
|
+ "plt.xlabel('Genes', fontsize=16)\n",
|
|
|
+ "plt.ylabel('Weights', fontsize=16)\n",
|
|
|
+ "plt.title('Logistic regression weights', fontsize=16)\n",
|
|
|
+ "plt.legend(fontsize=14, loc=(1.05, 0))\n",
|
|
|
+ "plt.xlim([0, num_features])"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "## 3.2 Optimization of the regularization parameter\n",
|
|
|
+ "\n",
|
|
|
+ "We will now use a 3-fold cross-validation on the training set to optimize the value of C. Scikit-learn makes it really easy to use a cross-validation to choose a good value for $\\alpha$ among a grid of several choices. Check the [GridSearchCV class](http://scikit-learn.org/0.17/modules/generated/sklearn.grid_search.GridSearchCV.html#sklearn.grid_search.GridSearchCV)."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": null,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "# Create a range of values to test for the parameter C\n",
|
|
|
+ "cvalues_list = np.logspace(-5, 1, 20)\n",
|
|
|
+ "print(cvalues_list)"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "__Question:__ Fill in the blanks below to find the optimal value of the parameter C.\n",
|
|
|
+ "\n",
|
|
|
+ "Use the `.best_estimator_` attribute of a `GridSearchCV`. "
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": null,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "# Optimize cvalue\n",
|
|
|
+ "classifier = linear_model.LogisticRegression(penalty='l2')\n",
|
|
|
+ "param_grid = {'C': #TODO\n",
|
|
|
+ " }\n",
|
|
|
+ "clf_logreg_l2_s_opt = model_selection.GridSearchCV(#TODO, \n",
|
|
|
+ " #TODO, \n",
|
|
|
+ " cv=3) \n",
|
|
|
+ "\n",
|
|
|
+ "# Train the model\n",
|
|
|
+ "# TODO\n",
|
|
|
+ "\n",
|
|
|
+ "# index of the positive class\n",
|
|
|
+ "pos_idx = # TODO\n",
|
|
|
+ "# predict probability of being positive\n",
|
|
|
+ "ypred_logreg_l2_s_opt = # TODO\n",
|
|
|
+ "# predict binary label\n",
|
|
|
+ "ypred_logreg_l2_s_opt_b = # TODO\n",
|
|
|
+ "\n",
|
|
|
+ "# optimal value of C\n",
|
|
|
+ "cvalue_opt = # TODO\n",
|
|
|
+ "print(\"Scaled, l2 regularization (C=%.2e): accuracy = %.3f\" % (cvalue_opt, \n",
|
|
|
+ " metrics.accuracy_score(yte, ypred_logreg_l2_s_opt_b)))\n",
|
|
|
+ "print(\"AUC = %.3f\" % (metrics.roc_auc_score(yte, ypred_logreg_l2_s_opt)))"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "__Question:__ Fill in the code below to compare the ROC curves of the non-regularized and l2-regularized logistic regressions."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": null,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "fpr_logreg_s, tpr_logreg_s, t = # TODO\n",
|
|
|
+ "auc_logreg_s = metrics.auc(fpr_logreg_s, tpr_logreg_s)\n",
|
|
|
+ "plt.plot(fpr_logreg_s, tpr_logreg_s, color='blue', \n",
|
|
|
+ " label='No regularization: AUC = %0.3f' % auc_logreg_s)\n",
|
|
|
+ "\n",
|
|
|
+ "fpr_logreg_l2_s_opt, tpr_logreg_l2_s_opt, t = # TODO\n",
|
|
|
+ "auc_logreg_l2_s_opt = metrics.auc(fpr_logreg_l2_s_opt, tpr_logreg_l2_s_opt)\n",
|
|
|
+ "plt.plot(fpr_logreg_l2_s_opt, tpr_logreg_l2_s_opt, color='orange', \n",
|
|
|
+ " label='L2 regularization: AUC = %0.3f' % auc_logreg_l2_s_opt)\n",
|
|
|
+ "\n",
|
|
|
+ "plt.xlabel('False Positive Rate', fontsize=16)\n",
|
|
|
+ "plt.ylabel('True Positive Rate', fontsize=16)\n",
|
|
|
+ "plt.title('ROC curve: Logistic regression', fontsize=16)\n",
|
|
|
+ "plt.legend(fontsize=14)"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "__Question:__ Is the optimal C larger or smaller than the one we tried before? Does this mean more or less regularization? Do you expect larger or smaller regularization coefficients?"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "__Question:__ Fill in the blanks to compare the regularization weights of the different methods."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": null,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "# Effect of l2-regularization on the weights\n",
|
|
|
+ "num_features = X_clf.shape[1]\n",
|
|
|
+ "plt.scatter(range(num_features), # TODO, \n",
|
|
|
+ " color='blue', marker='x', label='Logistic regression')\n",
|
|
|
+ "plt.scatter(range(num_features), # TODO, \n",
|
|
|
+ " color='orange', marker='+', label='L2-regularized logistic regression')\n",
|
|
|
+ "plt.scatter(range(num_features), # TODO, \n",
|
|
|
+ " color='magenta', marker='.', label='L2-regularized logistic regression (opt)')\n",
|
|
|
+ "\n",
|
|
|
+ "plt.xlabel('Genes', fontsize=16)\n",
|
|
|
+ "plt.ylabel('Weights', fontsize=16)\n",
|
|
|
+ "plt.title('Logistic regression weights', fontsize=16)\n",
|
|
|
+ "plt.legend(fontsize=14, loc=(1.05, 0))\n",
|
|
|
+ "plt.xlim([0, num_features])"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "# 4. L1-regularized logistic regression\n",
|
|
|
+ "\n",
|
|
|
+ "__Question:__ What is the role of the l1-regularized logistic regression?"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "__Question:__ Instead of a l2-regularized logistic regression with `C=0.01`, now train and evaluate a __l1__-regularized logistic regression with __`C=10.0`__."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": null,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "cvalue = # TODO\n",
|
|
|
+ "clf_logreg_l1_s = # TODO\n",
|
|
|
+ "clf_logreg_l1_s.fit(Xtr_scaled, ytr)\n",
|
|
|
+ "\n",
|
|
|
+ "# index of the positive class\n",
|
|
|
+ "pos_idx = list(clf_logreg_l1_s.classes_).index(1)\n",
|
|
|
+ "# predict the probability of belonging to the positive class\n",
|
|
|
+ "ypred_logreg_l1_s = clf_logreg_l1_s.predict_proba(Xte_scaled)[:, pos_idx]\n",
|
|
|
+ "# predict binary labels\n",
|
|
|
+ "ypred_logreg_l1_s_b = np.where(ypred_logreg_l1_s > 0.5, 1, 0)\n",
|
|
|
+ "\n",
|
|
|
+ "print(\"Scaled, l1 regularization (C=%.2e): accuracy = %.3f\" % (cvalue, \n",
|
|
|
+ " metrics.accuracy_score(yte, ypred_logreg_l1_s_b)))\n",
|
|
|
+ "print(\"AUC = %.3f\" % (metrics.roc_auc_score(yte, ypred_logreg_l1_s)))"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "__Question__: How did the performance evolve?"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "## 4.1 Effect of regularization on the regression coefficients\n",
|
|
|
+ "\n",
|
|
|
+ "__Question:__ Plot the weights that were given to each feature in your data."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": null,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "num_features = X_clf.shape[1]\n",
|
|
|
+ "plt.scatter(range(num_features), #TODO, \n",
|
|
|
+ " color='blue', marker='x', label='Logistic regression')\n",
|
|
|
+ "plt.scatter(range(num_features), #TODO, \n",
|
|
|
+ " color='orange', marker='+', label='L1-regularized logistic regression')\n",
|
|
|
+ "\n",
|
|
|
+ "plt.xlabel('Genes', fontsize=16)\n",
|
|
|
+ "plt.ylabel('Weights', fontsize=16)\n",
|
|
|
+ "plt.title('Logistic regression weights', fontsize=16)\n",
|
|
|
+ "plt.legend(fontsize=14, loc=(1.05, 0))\n",
|
|
|
+ "plt.xlim([0, num_features])"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "__Question:__ What do you observe? How does this differ from l2-regularization?"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "__Question:__ How many weights are different from zero? How many features are _not_ used by the l1-regularized model? "
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": null,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "# Number of selected features\n",
|
|
|
+ "print(\"The non-regularized logistic regression uses %d features\" % len(np.where(clf_logreg_s.coef_ !=0)[1]))\n",
|
|
|
+ "print(\"The L2-regularized logistic regression uses %d features\" % #TODO\n",
|
|
|
+ " )\n",
|
|
|
+ "print(\"The L1-regularized logistic regression uses %d features\" % #TODO\n",
|
|
|
+ " )\n",
|
|
|
+ "\n",
|
|
|
+ "print(\"Number of features discarded by the L1-regularization: %d\" % #TODO"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "## 4.2 Optimization of the regularization parameter\n",
|
|
|
+ "\n",
|
|
|
+ "__Question:__ Fill in the blanks to optimize the value of `C` for l1-regularized logistic regression."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": null,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "# Optimize cvalue\n",
|
|
|
+ "cvalues_list = np.logspace(-2, 3, 20)\n",
|
|
|
+ "print(cvalues_list)\n",
|
|
|
+ "\n",
|
|
|
+ "classifier = # TODO\n",
|
|
|
+ "param_grid = # TODO\n",
|
|
|
+ "clf_logreg_l1_s_opt = # TODO \n",
|
|
|
+ "\n",
|
|
|
+ "# Train the model\n",
|
|
|
+ "# TODO\n",
|
|
|
+ "\n",
|
|
|
+ "# index of the positive class\n",
|
|
|
+ "pos_idx = # TODO\n",
|
|
|
+ "# predict the probability of belonging to the positive class\n",
|
|
|
+ "ypred_logreg_l1_s_opt = # TODO\n",
|
|
|
+ "# predict binary labels\n",
|
|
|
+ "ypred_logreg_l1_s_b = # TODO\n",
|
|
|
+ "\n",
|
|
|
+ "# optimal value of C\n",
|
|
|
+ "cvalue_opt = # TODO\n",
|
|
|
+ "\n",
|
|
|
+ "print(\"Scaled, l1 regularization (C=%.2e): accuracy = %.3f\" % (cvalue_opt, \n",
|
|
|
+ " metrics.accuracy_score(yte, ypred_logreg_l1_s_b)))\n",
|
|
|
+ "print(\"AUC = %.3f\" % (metrics.roc_auc_score(yte, ypred_logreg_l1_s_opt)))"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": null,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "# ROC curve\n",
|
|
|
+ "fpr_logreg_s, tpr_logreg_s, t = # TODO\n",
|
|
|
+ "auc_logreg_s = metrics.auc(fpr_logreg_s, tpr_logreg_s)\n",
|
|
|
+ "plt.plot(fpr_logreg_s, tpr_logreg_s, color='blue', \n",
|
|
|
+ " label='No regularization: AUC = %0.3f' % auc_logreg_s)\n",
|
|
|
+ "\n",
|
|
|
+ "fpr_logreg_l2_s_opt, tpr_logreg_l2_s_opt, t = # TODO\n",
|
|
|
+ "auc_logreg_l2_s_opt = metrics.auc(fpr_logreg_l2_s_opt, tpr_logreg_l2_s_opt)\n",
|
|
|
+ "plt.plot(fpr_logreg_l2_s_opt, tpr_logreg_l2_s_opt, color='orange', \n",
|
|
|
+ " label='L2 regularization: AUC = %0.3f' % auc_logreg_l2_s_opt)\n",
|
|
|
+ "\n",
|
|
|
+ "fpr_logreg_l1_s_opt, tpr_logreg_l1_s_opt, t = metrics.roc_curve(yte, ypred_logreg_l1_s_opt, pos_label=1)\n",
|
|
|
+ "auc_logreg_l1_s_opt = metrics.auc(fpr_logreg_l1_s_opt, tpr_logreg_l1_s_opt)\n",
|
|
|
+ "plt.plot(fpr_logreg_l1_s_opt, tpr_logreg_l1_s_opt, color='magenta', \n",
|
|
|
+ " label='L1 regularization: AUC = %0.3f' % auc_logreg_l1_s_opt)\n",
|
|
|
+ "\n",
|
|
|
+ "plt.xlabel('False Positive Rate', fontsize=16)\n",
|
|
|
+ "plt.ylabel('True Positive Rate', fontsize=16)\n",
|
|
|
+ "plt.title('ROC curve: Logistic regression', fontsize=16)\n",
|
|
|
+ "plt.legend(fontsize=14)"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "__Question:__ How many genes does the l1-regularized approach select? How does this affect the performance, compared to `C=10.0` or the l2-regularized approach?"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": null,
|
|
|
+ "metadata": {
|
|
|
+ "scrolled": true
|
|
|
+ },
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "print(\"The L1-regularized logistic regression uses %d features\" % # TODO\n",
|
|
|
+ " )\n",
|
|
|
+ "print(\"The optimized L1-regularized logistic regression uses %d features\" % \\\n",
|
|
|
+ " # TODO\n",
|
|
|
+ " )"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "__Question:__ Compare the features selected with `C=0.01` and your optimal `C` value. What do you observe?"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": null,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "num_features = X_clf.shape[1]\n",
|
|
|
+ "plt.scatter(range(num_features), #TODO, \n",
|
|
|
+ " color='orange', marker='+', label='L1-regularized logistic regression')\n",
|
|
|
+ "plt.scatter(range(num_features), #TODO, \n",
|
|
|
+ " color='magenta', marker='x', label='L1-regularized logistic regression (optimal C)')\n",
|
|
|
+ "\n",
|
|
|
+ "plt.xlabel('Genes', fontsize=16)\n",
|
|
|
+ "plt.ylabel('Weights', fontsize=16)\n",
|
|
|
+ "plt.title('Logistic regression weights', fontsize=16)\n",
|
|
|
+ "plt.legend(fontsize=14, loc=(1.05, 0))\n",
|
|
|
+ "plt.xlim([0, num_features])"
|
|
|
+ ]
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "metadata": {
|
|
|
+ "kernelspec": {
|
|
|
+ "display_name": "Python 3",
|
|
|
+ "language": "python",
|
|
|
+ "name": "python3"
|
|
|
+ },
|
|
|
+ "language_info": {
|
|
|
+ "codemirror_mode": {
|
|
|
+ "name": "ipython",
|
|
|
+ "version": 3
|
|
|
+ },
|
|
|
+ "file_extension": ".py",
|
|
|
+ "mimetype": "text/x-python",
|
|
|
+ "name": "python",
|
|
|
+ "nbconvert_exporter": "python",
|
|
|
+ "pygments_lexer": "ipython3",
|
|
|
+ "version": "3.8.3"
|
|
|
+ }
|
|
|
+ },
|
|
|
+ "nbformat": 4,
|
|
|
+ "nbformat_minor": 2
|
|
|
+}
|