{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab 6: Trees and forests" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The goal of this lab is to explore and understand tree-based models on classification problems.\n", "\n", "We will focus successively on decision trees, bagging trees and random forests. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Import required libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# import required libraries\n", "import time\n", "import math\n", "import pandas as pd\n", "%pylab inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Classification data\n", "We will use the same data as in Lab 4: the samples are tumors, each described by the expression (= the abundance) of 3,000 genes. The goal is to separate the endometrium tumors from the uterine ones." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# load the endometrium vs. uterus tumor data\n", "endometrium_data = pd.read_csv('data/small_Endometrium_Uterus.csv', sep=\",\") # load data\n", "endometrium_data.head(n=5) # adjust n to view more data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create the design matrix and target vector\n", "X = endometrium_data.drop(['ID_REF', 'Tissue'], axis=1).values\n", "y = pd.get_dummies(endometrium_data['Tissue']).values[:,1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## make folds\n", "from sklearn import model_selection\n", "skf = model_selection.StratifiedKFold(n_splits=5)\n", "skf.get_n_splits(X, y)\n", "folds = [(tr,te) for (tr,te) in skf.split(X, y)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cross-validation procedures" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def cross_validate_clf(design_matrix, labels, classifier, cv_folds):\n", " \"\"\" Perform a cross-validation and returns the predictions.\n", " \n", " Parameters:\n", " -----------\n", " design_matrix: (n_samples, n_features) np.array\n", " Design matrix for the experiment.\n", " labels: (n_samples, ) np.array\n", " Vector of labels.\n", " classifier: sklearn classifier object\n", " Classifier instance; must have the following methods:\n", " - fit(X, y) to train the classifier on the data X, y\n", " - predict_proba(X) to apply the trained classifier to the data X and return probability estimates \n", " cv_folds: sklearn cross-validation object\n", " Cross-validation iterator.\n", " \n", " Return:\n", " -------\n", " pred: (n_samples, ) np.array\n", " Vectors of predictions (same order as labels).\n", " \"\"\"\n", " pred = np.zeros(labels.shape)\n", " for tr, te in cv_folds:\n", " classifier.fit(design_matrix[tr,:], labels[tr])\n", " pos_idx = list(classifier.classes_).index(1)\n", " pred[te] = (classifier.predict_proba(design_matrix[te,:]))[:, pos_idx]\n", " return pred" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def cross_validate_clf_optimize(design_matrix, labels, classifier, cv_folds):\n", " \"\"\" Perform a cross-validation and returns the predictions.\n", " \n", " Parameters:\n", " -----------\n", " design_matrix: (n_samples, n_features) np.array\n", " Design matrix for the experiment.\n", " labels: (n_samples, ) np.array\n", " Vector of labels.\n", " classifier: sklearn classifier object\n", " Classifier instance; must have the following methods:\n", " - fit(X, y) to train the classifier on the data X, y\n", " - predict_proba(X) to apply the trained classifier to the data X and return probability estimates \n", " cv_folds: sklearn cross-validation object\n", " Cross-validation iterator.\n", " \n", " Return:\n", " -------\n", " pred: (n_samples, ) np.array\n", " Vectors of predictions (same order as labels).\n", " \"\"\"\n", " pred = np.zeros(labels.shape)\n", " for tr, te in cv_folds:\n", " classifier.fit(design_matrix[tr,:], labels[tr])\n", " print(classifier.best_params_)\n", " pos_idx = list(classifier.best_estimator_.classes_).index(1)\n", " pred[te] = (classifier.predict_proba(design_matrix[te,:]))[:, pos_idx]\n", " return pred" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 1. Decision Trees\n", "A decision tree predicts the value of a target variable by learning simple decision rules inferred from the data features.\n", "\n", "In scikit-learn, they are implemented in [tree.DecisionTreeClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) for classification and [tree.DecisionTreeRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) for regression." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn import tree\n", "from sklearn.tree import DecisionTreeClassifier" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.1 Toy dataset\n", "In order to better understand how a decision tree processes the feature space, we will first work on a simulated dataset. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(5, 5))\n", "\n", "x1 = np.random.multivariate_normal([2,2], [[0.1,0],[0,0.1]], 50)\n", "x2 = np.random.multivariate_normal([-2,-2], [[0.1,0],[0,0.1]], 50)\n", "x3 = np.random.multivariate_normal([-3,3], [[0.1,0.1],[0,0.1]], 50)\n", "X1 = np.concatenate((x1,x2,x3), axis=0)\n", "\n", "y1 = np.random.multivariate_normal([-2,2], [[0.1,0],[0,0.1]], 50)\n", "y2 = np.random.multivariate_normal([2,-2], [[0.1,0],[0,0.1]], 50)\n", "y3 = np.random.multivariate_normal([-3,-3], [[0.01,0],[0,0.01]], 50)\n", "X2 = np.concatenate((y1,y2,y3), axis=0)\n", "\n", "plt.plot(X1[:,0],X1[:,1], 'x', color='blue', label='class 1')\n", "plt.plot(X2[:,0], X2[:,1], 'x', color='orange', label='class 2')\n", "\n", "\n", "plt.legend(loc=(0.4, 0.8), fontsize=12)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question:** What do you expect the decision boudaries to look like ? " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Answer:__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Question:__ Fill-in the following code to train a decision tree on this toy data and visualize it. \n", "\n", "Change the splitter to random, meaning that the algorithm will consider the feature along which to split _randomly_ (rather than picking the optimal one), and then select the best among several _random_ splitting point. Run the algorithm several times. What do you observer?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# Training data\n", "X_demo = np.concatenate((X1, X2), axis=0)\n", "y_demo = np.concatenate((np.zeros(X1.shape[0]), np.ones(X2.shape[0])))\n", "\n", "# Train a DecisionTreeClassifier on the training data\n", "clf = # TODO\n", "\n", "# Create a mesh, i.e. a fine grid of values between the minimum and maximum\n", "# values of x1 and x2 in the training data\n", "plot_step = 0.02\n", "x_min, x_max = X_demo[:, 0].min() - 1, X_demo[:, 0].max() + 1\n", "y_min, y_max = X_demo[:, 1].min() - 1, X_demo[:, 1].max() + 1\n", "xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),\n", " np.arange(y_min, y_max, plot_step))\n", "\n", "# Label each point of the mesh with the trained DecisionTreeClassifier\n", "Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])\n", "Z = Z.reshape(xx.shape)\n", "\n", "# Plot the contours corresponding to these labels \n", "# (i.e. the decision boundary of the DecisionTreeClassifier)\n", "cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)\n", "\n", "# Plot the training data \n", "plt.plot(X1[:,0], X1[:,1], 'x', label='class 1')\n", "plt.plot(X2[:,0], X2[:,1], 'x', label='class 2')\n", "plt.legend()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.2 Tumor classification data\n", "\n", "Let us now go back to our tumor classification problem.\n", "\n", "**Question:** Cross-validate 5 different decision trees (with default parameters) and print out their accuracy. Why do you get different values? Check the documentation for help." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "from sklearn import tree\n", "from sklearn import metrics\n", "\n", "ypred_dt = [] # will hold the 5 arrays of predictions (1 per tree)\n", "for tree_index in range(5):\n", " # Initialize a DecisionTreeClassifier\n", " clf = DecisionTreeClassifier()\n", " \n", " # Cross-validate this DecisionTreeClassifier on the toy data\n", " pred_proba = cross_validate_clf(X, y, clf, folds)\n", " \n", " # Append the prediction to ypred_dt \n", " ypred_dt.append(pred_proba)\n", " \n", " # Print the accuracy of DecisionTreeClassifier\n", " print(\"%.3f\" % metrics.accuracy_score(y, np.where(pred_proba > 0.5, 1, 0)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Answer:__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question:** Compute the mean and standard deviation of the area under the ROC curve of these 5 trees. Plot the ROC curves of these 5 trees.\n", "\n", "Use the [metrics](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics) module of scikit-learn." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fpr_dt = [] # will hold the 5 arrays of false positive rates (1 per tree)\n", "tpr_dt = [] # will hold the 5 arrays of true positive rates (1 per tree)\n", "auc_dt = [] # will hold the 5 areas under the ROC curve (1 per tree)\n", "\n", "for tree_index in range(5):\n", " # Compute the ROC curve of the current tree\n", " fpr_dt_tmp, tpr_dt_tmp, thresholds = metrics.roc_curve(#TODO\n", " \n", " # Compute the area under the ROC curve of the current tree\n", " auc_dt_tmp = metrics.auc(fpr_dt_tmp, tpr_dt_tmp)\n", " fpr_dt.append(fpr_dt_tmp)\n", " tpr_dt.append(tpr_dt_tmp)\n", " auc_dt.append(auc_dt_tmp)\n", "\n", "# Plot the first 4 ROC curves\n", "for tree_index in range(4):\n", " plt.plot(# TODO\n", " \n", "# Plot the last ROC curve, with a label that gives the mean/std AUC\n", "plt.plot(fpr_dt[-1], tpr_dt[-1], '-', \n", " label='DT (AUC = %0.2f +/- %0.2f)' % (np.mean(auc_dt), np.std(auc_dt)))\n", "\n", "# Plot the ROC curve\n", "plt.xlabel('False Positive Rate', fontsize=16)\n", "plt.ylabel('True Positive Rate', fontsize=16)\n", "plt.title('ROC curves', fontsize=16)\n", "plt.legend(loc=\"lower right\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question:** What parameters of DecisionTreeClassifier can you play with to define trees differently than with the default parameters? Cross-validate these using a grid search with [model_selection.GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). Plot the optimal decision tree on the previous plot. Did you manage to improve performance?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn import model_selection\n", "\n", "# Define the grid of parameters to test\n", "param_grid = # TODO\n", "\n", "# Initialize a GridSearchCV object that will be used to cross-validate\n", "# a DecisionTreeClassifier with these parameters.\n", "# What scoring function do you want to use?\n", "clf = model_selection.GridSearchCV( # TODO\n", "\n", "# Cross-validate the GridSearchCV object \n", "ypred_dt_opt = cross_validate_clf_optimize(X, y, clf, folds)\n", "\n", "# Compute the ROC curve for the optimized DecisionTreeClassifier\n", "fpr_dt_opt, tpr_dt_opt, thresholds = metrics.roc_curve(y, ypred_dt_opt, pos_label=1)\n", "auc_dt_opt = metrics.auc(fpr_dt_opt, tpr_dt_opt)\n", "\n", "# Plot the ROC curves of the 5 decision trees from earlier\n", "fig = plt.figure(figsize=(5, 5))\n", "\n", "for tree_index in range(4):\n", " plt.plot(fpr_dt[tree_index], tpr_dt[tree_index], '-', color='blue') \n", "plt.plot(fpr_dt[-1], tpr_dt[-1], '-', color='blue', \n", " label='DT (AUC = %0.2f (+/- %0.2f))' % (np.mean(auc_dt), np.std(auc_dt)))\n", "\n", "# Plot the ROC curve of the optimized DecisionTreeClassifier\n", "plt.plot(fpr_dt_opt, tpr_dt_opt, color='orange', label='DT optimized (AUC=%0.2f)' % auc_dt_opt)\n", "\n", "plt.xlabel('False Positive Rate', fontsize=16)\n", "plt.ylabel('True Positive Rate', fontsize=16)\n", "plt.title('ROC curves', fontsize=16)\n", "plt.legend(loc=\"lower right\", fontsize=12)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4.2 Bagging trees" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will resort to ensemble methods to try to improve the performance of single decision trees. Let us start with _bagging trees_: The different trees are to be built using a _bootstrap sample_ of the data, that is to say, a sample built by randomly drawing n points _with replacement_ from the original data, where n is the number of points in the training set.\n", "\n", "Bagging is efficient when used with low bias and high variance weak learners. Indeed, by averaging such estimators, we lower the variance by obtaining a smoother estimator, which is still centered around the true density (low bias). \n", "\n", "Bagging decision trees hence makes sense, as decision trees have:\n", "* low bias: intuitively, the conditions that are checked become multiplicative so the tree is continuously narrowing down on the data (the tree becomes highly tuned to the data present in the training set).\n", "* high variance: decision trees are very sensitive to where it splits and how it splits. Therefore, even small changes in input variable values might result in very different tree structure.\n", "\n", "\n", "**Note**: Bagging trees and random forests start being really powerful when using large number of trees (several hundreds). This is computationally more intensive, especially when the number of features is large, as in this lab. For the sake of computational time, we suggeste using small numbers of trees, but you might want to repeat this lab for larger number of trees at home." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question** Cross-validate a bagging ensemble of 5 decision trees on the data. Plot the resulting ROC curve, compared to the 5 decision trees you trained earlier.\n", "\n", "Use [ensemble.BaggingClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn import ensemble\n", "\n", "# Initialize a bag of trees\n", "clf = # TODO\n", "\n", "# Cross-validate the bagging trees on the tumor data\n", "ypred_bt = cross_validate_clf(X, y, clf, folds)\n", "\n", "# Compute the ROC curve of the bagging trees\n", "fpr_bt, tpr_bt, thresholds = metrics.roc_curve(y, ypred_bt, pos_label=1)\n", "auc_bt = metrics.auc(fpr_bt, tpr_bt)\n", "\n", "# Plot the ROC curve of the 5 decision trees from earlier\n", "fig = plt.figure(figsize=(5, 5))\n", "\n", "for tree_index in range(4):\n", " plt.plot(fpr_dt[tree_index], tpr_dt[tree_index], '-', color='blue') \n", "plt.plot(fpr_dt[-1], tpr_dt[-1], '-', color='blue', \n", " label='DT (AUC = %0.2f (+/- %0.2f))' % (np.mean(auc_dt), np.std(auc_dt)))\n", "\n", "# Plot the ROC curve of the bagging trees\n", "plt.plot(fpr_bt, tpr_bt, color='orange', label='BT (AUC=%0.2f)' % auc_bt)\n", "\n", "\n", "plt.xlabel('False Positive Rate', fontsize=16)\n", "plt.ylabel('True Positive Rate', fontsize=16)\n", "plt.title('ROC curves', fontsize=16)\n", "plt.legend(loc=\"lower right\", fontsize=12)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Question:__ How do the bagging trees perform compared to individual trees?\n", " \n", "__Answer:__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question** Use cross_validate_optimize to optimize the number of decision trees to use in the bagging method. How many trees did you find to be an optimal choice?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "# Number of trees to use\n", "list_n_trees = [5, 10, 20, 50, 80]\n", "\n", "# Start a ROC curve plot\n", "fig = plt.figure(figsize=(5, 5))\n", " \n", "for idx, n_trees in enumerate(list_n_trees):\n", " # Initialize a bag of trees with n_trees trees\n", " clf = # TODO\n", " \n", " # Cross-validate the bagging trees on the tumor data\n", " ypred_bt_tmp = cross_validate_clf(X, y, clf, folds)\n", " \n", " # Compute the ROC curve \n", " fpr_bt_tmp, tpr_bt_tmp, thresholds = metrics.roc_curve(y, ypred_bt_tmp, pos_label=1)\n", " auc_bt_tmp = metrics.auc(fpr_bt_tmp, tpr_bt_tmp)\n", "\n", " # Plot the ROC curve\n", " plt.plot(fpr_bt_tmp, tpr_bt_tmp, '-', \n", " label='BT %0.f trees (AUC = %0.2f)' % (n_trees, auc_bt_opt))\n", "\n", "# Plot the ROC curve of the optimal decision tree\n", "plt.plot(fpr_dt_opt, tpr_dt_opt, label='DT optimized (AUC=%0.2f)' % auc_dt_opt)\n", "\n", "plt.xlabel('False Positive Rate', fontsize=16)\n", "plt.ylabel('True Positive Rate', fontsize=16)\n", "plt.title('ROC curves', fontsize=16)\n", "plt.legend(loc=\"lower right\", fontsize=12)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Random Forest" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In practice, simply bagging is typically not enough. In order to get a good reduction in variance, we require that the models being aggregated be uncorrelated, so that they make “different errors”. Bagging will usually get you highly correlated models that will make the same errors, and will therefore not reduce the variance of the combined predictor.\n", "\n", "**Question** What is the difference between bagging trees and random forests? How does it intuitively fix the problem of correlations between trees ? " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Answer:__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question** Cross-validate a random forest of 5 decision trees on the data. Plot the resulting ROC curve, compared to the bagging tree made of 5 decision trees.\n", "\n", "Use [ensemble.RandomForestClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Initialize a random forest with 5 trees\n", "clf = # TODO\n", "\n", "# Cross-validate the random forest on the tumor data\n", "ypred_rf = # TODO\n", "\n", "# Compute the ROC curve of the random forest\n", "fpr_rf, tpr_rf, thresholds = # TODO\n", "auc_rf = # TODO\n", "\n", "# Plot the ROC curve of the 5 decision trees from earlier\n", "fig = plt.figure(figsize=(5, 5))\n", "\n", "for tree_index in range(4):\n", " plt.plot(fpr_dt[tree_index], tpr_dt[tree_index], '-', color='grey') \n", "plt.plot(fpr_dt[-1], tpr_dt[-1], '-', color='grey', \n", " label='DT (AUC = %0.2f (+/- %0.2f))' % (np.mean(auc_dt), np.std(auc_dt)))\n", "\n", "# Plot the ROC curve of the bagging trees (5 trees)\n", "plt.plot(fpr_bt, tpr_bt, label='BT (AUC=%0.2f)' % auc_bt)\n", "\n", "# Plot the ROC curve of the random forest (5 trees)\n", "plt.plot(fpr_rf, tpr_rf, label='BT (AUC=%0.2f)' % auc_bt)\n", "\n", "\n", "plt.xlabel('False Positive Rate', fontsize=16)\n", "plt.ylabel('True Positive Rate', fontsize=16)\n", "plt.title('ROC curves', fontsize=16)\n", "plt.legend(loc=\"lower right\", fontsize=12)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question** What are the main parameters of Random Forest which can be optimized ?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Answer:__ " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question** Use cross_validate_clf_optimize to optimize \n", "* the number of decision trees \n", "* the number of features to consider at each split.\n", "\n", "How many trees do you find to be an optimal choice? How does the optimal random forest compare to the optimal bagging trees? How do the training times of the random forest and the bagging trees compare?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Define the grid of parameters to test\n", "param_grid = # TODO\n", "\n", "# Initialize a GridSearchCV object that will be used to cross-validate\n", "# a random forest with these parameters.\n", "# What scoring function do you want to use?\n", "clf = grid_search.GridSearchCV(# TODO\n", "\n", "# Cross-validate the GridSearchCV object \n", "ypred_rf_opt = cross_validate_clf_optimize(X, y, clf, folds)\n", "\n", "# Compute the ROC curve for the optimized random forest\n", "fpr_rf_opt, tpr_rf_opt, thresholds = metrics.roc_curve(y, ypred_rf_opt, pos_label=1)\n", "auc_rf_opt = metrics.auc(fpr_rf_opt, tpr_rf_opt)\n", "\n", "# Plot the ROC curve of the optimized DecisionTreeClassifier\n", "fig = plt.figure(figsize=(5, 5))\n", "\n", "plt.plot(fpr_dt_opt, tpr_dt_opt, color='grey', \n", " label='DT optimized (AUC=%0.2f)' % auc_dt_opt)\n", " \n", "# Plot the ROC curve of the optimized random forest\n", "plt.plot(fpr_bt_opt, tpr_bt_opt, \n", " label='BT optimized (AUC=%0.2f)' % auc_bt_opt)\n", "\n", "# Plot the ROC curve of the optimized bagging trees\n", "plt.plot(fpr_rf_opt, tpr_rf_opt, l\n", " abel='RF optimized (AUC = %0.2f' % (auc_rf_opt))\n", " \n", "plt.xlabel('False Positive Rate', fontsize=16)\n", "plt.ylabel('True Positive Rate', fontsize=16)\n", "plt.title('ROC curves', fontsize=16)\n", "plt.legend(loc=\"lower right\", fontsize=12)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question** How do your tree-based classifiers compare to regularized logistic regression models? \n", "Plot the corresponding ROC curves." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn import linear_model\n", "\n", "# Evaluate an optimized l1-regularized logistic regression\n", "param_grid = {'C': np.logspace(-3, 3, 7)}\n", "clf = grid_search.GridSearchCV(linear_model.LogisticRegression(penalty='l1'), \n", " param_grid, scoring='roc_auc')\n", "ypred_l1 = cross_validate_clf_optimize(X, y, clf, folds)\n", "fpr_l1, tpr_l1, thresholds_l1 = metrics.roc_curve(y, ypred_l1, pos_label=1)\n", "auc_l1 = metrics.auc(fpr_l1, tpr_l1)\n", "print('nb features of best sparse model:', len(np.where(clf.best_estimator_.coef_!=0)[0]))\n", "\n", "# Evaluate an optimized l2-regularized logistic regression\n", "clf = grid_search.GridSearchCV(linear_model.LogisticRegression(penalty='l2'), \n", " param_grid, scoring='roc_auc')\n", "ypred_l2 = cross_validate_clf_optimize(X, y, clf, folds)\n", "fpr_l2, tpr_l2, thresholds_l2 = metrics.roc_curve(y, ypred_l2, pos_label=1)\n", "auc_l2 = metrics.auc(fpr_l2, tpr_l2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Plot the ROC curves\n", "fig = plt.figure(figsize=(5, 5))\n", "\n", "plt.plot(fpr_rf_opt, tpr_rf_opt, \n", " label='RF optimized (AUC = %0.2f)' % (auc_rf_opt))\n", "plt.plot(fpr_bt_opt, tpr_bt_opt, \n", " label='BT optimized (AUC = %0.2f)' % (auc_bt_opt))\n", "plt.plot(fpr_l1, tpr_l1, \n", " label='l1 optimized (AUC = %0.2f)' % (auc_l1))\n", "plt.plot(fpr_l2, tpr_l2, \n", " label='l2 optimized (AUC = %0.2f)' % (auc_l2))\n", "\n", "plt.xlabel('False Positive Rate', fontsize=16)\n", "plt.ylabel('True Positive Rate', fontsize=16)\n", "plt.title('ROC curves', fontsize=16)\n", "plt.legend(loc=\"lower right\", fontsize=12)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 2 }