{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab 07: Nearest neighbors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this lab, we will apply nearest neighbors classification to the Endometrium vs. Uterus cancer data. For documentation, see:\n", "* http://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbors-classification; and \n", "* http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier\n", "\n", "\n", "Let us start by setting up our environment, loading the data, and setting up our cross-validation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "%pylab inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load the data as in the previous labs. We will use `small_Endometrium_Uterus.csv` for this lab." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn import preprocessing\n", "\n", "# load the endometrium vs. uterus tumor data\n", "endometrium_data = pd.read_csv('data/small_Endometrium_Uterus.csv', sep=\",\") # load data\n", "endometrium_data.head(n=5) # adjust n to view more data\n", "\n", "# Create the design matrix and target vector\n", "X_clf = endometrium_data.drop(['ID_REF', 'Tissue'], axis=1).values\n", "y_clf = pd.get_dummies(endometrium_data['Tissue']).values[:,1]\n", "\n", "print(X_clf.shape)\n", "plt.scatter(X_clf[:,0], X_clf[:,1], c=y_clf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recall functions we had used to create cross validation folds in the previous labs. Redefine them here. \n", "\n", "*Note* : We shall call this *m*-fold cross validation, unlike *k*-fold, which we used in our previous labs, to distinguish it from *k*-nearest neighbors classification. This emphasizes that the two parameters are indeed different from each other." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn import model_selection\n", "\n", "def stratifiedMFolds(y, num_folds):\n", " kf = model_selection.StratifiedKFold(n_splits=num_folds)\n", " folds_ = [(tr, te) for (tr, te) in kf.split(np.zeros(y.size), y)]\n", " return folds_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now create 10 cross validate folds on the data. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "cv_folds = stratifiedMFolds(y_clf, 10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Import the previously written cross validation function." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# let's redefine the cross-validation procedure with standardization\n", "from sklearn import preprocessing\n", "def cross_validate(design_matrix, labels, regressor, cv_folds):\n", " \"\"\" Perform a cross-validation and returns the predictions. \n", " Use a scaler to scale the features to mean 0, standard deviation 1.\n", " \n", " Parameters:\n", " -----------\n", " design_matrix: (n_samples, n_features) np.array\n", " Design matrix for the experiment.\n", " labels: (n_samples, ) np.array\n", " Vector of labels.\n", " classifier: Regressor instance; must have the following methods:\n", " - fit(X, y) to train the regressor on the data X, y\n", " - predict_proba(X) to apply the trained regressor to the data X and return predicted values\n", " cv_folds: sklearn cross-validation object\n", " Cross-validation iterator.\n", " \n", " Return:\n", " -------\n", " pred: (n_samples, ) np.array\n", " Vectors of predictions (same order as labels).\n", " \"\"\"\n", " \n", " n_classes = np.unique(labels).size\n", " pred = np.zeros((labels.shape[0], n_classes))\n", " for tr, te in cv_folds:\n", " scaler = preprocessing.StandardScaler()\n", " Xtr = scaler.fit_transform(design_matrix[tr,:])\n", " ytr = labels[tr]\n", " Xte = scaler.transform(design_matrix[te,:])\n", " regressor.fit(Xtr, ytr)\n", " pred[te, :] = regressor.predict_proba(Xte)\n", " return pred" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 1. *k*-Nearest Neighbours Classifier" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A k-neighbours classifier can be initialised as `knn_clf = sklearn.neighbors.KNeighborsClassifier(n_neighbors=k)`\n", "\n", "Cross validate 20 *k*-NN classifiers on the loaded datset using `cross_validate`. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn import neighbors\n", "from sklearn import metrics\n", "\n", "aurocs_clf = []\n", "# Create a range of values of k. We will use this throughout the lab.\n", "k_range = range(1,40,2) \n", "\n", "for k in k_range:\n", " clf = neighbors.KNeighborsClassifier(n_neighbors=k)\n", " y_pred = cross_validate(X_clf, y_clf, clf, cv_folds)\n", " \n", " fpr, tpr, thresholdss = metrics.roc_curve(y_clf, y_pred[:,1])\n", " aurocs_clf.append(metrics.auc(fpr,tpr))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Question:__ Plot the AUC as a function of the number of nearest neighbours chosen." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.plot(#TODO)\n", "plt.xlabel('Number of nearest neighbours', fontsize=14)\n", "plt.ylabel('Cross-validated AUC', fontsize=14)\n", "plt.title('Nearest neighbours classification - cross validated AUC.', fontsize=14)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question.** Find the best value for the parameter `n_neighbors` by finding the one that gives the maximum value of AUC." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us now use `sklearn.model_selection.GridSearchCV` do to the same. The parameter to be cross-validated is the number of nearest neighbours to choose. Use an appropriate list to feed to `GridSearchCV` to find the best value for the nearest neighbours parameter." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn import model_selection\n", "from sklearn import metrics\n", "\n", "classifier = neighbors.KNeighborsClassifier()\n", "\n", "param_grid = {'n_neighbors': k_range}\n", "clf_knn_opt = model_selection.GridSearchCV(classifier, \n", " param_grid=param_grid, \n", " cv=cv_folds,\n", " scoring='roc_auc')\n", "clf_knn_opt.fit(X_clf, y_clf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Question:__ What is now the optimal parameter?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Find the best parameter\n", "# TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Try choosing different scoring metrics for GridSearchCV, and see how the result changes. You can find scoring metrics [here](http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now compare the performance of the *k*-nearest neighbours classifier with logistic regularisation (both, non-regularised, and regularised)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn import linear_model\n", "\n", "clf_logreg_l2 = linear_model.LogisticRegression()\n", "logreg_params = {'C':[1e-3, 1e-2, 1e-1, 1., 1e2]}\n", " \n", "clf_logreg_opt = model_selection.GridSearchCV(clf_logreg_l2, \n", " param_grid=logreg_params, \n", " cv=cv_folds,\n", " scoring='roc_auc')\n", "clf_logreg_opt.fit(X_clf, y_clf)\n", "ypred_clf_logreg_opt = cross_validate(X_clf, y_clf, \n", " clf_logreg_opt.best_estimator_, \n", " cv_folds)\n", "fpr_clf_logreg_opt, tpr_clf_logreg_opt, thresh = metrics.roc_curve(y_clf, \n", " ypred_clf_logreg_opt[:,1])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "clf_logreg = linear_model.LogisticRegression(C=1e12)\n", "\n", "ypred_clf_logreg = cross_validate(X_clf, y_clf, \n", " clf_logreg, \n", " cv_folds)\n", "fpr_clf_logreg, tpr_clf_logreg, thresh = metrics.roc_curve(y_clf, \n", " ypred_clf_logreg[:, 1])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "ypred_clf_knn_opt = cross_validate(X_clf, y_clf, clf_knn_opt.best_estimator_, cv_folds)\n", "fpr_clf_knn_opt, tpr_clf_knn_opt, thresh = metrics.roc_curve(y_clf, ypred_clf_knn_opt[:, 1])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(6, 6))\n", "\n", "logreg_l2_h, = plt.plot(fpr_clf_logreg_opt, tpr_clf_logreg_opt, 'b-')\n", "logreg_h, = plt.plot(fpr_clf_logreg, tpr_clf_logreg, 'g-')\n", "knn_h, = plt.plot(fpthatr_clf_knn_opt, tpr_clf_knn_opt, 'r-')\n", "logreg_l2_auc = metrics.auc(fpr_clf_logreg_opt, tpr_clf_logreg_opt)\n", "logreg_auc = metrics.auc(fpr_clf_logreg, tpr_clf_logreg)\n", "knn_auc = metrics.auc(fpr_clf_knn_opt, tpr_clf_knn_opt)\n", "\n", "logreg_legend = 'LogisticRegression. AUC=%.2f' %(logreg_auc)\n", "logreg_l2_legend = 'Regularised LogisticRegression. AUC=%.2f' %(logreg_l2_auc)\n", "knn_legend = 'KNeighborsClassifier. AUC=%.2f' %(knn_auc)\n", "plt.legend([logreg_h, logreg_l2_h, knn_h], \n", " [logreg_legend, logreg_l2_legend, knn_legend], fontsize=12)\n", "plt.xlabel('False Positive Rate', fontsize=16)\n", "plt.ylabel('True Positive Rate', fontsize=16)\n", "plt.title('ROC Curves comparison for logistic regression and k-nearest neighbours classifier.')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Setting the distance metric\n", "__Question__ You will notice that *k*-nearest neighbours classifiers measure distances between points to determine similarity. By default, we use the Euclidean distance metric. Often, using other distance metrics can prove to be helpful. Try to change the distance metric used here by passing it as an argument to the declaration of the classifier. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "classifiers = {} \n", "y_preds = {} \n", "# Fix a set of distance metrics to use\n", "d_metrics = # TODO \n", "aurocs = {} \n", "\n", "for m in d_metrics:\n", " aurocs[m] = [] \n", " for k in k_range: \n", " classifiers[m] = neighbors.KNeighborsClassifier(n_neighbors=k, metric=m)\n", " y_preds[m] = cross_validate(X_clf, y_clf, classifiers[m], cv_folds)\n", " \n", " fpr, tpr, thresholds = metrics.roc_curve(y_clf, y_preds[m][:,1])\n", " auc = metrics.auc(fpr, tpr)\n", " aurocs[m].append(auc) \n", " \n", " print 'Metric = %-12s | k = %3d | AUC = %.3f.' %(m, k, aurocs[m][-1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us now plot ROC curves for all the metrics together." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "f = plt.figure(figsize=(9, 6))\n", "\n", "for i in range(len(d_metrics)): \n", " plt.plot(k_range, aurocs[d_metrics[i]])\n", "\n", "plt.plot(k_range, [logreg_l2_auc for kval in k_range]) \n", " \n", "plt.xlabel('Number of nearest neighbors', fontsize=14)\n", "plt.ylabel('Cross-validated AUC', fontsize=14)\n", "plt.title('Nearest neighbors classification', fontsize=14)\n", "\n", "legends = [m for m in d_metrics]\n", "legends.append('Logistic regression')\n", "plt.legend(legends, fontsize=12)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 2 }