# Lab 07: Nearest neighbors

In this lab, we will apply nearest neighbors classification to the Endometrium vs. Uterus cancer data. For documentation, see:
* http://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbors-classification; and 
* http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier


Let us start by setting up our environment, loading the data, and setting up our cross-validation.

In [None]:
import pandas as pd
%pylab inline

Load the data as in the previous labs. We will use `small_Endometrium_Uterus.csv` for this lab.

In [None]:
from sklearn import preprocessing

# load the endometrium vs. uterus tumor data
endometrium_data = pd.read_csv('data/small_Endometrium_Uterus.csv', sep=",") # load data
endometrium_data.head(n=5) # adjust n to view more data

# Create the design matrix and target vector
X_clf = endometrium_data.drop(['ID_REF', 'Tissue'], axis=1).values
y_clf = pd.get_dummies(endometrium_data['Tissue']).values[:,1]

print(X_clf.shape)
plt.scatter(X_clf[:,0], X_clf[:,1], c=y_clf)

Recall functions we had used to create cross validation folds in the previous labs. Redefine them here. 

*Note* : We shall call this *m*-fold cross validation, unlike *k*-fold, which we used in our previous labs, to distinguish it from *k*-nearest neighbors classification. This emphasizes that the two parameters are indeed different from each other.

In [None]:
from sklearn import model_selection

def stratifiedMFolds(y, num_folds):
 kf = model_selection.StratifiedKFold(n_splits=num_folds)
 folds_ = [(tr, te) for (tr, te) in kf.split(np.zeros(y.size), y)]
 return folds_

Now create 10 cross validate folds on the data. 

In [None]:
cv_folds = stratifiedMFolds(y_clf, 10)

Import the previously written cross validation function.

In [None]:
# let's redefine the cross-validation procedure with standardization
from sklearn import preprocessing
def cross_validate(design_matrix, labels, regressor, cv_folds):
 """ Perform a cross-validation and returns the predictions. 
 Use a scaler to scale the features to mean 0, standard deviation 1.
 
 Parameters:
 -----------
 design_matrix: (n_samples, n_features) np.array
 Design matrix for the experiment.
 labels: (n_samples, ) np.array
 Vector of labels.
 classifier: Regressor instance; must have the following methods:
 - fit(X, y) to train the regressor on the data X, y
 - predict_proba(X) to apply the trained regressor to the data X and return predicted values
 cv_folds: sklearn cross-validation object
 Cross-validation iterator.
 
 Return:
 -------
 pred: (n_samples, ) np.array
 Vectors of predictions (same order as labels).
 """
 
 n_classes = np.unique(labels).size
 pred = np.zeros((labels.shape[0], n_classes))
 for tr, te in cv_folds:
 scaler = preprocessing.StandardScaler()
 Xtr = scaler.fit_transform(design_matrix[tr,:])
 ytr = labels[tr]
 Xte = scaler.transform(design_matrix[te,:])
 regressor.fit(Xtr, ytr)
 pred[te, :] = regressor.predict_proba(Xte)
 return pred

# 1. *k*-Nearest Neighbours Classifier

A k-neighbours classifier can be initialised as `knn_clf = sklearn.neighbors.KNeighborsClassifier(n_neighbors=k)`

Cross validate 20 *k*-NN classifiers on the loaded datset using `cross_validate`. 

In [None]:
from sklearn import neighbors
from sklearn import metrics

aurocs_clf = []
# Create a range of values of k. We will use this throughout the lab.
k_range = range(1,40,2) 

for k in k_range:
 clf = neighbors.KNeighborsClassifier(n_neighbors=k)
 y_pred = cross_validate(X_clf, y_clf, clf, cv_folds)
 
 fpr, tpr, thresholdss = metrics.roc_curve(y_clf, y_pred[:,1])
 aurocs_clf.append(metrics.auc(fpr,tpr))

__Question:__ Plot the AUC as a function of the number of nearest neighbours chosen.

In [None]:
plt.plot(#TODO)
plt.xlabel('Number of nearest neighbours', fontsize=14)
plt.ylabel('Cross-validated AUC', fontsize=14)
plt.title('Nearest neighbours classification - cross validated AUC.', fontsize=14)

**Question.** Find the best value for the parameter `n_neighbors` by finding the one that gives the maximum value of AUC.

Let us now use `sklearn.model_selection.GridSearchCV` do to the same. The parameter to be cross-validated is the number of nearest neighbours to choose. Use an appropriate list to feed to `GridSearchCV` to find the best value for the nearest neighbours parameter.

In [None]:
from sklearn import model_selection
from sklearn import metrics

classifier = neighbors.KNeighborsClassifier()

param_grid = {'n_neighbors': k_range}
clf_knn_opt = model_selection.GridSearchCV(classifier, 
 param_grid=param_grid, 
 cv=cv_folds,
 scoring='roc_auc')
clf_knn_opt.fit(X_clf, y_clf)

__Question:__ What is now the optimal parameter?

In [None]:
# Find the best parameter
# TODO

Try choosing different scoring metrics for GridSearchCV, and see how the result changes. You can find scoring metrics [here](http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter).

Now compare the performance of the *k*-nearest neighbours classifier with logistic regularisation (both, non-regularised, and regularised).

In [None]:
from sklearn import linear_model

clf_logreg_l2 = linear_model.LogisticRegression()
logreg_params = {'C':[1e-3, 1e-2, 1e-1, 1., 1e2]}
 
clf_logreg_opt = model_selection.GridSearchCV(clf_logreg_l2, 
 param_grid=logreg_params, 
 cv=cv_folds,
 scoring='roc_auc')
clf_logreg_opt.fit(X_clf, y_clf)
ypred_clf_logreg_opt = cross_validate(X_clf, y_clf, 
 clf_logreg_opt.best_estimator_, 
 cv_folds)
fpr_clf_logreg_opt, tpr_clf_logreg_opt, thresh = metrics.roc_curve(y_clf, 
 ypred_clf_logreg_opt[:,1])

In [None]:
clf_logreg = linear_model.LogisticRegression(C=1e12)

ypred_clf_logreg = cross_validate(X_clf, y_clf, 
 clf_logreg, 
 cv_folds)
fpr_clf_logreg, tpr_clf_logreg, thresh = metrics.roc_curve(y_clf, 
 ypred_clf_logreg[:, 1])

In [None]:
ypred_clf_knn_opt = cross_validate(X_clf, y_clf, clf_knn_opt.best_estimator_, cv_folds)
fpr_clf_knn_opt, tpr_clf_knn_opt, thresh = metrics.roc_curve(y_clf, ypred_clf_knn_opt[:, 1])

In [None]:
plt.figure(figsize=(6, 6))

logreg_l2_h, = plt.plot(fpr_clf_logreg_opt, tpr_clf_logreg_opt, 'b-')
logreg_h, = plt.plot(fpr_clf_logreg, tpr_clf_logreg, 'g-')
knn_h, = plt.plot(fpthatr_clf_knn_opt, tpr_clf_knn_opt, 'r-')
logreg_l2_auc = metrics.auc(fpr_clf_logreg_opt, tpr_clf_logreg_opt)
logreg_auc = metrics.auc(fpr_clf_logreg, tpr_clf_logreg)
knn_auc = metrics.auc(fpr_clf_knn_opt, tpr_clf_knn_opt)

logreg_legend = 'LogisticRegression. AUC=%.2f' %(logreg_auc)
logreg_l2_legend = 'Regularised LogisticRegression. AUC=%.2f' %(logreg_l2_auc)
knn_legend = 'KNeighborsClassifier. AUC=%.2f' %(knn_auc)
plt.legend([logreg_h, logreg_l2_h, knn_h], 
 [logreg_legend, logreg_l2_legend, knn_legend], fontsize=12)
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('ROC Curves comparison for logistic regression and k-nearest neighbours classifier.')
plt.show()

### Setting the distance metric
__Question__ You will notice that *k*-nearest neighbours classifiers measure distances between points to determine similarity. By default, we use the Euclidean distance metric. Often, using other distance metrics can prove to be helpful. Try to change the distance metric used here by passing it as an argument to the declaration of the classifier. 

In [None]:
classifiers = {} 
y_preds = {} 
# Fix a set of distance metrics to use
d_metrics = # TODO 
aurocs = {} 

for m in d_metrics:
 aurocs[m] = [] 
 for k in k_range: 
 classifiers[m] = neighbors.KNeighborsClassifier(n_neighbors=k, metric=m)
 y_preds[m] = cross_validate(X_clf, y_clf, classifiers[m], cv_folds)
 
 fpr, tpr, thresholds = metrics.roc_curve(y_clf, y_preds[m][:,1])
 auc = metrics.auc(fpr, tpr)
 aurocs[m].append(auc) 
 
 print 'Metric = %-12s | k = %3d | AUC = %.3f.' %(m, k, aurocs[m][-1])

Let us now plot ROC curves for all the metrics together.

In [None]:
f = plt.figure(figsize=(9, 6))

for i in range(len(d_metrics)): 
 plt.plot(k_range, aurocs[d_metrics[i]])

plt.plot(k_range, [logreg_l2_auc for kval in k_range]) 
 
plt.xlabel('Number of nearest neighbors', fontsize=14)
plt.ylabel('Cross-validated AUC', fontsize=14)
plt.title('Nearest neighbors classification', fontsize=14)

legends = [m for m in d_metrics]
legends.append('Logistic regression')
plt.legend(legends, fontsize=12)