# Lab 03: Linear and logistic regressions

The goal of this lab is to explore linear and logistic regression, implement them yourself and learn to use their respective scikit-learn implementation.

Let us start by loading some of the usual librairies

import pandas as pd
%pylab inline

1. Linear regression

We will now implement a linear regression, first using the closed form solution, and second with our gradient descent.

1.1 Linear regression data

Our first data set regards the quality ratings of a white vinho verde. Each wine is described by a number of physico-chemical descriptors such as acidity, sulfur dioxide content, density or pH.

# load the regression task data
wine_data = pd.read_csv('data/winequality-white.csv', sep=";")
wine_data.head(5)

# Load the data into X and y data arrays
X_regr = wine_data.drop(['quality'], axis=1).values
y_regr = wine_data['quality'].values

# Standardize the data
from sklearn import preprocessing
sc = preprocessing.StandardScaler()
sc.fit(X_regr)
X_regr = sc.transform(X_regr)

1.2 Cross-validation

Let us create a cross-validation utility function (similar to what we have done in Lab 3, but for regression).

# set up folds for cross_validation
from sklearn import model_selection
folds_regr = model_selection.KFold(n_splits=10, shuffle=True)

def cross_validate_regr(design_matrix, labels, regressor, cv_folds):
    """ Perform a cross-validation and returns the predictions.
    
    Parameters:
    -----------
    design_matrix: (n_samples, n_features) np.array
        Design matrix for the experiment.
    labels: (n_samples, ) np.array
        Vector of labels.
    regressor:  Regressor instance; must have the following methods:
        - fit(X, y) to train the regressor on the data X, y
        - predict(X) to apply the trained regressor to the data X and return estimates 
    cv_folds: sklearn cross-validation object
        Cross-validation iterator.
        
    Returns:
    -------
    pred: (n_samples, ) np.array
        Vectors of predictions (same order as labels).
    """
    pred = np.zeros(labels.shape)
    for tr, te in cv_folds:
        regressor.fit(design_matrix[tr,:], labels[tr])
        pred[te] = (regressor.predict(design_matrix[te,:]))
    return pred

1.3 Linear regression with scikit-learn

Question Cross-validate scikit-learn's linear_model.LinearRegression on your data.

from sklearn import linear_model

# Initialize a LinearRegression model
regr = # TODO

# Cross-validate it
pred = # TODO

from sklearn import metrics
print("Mean squared error: %.3f" % metrics.mean_squared_error(y_regr, pred))

2. Logistic regression

We will now implement a linear regression, first using the closed form solution, and second with our gradient descent.

2.1 Logistic regression data

Our second data set comes from the world of bioinformatics. In this data set, each observation is a tumor, and it is described by the expression of 3,000 genes. The expression of a gene is a measure of how much of that gene is present in the biological sample. Because this affects how much of the protein this gene codes for is produced, and because proteins dictacte what cells can do, gene expression gives us valuable information about the tumor. In particular, the expression of the same gene in the same individual is different in different tissues (although the DNA is the same): this is why blood cells look different from skin cells. In our data set, there are two types of tumors: breast tumors and ovary tumors. Let us see if gene expression can be used to separate them!

# Load the classification task data
breast_data = pd.read_csv('data/small_Breast_Ovary.csv')

# Drop the 'Tissue' column to create the design matrix
X_clf = np.array(breast_data.drop(['Tissue', 'ID_REF'], axis=1).values)

# Use the 'Tissue' column to create the labels (0=Breast, 1=Ovary)
y_clf = np.array(breast_data['Tissue'].values)
y_clf[np.where(y_clf == 'Breast')] = 0
y_clf[np.where(y_clf == 'Ovary')] = 1
y_clf = y_clf.astype(np.int)

#sc = preprocessing.StandardScaler()
#sc.fit(X_clf)
#X_clf = sc.transform(X_clf)

Question: How many samples do we have? How many belong to each class? How many features do we have?

# TODO

2.2 Cross-validation

Let us create a cross-validation utility function (similar to what we have done in Lab 3).

# Set up folds for cross_validation
from sklearn import model_selection
folds_clf = model_selection.StratifiedKFold(n_splits=10, shuffle=True)

def cross_validate_clf(design_matrix, labels, classifier, cv_folds):
    """ Perform a cross-validation and returns the predictions.
    
    Parameters:
    -----------
    design_matrix: (n_samples, n_features) np.array
        Design matrix for the experiment.
    labels: (n_samples, ) np.array
        Vector of labels.
    classifier:  sklearn classifier object
        Classifier instance; must have the following methods:
        - fit(X, y) to train the classifier on the data X, y
        - predict_proba(X) to apply the trained classifier to the data X and return probability estimates 
    cv_folds: sklearn cross-validation object
        Cross-validation iterator.
        
    Return:
    -------
    pred: (n_samples, ) np.array
        Vectors of predictions (same order as labels).
    """
    pred = np.zeros(labels.shape)
    for tr, te in cv_folds:
        classifier.fit(design_matrix[tr,:], labels[tr])
        pred[te] = classifier.predict_proba(design_matrix[te,:])[:,1]
    return pred

2.3 Logistic regression with scikit-learn

Question Cross-validate scikit-learn's linear_model.LogisticRegression on your data.

from sklearn import linear_model

# Initialize a LogisticRegression model. 
# Use C=1e7 to ensure there is no regularization (we'll talk about regularization next time!)
clf = # TODO

# Cross-validate it
ypred_logreg = # TODO

print("Accuracy: %.3f" % metrics.accuracy_score(ypred_logreg > 0.5, 1, 0))

Question : Plot the ROC curve. Use plt.semilogx to use a logarithmic scale on the x-axis. This "spreads out" the curve a little, making it easier to read.

fpr_logreg, tpr_logreg, thresholds = # TODO
auc_logreg = # TODO

plt.semilogx(fpr_logreg, tpr_logreg, '-', color='orange', 
             label='AUC = %0.3f' % auc_logreg)

plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('ROC curve: Logistic regression', fontsize=16)
plt.legend(loc="lower right")

Data scaling

See preprocessing.StandardScaler

Question Scale the data, and compute the cross-validated predictions of the logistic regression on the scaled data.

from sklearn import preprocessing

# Scale the data with preprocessing.StandardScaler
# Initialize a scaler
scaler = # TODO
# Scale your design matrix
X_clf_scaled = # TODO

# Initialize a LogisticRegression model. 
# Use C=1e7 to ensure there is no regularization (we'll talk about regularization next time!)
clf = # TODO

# Cross-validate it for the scaled data
ypred_logreg_scaled = # TODO

print("Accuracy: %.3f" % metrics.accuracy_score(ypred_logreg_scaled > 0.5, 1, 0))

Question Plot the two ROC curves (one for the logistic regression on the original data, one for the logistic regression on the scaled data) on the same plot.

fpr_logreg_scaled, tpr_logreg_scaled, thresholds = # TODO
auc_logreg_scaled = # TODO

plt.semilogx(fpr_logreg_scaled, tpr_logreg_scaled, '-', color='blue', 
             label='scaled data; AUC = %0.3f' % auc_logreg_scaled)
plt.semilogx(fpr_logreg, tpr_logreg, '-', color='orange', 
             label='original data; AUC = %0.3f' % auc_logreg)

plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('ROC curve: Logistic regression', fontsize=16)
plt.legend(loc="lower right")

In a cross-validation setting, we ignore the samples from the test fold when training the classifier. This also means that scaling should be done on the training data only.

In scikit-learn, we can use a scaler to make centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. The mean and standard deviation will be stored to be used on the test data.

Question Rewrite the cross_validate method to include a scaling step.

def cross_validate_clf_with_scaling(design_matrix, labels, classifier, cv_folds):
    """ Perform a cross-validation and returns the predictions.
    
    Parameters:
    -----------
    design_matrix: (n_samples, n_features) np.array
        Design matrix for the experiment.
    labels: (n_samples, ) np.array
        Vector of labels.
    classifier:  sklearn classifier object
        Classifier instance; must have the following methods:
        - fit(X, y) to train the classifier on the data X, y
        - predict_proba(X) to apply the trained classifier to the data X and return probability estimates 
    cv_folds: sklearn cross-validation object
        Cross-validation iterator.
        
    Return:
    -------
    pred: (n_samples, ) np.array
        Vectors of predictions (same order as labels).
    """
    pred = np.zeros(labels.shape)
    for tr, te in cv_folds:
        # TODO
    return pred

Question Now use the cross_validate_with_scaling method to cross-validate the logistic regression on our data.

clf = linear_model.LogisticRegression(C=1e6) 
ypred_logreg_scaled_ = cross_validate_clf_with_scaling(X_clf, y_clf, clf, folds_clf)
print(metrics.accuracy_score(y_clf,np.where(ypred_logreg_scaled_ > 0.5, 1, 0)))

Question Again, compare the AUROC and ROC curves with those obtained previously. What do you conclude?

fpr_logreg_scaled_, tpr_logreg_scaled_, thresholds = # TODO
auc_logreg_scaled_ = # TODO

plt.semilogx(fpr_logreg_scaled, tpr_logreg_scaled, '-', 
             color='blue', label='scaled data overfit; AUC = %0.3f' % auc_logreg_scaled)
plt.semilogx(fpr_logreg, tpr_logreg, '-', color='orange', 
             label='original data; AUC = %0.3f' % auc_logreg)
plt.semilogx(fpr_logreg_scaled_, tpr_logreg_scaled_, '-', color='black', 
             label='scaled data no overfit; AUC = %0.3f' % auc_logreg_scaled_)


plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('ROC curve: Logistic regression', fontsize=16)
plt.legend(loc="lower right")

03-linear-and-logistic-regression.ipynb 16 KB Historie Surový