{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Lab 3: Cross validation and feature encoding."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The data you will be using in this lab comes from a kaggle in class competition. The goal is to predict the number of shares an article will get on social media, from the article's topic, length, day of publication, and many other features.\n",
"\n",
"You are given labels, that is, number of shares, for 5000 of these articles. We will explore how to set up cross validation and do some feature encoding."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Populating the interactive namespace from numpy and matplotlib\n"
]
}
],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"%pylab inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. Model Selection: setting up a cross validation\n",
"\n",
"Cross-validation is a good way to perform model selection empirically while avoiding overfitting. \n",
"\n",
"This procedure can be split into the following two steps: \n",
"* the dataset is randomly split into K folds \n",
"* the model is run K times, each run using K-1 folds as the training set and evaluating the performance on the remaining fold which is the test set. \n",
"\n",
"Prediction performance are averaged over all folds. \n",
"\n",
"When the model contains parameters that need to be tuned, the CV scheme is repeated for all considered values of the hyperparameters, and those leading to the best prediction performance averaged on all folds are retained.\n",
"\n",
"Depending on the size of the dataset, 5 or 10 folds are usualy considered."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Question:__ In a K-fold cross-validation, how many times does each sample appear in a test set? In a training set? "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Answer:__"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Question:** Implement a function which splits the _indices_ of the training data in K folds."
]
},
{
"cell_type": "code",
"execution_count": 127,
"metadata": {},
"outputs": [],
"source": [
"def make_Kfolds(n_instance, n_folds):\n",
" \"\"\"\n",
" set up a K-fold croos-validation.\n",
" \n",
" Parameters:\n",
" -----------\n",
" n_instances: int\n",
" the number of instances in the dataset.\n",
" n_folds: int\n",
" the number of folds of the cross-validation scheme\n",
" \n",
" Outputs:\n",
" --------\n",
" fold_list: list\n",
" list of folds, a fold is a tuple of 2 lists, \n",
" the first one containing the indices of instances of the training set,\n",
" the second one containing the indices of instances of the test set\n",
" \"\"\"\n",
" # Create a list of the n_instance indices [0, 1, ..., n_instance-1]\n",
" list_indices = list(range(n_instance))\n",
" # Shuffle the list with np.random.shuffle\n",
" np.random.shuffle(list_indices)\n",
" \n",
" # Compute the number of instances per fold (i.e. in each test set)\n",
" n_instance_per_fold = n_instance // n_folds\n",
" print(n_instance_per_fold)\n",
" \n",
" # For each of the first K-1 folds, create the list of train set and test set indices\n",
" fold_list = []\n",
" for ind_fold in range(n_folds-1):\n",
" test_list = list_indices[ind_fold*n_instance_per_fold : (ind_fold+1)*n_instance_per_fold] # TODO\n",
" train_list = list_indices[:ind_fold*n_instance_per_fold] + list_indices[(ind_fold+1)*n_instance_per_fold:] # TODO\n",
" # TODO add the (train_list, test_list) tuple to fold_list\n",
" fold_list.append((train_list, test_list)) \n",
"\n",
" # Process the last fold separately\n",
" test_list = list_indices[(ind_fold+1)*n_instance_per_fold:] # TODO \n",
" train_list = list_indices[:(ind_fold+1)*n_instance_per_fold] # TODO \n",
" # TODO add the (train_list, test_list) tuple to fold_list \n",
" fold_list.append((train_list, test_list)) \n",
" \n",
" return fold_list"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"200\n",
"Fold 0\n",
"\t 801 training points\n",
"\t 200 test points\n",
"Fold 1\n",
"\t 801 training points\n",
"\t 200 test points\n",
"Fold 2\n",
"\t 801 training points\n",
"\t 200 test points\n",
"Fold 3\n",
"\t 801 training points\n",
"\t 200 test points\n",
"Fold 4\n",
"\t 800 training points\n",
"\t 201 test points\n"
]
}
],
"source": [
"# Check whether your function does what is expected\n",
"perso_folds = make_Kfolds(1001, 5)\n",
"\n",
"for ix, (tr, te) in enumerate(perso_folds):\n",
" print(\"Fold %d\" % ix)\n",
" print(\"\\t %d training points\" % len(tr))\n",
" print(\"\\t %d test points\" % len(te))\n",
" if len(np.intersect1d(tr, te))>0:\n",
" print('some instances are both in your training and test sets')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In practice, when using scikit-learn, you will not implement your cross-validation yourself, but rather rely on the library's functionalities for setting up cross-validation schemes. \n",
"\n",
"[Here](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection) is the list of available tools in the scikit-learn library.\n",
"\n",
"We list here one of the most important ones:\n",
"* [K-fold](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold): Provides train/test indices to split data in train/test sets by dataset into k consecutive folds (without shuffling by default). \n",
"* [stratified K-fold](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold) (to be used in case of classification): this cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.\n",
"\n",
"We will now explore the stratified K-fold on randomly generated data."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[0 0 0 0 1 0 1 1 1 1 1 1 0 1 1 1 1 0 1 0 0 0 1 0 0 0 1 1 0 1 0 0 1 0 1 0 1\n",
" 1 0 1 1 1 0 0 0 1 0 0 0 0 1 0 0 1 1 1 0 1 0 1 1 0 0 0 1 1 1 1 0 1 0 0 0 0\n",
" 1 0 1 0 1 0 1 1 0 1 0 1 1 1 0 1 0 0 0 1 0 1 1 1 0 0 1 1 0 0 1 0 1 1 1 1 0\n",
" 0 0 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 0 1 1 0 1 1 0 1 1 1 1 1 1 1 0 0 1 1 0 0\n",
" 0 0 1 1 1 0 1 1 1 1 0 1 1 1 0 1 1 1 1 0 1 1 0 1 0 0 0 0 1 1 1 1 0 0 0 1 0\n",
" 1 1 1 0 0 1 0 1 0 1 0 0 0 0 1 0 1 0 1 1 1 1 0 0 1 0 1 0 1 1 1 1 1 1 0 1 1\n",
" 1 1 0 1 0 0 1 0 1 1 1 1 1 1 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1 0 1 0 0 1 0 0 1\n",
" 1 1 0 0 1 0 0 0 1 1 1 0 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 0 1 0 0 1 1 0 0\n",
" 0 1 0 0 0 1 0 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 1 1 0 1 1 1 1 0 0 1 1 0 1 1\n",
" 0 0 1 1 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 1 0 0 1 0 1 1 1 1 1 1 0 0 0\n",
" 0 1 1 1 1 0 1 0 0 1 1 1 1 1 1 0 1 1 0 1 1 1 0 1 1 0 0 0 0 1 0 1 1 0 0 0 0\n",
" 0 1 1 1 1 0 0 1 0 1 0 1 0 0 0 1 1 0 1 0 0 0 0 1 0 0 0 1 1 1 0 0 1 0 1 0 0\n",
" 0 0 1 0 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 0 0 0 0 0 1 0 0 1 1 1\n",
" 0 0 0 1 0 1 0 1 0 0 1 1 0 1 1 1 1 1 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 0 1\n",
" 1 1 1 1 0 0 0 1 0 0 1 0 1 0 1 1 0 0 1 1 1 1 0 1 1 0 0 1 1 0 1 1 1 0 0 1 0\n",
" 1 0 0 1 1 1 0 1 0 0 1 0 1 1 0 1 1 1 0 0 1 1 0 1 1 0 0 1 0 0 0 0 0 1 0 0 1\n",
" 0 0 0 1 1 0 0 1 0 0 1 1 1 0 1 0 1 0 1 0 0 1 0 1 1 0 1 0 1 1 1 1 0 1 0 1 0\n",
" 0 1 1 0 0 1 1 0 0 1 0 1 1 0 0 0 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 1 1 0 1 0 0\n",
" 0 0 1 0 0 1 1 0 1 0 1 1 0 1 1 1 0 0 1 0 1 1 0 1 0 0 0 0 1 1 1 0 1 0 0 1 1\n",
" 0 0 0 0 0 0 1 0 1 1 0 1 1 0 1 1 1 1 1 0 1 0 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1\n",
" 1 0 0 1 1 1 0 1 0 0 0 0 0 0 1 0 1 0 0 1 0 1 1 0 0 1 1 0 0 1 1 0 1 0 1 0 1\n",
" 0 1 1 1 1 0 0 0 0 1 1 0 0 1 1 0 1 1 0 0 0 0 1 0 1 1 1 1 1 0 1 0 0 1 1 0 1\n",
" 1 0 1 0 0 1 0 0 0 1 1 0 0 0 1 0 0 1 1 1 1 0 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1\n",
" 1 1 0 1 1 1 1 1 0 1 0 1 1 0 1 0 1 1 0 0 1 1 0 0 1 1 0 0 0 1 1 1 0 1 1 0 0\n",
" 0 1 1 1 0 0 1 1 0 1 0 0 0 0 1 0 1 1 1 0 0 1 0 1 1 1 1 0 1 0 1 0 0 1 0 0 0\n",
" 1 0 0 1 0 1 0 0 0 0 0 0 0 1 1 0 1 1 0 1 0 1 1 1 0 1 0 0 0 1 0 1 0 0 1 1 0\n",
" 0 1 0 1 0 0 1 1 1 0 1 1 0 0 1 0 0 1 0 1 0 1 0 1 1 1 1 0 1 0 0 1 0 1 0 0 0\n",
" 0]\n"
]
}
],
"source": [
"# Generate random data\n",
"n_instances, n_features = 1000, 7\n",
"# Design matrix\n",
"X = np.random.random((n_instances, n_features))\n",
"# Classification labels\n",
"y = np.where(np.random.random(n_instances) >=0.5, 1, 0)\n",
"print(y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Question:** Using scikit-learn, set up a stratified 10-fold cross-validation for the above data."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"from sklearn import model_selection\n",
"# Initialize a StratifiedKFold object \n",
"skf = model_selection.StratifiedKFold()\n",
"# Split the data using skf\n",
"sk_folds = skf.split(X,y)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Fold 0\n",
"\t 800 training points\n",
"\t 200 test points\n",
"Fold 1\n",
"\t 800 training points\n",
"\t 200 test points\n",
"Fold 2\n",
"\t 800 training points\n",
"\t 200 test points\n",
"Fold 3\n",
"\t 800 training points\n",
"\t 200 test points\n",
"Fold 4\n",
"\t 800 training points\n",
"\t 200 test points\n"
]
}
],
"source": [
"# This is one way to access the training and test points\n",
"for ix, (tr, te) in enumerate(sk_folds):\n",
" print(\"Fold %d\" % ix)\n",
" print(\"\\t %d training points\" % len(tr))\n",
" print(\"\\t %d test points\" % len(te))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Important note:__ `sk_folds` is a [_generator_](https://wiki.python.org/moin/Generators), meaning that once you are done looping through it, it will be empty. In practice it avoids storing all the indices (if you were doing 10-fold cross-validation on a million sample, you would have $10^7$ values to store)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Question:** Create a cross-validation function that takes a design matrix, label array, scikit-learn classifier, and scikit-learn cross_validation object and returns the corresponding list of cross-validated predictions. \n",
"\n",
"The function contains a loop that goes through all folds and for each fold:\n",
"* trains a model on the training data\n",
"* uses this model to make predictions on the test data. \n",
"In this fashion you should be able to form *a single vector of predictions* `y_prob_cv` (as each point from the data appears once as a test point in the cross-validation).\n",
"\n",
"Make sure that you are returning the predictions in the correct order!\n",
"\n",
"Check the documentation of fit(X, y) and predict_proba(X) in [sklearn.naive_bayes.GaussianNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html). Every classifier implemented in scikit-learn has a fit(X,y) and a predict_proba(X) methods. \n",
"Note that the predict_proba methods returns a 2 dimentional array, you must find a way to only keep the probability to belong to the positive class."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"def cross_validate(design_matrix, labels, classifier, cv_folds):\n",
" \"\"\" Perform a cross-validation and returns the predictions.\n",
" \n",
" Parameters:\n",
" -----------\n",
" design_matrix: (n_samples, n_features) np.array\n",
" Design matrix for the experiment.\n",
" labels: (n_samples, ) np.array\n",
" Vector of labels.\n",
" classifier: sklearn classifier object\n",
" Classifier instance; must have the following methods:\n",
" - fit(X, y) to train the classifier on the data X, y\n",
" - predict_proba(X) to apply the trained classifier to the data X and return probability estimates \n",
" cv_folds: sklearn cross-validation object\n",
" Cross-validation iterator.\n",
" \n",
" Return:\n",
" -------\n",
" pred: (n_samples, ) np.array\n",
" Vectors of predictions (same order as labels).\n",
" \"\"\"\n",
" pred = np.zeros(labels.shape)\n",
" for tr, te in cv_folds:\n",
" # TODO\n",
" classifier.fit(design_matrix[tr,:], labels[tr])\n",
" pred[te] = classifier.predict_proba(design_matrix[te,:])[:,1]\n",
" return pred"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1000\n",
"1000\n",
"0.478\n"
]
}
],
"source": [
"# To check whether your function runs properly, you can use the following\n",
"\n",
"# import Gaussian Naive Bayes\n",
"from sklearn.naive_bayes import GaussianNB\n",
"from sklearn import metrics\n",
"\n",
"# create a GNB classifier\n",
"gnb = GaussianNB()\n",
"\n",
"# run your cross_validate function\n",
"y_prob_cv = cross_validate(X, y, gnb, sk_folds)\n",
"\n",
"# check y and y_prob_cv have the same length (the number of instance)\n",
"print(len(y_prob_cv))\n",
"print(len(y))\n",
"\n",
"# check the accuracy of your prediction (it should be close to 0.5 as we're considering random matrices). \n",
"print(metrics.accuracy_score(y, np.where(y_prob_cv>=0.5, 1, 0)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"** Extensions **\n",
"* **Leave-one-out cross-validation: ** in this case, the number of folds is the number of available points in the dataset. To say it differently, the model is trained K times on K-1 points, and tested on the left out point. The LOO CV scheme is particularly convenient when the number of samples is very small. When the number of samples is large, it becomes computationally burdensome; moreover the cross-validated error tends to have a very large variance which makes it hard to interpret.\n",
"\n",
"* **Nested-cross-validation: ** The goal of the cross validation scheme is to assess the performance of the model on _new_ data which were not used to train or optimize the model. From that perspective, the CV scheme is not rigorous when optimizing hyperparameters. Indeed, the test data are both used to assess the performance and choosing the set of parameters which led to that best performance. To avoid selecting a possibly over-fitted set of parameters, we also used the so-called nested cross validation (_Nested CV_) scheme which consists in a cross validation (_inner-CV_) nested in a other cross validation (_outer-CV_). At each step of the _outer-CV_, the optimal parameters are found via the _inner-CV_ on the train set of the _outer-CV_, and the performance is assessed on the remaining test fold of the _outer-CV_ Therefore, in _Nested CV_, parameter optimization and performance assessment are performed on different _unseen_ data.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2. Data loading and visualization\n",
"\n",
"Download the data from the competition link and unzip it."
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"'nb_words_title' Number of words in the article's titles\n",
"'nb_words_content' Number of words in the article\n",
"'pp_uniq_words' Proportion of unique words in the article\n",
"'pp_stop_words' Proportion of stop words (i.e. words predefined to be too common to be of use for interpretation or queries, such as 'the', 'a', 'and', etc.)\n",
"'pp_uniq_non-stop_words' Proportion of non-stop words among unique words\n",
"'nb_links' Number of hyperlinks in the article\n",
"'nb_outside_links' Number of hyperlinks pointing to another website\n",
"'nb_images' Number of images in the article\n",
"'nb_videos' Number of videos in the article\n",
"'ave_word_length' Average word length\n",
"'nb_keywords' Number of keywords in the metadata\n",
"'category' Category of the article: 0-Lifestyle, 1-Entertainment, 2-Business, 3-Web, 4-Tech, 5-World\n",
"'nb_mina_mink' Minimum number of share counts among all articles with at least one keyword in common with the article\n",
"'nb_mina_maxk' Minimum number of maximum share counts per keyword\n",
"'nb_mina_avek' Minimum number of average share counts per keyword\n",
"'nb_maxa_mink' Maximum number of minimum share counts per keyword\n",
"'nb_maxa_maxk' Maximum number of share counts among all articles with at least one keyword in common with the article\n",
"'nb_maxa_avek' Maximum number of average share counts per keyword\n",
"'nb_avea_mink' Average number of minimum share counts per keyword\n",
"'nb_avea_maxk' Average number of maximum share counts per keyword\n",
"'nb_avea_avek' Average number of average share counts per keyword\n",
"'nb_min_linked' Minimum number of shares of articles from the same website linked within the article\n",
"'nb_max_linked' Maximum number of shares of articles from the same website linked within the article\n",
"'nb_ave_linked' Average number of shares of articles from the same website linked within the article\n",
"'weekday' Day of the week: 0-Monday, 1-Tuesday, 2-Wednesday, until 6-Sunday\n",
"'dist_topic_0' Distance to topic 0\n",
"'dist_topic_1' Distance to topic 1\n",
"'dist_topic_2' Distance to topic 2\n",
"'dist_topic_3' Distance to topic 3\n",
"'dist_topic_4' Distance to topic 4\n",
"'subj' Subjectivity\n",
"'polar' Sentiment polarity \n",
"'pp_pos_words' Proportion of positive words in the article\n",
"'pp_neg_words' Proportion of negative words in the article\n",
"'pp_pos_words_in_nonneutral' Proportion of positive words among the non-neutral words of the article\n",
"'ave_polar_pos' Average sentiment polarity of the positive words\n",
"'min_polar_pos' Minimum sentiment polarity of the positive words\n",
"'max_polar_pos' Maximum sentiment polarity of the positive words\n",
"'ave_polar_neg' Average sentiment polarity of the negative words\n",
"'min_polar_neg' Mimimum sentiment polarity of the negative words\n",
"'max_polar_neg' Maximum sentiment polarity of the negative words\n",
"'subj_title' Subjectivity of the title\n",
"'polar_title' Polarity of the title\n"
]
}
],
"source": [
"# we display the description of the features\n",
"!cat data/kaggle_data/features.txt"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/wbader/miniconda3/envs/tp-ml/lib/python3.6/site-packages/ipykernel_launcher.py:2: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.\n",
" \n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
feature_names
\n",
"
feature_description
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
'nb_words_title'
\n",
"
Number of words in the article's titles
\n",
"
\n",
"
\n",
"
1
\n",
"
'nb_words_content'
\n",
"
Number of words in the article
\n",
"
\n",
"
\n",
"
2
\n",
"
'pp_uniq_words'
\n",
"
Proportion of unique words in the article
\n",
"
\n",
"
\n",
"
3
\n",
"
'pp_stop_words'
\n",
"
Proportion of stop words (i.e. words predefine...
\n",
"
\n",
"
\n",
"
4
\n",
"
'pp_uniq_non-stop_words'
\n",
"
Proportion of non-stop words among unique words
\n",
"
\n",
"
\n",
"
5
\n",
"
'nb_links'
\n",
"
Number of hyperlinks in the article
\n",
"
\n",
"
\n",
"
6
\n",
"
'nb_outside_links'
\n",
"
Number of hyperlinks pointing to another website
\n",
"
\n",
"
\n",
"
7
\n",
"
'nb_images'
\n",
"
Number of images in the article
\n",
"
\n",
"
\n",
"
8
\n",
"
'nb_videos'
\n",
"
Number of videos in the article
\n",
"
\n",
"
\n",
"
9
\n",
"
'ave_word_length'
\n",
"
Average word length
\n",
"
\n",
"
\n",
"
10
\n",
"
'nb_keywords'
\n",
"
Number of keywords in the metadata
\n",
"
\n",
"
\n",
"
11
\n",
"
'category'
\n",
"
Category of the article: 0-Lifestyle, 1-Entert...
\n",
"
\n",
"
\n",
"
12
\n",
"
'nb_mina_mink'
\n",
"
Minimum number of share counts among all artic...
\n",
"
\n",
"
\n",
"
13
\n",
"
'nb_mina_maxk'
\n",
"
Minimum number of maximum share counts per key...
\n",
"
\n",
"
\n",
"
14
\n",
"
'nb_mina_avek'
\n",
"
Minimum number of average share counts per key...
\n",
"
\n",
"
\n",
"
15
\n",
"
'nb_maxa_mink'
\n",
"
Maximum number of minimum share counts per key...
\n",
"
\n",
"
\n",
"
16
\n",
"
'nb_maxa_maxk'
\n",
"
Maximum number of share counts among all artic...
\n",
"
\n",
"
\n",
"
17
\n",
"
'nb_maxa_avek'
\n",
"
Maximum number of average share counts per key...
\n",
"
\n",
"
\n",
"
18
\n",
"
'nb_avea_mink'
\n",
"
Average number of minimum share counts per key...
\n",
"
\n",
"
\n",
"
19
\n",
"
'nb_avea_maxk'
\n",
"
Average number of maximum share counts per key...
\n",
"
\n",
"
\n",
"
20
\n",
"
'nb_avea_avek'
\n",
"
Average number of average share counts per key...
\n",
"
\n",
"
\n",
"
21
\n",
"
'nb_min_linked'
\n",
"
Minimum number of shares of articles from the ...
\n",
"
\n",
"
\n",
"
22
\n",
"
'nb_max_linked'
\n",
"
Maximum number of shares of articles from the ...
\n",
"
\n",
"
\n",
"
23
\n",
"
'nb_ave_linked'
\n",
"
Average number of shares of articles from the ...
\n",
"
\n",
"
\n",
"
24
\n",
"
'weekday'
\n",
"
Day of the week: 0-Monday, 1-Tuesday, 2-Wednes...
\n",
"
\n",
"
\n",
"
25
\n",
"
'dist_topic_0'
\n",
"
Distance to topic 0
\n",
"
\n",
"
\n",
"
26
\n",
"
'dist_topic_1'
\n",
"
Distance to topic 1
\n",
"
\n",
"
\n",
"
27
\n",
"
'dist_topic_2'
\n",
"
Distance to topic 2
\n",
"
\n",
"
\n",
"
28
\n",
"
'dist_topic_3'
\n",
"
Distance to topic 3
\n",
"
\n",
"
\n",
"
29
\n",
"
'dist_topic_4'
\n",
"
Distance to topic 4
\n",
"
\n",
"
\n",
"
30
\n",
"
'subj'
\n",
"
Subjectivity
\n",
"
\n",
"
\n",
"
31
\n",
"
'polar'
\n",
"
Sentiment polarity
\n",
"
\n",
"
\n",
"
32
\n",
"
'pp_pos_words'
\n",
"
Proportion of positive words in the article
\n",
"
\n",
"
\n",
"
33
\n",
"
'pp_neg_words'
\n",
"
Proportion of negative words in the article
\n",
"
\n",
"
\n",
"
34
\n",
"
'pp_pos_words_in_nonneutral'
\n",
"
Proportion of positive words among the non-neu...
\n",
"
\n",
"
\n",
"
35
\n",
"
'ave_polar_pos'
\n",
"
Average sentiment polarity of the positive words
\n",
"
\n",
"
\n",
"
36
\n",
"
'min_polar_pos'
\n",
"
Minimum sentiment polarity of the positive words
\n",
"
\n",
"
\n",
"
37
\n",
"
'max_polar_pos'
\n",
"
Maximum sentiment polarity of the positive words
\n",
"
\n",
"
\n",
"
38
\n",
"
'ave_polar_neg'
\n",
"
Average sentiment polarity of the negative words
\n",
"
\n",
"
\n",
"
39
\n",
"
'min_polar_neg'
\n",
"
Mimimum sentiment polarity of the negative words
\n",
"
\n",
"
\n",
"
40
\n",
"
'max_polar_neg'
\n",
"
Maximum sentiment polarity of the negative words
\n",
"
\n",
"
\n",
"
41
\n",
"
'subj_title'
\n",
"
Subjectivity of the title
\n",
"
\n",
"
\n",
"
42
\n",
"
'polar_title'
\n",
"
Polarity of the title
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" feature_names \\\n",
"0 'nb_words_title' \n",
"1 'nb_words_content' \n",
"2 'pp_uniq_words' \n",
"3 'pp_stop_words' \n",
"4 'pp_uniq_non-stop_words' \n",
"5 'nb_links' \n",
"6 'nb_outside_links' \n",
"7 'nb_images' \n",
"8 'nb_videos' \n",
"9 'ave_word_length' \n",
"10 'nb_keywords' \n",
"11 'category' \n",
"12 'nb_mina_mink' \n",
"13 'nb_mina_maxk' \n",
"14 'nb_mina_avek' \n",
"15 'nb_maxa_mink' \n",
"16 'nb_maxa_maxk' \n",
"17 'nb_maxa_avek' \n",
"18 'nb_avea_mink' \n",
"19 'nb_avea_maxk' \n",
"20 'nb_avea_avek' \n",
"21 'nb_min_linked' \n",
"22 'nb_max_linked' \n",
"23 'nb_ave_linked' \n",
"24 'weekday' \n",
"25 'dist_topic_0' \n",
"26 'dist_topic_1' \n",
"27 'dist_topic_2' \n",
"28 'dist_topic_3' \n",
"29 'dist_topic_4' \n",
"30 'subj' \n",
"31 'polar' \n",
"32 'pp_pos_words' \n",
"33 'pp_neg_words' \n",
"34 'pp_pos_words_in_nonneutral' \n",
"35 'ave_polar_pos' \n",
"36 'min_polar_pos' \n",
"37 'max_polar_pos' \n",
"38 'ave_polar_neg' \n",
"39 'min_polar_neg' \n",
"40 'max_polar_neg' \n",
"41 'subj_title' \n",
"42 'polar_title' \n",
"\n",
" feature_description \n",
"0 Number of words in the article's titles \n",
"1 Number of words in the article \n",
"2 Proportion of unique words in the article \n",
"3 Proportion of stop words (i.e. words predefine... \n",
"4 Proportion of non-stop words among unique words \n",
"5 Number of hyperlinks in the article \n",
"6 Number of hyperlinks pointing to another website \n",
"7 Number of images in the article \n",
"8 Number of videos in the article \n",
"9 Average word length \n",
"10 Number of keywords in the metadata \n",
"11 Category of the article: 0-Lifestyle, 1-Entert... \n",
"12 Minimum number of share counts among all artic... \n",
"13 Minimum number of maximum share counts per key... \n",
"14 Minimum number of average share counts per key... \n",
"15 Maximum number of minimum share counts per key... \n",
"16 Maximum number of share counts among all artic... \n",
"17 Maximum number of average share counts per key... \n",
"18 Average number of minimum share counts per key... \n",
"19 Average number of maximum share counts per key... \n",
"20 Average number of average share counts per key... \n",
"21 Minimum number of shares of articles from the ... \n",
"22 Maximum number of shares of articles from the ... \n",
"23 Average number of shares of articles from the ... \n",
"24 Day of the week: 0-Monday, 1-Tuesday, 2-Wednes... \n",
"25 Distance to topic 0 \n",
"26 Distance to topic 1 \n",
"27 Distance to topic 2 \n",
"28 Distance to topic 3 \n",
"29 Distance to topic 4 \n",
"30 Subjectivity \n",
"31 Sentiment polarity \n",
"32 Proportion of positive words in the article \n",
"33 Proportion of negative words in the article \n",
"34 Proportion of positive words among the non-neu... \n",
"35 Average sentiment polarity of the positive words \n",
"36 Minimum sentiment polarity of the positive words \n",
"37 Maximum sentiment polarity of the positive words \n",
"38 Average sentiment polarity of the negative words \n",
"39 Mimimum sentiment polarity of the negative words \n",
"40 Maximum sentiment polarity of the negative words \n",
"41 Subjectivity of the title \n",
"42 Polarity of the title "
]
},
"execution_count": 75,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"feature_data = pd.read_csv('data/kaggle_data/features.txt', header=None, sep=\" \",\n",
" names=['feature_names', 'feature_description'])\n",
"feature_data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"** Now, let's load and look at the distribution of number of shares (output). **"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
" feature_names \\\n",
"0 'nb_words_title' \n",
"1 'nb_words_content' \n",
"2 'pp_uniq_words' \n",
"3 'pp_stop_words' \n",
"4 'pp_uniq_non-stop_words' \n",
"5 'nb_links' \n",
"6 'nb_outside_links' \n",
"7 'nb_images' \n",
"8 'nb_videos' \n",
"9 'ave_word_length' \n",
"10 'nb_keywords' \n",
"11 'category' \n",
"12 'nb_mina_mink' \n",
"13 'nb_mina_maxk' \n",
"14 'nb_mina_avek' \n",
"15 'nb_maxa_mink' \n",
"16 'nb_maxa_maxk' \n",
"17 'nb_maxa_avek' \n",
"18 'nb_avea_mink' \n",
"19 'nb_avea_maxk' \n",
"20 'nb_avea_avek' \n",
"21 'nb_min_linked' \n",
"22 'nb_max_linked' \n",
"23 'nb_ave_linked' \n",
"24 'weekday' \n",
"25 'dist_topic_0' \n",
"26 'dist_topic_1' \n",
"27 'dist_topic_2' \n",
"28 'dist_topic_3' \n",
"29 'dist_topic_4' \n",
"30 'subj' \n",
"31 'polar' \n",
"32 'pp_pos_words' \n",
"33 'pp_neg_words' \n",
"34 'pp_pos_words_in_nonneutral' \n",
"35 'ave_polar_pos' \n",
"36 'min_polar_pos' \n",
"37 'max_polar_pos' \n",
"38 'ave_polar_neg' \n",
"39 'min_polar_neg' \n",
"40 'max_polar_neg' \n",
"41 'subj_title' \n",
"42 'polar_title' \n",
"\n",
" feature_description \n",
"0 Number of words in the article's titles \n",
"1 Number of words in the article \n",
"2 Proportion of unique words in the article \n",
"3 Proportion of stop words (i.e. words predefine... \n",
"4 Proportion of non-stop words among unique words \n",
"5 Number of hyperlinks in the article \n",
"6 Number of hyperlinks pointing to another website \n",
"7 Number of images in the article \n",
"8 Number of videos in the article \n",
"9 Average word length \n",
"10 Number of keywords in the metadata \n",
"11 Category of the article: 0-Lifestyle, 1-Entert... \n",
"12 Minimum number of share counts among all artic... \n",
"13 Minimum number of maximum share counts per key... \n",
"14 Minimum number of average share counts per key... \n",
"15 Maximum number of minimum share counts per key... \n",
"16 Maximum number of share counts among all artic... \n",
"17 Maximum number of average share counts per key... \n",
"18 Average number of minimum share counts per key... \n",
"19 Average number of maximum share counts per key... \n",
"20 Average number of average share counts per key... \n",
"21 Minimum number of shares of articles from the ... \n",
"22 Maximum number of shares of articles from the ... \n",
"23 Average number of shares of articles from the ... \n",
"24 Day of the week: 0-Monday, 1-Tuesday, 2-Wednes... \n",
"25 Distance to topic 0 \n",
"26 Distance to topic 1 \n",
"27 Distance to topic 2 \n",
"28 Distance to topic 3 \n",
"29 Distance to topic 4 \n",
"30 Subjectivity \n",
"31 Sentiment polarity \n",
"32 Proportion of positive words in the article \n",
"33 Proportion of negative words in the article \n",
"34 Proportion of positive words among the non-neu... \n",
"35 Average sentiment polarity of the positive words \n",
"36 Minimum sentiment polarity of the positive words \n",
"37 Maximum sentiment polarity of the positive words \n",
"38 Average sentiment polarity of the negative words \n",
"39 Mimimum sentiment polarity of the negative words \n",
"40 Maximum sentiment polarity of the negative words \n",
"41 Subjectivity of the title \n",
"42 Polarity of the title "
]
},
"execution_count": 78,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"feature_data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"** Now, let's load and visualize the features. **"
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from pandas import plotting\n",
"plotting.scatter_matrix(train_data.get([\"nb_words_content\", \"pp_uniq_words\"]), alpha=0.2,\n",
" figsize=(6, 6), diagonal='kde')"
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/wbader/miniconda3/envs/tp-ml/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.\n",
" FutureWarning\n"
]
},
{
"data": {
"text/plain": [
""
]
},
"execution_count": 82,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import seaborn as sns\n",
"sns.set_style('whitegrid')\n",
"\n",
"sns.jointplot(\"nb_words_content\", \"pp_uniq_words\", data = train_data, \n",
" kind='reg', height=6, space=0)"
]
},
{
"cell_type": "code",
"execution_count": 83,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/wbader/miniconda3/envs/tp-ml/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.\n",
" FutureWarning\n"
]
},
{
"data": {
"text/plain": [
""
]
},
"execution_count": 83,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"sns.jointplot(train_data[\"pp_neg_words\"], target_data['Prediction'], \n",
" kind='reg', ylim=5000, height=6, space=0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"** Question: ** Change the features you display to explore relationships. What conclusions are you drawing from this exploratory analysis? Are you going to keep all the features in your predictors?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Answer:__"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2. Data transformation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.1 feature engineering\n",
"This notion includes all kinds of manual modification and creation of features. All are of course problem dependant.\n",
"\n",
"* __Encoding categorical features:__ if a K-categorical feature is not ordered (categorie 1 is as far to categorie 2 as to categorie 3 etc), then it must not be encoded by a single integer specifying the categorie. We can encode such feature by creating K-1 binary features encoding the belonging to k-th category. (see [link](http://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features))\n",
"\n",
"* __Feature binarization:__ some continuous features can gain predictive power when binarized. For exemple, in some prediction tasks, weekdays could be split into $working\\ days$ and $not\\ working\\ days$. (see [link](http://scikit-learn.org/stable/modules/preprocessing.html#binarization))\n",
"\n",
"* __Imputation of missing values:__ there are multiple strategies to input missing values when required (see [link](http://scikit-learn.org/stable/modules/preprocessing.html#imputation-of-missing-values)).\n",
"\n",
"* __Dealing with time features or other periodic features:__ when considering the hour of the day as a feature, we can't encode it by the an integer between 1 and 24 as midnigth is as close to 11pm to 1am. An easy strategy to encode periodic features is to apply this transformation $x \\mapsto \\sin(\\frac{2\\pi x}{T})$ (T is the period). In the case of the hour of the day, it is $x \\mapsto \\sin(\\frac{2\\pi x}{24})$. \n",
"\n",
"* __Generating new features:__ you might want to combine the existing features into new ones that seem informative to you. It can be useful for exemple, notably when working with linear models, to generate polynomial features from the original ones. You can also use external data to transform your features; for instance, if one feature is a date, adding a feature that qualifies whether the day is a working day, a weekday or a holiday can be useful. \n",
"* ...\n",
"\n",
"In many practical cases, feature engineering is the key to obtaining a huge improvement in performance."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"** Question: ** How do you want to engineer the features of the challenge (first, you can start encoding the categorical features)? Keep thinking of this question all along the challenge."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let us encode weekdays as a categorical feature rather than a periodic one. Remember this transformation later in the challenge: does it help your performance?"
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
weekday_1
\n",
"
weekday_2
\n",
"
weekday_3
\n",
"
weekday_4
\n",
"
weekday_5
\n",
"
weekday_6
\n",
"
nb_words_title
\n",
"
nb_words_content
\n",
"
pp_uniq_words
\n",
"
pp_stop_words
\n",
"
...
\n",
"
pp_neg_words
\n",
"
pp_pos_words_in_nonneutral
\n",
"
ave_polar_pos
\n",
"
min_polar_pos
\n",
"
max_polar_pos
\n",
"
ave_polar_neg
\n",
"
min_polar_neg
\n",
"
max_polar_neg
\n",
"
subj_title
\n",
"
polar_title
\n",
"
\n",
" \n",
" \n",
"
\n",
"
2000
\n",
"
1
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
9
\n",
"
843
\n",
"
0.5358
\n",
"
2.092000e-09
\n",
"
...
\n",
"
0.019230
\n",
"
0.7143
\n",
"
0.4437
\n",
"
0.03333
\n",
"
1.0
\n",
"
-0.3160
\n",
"
-0.8000
\n",
"
-0.05
\n",
"
0.0
\n",
"
0.0
\n",
"
\n",
"
\n",
"
2001
\n",
"
0
\n",
"
1
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
9
\n",
"
805
\n",
"
0.4196
\n",
"
2.165000e-09
\n",
"
...
\n",
"
0.025710
\n",
"
0.5349
\n",
"
0.3081
\n",
"
0.05000
\n",
"
0.8
\n",
"
-0.3463
\n",
"
-0.7143
\n",
"
-0.10
\n",
"
0.9
\n",
"
0.3
\n",
"
\n",
"
\n",
"
2002
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
8
\n",
"
145
\n",
"
0.7594
\n",
"
1.163000e-08
\n",
"
...
\n",
"
0.007519
\n",
"
0.8333
\n",
"
0.3673
\n",
"
0.13640
\n",
"
0.5
\n",
"
-0.2000
\n",
"
-0.2000
\n",
"
-0.20
\n",
"
0.0
\n",
"
0.0
\n",
"
\n",
"
\n",
"
2003
\n",
"
0
\n",
"
1
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
12
\n",
"
201
\n",
"
0.6359
\n",
"
9.259000e-09
\n",
"
...
\n",
"
0.027030
\n",
"
0.7368
\n",
"
0.3721
\n",
"
0.13640
\n",
"
0.6
\n",
"
-0.4000
\n",
"
-0.4000
\n",
"
-0.40
\n",
"
0.0
\n",
"
0.0
\n",
"
\n",
"
\n",
"
2004
\n",
"
0
\n",
"
1
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
13
\n",
"
673
\n",
"
0.4609
\n",
"
2.500000e-09
\n",
"
...
\n",
"
0.021440
\n",
"
0.5625
\n",
"
0.3500
\n",
"
0.05000
\n",
"
0.6
\n",
"
-0.2435
\n",
"
-0.8000
\n",
"
-0.10
\n",
"
0.0
\n",
"
0.0
\n",
"
\n",
" \n",
"
\n",
"
5 rows × 48 columns
\n",
"
"
],
"text/plain": [
" weekday_1 weekday_2 weekday_3 weekday_4 weekday_5 weekday_6 \\\n",
"2000 1 0 0 0 0 0 \n",
"2001 0 1 0 0 0 0 \n",
"2002 0 0 0 0 0 0 \n",
"2003 0 1 0 0 0 0 \n",
"2004 0 1 0 0 0 0 \n",
"\n",
" nb_words_title nb_words_content pp_uniq_words pp_stop_words ... \\\n",
"2000 9 843 0.5358 2.092000e-09 ... \n",
"2001 9 805 0.4196 2.165000e-09 ... \n",
"2002 8 145 0.7594 1.163000e-08 ... \n",
"2003 12 201 0.6359 9.259000e-09 ... \n",
"2004 13 673 0.4609 2.500000e-09 ... \n",
"\n",
" pp_neg_words pp_pos_words_in_nonneutral ave_polar_pos min_polar_pos \\\n",
"2000 0.019230 0.7143 0.4437 0.03333 \n",
"2001 0.025710 0.5349 0.3081 0.05000 \n",
"2002 0.007519 0.8333 0.3673 0.13640 \n",
"2003 0.027030 0.7368 0.3721 0.13640 \n",
"2004 0.021440 0.5625 0.3500 0.05000 \n",
"\n",
" max_polar_pos ave_polar_neg min_polar_neg max_polar_neg subj_title \\\n",
"2000 1.0 -0.3160 -0.8000 -0.05 0.0 \n",
"2001 0.8 -0.3463 -0.7143 -0.10 0.9 \n",
"2002 0.5 -0.2000 -0.2000 -0.20 0.0 \n",
"2003 0.6 -0.4000 -0.4000 -0.40 0.0 \n",
"2004 0.6 -0.2435 -0.8000 -0.10 0.0 \n",
"\n",
" polar_title \n",
"2000 0.0 \n",
"2001 0.3 \n",
"2002 0.0 \n",
"2003 0.0 \n",
"2004 0.0 \n",
"\n",
"[5 rows x 48 columns]"
]
},
"execution_count": 85,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Get the weekday data and encode it using a dummy categorical encoding\n",
"weekday_data = pd.get_dummies(train_data['weekday'], prefix='weekday', drop_first=True)\n",
"\n",
"# Get the rest of the data\n",
"other_data = train_data.drop(['weekday'], axis=1)\n",
"\n",
"# Create a new data set by concatenation of the new weekday data and the old rest of the data\n",
"training_data = pd.concat([weekday_data, other_data], axis=1)\n",
"\n",
"# Print the created training data.\n",
"training_data.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Question:__ Repeat the process for the other categorical variable(s) in your data. Do not forget to apply your transformation to the test dataset as well!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# TODO"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.2 Preprocessing data: standardization\n",
"You might want to consider standardizing your data as seen in Lab 02."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.3 Unsupervised projection\n",
"If your number of features is high, it may be useful to reduce it with an unsupervised step prior to supervised steps. \n",
"\n",
"We have already worked on a widly used dimentionality reduction method in `Lab 1`, the Principal Component Analysis. \n",
"\n",
"We will discuss in `Lab 5` the combinaison of dimentionality reduction and a predictor."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.4 Feature selection\n",
"See [scikit-learn's documentation](http://scikit-learn.org/stable/modules/feature_selection.html).\n",
"\n",
"It may be useful to select a restricted number of important features to increase their predictive power. When the number of feature is particularly bigger than the number od instance, this issue of major importance.\n",
"\n",
"Multiple strategies can be considered depending on the problem such like:\n",
"* considering the most varying features, condering the most correlated features to the output etc\n",
"* using feed forward selection procedure: recursively adding features one by one by incresing improvement of performance\n",
"* using embbeded feature selection like lasso or ElasticNet (see lab 5)\n",
"* computing feature importance (via bagging procedure like [randomized lasso](https://stat.ethz.ch/~nicolai/stability.pdf) or bagging trees (see lab 5) for exemple) and thresholding the feature.\n",
"* ..."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3. Model evaluation and model selection"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.1 Our first classifier: Gaussian Naive Bayes\n",
"Documentation: http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html \n",
"\n",
"In order to start thinking about model evaluation and model selection, we will convert the regression problem of the KaggleInClass challenge into a classification task in order to to work with the first classifier we studied in class: the Gaussian Naive Bayes.\n",
"\n",
"Our goal here is to try to classify points between astonishingly and not astonishingly shared articles. Based on the distribution of the number of shares, the separation can be set at 1800 shares."
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(2991,)\n",
"(2009,)\n"
]
}
],
"source": [
"# Transform output into a classification task.\n",
"y_clf = np.where(y_tr >= 1800, 1, 0)\n",
"print(np.where(y_clf==0)[0].shape)\n",
"print(np.where(y_clf==1)[0].shape)"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {},
"outputs": [],
"source": [
"X_clf = training_data.values"
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {},
"outputs": [],
"source": [
"# import Gaussian Naive Bayes\n",
"from sklearn.naive_bayes import GaussianNB"
]
},
{
"cell_type": "code",
"execution_count": 91,
"metadata": {},
"outputs": [],
"source": [
"# create a Gaussian Naive Bayes classifier i.e. an instance of GaussianNB\n",
"gnb = GaussianNB()"
]
},
{
"cell_type": "code",
"execution_count": 92,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"GaussianNB()"
]
},
"execution_count": 92,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# fit the classifier to the data\n",
"gnb.fit(X_clf, y_clf)"
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {},
"outputs": [],
"source": [
"# predict on the same data\n",
"y_pred = gnb.predict(X_clf)"
]
},
{
"cell_type": "code",
"execution_count": 94,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of mislabeled points out of a total 5000 points : 1963\n"
]
}
],
"source": [
"# compute the number of mislabeled articles\n",
"print(\"Number of mislabeled points out of a total %d points : %d\" % \\\n",
" (X_clf.shape[0], (y_clf != y_pred).sum()))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note than all predictors implemented in the sklearn library are trained and applied via the `fit` and `predict` (or `predict_proba`) methods.\n",
"\n",
"**Question:** What are the parameters of the model we have trained? How many of them are they? How can you access them?"
]
},
{
"cell_type": "code",
"execution_count": 95,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"{'priors': None,\n",
" 'var_smoothing': 1e-09,\n",
" 'n_features_in_': 48,\n",
" 'epsilon_': 47.915030130169605,\n",
" 'classes_': array([0, 1]),\n",
" 'theta_': array([[ 1.93580742e-01, 2.05951187e-01, 1.87228352e-01,\n",
" 1.40421264e-01, 4.78100970e-02, 4.71414243e-02,\n",
" 1.04881311e+01, 5.45836844e+02, 5.34619425e-01,\n",
" 2.97559394e-02, 6.78008526e-01, 1.00320963e+01,\n",
" 6.79271147e+00, 4.03610832e+00, 1.17953862e+00,\n",
" 4.01036443e+00, 7.08793046e+00, 3.19324641e+00,\n",
" 2.41805416e+01, 9.28673353e+02, 2.70002006e+02,\n",
" 1.27069137e+04, 7.54202508e+05, 2.55922586e+05,\n",
" 1.01061819e+03, 5.12257472e+03, 2.91886433e+03,\n",
" 3.02787563e+03, 7.79266332e+03, 4.90240328e+03,\n",
" 1.72781849e-01, 1.52756931e-01, 2.46015109e-01,\n",
" 2.06135426e-01, 2.22310033e-01, 4.37093310e-01,\n",
" 1.17109337e-01, 3.90776316e-02, 1.66966225e-02,\n",
" 6.77679445e-01, 3.51399733e-01, 9.54050585e-02,\n",
" 7.45757773e-01, -2.52517482e-01, -5.11862414e-01,\n",
" -1.03064072e-01, 2.82964510e-01, 6.50785553e-02],\n",
" [ 1.80189149e-01, 1.71727227e-01, 1.64758586e-01,\n",
" 1.45345943e-01, 8.56147337e-02, 9.05923345e-02,\n",
" 1.02190144e+01, 5.87604778e+02, 5.23768940e-01,\n",
" 3.08611296e-02, 6.65375610e-01, 1.17177700e+01,\n",
" 8.35988054e+00, 4.86610254e+00, 1.31557989e+00,\n",
" 3.98208064e+00, 7.34942758e+00, 3.27177700e+00,\n",
" 3.11791936e+01, 1.29430762e+03, 3.49331508e+02,\n",
" 1.14555396e+04, 7.42223942e+05, 2.58327544e+05,\n",
" 1.20617870e+03, 6.05944052e+03, 3.34863081e+03,\n",
" 5.04087705e+03, 1.19179980e+04, 7.73627442e+03,\n",
" 2.08163688e-01, 1.20169836e-01, 1.69376142e-01,\n",
" 2.34597611e-01, 2.67692997e-01, 4.51727038e-01,\n",
" 1.28721942e-01, 4.10291648e-02, 1.62101981e-02,\n",
" 6.93852364e-01, 3.57977840e-01, 9.18679343e-02,\n",
" 7.74440284e-01, -2.58563280e-01, -5.19116252e-01,\n",
" -1.06148371e-01, 2.96920179e-01, 8.21279189e-02]]),\n",
" 'sigma_': array([[4.80711374e+01, 4.80785654e+01, 4.80672040e+01, 4.80357333e+01,\n",
" 4.79605544e+01, 4.79599492e+01, 5.23574670e+01, 2.00484574e+05,\n",
" 4.79333339e+01, 4.79439006e+01, 4.79381009e+01, 1.46923696e+02,\n",
" 1.29730972e+02, 1.08452677e+02, 6.43746048e+01, 4.85357853e+01,\n",
" 5.16087360e+01, 5.09308454e+01, 4.58833111e+03, 2.70535460e+06,\n",
" 7.88311247e+04, 3.16905236e+09, 4.67115687e+10, 1.77537953e+10,\n",
" 1.12425412e+06, 1.99926793e+07, 1.41124704e+06, 2.14915154e+08,\n",
" 9.25165735e+08, 3.47575175e+08, 4.79779402e+01, 4.79686235e+01,\n",
" 4.80040740e+01, 4.79957998e+01, 4.79949409e+01, 4.79285792e+01,\n",
" 4.79239664e+01, 4.79153324e+01, 4.79151514e+01, 4.79526437e+01,\n",
" 4.79259205e+01, 4.79202788e+01, 4.79762149e+01, 4.79303124e+01,\n",
" 4.79998731e+01, 4.79224631e+01, 4.80161744e+01, 4.79789879e+01],\n",
" [4.80627511e+01, 4.80572671e+01, 4.80526433e+01, 4.80392506e+01,\n",
" 4.79933150e+01, 4.79974155e+01, 5.22811992e+01, 2.40446097e+05,\n",
" 4.79339690e+01, 4.79449388e+01, 4.79387435e+01, 1.83723380e+02,\n",
" 1.63495322e+02, 1.14929954e+02, 6.22773608e+01, 4.85239674e+01,\n",
" 5.13130898e+01, 5.06365880e+01, 5.69946978e+03, 1.16386774e+07,\n",
" 3.31657609e+05, 1.62349785e+09, 4.96209108e+10, 1.84107850e+10,\n",
" 1.49577628e+06, 2.61482758e+07, 1.52266109e+06, 3.85017724e+08,\n",
" 1.53945604e+09, 5.95637982e+08, 4.79922245e+01, 4.79543917e+01,\n",
" 4.79738443e+01, 4.80042993e+01, 4.80102771e+01, 4.79282621e+01,\n",
" 4.79241411e+01, 4.79153239e+01, 4.79151323e+01, 4.79496336e+01,\n",
" 4.79259402e+01, 4.79200300e+01, 4.79752515e+01, 4.79308426e+01,\n",
" 4.79983999e+01, 4.79238575e+01, 4.80248025e+01, 4.79879665e+01]]),\n",
" 'class_count_': array([2991., 2009.]),\n",
" 'class_prior_': array([0.5982, 0.4018])}"
]
},
"execution_count": 95,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Hint\n",
"gnb.__dict__"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.2 Model Evaluation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You must have a look at http://scikit-learn.org/stable/modules/model_evaluation.html which shows and details a list of metrics for evaluating regression or classification models."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the case of regression, the most commonly used metrics are :\n",
"* `mean squared errors`\n",
"* `mean absolute errors` which gives less importance to errors of very bad prediction and more importance to errors of good predictions as the following plot shows than `mean squared errors`\n",
"* `R2` (coefficient of determination) which provides a measure of how well future samples are likely to be predicted by the model."
]
},
{
"cell_type": "code",
"execution_count": 96,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"x = np.arange(-2,2,0.01)\n",
"plt.plot(x,x*x, 'blue', label='$x^2$')\n",
"plt.plot(x,abs(x), 'orange', label='|x|')\n",
"plt.legend(loc=\"upper center\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the case of classification, lots of metrics are used depending on the considered problem:\n",
"\n",
"* `accuracy` is a default performance measure computing the proportion of missclassified tested instances\n",
"* `sensitivity` or 'true positive rate' is the proportion of well classified positive samples\n",
"* `specificity` or 'true negative rate' is the proportion of well classified negative samples\n",
"\n",
"* `precision` is the ability of the classifier not to label as positive a sample that is negative. Like in the case of cancer, we really want to avoid diagnose a cancer to somebody who does not have one.\n",
"* `recall` is the ability of the classifier to find all the positive samples.\n",
" \n",
"* `the area under the precision-recall curve`\n",
"* `the area under the Receiver operating characteristic (ROC) curve` "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Question** Use the sklearn library to compute the accuracy score of the above prediction."
]
},
{
"cell_type": "code",
"execution_count": 100,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Accuracy: 0.607\n"
]
}
],
"source": [
"from sklearn import metrics\n",
"# Score the predictions\n",
"print(\"Accuracy: %.3f\" % metrics.accuracy_score(y_clf, y_pred) # TODO\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Building an ROC curve requires to use the probability estimates for the test data points *before* they are thresholded."
]
},
{
"cell_type": "code",
"execution_count": 101,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(5000, 2)\n"
]
}
],
"source": [
"# Predict probability estimates instead of 0/1 class labels\n",
"y_prob = gnb.predict_proba(X_clf)\n",
"print(y_prob.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Question:** `y_prob` returns two values for each data point because it returns one probability estimate per class for each data point. The order in which the classes appear are given by `gnb.classes_ `. How do you get the 1-dimensional array that only contains the estimated probability for each point to belong to the positive class?"
]
},
{
"cell_type": "code",
"execution_count": 102,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 102,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"pos_index = list(gnb.classes_).index(1)\n",
"\n",
"# ROC curve\n",
"fpr, tpr, thresholds = metrics.roc_curve(y_clf, y_prob[:, pos_index], pos_label=1)\n",
"\n",
"# Area under the ROC curve\n",
"auc = metrics.auc(fpr, tpr)\n",
"\n",
"# Plot the ROC curve\n",
"plt.plot(fpr, tpr, '-', color='orange', label='AUC = %0.3f' % auc)\n",
"\n",
"plt.xlabel('False Positive Rate', fontsize=16)\n",
"plt.ylabel('True Positive Rate', fontsize=16)\n",
"plt.title('ROC curve: Gaussian Naive Bayes', fontsize=16)\n",
"plt.legend(loc=\"lower right\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Question:** What is it problematic to have evaluated our classifier on the training data? "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Answer:__"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.3 Model Selection: cross-validation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will now use the function `make_Kfolds` you have implemented in the first section to evaluate the accuracy of your model via a 5-fold cross-validation scheme. We will compare the results you obtained with those you get with scikit-learn's implementation of the cross-validation scheme. "
]
},
{
"cell_type": "code",
"execution_count": 130,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"200\n"
]
}
],
"source": [
"# Set up a cross-validation with make_Kfolds\n",
"n_instance = len(y)\n",
"perso_folds = make_Kfolds(n_instance, 5) # TODO\n",
"\n",
"# Set up a cross-validation with sklearn\n",
"sk_folds = model_selection.KFold(n_splits=5, shuffle=True).split(X,y) # TODO"
]
},
{
"cell_type": "code",
"execution_count": 135,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Your own cv-scheme: Accuracy: 0.515\n",
"Sklearn cv-scheme: Accuracy: 0.478\n"
]
}
],
"source": [
"# Assess performance using the cross_validate function you have implemented\n",
"# On perso_folds\n",
"gnb = GaussianNB()\n",
"# TODO use cross_validate and perso_folds\n",
"y_prob_cv_perso = cross_validate(X, y, gnb, perso_folds) > 0.5\n",
"print(\"Your own cv-scheme: Accuracy: %.3f\" % metrics.accuracy_score(y, y_prob_cv_perso) # TODO\n",
" )\n",
"\n",
"# On sk_folds\n",
"gnb = GaussianNB()\n",
"# TODO use cross_validate and perso_folds\n",
"y_prob_cv_sk = cross_validate(X, y, gnb, sk_folds) > 0.5\n",
"print(\"Sklearn cv-scheme: Accuracy: %.3f\" % metrics.accuracy_score(y, y_prob_cv_sk) # TODO\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will now plot the ROC curve corresponding to your predictions."
]
},
{
"cell_type": "code",
"execution_count": 137,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Your own cv-scheme: AUROC: 0.507\n"
]
},
{
"data": {
"text/plain": [
""
]
},
"execution_count": 137,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Compute the ROC curve corresponding to the y_prob_cv_perso predictions\n",
"fpr, tpr, thresholds = metrics.roc_curve(y, y_prob_cv_perso, pos_label=1)\n",
"\n",
"# Area under the ROC curve\n",
"auc = metrics.auc(fpr, tpr)\n",
"print(\"Your own cv-scheme: AUROC: %.3f\" % auc)\n",
"\n",
"# Plot the ROC curve\n",
"plt.plot(fpr, tpr, '-', color='orange', label='AUC = %0.3f' % auc)\n",
"\n",
"# TODO: plot in blue the ROC curve corresponding to the y_prob_cv_sk predictions\n",
"\n",
"plt.xlabel('False Positive Rate', fontsize=16)\n",
"plt.ylabel('True Positive Rate', fontsize=16)\n",
"plt.title('ROC curve: Gaussian Naive Bayes', fontsize=16)\n",
"plt.legend(loc=\"lower right\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Question:__ The `sklearn.cross_validation` module provides some utilities to make cross-validated predictions. Compare the results you obtained to what they return.\n",
"\n",
"Documentation: [cross_val_score](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)"
]
},
{
"cell_type": "code",
"execution_count": 138,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.515\n"
]
}
],
"source": [
"gnb = GaussianNB()\n",
"skf = model_selection.StratifiedKFold(5, shuffle=True, random_state=91)\n",
"\n",
"# Use model_selection.cross_val_score to compute the average cross-validated roc_auc score \n",
"# of gnb on (X_clf, y_clf), using the skf iterator.\n",
"cv_aucs = model_selection.cross_val_score(gnb, X, y, cv=5) # TODO\n",
"\n",
"print(np.mean(cv_aucs))\n",
"\n",
"# Note that averaging the AUCs obtained over 10 folds is not the same as \n",
"# globally computing the AUC for the predictions made within the cross-validation loop."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Question__ Compare scikit-learn's implementation of the cross-validation with yours."
]
},
{
"cell_type": "code",
"execution_count": 141,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Cross-validated accuracy: 0.515\n",
"Cross-validated accuracy: 0.512\n"
]
}
],
"source": [
"gnb = GaussianNB()\n",
"skf = model_selection.StratifiedKFold(5, shuffle=True, random_state=91)\n",
"\n",
"# Compute the cross-validation accuracy using model_selection.cross_val_predict\n",
"y_pred_sk = model_selection.cross_val_predict(gnb, X, y, cv=5 # TODO\n",
" )\n",
"print(\"Cross-validated accuracy: %.3f\" % metrics.accuracy_score(y, y_pred_sk # TODO\n",
" ))\n",
"\n",
"# Compute the cross-validation accuracy using your own cross_validate function\n",
"y_prob_cv = cross_validate(X, y, gnb, skf.split(X, y)# TODO\n",
" )\n",
"# Transform y_prob_cv into a vector of binary predictions\n",
"y_pred_perso = y_prob_cv > 0.5 # TODO \n",
"print(\"Cross-validated accuracy: %.3f\" % metrics.accuracy_score(y, y_pred_perso # TODO\n",
" ))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Question:__ Does stratifying the cross-validation make a difference?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Answer:__"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Getting started on the final project\n",
"\n",
"You will be evaluated on a final project in the form of a data challenge competition hosted on codalab.\n",
"The data challenge is not public: you can access it through the following link: https://codalab.lisn.upsaclay.fr/competitions/755?secret_key=95ee48c1-fc46-406f-b08b-25a7884d2a61\n",
"\n",
"- Register on the plateform\n",
"- Download the starting kit. It contains the data, a README file, and a script to help with creating the submission bundle.\n",
"- Read the README file.\n",
"- Run the submission script, and submit the submission bundle.\n",
"\n",
"You are all set to start working on the final project! Choose to create teams of two or work on your own, and get started training models!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.13"
}
},
"nbformat": 4,
"nbformat_minor": 2
}