3 years ago · cd3b30dcd4
--- a/01-PCA.ipynb
+++ b/01-PCA.ipynb
@@ -0,0 +1,618 @@
 
				+{
			
 
				+ "cells": [
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "# Lab 1: Principal Components Analysis"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "## 1. How to use Jupyter\n",
			
 
				+    "All our labs will be done in Jupyter notebooks. You should run your own instance of Jupyter, so that you can interact with the notebook, modify it and run Python code in it! Follow the instructions at https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/index.html \n",
			
 
				+    "\n",
			
 
				+    "A Jupyter notebook is a web application that allows you to create and share documents (such as this .ipynb notebook) that contain live code, visualisations and explanatory text (with equations).\n",
			
 
				+    "\n",
			
 
				+    "Here are some tips on using a Jupyter notebook:\n",
			
 
				+    "* Each block of text is contained in a _cell_. A cell can be either raw text, code, or markdown text (such as this cell). For more info on markdown syntax, follow the [guide](http://jupyter-notebook.readthedocs.io/en/latest/examples/Notebook/Working%20With%20Markdown%20Cells.html).\n",
			
 
				+    "* You can run a cell by clicking inside it and hitting `Shift+Enter` (or the play button in the toolbar)."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {
			
 
				+    "collapsed": true
			
 
				+   },
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "2 + 2  # hit Shift+Enter to run"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "* If you want to create a new cell below the one you're running, hit `Alt+Enter` (or the plus button in the toolbar).\n",
			
 
				+    "\n",
			
 
				+    "Some tips on using a Jupyter notebook with Python:\n",
			
 
				+    "* A notebook behaves like an interactive python shell! This means that\n",
			
 
				+    "    * classes, functions, and variables defined at the cell level have global scope throughout the notebok\n",
			
 
				+    "    * hitting `Tab` will autocomplete the keyword you have started typing\n",
			
 
				+    "    * typing a question mark after a function name will load the interactive help for this function.\n",
			
 
				+    "* Jupyter has special Python commands (shortcuts, if you will) called _magics_. For instance, `%bash` will allow you to run bash code, `%paste` will allow you to paste a block of code while retaining its formating, and `%matplotlib inline` will import the visualization library matplotlib, and automatically display its plots inline, that is, below the cell. Here's a full list: http://ipython.readthedocs.io/en/stable/interactive/magics.html \n",
			
 
				+    "* Learn more about the interactive Python shell here: http://ipython.readthedocs.io/en/stable/interactive/tutorial.html\n",
			
 
				+    "\n",
			
 
				+    "For more info on Jupyter: https://jupyter.org/"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "## 2. PCA of the Olympic Athletes data\n",
			
 
				+    "\n",
			
 
				+    "In this lab, we will import data (`./data/decathlon.txt`) relating to the top performances in the Men's decathlon at the 2004 summer Olympics in Athens (https://en.wikipedia.org/wiki/Athletics_at_the_2004_Summer_Olympics_%E2%80%93_Men%27s_decathlon) and Decastar 2004 in Talence (https://fr.wikipedia.org/wiki/D%C3%A9castar). (Both events were won by Roman Šebrle).\n",
			
 
				+    "\n",
			
 
				+    "###  Data description\n",
			
 
				+    "\n",
			
 
				+    "* The data set consists of 41 rows and 13 columns.\n",
			
 
				+    "* The first row is a header describing the content of the columns and the remaining rows refer to the 40 observations (athletes) considered in this dataset.\n",
			
 
				+    "* Columns 1 to 12 are continuous variables: the first ten columns correspond to the performance of the athletes for each event of the decathlon and columns 11 and 12 correspond respectively to the rank and the points obtained.\n",
			
 
				+    "* The last column is a categorical variable corresponding to the athletic meeting (2004 Olympic Games or 2004 Decastar).\n",
			
 
				+    "\n",
			
 
				+    "### Loading and manipulating the data with pandas\n",
			
 
				+    "pandas is a data analysis library for Python. With pandas we can import our Olympics athletes data into a structured object called a *data frame*, which we can then manipulate with pandas' built-in tools. Here we load the dataset into a data frame and begin to examine it with pandas."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {
			
 
				+    "collapsed": true
			
 
				+   },
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "import pandas as pd\n",
			
 
				+    "my_data = pd.read_csv('data/decathlon.txt', sep=\"\\t\")  # load data"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {
			
 
				+    "collapsed": true
			
 
				+   },
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "print(type(my_data))  # display my_data data type"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {
			
 
				+    "collapsed": true,
			
 
				+    "scrolled": true
			
 
				+   },
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "my_data.head(n=5)  # adjust n to view more data"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "### Accessing data\n",
			
 
				+    "\n",
			
 
				+    "* We can select a column by name. Note the returned object is also a pandas object (a *series*--a single-columned DataFrame), so we can use the `head()` function to view the first few rows only."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {
			
 
				+    "collapsed": true,
			
 
				+    "scrolled": true
			
 
				+   },
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "my_data['100m'].head()"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "* Or a list for multiple columns."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {
			
 
				+    "collapsed": true
			
 
				+   },
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "columns = ['100m', '400m']\n",
			
 
				+    "my_data[columns].head()"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "* We can select rows satisfying a given condition(s) by passing a boolean series."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {
			
 
				+    "collapsed": true
			
 
				+   },
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "my_data[my_data['Competition']=='OlympicG'].head()"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "* To *index* a row, we can use the data frame's `loc` object. This behaves like a dictionary whose keys are the data frame's index."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {
			
 
				+    "collapsed": true
			
 
				+   },
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "my_data.loc['SEBRLE']"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {
			
 
				+    "collapsed": true,
			
 
				+    "scrolled": true
			
 
				+   },
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "my_data.count()  # summarise counts of data"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "### Manipulating data"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {
			
 
				+    "collapsed": true
			
 
				+   },
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "list(my_data.columns)  # get the names of the columns"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {
			
 
				+    "collapsed": true
			
 
				+   },
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "print(my_data.shape)  # get the shape (rows x columns)"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {
			
 
				+    "collapsed": true,
			
 
				+    "scrolled": true
			
 
				+   },
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "print(my_data.values)  # get the content as a numpy array"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {
			
 
				+    "collapsed": true
			
 
				+   },
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "print(my_data.dtypes)  # get the data type (dtype) of each column"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "### Visualisation"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "To create visualisations, we'll use `matplotlib`, the primordial plotting library for Python. `matplotlib` may be used in different ways using a built-in interface called `pyplot`. This allows us to access matplotlib modules in a variety arrays from a high-level state-machine environment, to a low-level object-oriented approach). The latter is typically recommended. Another interface, `pylab`, is no longer recommended (http://matplotlib.org/faq/usage_faq.html#coding-styles).\n",
			
 
				+    "\n",
			
 
				+    "We also use a Jupyter magic command for inline plotting."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {
			
 
				+    "collapsed": true
			
 
				+   },
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "%pylab inline\n",
			
 
				+    "# This is equivalent to \n",
			
 
				+    "# import matplotlib.pyplot as plt\n",
			
 
				+    "# import numpy as np"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "We can optionally toggle vector graphics for Jupyter display, giving us a crisper plot (this can be expensive though, so beware!):"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {
			
 
				+    "collapsed": true
			
 
				+   },
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "from IPython.display import Image, set_matplotlib_formats \n",
			
 
				+    "# set_matplotlib_formats('pdf') # toggle vector graphics for a crisp plot!\n",
			
 
				+    "set_matplotlib_formats('svg')"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {
			
 
				+    "collapsed": true
			
 
				+   },
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "# basic visualization: athletes' performances depending on two disciplines\n",
			
 
				+    "my_data.plot(kind='scatter', x='400m', y='Shot.put', s=50,)"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "A scatterplot matrix allows us to visualize:\n",
			
 
				+    "  * on the diagonal, the density estimation for each of the features\n",
			
 
				+    "  * on each of the off-diagonal plots, a scatterplot between two of the features. Each dot represents a sample."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {
			
 
				+    "collapsed": true
			
 
				+   },
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "from pandas.plotting import scatter_matrix\n",
			
 
				+    "scatter_matrix(my_data.get(['Shot.put','High.jump', '400m']), alpha=0.2,\n",
			
 
				+    "               figsize=(6, 6), diagonal='kde');"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {
			
 
				+    "collapsed": true,
			
 
				+    "scrolled": false
			
 
				+   },
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "# fancier plot with seaborn : https://seaborn.pydata.org/\n",
			
 
				+    "import seaborn.apionly as sns\n",
			
 
				+    "sns.set_style('whitegrid')\n",
			
 
				+    "\n",
			
 
				+    "sns.jointplot('Shot.put', 'High.jump', data = my_data, \n",
			
 
				+    "              kind='kde', height=6, space=0)\n",
			
 
				+    "\n",
			
 
				+    "# loooking at correlated features\n",
			
 
				+    "sns.jointplot('Shot.put', 'Discus', data = my_data, \n",
			
 
				+    "              kind='reg', height=6, space=0)"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "### Cleaning data"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {
			
 
				+    "collapsed": true
			
 
				+   },
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "# Remove columns we don't need (we're only interested in performance in the different sports)\n",
			
 
				+    "data = my_data.drop(['Points', 'Rank', 'Competition'], axis=1)\n",
			
 
				+    "\n",
			
 
				+    "# Verify new column headers\n",
			
 
				+    "data.columns"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "### Use scikit-learn to find the PCs\n",
			
 
				+    "\n",
			
 
				+    "In this course, we will rely heavily on the [scikit-learn](http://scikit-learn.org/stable/index.html) machine learning toolbox, which implements most classical, (non-deep) machine learning algorithms. Here, we will use scikit-learn to compute the PCs, and compare the results to what we got before. A useful resource is the online documentation: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "#### Data standardization\n",
			
 
				+    "Recall that PCA works best on standardised data (mean 0, standard deviation 1). "
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {
			
 
				+    "collapsed": true
			
 
				+   },
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "import numpy as np\n",
			
 
				+    "\n",
			
 
				+    "# transform data from to numpy array\n",
			
 
				+    "X = data.values\n",
			
 
				+    "\n",
			
 
				+    "# TODO: standardise the data"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {
			
 
				+    "collapsed": true
			
 
				+   },
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "from sklearn import preprocessing\n",
			
 
				+    "\n",
			
 
				+    "std_scale = preprocessing.StandardScaler().fit(X)\n",
			
 
				+    "X_scaled = std_scale.transform(X)"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "#### Computing 2 principal components"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "from sklearn import decomposition\n",
			
 
				+    "\n",
			
 
				+    "pca = decomposition.PCA(n_components=2)\n",
			
 
				+    "pca.fit(X_scaled)"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "**Question:** Use `pca.transform` to project the data onto its principal components."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {
			
 
				+    "collapsed": true
			
 
				+   },
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "# TODO: project X on principal components\n",
			
 
				+    "X_projected = "
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "`pca.explained_variance_ratio_` gives the percentage of variance explained by each of the components."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {
			
 
				+    "collapsed": true
			
 
				+   },
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "print(pca.explained_variance_ratio_)"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "**Question:** How is `pca.explained_variance_ratio_` computed? Check this is the case by computing it yourself."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {
			
 
				+    "collapsed": true
			
 
				+   },
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "tot_var = np.var(X_scaled, axis=0).sum()\n",
			
 
				+    "print((1 / tot_var) * np.var(X_projected, axis=0))"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "#### Projecting the data onto its principal components\n",
			
 
				+    " \n",
			
 
				+    "We will plot the fraction of variance explained by each of the first 10 principal components.\n",
			
 
				+    "\n",
			
 
				+    "__Question:__ Compute the 10 first PCs"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "# TODO: compute the 10 first PCs\n"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "pca.explained_variance_ratio_.shape"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {
			
 
				+    "collapsed": true
			
 
				+   },
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "plt.bar(np.arange(10), pca.explained_variance_ratio_)\n",
			
 
				+    "plt.xlim([-1, 9])\n",
			
 
				+    "plt.xlabel(\"Number of PCs\", fontsize=16)\n",
			
 
				+    "plt.ylabel(\"Fraction of variance explained\", fontsize=16)"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "To better understand the information captured by the principal components, we can consider  `pca.components_`. These are the columns of $\\mathbf{W}$ (for $M = 2$)."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {
			
 
				+    "collapsed": true
			
 
				+   },
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "pcs = pca.components_\n",
			
 
				+    "print(pcs[0])"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "We can display each row of $\\mathbf{W}$ in a 2D plot whose x-axis gives its contribution to the first component and y-axis to the second component. Note, whereas before we were visualising the projected data, $\\mathbf{Z}$, now we are visualising the projections, $\\mathbf{W}$. This indicates how the features cluster i.e. if a pair of feature projections are close, observations will tend to be similarly-valued over those features."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {
			
 
				+    "collapsed": true
			
 
				+   },
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "fig = plt.figure(figsize=(6, 5))\n",
			
 
				+    "ax = fig.add_subplot(1, 1, 1)\n",
			
 
				+    "ax.set_xlim([-0.7, 0.7])\n",
			
 
				+    "ax.set_ylim([-0.7, 0.7])\n",
			
 
				+    "\n",
			
 
				+    "for i, (x, y) in enumerate(zip(pcs[0, :], pcs[1, :])):\n",
			
 
				+    "    # plot line between origin and point (x, y)\n",
			
 
				+    "    ax.plot([0, x], [0, y], color='k')\n",
			
 
				+    "    # display the label of the point\n",
			
 
				+    "    ax.text(x, y, data.columns[i], fontsize='14')"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "**Question:** based on the two previous graphs, can you find a meaning for the two principal components ?"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": []
			
 
				+  }
			
 
				+ ],
			
 
				+ "metadata": {
			
 
				+  "kernelspec": {
			
 
				+   "display_name": "Python 3",
			
 
				+   "language": "python",
			
 
				+   "name": "python3"
			
 
				+  },
			
 
				+  "language_info": {
			
 
				+   "codemirror_mode": {
			
 
				+    "name": "ipython",
			
 
				+    "version": 3
			
 
				+   },
			
 
				+   "file_extension": ".py",
			
 
				+   "mimetype": "text/x-python",
			
 
				+   "name": "python",
			
 
				+   "nbconvert_exporter": "python",
			
 
				+   "pygments_lexer": "ipython3",
			
 
				+   "version": "3.4.3"
			
 
				+  }
			
 
				+ },
			
 
				+ "nbformat": 4,
			
 
				+ "nbformat_minor": 2
			
 
				+}
			
--- a/02-FeatureProcessing.ipynb
+++ b/02-FeatureProcessing.ipynb
--- a/03-ProjectIntro.ipynb
+++ b/03-ProjectIntro.ipynb
--- a/LICENSE
+++ b/LICENSE
@@ -0,0 +1,21 @@
 
				+MIT License
			
 
				+
			
 
				+Copyright (c) 2019 Chloe-Agathe Azencott
			
 
				+
			
 
				+Permission is hereby granted, free of charge, to any person obtaining a copy
			
 
				+of this software and associated documentation files (the "Software"), to deal
			
 
				+in the Software without restriction, including without limitation the rights
			
 
				+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
			
 
				+copies of the Software, and to permit persons to whom the Software is
			
 
				+furnished to do so, subject to the following conditions:
			
 
				+
			
 
				+The above copyright notice and this permission notice shall be included in all
			
 
				+copies or substantial portions of the Software.
			
 
				+
			
 
				+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
			
 
				+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
			
 
				+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
			
 
				+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
			
 
				+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
			
 
				+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
			
 
				+SOFTWARE.
			
--- a/README.md
+++ b/README.md
@@ -0,0 +1,35 @@
 
				+# hpc-ai-ml-2019
			
 
				+This repository holds the computer labs for the Introduction to Machine Learning course of the 2019-2020 HPC-AI MSc 
			
 
				+https://www.hpc-ai.mines-paristech.fr/
			
 
				+
			
 
				+## Requirements
			
 
				+The labs were developed for Python3. All required packages are part of the [Anaconda platform](https://www.anaconda.com/download/) so you can simply install Anaconda3 on your machine. If you'd rather install just the required packages with pip, that is also possible. The labs were developed for Python 3.4.3, with the following libraries:
			
 
				+* numpy 1.16.5
			
 
				+* scipy 1.2.2
			
 
				+* matplotlib 2.2.4
			
 
				+* pandas 0.22.0
			
 
				+* seaborn 0.9.0
			
 
				+* sklearn 0.19.2
			
 
				+
			
 
				+To __check your installation__, try launching Jupyter (e.g. by typing `jupyter noteboook` in a shell), navigating to where you have downloaded/pulled the first lab (.ipynb file) and opening it. In that notebook (or in a python terminal), you should be able to run 
			
 
				+  ```python
			
 
				+  import sklearn
			
 
				+  import pandas
			
 
				+  import seaborn
			
 
				+  import matplotlib
			
 
				+  ```
			
 
				+
			
 
				+## Program
			
 
				+* Lab 1: [Principal Component Analysis](https://github.com/chagaz/hpc-ai-ml-2019/blob/master/01-PCA.ipynb) (1h)
			
 
				+* Lab 2: [Data normalization](https://github.com/chagaz/hpc-ai-ml-2019/blob/master/02-FeatureProcessing.ipynb) (1h)
			
 
				+* Lab 3: [Introduction to the KaggleInClass project](https://github.com/chagaz/hpc-ai-ml-2019/blob/master/03-ProjectIntro.ipynb) (2h)
			
 
				+* Lab 4: [Linear and logistic regression](https://github.com/chagaz/hpc-ai-ml-2019/blob/master/04-Linear%20and%20logistic%20regressions.ipynb) (1h) 
			
 
				+* Lab 5: [Regularized linear regression](https://github.com/chagaz/hpc-ai-ml-2019/blob/master/05-Regularization.ipynb) (1h) 
			
 
				+* Lab 6: [Nearest neighbors](https://github.com/chagaz/hpc-ai-ml-2019/blob/master/06-NearestNeighbors.ipynb) (1h) 
			
 
				+* Lab 7: [Trees and Forests](https://github.com/chagaz/hpc-ai-ml-2019/blob/master/07-TreesAndForests.ipynb) (1h) 
			
 
				+* Lab 8: [Support Vector Machines](https://github.com/chagaz/hpc-ai-ml-2019/blob/master/08-SVM.ipynb) (1h)
			
 
				+* Lab 9: Work on the project (3h) 
			
 
				+
			
 
				+
			
 
				+### Acknowledgements
			
 
				+These notebooks are adapted from notebooks previously created with the help of Judith Abecassis, Joseph Boyd, Arthur Imber, Benoit Playe and Mihir 
			
--- a/data/decathlon.txt
+++ b/data/decathlon.txt
@@ -0,0 +1,42 @@
 
				+"100m"	"Long.jump"	"Shot.put"	"High.jump"	"400m"	"110m.hurdle"	"Discus"	"Pole.vault"	"Javeline"	"1500m"	"Rank"	"Points"	"Competition"
			
 
				+"SEBRLE"	11.04	7.58	14.83	2.07	49.81	14.69	43.75	5.02	63.19	291.7	1	8217	"Decastar"
			
 
				+"CLAY"	10.76	7.4	14.26	1.86	49.37	14.05	50.72	4.92	60.15	301.5	2	8122	"Decastar"
			
 
				+"KARPOV"	11.02	7.3	14.77	2.04	48.37	14.09	48.95	4.92	50.31	300.2	3	8099	"Decastar"
			
 
				+"BERNARD"	11.02	7.23	14.25	1.92	48.93	14.99	40.87	5.32	62.77	280.1	4	8067	"Decastar"
			
 
				+"YURKOV"	11.34	7.09	15.19	2.1	50.42	15.31	46.26	4.72	63.44	276.4	5	8036	"Decastar"
			
 
				+"WARNERS"	11.11	7.6	14.31	1.98	48.68	14.23	41.1	4.92	51.77	278.1	6	8030	"Decastar"
			
 
				+"ZSIVOCZKY"	11.13	7.3	13.48	2.01	48.62	14.17	45.67	4.42	55.37	268	7	8004	"Decastar"
			
 
				+"McMULLEN"	10.83	7.31	13.76	2.13	49.91	14.38	44.41	4.42	56.37	285.1	8	7995	"Decastar"
			
 
				+"MARTINEAU"	11.64	6.81	14.57	1.95	50.14	14.93	47.6	4.92	52.33	262.1	9	7802	"Decastar"
			
 
				+"HERNU"	11.37	7.56	14.41	1.86	51.1	15.06	44.99	4.82	57.19	285.1	10	7733	"Decastar"
			
 
				+"BARRAS"	11.33	6.97	14.09	1.95	49.48	14.48	42.1	4.72	55.4	282	11	7708	"Decastar"
			
 
				+"NOOL"	11.33	7.27	12.68	1.98	49.2	15.29	37.92	4.62	57.44	266.6	12	7651	"Decastar"
			
 
				+"BOURGUIGNON"	11.36	6.8	13.46	1.86	51.16	15.67	40.49	5.02	54.68	291.7	13	7313	"Decastar"
			
 
				+"Sebrle"	10.85	7.84	16.36	2.12	48.36	14.05	48.72	5	70.52	280.01	1	8893	"OlympicG"
			
 
				+"Clay"	10.44	7.96	15.23	2.06	49.19	14.13	50.11	4.9	69.71	282	2	8820	"OlympicG"
			
 
				+"Karpov"	10.5	7.81	15.93	2.09	46.81	13.97	51.65	4.6	55.54	278.11	3	8725	"OlympicG"
			
 
				+"Macey"	10.89	7.47	15.73	2.15	48.97	14.56	48.34	4.4	58.46	265.42	4	8414	"OlympicG"
			
 
				+"Warners"	10.62	7.74	14.48	1.97	47.97	14.01	43.73	4.9	55.39	278.05	5	8343	"OlympicG"
			
 
				+"Zsivoczky"	10.91	7.14	15.31	2.12	49.4	14.95	45.62	4.7	63.45	269.54	6	8287	"OlympicG"
			
 
				+"Hernu"	10.97	7.19	14.65	2.03	48.73	14.25	44.72	4.8	57.76	264.35	7	8237	"OlympicG"
			
 
				+"Nool"	10.8	7.53	14.26	1.88	48.81	14.8	42.05	5.4	61.33	276.33	8	8235	"OlympicG"
			
 
				+"Bernard"	10.69	7.48	14.8	2.12	49.13	14.17	44.75	4.4	55.27	276.31	9	8225	"OlympicG"
			
 
				+"Schwarzl"	10.98	7.49	14.01	1.94	49.76	14.25	42.43	5.1	56.32	273.56	10	8102	"OlympicG"
			
 
				+"Pogorelov"	10.95	7.31	15.1	2.06	50.79	14.21	44.6	5	53.45	287.63	11	8084	"OlympicG"
			
 
				+"Schoenbeck"	10.9	7.3	14.77	1.88	50.3	14.34	44.41	5	60.89	278.82	12	8077	"OlympicG"
			
 
				+"Barras"	11.14	6.99	14.91	1.94	49.41	14.37	44.83	4.6	64.55	267.09	13	8067	"OlympicG"
			
 
				+"Smith"	10.85	6.81	15.24	1.91	49.27	14.01	49.02	4.2	61.52	272.74	14	8023	"OlympicG"
			
 
				+"Averyanov"	10.55	7.34	14.44	1.94	49.72	14.39	39.88	4.8	54.51	271.02	15	8021	"OlympicG"
			
 
				+"Ojaniemi"	10.68	7.5	14.97	1.94	49.12	15.01	40.35	4.6	59.26	275.71	16	8006	"OlympicG"
			
 
				+"Smirnov"	10.89	7.07	13.88	1.94	49.11	14.77	42.47	4.7	60.88	263.31	17	7993	"OlympicG"
			
 
				+"Qi"	11.06	7.34	13.55	1.97	49.65	14.78	45.13	4.5	60.79	272.63	18	7934	"OlympicG"
			
 
				+"Drews"	10.87	7.38	13.07	1.88	48.51	14.01	40.11	5	51.53	274.21	19	7926	"OlympicG"
			
 
				+"Parkhomenko"	11.14	6.61	15.69	2.03	51.04	14.88	41.9	4.8	65.82	277.94	20	7918	"OlympicG"
			
 
				+"Terek"	10.92	6.94	15.15	1.94	49.56	15.12	45.62	5.3	50.62	290.36	21	7893	"OlympicG"
			
 
				+"Gomez"	11.08	7.26	14.57	1.85	48.61	14.41	40.95	4.4	60.71	269.7	22	7865	"OlympicG"
			
 
				+"Turi"	11.08	6.91	13.62	2.03	51.67	14.26	39.83	4.8	59.34	290.01	23	7708	"OlympicG"
			
 
				+"Lorenzo"	11.1	7.03	13.22	1.85	49.34	15.38	40.22	4.5	58.36	263.08	24	7592	"OlympicG"
			
 
				+"Karlivans"	11.33	7.26	13.3	1.97	50.54	14.98	43.34	4.5	52.92	278.67	25	7583	"OlympicG"
			
 
				+"Korkizoglou"	10.86	7.07	14.81	1.94	51.16	14.96	46.07	4.7	53.05	317	26	7573	"OlympicG"
			
 
				+"Uldal"	11.23	6.99	13.53	1.85	50.95	15.09	43.01	4.5	60	281.7	27	7495	"OlympicG"
			
 
				+"Casarsa"	11.36	6.68	14.92	1.94	53.2	15.39	48.66	4.4	58.62	296.12	28	7404	"OlympicG"
			
--- a/data/mushrooms.csv
+++ b/data/mushrooms.csv
--- a/data/small_Breast_Ovary.csv
+++ b/data/small_Breast_Ovary.csv
--- a/data/small_Endometrium_Uterus.csv
+++ b/data/small_Endometrium_Uterus.csv
--- a/data/winequality-white.csv
+++ b/data/winequality-white.csv