{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab 1: Principal Components Analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. How to use Jupyter\n", "All our labs will be done in Jupyter notebooks. You should run your own instance of Jupyter, so that you can interact with the notebook, modify it and run Python code in it! Follow the instructions at https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/index.html \n", "\n", "A Jupyter notebook is a web application that allows you to create and share documents (such as this .ipynb notebook) that contain live code, visualisations and explanatory text (with equations).\n", "\n", "Here are some tips on using a Jupyter notebook:\n", "* Each block of text is contained in a _cell_. A cell can be either raw text, code, or markdown text (such as this cell). For more info on markdown syntax, follow the [guide](http://jupyter-notebook.readthedocs.io/en/latest/examples/Notebook/Working%20With%20Markdown%20Cells.html).\n", "* You can run a cell by clicking inside it and hitting `Shift+Enter` (or the play button in the toolbar)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "2 + 2 # hit Shift+Enter to run" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* If you want to create a new cell below the one you're running, hit `Alt+Enter` (or the plus button in the toolbar).\n", "\n", "Some tips on using a Jupyter notebook with Python:\n", "* A notebook behaves like an interactive python shell! This means that\n", " * classes, functions, and variables defined at the cell level have global scope throughout the notebok\n", " * hitting `Tab` will autocomplete the keyword you have started typing\n", " * typing a question mark after a function name will load the interactive help for this function.\n", "* Jupyter has special Python commands (shortcuts, if you will) called _magics_. For instance, `%bash` will allow you to run bash code, `%paste` will allow you to paste a block of code while retaining its formating, and `%matplotlib inline` will import the visualization library matplotlib, and automatically display its plots inline, that is, below the cell. Here's a full list: http://ipython.readthedocs.io/en/stable/interactive/magics.html \n", "* Learn more about the interactive Python shell here: http://ipython.readthedocs.io/en/stable/interactive/tutorial.html\n", "\n", "For more info on Jupyter: https://jupyter.org/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. PCA of the Olympic Athletes data\n", "\n", "In this lab, we will import data (`./data/decathlon.txt`) relating to the top performances in the Men's decathlon at the 2004 summer Olympics in Athens (https://en.wikipedia.org/wiki/Athletics_at_the_2004_Summer_Olympics_%E2%80%93_Men%27s_decathlon) and Decastar 2004 in Talence (https://fr.wikipedia.org/wiki/D%C3%A9castar). (Both events were won by Roman Šebrle).\n", "\n", "### Data description\n", "\n", "* The data set consists of 41 rows and 13 columns.\n", "* The first row is a header describing the content of the columns and the remaining rows refer to the 40 observations (athletes) considered in this dataset.\n", "* Columns 1 to 12 are continuous variables: the first ten columns correspond to the performance of the athletes for each event of the decathlon and columns 11 and 12 correspond respectively to the rank and the points obtained.\n", "* The last column is a categorical variable corresponding to the athletic meeting (2004 Olympic Games or 2004 Decastar).\n", "\n", "### Loading and manipulating the data with pandas\n", "pandas is a data analysis library for Python. With pandas we can import our Olympics athletes data into a structured object called a *data frame*, which we can then manipulate with pandas' built-in tools. Here we load the dataset into a data frame and begin to examine it with pandas." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd\n", "my_data = pd.read_csv('data/decathlon.txt', sep=\"\\t\") # load data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "print(type(my_data)) # display my_data data type" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "scrolled": true }, "outputs": [], "source": [ "my_data.head(n=5) # adjust n to view more data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Accessing data\n", "\n", "* We can select a column by name. Note the returned object is also a pandas object (a *series*--a single-columned DataFrame), so we can use the `head()` function to view the first few rows only." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "scrolled": true }, "outputs": [], "source": [ "my_data['100m'].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Or a list for multiple columns." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "columns = ['100m', '400m']\n", "my_data[columns].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* We can select rows satisfying a given condition(s) by passing a boolean series." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "my_data[my_data['Competition']=='OlympicG'].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* To *index* a row, we can use the data frame's `loc` object. This behaves like a dictionary whose keys are the data frame's index." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "my_data.loc['SEBRLE']" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "scrolled": true }, "outputs": [], "source": [ "my_data.count() # summarise counts of data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Manipulating data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "list(my_data.columns) # get the names of the columns" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "print(my_data.shape) # get the shape (rows x columns)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "scrolled": true }, "outputs": [], "source": [ "print(my_data.values) # get the content as a numpy array" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "print(my_data.dtypes) # get the data type (dtype) of each column" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualisation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To create visualisations, we'll use `matplotlib`, the primordial plotting library for Python. `matplotlib` may be used in different ways using a built-in interface called `pyplot`. This allows us to access matplotlib modules in a variety arrays from a high-level state-machine environment, to a low-level object-oriented approach). The latter is typically recommended. Another interface, `pylab`, is no longer recommended (http://matplotlib.org/faq/usage_faq.html#coding-styles).\n", "\n", "We also use a Jupyter magic command for inline plotting." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%pylab inline\n", "# This is equivalent to \n", "# import matplotlib.pyplot as plt\n", "# import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can optionally toggle vector graphics for Jupyter display, giving us a crisper plot (this can be expensive though, so beware!):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from IPython.display import Image, set_matplotlib_formats \n", "# set_matplotlib_formats('pdf') # toggle vector graphics for a crisp plot!\n", "set_matplotlib_formats('svg')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# basic visualization: athletes' performances depending on two disciplines\n", "my_data.plot(kind='scatter', x='400m', y='Shot.put', s=50,)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A scatterplot matrix allows us to visualize:\n", " * on the diagonal, the density estimation for each of the features\n", " * on each of the off-diagonal plots, a scatterplot between two of the features. Each dot represents a sample." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from pandas.plotting import scatter_matrix\n", "scatter_matrix(my_data.get(['Shot.put','High.jump', '400m']), alpha=0.2,\n", " figsize=(6, 6), diagonal='kde');" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "scrolled": false }, "outputs": [], "source": [ "# fancier plot with seaborn : https://seaborn.pydata.org/\n", "import seaborn as sns\n", "sns.set_style('whitegrid')\n", "\n", "sns.jointplot('Shot.put', 'High.jump', data = my_data, \n", " kind='kde', height=6, space=0)\n", "\n", "# loooking at correlated features\n", "sns.jointplot('Shot.put', 'Discus', data = my_data, \n", " kind='reg', height=6, space=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cleaning data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Remove columns we don't need (we're only interested in performance in the different sports)\n", "data = my_data.drop(['Points', 'Rank', 'Competition'], axis=1)\n", "\n", "# Verify new column headers\n", "data.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use scikit-learn to find the PCs\n", "\n", "In this course, we will rely heavily on the [scikit-learn](http://scikit-learn.org/stable/index.html) machine learning toolbox, which implements most classical, (non-deep) machine learning algorithms. Here, we will use scikit-learn to compute the PCs, and compare the results to what we got before. A useful resource is the online documentation: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Data standardization\n", "Recall that PCA works best on standardised data (mean 0, standard deviation 1). " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "\n", "# transform data from to numpy array\n", "X = data.values\n", "\n", "# TODO: standardise the data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn import preprocessing\n", "\n", "std_scale = preprocessing.StandardScaler().fit(X)\n", "X_scaled = std_scale.transform(X)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Computing 2 principal components" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn import decomposition\n", "\n", "pca = decomposition.PCA(n_components=2)\n", "pca.fit(X_scaled)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question:** Use `pca.transform` to project the data onto its principal components." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# TODO: project X on principal components\n", "X_projected = " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`pca.explained_variance_ratio_` gives the percentage of variance explained by each of the components." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "print(pca.explained_variance_ratio_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question:** How is `pca.explained_variance_ratio_` computed? Check this is the case by computing it yourself." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "tot_var = np.var(X_scaled, axis=0).sum()\n", "print((1 / tot_var) * np.var(X_projected, axis=0))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Projecting the data onto its principal components\n", " \n", "We will plot the fraction of variance explained by each of the first 10 principal components.\n", "\n", "__Question:__ Compute the 10 first PCs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TODO: compute the 10 first PCs\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pca.explained_variance_ratio_.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "plt.bar(np.arange(10), pca.explained_variance_ratio_)\n", "plt.xlim([-1, 9])\n", "plt.xlabel(\"Number of PCs\", fontsize=16)\n", "plt.ylabel(\"Fraction of variance explained\", fontsize=16)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To better understand the information captured by the principal components, we can consider  `pca.components_`. These are the columns of $\\mathbf{W}$ (for $M = 2$)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "pcs = pca.components_\n", "print(pcs[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can display each row of $\\mathbf{W}$ in a 2D plot whose x-axis gives its contribution to the first component and y-axis to the second component. Note, whereas before we were visualising the projected data, $\\mathbf{Z}$, now we are visualising the projections, $\\mathbf{W}$. This indicates how the features cluster i.e. if a pair of feature projections are close, observations will tend to be similarly-valued over those features." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "fig = plt.figure(figsize=(6, 5))\n", "ax = fig.add_subplot(1, 1, 1)\n", "ax.set_xlim([-0.7, 0.7])\n", "ax.set_ylim([-0.7, 0.7])\n", "\n", "for i, (x, y) in enumerate(zip(pcs[0, :], pcs[1, :])):\n", " # plot line between origin and point (x, y)\n", " ax.plot([0, x], [0, y], color='k')\n", " # display the label of the point\n", " ax.text(x, y, data.columns[i], fontsize='14')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question:** based on the two previous graphs, can you find a meaning for the two principal components ?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 2 }