# Lab 1: Principal Components Analysis
## 1. How to use Jupyter
All our labs will be done in Jupyter notebooks. You should run your own instance of Jupyter, so that you can interact with the notebook, modify it and run Python code in it! Follow the instructions at https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/index.html
A Jupyter notebook is a web application that allows you to create and share documents (such as this .ipynb notebook) that contain live code, visualisations and explanatory text (with equations).
Here are some tips on using a Jupyter notebook:
* Each block of text is contained in a _cell_. A cell can be either raw text, code, or markdown text (such as this cell). For more info on markdown syntax, follow the [guide](http://jupyter-notebook.readthedocs.io/en/latest/examples/Notebook/Working%20With%20Markdown%20Cells.html).
* You can run a cell by clicking inside it and hitting `Shift+Enter` (or the play button in the toolbar).
* If you want to create a new cell below the one you're running, hit `Alt+Enter` (or the plus button in the toolbar).
Some tips on using a Jupyter notebook with Python:
* A notebook behaves like an interactive python shell! This means that
* classes, functions, and variables defined at the cell level have global scope throughout the notebok
* hitting `Tab` will autocomplete the keyword you have started typing
* typing a question mark after a function name will load the interactive help for this function.
* Jupyter has special Python commands (shortcuts, if you will) called _magics_. For instance, `%bash` will allow you to run bash code, `%paste` will allow you to paste a block of code while retaining its formating, and `%matplotlib inline` will import the visualization library matplotlib, and automatically display its plots inline, that is, below the cell. Here's a full list: http://ipython.readthedocs.io/en/stable/interactive/magics.html
* Learn more about the interactive Python shell here: http://ipython.readthedocs.io/en/stable/interactive/tutorial.html
For more info on Jupyter: https://jupyter.org/
## 2. PCA of the Olympic Athletes data
In this lab, we will import data (`./data/decathlon.txt`) relating to the top performances in the Men's decathlon at the 2004 summer Olympics in Athens (https://en.wikipedia.org/wiki/Athletics_at_the_2004_Summer_Olympics_%E2%80%93_Men%27s_decathlon) and Decastar 2004 in Talence (https://fr.wikipedia.org/wiki/D%C3%A9castar). (Both events were won by Roman Šebrle).
### Data description
* The data set consists of 41 rows and 13 columns.
* The first row is a header describing the content of the columns and the remaining rows refer to the 40 observations (athletes) considered in this dataset.
* Columns 1 to 12 are continuous variables: the first ten columns correspond to the performance of the athletes for each event of the decathlon and columns 11 and 12 correspond respectively to the rank and the points obtained.
* The last column is a categorical variable corresponding to the athletic meeting (2004 Olympic Games or 2004 Decastar).
### Loading and manipulating the data with pandas
pandas is a data analysis library for Python. With pandas we can import our Olympics athletes data into a structured object called a *data frame*, which we can then manipulate with pandas' built-in tools. Here we load the dataset into a data frame and begin to examine it with pandas.
### Accessing data
* We can select a column by name. Note the returned object is also a pandas object (a *series*--a single-columned DataFrame), so we can use the `head()` function to view the first few rows only.
* Or a list for multiple columns.
* We can select rows satisfying a given condition(s) by passing a boolean series.
* To *index* a row, we can use the data frame's `loc` object. This behaves like a dictionary whose keys are the data frame's index.
### Manipulating data
### Visualisation
To create visualisations, we'll use `matplotlib`, the primordial plotting library for Python. `matplotlib` may be used in different ways using a built-in interface called `pyplot`. This allows us to access matplotlib modules in a variety arrays from a high-level state-machine environment, to a low-level object-oriented approach). The latter is typically recommended. Another interface, `pylab`, is no longer recommended (http://matplotlib.org/faq/usage_faq.html#coding-styles).
We also use a Jupyter magic command for inline plotting.
We can optionally toggle vector graphics for Jupyter display, giving us a crisper plot (this can be expensive though, so beware!):
A scatterplot matrix allows us to visualize:
* on the diagonal, the density estimation for each of the features
* on each of the off-diagonal plots, a scatterplot between two of the features. Each dot represents a sample.
### Cleaning data
### Use scikit-learn to find the PCs
In this course, we will rely heavily on the [scikit-learn](http://scikit-learn.org/stable/index.html) machine learning toolbox, which implements most classical, (non-deep) machine learning algorithms. Here, we will use scikit-learn to compute the PCs, and compare the results to what we got before. A useful resource is the online documentation: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
#### Data standardization
Recall that PCA works best on standardised data (mean 0, standard deviation 1).
#### Computing 2 principal components
**Question:** Use `pca.transform` to project the data onto its principal components.
`pca.explained_variance_ratio_` gives the percentage of variance explained by each of the components.
**Question:** How is `pca.explained_variance_ratio_` computed? Check this is the case by computing it yourself.
#### Projecting the data onto its principal components
We will plot the fraction of variance explained by each of the first 10 principal components.
__Question:__ Compute the 10 first PCs
To better understand the information captured by the principal components, we can consider `pca.components_`. These are the columns of $\mathbf{W}$ (for $M = 2$).
We can display each row of $\mathbf{W}$ in a 2D plot whose x-axis gives its contribution to the first component and y-axis to the second component. Note, whereas before we were visualising the projected data, $\mathbf{Z}$, now we are visualising the projections, $\mathbf{W}$. This indicates how the features cluster i.e. if a pair of feature projections are close, observations will tend to be similarly-valued over those features.
**Question:** based on the two previous graphs, can you find a meaning for the two principal components ?