{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab 1: Principal Components Analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. How to use Jupyter\n", "All our labs will be done in Jupyter notebooks. You should run your own instance of Jupyter, so that you can interact with the notebook, modify it and run Python code in it! Follow the instructions at https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/index.html \n", "\n", "A Jupyter notebook is a web application that allows you to create and share documents (such as this .ipynb notebook) that contain live code, visualisations and explanatory text (with equations).\n", "\n", "Here are some tips on using a Jupyter notebook:\n", "* Each block of text is contained in a _cell_. A cell can be either raw text, code, or markdown text (such as this cell). For more info on markdown syntax, follow the [guide](http://jupyter-notebook.readthedocs.io/en/latest/examples/Notebook/Working%20With%20Markdown%20Cells.html).\n", "* You can run a cell by clicking inside it and hitting `Shift+Enter` (or the play button in the toolbar)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [ { "data": { "text/plain": [ "4" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "2 + 2 # hit Shift+Enter to run" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* If you want to create a new cell below the one you're running, hit `Alt+Enter` (or the plus button in the toolbar).\n", "\n", "Some tips on using a Jupyter notebook with Python:\n", "* A notebook behaves like an interactive python shell! This means that\n", " * classes, functions, and variables defined at the cell level have global scope throughout the notebok\n", " * hitting `Tab` will autocomplete the keyword you have started typing\n", " * typing a question mark after a function name will load the interactive help for this function.\n", "* Jupyter has special Python commands (shortcuts, if you will) called _magics_. For instance, `%bash` will allow you to run bash code, `%paste` will allow you to paste a block of code while retaining its formating, and `%matplotlib inline` will import the visualization library matplotlib, and automatically display its plots inline, that is, below the cell. Here's a full list: http://ipython.readthedocs.io/en/stable/interactive/magics.html \n", "* Learn more about the interactive Python shell here: http://ipython.readthedocs.io/en/stable/interactive/tutorial.html\n", "\n", "For more info on Jupyter: https://jupyter.org/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. PCA of the Olympic Athletes data\n", "\n", "In this lab, we will import data (`./data/decathlon.txt`) relating to the top performances in the Men's decathlon at the 2004 summer Olympics in Athens (https://en.wikipedia.org/wiki/Athletics_at_the_2004_Summer_Olympics_%E2%80%93_Men%27s_decathlon) and Decastar 2004 in Talence (https://fr.wikipedia.org/wiki/D%C3%A9castar). (Both events were won by Roman Šebrle).\n", "\n", "### Data description\n", "\n", "* The data set consists of 41 rows and 13 columns.\n", "* The first row is a header describing the content of the columns and the remaining rows refer to the 40 observations (athletes) considered in this dataset.\n", "* Columns 1 to 12 are continuous variables: the first ten columns correspond to the performance of the athletes for each event of the decathlon and columns 11 and 12 correspond respectively to the rank and the points obtained.\n", "* The last column is a categorical variable corresponding to the athletic meeting (2004 Olympic Games or 2004 Decastar).\n", "\n", "### Loading and manipulating the data with pandas\n", "pandas is a data analysis library for Python. With pandas we can import our Olympics athletes data into a structured object called a *data frame*, which we can then manipulate with pandas' built-in tools. Here we load the dataset into a data frame and begin to examine it with pandas." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd\n", "my_data = pd.read_csv('data/decathlon.txt', sep=\"\\t\") # load data" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "print(type(my_data)) # display my_data data type" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true, "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
100mLong.jumpShot.putHigh.jump400m110m.hurdleDiscusPole.vaultJaveline1500mRankPointsCompetition
SEBRLE11.047.5814.832.0749.8114.6943.755.0263.19291.718217Decastar
CLAY10.767.4014.261.8649.3714.0550.724.9260.15301.528122Decastar
KARPOV11.027.3014.772.0448.3714.0948.954.9250.31300.238099Decastar
BERNARD11.027.2314.251.9248.9314.9940.875.3262.77280.148067Decastar
YURKOV11.347.0915.192.1050.4215.3146.264.7263.44276.458036Decastar
\n", "
" ], "text/plain": [ " 100m Long.jump Shot.put High.jump 400m 110m.hurdle Discus \\\n", "SEBRLE 11.04 7.58 14.83 2.07 49.81 14.69 43.75 \n", "CLAY 10.76 7.40 14.26 1.86 49.37 14.05 50.72 \n", "KARPOV 11.02 7.30 14.77 2.04 48.37 14.09 48.95 \n", "BERNARD 11.02 7.23 14.25 1.92 48.93 14.99 40.87 \n", "YURKOV 11.34 7.09 15.19 2.10 50.42 15.31 46.26 \n", "\n", " Pole.vault Javeline 1500m Rank Points Competition \n", "SEBRLE 5.02 63.19 291.7 1 8217 Decastar \n", "CLAY 4.92 60.15 301.5 2 8122 Decastar \n", "KARPOV 4.92 50.31 300.2 3 8099 Decastar \n", "BERNARD 5.32 62.77 280.1 4 8067 Decastar \n", "YURKOV 4.72 63.44 276.4 5 8036 Decastar " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_data.head(n=5) # adjust n to view more data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Accessing data\n", "\n", "* We can select a column by name. Note the returned object is also a pandas object (a *series*--a single-columned DataFrame), so we can use the `head()` function to view the first few rows only." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true, "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "SEBRLE 11.04\n", "CLAY 10.76\n", "KARPOV 11.02\n", "BERNARD 11.02\n", "YURKOV 11.34\n", "Name: 100m, dtype: float64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_data['100m'].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Or a list for multiple columns." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
100m400m
SEBRLE11.0449.81
CLAY10.7649.37
KARPOV11.0248.37
BERNARD11.0248.93
YURKOV11.3450.42
\n", "
" ], "text/plain": [ " 100m 400m\n", "SEBRLE 11.04 49.81\n", "CLAY 10.76 49.37\n", "KARPOV 11.02 48.37\n", "BERNARD 11.02 48.93\n", "YURKOV 11.34 50.42" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "columns = ['100m', '400m']\n", "my_data[columns].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* We can select rows satisfying a given condition(s) by passing a boolean series." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
100mLong.jumpShot.putHigh.jump400m110m.hurdleDiscusPole.vaultJaveline1500mRankPointsCompetition
Sebrle10.857.8416.362.1248.3614.0548.725.070.52280.0118893OlympicG
Clay10.447.9615.232.0649.1914.1350.114.969.71282.0028820OlympicG
Karpov10.507.8115.932.0946.8113.9751.654.655.54278.1138725OlympicG
Macey10.897.4715.732.1548.9714.5648.344.458.46265.4248414OlympicG
Warners10.627.7414.481.9747.9714.0143.734.955.39278.0558343OlympicG
\n", "
" ], "text/plain": [ " 100m Long.jump Shot.put High.jump 400m 110m.hurdle Discus \\\n", "Sebrle 10.85 7.84 16.36 2.12 48.36 14.05 48.72 \n", "Clay 10.44 7.96 15.23 2.06 49.19 14.13 50.11 \n", "Karpov 10.50 7.81 15.93 2.09 46.81 13.97 51.65 \n", "Macey 10.89 7.47 15.73 2.15 48.97 14.56 48.34 \n", "Warners 10.62 7.74 14.48 1.97 47.97 14.01 43.73 \n", "\n", " Pole.vault Javeline 1500m Rank Points Competition \n", "Sebrle 5.0 70.52 280.01 1 8893 OlympicG \n", "Clay 4.9 69.71 282.00 2 8820 OlympicG \n", "Karpov 4.6 55.54 278.11 3 8725 OlympicG \n", "Macey 4.4 58.46 265.42 4 8414 OlympicG \n", "Warners 4.9 55.39 278.05 5 8343 OlympicG " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_data[my_data['Competition']=='OlympicG'].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* To *index* a row, we can use the data frame's `loc` object. This behaves like a dictionary whose keys are the data frame's index." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [ { "data": { "text/plain": [ "100m 11.04\n", "Long.jump 7.58\n", "Shot.put 14.83\n", "High.jump 2.07\n", "400m 49.81\n", "110m.hurdle 14.69\n", "Discus 43.75\n", "Pole.vault 5.02\n", "Javeline 63.19\n", "1500m 291.7\n", "Rank 1\n", "Points 8217\n", "Competition Decastar\n", "Name: SEBRLE, dtype: object" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_data.loc['SEBRLE']" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true, "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "100m 41\n", "Long.jump 41\n", "Shot.put 41\n", "High.jump 41\n", "400m 41\n", "110m.hurdle 41\n", "Discus 41\n", "Pole.vault 41\n", "Javeline 41\n", "1500m 41\n", "Rank 41\n", "Points 41\n", "Competition 41\n", "dtype: int64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_data.count() # summarise counts of data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Manipulating data" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [ { "data": { "text/plain": [ "['100m',\n", " 'Long.jump',\n", " 'Shot.put',\n", " 'High.jump',\n", " '400m',\n", " '110m.hurdle',\n", " 'Discus',\n", " 'Pole.vault',\n", " 'Javeline',\n", " '1500m',\n", " 'Rank',\n", " 'Points',\n", " 'Competition']" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(my_data.columns) # get the names of the columns" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(41, 13)\n" ] } ], "source": [ "print(my_data.shape) # get the shape (rows x columns)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[11.04 7.58 14.83 2.07 49.81 14.69 43.75 5.02 63.19 291.7 1 8217\n", " 'Decastar']\n", " [10.76 7.4 14.26 1.86 49.37 14.05 50.72 4.92 60.15 301.5 2 8122\n", " 'Decastar']\n", " [11.02 7.3 14.77 2.04 48.37 14.09 48.95 4.92 50.31 300.2 3 8099\n", " 'Decastar']\n", " [11.02 7.23 14.25 1.92 48.93 14.99 40.87 5.32 62.77 280.1 4 8067\n", " 'Decastar']\n", " [11.34 7.09 15.19 2.1 50.42 15.31 46.26 4.72 63.44 276.4 5 8036\n", " 'Decastar']\n", " [11.11 7.6 14.31 1.98 48.68 14.23 41.1 4.92 51.77 278.1 6 8030\n", " 'Decastar']\n", " [11.13 7.3 13.48 2.01 48.62 14.17 45.67 4.42 55.37 268.0 7 8004\n", " 'Decastar']\n", " [10.83 7.31 13.76 2.13 49.91 14.38 44.41 4.42 56.37 285.1 8 7995\n", " 'Decastar']\n", " [11.64 6.81 14.57 1.95 50.14 14.93 47.6 4.92 52.33 262.1 9 7802\n", " 'Decastar']\n", " [11.37 7.56 14.41 1.86 51.1 15.06 44.99 4.82 57.19 285.1 10 7733\n", " 'Decastar']\n", " [11.33 6.97 14.09 1.95 49.48 14.48 42.1 4.72 55.4 282.0 11 7708\n", " 'Decastar']\n", " [11.33 7.27 12.68 1.98 49.2 15.29 37.92 4.62 57.44 266.6 12 7651\n", " 'Decastar']\n", " [11.36 6.8 13.46 1.86 51.16 15.67 40.49 5.02 54.68 291.7 13 7313\n", " 'Decastar']\n", " [10.85 7.84 16.36 2.12 48.36 14.05 48.72 5.0 70.52 280.01 1 8893\n", " 'OlympicG']\n", " [10.44 7.96 15.23 2.06 49.19 14.13 50.11 4.9 69.71 282.0 2 8820\n", " 'OlympicG']\n", " [10.5 7.81 15.93 2.09 46.81 13.97 51.65 4.6 55.54 278.11 3 8725\n", " 'OlympicG']\n", " [10.89 7.47 15.73 2.15 48.97 14.56 48.34 4.4 58.46 265.42 4 8414\n", " 'OlympicG']\n", " [10.62 7.74 14.48 1.97 47.97 14.01 43.73 4.9 55.39 278.05 5 8343\n", " 'OlympicG']\n", " [10.91 7.14 15.31 2.12 49.4 14.95 45.62 4.7 63.45 269.54 6 8287\n", " 'OlympicG']\n", " [10.97 7.19 14.65 2.03 48.73 14.25 44.72 4.8 57.76 264.35 7 8237\n", " 'OlympicG']\n", " [10.8 7.53 14.26 1.88 48.81 14.8 42.05 5.4 61.33 276.33 8 8235\n", " 'OlympicG']\n", " [10.69 7.48 14.8 2.12 49.13 14.17 44.75 4.4 55.27 276.31 9 8225\n", " 'OlympicG']\n", " [10.98 7.49 14.01 1.94 49.76 14.25 42.43 5.1 56.32 273.56 10 8102\n", " 'OlympicG']\n", " [10.95 7.31 15.1 2.06 50.79 14.21 44.6 5.0 53.45 287.63 11 8084\n", " 'OlympicG']\n", " [10.9 7.3 14.77 1.88 50.3 14.34 44.41 5.0 60.89 278.82 12 8077\n", " 'OlympicG']\n", " [11.14 6.99 14.91 1.94 49.41 14.37 44.83 4.6 64.55 267.09 13 8067\n", " 'OlympicG']\n", " [10.85 6.81 15.24 1.91 49.27 14.01 49.02 4.2 61.52 272.74 14 8023\n", " 'OlympicG']\n", " [10.55 7.34 14.44 1.94 49.72 14.39 39.88 4.8 54.51 271.02 15 8021\n", " 'OlympicG']\n", " [10.68 7.5 14.97 1.94 49.12 15.01 40.35 4.6 59.26 275.71 16 8006\n", " 'OlympicG']\n", " [10.89 7.07 13.88 1.94 49.11 14.77 42.47 4.7 60.88 263.31 17 7993\n", " 'OlympicG']\n", " [11.06 7.34 13.55 1.97 49.65 14.78 45.13 4.5 60.79 272.63 18 7934\n", " 'OlympicG']\n", " [10.87 7.38 13.07 1.88 48.51 14.01 40.11 5.0 51.53 274.21 19 7926\n", " 'OlympicG']\n", " [11.14 6.61 15.69 2.03 51.04 14.88 41.9 4.8 65.82 277.94 20 7918\n", " 'OlympicG']\n", " [10.92 6.94 15.15 1.94 49.56 15.12 45.62 5.3 50.62 290.36 21 7893\n", " 'OlympicG']\n", " [11.08 7.26 14.57 1.85 48.61 14.41 40.95 4.4 60.71 269.7 22 7865\n", " 'OlympicG']\n", " [11.08 6.91 13.62 2.03 51.67 14.26 39.83 4.8 59.34 290.01 23 7708\n", " 'OlympicG']\n", " [11.1 7.03 13.22 1.85 49.34 15.38 40.22 4.5 58.36 263.08 24 7592\n", " 'OlympicG']\n", " [11.33 7.26 13.3 1.97 50.54 14.98 43.34 4.5 52.92 278.67 25 7583\n", " 'OlympicG']\n", " [10.86 7.07 14.81 1.94 51.16 14.96 46.07 4.7 53.05 317.0 26 7573\n", " 'OlympicG']\n", " [11.23 6.99 13.53 1.85 50.95 15.09 43.01 4.5 60.0 281.7 27 7495\n", " 'OlympicG']\n", " [11.36 6.68 14.92 1.94 53.2 15.39 48.66 4.4 58.62 296.12 28 7404\n", " 'OlympicG']]\n" ] } ], "source": [ "print(my_data.values) # get the content as a numpy array" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "100m float64\n", "Long.jump float64\n", "Shot.put float64\n", "High.jump float64\n", "400m float64\n", "110m.hurdle float64\n", "Discus float64\n", "Pole.vault float64\n", "Javeline float64\n", "1500m float64\n", "Rank int64\n", "Points int64\n", "Competition object\n", "dtype: object\n" ] } ], "source": [ "print(my_data.dtypes) # get the data type (dtype) of each column" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualisation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To create visualisations, we'll use `matplotlib`, the primordial plotting library for Python. `matplotlib` may be used in different ways using a built-in interface called `pyplot`. This allows us to access matplotlib modules in a variety arrays from a high-level state-machine environment, to a low-level object-oriented approach). The latter is typically recommended. Another interface, `pylab`, is no longer recommended (http://matplotlib.org/faq/usage_faq.html#coding-styles).\n", "\n", "We also use a Jupyter magic command for inline plotting." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Populating the interactive namespace from numpy and matplotlib\n" ] } ], "source": [ "%pylab inline\n", "# This is equivalent to \n", "# import matplotlib.pyplot as plt\n", "# import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can optionally toggle vector graphics for Jupyter display, giving us a crisper plot (this can be expensive though, so beware!):" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from IPython.display import Image, set_matplotlib_formats \n", "# set_matplotlib_formats('pdf') # toggle vector graphics for a crisp plot!\n", "set_matplotlib_formats('svg')" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": true }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/svg+xml": "\n\n\n\n \n \n \n \n 2022-01-13T14:29:29.578660\n image/svg+xml\n \n \n Matplotlib v3.3.4, https://matplotlib.org/\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# basic visualization: athletes' performances depending on two disciplines\n", "my_data.plot(kind='scatter', x='400m', y='Shot.put', s=50,)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A scatterplot matrix allows us to visualize:\n", " * on the diagonal, the density estimation for each of the features\n", " * on each of the off-diagonal plots, a scatterplot between two of the features. Each dot represents a sample." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": true }, "outputs": [ { "data": { "image/svg+xml": "\n\n\n\n \n \n \n \n 2022-01-13T14:29:35.622277\n image/svg+xml\n \n \n Matplotlib v3.3.4, https://matplotlib.org/\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "from pandas.plotting import scatter_matrix\n", "scatter_matrix(my_data.get(['Shot.put','High.jump', '400m']), alpha=0.2,\n", " figsize=(6, 6), diagonal='kde');" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": true, "scrolled": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/wbader/miniconda3/envs/tp-ml/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.\n", " FutureWarning\n", "/home/wbader/miniconda3/envs/tp-ml/lib/python3.6/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.\n", " FutureWarning\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/svg+xml": "\n\n\n\n \n \n \n \n 2022-01-13T14:29:38.190436\n image/svg+xml\n \n \n Matplotlib v3.3.4, https://matplotlib.org/\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/svg+xml": "\n\n\n\n \n \n \n \n 2022-01-13T14:29:38.801125\n image/svg+xml\n \n \n Matplotlib v3.3.4, https://matplotlib.org/\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# fancier plot with seaborn : https://seaborn.pydata.org/\n", "import seaborn as sns\n", "sns.set_style('whitegrid')\n", "\n", "sns.jointplot('Shot.put', 'High.jump', data = my_data, \n", " kind='kde', height=6, space=0)\n", "\n", "# loooking at correlated features\n", "sns.jointplot('Shot.put', 'Discus', data = my_data, \n", " kind='reg', height=6, space=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cleaning data" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": true }, "outputs": [ { "data": { "text/plain": [ "Index(['100m', 'Long.jump', 'Shot.put', 'High.jump', '400m', '110m.hurdle',\n", " 'Discus', 'Pole.vault', 'Javeline', '1500m'],\n", " dtype='object')" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Remove columns we don't need (we're only interested in performance in the different sports)\n", "data = my_data.drop(['Points', 'Rank', 'Competition'], axis=1)\n", "\n", "# Verify new column headers\n", "data.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use scikit-learn to find the PCs\n", "\n", "In this course, we will rely heavily on the [scikit-learn](http://scikit-learn.org/stable/index.html) machine learning toolbox, which implements most classical, (non-deep) machine learning algorithms. Here, we will use scikit-learn to compute the PCs, and compare the results to what we got before. A useful resource is the online documentation: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Data standardization\n", "Recall that PCA works best on standardised data (mean 0, standard deviation 1). " ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "\n", "# transform data from to numpy array\n", "X = data.values\n", "\n", "# TODO: standardise the data\n", "X_scaled_ = (X - np.mean(X, axis=0))/np.std(X, axis=0)" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn import preprocessing\n", "\n", "std_scale = preprocessing.StandardScaler().fit(X)\n", "X_scaled = std_scale.transform(X)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Computing 2 principal components" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "PCA(n_components=2)" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn import decomposition\n", "\n", "pca = decomposition.PCA(n_components=2)\n", "pca.fit(X_scaled)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question:** Use `pca.transform` to project the data onto its principal components." ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# TODO: project X on principal components\n", "X_projected = pca.transform(X_scaled)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`pca.explained_variance_ratio_` gives the percentage of variance explained by each of the components." ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "collapsed": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0.32719055 0.1737131 ]\n" ] } ], "source": [ "print(pca.explained_variance_ratio_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question:** How is `pca.explained_variance_ratio_` computed? Check this is the case by computing it yourself." ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0.32719055 0.1737131 ]\n" ] } ], "source": [ "tot_var = np.var(X_scaled, axis=0).sum()\n", "print((1 / tot_var) * np.var(X_projected, axis=0))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Projecting the data onto its principal components\n", " \n", "We will plot the fraction of variance explained by each of the first 10 principal components.\n", "\n", "__Question:__ Compute the 10 first PCs" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "PCA(n_components=10)" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# TODO: compute the 10 first PCs\n", "pca = decomposition.PCA(n_components=10)\n", "pca.fit(X_scaled)" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(10,)" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pca.explained_variance_ratio_.shape" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "collapsed": true }, "outputs": [ { "data": { "text/plain": [ "Text(0, 0.5, 'Fraction of variance explained')" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/svg+xml": "\n\n\n\n \n \n \n \n 2022-01-13T14:35:50.702783\n image/svg+xml\n \n \n Matplotlib v3.3.4, https://matplotlib.org/\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.bar(np.arange(10), pca.explained_variance_ratio_)\n", "plt.xlim([-1, 9])\n", "plt.xlabel(\"Number of PCs\", fontsize=16)\n", "plt.ylabel(\"Fraction of variance explained\", fontsize=16)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To better understand the information captured by the principal components, we can consider  `pca.components_`. These are the columns of $\\mathbf{W}$ (for $M = 2$)." ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "collapsed": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[-0.42829627 0.41015201 0.34414444 0.31619436 -0.3757157 -0.41255442\n", " 0.30542571 0.02783081 0.15319802 -0.03210733]\n" ] } ], "source": [ "pcs = pca.components_\n", "print(pcs[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can display each row of $\\mathbf{W}$ in a 2D plot whose x-axis gives its contribution to the first component and y-axis to the second component. Note, whereas before we were visualising the projected data, $\\mathbf{Z}$, now we are visualising the projections, $\\mathbf{W}$. This indicates how the features cluster i.e. if a pair of feature projections are close, observations will tend to be similarly-valued over those features." ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "collapsed": true }, "outputs": [ { "data": { "image/svg+xml": "\n\n\n\n \n \n \n \n 2022-01-13T14:35:57.477692\n image/svg+xml\n \n \n Matplotlib v3.3.4, https://matplotlib.org/\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig = plt.figure(figsize=(6, 5))\n", "ax = fig.add_subplot(1, 1, 1)\n", "ax.set_xlim([-0.7, 0.7])\n", "ax.set_ylim([-0.7, 0.7])\n", "\n", "for i, (x, y) in enumerate(zip(pcs[0, :], pcs[1, :])):\n", " # plot line between origin and point (x, y)\n", " ax.plot([0, x], [0, y], color='k')\n", " # display the label of the point\n", " ax.text(x, y, data.columns[i], fontsize='14')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question:** based on the two previous graphs, can you find a meaning for the two principal components ?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.13" } }, "nbformat": 4, "nbformat_minor": 2 }