11 tahun lalu · 265a7570de
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1 @@
 
				+sol_021.py
			
--- a/Preliminaries.ipynb
+++ b/Preliminaries.ipynb
@@ -0,0 +1,356 @@
 
				+{
			
 
				+ "cells": [
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "<h1 style=\"text-align: center;\">NumPy Array Tutorial @ EuroScipy 2015</h1>\n",
			
 
				+    "\n",
			
 
				+    "<img src=\"images/euroscipy_logo.png\" />"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "# Goal of this Tutorial\n",
			
 
				+    "\n",
			
 
				+    "- **Introduce the basics of Numpy**, and some more advanced stuff;\n",
			
 
				+    "- **Provide some concrete examples** where Numpy takes a central role.\n",
			
 
				+    "        "
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "# Schedule\n",
			
 
				+    "\n",
			
 
				+    "Outline:\n",
			
 
				+    "\n",
			
 
				+    "**14:00 - 14:15** Preliminaries\n",
			
 
				+    "\n",
			
 
				+    "- Making sure your computer is set up and everything is up&running\n",
			
 
				+    "\n",
			
 
				+    "** PART 1 ** Numpy Basics (14:15 - 16:30)\n",
			
 
				+    "\n",
			
 
				+    "**14:15 - 15:00** (45 mins) Introduction to Numpy\n",
			
 
				+    "\n",
			
 
				+    "- What is Numpy?\n",
			
 
				+    "- Introduction to Numpy Arrays\n",
			
 
				+    "    - Understand the importance of numpy arrays over lists\n",
			
 
				+    "- Numpy Data Types\n",
			
 
				+    "    - Conversion and Type Casting\n",
			
 
				+    "    - Numerical Representations\n",
			
 
				+    "- Record Array\n",
			
 
				+    "\n",
			
 
				+    "** 15:00 - 15:30 ** (30 mins) Indexing and Slicing\n",
			
 
				+    "\n",
			
 
				+    "- Indexing numpy arrays\n",
			
 
				+    "    - fancy indexing\n",
			
 
				+    "    - memory management\n",
			
 
				+    "- Slicing \n",
			
 
				+    "- Vectorization\n",
			
 
				+    "- Using arrays in Conditions\n",
			
 
				+    "\n",
			
 
				+    "** 15:30 - 16:00 ** (30 mins) Coffee Break\n",
			
 
				+    "\n",
			
 
				+    "** 16:00 - 16:30** (30 mins) Numpy Operations\n",
			
 
				+    "\n",
			
 
				+    "- Linear Algebra\n",
			
 
				+    "- Array and Matrix\n",
			
 
				+    "- Reshaping and Resizing\n",
			
 
				+    "\n",
			
 
				+    "\n",
			
 
				+    "** PART 2 ** Advanced Numpy Functions and Applications (16:30 - 17:30)\n",
			
 
				+    "\n",
			
 
				+    "** 16:30- 17:00 ** (30 mins) Data Processing\n",
			
 
				+    "\n",
			
 
				+    "- File I/0\n",
			
 
				+    "- Data Processing\n",
			
 
				+    "- Memmap and Serialization\n",
			
 
				+    "- `numexpr`\n",
			
 
				+    "\n",
			
 
				+    "** 12:55 - 13:25 ** Connecting Numpy with the Rest of the world\n",
			
 
				+    "\n",
			
 
				+    "- Machine Learning with scikit-learn\n",
			
 
				+    "\n",
			
 
				+    "** 17:25 - 17:30 ** A look at the future (of Numpy)"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "# Requirements"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "This tutorial requires the following packages:\n",
			
 
				+    "\n",
			
 
				+    "- Python version 2.7, 3.4+\n",
			
 
				+    "- `numpy` version 1.5 or later: http://www.numpy.org/\n",
			
 
				+    "- `scipy` version 0.9 or later: http://www.scipy.org/\n",
			
 
				+    "- `matplotlib` version 1.0 or later: http://matplotlib.org/\n",
			
 
				+    "- `ipython` version 1.0 or later, with notebook support: http://ipython.org\n",
			
 
				+    "\n",
			
 
				+    "(and for the *second part* of the tutorial):\n",
			
 
				+    "\n",
			
 
				+    "- `scikit-learn` version 0.12 or later: http://scikit-learn.org\n",
			
 
				+    "- `networkx` version 1.9.1 or later: https://networkx.github.io\n",
			
 
				+    "\n",
			
 
				+    "The easiest way to get these is to use an all-in-one installer such as [Anaconda](http://www.continuum.io/downloads) from Continuum. These are available for multiple architectures."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {
			
 
				+    "collapsed": true
			
 
				+   },
			
 
				+   "source": [
			
 
				+    "# How to setup your environment"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "## The simplest way"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "The easiest way to get these is to use the [conda](https://store.continuum.io) environment manager. \n",
			
 
				+    "\n",
			
 
				+    "I suggest downloading and installing [miniconda](http://conda.pydata.org/miniconda.html).\n",
			
 
				+    "\n",
			
 
				+    "The following command will install all required packages:\n",
			
 
				+    "\n",
			
 
				+    "    $ conda install numpy scipy matplotlib scikit-learn ipython-notebook\n",
			
 
				+    "    \n",
			
 
				+    "Alternatively, you can download and install the (very large) **Anaconda software distribution**, found at [https://store.continuum.io/]()."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "## The \"longest\" way"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "1. Create your **Virtual Environment** (highly suggested)\n",
			
 
				+    "\n",
			
 
				+    "    - `$ virtualenv -p <path to the python interpreter you want to fork> numpy_training`\n",
			
 
				+    "    - `$ source numpy_training/bin/activate`\n",
			
 
				+    "\n",
			
 
				+    "2. **pip** on the run\n",
			
 
				+    "    - `pip install numpy`\n",
			
 
				+    "    - `pip install scipy`\n",
			
 
				+    "    - `pip install matplotlib`\n",
			
 
				+    "    - `pip install \"ipython[all]\"  # don't forget the quotation!`\n",
			
 
				+    "    - `pip install scikit-learn`"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "## Alternatives\n",
			
 
				+    "\n",
			
 
				+    "- **Linux**: If you're on Linux, you can use the linux distribution tools \n",
			
 
				+    "\n",
			
 
				+    "    - Type, for example, `apt-get install numpy` or `yum install numpy`.\n",
			
 
				+    "    \n",
			
 
				+    "    \n",
			
 
				+    "\n",
			
 
				+    "- **Mac**: If you're on OSX, there are similar tools such as MacPorts or HomeBrew which contain pre-compiled versions of these packages.\n",
			
 
				+    "\n",
			
 
				+    "    - Just type `brew install numpy` in your terminal (if you're using HomeBrew)\n",
			
 
				+    "\n",
			
 
				+    "\n",
			
 
				+    "\n",
			
 
				+    "- **Windows**: Windows can be challenging: the best bet is probably to use a package installer such as Anaconda, above."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "### Python Version"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "I'm currently running this tutorial with **Python 3** on **Anaconda*"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 10,
			
 
				+   "metadata": {
			
 
				+    "collapsed": false
			
 
				+   },
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "name": "stdout",
			
 
				+     "output_type": "stream",
			
 
				+     "text": [
			
 
				+      "Python 3.4.3 :: Anaconda 2.3.0 (x86_64)\r\n"
			
 
				+     ]
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "!python --version"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "# How to test if everything is Up&Running"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "## 1. Try running iPython with notebook support"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {
			
 
				+    "collapsed": false
			
 
				+   },
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "!ipython notebook  # run this in your terminal"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "## 2. Try to import everything"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 2,
			
 
				+   "metadata": {
			
 
				+    "collapsed": true
			
 
				+   },
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "import numpy as np\n",
			
 
				+    "import scipy as sp\n",
			
 
				+    "import matplotlib.pyplot as plt\n",
			
 
				+    "import pandas as pd\n",
			
 
				+    "import sklearn"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "## 3. Check Installed Versions "
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 6,
			
 
				+   "metadata": {
			
 
				+    "collapsed": false
			
 
				+   },
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "name": "stdout",
			
 
				+     "output_type": "stream",
			
 
				+     "text": [
			
 
				+      "numpy: 1.9.2\n",
			
 
				+      "scipy: 0.15.1\n",
			
 
				+      "matplotlib: 1.4.3\n",
			
 
				+      "iPython: 3.2.0\n",
			
 
				+      "scikit-learn: 0.16.1\n"
			
 
				+     ]
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "import numpy\n",
			
 
				+    "print('numpy:', numpy.__version__)\n",
			
 
				+    "\n",
			
 
				+    "import scipy\n",
			
 
				+    "print('scipy:', scipy.__version__)\n",
			
 
				+    "\n",
			
 
				+    "import matplotlib\n",
			
 
				+    "print('matplotlib:', matplotlib.__version__)\n",
			
 
				+    "\n",
			
 
				+    "import IPython\n",
			
 
				+    "print('iPython:', IPython.__version__)\n",
			
 
				+    "\n",
			
 
				+    "import sklearn\n",
			
 
				+    "print('scikit-learn:', sklearn.__version__)"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "### 4. Enable the inline visualisation of plots"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 8,
			
 
				+   "metadata": {
			
 
				+    "collapsed": false
			
 
				+   },
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "%matplotlib inline"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "<br>\n",
			
 
				+    "<hr>\n",
			
 
				+    "<h1 style=\"text-align: center;\">If everything worked down here, you're ready to start!</h1>"
			
 
				+   ]
			
 
				+  }
			
 
				+ ],
			
 
				+ "metadata": {
			
 
				+  "kernelspec": {
			
 
				+   "display_name": "Python 3",
			
 
				+   "language": "python",
			
 
				+   "name": "python3"
			
 
				+  },
			
 
				+  "language_info": {
			
 
				+   "codemirror_mode": {
			
 
				+    "name": "ipython",
			
 
				+    "version": 3
			
 
				+   },
			
 
				+   "file_extension": ".py",
			
 
				+   "mimetype": "text/x-python",
			
 
				+   "name": "python",
			
 
				+   "nbconvert_exporter": "python",
			
 
				+   "pygments_lexer": "ipython3",
			
 
				+   "version": "3.4.3"
			
 
				+  }
			
 
				+ },
			
 
				+ "nbformat": 4,
			
 
				+ "nbformat_minor": 0
			
 
				+}
			
--- a/Numpy.ipynb
+++ b/Numpy.ipynb
--- a/Slicing.ipynb
+++ b/Slicing.ipynb
--- a/03_Numpy_Operations.ipynb
+++ b/03_Numpy_Operations.ipynb
--- a/04_Data_Processing.ipynb
+++ b/04_Data_Processing.ipynb
--- a/05_Memmapping.ipynb
+++ b/05_Memmapping.ipynb
@@ -0,0 +1,540 @@
 
				+{
			
 
				+ "cells": [
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "# Memmapping"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "The numpy package makes it possible to memory map large contiguous chunks of binary files as shared memory for all the Python processes running on a given host:"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 1,
			
 
				+   "metadata": {
			
 
				+    "collapsed": false
			
 
				+   },
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "import numpy as np"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "* Creating a `numpy.memmap` instance with the `w+` mode creates a file on the filesystem and zeros its content. "
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 2,
			
 
				+   "metadata": {
			
 
				+    "collapsed": false
			
 
				+   },
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "name": "stdout",
			
 
				+     "output_type": "stream",
			
 
				+     "text": [
			
 
				+      "[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]\n"
			
 
				+     ]
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "# Cleanup any existing file from past session (necessary for windows)\n",
			
 
				+    "import os\n",
			
 
				+    "\n",
			
 
				+    "current_dir = os.path.abspath(os.path.curdir)\n",
			
 
				+    "mmap_filepath = os.path.join(current_dir, 'files', 'small.mmap')\n",
			
 
				+    "if os.path.exists(mmap_filepath):\n",
			
 
				+    "    os.unlink(mmap_filepath)\n",
			
 
				+    "\n",
			
 
				+    "mm_w = np.memmap(mmap_filepath, shape=10, dtype=np.float32, mode='w+')\n",
			
 
				+    "print(mm_w)"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "* This binary file can then be mapped as a new numpy array by all the engines having access to the same filesystem. \n",
			
 
				+    "* The `mode='r+'` opens this shared memory area in read write mode:"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 3,
			
 
				+   "metadata": {
			
 
				+    "collapsed": false
			
 
				+   },
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "name": "stdout",
			
 
				+     "output_type": "stream",
			
 
				+     "text": [
			
 
				+      "[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]\n"
			
 
				+     ]
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "mm_r = np.memmap('files/small.mmap', dtype=np.float32, mode='r+')\n",
			
 
				+    "print(mm_r)"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 4,
			
 
				+   "metadata": {
			
 
				+    "collapsed": false
			
 
				+   },
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "name": "stdout",
			
 
				+     "output_type": "stream",
			
 
				+     "text": [
			
 
				+      "[ 42.   0.   0.   0.   0.   0.   0.   0.   0.   0.]\n"
			
 
				+     ]
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "mm_w[0] = 42\n",
			
 
				+    "print(mm_w)"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 5,
			
 
				+   "metadata": {
			
 
				+    "collapsed": false
			
 
				+   },
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "name": "stdout",
			
 
				+     "output_type": "stream",
			
 
				+     "text": [
			
 
				+      "[ 42.   0.   0.   0.   0.   0.   0.   0.   0.   0.]\n"
			
 
				+     ]
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "print(mm_r)"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "* Memory mapped arrays created with `mode='r+'` can be modified and the modifications are shared \n",
			
 
				+    "    - in case of multiple process"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 12,
			
 
				+   "metadata": {
			
 
				+    "collapsed": false
			
 
				+   },
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "mm_r[1] = 43"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 13,
			
 
				+   "metadata": {
			
 
				+    "collapsed": false
			
 
				+   },
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "name": "stdout",
			
 
				+     "output_type": "stream",
			
 
				+     "text": [
			
 
				+      "[ 42.  43.   0.   0.   0.   0.   0.   0.   0.   0.]\n"
			
 
				+     ]
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "print(mm_r)"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "### Memmap Operations"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "Memmap arrays generally behave very much like regular in-memory numpy arrays:"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 14,
			
 
				+   "metadata": {
			
 
				+    "collapsed": false
			
 
				+   },
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "name": "stdout",
			
 
				+     "output_type": "stream",
			
 
				+     "text": [
			
 
				+      "85.0\n",
			
 
				+      "sum=85.0, mean=8.5, std=17.0014705657959\n"
			
 
				+     ]
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "print(mm_r.sum())\n",
			
 
				+    "print(\"sum={0}, mean={1}, std={2}\".format(mm_r.sum(), \n",
			
 
				+    "                                          np.mean(mm_r), np.std(mm_r)))"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "Before allocating more data let us define a couple of utility functions from the previous exercise (and more) to monitor what is used by which engine and what is still free on the cluster as a whole:"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "* Let's allocate a 80MB memmap array:"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 15,
			
 
				+   "metadata": {
			
 
				+    "collapsed": false
			
 
				+   },
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "data": {
			
 
				+      "text/plain": [
			
 
				+       "memmap([ 0.,  0.,  0., ...,  0.,  0.,  0.])"
			
 
				+      ]
			
 
				+     },
			
 
				+     "execution_count": 15,
			
 
				+     "metadata": {},
			
 
				+     "output_type": "execute_result"
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "# Cleanup any existing file from past session (necessary for windows)\n",
			
 
				+    "import os\n",
			
 
				+    "if os.path.exists('files/big.mmap'):\n",
			
 
				+    "    os.unlink('files/big.mmap')\n",
			
 
				+    "\n",
			
 
				+    "np.memmap('files/big.mmap', shape=10 * int(1e6), dtype=np.float64, mode='w+')"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "No significant memory was used in this operation as we just asked the OS to allocate the buffer on the hard drive and just maitain a virtual memory area as a cheap reference to this buffer.\n",
			
 
				+    "\n",
			
 
				+    "Let's open new references to the same buffer from all the engines at once:"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 17,
			
 
				+   "metadata": {
			
 
				+    "collapsed": false
			
 
				+   },
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "name": "stdout",
			
 
				+     "output_type": "stream",
			
 
				+     "text": [
			
 
				+      "CPU times: user 393 µs, sys: 577 µs, total: 970 µs\n",
			
 
				+      "Wall time: 773 µs\n"
			
 
				+     ]
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "%time big_mmap = np.memmap('files/big.mmap', dtype=np.float64, mode='r+')"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 18,
			
 
				+   "metadata": {
			
 
				+    "collapsed": false
			
 
				+   },
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "data": {
			
 
				+      "text/plain": [
			
 
				+       "memmap([ 0.,  0.,  0., ...,  0.,  0.,  0.])"
			
 
				+      ]
			
 
				+     },
			
 
				+     "execution_count": 18,
			
 
				+     "metadata": {},
			
 
				+     "output_type": "execute_result"
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "big_mmap"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "* Let's trigger an actual load of the data from the drive into the in-memory disk cache of the OS, this can take some time depending on the speed of the hard drive (on the order of 100MB/s to 300MB/s hence 3s to 8s for this dataset):"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 19,
			
 
				+   "metadata": {
			
 
				+    "collapsed": false
			
 
				+   },
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "name": "stdout",
			
 
				+     "output_type": "stream",
			
 
				+     "text": [
			
 
				+      "CPU times: user 39.4 ms, sys: 89.6 ms, total: 129 ms\n",
			
 
				+      "Wall time: 602 ms\n"
			
 
				+     ]
			
 
				+    },
			
 
				+    {
			
 
				+     "data": {
			
 
				+      "text/plain": [
			
 
				+       "memmap(0.0)"
			
 
				+      ]
			
 
				+     },
			
 
				+     "execution_count": 19,
			
 
				+     "metadata": {},
			
 
				+     "output_type": "execute_result"
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "%time np.sum(big_mmap)"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "* Now back into memory"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 20,
			
 
				+   "metadata": {
			
 
				+    "collapsed": false
			
 
				+   },
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "name": "stdout",
			
 
				+     "output_type": "stream",
			
 
				+     "text": [
			
 
				+      "CPU times: user 16.6 ms, sys: 2.2 ms, total: 18.8 ms\n",
			
 
				+      "Wall time: 16.3 ms\n"
			
 
				+     ]
			
 
				+    },
			
 
				+    {
			
 
				+     "data": {
			
 
				+      "text/plain": [
			
 
				+       "memmap(0.0)"
			
 
				+      ]
			
 
				+     },
			
 
				+     "execution_count": 20,
			
 
				+     "metadata": {},
			
 
				+     "output_type": "execute_result"
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "%time np.sum(big_mmap)"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "This strategy makes it very interesting to load the readonly datasets of machine learning problems, especially when the same data is reused over and over by concurrent processes as can be the case when doing learning curves analysis or grid search (**Hyperparameter Optimisation** & **Model Selection**).\n",
			
 
				+    "\n",
			
 
				+    "This is of great importance in case of multiple and **embarassingly** parallel processes (like **Grid Search**)"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "## Memmaping Nested Numpy-based Data Structures with Joblib"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "**joblib** is a utility library included in the **sklearn** package. Among other things it provides tools to serialize objects that comprise large numpy arrays and reload them as memmap backed datastructures.\n",
			
 
				+    "\n",
			
 
				+    "To demonstrate it, let's create an arbitrary python datastructure involving numpy arrays:"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 21,
			
 
				+   "metadata": {
			
 
				+    "collapsed": false
			
 
				+   },
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "data": {
			
 
				+      "text/plain": [
			
 
				+       "(array([[ 0.,  0.,  0.,  0.],\n",
			
 
				+       "        [ 0.,  0.,  0.,  0.],\n",
			
 
				+       "        [ 0.,  0.,  0.,  0.]], dtype=float32), array([[1, 1, 1, 1],\n",
			
 
				+       "        [1, 1, 1, 1],\n",
			
 
				+       "        [1, 1, 1, 1]]))"
			
 
				+      ]
			
 
				+     },
			
 
				+     "execution_count": 21,
			
 
				+     "metadata": {},
			
 
				+     "output_type": "execute_result"
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "import numpy as np\n",
			
 
				+    "\n",
			
 
				+    "class MyDataStructure(object):\n",
			
 
				+    "    \n",
			
 
				+    "    def __init__(self, shape):\n",
			
 
				+    "        self.float_zeros = np.zeros(shape, dtype=np.float32)\n",
			
 
				+    "        self.integer_ones = np.ones(shape, dtype=np.int64)\n",
			
 
				+    "        \n",
			
 
				+    "data_structure = MyDataStructure((3, 4))\n",
			
 
				+    "data_structure.float_zeros, data_structure.integer_ones"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "We can now persist this datastructure to disk:"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 22,
			
 
				+   "metadata": {
			
 
				+    "collapsed": false
			
 
				+   },
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "data": {
			
 
				+      "text/plain": [
			
 
				+       "['files/data_structure.pkl',\n",
			
 
				+       " 'files/data_structure.pkl_01.npy',\n",
			
 
				+       " 'files/data_structure.pkl_02.npy']"
			
 
				+      ]
			
 
				+     },
			
 
				+     "execution_count": 22,
			
 
				+     "metadata": {},
			
 
				+     "output_type": "execute_result"
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "from sklearn.externals import joblib\n",
			
 
				+    "joblib.dump(data_structure, 'files/data_structure.pkl')"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 23,
			
 
				+   "metadata": {
			
 
				+    "collapsed": false
			
 
				+   },
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "name": "stdout",
			
 
				+     "output_type": "stream",
			
 
				+     "text": [
			
 
				+      "-rw-r--r--  1 valerio  staff  267 Jul 21 10:17 files/data_structure.pkl\r\n",
			
 
				+      "-rw-r--r--  1 valerio  staff  176 Jul 21 10:17 files/data_structure.pkl_01.npy\r\n",
			
 
				+      "-rw-r--r--  1 valerio  staff  128 Jul 21 10:17 files/data_structure.pkl_02.npy\r\n"
			
 
				+     ]
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "!ls -l files/data_structure*"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "A memmapped copy of this datastructure can then be loaded:"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 24,
			
 
				+   "metadata": {
			
 
				+    "collapsed": false
			
 
				+   },
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "data": {
			
 
				+      "text/plain": [
			
 
				+       "(memmap([[ 0.,  0.,  0.,  0.],\n",
			
 
				+       "        [ 0.,  0.,  0.,  0.],\n",
			
 
				+       "        [ 0.,  0.,  0.,  0.]], dtype=float32), memmap([[1, 1, 1, 1],\n",
			
 
				+       "        [1, 1, 1, 1],\n",
			
 
				+       "        [1, 1, 1, 1]]))"
			
 
				+      ]
			
 
				+     },
			
 
				+     "execution_count": 24,
			
 
				+     "metadata": {},
			
 
				+     "output_type": "execute_result"
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "memmaped_data_structure = joblib.load('files/data_structure.pkl', \n",
			
 
				+    "                                      mmap_mode='r+')\n",
			
 
				+    "memmaped_data_structure.float_zeros, memmaped_data_structure.integer_ones"
			
 
				+   ]
			
 
				+  }
			
 
				+ ],
			
 
				+ "metadata": {
			
 
				+  "kernelspec": {
			
 
				+   "display_name": "Python 3",
			
 
				+   "language": "python",
			
 
				+   "name": "python3"
			
 
				+  },
			
 
				+  "language_info": {
			
 
				+   "codemirror_mode": {
			
 
				+    "name": "ipython",
			
 
				+    "version": 3
			
 
				+   },
			
 
				+   "file_extension": ".py",
			
 
				+   "mimetype": "text/x-python",
			
 
				+   "name": "python",
			
 
				+   "nbconvert_exporter": "python",
			
 
				+   "pygments_lexer": "ipython3",
			
 
				+   "version": "3.4.3"
			
 
				+  }
			
 
				+ },
			
 
				+ "nbformat": 4,
			
 
				+ "nbformat_minor": 0
			
 
				+}
			
--- a/06_Numexpr.ipynb
+++ b/06_Numexpr.ipynb
@@ -0,0 +1,252 @@
 
				+{
			
 
				+ "cells": [
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "# Numexpr\n",
			
 
				+    "\n",
			
 
				+    "**Numexpr** is a fast numerical expression evaluator for NumPy. \n",
			
 
				+    "\n",
			
 
				+    "With it, expressions that operate on arrays (like `3*a+4*b`) are accelerated and use less memory than doing the same calculation in Python.\n",
			
 
				+    "\n",
			
 
				+    "In addition, its **multi-threaded capabilities** can make use of all your cores, which may accelerate computations, most specially if they are not memory-bounded.\n",
			
 
				+    "\n",
			
 
				+    "Last but not least, `numexpr` can make use of Intel's VML (Vector Math Library, normally integrated in its Math Kernel Library, or MKL). This allows further acceleration of transcendent (i.e., non polynomial) expressions.\n",
			
 
				+    "\n",
			
 
				+    "**GitHub**: [https://github.com/pydata/numexpr#what-it-is-numexpr]()"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "## Some Examples \n",
			
 
				+    "\n",
			
 
				+    "(gathered from `numexpr` documentation)"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 1,
			
 
				+   "metadata": {
			
 
				+    "collapsed": true
			
 
				+   },
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "import numpy as np\n",
			
 
				+    "import numexpr as ne"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 2,
			
 
				+   "metadata": {
			
 
				+    "collapsed": true
			
 
				+   },
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "a = np.arange(1e6)   # Choose large arrays for better speedups\n",
			
 
				+    "b = np.arange(1e6)"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 3,
			
 
				+   "metadata": {
			
 
				+    "collapsed": false
			
 
				+   },
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "data": {
			
 
				+      "text/plain": [
			
 
				+       "array([  1.00000000e+00,   2.00000000e+00,   3.00000000e+00, ...,\n",
			
 
				+       "         9.99998000e+05,   9.99999000e+05,   1.00000000e+06])"
			
 
				+      ]
			
 
				+     },
			
 
				+     "execution_count": 3,
			
 
				+     "metadata": {},
			
 
				+     "output_type": "execute_result"
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "ne.evaluate(\"a + 1\")   # a simple expression"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 4,
			
 
				+   "metadata": {
			
 
				+    "collapsed": false
			
 
				+   },
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "data": {
			
 
				+      "text/plain": [
			
 
				+       "array([False, False, False, ...,  True,  True,  True], dtype=bool)"
			
 
				+      ]
			
 
				+     },
			
 
				+     "execution_count": 4,
			
 
				+     "metadata": {},
			
 
				+     "output_type": "execute_result"
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "ne.evaluate('a*b-4.1*a > 2.5*b')   # a more complex one"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 5,
			
 
				+   "metadata": {
			
 
				+    "collapsed": false
			
 
				+   },
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "data": {
			
 
				+      "text/plain": [
			
 
				+       "array([        nan,  1.72284457,  1.79067101, ...,  1.09567006,\n",
			
 
				+       "        0.17523598, -0.09597844])"
			
 
				+      ]
			
 
				+     },
			
 
				+     "execution_count": 5,
			
 
				+     "metadata": {},
			
 
				+     "output_type": "execute_result"
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "ne.evaluate(\"sin(a) + arcsinh(a/b)\")   # you can also use functions"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "## Time Comparison with Numpy"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 8,
			
 
				+   "metadata": {
			
 
				+    "collapsed": false
			
 
				+   },
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "name": "stdout",
			
 
				+     "output_type": "stream",
			
 
				+     "text": [
			
 
				+      "100 loops, best of 3: 3.11 ms per loop\n"
			
 
				+     ]
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "%timeit a+1"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 9,
			
 
				+   "metadata": {
			
 
				+    "collapsed": false
			
 
				+   },
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "name": "stdout",
			
 
				+     "output_type": "stream",
			
 
				+     "text": [
			
 
				+      "100 loops, best of 3: 2.8 ms per loop\n"
			
 
				+     ]
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "%timeit ne.evaluate(\"a + 1\")"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 10,
			
 
				+   "metadata": {
			
 
				+    "collapsed": false
			
 
				+   },
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "name": "stdout",
			
 
				+     "output_type": "stream",
			
 
				+     "text": [
			
 
				+      "100 loops, best of 3: 15.9 ms per loop\n"
			
 
				+     ]
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "%timeit a*b-4.1*a > 2.5*b"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 12,
			
 
				+   "metadata": {
			
 
				+    "collapsed": false
			
 
				+   },
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "name": "stdout",
			
 
				+     "output_type": "stream",
			
 
				+     "text": [
			
 
				+      "100 loops, best of 3: 3.13 ms per loop\n"
			
 
				+     ]
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "%timeit ne.evaluate('a*b-4.1*a > 2.5*b')"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "### (some) preliminary conclusions\n",
			
 
				+    "\n",
			
 
				+    "* numexpr is (generally) slow with small arrays\n",
			
 
				+    "* numexpr is very fast with large arrays and complex operations\n",
			
 
				+    "* numpy is terrific with in-place operations"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "# NumExpr supported Datatypes\n",
			
 
				+    "\n",
			
 
				+    "* 8-bit boolean (bool)\n",
			
 
				+    "* 32-bit signed integer (int or int32)\n",
			
 
				+    "* 64-bit signed integer (long or int64)\n",
			
 
				+    "* 32-bit single-precision floating point number (float or float32)\n",
			
 
				+    "* 64-bit, double-precision floating point number (double or float64)\n",
			
 
				+    "* 2x64-bit, double-precision complex number (complex or complex128)\n",
			
 
				+    "* Raw string of bytes (str)"
			
 
				+   ]
			
 
				+  }
			
 
				+ ],
			
 
				+ "metadata": {
			
 
				+  "kernelspec": {
			
 
				+   "display_name": "Python 3",
			
 
				+   "language": "python",
			
 
				+   "name": "python3"
			
 
				+  },
			
 
				+  "language_info": {
			
 
				+   "codemirror_mode": {
			
 
				+    "name": "ipython",
			
 
				+    "version": 3
			
 
				+   },
			
 
				+   "file_extension": ".py",
			
 
				+   "mimetype": "text/x-python",
			
 
				+   "name": "python",
			
 
				+   "nbconvert_exporter": "python",
			
 
				+   "pygments_lexer": "ipython3",
			
 
				+   "version": "3.4.3"
			
 
				+  }
			
 
				+ },
			
 
				+ "nbformat": 4,
			
 
				+ "nbformat_minor": 0
			
 
				+}
			
--- a/07_0_MachineLearning_Data.ipynb
+++ b/07_0_MachineLearning_Data.ipynb
--- a/07_1_Sparse_Matrices.ipynb
+++ b/07_1_Sparse_Matrices.ipynb
@@ -0,0 +1,500 @@
 
				+{
			
 
				+ "cells": [
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {
			
 
				+    "slideshow": {
			
 
				+     "slide_type": "skip"
			
 
				+    }
			
 
				+   },
			
 
				+   "source": [
			
 
				+    "<small><i>This notebook was put together by [Jake Vanderplas](http://www.vanderplas.com) for PyCon 2014. Source and license info is on [GitHub](https://github.com/jakevdp/sklearn_pycon2014/).</i></small>"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {
			
 
				+    "slideshow": {
			
 
				+     "slide_type": "slide"
			
 
				+    }
			
 
				+   },
			
 
				+   "source": [
			
 
				+    "# Scipy Sparse Matrices"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "**Sparse Matrices** are very nice in some situations.  \n",
			
 
				+    "\n",
			
 
				+    "For example, in some machine learning tasks, especially those associated\n",
			
 
				+    "with textual analysis, the data may be mostly zeros.  \n",
			
 
				+    "\n",
			
 
				+    "Storing all these zeros is very inefficient.  \n",
			
 
				+    "\n",
			
 
				+    "We can create and manipulate sparse matrices as follows:"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 2,
			
 
				+   "metadata": {
			
 
				+    "collapsed": true
			
 
				+   },
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "import numpy as np"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 4,
			
 
				+   "metadata": {
			
 
				+    "collapsed": false,
			
 
				+    "slideshow": {
			
 
				+     "slide_type": "subslide"
			
 
				+    }
			
 
				+   },
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "name": "stdout",
			
 
				+     "output_type": "stream",
			
 
				+     "text": [
			
 
				+      "[[ 0.92071168  0.66941621  0.30097014  0.8668366   0.94764952]\n",
			
 
				+      " [ 0.16978456  0.59292571  0.78884569  0.76910071  0.56415941]\n",
			
 
				+      " [ 0.096867    0.96869327  0.8643055   0.0297782   0.11921581]\n",
			
 
				+      " [ 0.22387061  0.71015351  0.45882072  0.34433871  0.85566776]\n",
			
 
				+      " [ 0.22217957  0.83387745  0.40605966  0.41212024  0.65548993]\n",
			
 
				+      " [ 0.53416368  0.92406734  0.66444729  0.57218427  0.48198361]\n",
			
 
				+      " [ 0.37469397  0.33167227  0.9107519   0.03360275  0.20205017]\n",
			
 
				+      " [ 0.39939621  0.61025928  0.14715445  0.86871212  0.25921407]\n",
			
 
				+      " [ 0.07210422  0.99690991  0.31477122  0.49698491  0.34563232]\n",
			
 
				+      " [ 0.10310154  0.3806856   0.77690381  0.46116052  0.43330533]]\n"
			
 
				+     ]
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "# Create a random array with a lot of zeros\n",
			
 
				+    "X = np.random.random((10, 5))\n",
			
 
				+    "print(X)"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 5,
			
 
				+   "metadata": {
			
 
				+    "collapsed": false,
			
 
				+    "slideshow": {
			
 
				+     "slide_type": "subslide"
			
 
				+    }
			
 
				+   },
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "name": "stdout",
			
 
				+     "output_type": "stream",
			
 
				+     "text": [
			
 
				+      "[[ 0.92071168  0.          0.          0.8668366   0.94764952]\n",
			
 
				+      " [ 0.          0.          0.78884569  0.76910071  0.        ]\n",
			
 
				+      " [ 0.          0.96869327  0.8643055   0.          0.        ]\n",
			
 
				+      " [ 0.          0.71015351  0.          0.          0.85566776]\n",
			
 
				+      " [ 0.          0.83387745  0.          0.          0.        ]\n",
			
 
				+      " [ 0.          0.92406734  0.          0.          0.        ]\n",
			
 
				+      " [ 0.          0.          0.9107519   0.          0.        ]\n",
			
 
				+      " [ 0.          0.          0.          0.86871212  0.        ]\n",
			
 
				+      " [ 0.          0.99690991  0.          0.          0.        ]\n",
			
 
				+      " [ 0.          0.          0.77690381  0.          0.        ]]\n"
			
 
				+     ]
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "X[X < 0.7] = 0\n",
			
 
				+    "print(X)"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 6,
			
 
				+   "metadata": {
			
 
				+    "collapsed": false,
			
 
				+    "slideshow": {
			
 
				+     "slide_type": "subslide"
			
 
				+    }
			
 
				+   },
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "name": "stdout",
			
 
				+     "output_type": "stream",
			
 
				+     "text": [
			
 
				+      "  (0, 0)\t0.920711681384\n",
			
 
				+      "  (0, 3)\t0.866836604396\n",
			
 
				+      "  (0, 4)\t0.947649515452\n",
			
 
				+      "  (1, 2)\t0.788845688727\n",
			
 
				+      "  (1, 3)\t0.769100712548\n",
			
 
				+      "  (2, 1)\t0.968693269052\n",
			
 
				+      "  (2, 2)\t0.864305496772\n",
			
 
				+      "  (3, 1)\t0.710153508323\n",
			
 
				+      "  (3, 4)\t0.855667757095\n",
			
 
				+      "  (4, 1)\t0.833877448584\n",
			
 
				+      "  (5, 1)\t0.924067342994\n",
			
 
				+      "  (6, 2)\t0.910751902907\n",
			
 
				+      "  (7, 3)\t0.868712121221\n",
			
 
				+      "  (8, 1)\t0.996909907387\n",
			
 
				+      "  (9, 2)\t0.776903807028\n"
			
 
				+     ]
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "from scipy import sparse\n",
			
 
				+    "\n",
			
 
				+    "# turn X into a csr (Compressed-Sparse-Row) matrix\n",
			
 
				+    "X_csr = sparse.csr_matrix(X)\n",
			
 
				+    "print(X_csr)"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 7,
			
 
				+   "metadata": {
			
 
				+    "collapsed": false,
			
 
				+    "slideshow": {
			
 
				+     "slide_type": "subslide"
			
 
				+    }
			
 
				+   },
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "name": "stdout",
			
 
				+     "output_type": "stream",
			
 
				+     "text": [
			
 
				+      "[[ 0.92071168  0.          0.          0.8668366   0.94764952]\n",
			
 
				+      " [ 0.          0.          0.78884569  0.76910071  0.        ]\n",
			
 
				+      " [ 0.          0.96869327  0.8643055   0.          0.        ]\n",
			
 
				+      " [ 0.          0.71015351  0.          0.          0.85566776]\n",
			
 
				+      " [ 0.          0.83387745  0.          0.          0.        ]\n",
			
 
				+      " [ 0.          0.92406734  0.          0.          0.        ]\n",
			
 
				+      " [ 0.          0.          0.9107519   0.          0.        ]\n",
			
 
				+      " [ 0.          0.          0.          0.86871212  0.        ]\n",
			
 
				+      " [ 0.          0.99690991  0.          0.          0.        ]\n",
			
 
				+      " [ 0.          0.          0.77690381  0.          0.        ]]\n"
			
 
				+     ]
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "# convert the sparse matrix to a dense array\n",
			
 
				+    "print(X_csr.toarray())"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 8,
			
 
				+   "metadata": {
			
 
				+    "collapsed": false,
			
 
				+    "slideshow": {
			
 
				+     "slide_type": "subslide"
			
 
				+    }
			
 
				+   },
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "data": {
			
 
				+      "text/plain": [
			
 
				+       "True"
			
 
				+      ]
			
 
				+     },
			
 
				+     "execution_count": 8,
			
 
				+     "metadata": {},
			
 
				+     "output_type": "execute_result"
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "# Sparse matrices support linear algebra:\n",
			
 
				+    "y = np.random.random(X_csr.shape[1])\n",
			
 
				+    "z1 = X_csr.dot(y)\n",
			
 
				+    "z2 = X.dot(y)\n",
			
 
				+    "np.allclose(z1, z2)"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {
			
 
				+    "slideshow": {
			
 
				+     "slide_type": "subslide"
			
 
				+    }
			
 
				+   },
			
 
				+   "source": [
			
 
				+    "* The CSR representation can be very efficient for computations, but it is not as good for adding elements.  \n",
			
 
				+    "\n",
			
 
				+    "* For that, the **LIL** (List-In-List) representation is better:"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 9,
			
 
				+   "metadata": {
			
 
				+    "collapsed": false,
			
 
				+    "slideshow": {
			
 
				+     "slide_type": "fragment"
			
 
				+    }
			
 
				+   },
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "name": "stdout",
			
 
				+     "output_type": "stream",
			
 
				+     "text": [
			
 
				+      "  (0, 2)\t2.0\n",
			
 
				+      "  (1, 1)\t2.0\n",
			
 
				+      "  (1, 2)\t3.0\n",
			
 
				+      "  (1, 3)\t4.0\n",
			
 
				+      "  (2, 0)\t2.0\n",
			
 
				+      "  (2, 3)\t5.0\n",
			
 
				+      "  (2, 4)\t6.0\n",
			
 
				+      "  (3, 0)\t3.0\n",
			
 
				+      "  (3, 1)\t4.0\n",
			
 
				+      "  (3, 4)\t7.0\n",
			
 
				+      "  (4, 2)\t6.0\n",
			
 
				+      "  (4, 3)\t7.0\n",
			
 
				+      "[[ 0.  0.  2.  0.  0.]\n",
			
 
				+      " [ 0.  2.  3.  4.  0.]\n",
			
 
				+      " [ 2.  0.  0.  5.  6.]\n",
			
 
				+      " [ 3.  4.  0.  0.  7.]\n",
			
 
				+      " [ 0.  0.  6.  7.  0.]]\n"
			
 
				+     ]
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "# Create an empty LIL matrix and add some items\n",
			
 
				+    "X_lil = sparse.lil_matrix((5, 5))\n",
			
 
				+    "\n",
			
 
				+    "for i, j in np.random.randint(0, 5, (15, 2)):\n",
			
 
				+    "    X_lil[i, j] = i + j\n",
			
 
				+    "\n",
			
 
				+    "print(X_lil)\n",
			
 
				+    "print(X_lil.toarray())"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {
			
 
				+    "slideshow": {
			
 
				+     "slide_type": "subslide"
			
 
				+    }
			
 
				+   },
			
 
				+   "source": [
			
 
				+    "* Often, once an LIL matrix is created, it is useful to convert it to a CSR format \n",
			
 
				+    "    * **Note**: many scikit-learn algorithms require CSR or CSC format"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 10,
			
 
				+   "metadata": {
			
 
				+    "collapsed": false,
			
 
				+    "slideshow": {
			
 
				+     "slide_type": "fragment"
			
 
				+    }
			
 
				+   },
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "name": "stdout",
			
 
				+     "output_type": "stream",
			
 
				+     "text": [
			
 
				+      "  (0, 2)\t2.0\n",
			
 
				+      "  (1, 1)\t2.0\n",
			
 
				+      "  (1, 2)\t3.0\n",
			
 
				+      "  (1, 3)\t4.0\n",
			
 
				+      "  (2, 0)\t2.0\n",
			
 
				+      "  (2, 3)\t5.0\n",
			
 
				+      "  (2, 4)\t6.0\n",
			
 
				+      "  (3, 0)\t3.0\n",
			
 
				+      "  (3, 1)\t4.0\n",
			
 
				+      "  (3, 4)\t7.0\n",
			
 
				+      "  (4, 2)\t6.0\n",
			
 
				+      "  (4, 3)\t7.0\n"
			
 
				+     ]
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "X_csr = X_lil.tocsr()\n",
			
 
				+    "print(X_csr)"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {
			
 
				+    "slideshow": {
			
 
				+     "slide_type": "subslide"
			
 
				+    }
			
 
				+   },
			
 
				+   "source": [
			
 
				+    "There are several other sparse formats that can be useful for various problems:\n",
			
 
				+    "\n",
			
 
				+    "- `CSC` (compressed sparse column)\n",
			
 
				+    "- `BSR` (block sparse row)\n",
			
 
				+    "- `COO` (coordinate)\n",
			
 
				+    "- `DIA` (diagonal)\n",
			
 
				+    "- `DOK` (dictionary of keys)"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "## CSC - Compressed Sparse Column\n",
			
 
				+    "\n",
			
 
				+    "**Advantages of the CSC format**\n",
			
 
				+    "\n",
			
 
				+    "    * efficient arithmetic operations CSC + CSC, CSC * CSC, etc.\n",
			
 
				+    "    * efficient column slicing\n",
			
 
				+    "    * fast matrix vector products (CSR, BSR may be faster)\n",
			
 
				+    "\n",
			
 
				+    "**Disadvantages of the CSC format**\n",
			
 
				+    "\n",
			
 
				+    "    * slow row slicing operations (consider CSR)\n",
			
 
				+    "    * changes to the sparsity structure are expensive (consider LIL or DOK)"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "### BSR - Block Sparse Row\n",
			
 
				+    "\n",
			
 
				+    "The Block Compressed Row (`BSR`) format is very similar to the Compressed Sparse Row (`CSR`) format. \n",
			
 
				+    "\n",
			
 
				+    "BSR is appropriate for sparse matrices with *dense sub matrices* like the example below. \n",
			
 
				+    "\n",
			
 
				+    "Block matrices often arise in *vector-valued* finite element discretizations. \n",
			
 
				+    "\n",
			
 
				+    "In such cases, BSR is **considerably more efficient** than CSR and CSC for many sparse arithmetic operations."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 12,
			
 
				+   "metadata": {
			
 
				+    "collapsed": false
			
 
				+   },
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "data": {
			
 
				+      "text/plain": [
			
 
				+       "array([[1, 1, 0, 0, 2, 2],\n",
			
 
				+       "       [1, 1, 0, 0, 2, 2],\n",
			
 
				+       "       [0, 0, 0, 0, 3, 3],\n",
			
 
				+       "       [0, 0, 0, 0, 3, 3],\n",
			
 
				+       "       [4, 4, 5, 5, 6, 6],\n",
			
 
				+       "       [4, 4, 5, 5, 6, 6]])"
			
 
				+      ]
			
 
				+     },
			
 
				+     "execution_count": 12,
			
 
				+     "metadata": {},
			
 
				+     "output_type": "execute_result"
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "from scipy.sparse import bsr_matrix\n",
			
 
				+    "\n",
			
 
				+    "indptr = np.array([0, 2, 3, 6])\n",
			
 
				+    "indices = np.array([0, 2, 2, 0, 1, 2])\n",
			
 
				+    "data = np.array([1, 2, 3, 4, 5, 6]).repeat(4).reshape(6, 2, 2)\n",
			
 
				+    "bsr_matrix((data,indices,indptr), shape=(6, 6)).toarray()"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "## COO - Coordinate Sparse Matrix\n",
			
 
				+    "\n",
			
 
				+    "**Advantages of the CSC format**\n",
			
 
				+    "\n",
			
 
				+    "    * facilitates fast conversion among sparse formats\n",
			
 
				+    "    * permits duplicate entries (see example)\n",
			
 
				+    "    * very fast conversion to and from CSR/CSC formats\n",
			
 
				+    "\n",
			
 
				+    "**Disadvantages of the CSC format**\n",
			
 
				+    "\n",
			
 
				+    "    * does not directly support arithmetic operations and slicing\n",
			
 
				+    "    \n",
			
 
				+    "** Intended Usage**\n",
			
 
				+    "\n",
			
 
				+    "    * COO is a fast format for constructing sparse matrices\n",
			
 
				+    "    * Once a matrix has been constructed, convert to CSR or CSC format for fast arithmetic and matrix vector\n",
			
 
				+    "    operations\n",
			
 
				+    "    * By default when converting to CSR or CSC format, duplicate (i,j) entries will be summed together. \n",
			
 
				+    "    This facilitates efficient construction of finite element matrices and the like.\n"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "## DOK - Dictionary of Keys\n",
			
 
				+    "\n",
			
 
				+    "Sparse matrices can be used in arithmetic operations: they support addition, subtraction, multiplication, division, and matrix power.\n",
			
 
				+    "\n",
			
 
				+    "Allows for efficient O(1) access of individual elements. Duplicates are not allowed. Can be efficiently converted to a coo_matrix once constructed."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 15,
			
 
				+   "metadata": {
			
 
				+    "collapsed": false
			
 
				+   },
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "data": {
			
 
				+      "text/plain": [
			
 
				+       "array([[ 0.,  1.,  2.,  3.,  4.],\n",
			
 
				+       "       [ 0.,  2.,  3.,  4.,  5.],\n",
			
 
				+       "       [ 0.,  0.,  4.,  5.,  6.],\n",
			
 
				+       "       [ 0.,  0.,  0.,  6.,  7.],\n",
			
 
				+       "       [ 0.,  0.,  0.,  0.,  8.]], dtype=float32)"
			
 
				+      ]
			
 
				+     },
			
 
				+     "execution_count": 15,
			
 
				+     "metadata": {},
			
 
				+     "output_type": "execute_result"
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "from scipy.sparse import dok_matrix\n",
			
 
				+    "S = dok_matrix((5, 5), dtype=np.float32)\n",
			
 
				+    "for i in range(5):\n",
			
 
				+    "    for j in range(i, 5):\n",
			
 
				+    "        S[i,j] = i+j\n",
			
 
				+    "        \n",
			
 
				+    "S.toarray()"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "The ``scipy.sparse`` submodule also has a lot of functions for sparse matrices\n",
			
 
				+    "including linear algebra, sparse solvers, graph algorithms, and much more."
			
 
				+   ]
			
 
				+  }
			
 
				+ ],
			
 
				+ "metadata": {
			
 
				+  "kernelspec": {
			
 
				+   "display_name": "Python 3",
			
 
				+   "language": "python",
			
 
				+   "name": "python3"
			
 
				+  },
			
 
				+  "language_info": {
			
 
				+   "codemirror_mode": {
			
 
				+    "name": "ipython",
			
 
				+    "version": 3
			
 
				+   },
			
 
				+   "file_extension": ".py",
			
 
				+   "mimetype": "text/x-python",
			
 
				+   "name": "python",
			
 
				+   "nbconvert_exporter": "python",
			
 
				+   "pygments_lexer": "ipython3",
			
 
				+   "version": "3.4.3"
			
 
				+  }
			
 
				+ },
			
 
				+ "nbformat": 4,
			
 
				+ "nbformat_minor": 0
			
 
				+}
			
--- a/files/big.mmap
+++ b/files/big.mmap
--- a/files/data_structure.pkl
+++ b/files/data_structure.pkl
--- a/files/data_structure.pkl_01.npy
+++ b/files/data_structure.pkl_01.npy
--- a/files/data_structure.pkl_02.npy
+++ b/files/data_structure.pkl_02.npy
--- a/files/matlab_test_data_01.mat
+++ b/files/matlab_test_data_01.mat
--- a/files/matlab_test_data_02.mat
+++ b/files/matlab_test_data_02.mat
--- a/files/random-matrix.csv
+++ b/files/random-matrix.csv
@@ -0,0 +1,3 @@
 
				+0.31318 0.20088 0.41317
			
 
				+0.73103 0.06485 0.65212
			
 
				+0.48175 0.95090 0.55600
			
--- a/files/random-matrix.npy
+++ b/files/random-matrix.npy
--- a/files/small.mmap
+++ b/files/small.mmap
--- a/files/stockholm_td_adj.dat
+++ b/files/stockholm_td_adj.dat
--- a/files/test.mat
+++ b/files/test.mat
--- a/images/cluster_0.png
+++ b/images/cluster_0.png
--- a/images/cluster_1.png
+++ b/images/cluster_1.png
--- a/images/euroscipy_logo.png
+++ b/images/euroscipy_logo.png
--- a/images/iris_setosa.jpg
+++ b/images/iris_setosa.jpg
--- a/images/iris_versicolor.jpg
+++ b/images/iris_versicolor.jpg
--- a/images/iris_virginica.jpg
+++ b/images/iris_virginica.jpg
--- a/images/modeling_data_flow.png
+++ b/images/modeling_data_flow.png
--- a/images/ndarray.png
+++ b/images/ndarray.png
--- a/images/ndarray_with_details.png
+++ b/images/ndarray_with_details.png
--- a/images/reference.png
+++ b/images/reference.png
--- a/images/storage_index.png
+++ b/images/storage_index.png
--- a/images/storage_simple.png
+++ b/images/storage_simple.png