{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction to SciPy\n",
    "Tutorial at EuroSciPy 2019, Bilbao\n",
    "## 1. Statistical analysis – Gyroscope data taken in a TGV"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from scipy import optimize, stats\n",
    "%matplotlib notebook\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Importing the data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our gyroscope data are stored in a compressed file `data/TGV_data.csv.bz2`. Each row of the uncompressed file contains entries separated by commas and the first row contains labels explaining the content of the respective column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = np.genfromtxt('data/TGV_data.csv.bz2', delimiter=',', names=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are five columns identified by names:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.dtype.names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "time = data['Time_s']\n",
    "omega_x = data['Gyroscope_x_rads']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us first get an idea of the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "_ = plt.plot(time, omega_x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Statistical analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use the data for $\\omega_x$ to demonstrate some aspects of statistical analysis with SciPy. Let us first take a look at a histogram of the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n, bins = np.histogram(omega_x, bins=100, density=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Because we have set `density=True`, the data can be considered as a normalized probability distribution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.sum(n)*(bins[1]-bins[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"color: DarkBlue;background-color: LightYellow\">\n",
    "    <i>Exercise:</i> Check the values of n. What happens if you set density=False?\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The array `bins` contains the edges of the bins. To plot the histogram, we need their centers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bincenters = 0.5*(bins[1:]+bins[:-1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "_ = plt.plot(bincenters, n)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With SciPy we can easily determine some statistical characteristics from the original data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "description = stats.describe(omega_x, ddof=False)\n",
    "description"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* mean $\\langle x\\rangle$\n",
    "\n",
    "\n",
    "* variance $\\langle (x-\\langle x\\rangle)^2\\rangle$\n",
    "\n",
    "\n",
    "* skewness $\\dfrac{\\langle(x-\\langle x\\rangle)^3\\rangle}{\\langle(x-\\langle x\\rangle)^2\\rangle^{3/2}}$\n",
    "\n",
    "\n",
    "* kurtosis $\\dfrac{\\langle(x-\\langle x\\rangle)^4\\rangle}{\\langle(x-\\langle x\\rangle)^2\\rangle^2}-3$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"color: DarkBlue;background-color: LightYellow\">\n",
    "    <i>Exercise:</i> Check the result for the kurtosis as there are two different definitions with 3 subtracted or not.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can compare our histogram with a Gaussian for the mean and variance just obtained.\n",
    "\n",
    "Probability density function of the normal distribution\n",
    "$$p(x) = \\frac{1}{\\sqrt{2\\pi}}\\exp\\left(-\\frac{x^2}{2}\\right)$$\n",
    "\n",
    "For general mean and variance:\n",
    "$$x = \\frac{y-\\mathrm{loc}}{\\mathrm{scale}}$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "loc = description.mean\n",
    "scale = np.sqrt(description.variance)\n",
    "loc, scale"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = np.linspace(-0.1, 0.1, 200)\n",
    "_ = plt.plot(x, stats.norm.pdf(x, loc, scale))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A maximum likelihood fit with a Gaussian will yield the same result."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "stats.norm.fit(omega_x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How can we fit the histogram to a Gaussian? Use nonlinear least-squares curve fitting."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def gaussian(x, loc, scale):\n",
    "    return stats.norm.pdf(x, loc, scale)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "popt, pcov = optimize.curve_fit(gaussian, bincenters, n)\n",
    "popt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "_ = plt.plot(x, gaussian(x, *popt))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How likely is it that our data follow a normal distribution?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "stats.normaltest(omega_x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"color: DarkBlue;background-color: LightYellow\">\n",
    "    <i>Exercise:</i> Differ the results for the <i>y</i>-component of the angular velocity from those for the <i>x</i>-component?\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Generation of normally distributed data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "normdata = stats.norm.rvs(1, 0.2, 5000)\n",
    "_ = plt.plot(normdata)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n, bins = np.histogram(normdata, bins=100, density=True)\n",
    "bincenters = 0.5*(bins[1:]+bins[:-1])\n",
    "plt.plot(bincenters, n)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = np.linspace(0, 1.6, 200)\n",
    "_ = plt.plot(x, stats.norm.pdf(x, *stats.norm.fit(normdata)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "popt, pcov = optimize.curve_fit(gaussian, bincenters, n)\n",
    "popt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "_ = plt.plot(x, gaussian(x, *popt))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "stats.normaltest(normdata)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"color: DarkBlue;background-color: LightYellow\">\n",
    "    <i>Exercise:</i> Do these results depend on the number of samples and the realization of the random variates?\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Distribution with skewness"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for a in range(-4, 5, 2):\n",
    "    x = np.linspace(stats.skewnorm.ppf(0.001, a),\n",
    "                    stats.skewnorm.ppf(0.999, a), 100)\n",
    "    plt.plot(x, stats.skewnorm.pdf(x, a))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}