{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to SciPy\n", "Tutorial at EuroSciPy 2019, Bilbao\n", "## 1. Statistical analysis – Gyroscope data taken in a TGV" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from scipy import optimize, stats\n", "%matplotlib notebook\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Importing the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our gyroscope data are stored in a compressed file `data/TGV_data.csv.bz2`. Each row of the uncompressed file contains entries separated by commas and the first row contains labels explaining the content of the respective column." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = np.genfromtxt('data/TGV_data.csv.bz2', delimiter=',', names=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are five columns identified by names:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.dtype.names" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "time = data['Time_s']\n", "omega_x = data['Gyroscope_x_rads']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us first get an idea of the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "_ = plt.plot(time, omega_x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Statistical analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use the data for $\\omega_x$ to demonstrate some aspects of statistical analysis with SciPy. Let us first take a look at a histogram of the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "n, bins = np.histogram(omega_x, bins=100, density=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because we have set `density=True`, the data can be considered as a normalized probability distribution." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.sum(n)*(bins[1]-bins[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Exercise: Check the values of n. What happens if you set density=False?\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The array `bins` contains the edges of the bins. To plot the histogram, we need their centers." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bincenters = 0.5*(bins[1:]+bins[:-1])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "_ = plt.plot(bincenters, n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With SciPy we can easily determine some statistical characteristics from the original data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "description = stats.describe(omega_x, ddof=False)\n", "description" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* mean $\\langle x\\rangle$\n", "\n", "\n", "* variance $\\langle (x-\\langle x\\rangle)^2\\rangle$\n", "\n", "\n", "* skewness $\\dfrac{\\langle(x-\\langle x\\rangle)^3\\rangle}{\\langle(x-\\langle x\\rangle)^2\\rangle^{3/2}}$\n", "\n", "\n", "* kurtosis $\\dfrac{\\langle(x-\\langle x\\rangle)^4\\rangle}{\\langle(x-\\langle x\\rangle)^2\\rangle^2}-3$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Exercise: Check the result for the kurtosis as there are two different definitions with 3 subtracted or not.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can compare our histogram with a Gaussian for the mean and variance just obtained.\n", "\n", "Probability density function of the normal distribution\n", "$$p(x) = \\frac{1}{\\sqrt{2\\pi}}\\exp\\left(-\\frac{x^2}{2}\\right)$$\n", "\n", "For general mean and variance:\n", "$$x = \\frac{y-\\mathrm{loc}}{\\mathrm{scale}}$$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "loc = description.mean\n", "scale = np.sqrt(description.variance)\n", "loc, scale" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = np.linspace(-0.1, 0.1, 200)\n", "_ = plt.plot(x, stats.norm.pdf(x, loc, scale))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A maximum likelihood fit with a Gaussian will yield the same result." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "stats.norm.fit(omega_x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How can we fit the histogram to a Gaussian? Use nonlinear least-squares curve fitting." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def gaussian(x, loc, scale):\n", " return stats.norm.pdf(x, loc, scale)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "popt, pcov = optimize.curve_fit(gaussian, bincenters, n)\n", "popt" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "_ = plt.plot(x, gaussian(x, *popt))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How likely is it that our data follow a normal distribution?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "stats.normaltest(omega_x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Exercise: Differ the results for the y-component of the angular velocity from those for the x-component?\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Generation of normally distributed data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "normdata = stats.norm.rvs(1, 0.2, 5000)\n", "_ = plt.plot(normdata)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "n, bins = np.histogram(normdata, bins=100, density=True)\n", "bincenters = 0.5*(bins[1:]+bins[:-1])\n", "plt.plot(bincenters, n)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = np.linspace(0, 1.6, 200)\n", "_ = plt.plot(x, stats.norm.pdf(x, *stats.norm.fit(normdata)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "popt, pcov = optimize.curve_fit(gaussian, bincenters, n)\n", "popt" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "_ = plt.plot(x, gaussian(x, *popt))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "stats.normaltest(normdata)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Exercise: Do these results depend on the number of samples and the realization of the random variates?\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Distribution with skewness" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for a in range(-4, 5, 2):\n", " x = np.linspace(stats.skewnorm.ppf(0.001, a),\n", " stats.skewnorm.ppf(0.999, a), 100)\n", " plt.plot(x, stats.skewnorm.pdf(x, a))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 }