{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "

\n", " \n", "\n", "

\n", "\n", "## Interactive Spurious Correlations Demonstration\n", "\n", "### Too Few Samples May Result in Spurious Correlations\n", "\n", "* in class I bring in 3 red balls, 2 green balls and my cowboy hat, yes I have one, recall I was a farmhand in Alberta, Canada\n", "\n", "* then I have students volunteer, one holds the hat, one draws balls with replacement and one records the results on the board\n", "\n", "* through multiple bootstrap sample sets we demonstrate the use of bootstrap to calculate uncertainty in the proportion from the sample itself through sampling with replacement\n", "\n", "* with this workflow we all provide an interactive plot demonstration with matplotlib and ipywidget packages to demonstrate this virtually\n", "\n", "#### Michael Pyrcz, Associate Professor, University of Texas at Austin \n", "\n", "##### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1) | [GeostatsPy](https://github.com/GeostatsGuy/GeostatsPy)\n", "\n", "#### Source of Spurious Correlations\n", "\n", "Let's explore the source of spurious correlations:\n", "\n", "* too few sample data\n", "\n", "* this issue can be exagerated when sampling from skewed distributions with possibility for extreme values \n", "\n", "What's the issue?\n", "\n", "* anomalously large absolute correlations between independent features\n", "\n", "We 'data mine' relationships that don't exist! Great examples are available at the [Spurious Correlations](https://www.tylervigen.com/spurious-correlations) website.\n", "\n", "#### The Correlation Coefficient\n", "\n", "Pearson’s Product‐Moment Correlation Coefficient\n", "* Provides a measure of the degree of linear relationship.\n", "* We refer to it as the 'correlation coefficient'\n", "\n", "Let's review the sample variance of variable $x$. Of course, I'm truncating our notation as $x$ is a set of samples a locations in our modeling space, $x(\\bf{u_\\alpha}), \\, \\forall \\, \\alpha = 0, 1, \\dots, n - 1$.\n", "\n", "\\begin{equation}\n", "\\sigma^2_{x} = \\frac{\\sum_{i=1}^{n} (x_i - \\overline{x})^2}{(n-1)}\n", "\\end{equation}\n", "\n", "We can expand the the squared term and replace on of them with $y$, another variable in addition to $x$.\n", "\n", "\\begin{equation}\n", "C_{xy} = \\frac{\\sum_{i=1}^{n} (x_i - \\overline{x})(y_i - \\overline{y})}{(n-1)}\n", "\\end{equation}\n", "\n", "We now have a measure that represents the manner in which variables $x$ and $y$ co-vary or vary together. We can standardized the covariance by the product of the standard deviations of $x$ and $y$ to calculate the correlation coefficent. \n", "\n", "\\begin{equation}\n", "\\rho_{xy} = \\frac{\\sum_{i=1}^{n} (x_i - \\overline{x})(y_i - \\overline{y})}{(n-1)\\sigma_x \\sigma_y}, \\, -1.0 \\le \\rho_{xy} \\le 1.0\n", "\\end{equation}\n", "\n", "In summary we can state that the correlation coefficient is related to the covariance as:\n", "\n", "\\begin{equation}\n", "\\rho_{xy} = \\frac{C_{xy}}{\\sigma_x \\sigma_y}\n", "\\end{equation}\n", "\n", "\n", "#### Objective \n", "\n", "Provide an example and demonstration for:\n", "\n", "1. interactive plotting in Jupyter Notebooks with Python packages matplotlib and ipywidgets\n", "2. provide an intuitive hands-on example to explore spurious correlations \n", "\n", "#### Getting Started\n", "\n", "Here's the steps to get setup in Python with the GeostatsPy package:\n", "\n", "1. Install Anaconda 3 on your machine (https://www.anaconda.com/download/). \n", "2. Open Jupyter and in the top block get started by copy and pasting the code block below from this Jupyter Notebook to start using the geostatspy functionality. \n", "\n", "#### Load the Required Libraries\n", "\n", "The following code loads the required libraries." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "from ipywidgets import interactive # widgets and interactivity\n", "from ipywidgets import widgets \n", "from ipywidgets import Layout\n", "from ipywidgets import Label\n", "from ipywidgets import VBox, HBox\n", "import matplotlib.pyplot as plt # plotting\n", "from matplotlib.colors import ListedColormap\n", "import numpy as np # working with arrays\n", "import pandas as pd # working with DataFrames\n", "import seaborn as sns # for matrix scatter plots\n", "from scipy.stats import triang # parametric distributions\n", "from scipy.stats import binom\n", "from scipy.stats import norm\n", "from scipy.stats import uniform\n", "from scipy.stats import triang\n", "from scipy.stats import lognorm\n", "from scipy import stats # statistical calculations\n", "import random # random drawing / bootstrap realizations of the data\n", "from matplotlib.gridspec import GridSpec # control of subplots \n", "import seaborn as sns # for matrix scatter plots" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Make a Synthetic Dataset\n", "\n", "This is an interactive method to:\n", "\n", "* select a parametric distribution\n", "\n", "* select the distribution parameters\n", "\n", "* select the number of samples\n", "\n", "* select the number of features \n", "\n", "Then we will view the lower triangular correlation matrix\n", "\n", "* we will color the correlations that are large (in absolute value $\\gt 0.8$)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "bins = np.linspace(-1,1,100) # set histogram bins\n", "\n", "# interactive calculation of the random sample set (control of source parametric distribution and number of samples)\n", "l = widgets.Text(value=' Spurious Correlation Demonstration, Michael Pyrcz, Associate Professor, The University of Texas at Austin',layout=Layout(width='950px', height='30px'))\n", "dist = widgets.Dropdown(\n", " options=['Triangular', 'Uniform', 'Gaussian', 'LogNorm'],\n", " value='Gaussian',\n", " description='Dataset Distribution:',\n", " disabled=False,\n", " layout=Layout(width='200px', height='30px')\n", ")\n", "a = widgets.FloatSlider(min=0.0, max = 100.0, value = 0.5, description = 'Sample: Mean/Mode',orientation='vertical',layout=Layout(width='170px', height='200px'))\n", "a.style.handle_color = 'blue'\n", "d = widgets.FloatSlider(min=0.01, max = 30.0, value = 5.0, step = 1.0, description = 'Sample: St.Dev.',orientation='vertical',layout=Layout(width='110px', height='200px'))\n", "d.style.handle_color = 'green'\n", "b = widgets.FloatSlider(min = 0, max = 100.0, value = 0.5, description = 'Sample: Min.',orientation='vertical',layout=Layout(width='110px', height='200px'))\n", "b.style.handle_color = 'red'\n", "c = widgets.IntSlider(min = 0, max = 100, value = 100, description = 'Sample: Max.',orientation='vertical',layout=Layout(width='110px', height='200px'))\n", "c.style.handle_color = 'orange'\n", "n = widgets.IntSlider(min = 2, max = 1000, value = 4, description = 'Number Samples',orientation='vertical',layout=Layout(width='110px', height='200px'))\n", "n.style.handle_color = 'gray'\n", "m = widgets.IntSlider(min = 2, max = 20, value = 10, description = 'Number Features',orientation='vertical',layout=Layout(width='110px', height='200px'))\n", "m.style.handle_color = 'gray'\n", "\n", "uia = widgets.HBox([dist,a,d,b,c,n,m],kwargs = {'justify_content':'center'}) # basic widget formatting\n", "#uib = widgets.HBox([n, m],kwargs = {'justify_content':'center'}) # basic widget formatting \n", "ui2 = widgets.VBox([l,uia],)\n", "\n", "def f_make(dist,a, b, c, d, n, m): # function to take parameters, make sample and plot\n", " dataset = make_data(dist,a, b, c, d, n, m)\n", " df = pd.DataFrame(data = dataset)\n", " corr = df.corr()\n", "\n", "# build a mask to remove the upper triangle\n", " mask = np.triu(np.ones_like(corr, dtype=bool))\n", " corr_values = corr.values\n", " corr_values2 = corr_values[mask != True]\n", " \n", "# make a custom colormap\n", " my_colormap = plt.cm.get_cmap('RdBu_r', 256)\n", " newcolors = my_colormap(np.linspace(0, 1, 256))\n", " white = np.array([256/256, 256/256, 256/256, 1])\n", " newcolors[26:230, :] = white # mask all correlations less than abs(0.8)\n", " newcmp = ListedColormap(newcolors)\n", "\n", "# Draw the heatmap with the mask and correct aspect ratio\n", " fig, (ax1) = plt.subplots(1, 1)\n", " sns.set(font_scale = 0.8)\n", " sns.heatmap(corr, ax = ax1, annot = True, mask=mask, cmap=newcmp, vmin = -1.0, vmax=1.0, center=0,\n", " square=True, linewidths=.5, linecolor = 'white', linewidth = 1, cbar_kws={'shrink': .5, 'label': 'Correlation Coefficents'})\n", " ax1.set_xlabel('Random Independent Features'); ax1.set_ylabel('Random Independent Features')\n", " ax1.set_title('Lower Triangular Correlation Matrix Heat Map')\n", " \n", "# ax2.hist(corr_values2, alpha=0.2,color=\"red\",edgecolor=\"black\", bins = bins)\n", "# ax2.set_title('Lower Triangular Correlation Coefficent Distribution'); ax2.set_xlabel('Correlation Coefficent'); ax2.set_ylabel('Frequency') \n", "# ax2.set_facecolor('white'); ax2.grid(True);\n", " \n", " plt.subplots_adjust(left=0.0, bottom=0.0, right=1.2, top=3.2, wspace=0.2, hspace=0.2)\n", " plt.show()\n", "\n", "def make_data(dist,a, b, c, d, n, m): # function to check parameters and make sample \n", " if dist == 'Uniform':\n", " if b >= c:\n", " print('Invalid uniform distribution parameters')\n", " return None\n", " dataset = uniform.rvs(size=[n,m], loc = b, scale = c, random_state = 73073).tolist()\n", " return dataset\n", " elif dist == 'Triangular':\n", " interval = c - b\n", " if b >= a or a >= c or interval <= 0:\n", " print('Invalid triangular distribution parameters')\n", " return None \n", " dataset = triang.rvs(size=[n,m], loc = b, c = (a-b)/interval, scale = interval, random_state = 73073).tolist()\n", " return dataset\n", " elif dist == 'Gaussian':\n", " dataset = norm.rvs(size=[n,m], loc = a, scale = d, random_state = 73073).tolist()\n", " return dataset\n", " elif dist == 'LogNorm':\n", " dataset = lognorm.rvs(size=[n,m], loc = a, scale = np.exp(a), s = d, random_state = 73073).tolist()\n", " return dataset\n", " \n", "# connect the function to make the samples and plot to the widgets \n", "interactive_plot = widgets.interactive_output(f_make, {'dist': dist,'a': a, 'd': d, 'b': b, 'c': c, 'n': n, 'm': m})\n", "interactive_plot.clear_output(wait = True) # reduce flickering by delaying plot updating" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Spurious Correlations Demonstration\n", "\n", "* spurious correlations due to a combination of too few samples and skewed distribution\n", "\n", "* interactive plot demonstration with ipywidget, matplotlib packages\n", "\n", "#### Michael Pyrcz, Associate Professor, University of Texas at Austin \n", "\n", "##### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1) | [GeostatsPy](https://github.com/GeostatsGuy/GeostatsPy)\n", "\n", "### The Problem\n", "\n", "Let's simulate bootstrap, resampling with replacement from a hat with $n_{red}$ and $n_{green}$ balls\n", "\n", "* **$n_{red}$**: number of red balls in the sample (placed in the hat)\n", "\n", "* **$n_{green}$**: number of green balls in the sample (placed in the hat)\n", "\n", "* **$L$**: number of bootstrap realizations" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "fa7e36176895433b8064d693754bf2f9", "version_major": 2, "version_minor": 0 }, "text/plain": [ "VBox(children=(Text(value=' Spurious Correlation Demonstration, Michael P…" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "0961920d80ef49548349b9e21407f084", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Output()" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(ui2, interactive_plot) # display the interactive plot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Observations\n", "\n", "Some observations:\n", "\n", "* spurious correlations due to a combination of too few samples and skewed distribution\n", "\n", "* interactive plot demonstration with ipywidget, matplotlib packages\n", "\n", "\n", "#### Comments\n", "\n", "This was a simple demonstration of interactive plots in Jupyter Notebook Python with the ipywidgets and matplotlib packages. \n", "\n", "I have many other demonstrations on data analytics and machine learning, e.g. on the basics of working with DataFrames, ndarrays, univariate statistics, plotting data, declustering, data transformations, trend modeling and many other workflows available at https://github.com/GeostatsGuy/PythonNumericalDemos and https://github.com/GeostatsGuy/GeostatsPy. \n", " \n", "I hope this was helpful,\n", "\n", "*Michael*\n", "\n", "#### The Author:\n", "\n", "### Michael Pyrcz, Associate Professor, University of Texas at Austin \n", "*Novel Data Analytics, Geostatistics and Machine Learning Subsurface Solutions*\n", "\n", "With over 17 years of experience in subsurface consulting, research and development, Michael has returned to academia driven by his passion for teaching and enthusiasm for enhancing engineers' and geoscientists' impact in subsurface resource development. \n", "\n", "For more about Michael check out these links:\n", "\n", "#### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)\n", "\n", "#### Want to Work Together?\n", "\n", "I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.\n", "\n", "* Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? I'd be happy to drop by and work with you! \n", "\n", "* Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PIs including Profs. Foster, Torres-Verdin and van Oort)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!\n", "\n", "* I can be reached at mpyrcz@austin.utexas.edu.\n", "\n", "I'm always happy to discuss,\n", "\n", "*Michael*\n", "\n", "Michael Pyrcz, Ph.D., P.Eng. Associate Professor The Hildebrand Department of Petroleum and Geosystems Engineering, Bureau of Economic Geology, The Jackson School of Geosciences, The University of Texas at Austin\n", "\n", "#### More Resources Available at: [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 2 }