{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "

\n", " \n", "\n", "

\n", "\n", "### Interactive Correlation Coefficient Demonstration\n", "\n", "\n", "#### Michael Pyrcz, Professor, The University of Texas at Austin \n", "\n", "##### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Correlation Coefficient\n", "\n", "Here is an interactive workflows demonstrationing the correlation coefficient, a common metric for analysis the relationship between two feautures, commonly used in inferential and predictive machine learning.\n", "\n", "I have recorded a walk-through of this interactive dashboard in my [Data Science Interactive Python Demonstrations](https://www.youtube.com/playlist?list=PLG19vXLQHvSDy26fM3hDLg3VCU7U5BGZl) series on my [YouTube](https://www.youtube.com/@GeostatsGuyLectures) channel.\n", "\n", "* Join me for walk-through of this dashboard [06 Data Science Interactive: Correlation Coefficient](TBD). I'm stoked to guide you and share observations and things to try out! \n", "\n", "* I have a lecture on [Correlation Analysis](https://www.youtube.com/watch?v=wZwYEDqB4A4&list=PLG19vXLQHvSB-D4XKYieEku9GQMQyAzjJ&index=21) as part of my [Data Analytics and Geostatistics](https://www.youtube.com/playlist?list=PLG19vXLQHvSB-D4XKYieEku9GQMQyAzjJ) course. Note, for all my recorded lecture the interactive and well-documented workflow demononstrations are available on my GitHub repository [GeostatsGuy's Python Numerical Demos](https://github.com/GeostatsGuy/PythonNumericalDemos).\n", "\n", "* Also, I have a lecture on [Univariate Statistics](https://www.youtube.com/watch?v=wAcbA2cIqec&list=PLG19vXLQHvSB-D4XKYieEku9GQMQyAzjJ&index=10) as part of my [Data Analytics and Geostatistics](https://www.youtube.com/playlist?list=PLG19vXLQHvSB-D4XKYieEku9GQMQyAzjJ) course.\n", "\n", "* Also, I have a lecture on [Principal Components Analysis](https://www.youtube.com/watch?v=-to3JXiae9Y&list=PLG19vXLQHvSC2ZKFIkgVpI9fCjkN38kwf&index=17) as part of my [Machine Learning](https://www.youtube.com/playlist?list=PLG19vXLQHvSC2ZKFIkgVpI9fCjkN38kwf) course.\n", "\n", "#### Bivariate Analysis\n", "\n", "Quantify the relationship between two features:\n", "\n", "* e.g., relationship between porosity and permeability, nickle and copper grade, etc.\n", "\n", "What would be the impact if we ignore this relationship and simply modeled eacf of our features independently?\n", "\n", "* no relationship beyond constraints at data locations\n", "* independent away from data\n", "* nonphysical results, unrealistic uncertainty models\n", "\n", "We must quantify relationships between our features and integrate them in our models.\n", "\n", "#### Correlation Coefficient\n", "\n", "Pearson’s Product‐Moment Correlation Coefficient\n", "* provides a measure of the degree of linear relationship.\n", "* we refer to it as the 'correlation coefficient'\n", "\n", "Let's review the sample variance of feature, $\\sigma^2_x$, where $X$ feature is represented by a set of samples a locations in our modeling space, $x(\\bf{u}_\\alpha)$, $\\alpha = 1, \\dots, n$, but we will use the shorthand, $x_{i}$, $i = 1, \\dots, n$.\n", "\n", "\\begin{equation}\n", "\\sigma^2_{x} = \\frac{\\sum_{i=1}^{n} (x_i - \\overline{x})^2}{(n-1)}\n", "\\end{equation}\n", "\n", "We can expand the the squared term and replace on of them with $Y$, another feature in addition to $X$.\n", "\n", "\\begin{equation}\n", "C_{x,y} = \\frac{\\sum_{i=1}^{n} (x_i - \\overline{x})(y_i - \\overline{y})}{(n-1)}\n", "\\end{equation}\n", "\n", "We now have a measure that represents the manner in which features $X$ and $Y$ co-vary or vary together. We can standardized the covariance by the product of the standard deviations of $X$ and $Y$ to calculate the correlation coefficent. \n", "\n", "\\begin{equation}\n", "\\rho_{x,y} = \\frac{\\sum_{i=1}^{n} (x_i - \\overline{x})(y_i - \\overline{y})}{(n-1)\\sigma_x \\sigma_y}, \\, -1.0 \\le \\rho_{x,y} \\le 1.0\n", "\\end{equation}\n", "\n", "#### Interpreting the Correlation Coefficient\n", "\n", "Given a bivariate Gaussian distribution, we can make the following interpretations of correlation coefficients:\n", "\n", "* $\\rho_{x,y} = 1.0$ - perfect positive linear relationship\n", "* $\\rho_{x,y} = 0.0$ - no relationship\n", "* $\\rho_{x,y} = -1.0$ - perfect negative linear relationship\n", "\n", "if not bivariate Gaussian this does not hold.\n", "\n", "In summary we can state that the correlation coefficient is related to the covariance as:\n", "\n", "\\begin{equation}\n", "\\rho_{x,y} = \\frac{C_{x,y}}{\\sigma_x \\sigma_y}\n", "\\end{equation}\n", "\n", "The correlation coefficient provides a useful metrics to quantify relationships between two features at a time. We can also consider bivariate scatter plots and matrix scatter plots to visualize multivariate data.\n", "\n", "Other applications in machine learning with correlation coefficients, e.g.:\n", "\n", "- the variance explained metric for linear regression is the correlation coefficient squared\n", "\n", "\\begin{equation}\n", "R^2_{X,Y} = \\left( \\rho_{x,y} \\right)^2 \n", "\\end{equation}\n", "\n", "- principal components analysis are calculated as the Eigen values and vectors from the features' covariance matrix, the unstandardized correlation coefficient.\n", "\n", "#### Limitations \n", "\n", "These are important limitations of the Pearson's product moment correlation coefficient:\n", "\n", "1. measures the strength of the linear relationship between the two features. \n", "2. sensitive to outliers\n", "3. assumes finite covariance, $C_{X,Y}$, and finite variances, $\\sigma^2_X$, and $\\sigma^2_Y$. \n", "4. only when the features are bivariate normal does the correlation coefficient provide a complete, unique description of the relationship between the two features, this includes homoscedasticity (constant conditional variance).\n", "\n", "#### Correlation Coefficient Dashboard\n", "\n", "To demonstrate the correlation coefficient, I built out 1 dashboard:\n", "\n", "* change the correlation coefficient and observe the associated standard normal, bivariate Gaussian univariate and bivarate distributions.\n", "\n", "#### Getting Started\n", "\n", "Here's the steps to get setup in Python with the GeostatsPy package:\n", "\n", "1. Install Anaconda 3 on your machine (https://www.anaconda.com/download/). \n", "\n", "That's all!\n", "\n", "#### Load the Required Libraries\n", "\n", "We will also need some standard Python packages. These should have been installed with Anaconda 3." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import os # to set current working directory \n", "import sys # supress output to screen for interactive variogram modeling\n", "import io\n", "import numpy as np # arrays and matrix math\n", "import pandas as pd # DataFrames\n", "import matplotlib.pyplot as plt # plotting\n", "from matplotlib.pyplot import cm # color maps\n", "from matplotlib.ticker import (MultipleLocator, AutoMinorLocator) # control of axes ticks\n", "from matplotlib.patches import Ellipse # plot an ellipse\n", "import math # sqrt operator\n", "import random # random simulation locations\n", "from copy import copy # copy a colormap\n", "from scipy.stats import norm\n", "from ipywidgets import interactive # widgets and interactivity\n", "from ipywidgets import widgets \n", "from ipywidgets import Layout\n", "from ipywidgets import Label\n", "from ipywidgets import VBox, HBox\n", "from scipy.stats import norm # Gaussian distribution\n", "import scipy.stats as st # statistical methods\n", "from scipy.interpolate import make_interp_spline # smooth curves\n", "plt.rc('axes', axisbelow=True) # set axes and grids in the background for all plots" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you get a package import error, you may have to first install some of these packages. This can usually be accomplished by opening up a command window on Windows and then typing 'python -m pip install [package-name]'. More assistance is available with the respective package docs. \n", "\n", "#### Declare Functions\n", "\n", "I just added a convenience function for adding major and minor gridlines." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "def add_grid(sub_plot):\n", " sub_plot.grid(True, which='major',linewidth = 1.0); sub_plot.grid(True, which='minor',linewidth = 0.2) # add y grids\n", " sub_plot.tick_params(which='major',length=7); sub_plot.tick_params(which='minor', length=4)\n", " sub_plot.xaxis.set_minor_locator(AutoMinorLocator()); sub_plot.yaxis.set_minor_locator(AutoMinorLocator()) # turn on minor ticks " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Interactive Correlation Coefficient\n", "\n", "Draw random values from a bivariate Gaussian distribution parameterized by:\n", "\n", "* **$\\overline{X}_1$, $\\overline{X}_2$** - mean of features $X_1$ and $X_2$\n", "\n", "* **$\\sigma_{X_1}$,$\\sigma_{X_1}$** - standard deviation of features $X_1$ and $X_2$ \n", "\n", "* **$\\rho_{x,y}$** - Pearson product-moment correlation coefficient " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's set up our dash board." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "global sample\n", "import warnings; warnings.simplefilter('ignore')\n", "\n", "# dashboard: number of simulation locations and variogram parameters\n", "style = {'description_width': 'initial'}\n", "l = widgets.Text(value=' Correlation Coefficient, Michael Pyrcz, Professor, The University of Texas at Austin',layout=Layout(width='950px', height='30px'))\n", "ndata = widgets.IntSlider(min = 0, max = 10000, value = 5000, step = 1000, description = r'$n_{samples}$',orientation='horizontal',continuous_update=True,\n", " layout=Layout(width='600px', height='40px'))\n", "ndata.style.handle_color = 'gray'\n", "\n", "corr = widgets.FloatSlider(min = -1.0, max = 1.0, value = 0, step = 0.1, description = r'$\\rho_{x_1,x_2}$',orientation='horizontal',continuous_update=True,\n", " layout=Layout(width='600px', height='40px'))\n", "corr.style.handle_color = 'gray'\n", "\n", "cond = widgets.Checkbox(value=False,description='Show Conditionals',disabled=False,indent=False)\n", "\n", "raster = widgets.Checkbox(value=False,description='Show Joint',disabled=False,indent=False)\n", "\n", "uipars = widgets.HBox([ndata,corr,cond,raster],) \n", "\n", "uik = widgets.VBox([l,uipars],)\n", "\n", "def f_make(ndata,corr,cond,raster): # function to take parameters, make sample and plot\n", " text_trap = io.StringIO() # suppress all text function output to dashboard to avoid clutter \n", " sys.stdout = text_trap\n", " cmap = cm.inferno\n", " np.random.seed(seed = 73072) # ensure same results for all runs\n", " mean = np.array([0,0])\n", " correl = np.array([[1.0,corr],[corr,1.0]],dtype=float)\n", " sample = np.random.multivariate_normal(mean,correl,size = ndata)\n", " df = pd.DataFrame(sample,columns=['X1','X2'])\n", " \n", " #slope, intercept, r_value, p_value, std_err = st.linregress(sample[:,0],sample[:,1])\n", " #x1 = np.array([-3.0,3.0])\n", " #x2 = x1*slope + intercept\n", " \n", " plt_scatter = plt.subplot2grid((3, 3), (1, 0), rowspan=2, colspan=2)\n", " plt_x1 = plt.subplot2grid((3, 3), (0, 0), colspan=2,\n", " sharex=plt_scatter)\n", " plt_x2 = plt.subplot2grid((3, 3), (1, 2), rowspan=2,\n", " sharey=plt_scatter) \n", " \n", " #plt.plot([0,0],[1.0,1.0],color = 'black')\n", " if raster == True:\n", " plt_scatter.hist2d(df['X1'],df['X2'], bins=30, range=[[-3,3],[-3,3]], density=False, weights=None, cmin=None, cmax=None,\n", " cmap=plt.cm.Reds,alpha=1.0,zorder=1)\n", " else:\n", " plt_scatter.scatter(sample[:,0],sample[:,1],color = 'red',alpha = 0.2,edgecolors='black',label = 'Samples',zorder=100)\n", " plt_scatter.legend(loc='upper left')\n", " \n", " plt_scatter.set_xlabel(r'$x_1$'); plt_scatter.set_ylabel(r'$x_2$')\n", " plt_scatter.set_xlim([-3.0,3.0]); plt_scatter.set_ylim([-3.0,3.0]); add_grid(plt_scatter)\n", "\n", " \n", " for x in np.linspace(-3.0,3.0,31):\n", " plt_scatter.plot([x,x],[-3,3],color='grey',lw=0.2)\n", " plt_scatter.plot([-3,3],[x,x],color='grey',lw=0.2)\n", " \n", " for x in np.linspace(-2.0,2.0,5):\n", " plt_scatter.plot([x,x],[-3,3],color='grey',lw=0.5)\n", " plt_scatter.plot([-3,3],[x,x],color='grey',lw=0.5)\n", " \n", " \n", " if cond == True:\n", " nbins = 6\n", " X1_new = np.linspace(-2.0,2.0,300) # for smooth splines\n", " X1_bins = np.linspace(-2.5,2.5,nbins) # set the bin boundaries and then the centroids for plotting\n", " X1_centroids = np.linspace((X1_bins[0]+X1_bins[1])*0.5,(X1_bins[-2]+X1_bins[-1])*0.5,nbins-1)\n", " df['X1_bins'] = pd.cut(df['X1'], X1_bins,labels = X1_centroids) # cut on bondaries and lable with centroids \n", " \n", " cond_exp = df.groupby('X1_bins')['X2'].mean()\n", " cond_P90 = df.groupby('X1_bins')['X2'].quantile(.9)\n", " cond_P10 = df.groupby('X1_bins')['X2'].quantile(.1)\n", " \n", " spl_exp = make_interp_spline(X1_centroids, cond_exp, k=3)\n", " spl_P90 = make_interp_spline(X1_centroids, cond_P90, k=3)\n", " spl_P10 = make_interp_spline(X1_centroids, cond_P10, k=3)\n", " cond_exp_spl = spl_exp(X1_new)\n", " cond_P90_spl = spl_P90(X1_new)\n", " cond_P10_spl = spl_P10(X1_new)\n", "\n", " plt_scatter.plot(X1_new,cond_exp_spl,color='white',lw=4,zorder=100)\n", " plt_scatter.plot(X1_new,cond_exp_spl,color='black',lw=2,zorder=200)\n", " plt_scatter.plot(X1_new,cond_P90_spl,color='white',lw = 4,zorder=100)\n", " plt_scatter.plot(X1_new,cond_P90_spl,'r--',color='black',lw = 2,zorder=200)\n", " plt_scatter.plot(X1_new,cond_P10_spl,color='white',lw = 4,zorder=100)\n", " plt_scatter.plot(X1_new,cond_P10_spl,'r--',color='black',lw = 2,zorder=200)\n", " \n", " plt_scatter.annotate('Exp',[X1_new[0]-0.3,cond_exp_spl[0]]); plt_scatter.annotate('Exp',[X1_new[-1]+0.05,cond_exp_spl[-1]])\n", " plt_scatter.annotate('P10',[X1_new[0]-0.3,cond_P10_spl[0]]); plt_scatter.annotate('P10',[X1_new[-1]+0.05,cond_P10_spl[-1]])\n", " plt_scatter.annotate('P90',[X1_new[0]-0.3,cond_P90_spl[0]]); plt_scatter.annotate('P90',[X1_new[-1]+0.05,cond_P90_spl[-1]])\n", " \n", " counts = plt_x1.hist(sample[:,0],density = True,alpha=0.0,color=None,edgecolor=None,bins=np.linspace(-3.0,3.0,31))[0]\n", " N, bins, patches = plt_x1.hist(sample[:,0],density = True,alpha=1.0,edgecolor='black',bins=np.linspace(-3.0,3.0,31))\n", " for i in range(0,30):\n", " patches[i].set_facecolor(plt.cm.Reds(counts[i]-np.min(counts)/(np.max(counts)-np.min(counts))))\n", " \n", " plt_x1.set_ylim([0.0,0.8]); add_grid(plt_x1)\n", " plt_x1.set_xlabel(r'$x_1$'); plt_x1.set_ylabel(r'Density')\n", " plt_x1.set_title(r'Bivariate Standard Gaussian Distributed Data with $\\rho =$' + str(np.round(corr,2)) + '.')\n", " \n", " counts = plt_x2.hist(sample[:,1],density = True,alpha=0.0,color=None,edgecolor=None,bins=np.linspace(-3.0,3.0,31))[0]\n", " N, bins, patches = plt_x2.hist(sample[:,1],orientation='horizontal',density = True,alpha=1.0,edgecolor='black',bins=np.linspace(-3.0,3.0,31))\n", " for i in range(0,30):\n", " patches[i].set_facecolor(plt.cm.Reds(counts[i]-np.min(counts)/(np.max(counts)-np.min(counts))))\n", "\n", " #plt_x2.hist(sample[:,1],orientation='horizontal',density = True,color='red',alpha=0.8,edgecolor='black',bins=np.linspace(-3.0,3.0,31))\n", " plt_x2.set_xlim([0.0,0.8]); add_grid(plt_x2)\n", " plt_x2.set_ylabel(r'$x_2$'); plt_x2.set_xlabel(r'Density')\n", " plt_scatter.set_ylabel(r'$x_2$')\n", " \n", " plt.subplots_adjust(left=0.0, bottom=0.0, right=1.5, top=1.7, wspace=0.3, hspace=0.3)\n", " plt.show()\n", " \n", "# connect the function to make the samples and plot to the widgets \n", "interactive_plot = widgets.interactive_output(f_make, {'ndata':ndata,'corr':corr,'cond':cond,'raster':raster})\n", "#interactive_plot.clear_output(wait = True) # reduce flickering by delaying plot updating" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Interactive Correlation Coefficient Demonstration\n", "\n", "* select the number of data and correlation coefficient and observe the samples and linear regression of $X_2 = f(X_1)$ \n", "\n", "#### Michael Pyrcz, Professor, The University of Texas at Austin \n", "\n", "##### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1) | [GeostatsPy](https://github.com/GeostatsGuy/GeostatsPy)\n", "\n", "### The Inputs\n", "\n", "Select the number of samples and the Pearson product-moment correlation coefficient:\n", "\n", "* **$n_{samples}$**: number of samples\n", "\n", "* **$\\rho_{x_1,x_2}$**: the Pearson product-moment correlation" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": false }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "a0d4a87540ee4a8cb0b5ea4d9e856c94", "version_major": 2, "version_minor": 0 }, "text/plain": [ "VBox(children=(Text(value=' Correlation Coefficient, Micha…" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "125b4a6c36334853b60ace63875160fc", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Output()" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(uik, interactive_plot) # display the interactive plot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Comments\n", "\n", "This was an interactive demonstration of the correlation coefficient. Providing students an opportunity to play with data analytics, geostatistics and machine learning for experiential learning.\n", " \n", "#### The Author:\n", "\n", "### Michael Pyrcz, Professor, The University of Texas at Austin \n", "*Novel Data Analytics, Geostatistics and Machine Learning Subsurface Solutions*\n", "\n", "With over 17 years of experience in subsurface consulting, research and development, Michael has returned to academia driven by his passion for teaching and enthusiasm for enhancing engineers' and geoscientists' impact in subsurface resource development. \n", "\n", "For more about Michael check out these links:\n", "\n", "#### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)\n", "\n", "#### Want to Work Together?\n", "\n", "I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.\n", "\n", "* Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? I'd be happy to drop by and work with you! \n", "\n", "* Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PIs including Profs. Foster, Torres-Verdin and van Oort)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!\n", "\n", "* I can be reached at mpyrcz@austin.utexas.edu.\n", "\n", "I'm always happy to discuss,\n", "\n", "*Michael*\n", "\n", "Michael Pyrcz, Ph.D., P.Eng. Professor, Cockrell School of Engineering and The Jackson School of Geosciences, The University of Texas at Austin\n", "\n", "#### More Resources Available at: [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1) \n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" } }, "nbformat": 4, "nbformat_minor": 2 }