\n",
"\n",
"## Bootstrap with a Cowbow Hat Demonstration\n",
"\n",
"* in class I bring in 3 red blocks, 2 yellow blocks and my cowboy hat, yes I have one, recall I was a farmhand in Northern Alberta, Canada. Aside, ever heard me say, 'Howdy'?\n",
"\n",
"* then I have 3 students volunteer, one holds the hat, one draws balls with replacement and one records the results on the board\n",
"\n",
"* through multiple bootstrap realizations we demonstrate the use of bootstrap to calculate uncertainty in the proportion from the sample itself\n",
"\n",
"* with this workflow we all provide an interactive plot demonstration with matplotlib and ipywidget packages\n",
"\n",
"#### Michael Pyrcz, Professor, The University of Texas at Austin \n",
"\n",
"##### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1) | [GeostatsPy](https://github.com/GeostatsGuy/GeostatsPy)\n",
"\n",
"#### Bootstrap\n",
"\n",
"Uncertainty in sample statistics:\n",
"* one source of uncertainty is the paucity of data.\n",
"* do 200 or even less samples provide a precise (and accurate estimate) of the mean? standard deviation? skew? P13?\n",
"\n",
"Would it be useful to know the uncertainty in these statistics due to the limited number of samples?\n",
"* what is the impact of uncertainty in the mean porosity, e.g., 20%+/-2%?\n",
"\n",
"**Bootstrap** is a method to assess the uncertainty in a sample statistic by repeated random sampling with replacement, new realizations of the data.\n",
"\n",
"#### Bootstrap Assumptions and Limitations\n",
"\n",
"Assumptions:\n",
"* sufficient, representative sampling, identical (stationarity over the population), idependent samples (no correlation between the samples)\n",
"\n",
"Limitations:\n",
"1. assumes the samples are representative, does not correct for model bias \n",
"2. assumes stationarity, identically distributed data over the area of interest\n",
"3. only accounts for uncertainty due to too few samples, e.g., no uncertainty due to changes away from data\n",
"4. does not account for boundary of area of interest \n",
"5. assumes the samples are independent, no correlations over time or space\n",
"6. does not account for other local information sources, no local model conditioning\n",
"\n",
"#### The Empirical Bootstrap Approach (Efron, 1982)\n",
"\n",
"Statistical resampling procedure to calculate accuracy in a sample estimate, i.e., uncertainty in a calculated statistic, from the data itself.\n",
"\n",
"* **mimicks** the sampling process, to proceed with multiple realizations of the data sample\n",
"\n",
"* **one of resampling methods** like permutation tests for hypothesis testing and cross validation for testing predictive model performance\n",
"\n",
"* **inferential**, approximates the distribution of uncertainty with the empirical distribution function, realizations of the statistic of interest \n",
"\n",
"* **flexible**, may be applied to calculate the uncertainty in any statistic\n",
"\n",
"* **machine learning** applies bootstrap to build ensembles of models for ensemble estimate aggragation to reduce model variance and improve model accuracy, this is known as model bagging\n",
"\n",
"* **spatial bootstrap** is an advanced form that accounts for spatial correlation, conditioning and nonstationarity (Journel, 1993). This is based on general model resampling methods. See my demonstrations:\n",
"\n",
"* [PythonDataBasics_SpatialBootstrap](https://github.com/GeostatsGuy/PythonNumericalDemos/blob/master/PythonDataBasics_SpatialBootstrap.ipynb)\n",
"\n",
"* [Spatial_Bootstrap](https://github.com/GeostatsGuy/PythonNumericalDemos/blob/master/Spatial_Bootstrap.ipynb) where I demonstrate the LU simulation-based spatial bootstrap method (Deutsch, 2004).\n",
"\n",
"#### Bootstrap Workflow\n",
"\n",
"Here's the general steps of bootstrap:\n",
"\n",
"1. assemble a sample set, must be representative, and reasonable to assume independence between samples\n",
"\n",
"2. optional: build a cumulative distribution function (CDF)\n",
" * may account for declustering weights, tail extrapolation\n",
" * could use analogous data to support, and impute missing values, or fit a parametric distribution supported by theory\n",
" * also consider data smoothing via added zero centered kernels around the data, known as smooth bootstrap \n",
"\n",
"3. For $\\ell = 1, \\ldots, L$ realizations, do the following:\n",
"\n",
" * For $i = \\alpha, \\ldots, n$ data, do the following:\n",
"\n",
" * draw a random sample with replacement from the sample set or Monte Carlo simulate from the CDF (if available). \n",
"\n",
"6. Calculate a realization of the summary statistic of interest from the $n$ samples, e.g., $m^\\ell$, $\\sigma^2_{\\ell}$. Return to 3 for another realization.\n",
"\n",
"7. Compile and summarize the $L$ realizations of the statistic of interest.\n",
"\n",
"#### Bootstrap vs. Theory\n",
"\n",
"Does this work? You can compare bootstrap to theory, e.g., for uncertainty in the mean the analytical solution is known as standard error: \n",
"\n",
"\\begin{equation}\n",
"\\sigma^2_\\overline{x} = \\frac{\\sigma^2_s}{n}\n",
"\\end{equation}\n",
"\n",
"Extremely powerful - could calculate uncertainty in any statistic! e.g. P13, skew etc.\n",
"* Would not be possible access general uncertainty in any statistic without bootstrap.\n",
"\n",
"This is a very powerful method. Let's try it out.\n",
"\n",
"#### Getting Started\n",
"\n",
"Here's the steps to get setup in Python with the GeostatsPy package:\n",
"\n",
"1. Install Anaconda 3 on your machine (https://www.anaconda.com/download/). \n",
"\n",
"That's all!\n",
"\n",
"#### Load the Required Libraries\n",
"\n",
"We will also need some standard Python packages. These should have been installed with Anaconda 3."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"supress_warnings = True # ignore warnings\n",
"%matplotlib inline\n",
"from ipywidgets import interactive # widgets and interactivity\n",
"from ipywidgets import widgets \n",
"from ipywidgets import Layout\n",
"from ipywidgets import Label\n",
"from ipywidgets import VBox, HBox\n",
"import matplotlib.pyplot as plt # plotting\n",
"from matplotlib.ticker import (MultipleLocator, AutoMinorLocator) # control of axes ticks\n",
"plt.rc('axes', axisbelow=True) # set axes and grids in the background for all plots\n",
"import numpy as np # working with arrays\n",
"import pandas as pd # working with DataFrames\n",
"import seaborn as sns # for matrix scatter plots\n",
"from scipy.stats import triang # parametric distributions\n",
"from scipy.stats import binom\n",
"from scipy.stats import norm\n",
"from scipy.stats import uniform\n",
"from scipy.stats import triang\n",
"from scipy import stats # statistical calculations\n",
"import random # random drawing / bootstrap realizations of the data\n",
"from matplotlib.gridspec import GridSpec\n",
"import warnings\n",
"plt.rc('axes', axisbelow=True) # grid behind plotting elements\n",
"if supress_warnings == True:\n",
" import warnings # supress any warnings for this demonstration\n",
" warnings.filterwarnings('ignore') "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Bootstrap Realizations from the Cowbow Hat\n",
"\n",
"It's a lot of work to draw $n$ colored blocks from a cowbow hat with replacement, calculate the realization of the proportion of red and yellow blocks and repeat $L$ times to calculate the uncertainty distribution for the proportion.\n",
"\n",
"* Let's let the compute sample for us! \n",
"\n",
"This dashboard makes and visualizes bootstrap realizations."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"seed = 73073\n",
"# parameters for the synthetic dataset\n",
"bins = np.linspace(0,1000,1000)\n",
"\n",
"# interactive calculation of the sample set (control of source parametric distribution and number of samples)\n",
"l = widgets.Text(value=' Simple Boostrap Demonstration, Michael Pyrcz, Associate Professor, The University of Texas at Austin',layout=Layout(width='950px', height='30px'))\n",
"\n",
"a = widgets.IntSlider(min=0, max = 100, value = 2, step = 1, description = '$n_{red}$',orientation='horizontal',layout=Layout(width='400px', height='30px'),continuous_update=False)\n",
"a.style.handle_color = 'red'\n",
"\n",
"b = widgets.IntSlider(min=0, max = 100, value = 3, step = 1, description = '$n_{green}$',orientation='horizontal',layout=Layout(width='400px', height='30px'),continuous_update=False)\n",
"b.style.handle_color = 'yellow'\n",
"\n",
"c = widgets.IntSlider(min=1, max = 16, value = 3, step = 1, description = '$L$',orientation='horizontal',layout=Layout(width='400px', height='30px'),continuous_update=False)\n",
"c.style.handle_color = 'black'\n",
"\n",
"ui = widgets.HBox([a,b,c],) # basic widget formatting \n",
"ui2 = widgets.VBox([l,ui],)\n",
"\n",
"def f_make(a, b, c): # function to take parameters, make sample and plot\n",
" red_freq = make_data(a, b, c)\n",
" np.random.seed(seed=seed) \n",
" labels = ['Red', 'Yellow']\n",
" #nrows = int(np.round(np.sqrt(c)+0.4,0)); ncols = int(np.round(c / nrows + 0.4,0))\n",
" nrows = 4; ncols = 4\n",
" plt.clf()\n",
" \n",
" for i in range(0, c): \n",
" plt.subplot(ncols,nrows,i + 1) \n",
" draw = [red_freq[i],a + b - red_freq[i]]\n",
" plt.grid(zorder=0, color='black', axis = 'y', alpha = 0.2); plt.ylim(0,a + b); \n",
" plt.ylabel('Frequency'); plt.xlabel('Blocks Drawn')\n",
" plt.yticks(np.arange(0,a + b + 1,max(1,round((a+b)/10))))\n",
" barlist = plt.bar(labels,draw,edgecolor = \"black\",linewidth = 1,alpha = 0.8); plt.title('Realization #' + str(i+1),zorder = 1) \n",
" barlist[0].set_color('red'); barlist[1].set_color('yellow')\n",
" barlist[0].set_edgecolor('black'); barlist[1].set_edgecolor('black')\n",
" \n",
" plt.subplots_adjust(left=0.0, bottom=0.0, right=2.0, top=1.2 * nrows, wspace=0.2, hspace=0.2)\n",
" plt.show()\n",
"\n",
"def make_data(a, b, c): # function to check parameters and make sample \n",
" prop_red = np.zeros(c)\n",
" for i in range(0, c): \n",
" prop_red[i] = np.random.multinomial(a+b,[a/(a+b),b/(a+b)], size = 1)[0][0]\n",
" return prop_red\n",
"\n",
"# connect the function to make the samples and plot to the widgets \n",
"interactive_plot = widgets.interactive_output(f_make, {'a': a, 'b': b, 'c': c})\n",
"interactive_plot.clear_output(wait = True) # reduce flickering by delaying plot updating"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Simple Bootstrap Demonstration - Drawing Red and Green Balls from a Virtual Cowboy Hat\n",
"\n",
"* drawing red and green balls from a hat with replacement to access uncertainty in the proportion \n",
"\n",
"* interactive plot demonstration with ipywidget, matplotlib packages\n",
"\n",
"#### Michael Pyrcz, Associate Professor, University of Texas at Austin \n",
"\n",
"##### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1) | [GeostatsPy](https://github.com/GeostatsGuy/GeostatsPy)\n",
"\n",
"### The Problem\n",
"\n",
"Let's simulate bootstrap, resampling with replacement from a hat with $n_{red}$ and $n_{green}$ balls\n",
"\n",
"* **$n_{red}$**: number of red balls in the sample (placed in the hat)\n",
"\n",
"* **$n_{green}$**: number of green balls in the sample (placed in the hat)\n",
"\n",
"* **$L$**: number of bootstrap realizations"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "8cbe019970e248e6a6abdaa48e326931",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"VBox(children=(Text(value=' Simple Boostrap Demonstration, Michael Pyrcz, Assoc…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "f293f5034aa848bbbc902b6bde09e647",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Output(outputs=({'output_type': 'display_data', 'data': {'text/plain': '', 'i…"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"display(ui2, interactive_plot) # display the interactive plot"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's look at a line chart of proportion of red and green balls over many bootstrap realizations. \n",
"\n",
"* for this we will use a line chart, as it is difficult to visualize many histograms."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# parameters for the synthetic dataset\n",
"bins = np.linspace(0,1000,1000)\n",
"\n",
"# interactive calculation of the sample set (control of source parametric distribution and number of samples)\n",
"l = widgets.Text(value=' Boostrap from a Cowbow Hat Demonstration, Michael Pyrcz, Associate Professor, The University of Texas at Austin',layout=Layout(width='950px', height='30px'))\n",
"\n",
"a = widgets.IntSlider(min=0, max = 100, value = 2, step = 1, description = '$n_{red}$',orientation='horizontal',layout=Layout(width='400px', height='30px'),continuous_update=False)\n",
"a.style.handle_color = 'red'\n",
"\n",
"b = widgets.IntSlider(min=0, max = 100, value = 3, step = 1, description = '$n_{yellow}$',orientation='horizontal',layout=Layout(width='400px', height='30px'),continuous_update=False)\n",
"b.style.handle_color = 'yellow'\n",
"\n",
"c = widgets.IntSlider(min=1, max = 1000, value = 20, step = 1, description = '$L$',orientation='horizontal',layout=Layout(width='400px', height='30px'),continuous_update=False)\n",
"c.style.handle_color = 'gray'\n",
"\n",
"ui = widgets.HBox([a,b,c],) # basic widget formatting \n",
"ui3 = widgets.VBox([l,ui],)\n",
"\n",
"def f_make3(a, b, c): # function to take parameters, make sample and plot\n",
" np.random.seed(seed=seed)\n",
" start_CI = 3; Plow = 30; Phigh = 70\n",
" red_freq = make_data(a, b, c) \n",
" green_freq = (a + b) - red_freq\n",
" prop_red = red_freq / (a + b)\n",
" prop_green = green_freq / (a + b)\n",
" labels = ['Red', 'Yellow']\n",
" trials = np.arange(1,c+1,1)\n",
"\n",
" prop_red_cumul = np.cumsum(prop_red, axis=0, dtype=None, out=None)/np.arange(1,len(prop_red)+1,1)\n",
" prop_green_cumul = 1.0 - prop_red_cumul\n",
" \n",
" cumul_per_trial = np.arange(start_CI,c+1,1)\n",
" Plow_red = []; Phigh_red = []; Plow_green = []; Phigh_green = [];\n",
" for l in range(start_CI,len(prop_red)+1):\n",
" Plow_red.append(np.percentile(prop_red[:l],Plow))\n",
" Phigh_red.append(np.percentile(prop_red[:l],Phigh))\n",
" Plow_green.append(np.percentile(prop_green[:l],Plow)) \n",
" Phigh_green.append(np.percentile(prop_green[:l],Phigh)) \n",
" \n",
" plt.subplot(1,2,1) \n",
" plt.plot([1,max(c,2)],[a/(a+b),a/(a+b)],color='red',alpha=1.0,lw=2,zorder=3)\n",
" plt.plot([1,max(c,2)],[a/(a+b),a/(a+b)],color='black',alpha=0.6,lw=4,zorder=1)\n",
" plt.annotate(r'$prop_{red}$',[(max(c,2)-1)*0.02+1,a/(a+b)+0.02],zorder=1001)\n",
" \n",
" plt.plot([1,max(c,2)],[b/(a+b),b/(a+b)],color='yellow',alpha=1.0,lw=2,zorder=3)\n",
" plt.plot([1,max(c,2)],[b/(a+b),b/(a+b)],color='black',alpha=0.6,lw=4,zorder=1)\n",
" plt.annotate(r'$prop_{yellow}$',[(max(c,2)-1)*0.02+1,b/(a+b)+0.02],zorder=1001)\n",
" \n",
" plt.plot(trials,prop_red,color='red',alpha=0.6,lw=1,ls='--',zorder=10)\n",
" plt.scatter(trials,prop_red,color='red',edgecolor='black',alpha=1.0,s=20,label=r'$prop_{red}^{\\ell}$',zorder=100)\n",
" plt.plot(trials,prop_green,color='yellow',alpha=1.0,lw=1,ls='--',zorder=10)\n",
" plt.scatter(trials,prop_green,color='yellow',edgecolor='black',alpha=1.0,s=20,label=r'$prop_{yellow}^{\\ell}$',zorder=100)\n",
"\n",
" plt.xlim([1,max(c,2)]); plt.ylim([0,1])\n",
" plt.xlabel('Bootstrap Realizations, $L$'); plt.ylabel('Proportion'); plt.title('Bootstrap Realizations of Proportions of Blocks')\n",
" plt.legend(loc='upper right')\n",
" if c < 30:\n",
" plt.gca().set_xticks(np.arange(1,c+1,1.0))\n",
" plt.gca().set_yticks(np.arange(0,1.1,0.1))\n",
" plt.gca().grid(True, which='major',axis='y',linewidth = 1.0); plt.gca().grid(True, which='minor',axis='y',linewidth = 0.2) # add y grids\n",
" plt.gca().tick_params(which='major',axis='y'); plt.gca().tick_params(which='minor',axis='y',length=4)\n",
" plt.gca().yaxis.set_minor_locator(AutoMinorLocator()) # turn on minor ticks \n",
" #plt.gca().xaxis.set_minor_locator(AutoMinorLocator()); \n",
" plt.gca().grid(True, which='major',axis='y',linewidth = 1.0);\n",
" \n",
" \n",
" plt.subplot(1,2,2) \n",
" plt.plot([1,max(c,2)],[a/(a+b),a/(a+b)],color='red',alpha=1.0,lw=2,zorder=3)\n",
" plt.plot([1,max(c,2)],[a/(a+b),a/(a+b)],color='black',alpha=0.6,lw=4,zorder=1)\n",
" plt.annotate(r'$prop_{red}$',[(max(c,2)-1)*0.02+1,a/(a+b)+0.02],zorder=1001)\n",
" \n",
" plt.plot([1,max(c,2)],[b/(a+b),b/(a+b)],color='yellow',alpha=1.0,lw=2,zorder=3)\n",
" plt.plot([1,max(c,2)],[b/(a+b),b/(a+b)],color='black',alpha=0.6,lw=4,zorder=1)\n",
" plt.annotate(r'$prop_{yellow}$',[(max(c,2)-1)*0.02+1,b/(a+b)+0.02],zorder=1001) \n",
"\n",
" plt.plot(trials,prop_red_cumul,color='red',alpha=0.6,lw=1,ls='--',zorder=10)\n",
" plt.scatter(trials,prop_red_cumul,color='red',edgecolor='black',alpha=1.0,s=20,label=r'$Prop_{red}^{\\leq \\ell}$',zorder=100)\n",
" plt.plot(trials,prop_green_cumul,color='yellow',alpha=1.0,lw=1,ls='--',zorder=10)\n",
" plt.scatter(trials,prop_green_cumul,color='yellow',edgecolor='black',alpha=1.0,s=20,label=r'$Prop_{yellow}^{\\leq \\ell}$',zorder=100)\n",
" \n",
" plt.fill_between(cumul_per_trial,Phigh_red,Plow_red,color='red',edgecolor=None,alpha=0.15,zorder=1)\n",
" plt.plot(cumul_per_trial,Plow_red,color='red',alpha=0.3,lw=1,zorder=2) \n",
" plt.plot(cumul_per_trial,Phigh_red,color='red',alpha=0.3,lw=1,zorder=2) \n",
" \n",
" plt.fill_between(cumul_per_trial,Phigh_green,Plow_green,color='yellow',edgecolor=None,alpha=0.15,zorder=1)\n",
" plt.plot(cumul_per_trial,Plow_green,color='yellow',alpha=0.3,lw=1,zorder=2) \n",
" plt.plot(cumul_per_trial,Phigh_green,color='yellow',alpha=0.3,lw=1,zorder=2) \n",
" \n",
" plt.xlim([1,max(c,2)]); plt.ylim([0,1])\n",
" plt.xlabel('Bootstrap Realizations, $L$'); plt.ylabel('Proportion'); plt.title('Cumulative Statistics Over Bootstrap Realizations')\n",
" plt.legend(loc='upper right')\n",
" if c < 30:\n",
" plt.gca().set_xticks(np.arange(1,c+1,1.0))\n",
" plt.gca().set_yticks(np.arange(0,1.1,0.1))\n",
" plt.gca().grid(True, which='major',axis='y',linewidth = 1.0); plt.gca().grid(True, which='minor',axis='y',linewidth = 0.2) # add y grids\n",
" plt.gca().tick_params(which='major',axis='y'); plt.gca().tick_params(which='minor',axis='y',length=4)\n",
" plt.gca().yaxis.set_minor_locator(AutoMinorLocator()) # turn on minor ticks \n",
" plt.gca().grid(True, which='major',axis='y',linewidth = 1.0);\n",
" \n",
" plt.subplots_adjust(left=0.0, bottom=0.0, right=2.0, top=1.2, wspace=0.2, hspace=0.2)\n",
" plt.show()\n",
"\n",
"def make_data(a, b, c): # function to check parameters and make sample \n",
" prop_red = np.zeros(c)\n",
" for i in range(0, c): \n",
" prop_red[i] = np.random.multinomial(a+b,[a/(a+b),b/(a+b)], size = 1)[0][0]\n",
" return prop_red\n",
"\n",
"# connect the function to make the samples and plot to the widgets \n",
"interactive_plot3 = widgets.interactive_output(f_make3, {'a': a, 'b': b, 'c': c})\n",
"interactive_plot3.clear_output(wait = True) # reduce flickering by delaying plot updating"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Simple Bootstrap Demonstration with a Cowbow Hat - Uncertainty in Proportions\n",
"\n",
"Let's look at the individual bootstrap realizations and cumulative uncertainty model over those realizations.\n",
"\n",
"#### Michael Pyrcz, Professor, The University of Texas at Austin \n",
"\n",
"##### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1) | [GeostatsPy](https://github.com/GeostatsGuy/GeostatsPy)\n",
"\n",
"### The Problem\n",
"\n",
"Let's simulate bootstrap, resampling with replacement from a hat with $n_{red}$ and $n_{yellow}$ blocks\n",
"\n",
"* **$n_{red}$**: number of red balls in the sample (placed in the hat)\n",
"\n",
"* **$n_{yellow}$**: number of green balls in the sample (placed in the hat)\n",
"\n",
"* **$L$**: number of bootstrap realizations"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "d4bc99d93064437196a10a40bb78bc1e",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"VBox(children=(Text(value=' Boostrap from a Cowbow Hat Demonstration, Michael P…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "5a973c3acb9f4c55a38b4b806650dd3e",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Output(outputs=({'output_type': 'display_data', 'data': {'text/plain': '', 'i…"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"display(ui3, interactive_plot3) # display the interactive plot"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Summarizing Bootstrap Uncertainty\n",
"\n",
"* Run more bootstrap realizations and evaluate the uncertainty model\n",
"\n",
"Now instead of looking at each bootstrap result, let's look at all realizations and summarize with:\n",
"\n",
"* **box and whisker plot** of the red and green ball frequencies\n",
"\n",
"* **histograms** of the red and green ball frequencies."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"# parameters for the synthetic dataset\n",
"bins = np.linspace(0,1000,1000)\n",
"\n",
"# interactive calculation of the sample set (control of source parametric distribution and number of samples)\n",
"l2 = widgets.Text(value=' Bootstrap Demonstration with a Cowbow Hat, Uncertainty in Proportion of Blocks, Michael Pyrcz, Professor, The University of Texas at Austin',layout=Layout(width='950px', height='30px'))\n",
"\n",
"a2 = widgets.IntSlider(min=0, max = 100, value = 2, step = 1, description = '$n_{red}$',orientation='horizontal',layout=Layout(width='400px', height='20px'),continuous_update=False)\n",
"a2.style.handle_color = 'red'\n",
"\n",
"b2 = widgets.IntSlider(min=0, max = 100, value = 3, step = 1, description = '$n_{yellow}$',orientation='horizontal',layout=Layout(width='400px', height='20px'),continuous_update=False)\n",
"b2.style.handle_color = 'yellow'\n",
"\n",
"c2 = widgets.IntSlider(min=1, max = 1000, value = 30, step = 1, description = '$L$',orientation='horizontal',layout=Layout(width='400px', height='20px'),continuous_update=False)\n",
"c2.style.handle_color = 'gray'\n",
"\n",
"uib = widgets.HBox([a2,b2,c2],) # basic widget formatting \n",
"uib2 = widgets.VBox([l2,uib],)\n",
"\n",
"def s_make(a, b, c): # function to take parameters, make sample and plot\n",
" np.random.seed(seed=seed)\n",
" red_freq = make_data(a, b, c)\n",
" green_freq = (a + b) - red_freq\n",
" prop_red = red_freq / (a + b)\n",
" prop_yellow = green_freq / (a + b)\n",
" labels = ['Red Blocks', 'Yellow Blocks']\n",
" bins = np.linspace(0,1.0,100)\n",
"\n",
" fig = plt.figure(constrained_layout=False)\n",
" gs = GridSpec(2, 2, figure=fig)\n",
" \n",
" ax1 = fig.add_subplot(gs[:, 0])\n",
" boxplot = ax1.boxplot([prop_red,prop_yellow],labels = labels, notch = True, sym = '+',patch_artist=True) \n",
" ax1.set_ylabel('Proportion of Blocks'); ax1.set_xlabel('Block Color');ax1.set_title('Bootstrap Uncertainty - Proportion Distributions')\n",
" ax1.set_yticks(np.arange(0,1.1,0.1)); ax1.set_ylim([0,1])\n",
" ax1.grid(True, which='major',axis='y',linewidth = 1.0); ax1.grid(True, which='minor',axis='y',linewidth = 0.2) # add y grids\n",
" ax1.tick_params(which='major',length=7); ax1.tick_params(which='minor', length=4)\n",
" ax1.xaxis.set_minor_locator(AutoMinorLocator()); ax1.yaxis.set_minor_locator(AutoMinorLocator()) # turn on minor ticks\n",
" \n",
" colors = ['red','yellow']\n",
" for patch, color in zip(boxplot['boxes'], colors):\n",
" patch.set_facecolor(color)\n",
" for patch, color in zip(boxplot['medians'], colors):\n",
" patch.set_color('black')\n",
" \n",
" ax2 = fig.add_subplot(gs[0, 1])\n",
" ax2.hist(prop_red,alpha=0.8,color=\"red\",edgecolor=\"black\", bins = bins)\n",
" ax2.set_title('Red Block Bootstrap Uncertainty - Proportion Distributions'); ax2.set_xlabel('Proportions of Red Blocks'); ax2.set_ylabel('Frequency') \n",
" ax2.grid(True, which='major',linewidth = 1.0); ax2.grid(True, which='minor',linewidth = 0.2) # add y grids\n",
" ax2.tick_params(which='major',length=7); ax2.tick_params(which='minor', length=4)\n",
" ax2.xaxis.set_minor_locator(AutoMinorLocator()); ax2.yaxis.set_minor_locator(AutoMinorLocator()) # turn on minor ticks\n",
" ax2.set_xlim([0,1])\n",
" \n",
" ax3 = fig.add_subplot(gs[1, 1])\n",
" ax3.hist(prop_yellow,alpha=0.8,color=\"yellow\",edgecolor=\"black\", bins = bins)\n",
" ax3.set_title('Yellow Block Bootstrap Uncertainty - Proportion Distribution'); ax3.set_xlabel('Proportion of Yellow Blocks'); ax3.set_ylabel('Frequency') \n",
" ax3.grid(True, which='major',linewidth = 1.0); ax3.grid(True, which='minor',linewidth = 0.2) # add y grids\n",
" ax3.tick_params(which='major',length=7); ax3.tick_params(which='minor', length=4)\n",
" ax3.xaxis.set_minor_locator(AutoMinorLocator()); ax3.yaxis.set_minor_locator(AutoMinorLocator()) # turn on minor ticks\n",
" ax3.set_xlim([0,1])\n",
" \n",
"# print('Uncertainty in Proportion of Red Balls: ')\n",
"# print('P10: ' + str(np.round(np.percentile(prop_red,10),2)) + ', P50: ' + str(np.round(np.percentile(prop_red,50),2)) + ', P90: ' + str(np.round(np.percentile(prop_red,90),2)))\n",
"\n",
" plt.subplots_adjust(left=0.0, bottom=0.0, right=2.0, top=2.1, wspace=0.2, hspace=0.3)\n",
" plt.show()\n",
"\n",
"# connect the function to make the samples and plot to the widgets \n",
"interactive_plot = widgets.interactive_output(s_make, {'a': a2, 'b': b2, 'c': c2})\n",
"interactive_plot.clear_output(wait = True) # reduce flickering by delaying plot updating"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Simple Bootstrap Demonstration with a Cowbow Hat - Uncertainty in Proportions\n",
"\n",
"Let's look at bootstrap sample distributions over all realizations.\n",
"\n",
"#### Michael Pyrcz, Professor, The University of Texas at Austin \n",
"\n",
"##### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1) | [GeostatsPy](https://github.com/GeostatsGuy/GeostatsPy)\n",
"\n",
"### The Problem\n",
"\n",
"Let's simulate bootstrap, resampling with replacement from a hat with $n_{red}$ and $n_{yellow}$ blocks\n",
"\n",
"* **$n_{red}$**: number of red balls in the sample (placed in the hat)\n",
"\n",
"* **$n_{yellow}$**: number of green balls in the sample (placed in the hat)\n",
"\n",
"* **$L$**: number of bootstrap realizations"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "eccd973751d84a779e62f340cf93de04",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"VBox(children=(Text(value=' Bootstrap Demonstration with a Cowbow Hat, Uncertainty in Proportion of…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "cebe64e2f8a947a8af79ec912ab8f4b1",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Output(outputs=({'output_type': 'display_data', 'data': {'text/plain': '', 'i…"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"display(uib2, interactive_plot) # display the interactive plot"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Observations\n",
"\n",
"Some observations:\n",
"\n",
"* sampling distribution for proportions become discrete with too few samples, only $n$ cases are possible\n",
"\n",
"* enough bootstrap realizations are rquired for stable statistics\n",
"\n",
"\n",
"#### Comments\n",
"\n",
"This was a simple demonstration of interactive plots in Jupyter Notebook Python with the ipywidgets and matplotlib packages. \n",
"\n",
"I have many other demonstrations on data analytics and machine learning, e.g. on the basics of working with DataFrames, ndarrays, univariate statistics, plotting data, declustering, data transformations, trend modeling and many other workflows available at https://github.com/GeostatsGuy/PythonNumericalDemos and https://github.com/GeostatsGuy/GeostatsPy. \n",
" \n",
"I hope this was helpful,\n",
"\n",
"*Michael*\n",
"\n",
"#### The Author:\n",
"\n",
"### Michael Pyrcz, Professor, The University of Texas at Austin \n",
"*Novel Data Analytics, Geostatistics and Machine Learning Subsurface Solutions*\n",
"\n",
"With over 17 years of experience in subsurface consulting, research and development, Michael has returned to academia driven by his passion for teaching and enthusiasm for enhancing engineers' and geoscientists' impact in subsurface resource development. \n",
"\n",
"For more about Michael check out these links:\n",
"\n",
"#### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)\n",
"\n",
"#### Want to Work Together?\n",
"\n",
"I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.\n",
"\n",
"* Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? I'd be happy to drop by and work with you! \n",
"\n",
"* Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PIs including Profs. Foster, Torres-Verdin and van Oort)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!\n",
"\n",
"* I can be reached at mpyrcz@austin.utexas.edu.\n",
"\n",
"I'm always happy to discuss,\n",
"\n",
"*Michael*\n",
"\n",
"Michael Pyrcz, Ph.D., P.Eng. Professor, Cockrell School of Engineering and The Jackson School of Geosciences, The University of Texas at Austin\n",
"\n",
"#### More Resources Available at: [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
}
},
"nbformat": 4,
"nbformat_minor": 2
}