\n",
"\n",
"## Interactive Hypothesis Testing Demonstration\n",
"\n",
"### Boostrap and Analytical Methods for Hypothesis Testing, Difference in Means\n",
"\n",
"* we calculate the hypothesis test for different in means with boostrap and compare to the analytical expression\n",
"\n",
"* **Welch's t-test**: we assume the features are Gaussian distributed and the variance are unequal\n",
"\n",
"#### Michael Pyrcz, Associate Professor, University of Texas at Austin \n",
"\n",
"##### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1) | [GeostatsPy](https://github.com/GeostatsGuy/GeostatsPy)\n",
"\n",
"#### Hypothesis Testing\n",
"\n",
"Powerful methodology for spatial data analytics:\n",
"\n",
"1. extracted sample set 1 and 2, the means look different, but are they? \n",
"2. should we suspect that the samples are in fact from 2 different populations?\n",
"\n",
"Now, let's try the t-test, hypothesis test for difference in means. This test assumes that the variances are similar along with the data being Gaussian distributed (see the course notes for more on this). This is our test:\n",
"\n",
"\\begin{equation}\n",
"H_0: \\mu_{X1} = \\mu_{X2}\n",
"\\end{equation}\n",
"\n",
"\\begin{equation}\n",
"H_1: \\mu_{X1} \\ne \\mu_{X2}\n",
"\\end{equation}\n",
"\n",
"To test this we will calculate the t statistic with the bootstrap and analytical approaches.\n",
"\n",
"#### The Welch's t-test for Difference in Means by Analytical and Empirical Methods\n",
"\n",
"We work with the following test statistic, *t statistic*, from the two sample sets.\n",
"\n",
"\\begin{equation}\n",
"\\hat{t} = \\frac{\\overline{x}_1 - \\overline{x}_2}{\\sqrt{\\frac{s^2_1}{n_1} + \\frac{s^2_2}{n_2}}}\n",
"\\end{equation}\n",
"\n",
"where $\\overline{x}_1$ and $\\overline{x}_2$ are the sample means, $s^2_1$ and $s^2_2$ are the sample variances and $n_1$ and $n_2$ are the numer of samples from the two datasets.\n",
"\n",
"The critical value, $t_{critical}$ is calculated by the analytical expression by:\n",
"\n",
"\\begin{equation}\n",
"t_{critical} = \\left|t(\\frac{\\alpha}{2},\\nu)\\right|\n",
"\\end{equation}\n",
"\n",
"The degrees of freedom, $\\nu$, is calculated as follows:\n",
"\n",
"\\begin{equation}\n",
"\\nu = \\frac{\\left(\\frac{1}{n_1} + \\frac{\\mu}{n_2}\\right)^2}{\\frac{1}{n_1^2(n_1-1)} + \\frac{\\mu^2}{n_2^2(n_2-1)}}\n",
"\\end{equation}\n",
"\n",
"Alternatively, the sampling distribution of the t_{statistic} and t_{critical} may be calculated empirically with bootstrap.\n",
"\n",
"The workflow proceeds as:\n",
"\n",
"* shift both sample sets to have the mean of the combined data, $x_1$ → $x^*_1$, $x_2$ → $x^*_2$ \n",
"\n",
"* for each bootstrap realization, $\\ell=1\\ldots,L$\n",
"\n",
" * perform $n_1$ Monte Carlo simulations, draws with replacement, from sample set $x^*_1$\n",
" \n",
" * perform $n_2$ Monte Carlo simulations, draws with replacement, from sample set $x^*_2$\n",
" \n",
" * calculate the t_{statistic} realization, $\\hat{t}^{\\ell}$ given the resulting sample means $\\overline{x}^{*,\\ell}_1$ and $\\overline{x}^{*,\\ell}_2$ and the sample variances $s^{*,2,\\ell}_1$ and $s^{*,2,\\ell}_2$\n",
" \n",
"* pool the results to assemble the $t_{statistic}$ sampling distribution\n",
"\n",
"* calculate the cumulative probability of the observed t_{statistic}m, $\\hat{t}$, from the boostrap distribution based on $\\hat{t}^{\\ell}$, $\\ell = 1,\\ldots,L$.\n",
"\n",
"Here's some prerequisite information on the boostrap.\n",
"\n",
"#### Bootstrap\n",
"\n",
"Uncertainty in the sample statistics\n",
"* one source of uncertainty is the paucity of data.\n",
"* do 200 or even less wells provide a precise (and accurate estimate) of the mean? standard deviation? skew? P13?\n",
"\n",
"Would it be useful to know the uncertainty in these statistics due to limited sampling?\n",
"* what is the impact of uncertainty in the mean porosity e.g. 20%+/-2%?\n",
"\n",
"**Bootstrap** is a method to assess the uncertainty in a sample statistic by repeated random sampling with replacement.\n",
"\n",
"Assumptions\n",
"* sufficient, representative sampling, identical, idependent samples\n",
"\n",
"Limitations\n",
"1. assumes the samples are representative \n",
"2. assumes stationarity\n",
"3. only accounts for uncertainty due to too few samples, e.g. no uncertainty due to changes away from data\n",
"4. does not account for boundary of area of interest \n",
"5. assumes the samples are independent\n",
"6. does not account for other local information sources\n",
"\n",
"The Bootstrap Approach (Efron, 1982)\n",
"\n",
"Statistical resampling procedure to calculate uncertainty in a calculated statistic from the data itself.\n",
"* Does this work? Prove it to yourself, for uncertainty in the mean solution is standard error: \n",
"\n",
"\\begin{equation}\n",
"\\sigma^2_\\overline{x} = \\frac{\\sigma^2_s}{n}\n",
"\\end{equation}\n",
"\n",
"Extremely powerful - could calculate uncertainty in any statistic! e.g. P13, skew etc.\n",
"* Would not be possible access general uncertainty in any statistic without bootstrap.\n",
"* Advanced forms account for spatial information and sampling strategy (game theory and Journel’s spatial bootstrap (1993).\n",
"\n",
"Steps: \n",
"\n",
"1. assemble a sample set, must be representative, reasonable to assume independence between samples\n",
"\n",
"2. optional: build a cumulative distribution function (CDF)\n",
" * may account for declustering weights, tail extrapolation\n",
" * could use analogous data to support\n",
"\n",
"3. For $\\ell = 1, \\ldots, L$ realizations, do the following:\n",
"\n",
" * For $i = \\alpha, \\ldots, n$ data, do the following:\n",
"\n",
" * Draw a random sample with replacement from the sample set or Monte Carlo simulate from the CDF (if available). \n",
"\n",
"6. Calculate a realization of the sammary statistic of interest from the $n$ samples, e.g. $m^\\ell$, $\\sigma^2_{\\ell}$. Return to 3 for another realization.\n",
"\n",
"7. Compile and summarize the $L$ realizations of the statistic of interest.\n",
"\n",
"This is a very powerful method. Let's try it out and compare the result to the analytical form of the confidence interval for the sample mean. \n",
"\n",
"\n",
"#### Objective \n",
"\n",
"Provide an example and demonstration for:\n",
"\n",
"1. interactive plotting in Jupyter Notebooks with Python packages matplotlib and ipywidgets\n",
"2. provide an intuitive hands-on example of confidence intervals and compare to statistical boostrap \n",
"\n",
"#### Getting Started\n",
"\n",
"Here's the steps to get setup in Python with the GeostatsPy package:\n",
"\n",
"1. Install Anaconda 3 on your machine (https://www.anaconda.com/download/). \n",
"2. Open Jupyter and in the top block get started by copy and pasting the code block below from this Jupyter Notebook to start using the geostatspy functionality. \n",
"\n",
"#### Load the Required Libraries\n",
"\n",
"The following code loads the required libraries."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline\n",
"from ipywidgets import interactive # widgets and interactivity\n",
"from ipywidgets import widgets \n",
"from ipywidgets import Layout\n",
"from ipywidgets import Label\n",
"from ipywidgets import VBox, HBox\n",
"import matplotlib.pyplot as plt # plotting\n",
"import numpy as np # working with arrays\n",
"import pandas as pd # working with DataFrames\n",
"from scipy import stats # statistical calculations\n",
"import random # random drawing / bootstrap realizations of the data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Make a Synthetic Dataset\n",
"\n",
"This is an interactive method to:\n",
"\n",
"* select a parametric distribution\n",
"\n",
"* select the distribution parameters\n",
"\n",
"* select the number of samples and visualize the synthetic dataset distribution"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"\n",
"# interactive calculation of the sample set (control of source parametric distribution and number of samples)\n",
"l = widgets.Text(value=' Interactive Hypothesis Testing, Difference in Means, Analytical & Bootstrap Methods, Michael Pyrcz, Associate Professor, The University of Texas at Austin',layout=Layout(width='950px', height='30px'))\n",
"\n",
"n1 = widgets.IntSlider(min=4, max = 100, value = 10, step = 1, description = '$n_{1}$',orientation='horizontal',layout=Layout(width='300px', height='30px'),continuous_update=False)\n",
"n1.style.handle_color = 'red'\n",
"\n",
"m1 = widgets.FloatSlider(min=0, max = 50, value = 3, step = 1.0, description = '$\\overline{x}_{1}$',orientation='horizontal',layout=Layout(width='300px', height='30px'),continuous_update=False)\n",
"m1.style.handle_color = 'red'\n",
"\n",
"s1 = widgets.FloatSlider(min=0, max = 10, value = 3, step = 0.25, description = '$s_1$',orientation='horizontal',layout=Layout(width='300px', height='30px'),continuous_update=False)\n",
"s1.style.handle_color = 'red'\n",
"\n",
"ui1 = widgets.VBox([n1,m1,s1],) # basic widget formatting \n",
"\n",
"n2 = widgets.IntSlider(min=4, max = 100, value = 10, step = 1, description = '$n_{2}$',orientation='horizontal',layout=Layout(width='300px', height='30px'),continuous_update=False)\n",
"n2.style.handle_color = 'yellow'\n",
"\n",
"m2 = widgets.FloatSlider(min=0, max = 50, value = 3, step = 1.0, description = '$\\overline{x}_{2}$',orientation='horizontal',layout=Layout(width='300px', height='30px'),continuous_update=False)\n",
"m2.style.handle_color = 'yellow'\n",
"\n",
"s2 = widgets.FloatSlider(min=0, max = 10, value = 3, step = 0.25, description = '$s_2$',orientation='horizontal',layout=Layout(width='300px', height='30px'),continuous_update=False)\n",
"s2.style.handle_color = 'yellow'\n",
"\n",
"ui2 = widgets.VBox([n2,m2,s2],) # basic widget formatting \n",
"\n",
"L = widgets.IntSlider(min=10, max = 1000, value = 100, step = 1, description = '$L$',orientation='horizontal',layout=Layout(width='300px', height='30px'),continuous_update=False)\n",
"L.style.handle_color = 'gray'\n",
"\n",
"alpha = widgets.FloatSlider(min=0, max = 50, value = 3, step = 1.0, description = '$α$',orientation='horizontal',layout=Layout(width='300px', height='30px'),continuous_update=False)\n",
"alpha.style.handle_color = 'gray'\n",
"\n",
"ui3 = widgets.VBox([L,alpha],) # basic widget formatting \n",
"\n",
"ui4 = widgets.HBox([ui1,ui2,ui3],) # basic widget formatting \n",
"\n",
"ui2 = widgets.VBox([l,ui4],)\n",
"\n",
"def f_make(n1, m1, s1, n2, m2, s2, L, alpha): # function to take parameters, make sample and plot\n",
"\n",
" \n",
" np.random.seed(73073)\n",
" x1 = np.random.normal(loc=m1,scale=s1,size=n1)\n",
" np.random.seed(73074)\n",
" x2 = np.random.normal(loc=m2,scale=s2,size=n2)\n",
" \n",
" mu = (s2*s2)/(s1*s1)\n",
" nu = ((1/n1 + mu/n2)*(1/n1 + mu/n2))/(1/(n1*n1*(n1-1)) + ((mu*mu)/(n2*n2*(n2-1))))\n",
" \n",
" prop_values = np.linspace(-8.0,8.0,100)\n",
" analytical_distribution = stats.t.pdf(prop_values,df = nu) \n",
" analytical_tcrit = stats.t.ppf(1.0-alpha*0.005,df = nu)\n",
" \n",
" # Analytical Method with SciPy\n",
" t_stat_observed, p_value_analytical = stats.ttest_ind(x1,x2,equal_var=False)\n",
" \n",
" # Bootstrap Method\n",
" global_average = np.average(np.concatenate([x1,x2])) # shift the means to be equal to the globla mean\n",
" x1s = x1 - np.average(x1) + global_average\n",
" x2s = x2 - np.average(x2) + global_average\n",
" \n",
" t_stat = np.zeros(L); p_value = np.zeros(L)\n",
" \n",
" random.seed(73075)\n",
" for l in range(0, L): # loop over realizations\n",
" samples1 = random.choices(x1s, weights=None, cum_weights=None, k=len(x1s))\n",
" #print(samples1)\n",
" samples2 = random.choices(x2s, weights=None, cum_weights=None, k=len(x2s))\n",
" #print(samples2)\n",
" t_stat[l], p_value[l] = stats.ttest_ind(samples1,samples2,equal_var=False)\n",
" \n",
" bootstrap_lower = np.percentile(t_stat,alpha * 0.5)\n",
" bootstrap_upper = np.percentile(t_stat,100.0 - alpha * 0.5)\n",
" \n",
" plt.subplot(121)\n",
" #print(t_stat)\n",
" \n",
" plt.hist(x1,cumulative = False, density = True, alpha=0.4,color=\"red\",edgecolor=\"black\", bins = np.linspace(0,50,50), label = '$x_1$')\n",
" plt.hist(x2,cumulative = False, density = True, alpha=0.4,color=\"yellow\",edgecolor=\"black\", bins = np.linspace(0,50,50), label = '$x_2$')\n",
" plt.ylim([0,0.4]); plt.xlim([0.0,30.0])\n",
" plt.title('Sample Distributions'); plt.xlabel('Value'); plt.ylabel('Density')\n",
" plt.legend()\n",
" \n",
" #plt.hist(x2)\n",
" \n",
" plt.subplot(122)\n",
" plt.ylim([0,0.6]); plt.xlim([-8.0,8.0])\n",
" plt.title('Bootstrap and Analytical $t_{statistic}$ Sampling Distributions'); plt.xlabel('$t_{statistic}$'); plt.ylabel('Density')\n",
" plt.plot([t_stat_observed,t_stat_observed],[0.0,0.6],color = 'black',label='observed $t_{statistic}$')\n",
" plt.plot([bootstrap_lower,bootstrap_lower],[0.0,0.6],color = 'blue',linestyle='dashed',label = 'bootstrap interval')\n",
" plt.plot([bootstrap_upper,bootstrap_upper],[0.0,0.6],color = 'blue',linestyle='dashed')\n",
" plt.plot(prop_values,analytical_distribution, color = 'red',label='analytical $t_{statistic}$')\n",
" plt.hist(t_stat,cumulative = False, density = True, alpha=0.2,color=\"blue\",edgecolor=\"black\", bins = np.linspace(-8.0,8.0,50), label = 'bootstrap $t_{statistic}$')\n",
"\n",
" plt.fill_between(prop_values, 0, analytical_distribution, where = prop_values <= -1*analytical_tcrit, facecolor='red', interpolate=True, alpha = 0.2)\n",
" plt.fill_between(prop_values, 0, analytical_distribution, where = prop_values >= analytical_tcrit, facecolor='red', interpolate=True, alpha = 0.2)\n",
" ax = plt.gca()\n",
" handles,labels = ax.get_legend_handles_labels()\n",
" handles = [handles[0], handles[2], handles[3], handles[1]]\n",
" labels = [labels[0], labels[2], labels[3], labels[1]]\n",
"\n",
" plt.legend(handles,labels,loc=1)\n",
" \n",
" \n",
" plt.subplots_adjust(left=0.0, bottom=0.0, right=2.0, top=1.2, wspace=0.2, hspace=0.2)\n",
" plt.show()\n",
"\n",
"\n",
"# connect the function to make the samples and plot to the widgets \n",
"interactive_plot = widgets.interactive_output(f_make, {'n1': n1, 'm1': m1, 's1': s1, 'n2': n2, 'm2': m2, 's2': s2, 'L': L, 'alpha': alpha})\n",
"interactive_plot.clear_output(wait = True) # reduce flickering by delaying plot updating"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Boostrap and Analytical Methods for Hypothesis Testing, Difference in Means\n",
"\n",
"* including the analytical and bootstrap methods for testing the difference in means\n",
"* interactive plot demonstration with ipywidget, matplotlib packages\n",
"\n",
"#### Michael Pyrcz, Associate Professor, University of Texas at Austin \n",
"\n",
"##### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1) | [GeostatsPy](https://github.com/GeostatsGuy/GeostatsPy)\n",
"\n",
"### The Problem\n",
"\n",
"Let's simulate bootstrap, resampling with replacement from a hat with $n_{red}$ and $n_{green}$ balls\n",
"\n",
"* **$n_1$**, **$n_2$** number of samples, **$\\overline{x}_1$**, **$\\overline{x}_2$** means and **$s_1$**, **$s_2$** standard deviation of the 2 sample sets\n",
"* **$L$**: number of bootstrap realizations\n",
"* **$\\alpha$**: alpha level"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "45485dbfed6640fca25f0890a27884ee",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"VBox(children=(Text(value=' Interactive Hypothesis Testing, Difference in Means, Analytical & Bootstrap Method…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "29cf73d9d685453a823fd123dc3c2284",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Output()"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"display(ui2, interactive_plot) # display the interactive plot"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Observations\n",
"\n",
"Some observations:\n",
"\n",
"* lower dispersion and higher difference in means increases the absolute magnitude of the observed $t_{statistic}$\n",
"\n",
"* the bootstrap distribution closely matches the analytical distribution if $L$ is large enough\n",
"\n",
"* it is possible to use bootstrap to calculate the sampling distribution instead of relying on the theoretical express distribution, in this case the Student's t distribution. \n",
"\n",
"\n",
"#### Comments\n",
"\n",
"This was a demonstration of interactive hypothesis testing for the significance in difference in means aboserved between 2 sample sets in Jupyter Notebook Python with the ipywidgets and matplotlib packages. \n",
"\n",
"I have many other demonstrations on data analytics and machine learning, e.g. on the basics of working with DataFrames, ndarrays, univariate statistics, plotting data, declustering, data transformations, trend modeling and many other workflows available at https://github.com/GeostatsGuy/PythonNumericalDemos and https://github.com/GeostatsGuy/GeostatsPy. \n",
" \n",
"I hope this was helpful,\n",
"\n",
"*Michael*\n",
"\n",
"#### The Author:\n",
"\n",
"### Michael Pyrcz, Associate Professor, University of Texas at Austin \n",
"*Novel Data Analytics, Geostatistics and Machine Learning Subsurface Solutions*\n",
"\n",
"With over 17 years of experience in subsurface consulting, research and development, Michael has returned to academia driven by his passion for teaching and enthusiasm for enhancing engineers' and geoscientists' impact in subsurface resource development. \n",
"\n",
"For more about Michael check out these links:\n",
"\n",
"#### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)\n",
"\n",
"#### Want to Work Together?\n",
"\n",
"I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.\n",
"\n",
"* Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? I'd be happy to drop by and work with you! \n",
"\n",
"* Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PIs including Profs. Foster, Torres-Verdin and van Oort)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!\n",
"\n",
"* I can be reached at mpyrcz@austin.utexas.edu.\n",
"\n",
"I'm always happy to discuss,\n",
"\n",
"*Michael*\n",
"\n",
"Michael Pyrcz, Ph.D., P.Eng. Associate Professor The Hildebrand Department of Petroleum and Geosystems Engineering, Bureau of Economic Geology, The Jackson School of Geosciences, The University of Texas at Austin\n",
"\n",
"#### More Resources Available at: [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
}
},
"nbformat": 4,
"nbformat_minor": 2
}