{ "cells": [ { "cell_type": "markdown", "id": "7a524981", "metadata": {}, "source": [ "

\n", " \n", "\n", "

\n", "\n", "## Sampling Methods Demonstration\n", "\n", "\n", "### Michael Pyrcz, Associate Professor, University of Texas at Austin \n", "\n", "##### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)\n", "\n", "\n", "### The Interactive Workflow\n", "\n", "Here's a simple workflow for comparing random and orthogonal sampling. \n", "\n", "* we use a 'toy problem' to demonstrate and compare these sampling methods \n", "\n", "#### Sampling\n", "\n", "While statistical theory supports random sampling, the fluctuation in sample statistics is quite extreme for small samples sizes. For example, the variance in the sample mean is calculated by standard error:\n", "\n", "\\begin{equation}\n", "\\sigma_{\\overline{x}}^2 = \\frac{\\sigma_s^2}{n}\n", "\\end{equation}\n", "\n", "To suppress these statistical fluctuations, alternative sampling methods are available:\n", "\n", "1. **Random Sampling** - next sample is drawn without consideration of the previously drawn samples\n", "2. **Latin Hypercube Sampling** - apply $k$ equiprobability bins to each feature, $X_m^k, m=1,...,M$. Then draw one sample from each bin, $n(X_m^k)=1$. \n", "3. **Orthogonal Sampling** - divide the joint probability density function into $k$ equal probability subspaces and then randomly draw an equal number of samples, $\\frac{n}{k}$ from each subspace. \n", "\n", "#### Objective \n", "\n", "An interactive exercise to try out and compare random and orthgonal sampling.\n", "\n", "* observe the stabilization of the sample statistics\n", "* observe the impact of number of regions on the results for orthogonal sampling\n", "\n", "#### Getting Started\n", "\n", "Here's the steps to get setup in Python with the GeostatsPy package:\n", "\n", "1. Install Anaconda 3 on your machine (https://www.anaconda.com/download/). \n", "2. From Anaconda Navigator (within Anaconda3 group), go to the environment tab, click on base (root) green arrow and open a terminal. \n", "3. In the terminal type: pip install geostatspy. \n", "4. Open Jupyter and in the top block get started by copy and pasting the code block below from this Jupyter Notebook to start using the geostatspy functionality. \n", "\n", "You will need to copy the data file to your working directory. They are available here:\n", "\n", "* Tabular data - sample_data.csv at https://git.io/fh4gm.\n", "\n", "There are exampled below with these functions. You can go here to see a list of the available functions, https://git.io/fh4eX, other example workflows and source code. \n", "\n", "#### Load the required libraries\n", "\n", "The following code loads the required libraries." ] }, { "cell_type": "code", "execution_count": 1, "id": "c4f2c0f3", "metadata": {}, "outputs": [], "source": [ "import numpy as np # arrays and array math\n", "import pandas as pd # tabular data and tabular data math\n", "import matplotlib.pyplot as plt # data visualization\n", "from matplotlib.cm import colors \n", "import scipy.stats as stats # Gaussian PDF and random sampling\n", "from ipywidgets import interactive # widgets and interactivity\n", "from ipywidgets import widgets \n", "from ipywidgets import Layout\n", "from ipywidgets import Label\n", "from ipywidgets import VBox, HBox" ] }, { "cell_type": "markdown", "id": "31ceda7a", "metadata": {}, "source": [ "#### Interactive Sampling Methods\n", "\n", "The following code includes:\n", "\n", "* dashboard with data and orthogonal sampling parameters, number of samples and number of subspaces\n", "\n", "* plots of the data distribution and random and orthogonal samples" ] }, { "cell_type": "code", "execution_count": 2, "id": "18f16228", "metadata": {}, "outputs": [], "source": [ "# interactive calculation of the sample set (control of source parametric distribution and number of samples)\n", "style = {'description_width': 'initial'}\n", "l = widgets.Text(value=' Sampling Methods, Michael Pyrcz, Associate Professor, The University of Texas at Austin',layout=Layout(width='950px', height='30px'))\n", "nsamp = widgets.IntSlider(min = 1, max = 1000, value = 10, step = 5, description = '$n_{sample}$',orientation='horizontal',\n", " layout=Layout(width='500px', height='30px'),continuous_update = False)\n", "nsamp.style.handle_color = 'darkorange'\n", "npart = widgets.IntSlider(min = 1, max = 20, value = 4, step = 1, description = '$n_{subspace}$',orientation='horizontal',\n", " layout=Layout(width='500px', height='30px'),continuous_update = False)\n", "npart.style.handle_color = 'darkorange'\n", "\n", "uipars = widgets.HBox([nsamp,npart],) \n", "uik = widgets.VBox([l,uipars],)\n", "\n", "def f_make_sample(nsamp,npart): # function to take parameters, make sample and plot\n", " mean = 10.0; stdev = 2.0\n", " npop = 100000\n", " parts = []\n", " np.random.seed(seed = 79079)\n", " nbin = 70\n", " hbins = np.linspace(0,20,nbin)\n", " shbins = np.linspace(0,20,nbin*100)\n", " \n", " cmap = plt.cm.hot; norm = colors.Normalize(vmin=1, vmax=npart+1)\n", " \n", " x = np.random.normal(loc=mean,scale=stdev,size=npop)\n", " xs = np.random.choice(x,nsamp,replace = False)\n", " yhat = stats.norm.pdf(shbins,loc=mean,scale=stdev)\n", " \n", " ax1 = plt.subplot(121)\n", " ax1.plot(shbins,yhat,color='black',lw=2.0,zorder=1)\n", " \n", " ax2 = ax1.twinx()\n", " hist2ax,_,_ = ax2.hist(xs,bins=hbins,color='grey',alpha=1.0,edgecolor='black',zorder=10,density=True,\n", " histtype=u'step',linewidth=2,label='Samples'); ax1.set_xlabel('Porosity (%)')\n", " ax2.hist(xs,bins=hbins,color='grey',alpha=0.2,zorder=20,density=True)\n", " ax1.fill_between(shbins,0,yhat,color='darkorange',alpha=0.8,zorder=1)\n", " ax1.set_xlabel('Porosity (%)'); ax1.set_ylabel('Population Density'); ax1.set_title('Population and Random Sample'); \n", " ax1.set_ylim([0,0.3]); ax1.set_xlim([2,18])\n", " plt.legend(loc='upper right')\n", " ax2.set_ylabel('Sample Density',rotation=270,labelpad=20);\n", " \n", " pbins = np.percentile(x,np.linspace(0,100,npart+1))\n", " int_values = pd.cut(x, pbins,labels = np.arange(1,npart+1,1))\n", " \n", " for i in range(0,npart):\n", " parts.append(x[int_values == int(i+1)])\n", " \n", " latin_samples = np.zeros(nsamp)\n", " ipart = 0\n", " for isamp in range(0,nsamp):\n", " latin_samples[isamp] = np.random.choice(parts[ipart],1,replace = False) \n", " ipart = ipart + 1\n", " if ipart >= npart:\n", " ipart = 0\n", " \n", " ax3 = plt.subplot(122)\n", " ax3.plot(shbins,yhat,color='black',lw=2.0,zorder=1)\n", " \n", " ax4 = ax3.twinx()\n", " hist2bx,_,_ = ax4.hist(latin_samples,bins=hbins,color='grey',alpha=1.0,edgecolor='black',zorder=10,density=True,\n", " histtype=u'step',linewidth=2,label='Samples')\n", " ax4.hist(latin_samples,bins=hbins,color='grey',alpha=0.2,zorder=20,density=True) \n", " ax3.set_xlabel('Porosity (%)'); ax3.set_title('Population and Orthogonal Samples')\n", " ax3.set_ylabel('Population Density'); ax3.set_ylim([0,0.3]); ax3.set_xlim([2,18])\n", " plt.legend(loc='upper right'); ax4.set_ylabel('Sample Density',rotation=270,labelpad=20)\n", " \n", " i = 0\n", " for fbin in pbins[1:]:\n", " ax3.vlines(fbin,0,stats.norm.pdf(fbin,loc=mean,scale=stdev),color='black',lw=1.0)\n", " ax3.fill_between(shbins,0,yhat,color=plt.cm.inferno(i/(npart+1)),alpha=0.8,where=(np.logical_and([shbins > pbins[i]],[shbins < pbins[i+1]])[0]),zorder=1)\n", " i = i + 1\n", " \n", " ylim_24 = max(np.max(hist2ax)*(1.1 + (1.25-1.1)/1000 * nsamp),np.max(hist2bx)*(1.1 + (1.25-1.1)/1000 * nsamp))\n", " ax2.set_ylim([0.0,ylim_24])\n", " ax4.set_ylim([0.0,ylim_24])\n", " \n", " plt.subplots_adjust(left=0.0, bottom=0.0, right=2.0, top=1.2, wspace=0.3, hspace=0.2); plt.show()\n", " \n", "# connect the function to make the samples and plot to the widgets \n", "interactive_plot = widgets.interactive_output(f_make_sample, {'nsamp':nsamp, 'npart':npart})\n", "interactive_plot.clear_output(wait = True) # reduce flickering by delaying plot updating" ] }, { "cell_type": "markdown", "id": "4fc3b000", "metadata": {}, "source": [ "### Interactive Sampling Demonstration\n", "\n", "Compare random and orthogonal sampling. Select the number of samples and number of subspaces and observe the sample distribution.\n", "\n", "#### Michael Pyrcz, Associate Professor, University of Texas at Austin \n", "\n", "##### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1) | [GeostatsPy](https://github.com/GeostatsGuy/GeostatsPy)\n", "\n", "### The Inputs\n", "\n", "* **$n_{sample}$** - the number of samples, **$n_{subspace}$** - the number of orthogonal subspaces" ] }, { "cell_type": "code", "execution_count": 3, "id": "29c114d5", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "befce3a17b46493d99095e1e5c42bbce", "version_major": 2, "version_minor": 0 }, "text/plain": [ "VBox(children=(Text(value=' Sampling Methods, Michael Pyrcz, Asso…" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "7065cb80ca4946f2b3247fa088cdd8ef", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Output()" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(uik, interactive_plot) # display the interactive plot" ] }, { "cell_type": "markdown", "id": "2b0d85f7", "metadata": {}, "source": [ "#### Comments\n", "\n", "This was an interactive demonstration of sampling for data analytics. Much more could be done, I have other demonstrations on the basics of working with DataFrames, ndarrays, univariate statistics, plotting data, declustering, data transformations and many other workflows available at https://github.com/GeostatsGuy/PythonNumericalDemos and https://github.com/GeostatsGuy/GeostatsPy. \n", " \n", "#### The Author:\n", "\n", "### Michael Pyrcz, Associate Professor, University of Texas at Austin \n", "*Novel Data Analytics, Geostatistics and Machine Learning Subsurface Solutions*\n", "\n", "With over 17 years of experience in subsurface consulting, research and development, Michael has returned to academia driven by his passion for teaching and enthusiasm for enhancing engineers' and geoscientists' impact in subsurface resource development. \n", "\n", "For more about Michael check out these links:\n", "\n", "#### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)\n", "\n", "#### Want to Work Together?\n", "\n", "I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.\n", "\n", "* Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? I'd be happy to drop by and work with you! \n", "\n", "* Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PIs including Profs. Foster, Torres-Verdin and van Oort)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!\n", "\n", "* I can be reached at mpyrcz@austin.utexas.edu.\n", "\n", "I'm always happy to discuss,\n", "\n", "*Michael*\n", "\n", "Michael Pyrcz, Ph.D., P.Eng. Associate Professor The Hildebrand Department of Petroleum and Geosystems Engineering, Bureau of Economic Geology, The Jackson School of Geosciences, The University of Texas at Austin\n", "\n", "#### More Resources Available at: [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1) " ] }, { "cell_type": "code", "execution_count": null, "id": "090362ad", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" } }, "nbformat": 4, "nbformat_minor": 5 }