{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "

\n", " \n", "\n", "

\n", "\n", "## Interactive Spatial Data Declustering vs. Naive Demonstration\n", "\n", "\n", "### Michael Pyrcz, Professor, University of Texas at Austin \n", "\n", "##### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)\n", "\n", "\n", "### The Interactive Workflow\n", "\n", "Here's an interactive demonstration of the impact of declustering, by comparing the declustered mean and naive mean sequentially as the data is collected. \n", "\n", "* first phase is regular sampling - naive mean and declustered mean are the same and converge to the truth mean as the number of samples increases and covers the area of interest\n", "\n", "* second phase is infill sampling - targeted samples near high values, the naive mean is increasingly biased high while the declustered mean is stable at the truth mean\n", "\n", "Why does this work so well? \n", "\n", "* the dataset is sampled on a perfectly regular grid for the original phase resulting in a obvious choice of cell-based declustering cell size. \n", "\n", "* the infill drilling fits within the cells sharing the weight with the regular spaced sample from the first phase.\n", "\n", "This is an ideal situation for cell-based desclustering.\n", "\n", "#### Geostatistical Sampling Representativity\n", "\n", "In general, we should assume that all spatial data that we work with is biased.\n", "\n", "##### Source of Spatial Sampling Bias\n", "\n", "Data is collected to answer questions:\n", "* how far does the contaminant plume extend? – sample peripheries\n", "* where is the fault? – drill based on seismic interpretation\n", "* what is the highest mineral grade? – sample the best part\n", "* who far does the reservoir extend? – offset drilling\n", "and to maximize NPV directly:\n", "* maximize production rates\n", "\n", "**Random Sampling**: when every item in the population has a equal chance of being chosen. Selection of every item is independent of every other selection.\n", "Is random sampling sufficient for subsurface? Is it available?\n", "* it is not usually available, would not be economic\n", "* data is collected answer questions\n", " * how large is the reservoir, what is the thickest part of the reservoir \n", "* and wells are located to maximize future production\n", " * dual purpose appraisal and injection / production wells!\n", "\n", "**Regular Sampling**: when samples are taken at regular intervals (equally spaced). \n", "* less reliable than random sampling.\n", "* Warning: may resonate with some unsuspected environmental variable.\n", "\n", "What do we have?\n", "* we usually have biased, opportunity sampling \n", "* we must account for bias (debiasing will be discussed later)\n", "\n", "So if we were designing sampling for representativity of the sample set and resulting sample statistics, by theory we have 2 options, random sampling and regular sampling.\n", "\n", "* What would happen if you proposed random sampling in the Gulf of Mexico at $150M per well?\n", "\n", "We should not change current sampling methods as they result in best economics, we should address sampling bias in the data.\n", "\n", "Never use raw spatial data without access sampling bias / correcting.\n", "\n", "##### Mitigating Sampling Bias\n", "\n", "In this demonstration we will take a biased spatial sample data set and apply declustering using **GeostatsPy** functionality.\n", "\n", "#### Objective \n", "\n", "In the PGE 383: Stochastic Subsurface Modeling class I want to provide hands-on experience with building subsurface modeling workflows. Python provides an excellent vehicle to accomplish this. I have coded a package called GeostatsPy with GSLIB: Geostatistical Library (Deutsch and Journel, 1998) functionality that provides basic building blocks for building subsurface modeling workflows. \n", "\n", "The objective is to remove the hurdles of subsurface modeling workflow construction by providing building blocks and sufficient examples. This is not a coding class per se, but we need the ability to 'script' workflows working with numerical methods. \n", "\n", "#### Getting Started\n", "\n", "Here's the steps to get setup in Python with the GeostatsPy package:\n", "\n", "1. Install Anaconda 3 on your machine (https://www.anaconda.com/download/). \n", "2. From Anaconda Navigator (within Anaconda3 group), go to the environment tab, click on base (root) green arrow and open a terminal. \n", "3. In the terminal type: pip install geostatspy. \n", "4. Open Jupyter and in the top block get started by copy and pasting the code block below from this Jupyter Notebook to start using the geostatspy functionality. \n", "\n", "You will need to copy the data file to your working directory. They are available here:\n", "\n", "* Tabular data - sample_data_biased.csv at https://git.io/fh0CW\n", "\n", "There are exampled below with these functions. You can go here to see a list of the available functions, https://git.io/fh4eX, other example workflows and source code. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "os.environ[\"TQDM_DISABLE\"] = \"True\" \n", "\n", "import geostatspy.GSLIB as GSLIB # GSLIB utilies, visualization and wrapper\n", "import geostatspy.geostats as geostats # GSLIB methods convert to Python " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will also need some standard packages. These should have been installed with Anaconda 3." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import numpy as np # ndarrys for gridded data\n", "import pandas as pd # DataFrames for tabular data\n", "import os # set working directory, run executables\n", "import matplotlib.pyplot as plt # for plotting\n", "from scipy import stats # summary statistics\n", "from ipywidgets import interactive # widgets and interactivity\n", "from ipywidgets import widgets \n", "from ipywidgets import Layout\n", "from ipywidgets import Label\n", "from ipywidgets import VBox, HBox\n", "from matplotlib.ticker import (MultipleLocator, AutoMinorLocator) # control of axes ticks\n", "plt.rc('axes', axisbelow=True) # set axes and grids in the background for all plots\n", "from matplotlib.patches import Rectangle # drawing shapes on plots\n", "from statsmodels.stats.weightstats import DescrStatsW\n", "cmap = plt.cm.inferno\n", "burnt_orange = \"#BF5700\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you get a package import error, you may have to first install some of these packages. This can usually be accomplished by opening up a command window on Windows and then typing 'python -m pip install [package-name]'. More assistance is available with the respective package docs. \n", "\n", "#### Declare Functions\n", "\n", "These functions read in the multiple realizations and produce local statistical summaries that we will cover below. They will shortly be added to GeostatsPy." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def add_grid():\n", " plt.gca().grid(True, which='major',linewidth = 1.0); plt.gca().grid(True, which='minor',linewidth = 0.2) # add y grids\n", " plt.gca().tick_params(which='major',length=7); plt.gca().tick_params(which='minor', length=4)\n", " plt.gca().xaxis.set_minor_locator(AutoMinorLocator()); plt.gca().yaxis.set_minor_locator(AutoMinorLocator()) # turn on minor ticks \n", "\n", "def weighted_percentile(data,perc,weights=None): # from Luca Jokull (https://stackoverflow.com/questions/21844024/weighted-percentile-using-numpy)\n", " if weights is None:\n", " return nanpercentile(data,perc)\n", " else:\n", " d=data[(~np.isnan(data))&(~np.isnan(weights))]\n", " ix=np.argsort(d)\n", " d=d[ix]\n", " wei=weights[ix]\n", " wei_cum=100.*np.cumsum(wei*1./sum(wei))\n", " return np.interp(perc,wei_cum,d)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Set the working directory\n", "\n", "I always like to do this so I don't lose files and to simplify subsequent read and writes (avoid including the full address each time). " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "#os.chdir(\"c:/PGE383\") # set the working directory" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Loading Tabular Data\n", "\n", "Here's the command to load our comma delimited data file in to a Pandas' DataFrame object. " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('https://raw.githubusercontent.com/GeostatsGuy/GeoDataSets/master/2level_cluster.csv') # load data from Dr. Pyrcz's GitHub \n", "pormin=4.0; pormax = 16.0\n", "xmin = 0.0; xmax = 10000.0; ymin = 0.0; ymax = 10000\n", "df.head()\n", "truth_average = 10.2 # from the synthetic data generation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Interactive Cell-based Declustered and Naive Mean Comparison as Data is Collected\n", "\n", "* use the slider to walk through the data collection and observe the difference in declustered and niave means.\n", "\n", "#### Michael Pyrcz, Professor, University of Texas at Austin \n", "\n", "##### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1) | [GeostatsPy](https://github.com/GeostatsGuy/GeostatsPy)\n", "\n", "### The Inputs\n", "\n", "* **Number of Samples" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "96c3c443ba904df0a42446df59a81a30", "version_major": 2, "version_minor": 0 }, "text/plain": [ "VBox(children=(Text(value=' Comparison Cell-based Declustered vs.…" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "af44744023f944ffa9510c62dd976115", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Output(layout=Layout(height='650px'))" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%capture --no-display\n", "\n", "from tqdm.auto import tqdm\n", "import sys\n", "from contextlib import contextmanager\n", "\n", "@contextmanager\n", "def suppress_stdout():\n", " with open(os.devnull, \"w\") as devnull:\n", " old_stdout = sys.stdout\n", " sys.stdout = devnull\n", " try:\n", " yield\n", " finally:\n", " sys.stdout = old_stdout\n", "\n", "pormin = 4.0; pormax = 16.0\n", "ncell = 1; cmin = 1000; cmax = 1000; noff = 5; iminmax = 1\n", "\n", "# interactive calculation of the sample set (control of source parametric distribution and number of samples)\n", "style = {'description_width': 'initial'}\n", "l = widgets.Text(value=' Comparison Cell-based Declustered vs. Naïve Statistics, Michael Pyrcz, Professor, The University of Texas at Austin',layout=Layout(width='950px', height='30px'))\n", "nsample = widgets.BoundedIntText(min = 1, max = len(df), value = 1, step = 3, description = 'Number of Samples',style=style,orientation='horizontal',\n", " layout=Layout(width='450px', height='50px'),continuous_update = False)\n", "nsample.style.handle_color = 'gray'\n", "\n", "ui = widgets.VBox([l,nsample],)\n", "\n", "def f_make_declus(nsample):\n", "\n", "\n", " \n", " declus_por_mean = np.zeros(nsample)\n", " df_subset = df.loc[:nsample-1,:].copy(deep=True)\n", " por = df_subset[\"Porosity\"].values\n", " cumulative_avg = np.cumsum(por) / np.arange(1, len(por) + 1)\n", " df_subset[\"Porosity_Cumulative_Mean\"] = cumulative_avg\n", " #for isample in range(0, nsample):\n", " for isample in tqdm(range(nsample), disable=True):\n", " with suppress_stdout():\n", " wts, cell_sizes, dmeans = geostats.declus(\n", " df_subset.loc[:isample,:],\n", " 'X','Y','Porosity',\n", " iminmax=iminmax,\n", " noff=noff,\n", " ncell=ncell,\n", " cmin=cmin,\n", " cmax=cmax\n", " )\n", " declus_por_mean[isample] = np.average(\n", " df_subset.loc[:isample,'Porosity'],\n", " weights=wts\n", " )\n", " df_subset[\"Porosity_Cumulative_Declus_Mean\"] = declus_por_mean\n", "\n", " with out:\n", " out.clear_output(wait=True)\n", " \n", " plt.subplot(121)\n", " sc = plt.scatter(df_subset['X'],df_subset['Y'],c=df_subset['Porosity'],s=30,edgecolor='black',cmap = cmap, vmin = pormin,vmax = pormax)\n", " cbar = plt.colorbar(sc)\n", " cbar.set_label(\"Porosity (%)\")\n", " \n", " add_grid(); plt.xlim([xmin,xmax]); plt.ylim([ymin,ymax])\n", " plt.xlabel('X (m)'); plt.ylabel('Y (m)'); plt.title('Porosity Data')\n", " \n", " plt.subplot(122)\n", " plt.plot(np.arange(0,nsample)+1,df_subset[\"Porosity_Cumulative_Mean\"],c='black',lw=2,label='Naive Average',zorder=3)\n", " plt.plot(np.arange(0,nsample)+1,df_subset[\"Porosity_Cumulative_Declus_Mean\"],c=burnt_orange,lw=4,label='Declustered Average',zorder=2)\n", " plt.plot([0,len(df)],[truth_average,truth_average],c='red',lw=1,ls='--',label='Truth',zorder=1)\n", " plt.axvline(x=len(df_subset), linestyle=\"--\", color=\"black\", linewidth=1, zorder=0)\n", " plt.legend(loc='upper right')\n", " plt.xlim([1,len(df)]); plt.ylim([7.0,12.0]); plt.xlabel('Number of Samples'); plt.ylabel('Average Porosity (%)'); plt.title('Running Average Porosity')\n", " add_grid()\n", " \n", " plt.subplots_adjust(left=0.0, bottom=0.0, right=2.0, top=1.2, wspace=0.15, hspace=0.2); plt.show()\n", " plt.figure(figsize=(12,5))\n", "\n", "\n", "out = widgets.Output(layout=widgets.Layout(height='650px')) # fixed height\n", "\n", "interactive_plot = widgets.interactive_output(\n", " f_make_declus,\n", " {'nsample': nsample}\n", ")\n", "\n", "display(ui, out)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Comments\n", "\n", "This is an interactive demonstration of cell-based declustering to correct for sampling bias. Much more could be done, I have other demonstrations on the basics of working with DataFrames, ndarrays and many other workflows availble at https://github.com/GeostatsGuy/PythonNumericalDemos and https://github.com/GeostatsGuy/GeostatsPy.\n", "\n", "I hope this was helpful,\n", "\n", "*Michael*\n", "\n", "Michael Pyrcz, Ph.D., P.Eng. Professor The Hildebrand Department of Petroleum and Geosystems Engineering, Bureau of Economic Geology, The Jackson School of Geosciences, The University of Texas at Austin\n", "\n", "#### More Resources Available at: [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)\n", "\n", "#### About the Author\n", "\n", "Michael Pyrcz is a professor in the [Cockrell School of Engineering](https://cockrell.utexas.edu/faculty-directory/alphabetical/p), and the [Jackson School of Geosciences](https://www.jsg.utexas.edu/researcher/michael_pyrcz/), at [The University of Texas at Austin](https://www.utexas.edu/), where he researches and teaches subsurface, spatial data analytics, geostatistics, and machine learning. Michael is also,\n", "\n", "* the principal investigator of the [Energy Analytics](https://fri.cns.utexas.edu/energy-analytics) freshmen research initiative and a core faculty in the Machine Learn Laboratory in the College of Natural Sciences, The University of Texas at Austin\n", "\n", "* an associate editor for [Computers and Geosciences](https://www.sciencedirect.com/journal/computers-and-geosciences/about/editorial-board), and a board member for [Mathematical Geosciences](https://link.springer.com/journal/11004/editorial-board), the International Association for Mathematical Geosciences. \n", "\n", "Michael has written over 80 [peer-reviewed publications](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en), a [Python package](https://pypi.org/project/geostatspy/) for spatial data analytics, co-authored a textbook on spatial data analytics, [Geostatistical Reservoir Modeling](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) and author of two recently released e-books, [Applied Geostatistics in Python: a Hands-on Guide with GeostatsPy](https://geostatsguy.github.io/GeostatsPyDemos_Book/intro.html) and [Applied Machine Learning in Python: a Hands-on Guide with Code](https://geostatsguy.github.io/MachineLearningDemos_Book/intro.html).\n", "\n", "All of Michael’s university lectures are available on his [YouTube Channel](https://www.youtube.com/@GeostatsGuyLectures) with links to 100s of Python interactive dashboards and well-documented workflows in over 40 repositories on his [GitHub account](https://github.com/GeostatsGuy), to support any interested students and working professionals with evergreen content. To find out more about Michael’s work and shared educational resources visit his [Website](www.michaelpyrcz.com).\n", "\n", "#### Want to Work Together?\n", "\n", "I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.\n", "\n", "* Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? I'd be happy to drop by and work with you! \n", "\n", "* Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PI is Professor John Foster)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!\n", "\n", "* I can be reached at mpyrcz@austin.utexas.edu.\n", "\n", "I'm always happy to discuss,\n", "\n", "*Michael*\n", "\n", "Michael Pyrcz, Ph.D., P.Eng. Professor, Cockrell School of Engineering and The Jackson School of Geosciences, The University of Texas at Austin\n", "\n", "More Resources Available at: [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Geostatistics Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [Applied Geostats in Python e-book](https://geostatsguy.github.io/GeostatsPyDemos_Book/intro.html) | [Applied Machine Learning in Python e-book](https://geostatsguy.github.io/MachineLearningDemos_Book/) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.9" } }, "nbformat": 4, "nbformat_minor": 4 }