{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "

\n", " \n", "\n", "

\n", "\n", "## Interactive Python Data Science Dashboards \n", "\n", "### Bootstrap and Machine Learning Bagging with Custom Plots\n", "\n", "#### Michael Pyrcz, Professor, The University of Texas at Austin \n", "\n", "##### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a tutorial for / demonstration of **Bootstrap and Machine Learning Bagging with Custom Plots**. \n", "\n", "**YouTube Lecture**: check out my lectures on:\n", "\n", "* [Bootstrap](https://youtu.be/wCgdoImlLY0?si=lpTWz2H7QTdxHBy9). \n", "* [Ensemble Tree Methods](https://www.youtube.com/watch?v=m5_wk310fho&list=PLG19vXLQHvSC2ZKFIkgVpI9fCjkN38kwf&index=39)\n", "\n", "For your convenience here's a summary of salient points.\n", "\n", "#### Bootstrap\n", "\n", "Uncertainty in the sample statistics\n", "* one source of uncertainty is the paucity of data.\n", "* do 200 or even less wells provide a precise (and accurate estimate) of the mean? standard deviation? skew? P13?\n", "\n", "Would it be useful to know the uncertainty in these statistics due to limited sampling?\n", "* what is the impact of uncertainty in the mean porosity e.g. 20%+/-2%?\n", "\n", "**Bootstrap** is a method to assess the uncertainty in a sample statistic by repeated random sampling with replacement.\n", "\n", "Assumptions\n", "* sufficient, representative sampling, identical, idependent samples\n", "\n", "Limitations\n", "1. assumes the samples are representative \n", "2. assumes stationarity\n", "3. only accounts for uncertainty due to too few samples, e.g. no uncertainty due to changes away from data\n", "4. does not account for boundary of area of interest \n", "5. assumes the samples are independent\n", "6. does not account for other local information sources\n", "\n", "The Bootstrap Approach (Efron, 1982)\n", "\n", "Statistical resampling procedure to calculate uncertainty in a calculated statistic from the data itself.\n", "* Does this work? Prove it to yourself, for uncertainty in the mean solution is standard error: \n", "\n", "\\begin{equation}\n", "\\sigma^2_\\overline{x} = \\frac{\\sigma^2_s}{n}\n", "\\end{equation}\n", "\n", "Extremely powerful - could calculate uncertainty in any statistic! e.g. P13, skew etc.\n", "* Would not be possible access general uncertainty in any statistic without bootstrap.\n", "* Advanced forms account for spatial information and sampling strategy (game theory and Journel’s spatial bootstrap (1993).\n", "\n", "Steps: \n", "\n", "1. assemble a sample set, must be representative, reasonable to assume independence between samples\n", "\n", "2. optional: build a cumulative distribution function (CDF)\n", " * may account for declustering weights, tail extrapolation\n", " * could use analogous data to support\n", "\n", "3. For $\\ell = 1, \\ldots, L$ realizations, do the following:\n", "\n", " * For $i = \\alpha, \\ldots, n$ data, do the following:\n", "\n", " * Draw a random sample with replacement from the sample set or Monte Carlo simulate from the CDF (if available). \n", "\n", "6. Calculate a realization of the sammary statistic of interest from the $n$ samples, e.g. $m^\\ell$, $\\sigma^2_{\\ell}$. Return to 3 for another realization.\n", "\n", "7. Compile and summarize the $L$ realizations of the statistic of interest.\n", "\n", "#### Machine Learning Bagging\n", "\n", "The fundamental idea is to use multiple data sets to built an ensemble of prediction models to calculate and aggregate over multiple predictions to reduce model variance.\n", "\n", "* But, to build the ensemble of models we need multiple training datasets. This is typically not available.\n", "\n", "* the solution is to **bootstrap** the entire dataset to build multiple bootstrap realizations of training data, $X_1^b,...,X_m^b$\n", "\n", "* a deep decision tree is fit to each realization of the training data, $\\hat{f}^b(X_1^b,...,X_m^b)$\n", "\n", "* a prediction estimate is calculated by each tree in the ensemble $Y^b =\\hat{f}^b(X_1^b,...,X_m^b)$\n", "\n", "* for **regression** the ensemble prediction is the average of the prediction from each member of the ensemble, $Y = \\frac{1}{B} \\sum_{b=1}^{B} Y^b$\n", "\n", "* for **classification** the ensemble prediction is the majority-rule of the ensemble classifications, $Y = argmax_k(Y^b_k)$\n", "\n", "These are a very powerful method. Let's demonstrate them both with useful custom plots in interactive Python dashboards.\n", "\n", "#### Load the required libraries\n", "\n", "The following code loads the required libraries. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt # plotting\n", "import statsmodels.api as sm # make PDFs\n", "import numpy as np # working with arrays\n", "import pandas as pd # working with DataFrames\n", "from sklearn.linear_model import LinearRegression\n", "from ipywidgets import interactive # widgets and interactivity\n", "from ipywidgets import widgets \n", "from ipywidgets import Layout\n", "from ipywidgets import Label\n", "from ipywidgets import VBox, HBox\n", "import random # random drawing / bootstrap realizations of the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you get a package import error, you may have to first install some of these packages. This can usually be accomplished by opening up a command window on Windows and then typing 'python -m pip install [package-name]'. More assistance is available with the respective package docs. \n", "\n", "#### Declare Functions\n", "\n", "I just added a convenience functions to:\n", "\n", "1. custom bootstrap plot\n", "2. custom machine learning bagging plot\n", "\n", "For these plots we make our own histogram and scatter plot with linear model plots to concisely visualize the entire workflow." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "def custom_hist(data,ax,xpmin,xpmax,ypmin,ypmax,xmin,xmax,ymin,ymax,nbin,nybin,xlabel,ylabel,title,tsize,labelpad,\n", " plot_avg,plot_PDF):\n", " \n", " plt.plot([xpmin,xpmax],[ypmin,ypmin],color='black'); plt.plot([xpmin,xpmin],[ypmin,ypmax],color='black')\n", " xrange = xmax - xmin; yrange = ymax - ymin\n", " xprange = xpmax - xpmin; yprange = ypmax - ypmin\n", " xhalf = (xmax-xmin)/(nbin-1)/2.0; xsize = xhalf*2.0\n", " xphalf = xhalf*xprange/xrange\n", " xpsize = xsize*xprange/xrange\n", " xbins = np.linspace(xmin,xmax,nbin)\n", " ybins = np.linspace(0.0,ymax,nybin)\n", " xcents = np.linspace(xmin + xhalf,xmax-xhalf,nbin-1)\n", " \n", " xvalues = np.linspace(xmin,xmax,100)\n", " for ibin, xbin in enumerate(xbins):\n", " xx = ((xbin-xmin)/xrange*xprange)+xpmin\n", " plt.plot([xx,xx],[ypmin,ypmax],color='grey',lw=0.2,zorder=1) \n", " if ibin % 2 == 0:\n", " plt.plot([xx,xx],[ypmin-yprange*0.1,ypmin],color='black',zorder=3)\n", " plt.annotate(np.round(xbin,1),(xx,ypmin-yprange*0.24),ha='center',size=tsize)\n", " else: \n", " plt.plot([xx,xx],[ypmin-yprange*0.05,ypmin],color='black',zorder=3)\n", " plt.annotate(np.round(xbin,1),(xx,ypmin-yprange*0.15),ha='center',size=tsize)\n", " \n", " for ybin in ybins:\n", " yy = (ybin)*yrange/ymax*yprange/yrange+ypmin\n", " plt.plot([xpmin-xprange*0.04,xpmin],[yy,yy],color='black',zorder=3)\n", " plt.plot([xpmin,xpmax],[yy,yy],color='grey',lw=0.2,zorder=1)\n", " plt.annotate(np.round(ybin,1),(xpmin-xprange*0.05,yy),ha='right',size=tsize)\n", " \n", " for idata in range(0,len(data)):\n", " xx = ((data[idata]-xmin)/xrange*xprange)+xpmin \n", " plt.scatter(xx,ypmin,color='red',edgecolor='black',s=20,alpha=0.4,zorder=100)\n", " \n", " histboot = np.histogram(data, bins=xbins, weights=None)[0]\n", " average = np.average(data)\n", " \n", " for ibin, prop in enumerate(histboot):\n", " xx = ((xcents[ibin]-xmin)/xrange*xprange)+xpmin\n", " yy = histboot[ibin]*yrange/ymax*yprange/yrange\n", " ax.add_patch(plt.Rectangle((xx-xphalf,ypmin), xpsize, yy, lw=1, fc = 'darkorange',color='black', ))\n", " xavg = ((average-xmin)/xrange*xprange)+xpmin\n", " \n", " if plot_avg:\n", " xx = ((np.average(data)-xmin)/xrange*xprange)+xpmin\n", " plt.plot([xx,xx],[ypmin,ypmax],color='red',lw=2,ls='--')\n", " \n", " if plot_PDF:\n", " PDFModel = sm.nonparametric.KDEUnivariate(data).fit()\n", " yPDF = PDFModel.evaluate(xvalues)*len(data)*xsize\n", " xxvalues = ((xvalues-xmin)/xrange*xprange)+xpmin; yyPDF = ((yPDF-ymin)/yrange*yprange)+ypmin\n", " plt.plot(xxvalues,yyPDF,color='black',zorder=200)\n", " \n", " plt.annotate(title,(xpmin+xprange*0.5,ypmax+yprange*0.08),ha='center') \n", " plt.annotate(xlabel,(xpmin+xprange*0.5,ypmin-labelpad*1.5),ha='center') \n", " plt.annotate(ylabel,(xpmin-labelpad*0.6,ypmin+yprange*0.5),va='center',rotation = 90.0) \n", " \n", "def custom_scatter(X,y,ax,xpmin,xpmax,ypmin,ypmax,xmin,xmax,ymin,ymax,nbin,nybin,xlabel,ylabel,title,tsize,labelpad,\n", " add_model,show_pred,x_input):\n", "\n", " plt.plot([xpmin,xpmax],[ypmin,ypmin],color='black'); plt.plot([xpmin,xpmin],[ypmin,ypmax],color='black')\n", " xrange = xmax - xmin; yrange = ymax - ymin\n", " xprange = xpmax - xpmin; yprange = ypmax - ypmin\n", " xbins = np.linspace(xmin,xmax,nbin)\n", " ybins = np.linspace(ymin,ymax,nybin)\n", " \n", " for ibin, xbin in enumerate(xbins):\n", " xx = ((xbin-xmin)/xrange*xprange)+xpmin\n", " plt.plot([xx,xx],[ypmin,ypmax],color='grey',lw=0.2,zorder=1) \n", " if ibin % 2 == 0:\n", " plt.plot([xx,xx],[ypmin-yprange*0.1,ypmin],color='black',zorder=3)\n", " plt.annotate(np.round(xbin,1),(xx,ypmin-yprange*0.24),ha='center',size=tsize)\n", " else: \n", " plt.plot([xx,xx],[ypmin-yprange*0.05,ypmin],color='black',zorder=3)\n", " plt.annotate(np.round(xbin,1),(xx,ypmin-yprange*0.15),ha='center',size=tsize)\n", " \n", " for ibin, ybin in enumerate(ybins):\n", " yy = ((ybin-ymin)/yrange*yprange)+ypmin\n", " plt.plot([xpmin,xpmax],[yy,yy],color='grey',lw=0.2,zorder=1)\n", " if ibin % 2 == 0:\n", " plt.plot([xpmin-xprange*0.015,xpmin],[yy,yy],color='black',zorder=3)\n", " plt.annotate(np.round(ybin,1),(xpmin-xprange*0.08,yy),ha='center',size=tsize)\n", " else: \n", " plt.plot([xpmin-xprange*0.05,xpmin],[yy,yy],color='black',zorder=3)\n", " plt.annotate(np.round(ybin,1),(xpmin-xprange*0.12,yy),ha='center',size=tsize) \n", " \n", " XX = ((X-xmin)/xrange*xprange)+xpmin\n", " yy = ((y-ymin)/yrange*yprange)+ypmin\n", " \n", " plt.scatter(XX,yy,color='darkorange',edgecolor='black',s=30)\n", " \n", " plt.annotate(title,(xpmin+xprange*0.5,ypmax+yprange*0.08),ha='center') \n", " plt.annotate(xlabel,(xpmin+xprange*0.5,ypmin-labelpad*1.5),ha='center') \n", " plt.annotate(ylabel,(xpmin-labelpad*0.7,ypmin+yprange*0.5),va='center',rotation = 90.0) \n", " \n", " y_input = -1\n", " if add_model == True:\n", " model = LinearRegression().fit(X.reshape(-1, 1),y)\n", " y_pred = model.predict(xbins.reshape(-1, 1))\n", " xxbins = ((xbins-xmin)/xrange*xprange)+xpmin\n", " yy_pred = ((y_pred-ymin)/yrange*yprange)+ypmin \n", " plt.plot(xxbins,yy_pred,color='red',ls='--')\n", " \n", " if show_pred == True:\n", " xx_input = ((x_input-xmin)/xrange*xprange)+xpmin\n", " y_input = model.coef_[0]*x_input + model.intercept_\n", " yy_input = ((y_input-ymin)/yrange*yprange)+ypmin\n", " plt.plot([xx_input,xx_input],[ypmin,yy_input],color='black')\n", " plt.plot([xpmin,xx_input],[yy_input,yy_input],color='black')\n", " plt.annotate('f(' + str(np.round(x_input,1)) + ')=' + str(np.round(y_input,2)),(xpmin + 0.02*xprange,yy_input+0.05*yprange),size=tsize)\n", " return y_input " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Custom Bootstrap Visualization\n", "\n", "We apply bootstrap data and statistic realizations to calculate the uncertainty in the average.\n", "\n", "* In the display we make a simple data set for demonstration.\n", "* Step through the bootstrap realizations or change the random number seed with the dashboard." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# parameters for the synthetic dataset\n", "bins = np.linspace(0,1000,1000)\n", "\n", "l = widgets.Text(value=' Boostrap Demonstration for Uncertainty in the Average, Michael Pyrcz, Professor, The University of Texas at Austin',layout=Layout(width='950px', height='30px'))\n", "\n", "nreal = widgets.IntSlider(min = 1, max = 16, value = 1, description = r'$L$',orientation='horizontal',layout=Layout(width='800px', height='20px'),continuous_update=False)\n", "nreal.style.handle_color = 'gray'\n", "\n", "seed = widgets.IntSlider(min = 1, max = 16, value = 1, description = r'$s$',orientation='horizontal',layout=Layout(width='800px', height='20px'),continuous_update=False)\n", "seed.style.handle_color = 'gray'\n", "\n", "ui5 = widgets.HBox([nreal,seed,],kwargs = {'justify_content':'center'})\n", "ui_all = widgets.VBox([l,ui5],)\n", "\n", "def make_viz3(nreal,seed):\n", " random.seed(seed); np.random.seed(seed=seed)\n", " random_values = np.random.normal(loc=10.0,scale=3.0,size=25)\n", " bootstrap_sample = np.zeros((nreal,len(random_values),))\n", " \n", " for ireal in range(0,nreal):\n", " bootstrap_sample[ireal] = random.choices(random_values, weights=None, cum_weights=None, k=len(random_values),)\n", " \n", " bootstrap_means = np.average(bootstrap_sample,axis = 1) \n", " \n", " ax1 = plt.subplot(111)\n", " plt.axis('off'); \n", " plt.xlim([0, 21]); plt.ylim([0, 30])\n", " xmin = 0; xmax = 20; ymin = 0; ymax = 10; nbin = 12; nybin = 5\n", " tsize = 6; labelpad = 1.2\n", " \n", " xorig = 1; yorig = 17\n", " xpmin = xorig; xpmax = xorig + 3.0; ypmin = yorig; ypmax =yorig + 4.0;\n", " custom_hist(random_values,ax1,xpmin,xpmax,ypmin,ypmax,xmin,xmax,ymin,ymax,nbin,nybin,'Porosity (%)','Frequency',\n", " 'Original Sample Data',tsize,labelpad,plot_avg = False,plot_PDF = False)\n", " \n", " xorig = 5; yorig = 24\n", " lx = xorig; ly = yorig\n", " for ireal in range(0,nreal):\n", " xpmin = lx; xpmax = lx + 3.0; ypmin = ly; ypmax = ly + 4.0;\n", " custom_hist(bootstrap_sample[ireal],ax1,xpmin,xpmax,ypmin,ypmax,xmin,xmax,ymin,ymax,nbin,nybin,'Porosity (%)','Frequency',\n", " 'Bootstrap Realization #' + str(ireal+1),tsize,labelpad,plot_avg = True,plot_PDF = False)\n", " lx = lx + 4\n", " if lx > 18:\n", " lx = xorig; ly = ly - 7\n", " \n", " xorig = 1; yorig = 9\n", " xpmin = xorig; xpmax = xorig + 3.0; ypmin = yorig; ypmax =yorig + 4.0; xmin = 8.0; xmax = 12.0\n", " if nreal < 5: \n", " plot_PDF = False\n", " else: \n", " plot_PDF = True\n", " custom_hist(bootstrap_means,ax1,xpmin,xpmax,ypmin,ypmax,xmin,xmax,ymin,ymax,int(nbin),nybin,'Porosity (%)','Frequency',\n", " 'Bootstrap Averages',tsize,labelpad,plot_avg = False,plot_PDF = plot_PDF)\n", " \n", " plt.annotate('Bootstrap Uncertainty in Mean:',(0,5.8))\n", " plt.annotate('Mean: ' + str(np.round(np.average(bootstrap_means),2)),(1,4.5),ha='left')\n", " plt.annotate('St.Dev.: ' + str(np.round(np.std(bootstrap_means),2)),(0.8,3.5),ha='left')\n", " plt.annotate('P10: ' + str(np.round(np.percentile(bootstrap_means,10),2)),(1.2,2.5),ha='left')\n", " plt.annotate('P90: ' + str(np.round(np.percentile(bootstrap_means,90),2)),(1.2,1.5),ha='left')\n", " \n", " plt.subplots_adjust(left=0.0, bottom=0.0, right=2.0, top=1.2, wspace=0.2, hspace=0.2); \n", " plt.savefig('Bootstrap.jpg',dpi=600,bbox_inches='tight') \n", " plt.show()\n", "\n", "interactive_plot1 = widgets.interactive_output(make_viz3, {'nreal':nreal,'seed':seed})\n", "interactive_plot1.clear_output(wait = True) # reduce flickering by delaying plot updating " ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": false }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "8c99f782a9cf49baa285f90fbd128ba6", "version_major": 2, "version_minor": 0 }, "text/plain": [ "VBox(children=(Text(value=' Boostrap Demonstration for Uncertainty in the Av…" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "c006f6b20fd840ea93ad034af6045bee", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Output()" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(ui_all, interactive_plot1) # display the interactive plot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Custom Machine Learning Bagging Visualization\n", "\n", "We apply bootstrap data and model prediction ensemble to demonstrate machine learning bagging.\n", "\n", "* We make a simple data set for demonstration.\n", "* Step through the bagging bootstrap data and model realizations or change the random number seed with the dashboard.\n", "* Select a predictor feature value for the demonstrated prediction" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "seed = 13\n", "np.random.seed(seed=seed)\n", "num_samples = 30\n", "mean = [0.0, 0.0] # Mean vector\n", "cov = [[1.0, 0.95], [0.95, 1.0]] # Covariance matrix\n", "X,y = np.random.multivariate_normal(mean, cov, num_samples).T\n", "X = X * 0.2 + 2.2; y = y * 2.5 + 10.0\n", "\n", "bins = np.linspace(0,1000,1000)\n", "l = widgets.Text(value=' Demonstration of Machine Learning Ensemble Prediction with Bagging Linear Regression, Michael Pyrcz, Professor, The University of Texas at Austin',layout=Layout(width='950px', height='30px'))\n", "nreal = widgets.IntSlider(min = 1, max = 16, value = 1, description = r'$L$',orientation='horizontal',layout=Layout(width='800px', height='20px'),continuous_update=False)\n", "nreal.style.handle_color = 'gray'\n", "\n", "x_input = widgets.FloatSlider(min = 1.8, max = 2.6, value = 2.3, description = r'$x$',orientation='horizontal',layout=Layout(width='800px', height='20px'),continuous_update=False)\n", "x_input.style.handle_color = 'gray'\n", "\n", "seed = widgets.IntSlider(min = 1, max = 16, value = 1, description = r'$s$',orientation='horizontal',layout=Layout(width='800px', height='20px'),continuous_update=False)\n", "seed.style.handle_color = 'gray'\n", "\n", "ui7 = widgets.HBox([nreal,x_input,seed,],kwargs = {'justify_content':'center'})\n", "ui_all7 = widgets.VBox([l,ui7],)\n", "\n", "def make_viz7(nreal,x_input,seed):\n", " \n", " ax1 = plt.subplot(111)\n", " \n", " xlabel = r'Density ($\\frac{g}{cm^3}$)'; ylabel = 'Porosity (%)'; title = 'Original Sample Data'\n", " add_model = True; show_pred = True\n", " \n", " xorig = 1; yorig = 17; labelpad = 1.0\n", " xpmin = xorig; xpmax = xorig + 3.0; ypmin = yorig; ypmax =yorig + 3.0;\n", " random.seed(seed); np.random.seed(seed=seed)\n", " X_bootstrap = np.zeros((nreal,len(X))); y_bootstrap= np.zeros((nreal,len(X)))\n", " for ireal in range(0,nreal): # bootstrap from a bivariate dataset, sampling pairs\n", " index = random.choices(np.arange(0,len(X),1), weights=None, cum_weights=None, k=len(X),)\n", " X_bootstrap[ireal,:] = X[index]; y_bootstrap[ireal,:] = y[index]\n", " \n", " ax1 = plt.subplot(111)\n", " plt.axis('off'); \n", " plt.xlim([0, 21]); plt.ylim([0, 30])\n", " xmin = 1.8; xmax = 2.6; ymin = 5; ymax = 15; nbin = 9; nybin = 5\n", " tsize = 6; labelpad = 1.2\n", " \n", " xorig = 1; yorig = 17\n", " xpmin = xorig; xpmax = xorig + 3.0; ypmin = yorig; ypmax = yorig + 5.0;\n", " \n", " custom_scatter(X,y,ax1,xpmin,xpmax,ypmin,ypmax,xmin,xmax,ymin,ymax,nbin,nybin,r'Density ($\\frac{g}{cm^3}$)','Porosity (%)',\n", " 'Original Sample Data',tsize,labelpad,add_model = False,show_pred = False,x_input = -1.0)\n", " \n", " xorig = 5; yorig = 24\n", " lx = xorig; ly = yorig\n", " est_ensemble = np.zeros(nreal)\n", " for ireal in range(0,nreal):\n", " xpmin = lx; xpmax = lx + 3.0; ypmin = ly; ypmax = ly + 4.0;\n", " est_ensemble[ireal] = custom_scatter(X_bootstrap[ireal,:],y_bootstrap[ireal,:],ax1,xpmin,xpmax,ypmin,ypmax,\n", " xmin,xmax,ymin,ymax,nbin,nybin,r'Density ($\\frac{g}{cm^3}$)','Porosity (%)',\n", " 'Bootstrap Realization #' + str(ireal+1),tsize,labelpad,add_model,show_pred,x_input)\n", " lx = lx + 4\n", " if lx > 18:\n", " lx = xorig; ly = ly - 7\n", " \n", " xorig = 1; yorig = 9\n", " xpmin = xorig; xpmax = xorig + 3.0; ypmin = yorig; ypmax =yorig + 4.0\n", " xmin = 10.6; xmax = 12.0; ymin = 0; ymax = 10; nbin = 11\n", " if nreal < 5: \n", " plot_PDF = False\n", " else: \n", " plot_PDF = True \n", " bag_est = np.average(est_ensemble)\n", " xbmin = np.round((bag_est - 0.5),1); xbmax = np.round((bag_est + 0.5),1);\n", " custom_hist(est_ensemble,ax1,xpmin,xpmax,ypmin,ypmax,xbmin,xbmax,ymin,ymax,nbin,nybin,r'Porosity Estimate (%)','Frequency',\n", " 'Bagging Ensemble Estimate',tsize,labelpad,plot_avg = True,plot_PDF = plot_PDF)\n", " \n", " plt.annotate('Bagging Ensemble Predictions:',(0,5.8))\n", " plt.annotate('Mean: ' + str(np.round(np.average(est_ensemble),2)),(1,4.5),ha='left')\n", " plt.annotate('St.Dev.: ' + str(np.round(np.std(est_ensemble),2)),(0.8,3.5),ha='left')\n", " plt.annotate('P10: ' + str(np.round(np.percentile(est_ensemble,10),2)),(1.2,2.5),ha='left')\n", " plt.annotate('P90: ' + str(np.round(np.percentile(est_ensemble,90),2)),(1.2,1.5),ha='left')\n", " \n", " plt.subplots_adjust(left=0.0, bottom=0.0, right=2.0, top=1.2, wspace=0.2, hspace=0.2); \n", " plt.savefig('Bootstrap.jpg',dpi=600,bbox_inches='tight') \n", " plt.show() \n", " \n", "interactive_plot7 = widgets.interactive_output(make_viz7, {'nreal':nreal,'x_input':x_input,'seed':seed})\n", "interactive_plot7.clear_output(wait = True) # reduce flickering by delaying plot updating " ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "scrolled": false }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "9038787ad2354974ba4379310d1d707d", "version_major": 2, "version_minor": 0 }, "text/plain": [ "VBox(children=(Text(value=' Demonstration of Machine Learning Ensemble Prediction with Bagging Linear R…" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "dec47e58b7ba4e3285203450bc1836be", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Output()" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(ui_all7, interactive_plot7) # display the interactive plot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Comments\n", "\n", "This was a basic demonstration bootstrap and machine learning bagging with custom plots in an interactive Python dashboard. Much more can be done, I have other demonstrations for modeling workflows with GeostatsPy in the GitHub repository [GeostatsPy_Demos](https://github.com/GeostatsGuy/GeostatsPy_Demos/tree/main).\n", "\n", "Note, the custom plots are rough, more could be done to format everything in a flexible manner, e.g., calculate the best label locations based on the size of the text in plot coordinates, etc.\n", "\n", "I hope this is helpful,\n", "\n", "*Michael*\n", "\n", "#### The Author:\n", "\n", "### Michael Pyrcz, Professor, The University of Texas at Austin \n", "*Novel Data Analytics, Geostatistics and Machine Learning Subsurface Solutions*\n", "\n", "With over 17 years of experience in subsurface consulting, research and development, Michael has returned to academia driven by his passion for teaching and enthusiasm for enhancing engineers' and geoscientists' impact in subsurface resource development. \n", "\n", "For more about Michael check out these links:\n", "\n", "#### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)\n", "\n", "#### Want to Work Together?\n", "\n", "I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.\n", "\n", "* Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? I'd be happy to drop by and work with you! \n", "\n", "* Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PIs including Profs. Foster, Torres-Verdin and van Oort)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!\n", "\n", "* I can be reached at mpyrcz@austin.utexas.edu.\n", "\n", "I'm always happy to discuss,\n", "\n", "*Michael*\n", "\n", "Michael Pyrcz, Ph.D., P.Eng. Professor, Cockrell School of Engineering and The Jackson School of Geosciences, The University of Texas at Austin\n", "\n", "#### More Resources Available at: [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1) ://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)\n", "\n", "#### Want to Work Together?\n", "\n", "I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.\n", "\n", "* Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? I'd be happy to drop by and work with you! \n", "\n", "* Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PIs including Profs. Foster, Torres-Verdin and van Oort)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!\n", "\n", "* I can be reached at mpyrcz@austin.utexas.edu.\n", "\n", "I'm always happy to discuss,\n", "\n", "*Michael*\n", "\n", "Michael Pyrcz, Ph.D., P.Eng. Associate Professor The Hildebrand Department of Petroleum and Geosystems Engineering, Bureau of Economic Geology, The Jackson School of Geosciences, The University of Texas at Austin\n", "\n", "#### More Resources Available at: [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" } }, "nbformat": 4, "nbformat_minor": 2 }