\n",
"\n",
"## Subsurface Data Analytics \n",
"\n",
"## Interactive Demonstration of Machine Learning Model Tuning, Generalization & Overfit\n",
"\n",
"#### Michael Pyrcz, Associate Professor, University of Texas at Austin \n",
"\n",
"##### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1) | [GeostatsPy](https://github.com/GeostatsGuy/GeostatsPy)\n",
"\n",
"### PGE 383 Exercise: Interactive Predictive Model Complexity Tuning, Generalization & Overfit\n",
"\n",
"Here's a simple workflow, demonstration of predictive machine learning model training and testing for overfit. We use a:\n",
"\n",
"* simple polynomial model\n",
"\n",
"* 1 preditor feature and 1 response feature\n",
"\n",
"for an high interpretability model/ simple illustration.\n",
"\n",
"#### Train / Test Split\n",
"\n",
"The available data is split into training and testing subsets.\n",
"\n",
"* in general 15-30% of the data is withheld from training to apply as testing data\n",
"\n",
"* testing data selection should be fair, the same difficulty of predictions (offset/different from the training dat\n",
"\n",
"#### Machine Learning Model Traing\n",
"\n",
"The training data is applied to train the model parameters such that the model minimizes mismatch with the training data\n",
"\n",
"* it is common to use **mean square error** (known as a **L2 norm**) as a loss function summarizing the model mismatch\n",
"\n",
"* **miminizing the loss function** for simple models an anlytical solution may be available, but for most machine this requires an iterative optimization method to find the best model parameters\n",
"\n",
"This process is repeated over a range of model complexities specified by hyperparameters. \n",
"\n",
"#### Machine Learning Model Tuning\n",
"\n",
"The withheld testing data is retrieved and loss function (usually the **L2 norm** again) is calculated to summarize the error over the testing data\n",
"\n",
"* this is repeated over over the range of specified hypparameters\n",
"\n",
"* the model complexity / hyperparameters that minimize the loss function / error summary in testing is selected\n",
"\n",
"This is known are model hypparameter tuning.\n",
"\n",
"#### Machine Learning Model Overfit\n",
"\n",
"More model complexity/flexibility than can be justified with the available data, data accuracy, frequency and coverage\n",
"\n",
"* Model explains “idiosyncrasies” of the data, capturing data noise/error in the model\n",
"\n",
"* High accuracy in training, but low accuracy in testing / real-world use away from training data cases – poor ability of the model to generalize\n",
"\n",
"\n",
"#### Workflow Goals\n",
"\n",
"Learn the basics of machine learning training, tuning for model generalization while avoiding model overfit.\n",
"\n",
"This includes:\n",
"\n",
"* Demonstrate model training and tuning by hand with an interactive exercies\n",
"\n",
"* Demonstrate the role of data error in leading to model overfit with complicated models\n",
"\n",
"#### Getting Started\n",
"\n",
"You will need to copy the following data files to your working directory. They are available [here](https://github.com/GeostatsGuy/GeoDataSets):\n",
"\n",
"* Tabular data - [Stochastic_1D_por_perm_demo.csv](https://github.com/GeostatsGuy/GeoDataSets/blob/master/Stochastic_1D_por_perm_demo.csv)\n",
"* Tabular data - [Random_Parabola.csv](https://github.com/GeostatsGuy/GeoDataSets/blob/master/Random_Parabola.csv)\n",
"\n",
"These datasets are available in the folder: https://github.com/GeostatsGuy/GeoDataSets.\n",
"\n",
"\n",
"#### Import Required Packages\n",
"\n",
"We will also need some standard packages. These should have been installed with Anaconda 3."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import geostatspy.GSLIB as GSLIB # GSLIB utilies, visualization and wrapper\n",
"import geostatspy.geostats as geostats # GSLIB methods convert to Python "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will also need some standard packages. These should have been installed with Anaconda 3."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import os # to set current working directory \n",
"import sys # supress output to screen for interactive variogram modeling\n",
"import io\n",
"import numpy as np # arrays and matrix math\n",
"import pandas as pd # DataFrames\n",
"import matplotlib.pyplot as plt # plotting\n",
"from sklearn.model_selection import train_test_split # train and test split\n",
"from sklearn.metrics import mean_squared_error # model error calculation\n",
"import scipy # kernel density estimator for PDF plot\n",
"from matplotlib.pyplot import cm # color maps\n",
"from ipywidgets import interactive # widgets and interactivity\n",
"from ipywidgets import widgets \n",
"from ipywidgets import Layout\n",
"from ipywidgets import Label\n",
"from ipywidgets import VBox, HBox\n",
"import warnings\n",
"warnings.filterwarnings('ignore') # supress warnings"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you get a package import error, you may have to first install some of these packages. This can usually be accomplished by opening up a command window on Windows and then typing 'python -m pip install [package-name]'. More assistance is available with the respective package docs. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Build the Interactive Dashboard\n",
"\n",
"The following code:\n",
"\n",
"* makes a random dataset, change the random number seed and number of data for a different dataset\n",
"* loops over polygonal fits of 1st-12th order, loops over mulitple realizations and calculates the average MSE and P10 and P90 vs. order\n",
"* calculates a specific model example\n",
"* plots the example model with training and testing data, the error distributions and the MSE envelopes vs. complexity"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"l = widgets.Text(value=' Machine Learning Overfit/Generalization Demo, Prof. Michael Pyrcz, The University of Texas at Austin',\n",
" layout=Layout(width='950px', height='30px'))\n",
"\n",
"n = widgets.IntSlider(min=15, max = 80, value=30, step = 1, description = 'n',orientation='horizontal', style = {'description_width': 'initial'}, continuous_update=False)\n",
"split = widgets.FloatSlider(min=0.05, max = .95, value=0.20, step = 0.05, description = 'Test %',orientation='horizontal',style = {'description_width': 'initial'}, continuous_update=False)\n",
"std = widgets.FloatSlider(min=0, max = 50, value=0, step = 1.0, description = 'Noise StDev',orientation='horizontal',style = {'description_width': 'initial'}, continuous_update=False)\n",
"degree = widgets.IntSlider(min=1, max = 12, value=1, step = 1, description = 'Model Order',orientation='horizontal', style = {'description_width': 'initial'}, continuous_update=False)\n",
"\n",
"ui = widgets.HBox([n,split,std,degree],)\n",
"ui2 = widgets.VBox([l,ui],)\n",
"\n",
"def run_plot(n,split,std,degree):\n",
" seed = 13014; nreal = 20\n",
" np.random.seed(seed) # seed the random number generator\n",
" \n",
" # make the datastet\n",
" X_seq = np.linspace(0,20,100)\n",
" X = np.random.rand(n)*20\n",
" y = X*X + 50.0 # fit a parabola\n",
" y = y + np.random.normal(loc = 0.0,scale=std,size=n) # add noise\n",
" \n",
" # calculate the MSE train and test over a range of complexity over multiple realizations of test/train split\n",
" cdegrees = np.arange(1,13)\n",
" cmse_train = np.zeros([len(cdegrees),nreal]); cmse_test = np.zeros([len(cdegrees),nreal])\n",
" for j in range(0,nreal):\n",
" for i, cdegree in enumerate(cdegrees):\n",
" X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=split, random_state=seed+j)\n",
" ccoefs = np.polyfit(X_train,y_train,cdegree)\n",
" y_pred_train = np.polyval(ccoefs, X_train)\n",
" y_pred_test = np.polyval(ccoefs, X_test)\n",
" cmse_train[i,j] = mean_squared_error(y_train, y_pred_train)\n",
" cmse_test[i,j] = mean_squared_error(y_test, y_pred_test)\n",
" # summarize over the realizations\n",
" cmse_train_avg = cmse_train.mean(axis=1)\n",
" cmse_test_avg = cmse_test.mean(axis=1)\n",
" cmse_train_high = np.percentile(cmse_train,q=90,axis=1)\n",
" cmse_train_low = np.percentile(cmse_train,q=10,axis=1) \n",
" cmse_test_high = np.percentile(cmse_test,q=90,axis=1)\n",
" cmse_test_low = np.percentile(cmse_test,q=10,axis=1)\n",
" \n",
"# cmse_train_high = np.amax(cmse_train,axis=1)\n",
"# cmse_train_low = np.amin(cmse_train,axis=1) \n",
"# cmse_test_high = np.amax(cmse_test,axis=1)\n",
"# cmse_test_low = np.amin(cmse_test,axis=1)\n",
" \n",
" # build the one model example to show\n",
" X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=split, random_state=seed)\n",
" coefs = np.polyfit(X_train,y_train,degree) \n",
" \n",
" # calculate error\n",
" error_seq = np.linspace(-100.0,100.0,100)\n",
" error_train = np.polyval(coefs, X_train) - y_train\n",
" #print(np.polyval(coefs, X_train))\n",
" #print('truth')\n",
" #print(X_train)\n",
" error_test = np.polyval(coefs, X_test) - y_test\n",
" \n",
" mse_train = mean_squared_error(y_train, np.polyval(coefs, X_train))\n",
" mse_test = mean_squared_error(y_test, np.polyval(coefs, X_test))\n",
" \n",
" error_train_std = np.std(error_train)\n",
" error_test_std = np.std(error_test)\n",
" \n",
" kde_error_train = scipy.stats.gaussian_kde(error_train)\n",
" kde_error_test = scipy.stats.gaussian_kde(error_test)\n",
" \n",
" plt.subplot(131)\n",
" plt.plot(X_seq, np.polyval(coefs, X_seq), color=\"black\")\n",
" plt.title(\"Polynomial Model of Degree = \"+str(degree))\n",
" plt.scatter(X_train,y_train,c =\"red\",alpha=0.2,edgecolors=\"black\")\n",
" plt.scatter(X_test,y_test,c =\"blue\",alpha=0.2,edgecolors=\"black\")\n",
" plt.ylim([0,500]); plt.xlim([0,20]); plt.grid()\n",
" plt.xlabel('Porosity (%)'); plt.ylabel('Permeability (mD)')\n",
" \n",
" plt.subplot(132)\n",
" plt.hist(error_train, facecolor='red',bins=np.linspace(-50.0,50.0,10),alpha=0.2,density=True,edgecolor='black',label='Train')\n",
" plt.hist(error_test, facecolor='blue',bins=np.linspace(-50.0,50.0,10),alpha=0.2,density=True,edgecolor='black',label='Test')\n",
" #plt.plot(error_seq,kde_error_train(error_seq),lw=2,label='Train',c='red')\n",
" #plt.plot(error_seq,kde_error_test(error_seq),lw=2,label='Test',c='blue') \n",
" plt.xlim([-55.0,55.0]); plt.ylim([0,0.1])\n",
" plt.xlabel('Model Error'); plt.ylabel('Frequency'); plt.title('Training and Testing Error, Model of Degree = '+str((degree)))\n",
" plt.legend(loc='upper left')\n",
" plt.grid(True)\n",
" \n",
" plt.subplot(133); ax = plt.gca()\n",
" plt.plot(cdegrees,cmse_train_avg,lw=2,label='Train',c='red')\n",
" ax.fill_between(cdegrees,cmse_train_high,cmse_train_low,facecolor='red',alpha=0.05)\n",
" \n",
" plt.plot(cdegrees,cmse_test_avg,lw=2,label='Test',c='blue') \n",
" ax.fill_between(cdegrees,cmse_test_high,cmse_test_low,facecolor='blue',alpha=0.05)\n",
" plt.xlim([1,12]); plt.yscale('log'); plt.ylim([0.0000001,10000])\n",
" plt.xlabel('Complexity - Polynomial Order'); plt.ylabel('Mean Square Error'); plt.title('Training and Testing Error vs. Model Complexity')\n",
" plt.legend(loc='upper left')\n",
" plt.grid(True)\n",
" \n",
" plt.plot([degree,degree],[.0000001,100000],c = 'black',linewidth=3,alpha = 0.8)\n",
" \n",
" plt.subplots_adjust(left=0.0, bottom=0.0, right=2.0, top=1.6, wspace=0.2, hspace=0.3)\n",
" plt.show()\n",
" \n",
"# connect the function to make the samples and plot to the widgets \n",
"interactive_plot = widgets.interactive_output(run_plot, {'n':n,'split':split,'std':std,'degree':degree})\n",
"interactive_plot.clear_output(wait = True) # reduce flickering by delaying plot updating\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Interactive Predictive Machine Learning Overfitting Demonstation \n",
"\n",
"#### Michael Pyrcz, Associate Professor, The University of Texas at Austin \n",
"\n",
"Change the number of sample data, train/test split and the data noise and observe overfit! Change the model order to observe a specific model example.\n",
"\n",
"### The Inputs\n",
"\n",
"* **n** - number of data\n",
"* **Test %** - percentage of sample data withheld as testing data\n",
"* **Noise StDev** - standard deviation of random Gaussian error added to the data\n",
"* **Model Order** - the order of the "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"VBox(children=(Text(value=' Machine Learning Overfit/Generalization Demo…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Output()"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"display(ui2, interactive_plot) # display the interactive plot"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Comments\n",
"\n",
"This was a basic demonstration of machine learning model training and tuning, model generalization and complexity. I have many other demonstrations and even basics of working with DataFrames, ndarrays, univariate statistics, plotting data, declustering, data transformations and many other workflows available at https://github.com/GeostatsGuy/PythonNumericalDemos and https://github.com/GeostatsGuy/GeostatsPy. \n",
" \n",
"#### The Author:\n",
"\n",
"### Michael Pyrcz, Associate Professor, University of Texas at Austin \n",
"*Novel Data Analytics, Geostatistics and Machine Learning Subsurface Solutions*\n",
"\n",
"With over 17 years of experience in subsurface consulting, research and development, Michael has returned to academia driven by his passion for teaching and enthusiasm for enhancing engineers' and geoscientists' impact in subsurface resource development. \n",
"\n",
"For more about Michael check out these links:\n",
"\n",
"#### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)\n",
"\n",
"#### Want to Work Together?\n",
"\n",
"I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.\n",
"\n",
"* Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? I'd be happy to drop by and work with you! \n",
"\n",
"* Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PIs including Profs. Foster, Torres-Verdin and van Oort)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!\n",
"\n",
"* I can be reached at mpyrcz@austin.utexas.edu.\n",
"\n",
"I'm always happy to discuss,\n",
"\n",
"*Michael*\n",
"\n",
"Michael Pyrcz, Ph.D., P.Eng. Associate Professor The Hildebrand Department of Petroleum and Geosystems Engineering, Bureau of Economic Geology, The Jackson School of Geosciences, The University of Texas at Austin\n",
"\n",
"#### More Resources Available at: [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1) \n",
" "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
}
},
"nbformat": 4,
"nbformat_minor": 2
}