{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "

\n", " \n", "\n", "

\n", "\n", "## Interactive Demonstration of Ridge Regression for an Introduction to Hyperparameter Tuning\n", "\n", "#### Michael J. Pyrcz, Professor, The University of Texas at Austin \n", "\n", "##### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1) | [GeostatsPy](https://github.com/GeostatsGuy/GeostatsPy)\n", "\n", "### The Interactive Workflow\n", "\n", "In the Fall of 2019 my students requested a demonstration to show the value of ridge regression. \n", "\n", "* I wrote this interactive demonstration to show cases in which the use of regularization coefficient, a hyperparameter, that reduces the model flexibilty / sensivity to training data (reduces model variance) improves the prediction accuracy. \n", "\n", "* I have a lecture on [Ridge Regression](https://www.youtube.com/watch?v=pMGO40yXZ5Y&list=PLG19vXLQHvSC2ZKFIkgVpI9fCjkN38kwf&index=22) as part of my [Machine Learning](https://www.youtube.com/playlist?list=PLG19vXLQHvSC2ZKFIkgVpI9fCjkN38kwf) course. Note, all recorded lectures, interactive and well-documented workflow demononstrations are available on my GitHub repository [GeostatsGuy's Python Numerical Demos](https://github.com/GeostatsGuy/PythonNumericalDemos).\n", "\n", "Here's some basic details about predictive machine learning ridge regression models, let's start with linear regression first: \n", "\n", "### Linear Regression\n", "\n", "Linear regression for prediction. I have a [Linear Regression Lecture](https://www.youtube.com/watch?v=0fzbyhWiP84&list=PLG19vXLQHvSC2ZKFIkgVpI9fCjkN38kwf&index=21) with linked interactive and well-documented demonstration. Here are some key aspects of linear regression:\n", "\n", "**Parametric Model**\n", "\n", "* the fit model is a simple weighted linear additive model based on all the available features, $x_1,\\ldots,x_m$.\n", "\n", "* the parametric model takes the form of: \n", "\n", "\\begin{equation}\n", "y = \\sum_{\\alpha = 1}^m b_{\\alpha} x_{\\alpha} + b_0\n", "\\end{equation}\n", "\n", "**Least Squares**\n", "\n", "* least squares optimization is applied to select the model parameters, $b_1,\\ldots,b_m,b_0$ \n", "\n", "* we minize the error, residual sum of squares (RSS) over the training data: \n", "\n", "\\begin{equation}\n", "RSS = \\sum_{i=1}^n (y_i - (\\sum_{\\alpha = 1}^m b_{\\alpha} x_{\\alpha} + b_0))^2\n", "\\end{equation}\n", "\n", "* this could be simplified as the sum of square error over the training data, \n", "\n", "\\begin{equation}\n", "\\sum_{i=1}^n (\\Delta y_i)^2\n", "\\end{equation}\n", "\n", "**Assumptions**\n", "\n", "* **Error-free** - predictor variables are error free, not random variables \n", "* **Linearity** - response is linear combination of feature(s)\n", "* **Constant Variance** - error in response is constant over predictor(s) value\n", "* **Independence of Error** - error in response are uncorrelated with each other\n", "* **No multicollinearity** - none of the features are redundant with other features \n", "\n", "#### Other Linear Regression Resources\n", "\n", "This is a tutorial / demonstration of **Linear Regression**. In $Python$, the $SciPy$ package, specifically the $Stats$ functions (https://docs.scipy.org/doc/scipy/reference/stats.html) provide excellent tools for efficient use of statistics. \n", "I have previously provided this example in R and posted it on GitHub:\n", "\n", "* [Linear Regression in R](https://github.com/GeostatsGuy/geostatsr/blob/master/linear_regression_demo_v2.R)\n", "* [Linear Regression in R markdown](https://github.com/GeostatsGuy/geostatsr/blob/master/linear_regression_demo_v2.Rmd) with docs \n", "* [Linear Regression in R document](https://github.com/GeostatsGuy/geostatsr/blob/master/linear_regression_demo_v2.html) knit as an HTML document\n", "\n", "and also in Excel:\n", "\n", "* [Linear Regression in Excel](https://github.com/GeostatsGuy/ExcelNumericalDemos/blob/master/Linear_Regression_Demo_v2.xlsx)\n", "\n", "#### Ridge Regression\n", "\n", "With ridge regression we add a hyperparameter, $\\lambda$, to our minimization, with a shrinkage penalty term, $\\sum_{j=1}^m b_{\\alpha}^2$.\n", "\n", "\\begin{equation}\n", "\\sum_{i=1}^n (y_i - (\\sum_{\\alpha = 1}^m b_{\\alpha} x_{\\alpha} + b_0))^2 + \\lambda \\sum_{j=1}^m b_{\\alpha}^2\n", "\\end{equation}\n", "\n", "As a result ridge regression has 2 criteria:\n", "\n", "* set the model parameters to minimize the error with training data\n", "\n", "* shrink the estimates of the slope parameters towards zero\n", "\n", "Note: the intercept is not affected by lambda.\n", "\n", "The $\\lambda$ is a hyperparameter that controls the degree of fit of the model and may be related to the model variance and bias trade-off.\n", "\n", "* for $\\lambda \\rightarrow 0$ the solution approaches linear regression, there is no bias (relative to a linear model fit), but the variance is high\n", "\n", "* as $\\lambda$ increases the model variance decreases and the model bias increases\n", "\n", "* for $\\lambda \\rightarrow \\infty$ the coefficients approach 0.0 and the model approaches the global mean\n", "\n", "#### Train / Test Split\n", "\n", "The available data is split into training and testing subsets.\n", "\n", "* in general 15-30% of the data is withheld from training to apply as testing data\n", "\n", "* testing data selection should be fair, the same difficulty of predictions (offset/different from the training dat\n", "\n", "#### Machine Learning Model Training\n", "\n", "The training data is applied to train the model parameters such that the model minimizes mismatch with the training data\n", "\n", "* it is common to use **mean square error** (known as a **L2 norm**) as a loss function summarizing the model mismatch\n", "\n", "* **miminizing the loss function** for simple models an anlytical solution may be available, but for most machine this requires an iterative optimization method to find the best model parameters\n", "\n", "This process is repeated over a range of model complexities specified by hyperparameters. \n", "\n", "#### Machine Learning Model Tuning\n", "\n", "The withheld testing data is retrieved and loss function (usually the **L2 norm** again) is calculated to summarize the error over the testing data\n", "\n", "* this is repeated over over the range of specified hypparameters\n", "\n", "* the model complexity / hyperparameters that minimize the loss function / error summary in testing is selected\n", "\n", "This is known are model hypparameter tuning.\n", "\n", "#### Machine Learning Model Overfit\n", "\n", "More model complexity/flexibility than can be justified with the available data, data accuracy, frequency and coverage\n", "\n", "* Model explains “idiosyncrasies” of the data, capturing data noise/error in the model\n", "\n", "* High accuracy in training, but low accuracy in testing / real-world use away from training data cases – poor ability of the model to generalize\n", "\n", "#### The Interactive Demonstrations\n", "\n", "Here's a simple workflow, demonstration of predictive machine learning model training and testing for overfit. We use a:\n", "\n", "* simple polynomial model\n", "\n", "* 1 preditor feature and 1 response feature\n", "\n", "for an high interpretability model/ simple illustration.\n", "\n", "#### Workflow Goals\n", "\n", "Learn the basics of machine learning training, tuning for model generalization while avoiding model overfit. We use the very simple case of ridge regression that introduces a hyperparameter to linear regression.\n", "\n", "We consider 2 examples:\n", "\n", "1. **A linear model + noise** to make a random dataset with 1 predictor feature and 1 response feature. You can make the model, by-hand set the lambda hyperparameter and observe the impact. The code actually runs many lambda values so you can explore accurate and inacturate (over and underfit models).\n", "\n", "2. **A loaded multivariate subsurface dataset** with random resampling to explore uncertainty in your result. Like above you can set the lambda hyperparameter by-hand and observe the model accuracy in train and test over a range model flexibility. In this case, since the model is quite multidimensional, you can see the cross validation plot instead of the visualization of the model.\n", "\n", "#### Getting Started\n", "\n", "You will need to copy the following data file to your working directory. They are available [here](https://github.com/GeostatsGuy/GeoDataSets):\n", "\n", "* [unconv_MV.csv](https://github.com/GeostatsGuy/GeoDataSets/blob/master/unconv_MV.csv)\n", "\n", "The dataset is available in this repository, https://github.com/GeostatsGuy/GeoDataSets. You have two options: \n", "\n", "* download this file to your working directory\n", "\n", "* load the data directly from my GitHub repository\n", "\n", "#### Import Required Packages\n", "\n", "We will also need some standard packages. These should have been installed with Anaconda 3." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline \n", "supress_warnings = False\n", "import os # to set current working directory \n", "import sys # supress output to screen\n", "import io\n", "import numpy as np # arrays and matrix math\n", "import pandas as pd # DataFrames\n", "import matplotlib.pyplot as plt # plotting\n", "from sklearn.model_selection import train_test_split # train and test split\n", "from sklearn.metrics import mean_squared_error # model error calculation\n", "from sklearn.linear_model import Ridge # ridge regression\n", "import scipy # kernel density estimator for PDF plot\n", "from matplotlib.pyplot import cm # color maps\n", "from ipywidgets import interactive # widgets and interactivity\n", "from ipywidgets import widgets \n", "from ipywidgets import Layout\n", "from ipywidgets import Label\n", "from ipywidgets import VBox, HBox\n", "cmap = plt.cm.inferno # default color bar, no bias and friendly for color vision defeciency\n", "plt.rc('axes', axisbelow=True) # grid behind plotting elements\n", "if supress_warnings == True:\n", " import warnings # supress any warnings for this demonstration\n", " warnings.filterwarnings('ignore') " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you get a package import error, you may have to first install some of these packages. This can usually be accomplished by opening up a command window on Windows and then typing 'python -m pip install [package-name]'. More assistance is available with the respective package docs. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Demonstration 1, Simple Linear Model + Noise\n", "\n", "Let's build the code and dashboard in one block for concisenss. I have other examples to cover basics if you need that.\n", "\n", "#### Build the Interactive Dashboard\n", "\n", "The following code:\n", "\n", "* makes a random dataset, change the random number seed and number of data for a different dataset\n", "* loops over polygonal fits of 1st-12th order, loops over mulitple realizations and calculates the average MSE and P10 and P90 vs. order\n", "* calculates a specific model example\n", "* plots the example model with training and testing data, the error distributions and the MSE envelopes vs. complexity" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "text_trap = io.StringIO()\n", "sys.stdout = text_trap\n", "\n", "l = widgets.Text(value=' Machine Learning Ridge Regression Hyperparameter Tuning Demo, Prof. Michael Pyrcz, The University of Texas at Austin',\n", " layout=Layout(width='950px', height='30px'))\n", "\n", "n = widgets.IntSlider(min=5, max = 200, value=30, step = 1, description = 'n',orientation='horizontal', style = {'description_width': 'initial'}, continuous_update=False)\n", "split = widgets.FloatSlider(min=0.05, max = .95, value=0.20, step = 0.05, description = 'Test %',orientation='horizontal',style = {'description_width': 'initial'}, continuous_update=False)\n", "std = widgets.FloatSlider(min=0, max = 200, value=0, step = 1.0, description = 'Noise StDev',orientation='horizontal',style = {'description_width': 'initial'}, continuous_update=False)\n", "lam = widgets.FloatLogSlider(min=-5.0, max = 5.0, value=1,base=10,step = 0.2,description = 'Regularization',orientation='horizontal', style = {'description_width': 'initial'}, continuous_update=False)\n", "\n", "ui = widgets.HBox([n,split,std,lam],)\n", "ui2 = widgets.VBox([l,ui],)\n", "\n", "def run_plot(n,split,std,lam):\n", " seed = 13014; nreal = 20; slope = 20.0; intercept = 50.0\n", " np.random.seed(seed) # seed the random number generator\n", " lam_min = 1.0E-5;lam_max = 1.0e5\n", " \n", " # make the datastet\n", " X_seq = np.linspace(0.0,100,100)\n", " X = np.random.rand(n)*20\n", " y = X*slope + intercept # assume linear + error\n", " y = y + np.random.normal(loc = 0.0,scale=std,size=n) # add noise\n", " \n", " # calculate the MSE train and test over a range of complexity over multiple realizations of test/train split\n", " clam_list = np.logspace(-5,5,20,base=10.0)\n", " #print(clam_list)\n", " cmse_train = np.zeros([len(clam_list),nreal]); cmse_test = np.zeros([len(clam_list),nreal])\n", " cmse_truth = np.zeros([len(clam_list),nreal])\n", " for j in range(0,nreal):\n", " for i, clam in enumerate(clam_list):\n", " #print('lam' + str(clam))\n", " X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=split, random_state=seed+j)\n", " n_train = len(X_train); n_test = len(X_test)\n", " ridge_reg = Ridge(alpha=clam)\n", " #print(X_train.reshape(n_train,1))\n", " ridge_reg.fit(X_train.reshape(n_train,1),y_train)\n", " #print('here')\n", " y_pred_train = ridge_reg.predict(X_train.reshape(n_train,1))\n", " y_pred_test = ridge_reg.predict(X_test.reshape(n_test,1))\n", " y_pred_truth = ridge_reg.predict(X_seq.reshape(len(X_seq),1))\n", " y_truth = X_seq*slope + intercept\n", " cmse_train[i,j] = mean_squared_error(y_train, y_pred_train)\n", " cmse_test[i,j] = mean_squared_error(y_test, y_pred_test)\n", " cmse_truth[i,j] = mean_squared_error(y_truth, y_pred_truth)\n", " # summarize over the realizations\n", " cmse_train_avg = cmse_train.mean(axis=1)\n", " cmse_test_avg = cmse_test.mean(axis=1)\n", " cmse_truth_avg = cmse_truth.mean(axis=1)\n", " cmse_train_high = np.percentile(cmse_train,q=90,axis=1)\n", " cmse_train_low = np.percentile(cmse_train,q=10,axis=1) \n", " cmse_test_high = np.percentile(cmse_test,q=90,axis=1)\n", " cmse_test_low = np.percentile(cmse_test,q=10,axis=1)\n", " cmse_truth_high = np.percentile(cmse_truth,q=90,axis=1)\n", " cmse_truth_low = np.percentile(cmse_truth,q=10,axis=1)\n", " \n", "# cmse_train_high = np.amax(cmse_train,axis=1)\n", "# cmse_train_low = np.amin(cmse_train,axis=1) \n", "# cmse_test_high = np.amax(cmse_test,axis=1)\n", "# cmse_test_low = np.amin(cmse_test,axis=1)\n", " \n", " # build the one model example to show\n", " X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=split, random_state=seed)\n", " n_train = len(X_train); n_test = len(X_test)\n", " ridge_reg = Ridge(alpha=lam)\n", " ridge_reg.fit(X_train.reshape(n_train,1),y_train)\n", " y_pred_train = ridge_reg.predict(X_train.reshape(n_train,1))\n", " y_pred_test = ridge_reg.predict(X_test.reshape(n_test,1))\n", " y_pred_truth = ridge_reg.predict(X_seq.reshape(len(X_seq),1))\n", " \n", " # calculate error\n", " error_seq = np.linspace(-100.0,100.0,100)\n", " error_train = y_pred_train - y_train\n", " error_test = y_pred_test - y_test\n", " error_truth = X_seq*slope + intercept - y_truth\n", " mse_train = mean_squared_error(y_train,y_pred_train)\n", " mse_test = mean_squared_error(y_test,y_pred_test)\n", " mse_truth = mean_squared_error(y_truth,y_pred_truth)\n", " \n", " error_train_std = np.std(error_train)\n", " error_test_std = np.std(error_test)\n", " error_truth_std = np.std(error_truth)\n", " \n", " #kde_error_train = scipy.stats.gaussian_kde(error_train)\n", " #kde_error_test = scipy.stats.gaussian_kde(error_test)\n", " \n", " plt.subplot(131)\n", " plt.plot(X_seq, ridge_reg.predict(X_seq.reshape(len(X_seq),1)), color=\"black\")\n", " plt.plot(X_seq, X_seq*slope+intercept, color=\"black\",alpha = 0.2)\n", " plt.title(\"Ridge Regression Model, Lambda = \"+str(round(lam,2)))\n", " plt.scatter(X_train,y_train,c =\"red\",alpha=0.2,edgecolors=\"black\")\n", " plt.scatter(X_test,y_test,c =\"blue\",alpha=0.2,edgecolors=\"black\")\n", " plt.ylim([0,500]); plt.xlim([0,20]); plt.grid()\n", " plt.xlabel('Porosity (%)'); plt.ylabel('Permeability (mD)')\n", " \n", " plt.subplot(132)\n", " plt.hist(error_train, facecolor='red',bins=np.linspace(-200.0,200.0,10),alpha=0.2,density=True,edgecolor='black',label='Train')\n", " plt.hist(error_test, facecolor='blue',bins=np.linspace(-200.0,200.0,10),alpha=0.2,density=True,edgecolor='black',label='Test')\n", " #plt.plot(error_seq,kde_error_train(error_seq),lw=2,label='Train',c='red')\n", " #plt.plot(error_seq,kde_error_test(error_seq),lw=2,label='Test',c='blue') \n", " #plt.xlim([-55.0,55.0]); \n", " plt.ylim([0,0.1])\n", " plt.xlabel('Model Error'); plt.ylabel('Frequency'); plt.title('Training and Testing Error, Lambda = '+str((round(lam,2))))\n", " plt.legend(loc='upper left')\n", " plt.grid(True)\n", " \n", " plt.subplot(133); ax = plt.gca()\n", " plt.plot(clam_list,cmse_train_avg,lw=2,label='Train',c='red')\n", " ax.fill_between(clam_list,cmse_train_high,cmse_train_low,facecolor='red',alpha=0.05)\n", " \n", " plt.plot(clam_list,cmse_test_avg,lw=2,label='Test',c='blue') \n", " ax.fill_between(clam_list,cmse_test_high,cmse_test_low,facecolor='blue',alpha=0.05)\n", " \n", " plt.plot(clam_list,cmse_truth_avg,lw=2,label='Truth',c='black') \n", " ax.fill_between(clam_list,cmse_truth_high,cmse_truth_low,facecolor='black',alpha=0.05)\n", " \n", " plt.xscale('log'); plt.xlim([lam_max,lam_min]); plt.yscale('log'); #plt.ylim([10,10000])\n", " plt.xlabel('Model Flexibility by Regularization, Lambda'); plt.ylabel('Mean Square Error'); plt.title('Training and Testing Error vs. Model Complexity')\n", " plt.legend(loc='upper left')\n", " plt.grid(True)\n", " \n", " plt.plot([lam,lam],[1.0e-5,1.0e5],c = 'black',linewidth=3,alpha = 0.8)\n", " \n", " plt.subplots_adjust(left=0.0, bottom=0.0, right=2.0, top=1.6, wspace=0.2, hspace=0.3)\n", " plt.show()\n", " \n", "# connect the function to make the samples and plot to the widgets \n", "interactive_plot = widgets.interactive_output(run_plot, {'n':n,'split':split,'std':std,'lam':lam})\n", "interactive_plot.clear_output(wait = True) # reduce flickering by delaying plot updating\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Interactive Machine Learning Ridge Regression Hyperparameter Tuning Demonstation \n", "\n", "#### Michael Pyrcz, Associate Professo, University of Texas at Austin \n", "\n", "Change the number of sample data, train/test split and the data noise and observe hyperparameter tuning! Change the regularization hyperparameter, lambda, to observe a specific model example. Based on a simple, linear truth model + noise.\n", "\n", "### The Inputs\n", "\n", "* **n** - number of data\n", "* **Test %** - percentage of sample data withheld as testing data\n", "* **Noise StDev** - standard deviation of random Gaussian error added to the data\n", "* **Regularization Hyperparameter** - the lambda coefficient, weight in loss the function on minimization of the model parameters " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "c15c5526c26b4b7ab01a6314270f122f", "version_major": 2, "version_minor": 0 }, "text/plain": [ "VBox(children=(Text(value=' Machine Learning Ridge Regression Hyperparameter Tuning…" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "750225d3a93e4b719091eca0ff8ab9ea", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Output(outputs=({'output_type': 'display_data', 'data': {'text/plain': '
', 'i…" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(ui2, interactive_plot) # display the interactive plot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Observations\n", "\n", "What did we learn? \n", "\n", "* regularization reduces the model sensitivity to the data, this results in reduced model variance\n", "* regularization is helpful with sparse data with noise\n", "\n", "When the noise is removed from the dataset, regularization increase / flexibility and sensitivity reduction reduces the quality of the model. \n", "\n", "Let's check out a more extreme example to really see this effect. \n", "\n", "* we can enhance data sparcity by increasing the problem dimensionality\n", "* we will use more realistic, noisy data\n", "\n", "### Demonstration 2: Multivariate, Realistic Dataset\n", "\n", "Let's load a multivariate dataset for a more realistic demonstration. \n", "\n", "* the combination of too few data samples and high dimensionality should better demonstrate the benefit of regularization\n", "\n", "You may want to copy the data file to your working directory. They are available here:\n", "\n", "* Tabular data - [unconv_MV.csv](https://github.com/GeostatsGuy/GeoDataSets/blob/master/unconv_MV.csv)\n", "\n", "or you can use the code below to load the data directly from my GitHub [GeoDataSets](https://github.com/GeostatsGuy/GeoDataSets) repository.\n", "\n", "#### Set the working directory\n", "\n", "I always like to do this so I don't lose files and to simplify subsequent read and writes (avoid including the full address each time). \n", "\n", "* Also, in this case make sure to place the required (see below) data file in this working directory. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "#os.chdir(\"d:\\PGE337\") # set the working directory" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Load the data\n", "\n", "The following loads the data into a DataFrame and then previews the first 5 samples." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
WellIndexPorLogPermAIBrittleTOCVRProduction
0115.911.673.0614.051.361.85177.381958
1215.341.652.6031.881.371.791479.767778
2320.452.023.1363.671.792.534421.221583
3411.951.143.9058.810.402.031488.317629
4519.531.832.5743.751.402.115261.094919
\n", "
" ], "text/plain": [ " WellIndex Por LogPerm AI Brittle TOC VR Production\n", "0 1 15.91 1.67 3.06 14.05 1.36 1.85 177.381958\n", "1 2 15.34 1.65 2.60 31.88 1.37 1.79 1479.767778\n", "2 3 20.45 2.02 3.13 63.67 1.79 2.53 4421.221583\n", "3 4 11.95 1.14 3.90 58.81 0.40 2.03 1488.317629\n", "4 5 19.53 1.83 2.57 43.75 1.40 2.11 5261.094919" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#df_mv = pd.read_csv(\"unconv_MV.csv\") \n", "df_mv = pd.read_csv(r\"https://raw.githubusercontent.com/GeostatsGuy/GeoDataSets/master/unconv_MV.csv\") # load from Prof. Pyrcz's GitHub \n", "df_mv.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Check the data\n", "\n", "We should do much more data investigation. I do this in most of my workflows, but for this one, let's just calculate the summary statistics and call it a day! " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countmeanstdmin25%50%75%max
WellIndex1000.0500.500000288.8194361.000000250.75000500.50000750.2500001000.00000
Por1000.014.9504603.0296345.40000012.8575014.9850017.08000024.65000
LogPerm1000.01.3988800.4059660.1200001.130001.390001.6800002.58000
AI1000.02.9826100.5776290.9600002.577503.010003.3600004.70000
Brittle1000.049.71948015.077006-10.50000039.7225049.6800059.17000093.47000
TOC1000.01.0038100.504978-0.2600000.640000.995001.3600002.71000
VR1000.01.9911700.3081940.9000001.810002.000002.1725002.90000
Production1000.02247.2958091464.2563122.7135351191.369561976.487823023.59421412568.64413
\n", "
" ], "text/plain": [ " count mean std min 25% \\\n", "WellIndex 1000.0 500.500000 288.819436 1.000000 250.75000 \n", "Por 1000.0 14.950460 3.029634 5.400000 12.85750 \n", "LogPerm 1000.0 1.398880 0.405966 0.120000 1.13000 \n", "AI 1000.0 2.982610 0.577629 0.960000 2.57750 \n", "Brittle 1000.0 49.719480 15.077006 -10.500000 39.72250 \n", "TOC 1000.0 1.003810 0.504978 -0.260000 0.64000 \n", "VR 1000.0 1.991170 0.308194 0.900000 1.81000 \n", "Production 1000.0 2247.295809 1464.256312 2.713535 1191.36956 \n", "\n", " 50% 75% max \n", "WellIndex 500.50000 750.250000 1000.00000 \n", "Por 14.98500 17.080000 24.65000 \n", "LogPerm 1.39000 1.680000 2.58000 \n", "AI 3.01000 3.360000 4.70000 \n", "Brittle 49.68000 59.170000 93.47000 \n", "TOC 0.99500 1.360000 2.71000 \n", "VR 2.00000 2.172500 2.90000 \n", "Production 1976.48782 3023.594214 12568.64413 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_mv.describe().transpose()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Build the Interactive Dashboard\n", "\n", "The following code:\n", "\n", "* makes a random sample of the dataset\n", "* loops over ridge regression fits varying the lambda between 1.0e-5 to 1.0e5, loops over mulitple realizations and calculates the average MSE and P10 and P90 vs. lambda.\n", "* calculates specified model cross validation\n", "* plots the example model with training and testing data, the error distributions and the MSE envelopes vs. complexity/inverse of lambda" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "import warnings\n", "warnings.filterwarnings('ignore')\n", "\n", "text_trap = io.StringIO()\n", "sys.stdout = text_trap\n", "\n", "l = widgets.Text(value=' Machine Learning Ridge Regression Hyperparameter Tuning Demo, Prof. Michael Pyrcz, The University of Texas at Austin',\n", " layout=Layout(width='950px', height='30px'))\n", "\n", "n = widgets.IntSlider(min=5, max = 200, value=30, step = 1, description = 'n',orientation='horizontal', style = {'description_width': 'initial'}, continuous_update=False)\n", "split = widgets.FloatSlider(min=0.05, max = .95, value=0.20, step = 0.05, description = 'Test %',orientation='horizontal',style = {'description_width': 'initial'}, continuous_update=False)\n", "lam = widgets.FloatLogSlider(min=-5.0, max = 5.0, value=1,base=10,step = 0.2,description = 'Regularization',orientation='horizontal', style = {'description_width': 'initial'}, continuous_update=False)\n", "\n", "ui = widgets.HBox([n,split,lam],)\n", "ui2 = widgets.VBox([l,ui],)\n", "\n", "def run_plot(n,split,lam):\n", " seed = 13014; nreal = 20; slope = 20.0; intercept = 50.0\n", " np.random.seed(seed) # seed the random number generator\n", " lam_min = 1.0E-5;lam_max = 1.0e5\n", " \n", " df_sample_mv = df_mv.sample(n=n)\n", " df_X_mv = df_sample_mv[['Por','LogPerm','AI','Brittle','TOC','VR']]\n", " df_y_mv = df_sample_mv[['Production']]\n", " \n", " # calculate the MSE train and test over a range of complexity over multiple realizations of test/train split\n", " clam_list = np.logspace(-5,5,20,base=10.0)\n", " #print(clam_list)\n", " cmse_train = np.zeros([len(clam_list),nreal]); cmse_test = np.zeros([len(clam_list),nreal])\n", " for j in range(0,nreal):\n", " for i, clam in enumerate(clam_list):\n", " #print('lam' + str(clam))\n", " df_X_mv_train, df_X_mv_test, df_y_mv_train, df_y_mv_test = train_test_split(df_X_mv, df_y_mv, test_size=split, random_state=seed+j)\n", " n_train = len(df_X_mv_train); n_test = len(df_X_mv_test)\n", " ridge_reg = Ridge(alpha=clam)\n", " #print(X_train.reshape(n_train,1))\n", " ridge_reg.fit(df_X_mv_train.values,df_y_mv_train.values)\n", " #print('here')\n", " y_pred_train = ridge_reg.predict(df_X_mv_train.values)\n", " y_pred_test = ridge_reg.predict(df_X_mv_test.values)\n", " cmse_train[i,j] = mean_squared_error(df_y_mv_train.values, y_pred_train)\n", " cmse_test[i,j] = mean_squared_error(df_y_mv_test.values, y_pred_test)\n", " # summarize over the realizations\n", " cmse_train_avg = cmse_train.mean(axis=1)\n", " cmse_test_avg = cmse_test.mean(axis=1)\n", " cmse_train_high = np.percentile(cmse_train,q=90,axis=1)\n", " cmse_train_low = np.percentile(cmse_train,q=10,axis=1) \n", " cmse_test_high = np.percentile(cmse_test,q=90,axis=1)\n", " cmse_test_low = np.percentile(cmse_test,q=10,axis=1)\n", " \n", "# cmse_train_high = np.amax(cmse_train,axis=1)\n", "# cmse_train_low = np.amin(cmse_train,axis=1) \n", "# cmse_test_high = np.amax(cmse_test,axis=1)\n", "# cmse_test_low = np.amin(cmse_test,axis=1)\n", " \n", " # build the one model example to show\n", " \n", " df_X_mv_train, df_X_mv_test, df_y_mv_train, df_y_mv_test = train_test_split(df_X_mv, df_y_mv, test_size=split, random_state=seed+j)\n", " n_train = len(df_X_mv_train); n_test = len(df_X_mv_test)\n", " ridge_reg = Ridge(alpha=lam)\n", " ridge_reg.fit(df_X_mv_train.values,df_y_mv_train.values)\n", " y_pred_train = ridge_reg.predict(df_X_mv_train.values)\n", " y_pred_test = ridge_reg.predict(df_X_mv_test.values)\n", " \n", " # calculate error\n", " error_seq = np.linspace(-100.0,100.0,100)\n", " error_train = y_pred_train - df_y_mv_train.values\n", " error_test = y_pred_test - df_y_mv_test.values\n", " mse_train = mean_squared_error(df_y_mv_train.values,y_pred_train)\n", " mse_test = mean_squared_error(df_y_mv_test.values,y_pred_test)\n", "\n", " error_train_std = np.std(error_train)\n", " error_test_std = np.std(error_test)\n", " \n", " #kde_error_train = scipy.stats.gaussian_kde(error_train)\n", " #kde_error_test = scipy.stats.gaussian_kde(error_test)\n", " \n", " plt.subplot(131)\n", " plt.plot([0,8000],[0,8000],c = 'black',linewidth=3,alpha = 0.4)\n", " plt.scatter(df_y_mv_train.values,y_pred_train,color=\"red\",edgecolors='black',label='Train',alpha=0.2)\n", " plt.scatter(df_y_mv_test.values,y_pred_test,color=\"blue\",edgecolors='black',label='Test',alpha=0.2)\n", " plt.title(\"Cross Validation, Lambda = \"+str(round(lam,2)))\n", " plt.ylim([0,8000]); plt.xlim([0,8000]); plt.grid(); plt.legend(loc = 'upper left')\n", " plt.xlabel('True Production (MCFPD)'); plt.ylabel('Estimated Production (MCFPD)')\n", " \n", " plt.subplot(132)\n", " plt.hist(error_train, facecolor='red',bins=np.linspace(-2000.0,2000.0,20),alpha=0.2,density=False,edgecolor='black',label='Train')\n", " plt.hist(error_test, facecolor='blue',bins=np.linspace(-2000.0,2000.0,20),alpha=0.2,density=False,edgecolor='black',label='Test')\n", " #plt.plot(error_seq,kde_error_train(error_seq),lw=2,label='Train',c='red')\n", " #plt.plot(error_seq,kde_error_test(error_seq),lw=2,label='Test',c='blue') \n", " #plt.xlim([-55.0,55.0]); \n", " plt.ylim([0,10])\n", " plt.xlabel('Model Error'); plt.ylabel('Frequency'); plt.title('Training and Testing Error, Lambda = '+str((round(lam,2))))\n", " plt.legend(loc='upper left')\n", " plt.grid(True)\n", " \n", " plt.subplot(133); ax = plt.gca()\n", " plt.plot(clam_list,cmse_train_avg,lw=2,label='Train',c='red')\n", " ax.fill_between(clam_list,cmse_train_high,cmse_train_low,facecolor='red',alpha=0.05)\n", " \n", " plt.plot(clam_list,cmse_test_avg,lw=2,label='Test',c='blue') \n", " ax.fill_between(clam_list,cmse_test_high,cmse_test_low,facecolor='blue',alpha=0.05)\n", " \n", " plt.xscale('log'); plt.xlim([lam_max,lam_min]); plt.yscale('log'); plt.ylim([1.0e5,1.0e7])\n", " plt.xlabel('Model Flexibility by Regularization, Lambda'); plt.ylabel('Mean Square Error'); plt.title('Training and Testing Error vs. Model Complexity')\n", " plt.legend(loc='upper left')\n", " plt.grid(True)\n", " \n", " plt.plot([lam,lam],[1.0e5,1.0e7],c = 'black',linewidth=3,alpha = 0.8)\n", " \n", " plt.subplots_adjust(left=0.0, bottom=0.0, right=2.0, top=1.0, wspace=0.2, hspace=0.3)\n", " plt.show()\n", " \n", "# connect the function to make the samples and plot to the widgets \n", "interactive_plot = widgets.interactive_output(run_plot, {'n':n,'split':split,'lam':lam})\n", "interactive_plot.clear_output(wait = True) # reduce flickering by delaying plot updating\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Interactive Machine Learning Ridge Regression Hyperparameter Tuning Demonstation \n", "\n", "#### Michael Pyrcz, Associate Professo, University of Texas at Austin \n", "\n", "Change the number of sample data, train/test split and the data noise and observe hyperparameter tuning! Change the regularization hyperparameter, lambda, to observe a specific model example. Based on a more complicated (6 predictor features, some non-linearity and sampling error).\n", "\n", "### The Inputs\n", "\n", "* **n** - number of data\n", "* **Test %** - percentage of sample data withheld as testing data\n", "* **Noise StDev** - standard deviation of random Gaussian error added to the data\n", "* **Regularization Hyperparameter** - the lambda coefficient, weight in loss the function on minimization of the model parameters " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "61f6f04e3df84b1182b815d18f0804ba", "version_major": 2, "version_minor": 0 }, "text/plain": [ "VBox(children=(Text(value=' Machine Learning Ridge Regression Hyperparameter Tuning D…" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "c1f0ca7007ce427db3b4afc9c63d305a", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Output(outputs=({'output_type': 'display_data', 'data': {'text/plain': '
', 'i…" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(ui2, interactive_plot) # display the interactive plot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Comments\n", "\n", "This was a basic demonstration of machine learning model training and tuning, with ridge regression. I have many other demonstrations and even basics of working with DataFrames, ndarrays, univariate statistics, plotting data, declustering, data transformations and many other workflows available at https://github.com/GeostatsGuy/PythonNumericalDemos and https://github.com/GeostatsGuy/GeostatsPy. \n", " \n", "#### The Author:\n", "\n", "### Michael J. Pyrcz, Professor, The University of Texas at Austin \n", "*Novel Data Analytics, Geostatistics and Machine Learning Subsurface Solutions*\n", "\n", "With over 17 years of experience in subsurface consulting, research and development, Michael has returned to academia driven by his passion for teaching and enthusiasm for enhancing engineers' and geoscientists' impact in subsurface resource development. \n", "\n", "For more about Michael check out these links:\n", "\n", "#### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)\n", "\n", "#### Want to Work Together?\n", "\n", "I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.\n", "\n", "* Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? I'd be happy to drop by and work with you! \n", "\n", "* Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PIs including Profs. Foster, Torres-Verdin and van Oort)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!\n", "\n", "* I can be reached at mpyrcz@austin.utexas.edu.\n", "\n", "I'm always happy to discuss,\n", "\n", "*Michael*\n", "\n", "Michael Pyrcz, Ph.D., P.Eng. Professor, Cockrell School of Engineering, Bureau of Economic Geology, and The Jackson School of Geosciences, The University of Texas at Austin\n", "\n", "#### More Resources Available at: [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1) \n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 2 }