\n",
"\n",
"## Data Analytics \n",
"\n",
"### Interactive Shapley Value Demonstration in Python \n",
"\n",
"\n",
"#### Michael Pyrcz, Associate Professor, University of Texas at Austin \n",
"\n",
"##### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Shapley Values\n",
"\n",
"Here's a demonstration of Shapley values for subsurface modeling in Python. This is part of my Subsuface Machine Learning Course at the Cockrell School of Engineering at the University of Texas at Austin. \n",
"\n",
"#### Motivation\n",
"\n",
"Complicated models are often required to model our natural settings, but have low interpretability. We have 2 choices to improve model interpretability: \n",
"\n",
"1. reduce the complexity of the models, but also reduce model accuracy\n",
"\n",
"2. develop improved, agnostic (for any model) model diagnostics\n",
"\n",
"Shapley values are the later, they treat the model as a black box and use a data subset, known as background, with predictor features $X_1,\\ldots,X_m$ and model predictions $\\hat{y}$ to discover structures between each predictor feature and the model response.\n",
"\n",
"\\begin{equation}\n",
"f(x_1,\\ldots,x_m) = \\sum_{i=1}^m \\phi_i\n",
"\\end{equation}\n",
"\n",
"where $\\phi_i$ is the Shapely value of the $i^{th}$ feature calculated by averaging over the combinatorial of marginal contributions for all feature orders and sequences.\n",
"\n",
"* **order** - the number of previous features included before adding the $i^{th}$ feature\n",
"\n",
"* **sequence** - the sequence of the addition of features\n",
"\n",
"\n",
"#### Description\n",
"\n",
"Shapley values are adpated from game theory where the approach is used to calculate:\n",
"\n",
"* allocating resources between ‘players’ based on a summarization of marginal contributions to the game. Dividing up winnings between all players. \n",
"\n",
"* contribution of each predictor feature to push the response prediction away from the mean value of the response over training\n",
"\n",
"* an example of an marginal contribution include $f(x_2,x_1) - f(x_2)$ and $f(x_1) - E[y]$ to quantify the impact on the estimate of feature $X_1$, locally for an estimate with values $x_1$ and $x_2$.\n",
"\n",
"Note, Shapley values are based completely on the model and do not access the accuracy of the model\n",
"\n",
"* a poor model will potentially lead to incorrect interpretation with Shapley values\n",
"\n",
"#### Variable Ranking\n",
"\n",
"There are often many predictor features, input variables, available for us to work with for subsurface prediction. There are good reasons to be selective, throwing in every possible feature is not a good idea! In general, for the best prediction model, careful selection of the fewest features that provide the most amount of information is the best practice. \n",
"\n",
"Here's why:\n",
"\n",
"* more variables result in more complicated workflows that require more professional time and have increased opportunity for blunders\n",
"* higher dimensional feature sets are more difficult to visualize\n",
"* more complicated models may be more difficult to interrogate, interpret and QC\n",
"* inclusion of highly redundant and colinear variables increases model instability and decreases prediction accuracy in testing\n",
"* more variables generally increase the computational time required to train the model and the model may be less compact and portable\n",
"* the risk of overfit increases with the more variables, more complexity\n",
"\n",
"#### What is Feature Ranking?\n",
"\n",
"Feature ranking is a set of metrics that assign relative importance or value to each feature with respect to information contained for inference and importance in predicting a response feature. There are a wide variety of possible methods to accomplish this. My recommendation is a **'wide-array'** approach with multiple metric, while understanding the assumptions and limitations of each metric. \n",
"\n",
"Here's the general types of metrics that we will consider for feature ranking.\n",
"\n",
"1. Visual Inspection of Data Distributions and Scatter Plots\n",
"2. Statistical Summaries\n",
"3. Model-based\n",
"4. Recursive Feature Elimination \n",
"\n",
"Also, we should not neglect expert knowledge. If additional information is known about physical processes, causation, reliability and availability of features this should be integrated into assigning feature ranks."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Import Packages\n",
"\n",
"We will also need some standard packages. These should have been installed with Anaconda 3."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np # ndarrys for gridded data\n",
"import pandas as pd # DataFrames for tabular data\n",
"import os # set working directory, run executables\n",
"import matplotlib.pyplot as plt # for plotting\n",
"from sklearn.ensemble import RandomForestRegressor # random forest method\n",
"from ipywidgets import interactive # widgets and interactivity\n",
"from ipywidgets import widgets \n",
"from ipywidgets import Layout\n",
"from ipywidgets import Label\n",
"from ipywidgets import VBox, HBox"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We have added Shapley value, let's install the shap package"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)\n"
]
},
{
"data": {
"text/html": [
"
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"#import sys # uncomment to install shapley if not installed\n",
"#!{sys.executable} -m pip install shap\n",
"import shap\n",
"shap.initjs() # enable javascript for Shapley plots"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Set the Demonstration Parameters\n",
"\n",
"Let's assume some parameters for our itneractive demonstrations."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"n = 50; nb = 25 # number of training and background data\n",
"b0 = 0.3\n",
"\n",
"max_leaf_nodes = 5 # default random forest parameters\n",
"num_tree = 300\n",
"max_features = 0.3\n",
"\n",
"seed = 73073 # random number seed for consistent interactive displays\n",
"\n",
"xvals = np.linspace(-3,3,1000) # grid sampling to visualize the models in 1D and 2D\n",
"x1values = np.linspace(-3, 3, 100)\n",
"x2values = np.linspace(-3, 3, 100)\n",
"x1grid, x2grid = np.meshgrid(x1values, x2values)\n",
"xgrid = np.stack((x1grid.flatten(), x2grid.flatten()), axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Experiment \\#1: One Predictor Feature, Linear Relationship with Noise\n",
"\n",
"Let's start very simple with one predictor feature and a linear relationship with noise added."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "1fcf03ab5a734e0187ec164163cb0363",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"VBox(children=(Text(value=' One Predictor Feature Shapley Dependency Demonstration, Mi…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "704ac8111690459c88687cd7021c24cc",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Output()"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"l = widgets.Text(value=' One Predictor Feature Shapley Dependency Demonstration, Michael Pyrcz, Associate Professor, The University of Texas at Austin',layout=Layout(width='950px', height='30px'))\n",
"\n",
"b1 = widgets.FloatSlider(min=0.00, max = 1.0, value = 0.5, step = 0.1, description = r'$b_1$',orientation='horizontal',layout=Layout(width='400px', height='50px'),continuous_update=False)\n",
"b1.style.handle_color = 'red'\n",
"noise = widgets.FloatSlider(min=0.01, max = 1.0, value = 0.20, step = 0.01, description = r'$\\sigma_{noise}$',orientation='horizontal',layout=Layout(width='400px', height='50px'),continuous_update=False)\n",
"noise.style.handle_color = 'red'\n",
"\n",
"ui1 = widgets.HBox([b1,noise],kwargs = {'justify_content':'center'}) \n",
"ui = widgets.VBox([l,ui1],kwargs = {'justify_content':'center'})\n",
"\n",
"def f_make(b1,noise):\n",
" np.random.seed(seed=seed)\n",
" features = ['X1']\n",
" X1 = np.random.normal(loc = 0.0,scale = 1.0,size = n)\n",
" X1b = np.random.normal(loc = 0.0,scale = 1.0,size = nb)\n",
" y = b1*X1 + b0 + np.random.normal(loc = 0.0, scale = noise, size = n)\n",
" yb = b1*X1b + b0 + np.random.normal(loc = 0.0, scale = noise, size = nb)\n",
" \n",
" fig, axes = plt.subplots(1, 2, figsize=(15, 5), sharey=True)\n",
" #axes[0].scatter(X1,y,color = 'red', edgecolor = 'black',label = 'training',alpha = 0.3)\n",
" axes[0].scatter(X1b,yb,color = 'white', edgecolor = 'black',label = '$y_{background}$', alpha = 0.3)\n",
" axes[0].set_xlim(-3,3); axes[0].set_ylim(-2,2)\n",
" \n",
" random_forest = RandomForestRegressor(max_leaf_nodes=max_leaf_nodes, random_state=seed,n_estimators=num_tree, max_features=max_features)\n",
" random_forest.fit(X = X1.reshape(-1, 1), y = y.reshape(-1,))\n",
" y_hat = random_forest.predict(xvals.reshape(-1, 1))\n",
" yb_hat = random_forest.predict(X1b.reshape(-1, 1))\n",
" axes[0].plot(xvals,y_hat,color='black', label = '$\\\\widehat{y}$', alpha = 0.3)\n",
" axes[0].plot([-3,3],[np.average(y),np.average(y)],'--',color = 'red',alpha = 0.3)\n",
" axes[0].scatter(X1b,yb_hat,color = 'blue', edgecolor = 'black',label = '$\\\\widehat{y}_{background}$')\n",
" axes[0].legend(); axes[0].set_xlabel('$X_1$'); axes[0].set_ylabel('y')\n",
" axes[0].annotate(xy = [-2.8,np.average(y)+0.2],text='$E[y]$'); axes[0].set_title('Data and Model')\n",
" \n",
" shap_values = shap.TreeExplainer(random_forest).shap_values(X1b.reshape(-1, 1))\n",
" shap.dependence_plot(ax = axes[1],ind = features[0],show=False,feature_names = features, shap_values = shap_values, features = X1b.reshape(-1, 1), cmap = plt.cm.inferno)\n",
" axes[1].plot(xvals,y_hat-np.average(y),color='black',alpha = 0.2,label = '$\\\\Delta \\widehat{y}$')\n",
" axes[1].plot([-3,3],[0,0],'--',color = 'red', alpha = 0.3)\n",
" axes[1].legend(); axes[1].set_title('Dependency Plot')\n",
" axes[1].set_xlim(-3,3); axes[1].set_ylim(-2,2)\n",
"\n",
"interactive_plot = widgets.interactive_output(f_make, {'b1':b1,'noise':noise})\n",
"interactive_plot.clear_output(wait = True) # reduce flickering by delaying plot updating \n",
" \n",
"display(ui, interactive_plot) # display the interactive plot"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can make the following observations:\n",
"\n",
"* the Shapley value is just the $f(X_1) - E[y]$ marginal contribution. \n",
"\n",
"* we can observe that the Shapley values are the departure of the predictions of $y$ form the global average of $\\overline{y}$\n",
"\n",
"* noise disrupts the original linear structure of the relationship\n",
"\n",
"#### Experiment \\#2: Two Predictor Features, Multilinear Relationship with Noise and Control Model Hyperparameter\n",
"\n",
"Now let's add another feature $X_2$ and include control of the model hyperparameter, number of leaf nodes, so we can underfit the model."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "1976ee912ae949f9b0c49b3b9cf8f98c",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"VBox(children=(Text(value=' Two Predictor Features Shapley Dependency Demonstration, M…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "42709129260844b2b5dd6353710ec377",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Output()"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# interactive calculation of the random sample set (control of source parametric distribution and number of samples)\n",
"l = widgets.Text(value=' Two Predictor Features Shapley Dependency Demonstration, Michael Pyrcz, Associate Professor, The University of Texas at Austin',layout=Layout(width='950px', height='30px'))\n",
"\n",
"b1 = widgets.FloatSlider(min=0.00, max = 5.0, value = 0.5, step = 0.1, description = '$b_1$',orientation='horizontal',layout=Layout(width='300px', height='50px'),continuous_update=False)\n",
"b1.style.handle_color = 'red'\n",
"b2 = widgets.FloatSlider(min=0.00, max = 5.0, value = 0.5, step = 0.1, description = '$b_2$',orientation='horizontal',layout=Layout(width='300px', height='50px'),continuous_update=False)\n",
"b2.style.handle_color = 'red'\n",
"noise = widgets.FloatSlider(min=0.01, max = 5.0, value = 0.20, step = 0.1, description = '$\\\\sigma_{noise}$',orientation='horizontal',layout=Layout(width='300px', height='50px'),continuous_update=False)\n",
"noise.style.handle_color = 'red'\n",
"nodes = widgets.IntSlider(min=2, max = 100, value = 5, step = 1, description = 'leaf nodes',orientation='horizontal',layout=Layout(width='300px', height='50px'),continuous_update=False)\n",
"nodes.style.handle_color = 'red'\n",
"\n",
"ui1 = widgets.HBox([b1,b2,noise,nodes],kwargs = {'justify_content':'center'}) \n",
"ui = widgets.VBox([l,ui1],kwargs = {'justify_content':'center'})\n",
"num_tree = 5\n",
"\n",
"def f_make(b1,b2,noise,nodes):\n",
" np.random.seed(seed=seed)\n",
" max_leaf_nodes = 25; n = 300; nb = 200\n",
" features = ['X1','X2']\n",
" X1 = np.random.normal(loc = 0.0,scale = 1.0,size = n)\n",
" X1b = np.random.normal(loc = 0.0,scale = 1.0,size = nb)\n",
" X2 = np.random.normal(loc = 0.0,scale = 1.0,size = n)\n",
" X2b = np.random.normal(loc = 0.0,scale = 1.0,size = nb)\n",
" y = b2*X2 + b1*X1 + b0 + np.random.normal(loc = 0.0, scale = noise, size = n)\n",
" yb = b2*X2b + b1*X1b + b0 + np.random.normal(loc = 0.0, scale = noise, size = nb)\n",
" \n",
" X = np.stack((X1, X2), axis=1)\n",
" Xb = np.stack((X1b, X2b), axis=1)\n",
" \n",
" fig, axes = plt.subplots(nrows = 1, ncols = 3, figsize=(20, 5), sharey=False, gridspec_kw = {\"wspace\":0.3})\n",
" \n",
" random_forest = RandomForestRegressor(max_leaf_nodes=nodes, random_state=seed,n_estimators=num_tree, max_features=max_features)\n",
" random_forest.fit(X = X, y = y.reshape(-1,))\n",
" y_hat = random_forest.predict(xgrid)\n",
" yb_hat = random_forest.predict(Xb)\n",
" ymin = np.min(y_hat.reshape([100,100])); ymax = np.max(y_hat.reshape([100,100]))\n",
" \n",
" #axes[0].scatter(X1,X2,c = y,edgecolor = 'black',marker = 'v',s = 20, alpha = 0.3, cmap = plt.cm.inferno,label = 'training')\n",
" axes[0].scatter(X1b,X2b,c = yb_hat,edgecolor = 'black',s = 20, alpha = 0.8, cmap = plt.cm.inferno,label = 'background')\n",
" ct = axes[0].contour(x1grid, x2grid,y_hat.reshape([100,100]),levels = 20, cmap = plt.cm.inferno, alpha = 0.2, vmin = ymin, vmax = ymax)\n",
" axes[0].legend(); axes[0].set_xlabel('$X_1$'); axes[0].set_ylabel('$X_2$')\n",
" axes[0].set_title('Data and Model')\n",
" axes[0].set_xlim(-3,3)\n",
" fig.colorbar(ct, ax=axes[0])\n",
" \n",
" shap_values = shap.TreeExplainer(random_forest).shap_values(Xb)\n",
" shap.dependence_plot(ax = axes[1],ind = features[0],show=False,feature_names = features, shap_values = shap_values, features = Xb, cmap = plt.cm.inferno)\n",
" axes[1].plot([-3,3],[0,0],'--',color = 'red', alpha = 0.3)\n",
" axes[1].set_title('Dependency Plot')\n",
" #axes[1].set_xlim(-3,3); axes[1].set_ylim(-1*ymax,ymax)\n",
" axes[1].set_xlim(-3,3); axes[1].set_ylim(-1.5,1.5) \n",
"\n",
" shap.dependence_plot(ax = axes[2],ind = features[1],show=False,feature_names = features, shap_values = shap_values, features = Xb, cmap = plt.cm.inferno)\n",
" axes[2].plot([-3,3],[0,0],'--',color = 'red', alpha = 0.3)\n",
" axes[2].set_title('Dependency Plot')\n",
" #axes[2].set_xlim(-3,3); axes[2].set_ylim(-1*ymax,ymax)\n",
" axes[2].set_xlim(-3,3); axes[2].set_ylim(-1.5,1.5)\n",
" \n",
" axes[0].set_ylim(-3,3)\n",
"\n",
"interactive_plot = widgets.interactive_output(f_make, {'b1':b1,'b2':b2,'noise':noise,'nodes':nodes})\n",
"interactive_plot.clear_output(wait = True) # reduce flickering by delaying plot updating \n",
" \n",
"display(ui, interactive_plot) # display the interactive plot"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can make the following observations:\n",
"\n",
"* the Shapley value now includes:\n",
"\n",
" * $f(X_1) - E[y]$ and $f(X_1,X_2) - f(X_2)$ marginal contributions for the $X_1$ Shapley value\n",
" \n",
" * $f(X_2) - E[y]$ and $f(X_1,X_2) - f(X_1)$ marginal contributions for the $X_2$ Shapley value\n",
" \n",
"* as a result the Shapley values extract the individual contribution structures for each feature even when there is significant conditional variance, $\\sigma_{y | X_1}$ and $\\sigma_{y | X_2}$ \n",
"\n",
"* the influence of the features over the range of departures from the global mean, $E{y}$, are approximately indepedendent (if model complexity is well tuned)\n",
"\n",
"* noise still disrupts the original linear structure of the relationship \n",
"\n",
"* noise and complexity (large number of leaf nodes) can result in an overfit model, adding further noise to the Shapley values\n",
"\n",
"#### Experiment \\#3: Two Predictor Features, Non-linear Relationship with Noise and Control Model Hyperparameter\n",
"\n",
"Now let's add a square term to the $X_1$ feature and retain control of the model hyperparameter, number of leaf nodes, so we can underfit the model."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "a22b236c4516406f9ccbce0b7b961e77",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"VBox(children=(Text(value=' Two Predictor Features Shapley Dependency Demonstration, M…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "2ad1b6ae8319484a8d5ec0db7970c270",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Output()"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# interactive calculation of the random sample set (control of source parametric distribution and number of samples)\n",
"l = widgets.Text(value=' Two Predictor Features Shapley Dependency Demonstration, Michael Pyrcz, Associate Professor, The University of Texas at Austin',layout=Layout(width='950px', height='30px'))\n",
"\n",
"b1 = widgets.FloatSlider(min=0.00, max = 5.0, value = 0.5, step = 0.1, description = '$b_1$',orientation='horizontal',layout=Layout(width='300px', height='50px'),continuous_update=False)\n",
"b1.style.handle_color = 'red'\n",
"b2 = widgets.FloatSlider(min=0.00, max = 5.0, value = 0.5, step = 0.1, description = '$b_2$',orientation='horizontal',layout=Layout(width='300px', height='50px'),continuous_update=False)\n",
"b2.style.handle_color = 'red'\n",
"noise = widgets.FloatSlider(min=0.01, max = 5.0, value = 0.20, step = 0.01, description = '$\\\\sigma_{noise}$',orientation='horizontal',layout=Layout(width='300px', height='50px'),continuous_update=False)\n",
"noise.style.handle_color = 'red'\n",
"nodes = widgets.IntSlider(min=2, max = 1000, value = 50, step = 1, description = 'leaf nodes',orientation='horizontal',layout=Layout(width='300px', height='50px'),continuous_update=False)\n",
"nodes.style.handle_color = 'red'\n",
"\n",
"ui1 = widgets.HBox([b1,b2,noise,nodes],kwargs = {'justify_content':'center'}) \n",
"ui = widgets.VBox([l,ui1],kwargs = {'justify_content':'center'})\n",
"\n",
"def f_make(b1,b2,noise,nodes):\n",
" np.random.seed(seed=seed)\n",
" max_leaf_nodes = 25; n = 300; nb = 200\n",
" features = ['X1','X2']\n",
" X1 = np.random.normal(loc = 0.0,scale = 1.0,size = n)\n",
" X1b = np.random.normal(loc = 0.0,scale = 1.0,size = nb)\n",
" X2 = np.random.normal(loc = 0.0,scale = 1.0,size = n)\n",
" X2b = np.random.normal(loc = 0.0,scale = 1.0,size = nb)\n",
" y = b2*X2 + b1*X1*X1 + b0 + np.random.normal(loc = 0.0, scale = noise, size = n)\n",
" yb = b2*X2b + b1*X1b*X1b + b0 + np.random.normal(loc = 0.0, scale = noise, size = nb)\n",
" \n",
" X = np.stack((X1, X2), axis=1)\n",
" Xb = np.stack((X1b, X2b), axis=1)\n",
" \n",
" fig, axes = plt.subplots(nrows = 1, ncols = 3, figsize=(20, 5), sharey=False, gridspec_kw = {\"wspace\":0.3})\n",
" \n",
" random_forest = RandomForestRegressor(max_leaf_nodes=nodes, random_state=seed,n_estimators=num_tree, max_features=max_features)\n",
" random_forest.fit(X = X, y = y.reshape(-1,))\n",
" y_hat = random_forest.predict(xgrid)\n",
" yb_hat = random_forest.predict(Xb)\n",
" ymin = np.min(y_hat.reshape([100,100])); ymax = np.max(y_hat.reshape([100,100]))\n",
" \n",
" #axes[0].scatter(X1,X2,c = y,edgecolor = 'black',marker = 'v',s = 20, alpha = 0.3, cmap = plt.cm.inferno,label = 'training')\n",
" axes[0].scatter(X1b,X2b,c = yb,edgecolor = 'black',s = 20, alpha = 0.8, cmap = plt.cm.inferno,label = 'background')\n",
" ct = axes[0].contour(x1grid, x2grid,y_hat.reshape([100,100]),levels = 20, cmap = plt.cm.inferno, alpha = 0.2, vmin = ymin, vmax = ymax)\n",
" axes[0].legend(); axes[0].set_xlabel('$X_1$'); axes[0].set_ylabel('$X_2$')\n",
" axes[0].set_title('Data and Model')\n",
" axes[0].set_xlim(-3,3)\n",
" fig.colorbar(ct, ax=axes[0])\n",
" \n",
" shap_values = shap.TreeExplainer(random_forest).shap_values(Xb)\n",
" shap.dependence_plot(ax = axes[1],ind = features[0],show=False,feature_names = features, shap_values = shap_values, features = Xb, cmap = plt.cm.inferno)\n",
" axes[1].plot([-3,3],[0,0],'--',color = 'red', alpha = 0.3)\n",
" axes[1].set_title('Dependency Plot')\n",
" axes[1].set_xlim(-3,3); axes[1].set_ylim(-1*ymax,ymax)\n",
" \n",
" shap.dependence_plot(ax = axes[2],ind = features[1],show=False,feature_names = features, shap_values = shap_values, features = Xb, cmap = plt.cm.inferno)\n",
" axes[2].plot([-3,3],[0,0],'--',color = 'red', alpha = 0.3)\n",
" axes[2].set_title('Dependency Plot')\n",
" axes[2].set_xlim(-3,3); axes[2].set_ylim(-1*ymax,ymax)\n",
" \n",
"interactive_plot = widgets.interactive_output(f_make, {'b1':b1,'b2':b2,'noise':noise,'nodes':nodes})\n",
"interactive_plot.clear_output(wait = True) # reduce flickering by delaying plot updating \n",
" \n",
"display(ui, interactive_plot) # display the interactive plot"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can make the following observations:\n",
"\n",
"* the Shapley values are able to reveal the linear ($X_2$) and non-linear, quadratic ($X_1$) relationships:\n",
"\n",
" * $f(X_1) - E[y]$ and $f(X_1,X_2) - f(X_2)$ marginal contributions for the $X_1$ Shapley value\n",
" \n",
" * $f(X_2) - E[y]$ and $f(X_1,X_2) - f(X_1)$ marginal contributions for the $X_2$ Shapley value\n",
" \n",
"* as a result the Shapley values extract the individual contribution structures for each feature even when there is significant conditional variance, $\\sigma_{y | X_1}$ and $\\sigma_{y | X_2}$ \n",
"\n",
"* the influence of the features over the range of departures from the global mean, $E{y}$, are approximately indepedendent (if model complexity is well tuned)\n",
"\n",
"* noise still disrupts the original linear structure of the relationship \n",
"\n",
"* noise and complexity (large number of leaf nodes) can result in an overfit model, adding further noise to the Shapley values\n",
"\n",
"#### Experiment \\#4: Two Predictor Features, Multilinear Relationship with Correlation, Noise and Control Model Hyperparameter\n",
"\n",
"Now let's add correlation between the predictor features, $X_1$ and $X_2$, and retain control of the model hyperparameter, number of leaf nodes, while adding number of trees so we can underfit the model."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "f7b39c3def7d4400949f1d9f2bee69ac",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"VBox(children=(Text(value=' Two Predictor Features Shapley Dependency Demonstration, M…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "eff072626242432da269ecb053298b4e",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Output()"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# interactive calculation of the random sample set (control of source parametric distribution and number of samples)\n",
"l = widgets.Text(value=' Two Predictor Features Shapley Dependency Demonstration, Michael Pyrcz, Associate Professor, The University of Texas at Austin',layout=Layout(width='950px', height='30px'))\n",
"\n",
"b1 = widgets.FloatSlider(min=0.00, max = 5.0, value = 0.5, step = 0.1, description = '$b_1$',orientation='horizontal',layout=Layout(width='400px', height='50px'),continuous_update=False)\n",
"b1.style.handle_color = 'red'\n",
"b2 = widgets.FloatSlider(min=0.00, max = 5.0, value = 0.5, step = 0.1, description = '$b_2$',orientation='horizontal',layout=Layout(width='400px', height='50px'),continuous_update=False)\n",
"b2.style.handle_color = 'red'\n",
"corr = widgets.FloatSlider(min=-1.0, max = 1.0, value = 0.0, step = 0.1, description = '$\\\\rho$',orientation='horizontal',layout=Layout(width='400px', height='50px'),continuous_update=False)\n",
"corr.style.handle_color = 'red'\n",
"noise = widgets.FloatSlider(min=0.01, max = 5.0, value = 0.20, step = 0.1, description = '$\\\\sigma_{noise}$',orientation='horizontal',layout=Layout(width='400px', height='50px'),continuous_update=False)\n",
"noise.style.handle_color = 'red'\n",
"nodes = widgets.IntSlider(min=2, max = 100, value = 50, step = 1, description = 'leaf nodes',orientation='horizontal',layout=Layout(width='400px', height='50px'),continuous_update=False)\n",
"nodes.style.handle_color = 'red'\n",
"ntree = widgets.IntSlider(min=1, max = 100, value = 50, step = 1, description = '# tree',orientation='horizontal',layout=Layout(width='400px', height='50px'),continuous_update=False)\n",
"ntree.style.handle_color = 'red'\n",
"\n",
"ui1 = widgets.HBox([b1,b2,corr],kwargs = {'justify_content':'center'}) \n",
"ui2 = widgets.HBox([noise,nodes,ntree],kwargs = {'justify_content':'center'}) \n",
"ui = widgets.VBox([l,ui1,ui2],kwargs = {'justify_content':'center'})\n",
"\n",
"def f_make(b1,b2,corr,noise,nodes,ntree):\n",
" np.random.seed(seed=seed)\n",
" max_leaf_nodes = 25; n = 300; nb = 200\n",
" features = ['X1','X2']\n",
" \n",
" mean = [0.0,0.0]; stdev = [1.0,1.0]\n",
" cov = np.array([1.0,corr,corr,1.0]); cov = cov.reshape((2,2))\n",
" X = np.random.multivariate_normal(mean, cov, size = n)\n",
" X1 = X[:,0]; X2 = X[:,1]\n",
" Xb = np.random.multivariate_normal(mean, cov, size = nb)\n",
" X1b = Xb[:,0]; X2b = Xb[:,1]\n",
"\n",
" y = b2*X2 + b1*X1 + b0 + np.random.normal(loc = 0.0, scale = noise, size = n)\n",
" yb = b2*X2b + b1*X1b + b0 + np.random.normal(loc = 0.0, scale = noise, size = nb)\n",
" \n",
" fig, axes = plt.subplots(nrows = 1, ncols = 3, figsize=(20, 5), sharey=False, gridspec_kw = {\"wspace\":0.3})\n",
" \n",
" random_forest = RandomForestRegressor(max_leaf_nodes=nodes, random_state=seed,n_estimators=ntree, max_features=max_features)\n",
" random_forest.fit(X = X, y = y.reshape(-1,))\n",
" y_hat = random_forest.predict(xgrid)\n",
" yb_hat = random_forest.predict(Xb)\n",
" #ymin = np.min(y_hat.reshape([100,100])); ymax = np.max(y_hat.reshape([100,100]))\n",
" #ymin = np.min(y); ymax = np.max(y)\n",
" \n",
" axes[0].scatter(X1,X2,c = y,edgecolor = 'black',marker = 'v',s = 20, alpha = 0.3, cmap = plt.cm.inferno,label = 'training')\n",
" axes[0].scatter(X1b,X2b,c = yb,edgecolor = 'black',s = 20, alpha = 0.8, cmap = plt.cm.inferno,label = 'background')\n",
" #ct = axes[0].contour(x1grid, x2grid,y_hat.reshape([100,100]),levels = 20, cmap = plt.cm.inferno, alpha = 0.2, vmin = ymin, vmax = ymax)\n",
" ct = axes[0].contour(x1grid, x2grid,y_hat.reshape([100,100]),levels = 20, cmap = plt.cm.inferno, alpha = 0.2, vmin = -3, vmax = 3)\n",
" axes[0].legend(); axes[0].set_xlabel('$X_1$'); axes[0].set_ylabel('$X_2$')\n",
" axes[0].set_title('Data and Model')\n",
" axes[0].set_xlim(-3,3)\n",
" fig.colorbar(ct, ax=axes[0])\n",
" \n",
" shap_values = shap.TreeExplainer(random_forest).shap_values(Xb)\n",
" shap.dependence_plot(ax = axes[1],ind = features[0],show=False,feature_names = features, shap_values = shap_values, features = Xb, cmap = plt.cm.inferno)\n",
" axes[1].plot([-3,3],[0,0],'--',color = 'red', alpha = 0.3)\n",
" axes[1].set_title('Dependency Plot')\n",
" #axes[1].set_xlim(-3,3); axes[1].set_ylim(ymin,ymax)\n",
" axes[1].set_xlim(-3,3); axes[1].set_ylim(-3,3)\n",
" \n",
" shap.dependence_plot(ax = axes[2],ind = features[1],show=False,feature_names = features, shap_values = shap_values, features = Xb, cmap = plt.cm.inferno)\n",
" axes[2].plot([-3,3],[0,0],'--',color = 'red', alpha = 0.3)\n",
" axes[2].set_title('Dependency Plot')\n",
" #axes[2].set_xlim(-3,3); axes[2].set_ylim(ymin,ymax)\n",
" axes[2].set_xlim(-3,3); axes[2].set_ylim(-3,3)\n",
" \n",
" axes[0].set_ylim(-3,3)\n",
"\n",
"interactive_plot = widgets.interactive_output(f_make, {'b1':b1,'b2':b2,'corr':corr,'noise':noise,'nodes':nodes,'ntree':ntree})\n",
"interactive_plot.clear_output(wait = True) # reduce flickering by delaying plot updating \n",
" \n",
"display(ui, interactive_plot) # display the interactive plot"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can add the following observations:\n",
"\n",
"* correlation between the predictor features, $X_1$ and $X_2$, adds structure and ordering to the dependency plots\n",
"\n",
"* this is evident with ordering and stratification relationships \n",
" \n",
"#### Experiment \\#5: Global Shapley for Two Predictor Features, Multilinear Relationship with Correlation, Noise and Control Model Hyperparameter\n",
"\n",
"Now let's take the previous experiment and apply it to global Shapley measures."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "6720c1dd2cc443afaa4aadf35b102ec8",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"VBox(children=(Text(value=' Global Shapley Demonstration, Michael Pyrcz, Associate Pro…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "9234870bea9d4e6da4011cee4164b16a",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Output()"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# interactive calculation of the random sample set (control of source parametric distribution and number of samples)\n",
"l = widgets.Text(value=' Global Shapley Demonstration, Michael Pyrcz, Associate Professor, The University of Texas at Austin',layout=Layout(width='950px', height='30px'))\n",
"\n",
"b1 = widgets.FloatSlider(min=0.00, max = 5.0, value = 0.5, step = 0.1, description = '$b_1$',orientation='horizontal',layout=Layout(width='400px', height='50px'),continuous_update=False)\n",
"b1.style.handle_color = 'red'\n",
"b2 = widgets.FloatSlider(min=0.00, max = 5.0, value = 0.5, step = 0.1, description = '$b_2$',orientation='horizontal',layout=Layout(width='400px', height='50px'),continuous_update=False)\n",
"b2.style.handle_color = 'red'\n",
"corr = widgets.FloatSlider(min=-1.0, max = 1.0, value = 0.0, step = 0.1, description = '$\\\\rho$',orientation='horizontal',layout=Layout(width='400px', height='50px'),continuous_update=False)\n",
"corr.style.handle_color = 'red'\n",
"noise = widgets.FloatSlider(min=0.01, max = 5.0, value = 0.20, step = 0.1, description = '$\\\\sigma_{noise}$',orientation='horizontal',layout=Layout(width='400px', height='50px'),continuous_update=False)\n",
"noise.style.handle_color = 'red'\n",
"nodes = widgets.IntSlider(min=2, max = 100, value = 50, step = 1, description = 'leaf nodes',orientation='horizontal',layout=Layout(width='400px', height='50px'),continuous_update=False)\n",
"nodes.style.handle_color = 'red'\n",
"ntree = widgets.IntSlider(min=1, max = 100, value = 50, step = 1, description = '# tree',orientation='horizontal',layout=Layout(width='400px', height='50px'),continuous_update=False)\n",
"ntree.style.handle_color = 'red'\n",
"\n",
"ui1 = widgets.HBox([b1,b2,corr],kwargs = {'justify_content':'center'}) \n",
"ui2 = widgets.HBox([noise,nodes,ntree],kwargs = {'justify_content':'center'}) \n",
"ui = widgets.VBox([l,ui1,ui2],kwargs = {'justify_content':'center'})\n",
"\n",
"def f_make(b1,b2,corr,noise,nodes,ntree):\n",
" np.random.seed(seed=seed)\n",
" max_leaf_nodes = 25; n = 300; nb = 200\n",
" features = ['X1','X2']\n",
" \n",
" mean = [0.0,0.0]; stdev = [1.0,1.0]\n",
" cov = np.array([1.0,corr,corr,1.0]); cov = cov.reshape((2,2))\n",
" X = np.random.multivariate_normal(mean, cov, size = n)\n",
" X1 = X[:,0]; X2 = X[:,1]\n",
" Xb = np.random.multivariate_normal(mean, cov, size = nb)\n",
" X1b = Xb[:,0]; X2b = Xb[:,1]\n",
"\n",
" y = b2*X2 + b1*X1 + b0 + np.random.normal(loc = 0.0, scale = noise, size = n)\n",
" yb = b2*X2b + b1*X1b + b0 + np.random.normal(loc = 0.0, scale = noise, size = nb)\n",
" \n",
" fig, axes = plt.subplots(nrows = 1, ncols = 3, figsize=(20, 5), sharey=False, gridspec_kw = {\"wspace\":0.3})\n",
" \n",
" random_forest = RandomForestRegressor(max_leaf_nodes=nodes, random_state=seed,n_estimators=ntree, max_features=max_features)\n",
" random_forest.fit(X = X, y = y.reshape(-1,))\n",
" y_hat = random_forest.predict(xgrid)\n",
" yb_hat = random_forest.predict(Xb)\n",
" #ymin = np.min(y_hat.reshape([100,100])); ymax = np.max(y_hat.reshape([100,100]))\n",
" ymin = np.min(y); ymax = np.max(y)\n",
" \n",
" shap_values = shap.TreeExplainer(random_forest).shap_values(Xb)\n",
"\n",
" plt.subplot(131)\n",
" shap.summary_plot(show=False,feature_names = features, shap_values = shap_values, features = Xb, plot_type=\"bar\",color = \"gray\", cmap = plt.cm.inferno)\n",
" plt.ylabel('Predictor Features')\n",
" \n",
" plt.subplot(132)\n",
" shap.summary_plot(show=False,feature_names = features, shap_values = shap_values, features = Xb,cmap = plt.cm.inferno)\n",
" \n",
" plt.subplot(133)\n",
" shap.summary_plot(show=False,feature_names = features, shap_values = shap_values, features = Xb,plot_type = \"violin\")\n",
" \n",
" plt.subplots_adjust(left=0.0, bottom=0.0, right=2.2, top=1.2, wspace=0.2, hspace=0.2)\n",
" plt.show()\n",
"\n",
"interactive_plot = widgets.interactive_output(f_make, {'b1':b1,'b2':b2,'corr':corr,'noise':noise,'nodes':nodes,'ntree':ntree})\n",
"interactive_plot.clear_output(wait = True) # reduce flickering by delaying plot updating \n",
" \n",
"display(ui, interactive_plot) # display the interactive plot\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can observed the competition for global feature importance, and the impact of model overfit.\n",
"\n",
"#### Comments\n",
"\n",
"This was a basic interactive demonstration of Shapley values. \n",
"\n",
"I have other demonstrations on the basics of working with DataFrames, ndarrays, univariate statistics, plotting data, declustering, data transformations, trend modeling and many other workflows available at [Python Demos](https://github.com/GeostatsGuy/PythonNumericalDemos) and a Python package for data analytics and geostatistics at [GeostatsPy](https://github.com/GeostatsGuy/GeostatsPy). \n",
" \n",
"I hope this was helpful,\n",
"\n",
"*Michael*\n",
"\n",
"#### The Author:\n",
"\n",
"### Michael Pyrcz, Associate Professor, University of Texas at Austin \n",
"*Novel Data Analytics, Geostatistics and Machine Learning Subsurface Solutions*\n",
"\n",
"With over 17 years of experience in subsurface consulting, research and development, Michael has returned to academia driven by his passion for teaching and enthusiasm for enhancing engineers' and geoscientists' impact in subsurface resource development. \n",
"\n",
"For more about Michael check out these links:\n",
"\n",
"#### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)\n",
"\n",
"#### Want to Work Together?\n",
"\n",
"I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.\n",
"\n",
"* Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? I'd be happy to drop by and work with you! \n",
"\n",
"* Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PIs including Profs. Foster, Torres-Verdin and van Oort)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!\n",
"\n",
"* I can be reached at mpyrcz@austin.utexas.edu.\n",
"\n",
"I'm always happy to discuss,\n",
"\n",
"*Michael*\n",
"\n",
"Michael Pyrcz, Ph.D., P.Eng. Associate Professor The Hildebrand Department of Petroleum and Geosystems Engineering, Bureau of Economic Geology, The Jackson School of Geosciences, The University of Texas at Austin\n",
"\n",
"#### More Resources Available at: [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
}
},
"nbformat": 4,
"nbformat_minor": 4
}