{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "

\n", " \n", "\n", "

\n", "\n", "### Interactive Distribution Transformations in Python \n", "\n", "\n", "#### Michael Pyrcz, Professor, The University of Texas at Austin \n", "\n", "##### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Feature Engineering with Distribution Transformations\n", "\n", "Here are 2 interactive workflows demonstrationing distribution transformations, a common, essential tool for feature engineering for predictive machine learning.\n", "\n", "I have recorded a walk-through of this interactive dashboard in my [Data Science Interactive Python Demonstrations](https://www.youtube.com/playlist?list=PLG19vXLQHvSDy26fM3hDLg3VCU7U5BGZl) series on my [YouTube](https://www.youtube.com/@GeostatsGuyLectures) channel.\n", "\n", "* Join me for walk-through of this dashboard [04 Data Science Interactive: Distribution Transformations](TBD). I'm stoked to guide you and share observations and things to try out! \n", "\n", "* I have a lecture on [Distribution Transformation](https://www.youtube.com/watch?v=ZDIpE3OkAIU&list=PLG19vXLQHvSB-D4XKYieEku9GQMQyAzjJ&index=14) as part of my [Data Analytics and Geostatistics](https://www.youtube.com/playlist?list=PLG19vXLQHvSB-D4XKYieEku9GQMQyAzjJ) course. Note, for all my recorded lecture the interactive and well-documented workflow demononstrations are available on my GitHub repository [GeostatsGuy's Python Numerical Demos](https://github.com/GeostatsGuy/PythonNumericalDemos).\n", "\n", "* Also, I have a lecture on [Feature Transformations for Machine Learning](https://www.youtube.com/watch?v=6QJjZoWknEI&list=PLG19vXLQHvSC2ZKFIkgVpI9fCjkN38kwf&index=9) as part of my [Machine Learning](https://www.youtube.com/playlist?list=PLG19vXLQHvSC2ZKFIkgVpI9fCjkN38kwf) course.\n", "\n", "* Finally, I have a lecture on [Q-Q plots](https://www.youtube.com/watch?v=RETZus4XBNM&list=PLG19vXLQHvSB-D4XKYieEku9GQMQyAzjJ&index=23) as part of my [Data Analytics and Geostatistics](https://www.youtube.com/playlist?list=PLG19vXLQHvSB-D4XKYieEku9GQMQyAzjJ) course.\n", "\n", "#### Distribution Transformations Motivation\n", "\n", "With distribution transformation we map our feature to a new feature with a new distribution, e.g., we could transform our sample data to be Gaussian distributed. \n", "\n", "Why do we do this? Here are some reasons for distribution transformation in data science:\n", "\n", "* **Inference**: the feature distribution has expected shape from theory, so we are adding important information\n", "\n", "* **Data Preparation / Cleaning**: correcting for too data paucity, data noise and data outliers\n", "\n", "* **Theory**: a specific distribution assumption is required for a method, so we correct the data to have this distribution to to avoid invalidating an assumption\n", "\n", "#### Distribution Transformations\n", "\n", "How do we do it? We apply the following to all sample data, $x_{\\alpha}$ $\\forall$ $\\alpha = 1,\\ldots,n$.\n", "\n", "\\begin{equation}\n", "y_{\\alpha} = G^{-1}_Y\\left(F_X(x_{\\alpha})\\right)\n", "\\end{equation}\n", "\n", "were $X$ is the original feature with a $F_X$ original cumulative distribution function and $Y$ is transformed feature with a $G_Y$ transformed cumulative distribution function.\n", "\n", "Let's summarize this operation:\n", "\n", "* Mapping from one distribution to another through percentiles\n", "\n", "* This may be applied to any parametric or nonparametric distributions\n", "\n", "* This is a rank preserving transform, e.g. P50 of 𝑋 is P50 of 𝑌\n", "\n", "#### Distribution Transformation Dashboards\n", "\n", "To demonstrate distributions, I built out 2 dashboards:\n", "\n", "1. data to a nonparametric distribution (to match the distribution of another dataset)\n", "2. data to a parametric distribution (to a Gaussian distribution)\n", "\n", "#### Getting Started\n", "\n", "Here's the steps to get setup in Python with the GeostatsPy package:\n", "\n", "1. Install Anaconda 3 on your machine (https://www.anaconda.com/download/). \n", "\n", "That's all!\n", "\n", "#### Load the Required Libraries\n", "\n", "We will also need some standard Python packages. These should have been installed with Anaconda 3." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np # ndarrys for gridded data\n", "import pandas as pd # DataFrames for tabular data\n", "import matplotlib.pyplot as plt # plotting\n", "from scipy import stats # summary statistics\n", "import math # trigonometry etc.\n", "import random # randon numbers\n", "from scipy.stats import norm # Gaussian parametric distribution\n", "import geostatspy.GSLIB as GSLIB\n", "from ipywidgets import interactive # widgets and interactivity\n", "from ipywidgets import widgets \n", "from ipywidgets import Layout\n", "from ipywidgets import Label\n", "from ipywidgets import VBox, HBox\n", "import warnings\n", "warnings.filterwarnings('ignore') # supress warnings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Set the Random Number Seed\n", "\n", "Set the random number seed so that we have a repeatable workflow" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "seed = 73073" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Loading Tabular Data\n", "\n", "Here's the command to load our comma delimited data file in to a Pandas' DataFrame object. For fun try misspelling the name. You will get an ugly, long error. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "data_url = \"https://raw.githubusercontent.com/GeostatsGuy/GeoDataSets/master/sample_data.csv\"\n", "df = pd.read_csv(data_url) # load our data table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It worked, we loaded our file into our DataFrame called 'df'. But how do you really know that it worked? Visualizing the DataFrame would be useful and we already leard about these methods in this demo (https://git.io/fNgRW). \n", "\n", "We can preview the DataFrame by printing a slice or by utilizing the 'head' DataFrame member function (with a nice and clean format, see below). With the slice we could look at any subset of the data table and with the head command, add parameter 'n=13' to see the first 13 rows of the dataset. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
XYFaciesPorosityPermAI
0100.0900.01.00.1001871.3638905110.699751
1100.0800.00.00.10794712.5768454671.458560
2100.0700.00.00.0853575.9845206127.548006
3100.0600.00.00.1084602.4466785201.637996
4100.0500.00.00.1024681.9522643835.270322
5100.0400.00.00.1105793.6919085295.267191
\n", "
" ], "text/plain": [ " X Y Facies Porosity Perm AI\n", "0 100.0 900.0 1.0 0.100187 1.363890 5110.699751\n", "1 100.0 800.0 0.0 0.107947 12.576845 4671.458560\n", "2 100.0 700.0 0.0 0.085357 5.984520 6127.548006\n", "3 100.0 600.0 0.0 0.108460 2.446678 5201.637996\n", "4 100.0 500.0 0.0 0.102468 1.952264 3835.270322\n", "5 100.0 400.0 0.0 0.110579 3.691908 5295.267191" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(n=6) # we could also use this command for a table preview" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Calculating and Plotting a CDF by Hand\n", "\n", "Let's demonstrate the calculation and plotting of a non-parametric CDF by hand\n", "\n", "1. make a copy of the feature as a 1D array (ndarray from NumPy)\n", "2. sort the data in ascending order\n", "3. assign cumulative probabilities based on the tails assumptions\n", "4. plot cumulative probability vs. value" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The ndarray has a shape of (261,).\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "por = df['Porosity'].copy(deep = True).values # make a deepcopy of the feature from the DataFrame\n", "print('The ndarray has a shape of ' + str(por.shape) + '.')\n", "\n", "por = np.sort(por) # sort the data in ascending order\n", "n = por.shape[0] # get the number of data samples\n", "\n", "cprob = np.zeros(n)\n", "for i in range(0,n):\n", " index = i + 1\n", " cprob[i] = index / n # known upper tail\n", " # cprob[i] = (index - 1)/n # known lower tail\n", " # cprob[i] = (index - 1)/(n - 1) # known upper and lower tails\n", " # cprob[i] = index/(n+1) # unknown tails \n", "\n", "plt.subplot(111)\n", "plt.plot(por,cprob, alpha = 0.2, c = 'black') # plot piecewise linear interpolation\n", "plt.scatter(por,cprob,s = 10, alpha = 1.0, c = 'red', edgecolor = 'black') # plot the CDF points\n", "plt.grid(); plt.xlim([0.05,0.25]); plt.ylim([0.0,1.0])\n", "plt.xlabel(\"Porosity (fraction)\"); plt.ylabel(\"Cumulative Probability\"); plt.title(\"Non-parametric Porosity Cumulative Distribution Function\")\n", "\n", "plt.subplots_adjust(left=0.0, bottom=0.0, right=1.0, top=1.2, wspace=0.1, hspace=0.2)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Distribution Transformation to a Parametric Distribution\n", "\n", "We can transform our data feature distribution to any parametric distribution with this workflow.\n", "\n", "1. Calculate the cumulative probability value of each of our data values, $p_{\\alpha} = F_x(x_\\alpha)$, $\\forall$ $\\alpha = 1,\\ldots, n$.\n", "\n", "2. Apply the inverse of the target parametric cumulative distribution function (CDF) to calculate the transformed values. $y_{\\alpha} = G_y^{-1}\\left(F_x(x_\\alpha)\\right)$, $\\forall$ $\\alpha = 1,\\ldots, n$.\n", "\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "y = np.zeros(n)\n", "\n", "for i in range(0,n):\n", " y[i] = norm.ppf(cprob[i],loc=0.0,scale=1.0)\n", "\n", "plt.subplot(121)\n", "plt.plot(por,cprob, alpha = 0.2, c = 'black') # plot piecewise linear interpolation\n", "plt.scatter(por,cprob,s = 10, alpha = 1.0, c = 'red', edgecolor = 'black') # plot the CDF points\n", "plt.grid(); plt.xlim([0.05,0.25]); plt.ylim([0.0,1.0])\n", "plt.xlabel(\"Porosity (fraction)\"); plt.ylabel(\"Cumulative Probability\"); plt.title(\"Non-parametric Porosity Cumulative Distribution Function\")\n", "\n", "plt.subplot(122)\n", "plt.plot(y,cprob, alpha = 0.2, c = 'black') # plot piecewise linear interpolation\n", "plt.scatter(y,cprob,s = 10, alpha = 1.0, c = 'red', edgecolor = 'black') # plot the CDF points\n", "plt.grid(); plt.xlim([-3.0,3.0]); plt.ylim([0.0,1.0])\n", "plt.xlabel(\"Porosity (fraction)\"); plt.ylabel(\"Cumulative Probability\"); plt.title(\"After Distribution Transformation to Gaussian\")\n", "\n", "plt.subplots_adjust(left=0.0, bottom=0.0, right=2.0, top=1.2, wspace=0.2, hspace=0.2)\n", "plt.show()\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's make an interactive version of this plot to visualize the transformation." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# widgets and dashboard\n", "l = widgets.Text(value=' Data Analytics, Distribution Transformation, Prof. Michael Pyrcz, The University of Texas at Austin',layout=Layout(width='950px', height='30px'))\n", "\n", "data_index = widgets.IntSlider(min=1, max = n-1, value=1.0, step = 10.0, description = 'Data Index, $\\\\alpha$',orientation='horizontal', style = {'description_width': 'initial'}, continuous_update=False)\n", "\n", "ui = widgets.VBox([l,data_index],)\n", "\n", "def run_plot(data_index): # make data, fit models and plot\n", " plt.subplot(131)\n", " plt.plot(por,cprob, alpha = 0.2, c = 'black') # plot piecewise linear interpolation\n", " plt.scatter(por,cprob,s = 10, alpha = 1.0, c = 'red', edgecolor = 'black') # plot the CDF points\n", " plt.grid(); plt.xlim([0.05,0.25]); plt.ylim([0.0,1.0])\n", " plt.xlabel(\"Original Feature, $x$\"); plt.ylabel(\"Cumulative Probability\"); plt.title(\"Original Cumulative Distribution Function\")\n", " plt.plot([por[data_index-1],por[data_index-1]],[0.0,cprob[data_index-1]],color = 'red',linestyle='dashed')\n", " plt.plot([por[data_index-1],3.0],[cprob[data_index-1],cprob[data_index-1]],color = 'red',linestyle='dashed')\n", " plt.annotate('x = ' + str(round(por[data_index-1],2)), xy=(por[data_index-1]+0.003, 0.01))\n", " plt.annotate('p = ' + str(round(cprob[data_index-1],2)), xy=(0.225, cprob[data_index-1]+0.02))\n", "\n", " \n", " plt.subplot(132)\n", " plt.plot(y,cprob, alpha = 0.2, c = 'black') # plot piecewise linear interpolation\n", " plt.scatter(y,cprob,s = 10, alpha = 1.0, c = 'red', edgecolor = 'black') # plot the CDF points\n", " plt.grid(); plt.xlim([-3.0,3.0]); plt.ylim([0.0,1.0])\n", " plt.xlabel(\"Gaussian Transformed Feature, $y$\"); plt.ylabel(\"Cumulative Probability\"); plt.title(\"After Distribution Transformation to Gaussian\")\n", " plt.plot([-3.0,y[data_index-1]],[cprob[data_index-1],cprob[data_index-1]],color = 'red',linestyle='dashed')\n", " plt.plot([y[data_index-1],y[data_index-1]],[0.0,cprob[data_index-1]],color = 'red',linestyle='dashed')\n", " #plt.arrow(y[data_index-1],cprob[data_index-1],0.0,-1.0*(cprob[data_index-1]-0.01),color = 'red',width = 0.02, head_width = 0.1, linestyle='dashed', head_length = 0.01)\n", " plt.annotate('p = ' + str(round(cprob[data_index-1],2)), xy=(-2.90, cprob[data_index-1]+0.02)) \n", " plt.annotate('y = ' + str(round(y[data_index-1],2)), xy=(y[data_index-1]+0.1, 0.01))\n", " \n", " plt.subplot(133)\n", " plt.plot(por,y, alpha = 0.2, c = 'black') # plot piecewise linear interpolation\n", " plt.grid(); plt.xlim([0.05,0.25]); plt.ylim([-3.0,3.0])\n", " plt.xlabel(\"Original Porosity (fraction)\"); plt.ylabel(\"Gaussian Transformed Porosity (N[fraction])\"); plt.title(\"Parametric Distribution Transformation, Q-Q Plot\")\n", " #plt.plot([0.05,0.25],[0.05,0.25],color = 'red',linestyle='dashed', alpha = 0.4)\n", " plt.scatter(por[data_index-1],y[data_index-1],s = 50, c = 'red', edgecolor = 'black', alpha = 1.0, zorder=200) # plot the CDF points\n", " plt.scatter(por,y,s = 20, c = 'red', edgecolor = 'black', alpha = 0.1, zorder=100) # plot the CDF points\n", " \n", " plt.subplots_adjust(left=0.0, bottom=0.0, right=3.0, top=1.2, wspace=0.2, hspace=0.2)\n", " plt.show()\n", " \n", "# connect the function to make the samples and plot to the widgets \n", "interactive_plot = widgets.interactive_output(run_plot, {'data_index':data_index})\n", "interactive_plot.clear_output(wait = True) # reduce flickering by delaying plot updating" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Interactive Data Analytics Distribution Transformation Demonstration \n", "\n", "#### Michael Pyrcz, Professor, The University of Texas at Austin \n", "\n", "Select any data value and observe the distribution transform by mapping through cumulative probability.\n", "\n", "### The Inputs\n", "\n", "* **data_index** - the data index from 1 to n in the sorted ascending order" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "902f1ea179b24d43beba7f1a4aea9a98", "version_major": 2, "version_minor": 0 }, "text/plain": [ "VBox(children=(Text(value=' Data Analytics, Distribution Tran…" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "8e02c9196d0d47ffbf9e7bf60f11236d", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Output(outputs=({'output_type': 'display_data', 'data': {'text/plain': '
', 'i…" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(ui, interactive_plot) # display the interactive plot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Distribution Transform to a Non-Parametric Distribution\n", "\n", "We can apply the mapping through cumulative probabilities to transform from any distribution to any other distribution.\n", "\n", "* let's make a new data set by randomly sampling from the previous one and adding error\n", "\n", "Then we can demonstrate transforming this dataset to match the original distribution\n", "\n", "* this is mimicking the situation where we transform a dataset to match the distribution of a better sampled analog distribution\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The sample ndarray has a shape of (30,).\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "n_sample = 30\n", "df_sample = df.sample(n_sample,random_state = seed)\n", " \n", "df_sample = df_sample.copy(deep = True) # make a deepcopy of the feature from the DataFrame\n", "\n", "df_sample['Porosity'] = df_sample['Porosity'].values + np.random.normal(loc = 0.0, scale = 0.01, size = n_sample)\n", "\n", "df_sample = df_sample.sort_values(by = 'Porosity') # sort the DataFrame\n", "por_sample = df_sample['Porosity'].values\n", "print('The sample ndarray has a shape of ' + str(por_sample.shape) + '.')\n", "\n", "cprob_sample = np.zeros(n_sample)\n", "for i in range(0,n_sample):\n", " index = i + 1\n", " cprob_sample[i] = index / n_sample # known upper tail\n", " # cprob[i] = (index - 1)/n # known lower tail\n", " # cprob[i] = (index - 1)/(n - 1) # known upper and lower tails\n", " # cprob[i] = index/(n+1) # unknown tails \n", "\n", "plt.subplot(121)\n", "plt.plot(por_sample,cprob_sample, alpha = 0.2, c = 'black') # plot piecewise linear interpolation\n", "plt.scatter(por_sample,cprob_sample,s = 10, alpha = 1.0, c = 'red', edgecolor = 'black') # plot the CDF points\n", "plt.grid(); plt.xlim([0.05,0.25]); plt.ylim([0.0,1.0])\n", "plt.xlabel(\"Porosity (fraction)\"); plt.ylabel(\"Cumulative Probability\"); plt.title(\"Sparse Sample with Noise Cumulative Distribution Function\")\n", "\n", "plt.subplot(122)\n", "plt.plot(por,cprob, alpha = 0.2, c = 'black') # plot piecewise linear interpolation\n", "plt.scatter(por,cprob,s = 10, alpha = 1.0, c = 'red', edgecolor = 'black') # plot the CDF points\n", "plt.grid(); plt.xlim([0.05,0.25]); plt.ylim([0.0,1.0])\n", "plt.xlabel(\"Porosity (fraction)\"); plt.ylabel(\"Cumulative Probability\"); plt.title(\"Non-parametric Porosity Cumulative Distribution Function\")\n", "\n", "plt.subplots_adjust(left=0.0, bottom=0.0, right=2.0, top=1.2, wspace=0.2, hspace=0.2)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's transform the values and show them on the target distribution." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "y_sample = np.zeros(n_sample)\n", "\n", "for i in range(0,n_sample):\n", " y_sample[i] = np.percentile(por,cprob_sample[i]*100, interpolation = 'linear') # piecewise linear interpolation of inverse of target CDF \n", " \n", "plt.subplot(121)\n", "plt.plot(por_sample,cprob_sample, alpha = 0.2, c = 'black') # plot piecewise linear interpolation\n", "plt.scatter(por_sample,cprob_sample,s = 30, alpha = 1.0, c = 'green', edgecolor = 'black', zorder = 100) # plot the CDF points\n", "plt.grid(); plt.xlim([0.05,0.25]); plt.ylim([0.0,1.0])\n", "plt.xlabel(\"Porosity (fraction)\"); plt.ylabel(\"Cumulative Probability\"); plt.title(\"Sparse Sample with Noise Cumulative Distribution Function\")\n", "\n", "plt.subplot(122)\n", "plt.plot(por,cprob, alpha = 0.2, c = 'black') # plot piecewise linear interpolation\n", "plt.scatter(por,cprob,s = 10, c = 'red', edgecolor = 'black', alpha = 0.3) # plot the CDF points\n", "plt.scatter(y_sample,cprob_sample,s = 30, c = 'green', edgecolor = 'black', alpha = 1.0, zorder = 100) # plot the CDF points\n", "plt.grid(); plt.xlim([0.05,0.25]); plt.ylim([0.0,1.0])\n", "plt.xlabel(\"Porosity (fraction)\"); plt.ylabel(\"Cumulative Probability\"); plt.title(\"Non-parametric Porosity Cumulative Distribution Function\")\n", "\n", "plt.subplots_adjust(left=0.0, bottom=0.0, right=2.0, top=1.2, wspace=0.2, hspace=0.2)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's make an interactive version of this plot to visualize the transformation." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# widgets and dashboard\n", "l_sample = widgets.Text(value=' Data Analytics, Distribution Transformation, Prof. Michael Pyrcz, The University of Texas at Austin',layout=Layout(width='950px', height='30px'))\n", "\n", "data_index_sample = widgets.IntSlider(min=1, max = n_sample, value=1.0, step = 1.0, description = 'Data Sample Index, $\\\\beta$',orientation='horizontal', style = {'description_width': 'initial'}, continuous_update=False)\n", "\n", "ui_sample = widgets.VBox([l_sample,data_index_sample],)\n", "\n", "def run_plot_sample(data_index_sample): # make data, fit models and plot\n", " plt.subplot(131)\n", " plt.plot(por_sample,cprob_sample, alpha = 0.2, c = 'black') # plot piecewise linear interpolation\n", " plt.scatter(por_sample,cprob_sample,s = 30, alpha = 1.0, c = 'green', edgecolor = 'black',zorder = 100) # plot the CDF points\n", " plt.grid(); plt.xlim([0.05,0.25]); plt.ylim([0.0,1.0])\n", " plt.xlabel(\"Porosity (fraction)\"); plt.ylabel(\"Cumulative Probability\"); plt.title(\"Original Sparse Sample with Noise, Cumulative Distribution Function\")\n", " plt.plot([por_sample[data_index_sample-1],por_sample[data_index_sample-1]],[0.0,cprob_sample[data_index_sample-1]],color = 'red',linestyle='dashed')\n", " plt.plot([por_sample[data_index_sample-1],3.0],[cprob_sample[data_index_sample-1],cprob_sample[data_index_sample-1]],color = 'red',linestyle='dashed')\n", " plt.annotate('x = ' + str(round(por_sample[data_index_sample-1],2)), xy=(por_sample[data_index_sample-1]+0.003, 0.01))\n", " plt.annotate('p = ' + str(round(cprob_sample[data_index_sample-1],2)), xy=(0.225, cprob_sample[data_index_sample-1]+0.02))\n", " \n", " plt.subplot(132)\n", " plt.plot(por,cprob, alpha = 0.2, c = 'black') # plot piecewise linear interpolation\n", " plt.scatter(por,cprob,s = 10, c = 'red', edgecolor = 'black', alpha = 1.0) # plot the CDF points\n", " plt.grid(); plt.xlim([0.05,0.25]); plt.ylim([0.0,1.0])\n", " plt.xlabel(\"Porosity (fraction)\"); plt.ylabel(\"Cumulative Probability\"); plt.title(\"Non-parametric Target Porosity Cumulative Distribution Function\")\n", " plt.plot([0.0,y_sample[data_index_sample-1]],[cprob_sample[data_index_sample-1],cprob_sample[data_index_sample-1]],color = 'red',linestyle='dashed')\n", " plt.plot([y_sample[data_index_sample-1],y_sample[data_index_sample-1]],[0.0,cprob_sample[data_index_sample-1]],color = 'red',linestyle='dashed')\n", " plt.annotate('p = ' + str(round(cprob_sample[data_index_sample-1],2)), xy=(0.053, cprob_sample[data_index_sample-1]+0.02)) \n", " plt.annotate('y = ' + str(round(y_sample[data_index_sample-1],2)), xy=(y_sample[data_index_sample-1]+0.003, 0.01))\n", " plt.scatter(y_sample[data_index_sample-1],cprob_sample[data_index_sample-1],s = 50, c = 'green', edgecolor = 'black', alpha = 1.0, zorder=100) # plot the CDF points\n", " \n", " plt.subplot(133)\n", " plt.plot(por_sample,y_sample, alpha = 0.2, c = 'black') # plot piecewise linear interpolation\n", " plt.grid(); plt.xlim([0.05,0.25]); plt.ylim([0.05,0.25])\n", " plt.xlabel(\"Original Porosity (fraction)\"); plt.ylabel(\"Transformed Porosity (fraction)\"); plt.title(\"Non-parametric Distribution Transformation, Q-Q Plot\")\n", " plt.plot([0.05,0.25],[0.05,0.25],color = 'red',linestyle='dashed', alpha = 0.4)\n", " plt.scatter(por_sample[data_index_sample-1],y_sample[data_index_sample-1],s = 50, c = 'green', edgecolor = 'black', alpha = 1.0, zorder=200) # plot the CDF points\n", " plt.scatter(por_sample,y_sample,s = 20, c = 'green', edgecolor = 'black', alpha = 0.3, zorder=100) # plot the CDF points\n", " \n", " plt.subplots_adjust(left=0.0, bottom=0.0, right=3.0, top=1.2, wspace=0.2, hspace=0.2)\n", " plt.show()\n", " \n", " \n", " \n", "# connect the function to make the samples and plot to the widgets \n", "interactive_plot_s = widgets.interactive_output(run_plot_sample, {'data_index_sample':data_index_sample})\n", "#interactive_plot_sample.clear_output(wait = True) # reduce flickering by delaying plot updating" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Interactive Data Analytics Distribution Transformation Demonstration \n", "\n", "#### Michael Pyrcz, Professor, The University of Texas at Austin \n", "\n", "Select any data value and observe the distribution transform by mapping through cumulative probability.\n", "\n", "#### The Inputs\n", "\n", "* **data_index** - the data index from 1 to n in the sorted ascending order" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "b5d144bc40844eab96e1b7e6c63b5ed2", "version_major": 2, "version_minor": 0 }, "text/plain": [ "VBox(children=(Text(value=' Data Analytics, Distribution Tran…" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "71a295d8760748d0a61403e72091ec91", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Output(outputs=({'output_type': 'display_data', 'data': {'text/plain': '
', 'i…" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(ui_sample, interactive_plot_s) # display the interactive plot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To summarize let's look at a DataFrame with the original noisey sample and the transformed to match the original distribution.\n", "\n", "* we're making and showing a table of original values, $x_{\\beta}$ $\\forall$ $\\beta = 1, \\ldots, n_{sample}$, and the transformed values, $y_{\\beta}$ $\\forall$ $\\beta = 1, \\ldots, n_{sample}$.\n", "\n" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
XYFaciesPorosityPermAITransformed_Por
207201.0426.00.00.0891950.4006585263.5421120.081044
189201.0456.00.00.0946700.5463965018.3554760.085867
80900.0100.00.00.0965981.2802574573.6560720.091834
47600.0700.00.00.10197512.3844963595.5869770.095628
3100.0600.00.00.1022612.4466785201.6379960.099378
226211.0396.00.00.1037626.3685295725.3348030.101987
5100.0400.00.00.1086873.6919085295.2671910.103970
218251.0416.00.00.1091591.0033745822.4679140.106704
72900.0900.00.00.11104012.4339966242.7048100.108460
41500.0400.00.00.1127036.3121985515.9186460.111491
210231.0426.00.00.1169655.5840404919.0748710.113887
71800.0100.00.00.1311867.7391055274.5326600.117984
53600.0100.01.00.14386142.3960444204.1508930.121592
245690.0529.01.00.172408316.9056894271.0131480.127131
84955.0559.01.00.18356274.2150583386.1827220.137062
150985.0489.01.00.18517173.1330402672.2945670.153759
158975.0479.01.00.18517269.3365762493.1281770.176536
93955.0549.01.00.186417374.2989253181.5572810.185270
165955.0469.01.00.19564426.1972392889.1966470.188136
149975.0489.01.00.19703521.1090852412.8753300.191655
151995.0489.01.00.198042460.4949862792.8043220.195998
113975.0529.01.00.2031661548.0940623167.1853770.198182
1251005.0519.01.00.20645854.6671952577.7146780.199315
181935.0449.01.00.206652368.5076014249.4779230.201943
111955.0529.01.00.2099921113.9710763177.6357370.206934
136935.0499.01.00.212483523.2878102579.0328970.209051
99925.0539.01.00.213423211.1632963442.8852450.211638
129955.0509.01.00.2207781525.2470662512.0614340.215686
174955.0459.01.00.22115945.0020883394.5630380.224214
141985.0499.01.00.22407868.2761482547.5261130.242298
\n", "
" ], "text/plain": [ " X Y Facies Porosity Perm AI \\\n", "207 201.0 426.0 0.0 0.089195 0.400658 5263.542112 \n", "189 201.0 456.0 0.0 0.094670 0.546396 5018.355476 \n", "80 900.0 100.0 0.0 0.096598 1.280257 4573.656072 \n", "47 600.0 700.0 0.0 0.101975 12.384496 3595.586977 \n", "3 100.0 600.0 0.0 0.102261 2.446678 5201.637996 \n", "226 211.0 396.0 0.0 0.103762 6.368529 5725.334803 \n", "5 100.0 400.0 0.0 0.108687 3.691908 5295.267191 \n", "218 251.0 416.0 0.0 0.109159 1.003374 5822.467914 \n", "72 900.0 900.0 0.0 0.111040 12.433996 6242.704810 \n", "41 500.0 400.0 0.0 0.112703 6.312198 5515.918646 \n", "210 231.0 426.0 0.0 0.116965 5.584040 4919.074871 \n", "71 800.0 100.0 0.0 0.131186 7.739105 5274.532660 \n", "53 600.0 100.0 1.0 0.143861 42.396044 4204.150893 \n", "245 690.0 529.0 1.0 0.172408 316.905689 4271.013148 \n", "84 955.0 559.0 1.0 0.183562 74.215058 3386.182722 \n", "150 985.0 489.0 1.0 0.185171 73.133040 2672.294567 \n", "158 975.0 479.0 1.0 0.185172 69.336576 2493.128177 \n", "93 955.0 549.0 1.0 0.186417 374.298925 3181.557281 \n", "165 955.0 469.0 1.0 0.195644 26.197239 2889.196647 \n", "149 975.0 489.0 1.0 0.197035 21.109085 2412.875330 \n", "151 995.0 489.0 1.0 0.198042 460.494986 2792.804322 \n", "113 975.0 529.0 1.0 0.203166 1548.094062 3167.185377 \n", "125 1005.0 519.0 1.0 0.206458 54.667195 2577.714678 \n", "181 935.0 449.0 1.0 0.206652 368.507601 4249.477923 \n", "111 955.0 529.0 1.0 0.209992 1113.971076 3177.635737 \n", "136 935.0 499.0 1.0 0.212483 523.287810 2579.032897 \n", "99 925.0 539.0 1.0 0.213423 211.163296 3442.885245 \n", "129 955.0 509.0 1.0 0.220778 1525.247066 2512.061434 \n", "174 955.0 459.0 1.0 0.221159 45.002088 3394.563038 \n", "141 985.0 499.0 1.0 0.224078 68.276148 2547.526113 \n", "\n", " Transformed_Por \n", "207 0.081044 \n", "189 0.085867 \n", "80 0.091834 \n", "47 0.095628 \n", "3 0.099378 \n", "226 0.101987 \n", "5 0.103970 \n", "218 0.106704 \n", "72 0.108460 \n", "41 0.111491 \n", "210 0.113887 \n", "71 0.117984 \n", "53 0.121592 \n", "245 0.127131 \n", "84 0.137062 \n", "150 0.153759 \n", "158 0.176536 \n", "93 0.185270 \n", "165 0.188136 \n", "149 0.191655 \n", "151 0.195998 \n", "113 0.198182 \n", "125 0.199315 \n", "181 0.201943 \n", "111 0.206934 \n", "136 0.209051 \n", "99 0.211638 \n", "129 0.215686 \n", "174 0.224214 \n", "141 0.242298 " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_sample['Transformed_Por'] = y_sample\n", "df_sample.head(n=n_sample)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It would be straitforward to modify the code above to perform distribution transformations:\n", "\n", "* to a parametric distribution like Gaussian\n", "\n", "* to a non-parametric distribution from actual data (build a CDF and interpolate between the data samples)\n", "\n", "#### Comments\n", "\n", "This was a basic demonstration of distribution transformations. \n", "\n", "I have other demonstrations on the basics of working with DataFrames, ndarrays, univariate statistics, plotting data, declustering, data transformations, trend modeling and many other workflows available at [Python Demos](https://github.com/GeostatsGuy/PythonNumericalDemos) and a Python package for data analytics and geostatistics at [GeostatsPy](https://github.com/GeostatsGuy/GeostatsPy). \n", " \n", "I hope this was helpful,\n", "\n", "*Michael*\n", "\n", "#### The Author:\n", "\n", "### Michael Pyrcz, Associate Professor, University of Texas at Austin \n", "*Novel Data Analytics, Geostatistics and Machine Learning Subsurface Solutions*\n", "\n", "With over 17 years of experience in subsurface consulting, research and development, Michael has returned to academia driven by his passion for teaching and enthusiasm for enhancing engineers' and geoscientists' impact in subsurface resource development. \n", "\n", "For more about Michael check out these links:\n", "\n", "#### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)\n", "\n", "#### Want to Work Together?\n", "\n", "I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.\n", "\n", "* Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? I'd be happy to drop by and work with you! \n", "\n", "* Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PIs including Profs. Foster, Torres-Verdin and van Oort)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!\n", "\n", "* I can be reached at mpyrcz@austin.utexas.edu.\n", "\n", "I'm always happy to discuss,\n", "\n", "*Michael*\n", "\n", "Michael Pyrcz, Ph.D., P.Eng. Professor, Cockrell School of Engineering and The Jackson School of Geosciences, The University of Texas at Austin\n", "\n", "#### More Resources Available at: [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" } }, "nbformat": 4, "nbformat_minor": 2 }