\n",
"\n",
"## Interactive k-Means Clustering from Inferential Machine Learning\n",
"\n",
"\n",
"### Michael Pyrcz, Professor, University of Texas at Austin \n",
"\n",
"##### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)\n",
"\n",
"\n",
"### The Interactive Workflow\n",
"\n",
"Here's an interative workflow for demonstrating the behavoir of k-means clustering's solution heuristic. This solution heuristic proposed by Hartigan and Wong (1979) iterates over these steps of:\n",
"\n",
"* assign initial random prototype with labels\n",
"\n",
"* assign samples to the nearest prototype label\n",
"\n",
"* update prototype based on centroids of samples belonging to this prototype\n",
"\n",
"* iterate until no sample assignments change\n",
"\n",
"This is a solution heuristic that avoids exploration of the entire possible solution space, $n_{sol}$, defined as:\n",
"\n",
"\\begin{equation}\n",
"n_{sol} = K^n\n",
"\\end{equation}\n",
"\n",
"where $K$ is the number of clusters (categories) and $n$ is the number of data. When I teach this heuristic I state that it is practical, well-behaved and efficient. We avoid local minimums with a few multiple starts (change the seed and assign new initial prototypes).\n",
"\n",
"But how well does this heuristic perform? Hartigan and Wong (1979) state that their heuristic finds local minimum at which point \"no movemoment of a point from one cluster to another will reduce the within-cluster sum of squares\". But, does it find the global minimum? Is it stable? How often does it get stuck in a 'bad' local minimum? \n",
"\n",
"* let's use interactive Python with MatPlotLib to build some models and observe the results over multiple seeds!\n",
"\n",
"#### k-Means Clustering\n",
"\n",
"The K-means clustering approach is primaryly applied as an unsupervised method for classification:\n",
"\n",
"* **Prototype Method** - represents the training data with number of synthetic cases in the features space. For K-means clustering we assign and iteratively update $K$ prototypes.\n",
"\n",
"* **Iterative Solution** - the initial prototypes are assigned randomly in the feature space, the labels for each training sample are updated to the nearest prototype, then the prototypes are adjusted to the centroid of their assigned training data, repeat until there is no further update to the training data assignments.\n",
"\n",
"* **Unsupervised Learning** - the training data are not labeled and are assigned $K$ labels based on their proximity to the prototypes in the feature space. The idea is that similar things, proximity in feature space, should belong to the same category. \n",
"\n",
"* **Feature Weighting** - the procedure depends on the 'distance' between training samples and prototypes in feature space. Distance is treated as the 'inverse' of similarity. If the features have significantly different magnitudes, the feature(s) with the largest magnitudes and ranges will dominate the process. One approach is to sandardize / normalize the variables. Also, by-feature weighting may be applied. In this demonstration we normalize the features to range from 0.0 to 1.0.\n",
"\n",
"* Supervised Learning Variant for Classification of the Feature Space - applies multiple prototypes in each category to then constructs a decision boundary based on nearest prototype. More prototypes per category results in a more complicated decision boundary in the feature space. \n",
"\n",
"See my YouTube channel for lectures on [k-means clustering](https://www.youtube.com/watch?v=oFE10cLl0Fs&list=PLG19vXLQHvSC2ZKFIkgVpI9fCjkN38kwf&index=13). \n",
"\n",
"#### Getting Started\n",
"\n",
"Here's the steps to get setup in Python with the GeostatsPy package:\n",
"\n",
"1. Install Anaconda 3 on your machine (https://www.anaconda.com/download/). \n",
"2. From Anaconda Navigator (within Anaconda3 group), go to the environment tab, click on base (root) green arrow and open a terminal. \n",
"3. In the terminal type: pip install geostatspy. \n",
"4. Open Jupyter and in the top block get started by copy and pasting the code block below from this Jupyter Notebook to start using the geostatspy functionality. \n",
"\n",
"You will need to copy the data file to your working directory. They are available here:\n",
"\n",
"* Tabular data - unconv_MV.csv at https://git.io/fjmBH.\n",
"\n",
"There are exampled below with these functions. You can go here to see a list of the available functions, https://git.io/fh4eX, other example workflows and source code. \n",
"\n",
"#### Install Packages\n",
"\n",
"We will include the standard packages for DataFrames and ndarrays and add sci-kit-learn (sklearn) for machine learning."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import os # to set current working directory \n",
"import sys # supress output to screen for interactive variogram modeling\n",
"import io\n",
"import numpy as np # arrays and matrix math\n",
"import pandas as pd # DataFrames\n",
"import matplotlib.pyplot as plt # plotting\n",
"from sklearn.cluster import KMeans # k-means clustering\n",
"from matplotlib.pyplot import cm # color maps\n",
"from matplotlib.patches import Ellipse # plot an ellipse\n",
"from matplotlib.patches import Rectangle\n",
"import math # sqrt operator\n",
"import alphashape # convex hull for visualization\n",
"from scipy.stats import norm\n",
"from ipywidgets import interactive # widgets and interactivity\n",
"from ipywidgets import widgets \n",
"from ipywidgets import Layout\n",
"from ipywidgets import Label\n",
"from ipywidgets import VBox, HBox\n",
"plt.rc('axes', axisbelow=True) # grid behind plotting elements\n",
"import warnings\n",
"warnings.filterwarnings('ignore') # supress any warnings for this demonstration"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you get a package import error, you may have to first install some of these packages. This can usually be accomplished by opening up a command window on Windows and then typing 'python -m pip install [package-name]'. More assistance is available with the respective package docs. "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'1.21.5'"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy\n",
"numpy.__version__"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Loading Data\n",
"\n",
"Let's load the provided multivariate, spatial dataset '12_sample_data.csv'. It is a comma delimited file with: \n",
"\n",
"* X and Y coordinates ($m$)\n",
"* facies 0 and 1 \n",
"* porosity (fraction)\n",
"* permeability ($mD$)\n",
"* acoustic impedance ($\\frac{kg}{m^3} \\cdot \\frac{m}{s} \\cdot 10^3$). \n",
"\n",
"We load it with the pandas 'read_csv' function into a data frame we called 'df' and then preview it to make sure it loaded correctly.\n",
"\n",
"**Python Tip: using functions from a package** just type the label for the package that we declared at the beginning:\n",
"\n",
"```python\n",
"import pandas as pd\n",
"```\n",
"\n",
"so we can access the pandas function 'read_csv' with the command: \n",
"\n",
"```python\n",
"pd.read_csv()\n",
"```\n",
"\n",
"but read csv has required input parameters. The essential one is the name of the file. For our circumstance all the other default parameters are fine. If you want to see all the possible parameters for this function, just go to the docs [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html). \n",
"\n",
"* The docs are always helpful\n",
"* There is often a lot of flexibility for Python functions, possible through using various inputs parameters\n",
"\n",
"also, the program has an output, a pandas DataFrame loaded from the data. So we have to specficy the name / variable representing that new object.\n",
"\n",
"```python\n",
"df = pd.read_csv(\"12_sample_data.csv\") \n",
"```\n",
"\n",
"Let's run this command to load the data and then this command to extract a random subset of the data.\n",
"\n",
"```python\n",
"df = df.sample(frac=.30, random_state = 73073); \n",
"df = df.reset_index()\n",
"```\n",
"\n",
"We do this to reduce the number of data for ease of visualization (hard to see if too many points on our plots)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"df = pd.read_csv('https://raw.githubusercontent.com/GeostatsGuy/GeoDataSets/master/12_sample_data.csv')\n",
"df = df.sample(frac=.30, random_state = 73073); df = df.reset_index() # extract 30% random to reduce the number of data\n",
"\n",
"pormin = df['Porosity'].min(); pormax = df['Porosity'].max()\n",
"AImin = df['AI'].min(); AImax = df['AI'].max()\n",
"\n",
"df['Norm_Porosity'] = (df['Porosity']-pormin)/(pormax - pormin)\n",
"df['Norm_AI'] = (df['AI']-AImin)/(AImax - AImin)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Interactive k-Means Clustering Method\n",
"\n",
"The following code includes:\n",
"\n",
"* the dashboard with widgets linked to plots to run k-means clustering over many seeds and number of iterations and to observe the results."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"l = widgets.Text(value=' Machine Learning k-Means Clustering Heuristic Demo, Prof. Michael Pyrcz, The University of Texas at Austin',\n",
" layout=Layout(width='930px', height='30px'))\n",
"\n",
"K = widgets.IntSlider(min=2, max = 20, value=2, step = 1, description = r'$K$',orientation='horizontal', style = {'description_width': 'initial'}, continuous_update=False)\n",
"niter = widgets.IntSlider(min=0, max = 1000, value=1, step = 1, description = r'$n_{iter}$',orientation='horizontal', style = {'description_width': 'initial'}, continuous_update=False)\n",
"seed = widgets.IntSlider(min=101, max = 999, value=777, step = 1, description = r'$seed$',orientation='horizontal', style = {'description_width': 'initial'}, continuous_update=False)\n",
"\n",
"ui = widgets.HBox([K,niter,seed],)\n",
"ui2 = widgets.VBox([l,ui],)\n",
"\n",
"def run_plot(K,niter,seed):\n",
" np.random.seed(seed=seed)\n",
" cent_init = np.random.rand(K,2)\n",
" plt.subplot(121)\n",
" if niter == 0:\n",
" plt.scatter(df['Norm_Porosity'], df['Norm_AI'], c='black', alpha = 0.4, linewidths=1.0, edgecolors=\"black\",cmap=plt.cm.inferno,vmin=1,vmax=K)\n",
"\n",
" for icent in range(0,K):\n",
" plt.scatter([cent_init[icent,0]],[cent_init[icent,1]],c=icent+1,s=30,alpha = 0.8,linewidths=1.0,\n",
" edgecolors=\"black\",marker='s',cmap=plt.cm.inferno,vmin=1,vmax=K)\n",
" if niter > 0:\n",
" kmeans = KMeans(n_clusters=K,max_iter = niter,init=cent_init,random_state=14, n_init = 1).fit(df.loc[:,['Norm_Porosity','Norm_AI']].values)\n",
" df['kMeans'] = kmeans.labels_ + 1\n",
" cent = kmeans.cluster_centers_ # (n_clusters, n_features)\n",
" inertia = kmeans.inertia_\n",
" \n",
" plt.scatter(df['Norm_Porosity'], df['Norm_AI'], c=df['kMeans'], alpha = 0.4, linewidths=1.0, edgecolors=\"black\",cmap=plt.cm.inferno,vmin=1,vmax=K)\n",
" cent_cur = KMeans(n_clusters=K,max_iter = 1,init=cent_init,random_state=14, n_init = 1).fit(df.loc[:,['Norm_Porosity','Norm_AI']].values).cluster_centers_\n",
" for icent in range(0,K):\n",
" plt.scatter([cent_init[icent,0]],[cent_init[icent,1]],c=icent+1,s=30,alpha = 0.8,linewidths=1.0,\n",
" edgecolors=\"black\",marker='s',cmap=plt.cm.inferno,vmin=1,vmax=K)\n",
" plt.scatter([cent[icent,0]],[cent[icent,1]],c=icent+1,s=200,alpha = 0.8,linewidths=1.0,\n",
" edgecolors=\"black\",marker='*',cmap=plt.cm.inferno,vmin=1,vmax=K,zorder=10)\n",
" plt.plot([cent_init[icent,0],cent_cur[icent,0]],[cent_init[icent,1],cent_cur[icent,1]],\n",
" alpha = 0.4,lw=2.0,ls='--',color='black',zorder=1)\n",
" \n",
" if niter > 1:\n",
" for iters in range(2,niter+1):\n",
" alpha = 0.4 + 0.6*((iters+1)/(niter+1))\n",
" cent_cur = KMeans(n_clusters=K,max_iter = iters,init=cent_init,random_state=14, n_init = 1).fit(df.loc[:,['Norm_Porosity','Norm_AI']].values).cluster_centers_\n",
" cent_prev = KMeans(n_clusters=K,max_iter = iters-1,init=cent_init,random_state=14, n_init = 1).fit(df.loc[:,['Norm_Porosity','Norm_AI']].values).cluster_centers_\n",
" for icent in range(0,K):\n",
" plt.plot([cent_prev[icent,0],cent_cur[icent,0]],[cent_prev[icent,1],cent_cur[icent,1]],\n",
" alpha = 0.4,lw=2.0,ls='--',color='black',zorder=1)\n",
" \n",
" plt.title('k-Means Clusters with Cluster Prototype Iteration'); plt.xlabel('Predictor Feature #1 (normalized)'); plt.ylabel('Predictor Feature #2 (normalized)')\n",
" plt.xlim(0, 1); plt.ylim(0, 1); plt.grid()\n",
" \n",
" nreal = 100; inertias = np.zeros(nreal)\n",
" plt.subplot(122) \n",
"\n",
" if niter > 0:\n",
" plt.vlines(inertia,0,1,color='black',ls='--',lw=2,label='Shown Model')\n",
" for icent in range(0,nreal):\n",
" inertias[icent] = KMeans(n_clusters=K,max_iter = niter,init='random',random_state=14+icent,n_init = 1).fit(df.loc[:,['Norm_Porosity','Norm_AI']].values).inertia_\n",
" \n",
" #plt.hist(inertias,bins=np.linspace(0,5,100),color='red',alpha=0.6,edgecolor='black')\n",
" plt.plot(np.sort(inertias),np.linspace(0,1,nreal),color='red',lw=4,alpha=0.6,label='Realizations')\n",
" plt.legend(loc='lower right')\n",
" plt.xlim([0,12]); plt.ylim([0,1]); plt.xlabel('k-Means Clustering Inertia'); plt.ylabel('Cumulative Probability')\n",
" plt.grid(); plt.title('k-Means Heurstic Performance Over Many Random Starts')\n",
" plt.subplots_adjust(left=0.0, bottom=0.0, right=2.0, top=1.1, wspace=0.2, hspace=0.2)\n",
" \n",
"# connect the function to make the samples and plot to the widgets \n",
"interactive_plot = widgets.interactive_output(run_plot, {'K':K,'niter':niter,'seed':seed})\n",
"interactive_plot.clear_output(wait = True) # reduce flickering by delaying plot updating"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Interactive k-Means Clustering Heuristic Demonstration\n",
"\n",
"* select $K$, random number $seed$ and step over the iterations of prototype location, observe the CDF of clustering inertia vs. the demonstrated case.\n",
"\n",
"#### Michael Pyrcz, Professor, University of Texas at Austin \n",
"\n",
"##### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1) | [GeostatsPy](https://github.com/GeostatsGuy/GeostatsPy)\n",
"\n",
"### The Inputs\n",
"\n",
"Observe the k-means clustering soultion heuristic, e.g., step over iterations and change the random seed:\n",
"\n",
"* $K$: number of k-means clusters, $n_{iter}$: number of iterations, $seed$: random number seed"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"display(ui2, interactive_plot) # display the interactive plot"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Improved Summary Over Many Realizations to Explore Solution Uniqueness and Stability\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"l = widgets.Text(value=' Machine Learning k-Means Clustering Heuristic Demo, Prof. Michael Pyrcz, The University of Texas at Austin',\n",
" layout=Layout(width='930px', height='30px'))\n",
"\n",
"K_summary = widgets.IntSlider(min=2, max = 20, value=2, step = 1, description = r'$K$',orientation='horizontal', \n",
" style = {'description_width': 'initial'},layout=Layout(width='200px', height='30px'),continuous_update=False)\n",
"niter_summary = widgets.IntSlider(min=1, max = 500, value=1, step = 1, description = r'$n_{iter}$',orientation='horizontal', \n",
" style = {'description_width': 'initial'},layout=Layout(width='200px', height='30px'), continuous_update=False)\n",
"seed_summary = widgets.IntSlider(min=101, max = 999, value=777, step = 1, description = r'$seed$',orientation='horizontal', \n",
" style = {'description_width': 'initial'},layout=Layout(width='200px', height='30px'), continuous_update=False)\n",
"ease_summary = widgets.FloatSlider(min=0.0,max=1.0,value=0.0,step=0.1,description = r'$ease$',orientation='horizontal', \n",
" style = {'description_width': 'initial'},layout=Layout(width='200px', height='30px'), continuous_update=False)\n",
"\n",
"ui_summary = widgets.HBox([K_summary,niter_summary,seed_summary,ease_summary],)\n",
"ui2_summary = widgets.VBox([l,ui_summary],)\n",
"\n",
"#from sklearn.cluster import MeanShift, estimate_bandwidth\n",
"from sklearn.cluster import DBSCAN\n",
"from matplotlib import colors\n",
"from scipy.stats import rankdata\n",
"\n",
"def run_plot_summary(K_summary,niter_summary,seed_summary,ease_summary):\n",
" np.random.seed(seed=seed_summary)\n",
" nreal = 40; max_sol = 5\n",
" cent = np.zeros([nreal,K_summary,2])\n",
" inertia = np.zeros([nreal]); inertia_cluster = np.zeros([nreal])\n",
" \n",
" # Modify the data\n",
" df_copy = df.copy(deep = 'True')\n",
" data_kmeans = KMeans(n_clusters=K_summary,init='random',max_iter = 2,random_state=12,n_init = 1).fit(df.loc[:,['Norm_Porosity','Norm_AI']].values)\n",
" data_cent = data_kmeans.cluster_centers_\n",
" #print(data_cent);\n",
" data_labels = data_kmeans.labels_\n",
" #print(data_labels)\n",
" for idata in range(0,len(df)):\n",
" px = data_cent[data_labels[idata],0]; py = data_cent[data_labels[idata],1] \n",
" #print(px,py)\n",
" df_copy.loc[idata,'Norm_Porosity'] = (1-ease_summary)*df.loc[idata,'Norm_Porosity'] + ease_summary*px\n",
" df_copy.loc[idata,'Norm_AI'] = (1-ease_summary)*df.loc[idata,'Norm_AI'] + ease_summary*py\n",
" #print (df)\n",
" \n",
" plt.subplot(121)\n",
" plt.scatter(df_copy['Norm_Porosity'], df_copy['Norm_AI'], c='white', alpha = 0.3, linewidths=1.0, edgecolors=\"black\")\n",
"\n",
" for ireal in range(0,nreal):\n",
" kmeans = KMeans(n_clusters=K_summary,init='random',max_iter = niter_summary,random_state=13+ireal,n_init = 1).fit(df_copy.loc[:,['Norm_Porosity','Norm_AI']].values)\n",
" cur_cent = kmeans.cluster_centers_\n",
" for k in range(0,K_summary):\n",
" cent[ireal,k,0] = cur_cent[k,0]; cent[ireal,k,1] = cur_cent[k,1]\n",
" inertia[ireal] = kmeans.inertia_\n",
"# bandwidth = max(estimate_bandwidth(inertia.reshape(-1,1), quantile=0.2, n_samples=nreal),0.001)\n",
"# mean_shift = MeanShift(bandwidth=bandwidth, bin_seeding=True)\n",
"# mean_shift.fit(inertia.reshape(-1,1))\n",
" dbscan = DBSCAN(eps=0.1, min_samples=2).fit(inertia.reshape(-1,1))\n",
" #cent_flatten = np.reshape(cent,[nreal,K_summary*2])\n",
" #print(cent_flatten)\n",
" #dbscan = DBSCAN(eps=0.02, min_samples=10).fit(cent_flatten)\n",
" inertia_cluster = dbscan.labels_+1\n",
" \n",
" #nsolution = int(np.max(inertia_cluster))+1\n",
" nsolution = len(np.unique(inertia_cluster))\n",
" if nsolution == 1:\n",
" inertia_cluster = np.zeros(nreal)\n",
" else:\n",
" #inertia_cluster = np.round(nsolution*(inertia_cluster - np.min(inertia_cluster))/(np.max(inertia_cluster)-np.min(inertia_cluster)),0)\n",
" inertia_cluster = rankdata(inertia_cluster,method = 'dense') - 1\n",
" #print(nsolution)\n",
" #print(inertia_cluster)\n",
" #cluster_centers = ms.cluster_centers_\n",
" \n",
" #bounds = np.linspace(0,nsolution,nsolution+1)\n",
" bounds = np.linspace(0,max_sol,max_sol+1)\n",
" norm = colors.BoundaryNorm(bounds, plt.cm.inferno.N)\n",
"\n",
" for ireal in range(0,nreal): \n",
" for k in range(0,K_summary):\n",
" if inertia_cluster[ireal] >= 0:\n",
" im = plt.scatter([cent[ireal,k,0]],[cent[ireal,k,1]],c=inertia_cluster[ireal],s=30,alpha = 0.6,linewidths=1.0,\n",
" edgecolors=\"black\",marker='s',norm=norm,cmap=plt.cm.inferno)\n",
" else:\n",
" im = plt.scatter([cent[ireal,k,0]],[cent[ireal,k,1]],s=30,alpha = 0.6,linewidths=1.0,\n",
" c=\"black\",marker='x')\n",
" \n",
" ireal = 7\n",
"# for k in range(0,K_summary):\n",
"# plt.scatter([cent[ireal,k,0]],[cent[ireal,k,1]],c='black',s=90,alpha = 0.6,linewidths=1.0,\n",
"# edgecolors=\"black\",marker='*',cmap=plt.cm.inferno)\n",
" \n",
" plt.title('k-Means Prototypes Iteration'); plt.xlabel('Predictor Feature #1 (normalized)'); plt.ylabel('Predictor Feature #2 (normalized)')\n",
" plt.xlim(0, 1); plt.ylim(0, 1); plt.grid()\n",
" # define the bins and normalize\n",
" \n",
" cbar = plt.colorbar(\n",
" im, orientation=\"vertical\",ticks=np.linspace(1,max_sol,max_sol)\n",
" )\n",
" cbar.set_label(label='Solution Groups', rotation=270, labelpad=20)\n",
" \n",
" plt.subplot(122) \n",
" \n",
" #plt.hist(inertias,bins=np.linspace(0,5,100),color='red',alpha=0.6,edgecolor='black')\n",
" \n",
" plt.plot(np.sort(inertia_cluster)+1,np.linspace(0,1,nreal),color='red',lw=4,alpha=0.6,label='Realizations')\n",
" plt.legend(loc='lower right')\n",
" plt.xlim([0,max_sol]); plt.ylim([0,1]); plt.xlabel('Unique k-means Prototypes Solution Index'); plt.ylabel('Cumulative Probability')\n",
" plt.grid(); plt.title('k-Means Heurstic Prototype Groups CDF')\n",
" plt.subplots_adjust(left=0.0, bottom=0.0, right=2.0, top=1.1, wspace=0.2, hspace=0.2)\n",
" \n",
"# connect the function to make the samples and plot to the widgets \n",
"interactive_plot_summary = widgets.interactive_output(run_plot_summary, {'K_summary':K_summary,'niter_summary':niter_summary,'seed_summary':seed_summary,'ease_summary':ease_summary,})\n",
"interactive_plot_summary.clear_output(wait = True) # reduce flickering by delaying plot updating"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"display(ui2_summary, interactive_plot_summary) # display the interactive plot"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"l = widgets.Text(value=' Machine Learning k-Means Clustering Heuristic Demo, Prof. Michael Pyrcz, The University of Texas at Austin',\n",
" layout=Layout(width='930px', height='30px'))\n",
"\n",
"K_summary = widgets.IntSlider(min=2, max = 20, value=3, step = 1, description = r'$K$',orientation='horizontal', \n",
" style = {'description_width': 'initial'},layout=Layout(width='200px', height='30px'),continuous_update=False)\n",
"niter_summary = widgets.IntSlider(min=1, max = 500, value=1, step = 1, description = r'$n_{iter}$',orientation='horizontal', \n",
" style = {'description_width': 'initial'},layout=Layout(width='200px', height='30px'), continuous_update=False)\n",
"seed_summary = widgets.IntSlider(min=101, max = 999, value=777, step = 1, description = r'$seed$',orientation='horizontal', \n",
" style = {'description_width': 'initial'},layout=Layout(width='200px', height='30px'), continuous_update=False)\n",
"ease_summary = widgets.FloatSlider(min=0.0,max=1.0,value=0.0,step=0.1,description = r'$ease$',orientation='horizontal', \n",
" style = {'description_width': 'initial'},layout=Layout(width='200px', height='30px'), continuous_update=False)\n",
"\n",
"ui_summary = widgets.HBox([K_summary,niter_summary,seed_summary,ease_summary],)\n",
"ui2_summary = widgets.VBox([l,ui_summary],)\n",
"\n",
"#from sklearn.cluster import MeanShift, estimate_bandwidth\n",
"from sklearn.cluster import DBSCAN\n",
"from matplotlib import colors\n",
"from scipy.stats import rankdata\n",
"from scipy.spatial import ConvexHull, convex_hull_plot_2d\n",
"import alphashape\n",
"from scipy.spatial import Delaunay\n",
"\n",
"def run_plot_summary(K_summary,niter_summary,seed_summary,ease_summary):\n",
" np.random.seed(seed=seed_summary)\n",
" nreal = 10; max_sol = 7\n",
" cent = np.zeros([nreal,K_summary,2])\n",
" inertia = np.zeros([nreal]); inertia_cluster = np.zeros([nreal])\n",
" \n",
" # Modify the data\n",
" df_copy = df.copy(deep = 'True')\n",
" data_kmeans = KMeans(n_clusters=K_summary,init='random',max_iter = 3,random_state=10,n_init = 1).fit(df.loc[:,['Norm_Porosity','Norm_AI']].values)\n",
" data_cent = data_kmeans.cluster_centers_\n",
" #print(data_cent);\n",
" data_labels = data_kmeans.labels_\n",
" #print(data_labels)\n",
" for idata in range(0,len(df)):\n",
" px = data_cent[data_labels[idata],0]; py = data_cent[data_labels[idata],1] \n",
" #print(px,py)\n",
" df_copy.loc[idata,'Norm_Porosity'] = (1-ease_summary)*df.loc[idata,'Norm_Porosity'] + ease_summary*px\n",
" df_copy.loc[idata,'Norm_AI'] = (1-ease_summary)*df.loc[idata,'Norm_AI'] + ease_summary*py\n",
" #print (df)\n",
" \n",
" plt.subplot(121)\n",
" plt.scatter(df_copy['Norm_Porosity'], df_copy['Norm_AI'], c='white', alpha = 0.3, linewidths=1.0, edgecolors=\"black\")\n",
"\n",
" for ireal in range(0,nreal):\n",
" kmeans = KMeans(n_clusters=K_summary,init='random',max_iter = niter_summary,random_state=13+ireal,n_init = 1).fit(df_copy.loc[:,['Norm_Porosity','Norm_AI']].values)\n",
" cur_cent = kmeans.cluster_centers_\n",
" for k in range(0,K_summary):\n",
" cent[ireal,k,0] = cur_cent[k,0]; cent[ireal,k,1] = cur_cent[k,1]\n",
" inertia[ireal] = kmeans.inertia_\n",
"# bandwidth = max(estimate_bandwidth(inertia.reshape(-1,1), quantile=0.2, n_samples=nreal),0.001)\n",
"# mean_shift = MeanShift(bandwidth=bandwidth, bin_seeding=True)\n",
"# mean_shift.fit(inertia.reshape(-1,1))\n",
" dbscan = DBSCAN(eps=0.1, min_samples=2).fit(inertia.reshape(-1,1))\n",
" #cent_flatten = np.reshape(cent,[nreal,K_summary*2])\n",
" #print(cent_flatten)\n",
" #dbscan = DBSCAN(eps=0.02, min_samples=10).fit(cent_flatten)\n",
" inertia_cluster = dbscan.labels_+1\n",
" \n",
" #nsolution = int(np.max(inertia_cluster))+1\n",
" nsolution = len(np.unique(inertia_cluster))\n",
" if nsolution == 1:\n",
" inertia_cluster = np.zeros(nreal)\n",
" else:\n",
" #inertia_cluster = np.round(nsolution*(inertia_cluster - np.min(inertia_cluster))/(np.max(inertia_cluster)-np.min(inertia_cluster)),0)\n",
" inertia_cluster = rankdata(inertia_cluster,method = 'dense') - 1\n",
" #print(nsolution)\n",
" #print(inertia_cluster)\n",
" #cluster_centers = ms.cluster_centers_\n",
" \n",
" #bounds = np.linspace(0,nsolution,nsolution+1)\n",
" bounds = np.linspace(0,max_sol,max_sol+1)\n",
" norm = colors.BoundaryNorm(bounds, plt.cm.inferno.N)\n",
"\n",
" #print(cent[ireal,:,:])\n",
" for ireal in range(0,nreal): \n",
" \n",
" #hull = ConvexHull(cent[ireal,:,:])\n",
" alpha_shape = alphashape.alphashape(cent[ireal,:,:], 0.)\n",
" #plt.plot(cent[ireal,:,:][hull.vertices,0], cent[ireal,:,:][hull.vertices,1], 'r--', lw=2)\n",
" \n",
" #print(cent[ireal,:,:][hull.vertices,0])\n",
"# print(type(alpha_shape))\n",
" #x,y = alpha_shape.xy\n",
" x,y = alpha_shape.exterior.coords.xy\n",
" poly = np.vstack((x,y))\n",
" \n",
" #print(np.transpose(poly))\n",
" t2 = plt.Polygon(np.transpose(poly), color=plt.cm.inferno(ireal/nreal),alpha=0.01)\n",
" plt.gca().add_patch(t2)\n",
"\n",
"\n",
" #poly2 = np.vstack((cent[ireal,:,:][hull.vertices,0],cent[ireal,:,:][hull.vertices,1]))\n",
" poly2 = np.vstack((cent[ireal,:,0],cent[ireal,:,1]))\n",
" #print(poly2)\n",
" tri = Delaunay(np.transpose(poly2))\n",
" plt.triplot(cent[ireal,:,0],cent[ireal,:,1],tri.simplices,\n",
" c=plt.cm.inferno(ireal/nreal),alpha=0.2,zorder=1)\n",
" \n",
" for k in range(0,K_summary):\n",
" if inertia_cluster[ireal] >= 0:\n",
" im = plt.scatter([cent[ireal,k,0]],[cent[ireal,k,1]],c=ireal,s=30,alpha = 0.2,linewidths=1.0,\n",
" edgecolors=\"black\",marker='s',cmap=plt.cm.inferno,vmin=0,vmax=nreal,zorder=10)\n",
" #if k > 0:\n",
" #plt.plot([cent[ireal,k,0],cent[ireal,k-1,0]],[cent[ireal,k,1],cent[ireal,k-1,1]],c='black')\n",
" \n",
" else:\n",
" im = plt.scatter([cent[ireal,k,0]],[cent[ireal,k,1]],s=30,alpha = 0.6,linewidths=1.0,\n",
" c=\"black\",marker='x',zorder=10)\n",
" \n",
" ireal = 7\n",
"# for k in range(0,K_summary):\n",
"# plt.scatter([cent[ireal,k,0]],[cent[ireal,k,1]],c='black',s=90,alpha = 0.6,linewidths=1.0,\n",
"# edgecolors=\"black\",marker='*',cmap=plt.cm.inferno)\n",
" \n",
" plt.title('k-Means Prototypes Iteration'); plt.xlabel('Predictor Feature #1 (normalized)'); plt.ylabel('Predictor Feature #2 (normalized)')\n",
" plt.xlim(0, 1); plt.ylim(0, 1); plt.grid()\n",
" # define the bins and normalize\n",
" \n",
"# cbar = plt.colorbar(\n",
"# im, orientation=\"vertical\",ticks=np.linspace(1,max_sol,max_sol)\n",
"# )\n",
"# cbar.set_label(label='Solution Groups', rotation=270, labelpad=20)\n",
" \n",
" plt.subplot(122) \n",
" \n",
" #plt.hist(inertias,bins=np.linspace(0,5,100),color='red',alpha=0.6,edgecolor='black')\n",
" \n",
"# plt.plot(np.sort(inertia_cluster)+1,np.linspace(0,1,nreal),color='red',lw=4,alpha=0.6,label='Realizations')\n",
" plt.plot(np.sort(inertia),np.linspace(0,1,nreal),color='red',lw=4,alpha=0.6,label='Realizations')\n",
" plt.legend(loc='lower right')\n",
"# plt.xlim([0,max_sol]); \n",
" plt.xlim([3.0,6.0])\n",
" plt.ylim([0,1]); plt.xlabel('Inertia'); plt.ylabel('Cumulative Probability')\n",
" plt.grid(); plt.title('k-Means Heurstic Solutions Inertia CDF')\n",
" plt.subplots_adjust(left=0.0, bottom=0.0, right=2.0, top=1.1, wspace=0.2, hspace=0.2)\n",
" \n",
"# connect the function to make the samples and plot to the widgets \n",
"interactive_plot_summary = widgets.interactive_output(run_plot_summary, {'K_summary':K_summary,'niter_summary':niter_summary,'seed_summary':seed_summary,'ease_summary':ease_summary,})\n",
"interactive_plot_summary.clear_output(wait = True) # reduce flickering by delaying plot updating"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"display(ui2_summary, interactive_plot_summary) # display the interactive plot"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"points = np.array([[0, 0], [0, 1.1], [1, 0], [1, 1]])\n",
"plt.plot(points,c=0.1,cmap=plt.cm.inferno,vmin=0,vmax=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Comments\n",
"\n",
"This was an interactive demonstration of simple kriging behavoir, the impact of data closeness and redudancy on kriging weights for spatial data analytics. Much more could be done, I have other demonstrations on the basics of working with DataFrames, ndarrays, univariate statistics, plotting data, declustering, data transformations and many other workflows available at https://github.com/GeostatsGuy/PythonNumericalDemos and https://github.com/GeostatsGuy/GeostatsPy. \n",
" \n",
"#### The Author:\n",
"\n",
"### Michael Pyrcz, Associate Professor, University of Texas at Austin \n",
"*Novel Data Analytics, Geostatistics and Machine Learning Subsurface Solutions*\n",
"\n",
"With over 17 years of experience in subsurface consulting, research and development, Michael has returned to academia driven by his passion for teaching and enthusiasm for enhancing engineers' and geoscientists' impact in subsurface resource development. \n",
"\n",
"For more about Michael check out these links:\n",
"\n",
"#### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)\n",
"\n",
"#### Want to Work Together?\n",
"\n",
"I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.\n",
"\n",
"* Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? I'd be happy to drop by and work with you! \n",
"\n",
"* Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PIs including Profs. Foster, Torres-Verdin and van Oort)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!\n",
"\n",
"* I can be reached at mpyrcz@austin.utexas.edu.\n",
"\n",
"I'm always happy to discuss,\n",
"\n",
"*Michael*\n",
"\n",
"Michael Pyrcz, Ph.D., P.Eng. Associate Professor The Hildebrand Department of Petroleum and Geosystems Engineering, Bureau of Economic Geology, The Jackson School of Geosciences, The University of Texas at Austin\n",
"\n",
"#### More Resources Available at: [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1) \n",
" "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
}
},
"nbformat": 4,
"nbformat_minor": 2
}