{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Interactive Workflow on the Impact of Correlation and Standardization on Multidimensional Scaling\n", "\n", "### The University of Texas at Austin, PGE 2020 SURI, Undergraduation Research\n", "### Alan Scherman, Undergraduate Student, Rice University, UT PGE SURI 2020\n", "### Supervised by:\n", "### Michael Pyrcz, Associate Professor, University of Texas at Austin\n", "***\n", "### Introduction\n", "Correlation coefficients and standardization are two elementary concepts from the realm of multivariate statistics (the latter present even in univariate statistics) which seemingly bear no connection to the outcome of complicated algorithms such as those of multidimensional scaling (MDS). Nevertheless, correlation and standardization fundamentally influence how events in a spatial data set behave (e.g. how far apart they are situated - relevant to MDS purposes). This workflow seeks to qualitatively demonstrate through an user-controlled interactive how combinations of correlation coefficients and standardization do impact the visualization and accuracy of MDS projections.\n", "\n", "***\n", "### Theory\n", "**Correlation coefficients**, also known by their full name of Pearson's Product Moment Correlation Coefficient, are bivariate measures of the degree of linear dependency between two features. They are defined as:\n", "\n", "
\n", "\\begin{equation}\n", "\\rho_{xy} = \\frac{\\sum_{i=1}^{n}{(x_{i}-\\bar{x})}{(y_{i}-\\bar{y})}}{(n-1)\\sigma_{x}\\sigma_{y}}\n", "\\end{equation}\n", "\n", "where $\\rho_{xy}$ is the correlation coefficient between features $x$ and $y$, $n$ is the total number of events, $\\bar{x}$ and $\\bar{y}$ are the means for features $x$ and $y$, and $\\sigma_{x}$ and $\\sigma_{y}$ are the standard deviations of features $x$ and $y$. \n", "\n", "Note that $\\rho$ can vary in the range $[-1,1]$, where $1$ illustrates perfect positive correlation (linear behavior - along $y=x$) and $-1$ shows perfect negative correlation (negative linear behavior - along $y=-x$).\n", "\n", "Rather than a measure, **standardization** provides a univariate process by which different features are transferred to a common range while retaining the significance of their original values. To standardize a data point, apply the following formula:\n", "\n", "
\n", "\\begin{equation}\n", "x_{i,standardized} = \\frac{x_{i} - \\bar{x}}{\\sigma_{x}}\n", "\\end{equation}\n", "\n", "where $x_{i}$ is the value of feature $x$ for event $i$, $\\bar{x}$ is the mean of feature $x$, and $\\sigma_{x}$ is the standard deviation of feature $x$. \n", "\n", "Note that standardization is especially useful when dealing with features of different orders of magnitude because it creates Gaussian distributions of all variables (mean of $0$ and variance of $1$). This prevents one feature from outweighing another in processes like multidimensional scaling.\n", "\n", "**Multidimensional scaling (MDS)** is a powerful application from inferential machine learning employed to project high-dimensional data sets onto lower dimensions. The resulting $2D$ or $3D$ projection may reveal clusters, trends or event outliers which help describe the data set and determine events worth further study. A key concern for MDS is to preserve original pairwise distances in the resulting projection (i.e. decreasing distortions). Several types of MDS exist, such as metric and nonmetric MDS, each with their own algorithms to reduce projection stress. There is much more to learn on MDS, so we encourage the reader to check out the \"MDS_cities_workflow\" demonstration which can be found under the References section for a more detailed, step-by-step walkthrough of MDS. In this interactive, we will utilize the SMACOF algorithm for nonmetric MDS provided by the *sklearn* Python package.\n", "\n", "***\n", "### Objective\n", "In this demonstration, we will determine if different correlation coefficients and standardization influence the outcome and the accuracy (i.e. how it affects a projection's stress) in multidimensional scaling projections. To do so, we will need to generate realistic spatial data sets and conduct multidimensional scaling on them.\n", "***\n", "### Structure of workflow\n", "This workflow was designed to *allow the user full control and freedom to try out different configurations of correlation coefficients and standardization* in an effort to identify the impact on MDS. As such, this interactive workflow includes not only the code used to generate and analyze spatial data, but also widgets which provide inputs to the processing functions. The interactive utilizes a *combination of the Python packages Ipywidgets and Matplotlib*.\n", "\n", "First, we will walk through the functions which generate sample data and project into MDS space. Then, we present the widget configurations. Next, we invite the user to experiment with different widget combinations and identify potential impacts on the MDS projections. Finally, we list a set of conclusions observed from multiple runs of the interactive. *Note that the code for data set generation is computationally expensive and so the interactive might not produce an immediate answer depending on your machine's processing capability.*\n", "\n", "#### Import libraries\n", "We begin by importing some standard Python libraries. Note that typically these would be called within the functions below, but because the interactive supposes repeated use of these libraries, we import these packages outside the functions for greater efficiency. Most of these libraries should come installed if you have Anaconda or other environment software." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import geostatspy.geostats as geostats # reimplimentation of GSLIB algorithms in Python\n", "import geostatspy.GSLIB as GSLIB # GSLIB visualization, subroutines and algorithm wrappers\n", "from copy import copy # deep copies\n", "import sys, os\n", "\n", "from sklearn.manifold import MDS # Multidimensional scaling\n", "import matplotlib.pyplot as plt # Plotting and settings\n", "from mpl_toolkits.mplot3d import Axes3D # 3D scatter plots\n", "\n", "# Interactive resources:\n", "from ipywidgets import interactive, interact_manual\n", "from ipywidgets import widgets \n", "from ipywidgets import Layout\n", "from ipywidgets import Label\n", "from ipywidgets import VBox, HBox" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you get a package import error, you may have to first install some of these packages. This can usually be accomplished by opening up a command window on Windows and then typing 'python -m pip install [package-name]'. More assistance is available with the respective package docs.\n", "\n", "#### *make_sample_data()* Function\n", "The function below creates the spatial data sets which undergo multidimensional scaling. Note that the code below inherits from an interactive workflow written by Dr. Michael Pyrcz (see more detail on function docstring). *make_sample_data* will create a 4 feature (facies, porosity, permeability, and acoustic impedance) data set, where porosity is the truth model with a mean of $0$ and variance of $1$, *corr1* is the correlation coefficient between porosity and permeability, and *corr2* is the correlation coefficient between porosity and acoustic impedance. The spatial data set is generated by a combination of regular, random, and rejection sampling methods which aim to emulate a real-world sample as much as possible." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "def make_sample_data(corr1, corr2, standardize=True):\n", " '''\n", " Function used to create nonlinear, multivariate spatial data sets through a combination of\n", " regular, random, and biased sampling techniques. Output data set contains sample Cartesian coordinates,\n", " porosity, permeability, acoustic impedance, and facies information.\n", " \n", " :param corr1: correlation coefficient between permeability and porosity (truth model)\n", " :type: float in range [-1,1] with increments of 0.05\n", " :param corr2: correlation coefficient beween acoustic impedance and porosity (truth model)\n", " :type: float in range [-1,1] with increments of 0.05\n", " :param standardize: default True to standardize features of data set\n", " :type: boolean\n", " :return:biased_samples: 4-feature data set generated by regular, random, and biased sampling methods\n", " :type: Pandas DataFrame\n", " \n", " Note: All credit to Dr. Michael Pyrcz, Associate Professor at UT Austin; the code below is modified from his \n", " \"make_nonlinear_MV_spatial_data\" workflow available on his PythonNumericalDemos GitHub repository at GeoStatsGuy \n", " (https://github.com/GeostatsGuy/PythonNumericalDemos).\n", " \n", " '''\n", " \n", " #***********************************************************************************************************************\n", " # BLOCK / ENABLE PRINT FUNCTIONS\n", " #***********************************************************************************************************************\n", " \n", " # To prevent unnecessary outputs from geostats.sgsim() calls\n", " \n", " # Disable printing\n", " def blockPrint():\n", " sys.stdout = open(os.devnull, 'w')\n", "\n", " # Restore printing\n", " def enablePrint():\n", " sys.stdout = sys.__stdout__\n", "\n", " #***********************************************************************************************************************\n", " # TRUTH MODEL PARAMETERS\n", " #***********************************************************************************************************************\n", "\n", " # Truth Model Parameters \n", " seed = 73073 # random number seed for repeatable random samples and rejection sampler \n", " name = 'Por' # name of the feature of interest\n", " nug = 0.0 # nugget effect - variogram model\n", " it = 1 # type of structure (1 -spherical, 2 - exponential, 3 - Gaussian)\n", " azi = 45.0 # primary direction/azimuth of continuity (0 = y positive, 90 = x positive)\n", " hmaj = 800 # variogram range in the major direction\n", " hmin = 300 # variogram range in the minor direction (major direction + 90)\n", "\n", " # Make the variogram object\n", " vario = GSLIB.make_variogram(nug,nst=1,it1=it,cc1=1.0,azi1=azi,hmaj1=hmaj,hmin1=hmin) # make model object\n", "\n", " # Make the dummy dataset (should be outside the range of correlation of the model)\n", " df = pd.DataFrame({'X':np.full(100,-9999),'Y':np.full(100,-9999),name:np.random.normal(0.0,1.0,100)})\n", "\n", " # Truth Model Grid Parameters\n", " nx = 100; ny = 100 # number of cells in the x and y directions\n", " xsiz = 10.0; ysiz = 10.0 # size of the cells in the x and y directions\n", " xmn = xsiz * 0.5; ymn = ysiz * 0.5 # assume origin at 0,0, calculate the 1 cell centroid\n", " xmin = xmn - 0.5*xsiz; ymin = ymn - 0.5*ysiz # assume origin at 0,0, calculate the min and max x/y coordinates\n", " xmax = xmin + nx * xsiz; ymax = ymin + ny * ysiz;\n", " vmin = -3; vmax = 3 # feature min and max for color bars\n", " sim_ns = np.zeros([6,ny,nx])\n", "\n", " #***********************************************************************************************************************\n", " # GAUSSIAN SPATIAL REALIZATIONS\n", " #***********************************************************************************************************************\n", " \n", " blockPrint()\n", " \n", " # Calculate the Simulated Truth Model over the Specified Grid and Variogram \n", " sim_ns[0,:,:] = geostats.sgsim(df,'X','Y',name,wcol=-1,scol=-1,tmin=-9999,tmax=9999,itrans=0,ismooth=0,dftrans=0,tcol=0,\n", " twtcol=0,zmin=-3.0,zmax=3.0,ltail=1,ltpar=-3.0,utail=1,utpar=3.0,nsim=1,\n", " nx=nx,xmn=xmn,xsiz=xsiz,ny=ny,ymn=ymn,ysiz=ysiz,seed=73073,\n", " ndmin=0,ndmax=100,nodmax=10,mults=0,nmult=2,noct=-1,radius=hmaj,radius1=1,sang1=0,\n", " mxctx=10,mxcty=10,ktype=0,colocorr=0.0,sec_map=0,vario=vario); # corr = 0 bc Por is the feature of interest\n", " sim_ns[0,:,:] = GSLIB.affine(sim_ns[0,:,:],0.0,1.0); # correct the mean and variance\n", "\n", "\n", " sim_ns[1,:,:] = geostats.sgsim(df,'X','Y',name,wcol=-1,scol=-1,tmin=-9999,tmax=9999,itrans=0,ismooth=0,dftrans=0,tcol=0,\n", " twtcol=0,zmin=-3.0,zmax=3.0,ltail=1,ltpar=-3.0,utail=1,utpar=3.0,nsim=1,\n", " nx=nx,xmn=xmn,xsiz=xsiz,ny=ny,ymn=ymn,ysiz=ysiz,seed=73074,\n", " ndmin=0,ndmax=100,nodmax=10,mults=0,nmult=2,noct=-1,radius=hmaj,radius1=1,sang1=0,\n", " mxctx=10,mxcty=10,ktype=4,colocorr=corr1,sec_map=sim_ns[0,:,:],vario=vario); # corr to Por\n", " sim_ns[1,:,:] = GSLIB.affine(sim_ns[1,:,:],0.0,1.0); # correct the mean and variance\n", "\n", " sim_ns[2,:,:] = geostats.sgsim(df,'X','Y',name,wcol=-1,scol=-1,tmin=-9999,tmax=9999,itrans=0,ismooth=0,dftrans=0,tcol=0,\n", " twtcol=0,zmin=-3.0,zmax=3.0,ltail=1,ltpar=-3.0,utail=1,utpar=3.0,nsim=1,\n", " nx=nx,xmn=xmn,xsiz=xsiz,ny=ny,ymn=ymn,ysiz=ysiz,seed=73075,\n", " ndmin=0,ndmax=100,nodmax=10,mults=0,nmult=2,noct=-1,radius=hmaj,radius1=1,sang1=0,\n", " mxctx=10,mxcty=10,ktype=4,colocorr=corr2,sec_map=sim_ns[0,:,:],vario=vario); # corr to Por\n", " sim_ns[2,:,:] = GSLIB.affine(sim_ns[2,:,:],0.0,1.0); # correct the mean and variance\n", "\n", " sim_ns[3,:,:] = geostats.sgsim(df,'X','Y',name,wcol=-1,scol=-1,tmin=-9999,tmax=9999,itrans=0,ismooth=0,dftrans=0,tcol=0,\n", " twtcol=0,zmin=-3.0,zmax=3.0,ltail=1,ltpar=-3.0,utail=1,utpar=3.0,nsim=1,\n", " nx=nx,xmn=xmn,xsiz=xsiz,ny=ny,ymn=ymn,ysiz=ysiz,seed=73076,\n", " ndmin=0,ndmax=100,nodmax=10,mults=0,nmult=2,noct=-1,radius=hmaj,radius1=1,sang1=0,\n", " mxctx=10,mxcty=10,ktype=0,colocorr=0.0,sec_map=0,vario=vario); # corr = 0 bc Por is the feature of interest\n", " sim_ns[3,:,:] = GSLIB.affine(sim_ns[3,:,:],0.0,1.0); # correct the mean and variance\n", "\n", " sim_ns[4,:,:] = geostats.sgsim(df,'X','Y',name,wcol=-1,scol=-1,tmin=-9999,tmax=9999,itrans=0,ismooth=0,dftrans=0,tcol=0,\n", " twtcol=0,zmin=-3.0,zmax=3.0,ltail=1,ltpar=-3.0,utail=1,utpar=3.0,nsim=1,\n", " nx=nx,xmn=xmn,xsiz=xsiz,ny=ny,ymn=ymn,ysiz=ysiz,seed=73077,\n", " ndmin=0,ndmax=100,nodmax=10,mults=0,nmult=2,noct=-1,radius=hmaj,radius1=1,sang1=0,\n", " mxctx=10,mxcty=10,ktype=4,colocorr=corr1,sec_map=sim_ns[3,:,:],vario=vario); # corr to Por\n", " sim_ns[4,:,:] = GSLIB.affine(sim_ns[4,:,:],0.0,1.0); # correct the mean and variance\n", "\n", " sim_ns[5,:,:] = geostats.sgsim(df,'X','Y',name,wcol=-1,scol=-1,tmin=-9999,tmax=9999,itrans=0,ismooth=0,dftrans=0,tcol=0,\n", " twtcol=0,zmin=-3.0,zmax=3.0,ltail=1,ltpar=-3.0,utail=1,utpar=3.0,nsim=1,\n", " nx=nx,xmn=xmn,xsiz=xsiz,ny=ny,ymn=ymn,ysiz=ysiz,seed=73078,\n", " ndmin=0,ndmax=100,nodmax=10,mults=0,nmult=2,noct=-1,radius=hmaj,radius1=1,sang1=0,\n", " mxctx=10,mxcty=10,ktype=4,colocorr=corr2,sec_map=sim_ns[3,:,:],vario=vario); # corr to Por\n", " sim_ns[5,:,:] = GSLIB.affine(sim_ns[5,:,:],0.0,1.0); # correct the mean and variance\n", "\n", " \n", " enablePrint()\n", " \n", " #***********************************************************************************************************************\n", " # GAUSSIAN TO PROPERTY DISTRIBUTION (SAND)\n", " #***********************************************************************************************************************\n", "\n", " df_ns = pd.DataFrame({'NS1':sim_ns[0,:,:].flatten(),'NS2':sim_ns[1,:,:].flatten(),\n", " 'NS3':sim_ns[2,:,:].flatten(),'NS4':sim_ns[3,:,:].flatten(),\n", " 'NS5':sim_ns[4,:,:].flatten(),'NS6':sim_ns[5,:,:].flatten()})\n", "\n", " df_ns['Por1'] = GSLIB.affine(df_ns['NS1'],13.0,4.5)\n", "\n", " a = 5.5; b = 35; c = 100\n", " # impose nonlinearity through a quadratic transform of the permeability \n", " df_ns['Perm1'] = a * df_ns['NS2'] * df_ns['NS2'] + b * df_ns['NS2'] + c \n", " df_ns['Perm1'] = GSLIB.affine(df_ns['Perm1'],4.3,1.0)*100\n", "\n", " a = 5.5; b = 50; c = 100\n", " df_ns['AI1'] = a * df_ns['NS3'] * df_ns['NS3'] + b * df_ns['NS3'] + c \n", " df_ns['AI1'] = GSLIB.affine(df_ns['AI1'],4.0,0.3)*1000\n", "\n", " df_ns['facies1'] = np.full(len(df_ns),1)\n", "\n", " #***********************************************************************************************************************\n", " # GAUSSIAN TO PROPERTY DISTRIBUTION (SHALE)\n", " #***********************************************************************************************************************\n", "\n", " df_ns['Por2'] = GSLIB.affine(df_ns['NS4'],7.0,2.5)\n", "\n", " a = 5.5; b = 35; c = 100\n", " # impose nonlinearity through a quadratic transform of the permeability \n", " df_ns['Perm2'] = a * df_ns['NS5'] * df_ns['NS5'] + b * df_ns['NS5'] + c \n", " df_ns['Perm2'] = GSLIB.affine(df_ns['Perm2'],2.0,0.3)*100\n", "\n", " a = 5.5; b = 50; c = 100\n", " df_ns['AI2'] = a * df_ns['NS6'] * df_ns['NS6'] + b * df_ns['NS6'] + c \n", " df_ns['AI2'] = GSLIB.affine(df_ns['AI2'],5.0,0.4)*1000\n", "\n", " df_ns['facies2'] = np.full(len(df_ns),0)\n", "\n", "\n", " #***********************************************************************************************************************\n", " # MAKE FACIES MODEL\n", " #***********************************************************************************************************************\n", "\n", " # Sequential Indicator Simulation with Simple Kriging Multiple Realizations \n", " nx = 100; ny = 100; xsiz = 10.0; ysiz = 10.0; xmn = 5.0; ymn = 5.0; nxdis = 1; nydis = 1\n", " ndmin = 0; ndmax = 10; nodmax = 10; radius = 400; skmean = 0\n", " tmin = -999; tmax = 999\n", " df_dummy_facies = pd.DataFrame({'X':np.full(100,-9999),'Y':np.full(100,-9999),'Facies':np.full(100,0)})\n", " dummy_trend = np.zeros((10,10,2)) # the current version requires trend input - if wrong size it is ignored \n", "\n", " ncut = 2 # number of facies\n", " thresh = [0,1] # the facies categories (use consisten order)\n", " gcdf = [0.4,0.6] # the global proportions of the categories\n", " varios = [] # the variogram list\n", " varios.append(GSLIB.make_variogram(nug=0.0,nst=1,it1=1,cc1=1.0,azi1=45,hmaj1=300,hmin1=100)) # shale indicator variogram\n", " varios.append(GSLIB.make_variogram(nug=0.0,nst=1,it1=1,cc1=1.0,azi1=45,hmaj1=300,hmin1=100)) # sand indicator variogram\n", "\n", " sisim = geostats.sisim(df_dummy_facies,'X','Y','Facies',ivtype=0,koption=0,ncut=2,thresh=thresh,gcdf=gcdf,trend=dummy_trend,\n", " tmin=tmin,tmax=tmax,zmin=0.0,zmax=1.0,ltail=1,ltpar=1,middle=1,mpar=0,utail=1,utpar=2,\n", " nx=nx,xmn=xmn,xsiz=xsiz,ny=ny,ymn=ymn,ysiz=ysiz,seed = 73073,\n", " ndmin=ndmin,ndmax=ndmax,nodmax=nodmax,mults=1,nmult=3,noct=-1,radius=radius,ktype=0,vario=varios)\n", "\n", " #***********************************************************************************************************************\n", " # MERGE CONTINUOUS PROPERTIES BY FACIES (COOKIE CUTTER)\n", " #***********************************************************************************************************************\n", "\n", " por1 = df_ns['Por1'].values.reshape([ny,nx])\n", " perm1 = df_ns['Perm1'].values.reshape([ny,nx])\n", " AI1 = df_ns['AI1'].values.reshape([ny,nx])\n", "\n", " por2 = df_ns['Por2'].values.reshape([ny,nx])\n", " perm2 = df_ns['Perm2'].values.reshape([ny,nx])\n", " AI2 = df_ns['AI2'].values.reshape([ny,nx])\n", "\n", " facies = np.zeros([ny,nx])\n", " por = np.zeros([ny,nx])\n", " perm = np.zeros([ny,nx])\n", " AI = np.zeros([ny,nx])\n", "\n", " for iy in range(0,ny):\n", " for ix in range(0,nx):\n", " if sisim[iy,ix] > 0.5: # current location is assumed to be sand\n", " facies[iy,ix] = 1\n", " por[iy,ix] = por1[iy,ix];\n", " perm[iy,ix] = perm1[iy,ix];\n", " AI[iy,ix] = AI1[iy,ix]\n", " else: # current location is assumed to be shale\n", " facies[iy,ix] = 0\n", " por[iy,ix] = por2[iy,ix];\n", " perm[iy,ix] = perm2[iy,ix];\n", " AI[iy,ix] = AI2[iy,ix]\n", "\n", " #***********************************************************************************************************************\n", " # REGULAR, RANDOM, REJECT SAMPLE FUNCTIONS\n", " #***********************************************************************************************************************\n", "\n", " def regular_sample_MV(array,array2,array3,array4,xmin,xmax,ymin,ymax,spacing,\n", " minsx=xmin,maxsx=xmax,minsy=ymin,maxsy=ymax,name='Value',name2='Value2',name3='Value3',name4=\"Value4\"):\n", " x = []; y = []; v = []; v2 = []; v3 = []; v4 = []\n", " nx = array.shape[1]; ny = array.shape[0]\n", " xsiz = (xmax-xmin)/nx; ysiz = (ymax-ymin)/ny\n", " xmn = xmin + 0.5*xsiz; ymn = ymin + 0.5*ysiz\n", " xx, yy = np.meshgrid(np.arange(xmin, xmax, spacing), np.arange(ymax, ymin, -1 * spacing))\n", " xx = xx + spacing*0.5; yy = yy - spacing*0.5\n", " for ix,iy in np.ndindex(xx.shape):\n", " iix = geostats.getindex(nx,xmn,xsiz,xx[iy,ix])\n", " iiy = geostats.getindex(ny,ymn,ysiz,yy[iy,ix])\n", " if yy[iy,ix] >= minsy and yy[iy,ix] <= maxsy: \n", " if xx[iy,ix] >= minsx and xx[iy,ix] >= minsx: \n", " x.append(xx[iy, ix])\n", " y.append(yy[iy, ix])\n", " v.append(array[ny - iiy - 1, iix])\n", " v2.append(array2[ny - iiy - 1, iix])\n", " v3.append(array3[ny - iiy - 1, iix])\n", " v4.append(array4[ny - iiy - 1, iix])\n", " df = pd.DataFrame(np.c_[x, y, v, v2, v3, v4], columns=[\"X\", \"Y\", name,name2,name3,name4])\n", " return df\n", "\n", " def random_sample_MV(array,array2,array3,array4,xmin,xmax,ymin,ymax,nsamp=10,minsx=xmin,maxsx=xmax,minsy=ymin,maxsy=ymax,name='Value',name2='Value2',name3='Value3',name4='Value4'):\n", " x = []; y = []; v = []; v2 = []; v3 = []; v4 = []\n", " ny, nx = array.shape \n", " xx, yy = np.meshgrid(\n", " np.arange(xmin, xmax,(xmax-xmin)/float(nx)), np.arange(ymax - 1, ymin - 1, -1 * (ymax-ymin)/float(ny))\n", " )\n", " mask = np.zeros([ny,nx])\n", " for iy in range(0,ny):\n", " for ix in range(0,nx):\n", " if xx[iy,ix] >= minsx and xx[iy,ix] <= maxsx:\n", " if yy[iy,ix] >= minsy and yy[iy,ix] <= maxsy:\n", " mask[iy,ix] = 1.0\n", "\n", " if nsamp*1.2 > np.sum(mask):\n", " print('ERROR - too few locations available for number of samples requested!') \n", " return pd.DataFrame()\n", " isamp = 0\n", " while isamp < nsamp:\n", " sample_index = np.random.choice(range(nx * ny), 1)\n", " iy = int(sample_index[0] / ny)\n", " ix = sample_index[0] - iy * nx\n", " if mask[iy,ix] == 1: \n", " if xx[iy,ix] >= xmin and xx[iy,ix] <= xmax:\n", " if yy[iy,ix] >= ymin and yy[iy,ix] <= ymax: \n", " x.append(xx[iy, ix])\n", " y.append(yy[iy, ix])\n", " v.append(array[iy, ix])\n", " v2.append(array2[iy, ix])\n", " v3.append(array3[iy, ix]) \n", " v4.append(array4[iy, ix]) \n", " mask[iy,ix] = 0.0\n", " isamp = isamp + 1\n", " df = pd.DataFrame(np.c_[x, y, v, v2, v3, v4], columns=[\"X\", \"Y\", name,name2,name3,name4])\n", " return df\n", "\n", " def rejection_sample(df,vcol,frac,wt_min,wt_max):\n", " value = [np.min(df[vcol].values),np.max(df[vcol].values)]\n", " wt = [wt_min,wt_max]\n", " df_copy = df.copy(deep = True)\n", " df_copy['weights'] = np.interp(df[vcol],value,wt)\n", " df_sample = df_copy.sample(frac = frac, replace = False, weights = 'weights')\n", " return df_sample\n", "\n", " #***********************************************************************************************************************\n", " # SAMPLE FROM TRUTH MODEL FOR EACH FEATURE (USE OF REGULAR, RANDOM, REJECT SAMPLE FUNCTIONS)\n", " #***********************************************************************************************************************\n", "\n", " # Data Set Sampling Paramters\n", " spacing = 30 # spacing of regular samples\n", " nrandom_sample = 400 # number of random samples\n", " wt_min = 0.1; wt_max = 1.0 # weigths of minimum and maximum feature values - control on degree of bias \n", " bias_frac = 0.7 # fraction of dataset to remove with rejection sampler\n", " seed = 73073 # random number seed\n", " vmin = 0.0; vmax = 25.0\n", "\n", " np.random.seed(seed = seed) # set the random number seed for repeatability of samples\n", "\n", " # Regular Sampling\n", " regular_samples = regular_sample_MV(por,perm,AI,facies,xmin=xmin,xmax=xmax,ymin=ymin,ymax=ymax,spacing=spacing,\n", " minsx=xmin,maxsx=xmax,minsy=ymin,maxsy=ymax,name=\"Por\",name2=\"Perm\",name3=\"AI\",name4=\"Facies\")\n", "\n", " # Random Sampling\n", " random_samples = random_sample_MV(por,perm,AI,facies,xmin=xmin,xmax=xmax,ymin=ymin,ymax=ymax,\n", " nsamp = nrandom_sample,minsx=xmin,maxsx=xmax,minsy=ymin,maxsy=ymax,\n", " name=\"Por\",name2=\"Perm\",name3=\"AI\",name4=\"Facies\")\n", "\n", " # Concatenate (Combine) Regular and Random Samples into One DataFrame \n", " samples = pd.concat([regular_samples,random_samples])\n", "\n", " # Rejection Sampler\n", " biased_samples = rejection_sample(samples,\"Por\",1-bias_frac,wt_min,wt_max)\n", "\n", " # Remove Dupicates \n", " biased_samples = biased_samples[np.invert(biased_samples.duplicated(subset=['X','Y'],keep='first'))] \n", "\n", " # Drop the weights column\n", " biased_samples = biased_samples.drop('weights',axis=1) \n", " \n", " #***********************************************************************************************************************\n", " # STANDARDIZATION OF RESULT SAMPLE SET\n", " #***********************************************************************************************************************\n", " \n", " if standardize == True:\n", " biased_samples = (biased_samples - biased_samples.mean())/biased_samples.std()\n", " return biased_samples\n", " elif standardize == False:\n", " return biased_samples\n", " else:\n", " raise TypeError('\"standardize\" parameter must be boolean')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### *make_MDS()* Function\n", "Now that we have an utility which creates a spatial data set from user-selected correlation coefficients, we can code up another function to do MDS on the data set and produce plots of the MDS projection and its accuracy. Observe that most of the function below is dedicated to calculating pairwise distances and the Kruskal stress for the second plot. The core of the (non-metric) multidimensional scaling analysis is executed with the *sklearn.manifold.MDS* module." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def make_MDS(sample_df, n_dim, corr1, corr2, standardize=True):\n", " '''\n", " Function that computes non-metric MDS of selected sample features and outputs plots of MDS projection and\n", " model accuracy check with Kruskal stress. Relies on sklearn.manifold.MDS.\n", " \n", " :param sample_df: spatial data set with select features for MDS analysis\n", " :type: Pandas DataFrame\n", " :param n_dim: number of dimensions of MDS projection\n", " :type: str '2D' or '3D'\n", " :param corr1: correlation coefficient between porosity and permeability (for plot label purposes)\n", " :type: float in range [-1,1] with increments of 0.05\n", " :param corr2: correlation coefficient between porosity and acoustic impedance (for plot label purposes)\n", " :type: float in range [-1,1] with increments of 0.05\n", " :param standardize: default True to standardize data set by feature (for plot label purposes)\n", " :type: boolean\n", " :return:scatter plot of MDS projection\n", " :type: 2D or 3D scatter plot\n", " :plot of original versus projected pairwise distances w/ stress\n", " :type: 2D scatter plot\n", " \n", " '''\n", " \n", " sample = sample_df.to_numpy() # Convert to NumPy array\n", " \n", " #***********************************************************************************************************************\n", " # NONMETRIC MDS\n", " #***********************************************************************************************************************\n", " \n", " embedding = MDS(n_components=int(n_dim.replace('D','')))\n", " proj_coords = embedding.fit_transform(sample) # Compute and store the MDS coordinates\n", " \n", " #***********************************************************************************************************************\n", " # CALCULATE ORIGINAL DISTANCES\n", " #***********************************************************************************************************************\n", " \n", " orig_dists = np.zeros(shape=(np.shape(proj_coords)[0],np.shape(proj_coords)[0]))\n", " \n", " for i in range(np.shape(sample)[0]): # Span rows\n", " for j in range(np.shape(sample)[0]): # Span columns\n", " if i==j: # Distance of event to itself is null\n", " continue\n", " else:\n", " inside = 0\n", " for k in range(np.shape(sample)[1]): # Span features\n", " inside += (sample[i,k] - sample[j,k])**2\n", " \n", " orig_dists[i,j] = np.sqrt(inside)\n", " \n", " #***********************************************************************************************************************\n", " # CALCULATE PROJECTED DISTANCES\n", " #***********************************************************************************************************************\n", " \n", " proj_dists = np.zeros(shape=(np.shape(proj_coords)[0],np.shape(proj_coords)[0])) # MDS projected distances\n", "\n", " for i in range(np.shape(proj_coords)[0]): # Span rows\n", " for j in range(np.shape(proj_coords)[0]): # Span columns\n", " if i==j:\n", " continue # Distance of event to itself is null\n", " else:\n", " inside = 0\n", " for k in range(np.shape(proj_coords)[1]): # Span features\n", " inside += (proj_coords[i,k] - proj_coords[j,k])**2 \n", " \n", " proj_dists[i,j] = np.sqrt(inside)\n", " \n", " #***********************************************************************************************************************\n", " # STRESS\n", " #***********************************************************************************************************************\n", " num = 0; den = 0 # Preallocating numerator and denominator\n", "\n", " for i in range(np.shape(proj_coords)[0]): # Span rows\n", " for j in range(np.shape(proj_coords)[0]): # Span columns\n", " if i==j: # num, den would be 0 so skip for efficiency\n", " continue\n", " else:\n", " num += (proj_dists[i,j] - orig_dists[i,j])**2\n", " den += (orig_dists[i,j])**2\n", " \n", " stress = np.sqrt(num/den)\n", "\n", " #***********************************************************************************************************************\n", " # PLOTTING\n", " #***********************************************************************************************************************\n", " plt.style.use('ggplot')\n", " \n", " plt.figure(figsize=(14,10)) # MDS projection\n", " \n", " if np.shape(proj_coords)[1]==2: # 2D projection\n", " plt.scatter(proj_coords[:,0],proj_coords[:,1],marker='o',s=10, color='blue');\n", " else: # 3D projection\n", " ax = plt.axes(projection='3d')\n", " ax.scatter3D(proj_coords[:,0],proj_coords[:,1],proj_coords[:,2],marker='o',s=10, color='blue');\n", " ax.set_zlabel('Third MDS Dimension',size=20);\n", " \n", " plt.xlabel('First MDS Dimension',size=20); plt.ylabel('Second MDS Dimension',size=20);\n", " plt.suptitle('{dim}D MDS Projection of Subsurface Sample'.format(dim=int(n_dim.replace('D',''))),size=25)\n", " plt.title('Correlation Coefficients: {c1},{c2} \\n Standardization: {s}'.format(c1=corr1,c2=corr2,s=standardize),size=15)\n", " \n", " plt.figure(figsize=(14,10)) # Original vs projected pairwise distances and Kruskal stress\n", " \n", " plt.scatter(orig_dists,proj_dists,label='Projected vs Original Distances', color='red',s=3);\n", " plt.plot(np.max(orig_dists)*np.array([0,1]),np.max(orig_dists)*np.array([0,1]),color='black',linewidth=2,label='Null Stress (y=x)');\n", " plt.title('Projected MDS versus Original Distances w/Stress',size=25)\n", " plt.xlabel('Original Distances',size=20); plt.ylabel('Projected Distances',size=20); \n", " plt.text(0.1*np.max(orig_dists),0.9*np.max(orig_dists), 'Kruskal Stress = \\n %.3f' %stress, size = 15, horizontalalignment = 'center', \\\n", " bbox=dict(facecolor='none', edgecolor='black'))\n", " plt.legend(loc='lower right');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Widgets and *widget2MDS()* Function\n", "The last step before launching the interactive is to code up the widgets which will provide inputs to the functions above, and to create a function to encapsulate *make_sample_data()* and *make_MDS()*. This global function which we call *widget2MDS()* does not perform any additional processing of the sample data or the MDS projection. Rather, it just helps call the interactive." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "header = widgets.Text(value='Impact of Correlation and Standardization in Multidimensional Scaling',\\\n", " layout=Layout(width='900px', height='50px',justify_content='center'))\n", "\n", "style = {'description_width': 'initial'}\n", "\n", "corr1 = widgets.FloatSlider(min = -1.0, max = 1.0, value = 0, step = 0.05, description = 'Permeability Correlation Coefficient',\\\n", " orientation='horizontal',continuous_update=False,style=style,layout=Layout(width='500px', height='50px'))\n", "corr1.style.handle_color = 'red'\n", "\n", "corr2 = widgets.FloatSlider(min = -1.0, max = 1.0, value = 0, step = 0.05, description = 'Acoustic Imp. Correlation Coefficient',\\\n", " style=style,orientation='horizontal',continuous_update=False,layout=Layout(width='500px', height='50px'))\n", "corr2.style.handle_color = 'blue'\n", "\n", "standardize = widgets.Checkbox(value=False,description='Standardize by feature?',disabled=False,indent=False,layout=Layout(width='400px', height='50px'))\n", "\n", "n_dim = widgets.Dropdown(options=['2D', '3D'],value='2D',description='Number of dimensions in MDS projection:',disabled=False,\\\n", " style=style,layout=Layout(width='400px', height='50px'))\n", "\n", "\n", "def widget2MDS(corr1, corr2, n_dim, standardize=True):\n", " '''\n", " Function to encapsulate make_sample_data() and make_MDS(). \n", " Takes widget inputs and produces MDS projections and Kruskal stress plot.\n", " \n", " :param corr1: correlation coefficient between permeability and porosity (truth model)\n", " :type: float in range [-1,1] with increments of 0.05\n", " :param corr2: correlation coefficient beween acoustic impedance and porosity (truth model)\n", " :type: float in range [-1,1] with increments of 0.05\n", " :param standardize: default True to standardize features of data set\n", " :type: boolean\n", " :param n_dim: number of dimensions of MDS projection\n", " :type: str '2D' or '3D'\n", " :return:scatter plot of MDS projection\n", " :type: 2D or 3D scatter plot\n", " :plot of original versus projected pairwise distances w/ stress\n", " :type: 2D scatter plot\n", " '''\n", " \n", " sample_df = make_sample_data(corr1=corr1, corr2=corr2, standardize=standardize)\n", " \n", " make_MDS(sample_df=sample_df, n_dim=n_dim, corr1=corr1, corr2=corr2, standardize=standardize)\n", " \n", "\n", "# Links widget inputs to widget2MDS()\n", "play_button = interactive(widget2MDS,{'manual':True},corr1=corr1,corr2=corr2,n_dim=n_dim,standardize=standardize)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "### Interactive \n", "With all functions and widgets properly set up and linked together, it is time to run our interactive. Note that much more could have been done to improve the visualization of the widgets, and we invite the user willing to collaborate to do so.\n", "\n", "##### Instructions:\n", " - Red dial: refers to the **correlation coefficient between the sample's permeability and porosity** and ranges from -1 (for perfect inverse behavior) to 1 (for perfect direct proportionality) with increments of 0.05;\n", " - Blue dial: refers to the **correlation coefficient between acoustic impedance and porosity**; \n", " - Number of dimensions in MDS projection: in **how many dimensions to display the MDS projection** scatter plot (2D or 3D);\n", " - Standardize by Feature: when checked **will standardize the selected features**;\n", "\n", "Once you have your settings picked, make sure to press **\"Run Interact\"** otherwise the interactive will not start. Enjoy!" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "scrolled": false }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "9c99a87cbec7403cad08460a25e9fead", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Text(value='Impact of Correlation and Standardization in Multidimensional Scaling', layout=Layout(height='50px…" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "84d60a2b0ed24791a8017fc2630ff705", "version_major": 2, "version_minor": 0 }, "text/plain": [ "interactive(children=(FloatSlider(value=0.0, continuous_update=False, description='Permeability Correlation Co…" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(header, play_button)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "### Conclusion\n", "By running the interactive above for different combinations of initial settings (large, medium, and small correlation coefficients; standardized and non-standardized data), the user can draw three main qualitative conclusions:\n", "\n", "\n", "1) Standardizing data by feature facilitates the appearance of clusters in spite of the correlation coefficients and projection dimensionality chosen;\n", "\n", "2) For both standardized and non-standardized data sets, the variance in projection stress increases with the magnitude of the correlation coefficients. For example, 3 iterations with standardized data and large correlation coefficients ($|\\rho|=1$) yield stress values of $0.179$, $0.190$, and $0.145$, while 3 iterations with standardized data and small correlation coefficients ($|\\rho|=0.1$) yield stress values of $0.191$, $0.194$, and $0.192$;\n", "\n", "3) The projection stress decreases for greater projection dimensionality. Hence, a $3D$ MDS projection will certainly distort less the original pairwise distances than a $2D$ MDS projection.\n", "\n", "***\n", "### References\n", "The sources below provide further reading and video material on correlation coefficients, standardization, and MDS:\n", "\n", "1. [MDS_cities_workflow](https://github.com/AlanScherman/SURI2020/blob/master/MDS%20-%20Cities%20demo/MDS_cities_workflow.ipynb)\n", "2. [Interpreting MDS projections](http://www.analytictech.com/borgatti/mds.htm)\n", "3. [sklearn.manifold.MDS Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.MDS.html) \n", "4. [Dr. Michael Pyrcz's lecture on correlation](https://www.youtube.com/watch?v=wZwYEDqB4A4&list=PLG19vXLQHvSB-D4XKYieEku9GQMQyAzjJ&index=20)\n", "5. [Dr. Michael Pyrcz's \"make_nonlinear_MV_spatial_data\" workflow](https://github.com/GeostatsGuy/PythonNumericalDemos/blob/master/make_nonlinear_MV_spatial_data.ipynb)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" } }, "nbformat": 4, "nbformat_minor": 4 }