{ "cells": [ { "cell_type": "markdown", "id": "56d479f5", "metadata": {}, "source": [ "

\n", " \n", "\n", "

\n", "\n", "### Interactive Workflow of Principal Component Analysis, Eigen Values and Eigen Vectors\n", " \n", "#### Michael Pyrcz, Associate Professor, University of Texas at Austin,\n", " \n", " ##### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1) | [GeostatsPy](https://github.com/GeostatsGuy/GeostatsPy),\n", "\n", "#### Introduction\n", "\n", "This semester I had students asking about how Eigen vectors and values behave in PCA. They wanted to see how they response to structure in covariance matrix. So I made this interactivity to demonstrate and visualize this! \n", "\n", "For more check out my YouTube channel [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig). For the walkthrough video of this workflow go here [walkthrough](TBD). Here's some basic concepts on Principal Component Analysis.\n", "\n", "#### Principal Component Analysis\n", "\n", "Principal Component Analysis one of a variety of methods for dimensional reduction:\n", "\n", "Dimensional reduction transforms the data to a lower dimension\n", "\n", "* Given features, $𝑋_1,\\dots,𝑋_π‘š$ we would require ${m \\choose 2}=\\frac{π‘š \\cdot (π‘šβˆ’1)}{2}$ scatter plots to visualize just the two-dimensional scatter plots.\n", "\n", "* Once we have 4 or more variables understanding our data gets very hard.\n", "* Recall the curse of dimensionality, impact inference, modeling and visualization. \n", "\n", "One solution, is to find a good lower dimensional, $𝑝$, representation of the original dimensions $π‘š$\n", "\n", "Benefits of Working in a Reduced Dimensional Representation:\n", "\n", "1. Data storage / Computational Time\n", "2. Easier visualization\n", "3. Also takes care of multicollinearity \n", "\n", "#### Orthogonal Transformation \n", "\n", "Convert a set of observations into a set of linearly uncorrelated variables known as principal components\n", "\n", "* The number of principal components ($k$) available are min⁑($π‘›βˆ’1,π‘š$) \n", "\n", "* Limited by the variables/features, $π‘š$, and the number of data\n", "\n", "Components are ordered\n", "\n", "* First component describes the larges possible variance / accounts for as much variability as possible\n", "* Next component describes the largest possible remaining variance \n", "* Up to the maximum number of principal components\n", "\n", "Eigen Values / Eigen Vectors\n", "\n", "* The Eigen values are the variance explained for each component\n", "* The Eigen vectors of the data covariance matrix are the principal components' loadings\n", "\n", "#### Install Packages\n", "\n", "For this interactive workflow to work, we need to install several packages relating to display features, widgets and data analysis interpretation." ] }, { "cell_type": "code", "execution_count": 1, "id": "b9458018", "metadata": {}, "outputs": [], "source": [ "import pandas as pd # DataFrames and plotting\n", "import numpy as np\n", "import matplotlib.pyplot as plt # plotting\n", "from matplotlib.colors import ListedColormap # custom color maps\n", "import matplotlib.ticker as mtick\n", "from matplotlib.patches import Rectangle\n", "import matplotlib as mpl\n", "from mpl_toolkits.axes_grid1 import make_axes_locatable\n", "from numpy.linalg import eig # Eigen values and Eigen vectors\n", "from sklearn.decomposition import PCA # PCA program from scikit learn (package for machine learning)\n", "from sklearn.preprocessing import StandardScaler # normalize synthetic data\n", "from ipywidgets import interactive # widgets and interactivity\n", "from ipywidgets import widgets \n", "from ipywidgets import Layout\n", "from ipywidgets import Label\n", "from ipywidgets import VBox, HBox\n", "import warnings\n", "warnings.filterwarnings('ignore') # ignore warnings\n", "plt.rc('axes', axisbelow=True) # grids behind plot elements" ] }, { "cell_type": "markdown", "id": "014a3dd1", "metadata": {}, "source": [ "#### Make the Dashbaord\n", "\n", "The numerical methods for this dashboard are:\n", "\n", "1. make a covariance matrix\n", "2. sample jointly from the multiGaussian distribution based on this covariance matrix\n", "3. standardize the samples to correct the mean and variance to 0.0 and 1.0 repsectively\n", "4. calculate the actual covariance matrix / this ensures that the covariance matrix is positive semidefiniate (because it is based on actual data)\n", "5. calculate the Eigen values and vectors, and sort by descending Eigen values\n", "6. plot the original feature variances, Eigen vectors and values." ] }, { "cell_type": "code", "execution_count": 2, "id": "36698d45", "metadata": {}, "outputs": [], "source": [ "l = widgets.Text(value=' PCA Eigen Vector / Component Loadings Demo, Prof. Michael Pyrcz, The University of Texas at Austin',\n", " layout=Layout(width='900px', height='30px'))\n", "# P_happening_label = widgets.Text(value='Probability of Happening',layout=Layout(width='50px',height='30px',line-size='0 px'))\n", "cstr = widgets.FloatSlider(min=0.0, max = 1.0, value=0.0, step = 0.1, description = r'$\\rho_{strength}$',orientation='horizontal', \n", " style = {'description_width':'initial','button_color':'green'},layout=Layout(width='600px',height='40px'),continuous_update=False,readout_format='.3f')\n", "\n", "ui_summary = widgets.HBox([cstr],)\n", "ui_summary1 = widgets.VBox([l,ui_summary],)\n", "\n", "def run_plot_summary(cstr):\n", " \n", " m = 4;\n", " \n", " mean = np.zeros((m)) # make inputs for multivariate dataset\n", " #cov = np.zeros((m,m))\n", " cov = np.full((m,m),0.0)\n", " for i in range(0,m):\n", " cov[i,i] = 1.0\n", " cov[0,1] = cov[1,0] = 0.99*cstr; cov[1,2] = cov[2,1] = -0.9*cstr; cov[0,2] = cov[2,0] = -0.7*cstr;\n", " \n", " data = np.random.multivariate_normal(mean = mean, cov = cov, size = 1000) # draw samples from MV Gaussian\n", " data = StandardScaler(copy=True, with_mean=True, with_std=True).fit(data).transform(data)\n", " \n", " cov_actual = np.cov(data,rowvar = False)\n", " \n", " eigen_values,eigen_vectors = eig(cov_actual) # Eigen values and vectors \n", " sorted_indices = np.argsort(-eigen_values)\n", " sorted_eigen_vectors = eigen_vectors[:, sorted_indices]\n", " sorted_eigen_values = np.sort(-eigen_values)*-1\n", " \n", " fig = plt.figure(figsize=(6, 6))\n", " gs = fig.add_gridspec(2,2 ,width_ratios=(1.0, 1.0))\n", " \n", " plt_center = fig.add_subplot(gs[1, 1])\n", " plt_x = fig.add_subplot(gs[1, 0],sharey=plt_center) \n", " plt_y = fig.add_subplot(gs[0, 1],sharex=plt_center) \n", " plt_extra = fig.add_subplot(gs[0, 0]) \n", " \n", " for i in range(0,m):\n", " for j in range(0,m):\n", " color = (sorted_eigen_vectors[j,i] + 1.0)/(2.0)\n", " plt_center.add_patch(Rectangle((i-0.5,j-0.5), 1, 1,color = plt.cm.RdGy_r(color),fill=True))\n", " \n", " if abs(sorted_eigen_vectors[j,i]) > 0.5:\n", " plt_center.annotate(np.round(sorted_eigen_vectors[j,i],1),(i-0.1,j-0.05),color='white')\n", " else:\n", " plt_center.annotate(np.round(sorted_eigen_vectors[j,i],1),(i-0.1,j-0.05))\n", " \n", " plt_center.set_xlim([-0.5,3.5]); plt_center.set_ylim([-0.5,3.5])\n", " plt_center.set_xticks([0,1, 2, 3],[1,2,3,4]); plt_center.set_yticks([0,1, 2, 3],[1,2,3,4])\n", " for x in np.arange(0.5,3.5,1.0):\n", " plt_center.plot([x,x],[-0.5,3.5],c='black',lw=3)\n", " plt_center.plot([-0.5,3.5],[x,x],c='black',lw=1,ls='--')\n", " plt_center.set_title('Eigen Vectors / Principal Component Loadings') \n", " plt_center.set_xlabel('Eigen Vector'); plt_center.set_ylabel('Feature')\n", " \n", " plt_x.barh(y=np.array([0,1,2,3],dtype='float'),width=np.var(data,axis=0),color='darkorange',edgecolor='black')\n", " plt_x.set_xlim([3.0,0]); plt_x.set_yticks([0,1, 2, 3],[1,2,3,4])\n", " plt_x.plot([1,1],[-0.5,3.5],c='black',ls='--'); plt_x.annotate('Equal Variance',(1.13,2.6),rotation=90.0,size=9)\n", " plt_x.set_ylabel('Feature'); plt_x.set_xlabel('Variance')\n", " plt_x.set_title('Original Feature Variance') \n", " plt_x.grid(axis='x',which='minor', color='#EEEEEE', linestyle=':', linewidth=0.5)\n", " plt_x.grid(axis='x',which='major', color='#DDDDDD', linewidth=0.8); plt_x.minorticks_on()\n", " for x in np.arange(0.5,3.5,1.0):\n", " plt_x.plot([-0.5,3.5],[x,x],c='black',lw=1,ls='--')\n", " \n", " plt_y.bar(x=np.array([0,1,2,3],dtype='float'),height=sorted_eigen_values,color='darkorange',edgecolor='black')\n", " plt_y.set_ylim([0,3.0]); plt_y.set_xticks([0,1, 2, 3],[1,2,3,4]); \n", " plt_y.plot([-0.5,3.5],[1,1],c='black',ls='--'); plt_y.annotate('Equal Variance',(2.55,1.05),size=9)\n", " plt_y.set_xlabel('Eigen Value'); plt_y.set_ylabel('Variance')\n", " plt_y.set_title('Sorted, Projected Feature Variance') \n", " plt_y.grid(axis='y',which='minor', color='#EEEEEE', linestyle=':', linewidth=0.5)\n", " plt_y.grid(axis='y',which='major', color='#DDDDDD', linewidth=0.8); plt_y.minorticks_on() \n", " for x in np.arange(0.5,3.5,1.0):\n", " plt_y.plot([x,x],[-0.5,3.5],c='black',lw=3)\n", "\n", " for i in range(0,m):\n", " for j in range(0,m):\n", " color = (cov_actual[j,i] + 1.0)/(2.0)\n", " plt_extra.add_patch(Rectangle((i-0.5,j-0.5), 1, 1,color = plt.cm.BrBG(color),fill=True))\n", " \n", " plt_extra.set_xlim([-0.5,3.5]); plt_extra.set_ylim([3.5,-0.5])\n", " plt_extra.set_xticks([0,1, 2, 3],[1,2,3,4]); plt_extra.set_yticks([0,1, 2, 3],[1,2,3,4])\n", " for x in np.arange(0.5,3.5,1.0):\n", " plt_extra.plot([x,x],[-0.5,3.5],c='black',lw=2)\n", " plt_extra.plot([-0.5,3.5],[x,x],c='black',lw=2)\n", " plt_extra.set_title('Covariance Matrix') \n", " \n", " cplt_extra = make_axes_locatable(plt_extra).append_axes('left', size='5%', pad=0.3)\n", " fig.colorbar(mpl.cm.ScalarMappable(norm=mpl.colors.Normalize(vmin=-1.0, vmax=1.0), cmap=plt.cm.BrBG),\n", " cax=cplt_extra, orientation='vertical')\n", " cplt_extra.yaxis.set_ticks_position('left')\n", " \n", " plt.subplots_adjust(left=0.0, bottom=0.0, right=1.51, top=1.50, wspace=0.2, hspace=0.2); plt.show()\n", " \n", "interactive_plot_summary = widgets.interactive_output(run_plot_summary, {'cstr':cstr,})\n", "interactive_plot_summary.clear_output(wait = True) " ] }, { "cell_type": "markdown", "id": "9f1f60e1", "metadata": {}, "source": [ "### Interactive Principal Components Analysis, Component Loadings & Variance Explained Demostration\n", "\n", "* add data correlation / redundancy and observe the impact on the component loadings (Eigen vectors) and variance explained (Eigen values).\n", "\n", "#### Michael Pyrcz, Professor, The University of Texas at Austin \n", "##### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1) | [GeostatsPy](https://github.com/GeostatsGuy/GeostatsPy)\n", "\n", "The Inputs: **$\\rho_{strenth}$**: the strength of the correlation between features, scaler applied to $X_1$, $X_2$, & $X_3$ correlation." ] }, { "cell_type": "code", "execution_count": 3, "id": "b2108939", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "408af31c27ac462baa3f1779d3c164ca", "version_major": 2, "version_minor": 0 }, "text/plain": [ "VBox(children=(Text(value=' PCA Eigen Vector / Component Loadings Demo…" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "e78bad5d43544f04a40fc3dc26cc7b03", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Output()" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(ui_summary1, interactive_plot_summary) # display the interactive plot" ] }, { "cell_type": "markdown", "id": "15b997fe", "metadata": {}, "source": [ "#### Comments\n", "\n", "This was an interactive demonstrations of the Eigen values (variance explained) and Eigen vectors (component loadings) for Principal Components Analysis (PCA) with varianble between feature correlation. \n", "\n", "I have many other demonstrations on data analytics and machine learning, e.g. on the basics of working with DataFrames, ndarrays, univariate statistics, plotting data, declustering, data transformations, trend modeling and many other workflows available at https://github.com/GeostatsGuy/PythonNumericalDemos and https://github.com/GeostatsGuy/GeostatsPy. \n", " \n", "I hope this was helpful,\n", "\n", "*Michael*\n", "\n", "#### The Author:\n", "\n", "### Michael J Pyrcz, Professor, The University of Texas at Austin \n", "*Novel Data Analytics, Geostatistics and Machine Learning Subsurface Solutions*\n", "\n", "With over 17 years of experience in subsurface consulting, research and development, Michael has returned to academia driven by his passion for teaching and enthusiasm for enhancing engineers' and geoscientists' impact in subsurface resource development. \n", "\n", "For more about Michael check out these links:\n", "\n", "#### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)\n", "\n", "#### Want to Work Together?\n", "\n", "I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.\n", "\n", "* Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? I'd be happy to drop by and work with you! \n", "\n", "* Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PIs including Profs. Foster, Torres-Verdin and van Oort)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!\n", "\n", "* I can be reached at mpyrcz@austin.utexas.edu.\n", "\n", "I'm always happy to discuss,\n", "\n", "*Michael*\n", "\n", "Michael Pyrcz, Ph.D., P.Eng. Professor The Hildebrand Department of Petroleum and Geosystems Engineering, Bureau of Economic Geology, The Jackson School of Geosciences, The University of Texas at Austin\n", "\n", "#### More Resources Available at: [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "d23f7326", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" } }, "nbformat": 4, "nbformat_minor": 5 }