<p align="center">
    <img src="https://github.com/GeostatsGuy/GeostatsPy/blob/master/TCG_color_logo.png?raw=true" width="220" height="240" />

</p>

## P-P Plot Interactive Demonstration

### P-P (Probability-Probablity) Plots in Python 

Interactive demonstration of P-P plots to compare two distributions, cumulative distribution functions. 

* P-P plots map data values between two distributions, and then scatter plot the cumulative probability values.

* A lecture that covers these concepts is available [Q-Q plots and P-P plots](https://www.youtube.com/watch?v=RETZus4XBNM&list=PLG19vXLQHvSB-D4XKYieEku9GQMQyAzjJ&index=23&t=4s).

This interactive dashboard may be applied to support teaching data science.

#### Jason Bott, Undergraduate Student, The University of Texas at Austin

####  [GitHub](https://github.com/jasonbott124) | [GoogleScholar](https://scholar.google.com/citations?user=31Ae8UkAAAAJ&hl=en) | [LinkedIn](https://www.linkedin.com/in/jason-bott-a52944270/) | [Eportfolio](https://jasonseportfolio5.wordpress.com/) | Email: jbott@utexas.edu

#### Michael Pyrcz, Professor, The University of Texas at Austin 

##### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig)  | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)


#### Importing Packages

We will need some standard packages. These should have been installed with Anaconda 3.

In [1]:
%matplotlib inline
from ipywidgets import interactive        # widgets and interactivity
from ipywidgets import widgets                            
from ipywidgets import Layout
from ipywidgets import Label
from ipywidgets import VBox, HBox

import numpy as np                        # ndarrays for gridded data
import pandas as pd                       # DataFrames for tabular data
from scipy import stats                   # inverse percentiles, percentileofscore function for P-P plots
import os                                 # set working directory, run executables

import matplotlib.pyplot as plt           # plotting
import matplotlib.gridspec as gridspec
plt.rc('axes', axisbelow=True)

#### Widgets and Display
Next, we need to create our widgets and format the overall display

In [2]:
# interactive calculation of the sample set (control of source parametric distribution and number of samples)
l = widgets.Text(value='           Interactive P-P Plot | Jason Bott, Undergraduate Student, the University of Texas at Austin | Michael Pyrcz, Professor, The University of Texas at Austin',layout=Layout(width='950px', height='30px'),continuous_update=True)

n1 = widgets.IntSlider(min=0, max = 1000, value = 100, step = 10, description = '$n_{1}$',orientation='horizontal',layout=Layout(width='300px', height='30px'),continuous_update=False)
n1.style.handle_color = 'red'

m1 = widgets.FloatSlider(min=0.2, max = 0.8, value = 0.3, step = 0.1, description = '$\overline{x}_{1}$',orientation='horizontal',layout=Layout(width='300px', height='30px'),continuous_update=False)
m1.style.handle_color = 'red'

s1 = widgets.FloatSlider(min=0.0, max = 0.2, value = 0.03, step = 0.005, description = '$s_1$',orientation='horizontal',layout=Layout(width='300px', height='30px'),continuous_update=False)
s1.style.handle_color = 'red'

ui1 = widgets.VBox([n1,m1,s1],)                               # basic widget formatting 

n2 = widgets.IntSlider(min=0, max = 1000, value = 100, step = 10, description = '$n_{2}$',orientation='horizontal',layout=Layout(width='300px', height='30px'),continuous_update=False)
n2.style.handle_color = 'blue'

m2 = widgets.FloatSlider(min=0.2, max = 0.8, value = 0.2, step = 0.1, description = '$\overline{x}_{2}$',orientation='horizontal',layout=Layout(width='300px', height='30px'),continuous_update=False)
m2.style.handle_color = 'blue'

s2 = widgets.FloatSlider(min=0, max = 0.2, value = 0.03, step = 0.005, description = '$s_2$',orientation='horizontal',layout=Layout(width='300px', height='30px'),continuous_update=False)
s2.style.handle_color = 'blue'

ui2 = widgets.VBox([n2,m2,s2],)                               # basic widget formatting 

nq = widgets.IntSlider(min=10, max = 1000, value = 100, step = 1, description = '$n_q$',orientation='horizontal',layout=Layout(width='300px', height='30px'),continuous_update=False)
nq.style.handle_color = 'gray'

# plot = widgets.Checkbox(value=False,description='Make Plot')

ui3 = widgets.VBox([nq,],)                                # basic widget formatting 

ui4 = widgets.HBox([ui1,ui2,ui3],)                               # basic widget formatting 

ui2 = widgets.VBox([l,ui4],)

#### P-P plot Function

We create a function that calculates and matches the values from both data distributions. And plots the cumulative probilities.

In [3]:
#function to take parameters, make sample and PP plot
def double_p(n1, m1, s1, n2, m2, s2, nq):
    
#     n1 = 100; mean1 = 0.35; stdev1 = 0.06 
#     n2 = 50; mean2 = 0.3; stdev2 = 0.05
    
    seed = 73073; #nq = 100
    xmin=0.0; xmax=0.6
    np.random.seed(seed=seed)
    
    X1 = np.random.normal(loc=m1,scale=s1,size=n1)
    X2 = np.random.normal(loc=m2,scale=s2,size=n2)
    
    min_X = min(X1.min(),X2.min())
    max_X = max(X1.max(),X2.max())
    
    X_values = np.linspace(min_X,max_X,nq)

    X1_cumul_probs = []; X2_cumul_probs = []

    for X in X_values:
        X1_cumul_probs.append(stats.percentileofscore(X1,X)/100)
        X2_cumul_probs.append(stats.percentileofscore(X2,X)/100)
    
    X1_cumul_probs = np.asarray(X1_cumul_probs); X2_cumul_probs = np.asarray(X2_cumul_probs)
    fig = plt.figure()
    spec = fig.add_gridspec(2, 3)

    #P-P plot
    ax0 = fig.add_subplot(spec[:, 1:])
    plt.scatter(X1_cumul_probs,X2_cumul_probs,color='darkorange',edgecolor='black',s=20,label='P-P plot')
    plt.plot([0,1.0],[0,1.0],ls='--',color='red')
    plt.grid(); plt.xlim([0.0,1.0]); plt.ylim([0.0,1.0]); plt.xlabel(r'$F^{-1}_{X_1}(x)$ - Cumulative Probability'); plt.ylabel(r'$F^{-1}_{X_2}(x)$ - Cumulative Probability'); 
    plt.title('P-P Plot'); plt.legend(loc='lower right')

    #Histogram
    ax10 = fig.add_subplot(spec[0, 0])
    plt.hist(X1,bins=np.linspace(xmin,xmax,30),color='red',alpha=0.5,edgecolor='black',label=r'$X_1$',density=True)
    plt.hist(X2,bins=np.linspace(xmin,xmax,30),color='yellow',alpha=0.5,edgecolor='black',label=r'$X_2$',density=True)
    plt.grid(); plt.xlim([xmin,xmax]); plt.ylim([0,15]); plt.xlabel('Porosity (fraction)'); plt.ylabel('Density')
    plt.title('Histograms'); plt.legend(loc='upper right')
    
    #CDF
    ax11 = fig.add_subplot(spec[1, 0])
    plt.scatter(np.sort(X1),np.linspace(0,1,len(X1)),color='red',alpha=0.5,edgecolor='black',s=30,label=r'$X_1$')
    plt.scatter(np.sort(X2),np.linspace(0,1,len(X2)),color='yellow',alpha=0.5,edgecolor='black',s=30,label=r'$X_2$')
    plt.grid(); plt.xlim([xmin,xmax]); plt.ylim([0,1]); plt.xlabel('Porosity (fraction)'); plt.title('CDFs'); plt.legend(loc='lower right')

    plt.subplots_adjust(left=0.0, bottom=0.0, right=1.5, top=1.4, wspace=0.3, hspace=0.3); plt.show()

interactive_plot = widgets.interactive_output(double_p, {'n1': n1, 'm1': m1, 's1': s1, 'n2': n2, 'm2': m2, 's2': s2, 'nq': nq}) #creates an object called interactive_plot that calls the double_p() function
interactive_plot.clear_output(wait = True) #reduce flickering by delaying plot updating

### P-P Plot for Comparing Distributions

* demonstration of P-P plots to compare distributions, while interactively varying the distributions

#### Jason Bott, Undergraduate Student and Michael Pyrcz, Professor, The University of Texas at Austin 

Let's make 2 random datasets, $\color{red}{X_1}$ and $\color{blue}{X_2}$ and calculate their P-P plot.

* **$\color{red}{n_1}$**, **$\color{blue}{n_2}$** number of samples, **$\color{red}{\overline{x}_1}$**, **$\color{blue}{\overline{x}_2}$** means and **$\color{red}{s_1}$**, **$\color{blue}s_2$** standard deviation of the 2 sample sets
* **$\color{grey}{n_q}$**: number of regular bins over the range of values

In [4]:
display(ui2, interactive_plot) #displays the widgets and plots

VBox(children=(Text(value='           Interactive P-P Plot | Jason Bott, Undergraduate Student, the University…

Output(outputs=({'output_type': 'display_data', 'data': {'text/plain': '<Figure size 432x288 with 3 Axes>', 'i…

#### Comments

This was a basic interactive demonstration of a P-P plot in Python.

#### The Authors:

### Jason Bott, Undergraduate Student, University of Texas at Austin
*Geostatistics, Geophysics, Polar Geophysics, Volcanism, Subglacial Volcanism, Exploration Geophysics*

Just a passionate student who enjoys all things geoscience. If you would like to contact me, I can be reached through email at: jbott@utexas.edu

For more about Jason check out these links:

####  [GitHub](https://github.com/jasonbott124) | [GoogleScholar](https://scholar.google.com/citations?user=31Ae8UkAAAAJ&hl=en) | [LinkedIn](https://www.linkedin.com/in/jason-bott-a52944270/) | [Eportfolio](https://jasonseportfolio5.wordpress.com/) 


### Michael Pyrcz, Associate Professor, University of Texas at Austin 
*Novel Data Analytics, Geostatistics and Machine Learning Subsurface Solutions*

With over 17 years of experience in subsurface consulting, research and development, Michael has returned to academia driven by his passion for teaching and enthusiasm for enhancing engineers' and geoscientists' impact in subsurface resource development. 

For more about Michael check out these links:

#### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig)  | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)

#### Want to Work With Michael?

I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.

* Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? I'd be happy to drop by and work with you! 

* Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PIs including Profs. Foster, Torres-Verdin and van Oort)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!

* I can be reached at mpyrcz@austin.utexas.edu.

I'm always happy to discuss,

*Michael*

Michael Pyrcz, Ph.D., P.Eng. Associate Professor The Hildebrand Department of Petroleum and Geosystems Engineering, Bureau of Economic Geology, The Jackson School of Geosciences, The University of Texas at Austin

#### More Resources Available at: [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig)  | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)
