Browse Source

Merge pull request #11 from mozhgan-kch/nways_md

Nways md
Bharatkumar Sharma 3 years ago
parent
commit
f7a9033ad1

+ 2 - 2
README.md

@@ -10,8 +10,8 @@ This repository consists of GPU bootcamp material for both HPC and AI:
 # System Requirements
 Each lab contains docker and singularity definition files. Follow the readme files inside each on how to build the container and run the labs inside it. 
 
-# Slides:
-The slides associated with these training materials can be downloaded from [Google Slides](https://drive.google.com/drive/folders/1laRYdu6mtSA29M6Xthc1jP8AEOtVnbBo?usp=sharing)
+<!--# Slides:
+The slides associated with these training materials can be downloaded from [Google Slides](https://drive.google.com/drive/folders/1laRYdu6mtSA29M6Xthc1jP8AEOtVnbBo?usp=sharing)-->
 
 ## Questions?
 Please join [OpenACC Slack Channel](https://openacclang.slack.com/messages/openaccusergroup) for questions.

+ 2 - 2
hpc/README.md

@@ -6,9 +6,9 @@
 Each lab contains docker and singularity definition files. Follow the readme files inside each on how to build the container and run the labs inside it. 
 
 
-# Slides:
+<!--# Slides:
 The slides associated with these training materials can be downloaded from [Google Slides](https://drive.google.com/drive/folders/1nYd_oHbmA4cxdDPesg5CwQkrvr0E3ruf?usp=sharing)
-
+-->
 
 ## Questions?
 Please join [OpenACC Slack Channel](https://openacclang.slack.com/messages/openaccusergroup) for questions.

BIN
hpc/nways/nways_labs/nways_MD/English/C/jupyter_notebook/images/compute_collapse.png


BIN
hpc/nways/nways_labs/nways_MD/English/C/jupyter_notebook/images/compute_collapse_roofline.png


BIN
hpc/nways/nways_labs/nways_MD/English/C/jupyter_notebook/images/compute_split.png


BIN
hpc/nways/nways_labs/nways_MD/English/C/jupyter_notebook/images/openmp_feedback_offload_split.png


BIN
hpc/nways/nways_labs/nways_MD/English/C/jupyter_notebook/images/openmp_gpu_collapse.png


+ 7 - 92
hpc/nways/nways_labs/nways_MD/English/C/jupyter_notebook/openmp/nways_openmp.ipynb

@@ -445,98 +445,6 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "\n",
-    "# Optional Exercise\n",
-    "\n",
-    "## collapse clause\n",
-    "\n",
-    "Specifying the collapse(n) clause takes the next n tightly-nested loops, folds them into one, and applies the OpenMP directives to the new loop. Collapsing loops means that two loops of trip counts N and M respectively will be automatically turned\n",
-    "into a single loop with a trip count of N times M. By collapsing two or more parallel loops into a single\n",
-    "loop the compiler has an increased amount of parallelism to use when mapping the code to the device. On\n",
-    "highly parallel architectures, such as GPUs, this can result in improved performance\n",
-    "\n",
-    "Try using the collapse clause and observe any performance difference. How much this optimization will speed-up the code will vary according to the application and the target accelerator, but it is not uncommon to see large speed-ups by using collapse on loop nests.\n",
-    "\n",
-    "Sample usage of collapse clause is given as follows:\n",
-    "\n",
-    "```cpp\n",
-    "#pragma omp target teams distribute parallel for collapse(2) \n",
-    "for (int i = 0; i < N; i++ )\n",
-    "{\n",
-    "    for (int j=0;j< N;j++)\n",
-    "        < loop code >\n",
-    "} \n",
-    "```\n",
-    "\n",
-    "Now, lets start modifying the original code and add the collapse clause. From the top menu, click on *File*, and *Open* `rdf.cpp` and `dcdread.h` from the current directory at `C/source_code/openmp` directory. Remember to **SAVE** your code after changes, before running below cells."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "#compile for Tesla GPU\n",
-    "!cd ../../source_code/openmp && nvc++ -mp=gpu -Minfo=mp -o rdf rdf.cpp "
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Make sure to validate the output by running the executable and validate the output."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "#Run on Nvidia GPU and check the output\n",
-    "!cd ../../source_code/openmp && ./rdf && cat Pair_entropy.dat"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "The output should be the following:\n",
-    "\n",
-    "```\n",
-    "s2 value is -2.43191\n",
-    "s2bond value is -3.87014\n",
-    "```"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "#profile and see output of nvptx\n",
-    "!cd ../../source_code/openmp && nsys profile -t nvtx,cuda --stats=true --force-overwrite true -o rdf_collapse ./rdf"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Let's checkout the profiler's report. [Download the profiler output](../../source_code/openmp/rdf_collapse.qdrep) and open it via the GUI. Have a look at the example expected profiler report below:\n",
-    "\n",
-    "<img src=\"../images/openmp_gpu_collapse.png\">\n",
-    "\n",
-    "Compare the execution time for the `Pair_Calculation` from the NVTX row (annotated in Red rectangle in the example screenshot) with the previous section. It is clear the using collapse clause improved the performance by extracting more parallelisim.\n",
-    "\n",
-    "Feel free to checkout the [solution](../../source_code/openmp/SOLUTION/rdf_offload_collapse.cpp) to help you understand better."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
     "## Post-Lab Summary\n",
     "\n",
     "If you would like to download this lab for later viewing, it is recommend you go to your browsers File menu (not the Jupyter notebook file menu) and save the complete web page.  This will ensure the images are copied down as well. You can also execute the following cell block to create a zip-file of the files you've been working on, and download it with the link below."
@@ -560,6 +468,10 @@
    "source": [
     "**After** executing the above zip command, you should be able to download the zip file [here](../nways_files.zip). Let us now go back to parallelizing our code using other approaches.\n",
     "\n",
+    "<!--\n",
+    "**IMPORTANT**: If you would like to continue and optimize this application further with OpenMP, please click on the **NEXT** button, otherwise click on **HOME** to go back to the main notebook for *N ways of GPU programming for MD* code.\n",
+    "-->\n",
+    "\n",
     "**IMPORTANT**: Please click on **HOME** to go back to the main notebook for *N ways of GPU programming for MD* code.\n",
     "\n",
     "-----\n",
@@ -568,6 +480,8 @@
     "\n",
     "-----\n",
     "\n",
+    "<!-- <p style=\"text-align:center;border:3px; border-style:solid; border-color:#FF0000  ; padding: 1em\"> <a href=../../../nways_MD_start.ipynb>HOME</a>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style=\"float:center\"> <a href=nways_openmp_opt.ipynb>NEXT</a></span> </p>\n",
+    "-->\n",
     "\n",
     "# Links and Resources\n",
     "[OpenMP Programming Model](https://computing.llnl.gov/tutorials/openMP/)\n",
@@ -576,6 +490,7 @@
     "\n",
     "[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)\n",
     "\n",
+    "\n",
     "**NOTE**: To be able to see the Nsight System profiler output, please download Nsight System latest version from [here](https://developer.nvidia.com/nsight-systems).\n",
     "\n",
     "Don't forget to check out additional [OpenACC Resources](https://www.openacc.org/resources) and join our [OpenACC Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.\n",

File diff suppressed because it is too large
+ 490 - 0
hpc/nways/nways_labs/nways_MD/English/C/jupyter_notebook/openmp/nways_openmp_opt.ipynb


+ 124 - 126
hpc/nways/nways_labs/nways_MD/English/C/source_code/cudac/rdf.cu

@@ -82,131 +82,129 @@ int main(int argc , char* argv[] )
 
 	//Todo: Allocate memory on GPU.
 	HANDLE_ERROR(cudaMalloc(); //For d_x
-			HANDLE_ERROR(cudaMalloc(); //For d_y
-				HANDLE_ERROR(cudaMalloc(); //For d_z
-					HANDLE_ERROR(cudaMalloc(); //For d_g2
-
-						HANDLE_ERROR (cudaPeekAtLastError());
-
-						memset(h_g2,0,sizebin);
-
-
-
-						/////////reading cordinates//////////////////////////////////////////////
-						nvtxRangePush("Read_File");
-						double ax[numatm],ay[numatm],az[numatm];
-						for (int i=0;i<nconf;i++) {
-						dcdreadframe(ax,ay,az,infile,numatm,xbox,ybox,zbox);
-						for (int j=0;j<numatm;j++){
-						h_x[i*numatm+j]=ax[j];
-						h_y[i*numatm+j]=ay[j];
-						h_z[i*numatm+j]=az[j];
-						}
-						}
-						nvtxRangePop(); //pop for REading file
-
-
-						nvtxRangePush("Pair_Calculation");
-						//Todo: Copy the data from Host to Device before calculation on GPU
-						HANDLE_ERROR(cudaMemcpy(dest, source, ,));
-						HANDLE_ERROR(cudaMemcpy(dest, source, , ));
-						HANDLE_ERROR(cudaMemcpy(dest, source, , ));
-						HANDLE_ERROR(cudaMemcpy(dest, source, , ));
-
-						cout<<"Reading of input file and transfer to gpu is completed"<<endl;
-						//////////////////////////////////////////////////////////////////////////
-
-						near2=nthreads*(int(0.5*numatm*(numatm-1)/nthreads)+1);
-						unsigned long long int nblock = (near2/nthreads);
-
-						cout<<"Initial blocks are "<<nblock<<" "<<", now changing to ";
-
-						int maxblock=65535;
-						int bl;
-						int blockloop= int(nblock/maxblock);
-						if (blockloop != 0) {
-							nblock=maxblock;
-						}
-						cout<<nblock<<" and will run over "<<(blockloop+1)<<" blockloops"<<endl;
-
-						for (bl=0;bl<(blockloop+1);bl++) {
-							//cout <<bl<<endl;
-							//Todo: Fill the number of blocks and threads and pass the right device pointers
-							pair_gpu<<< , >>> (, , , , numatm, nconf, xbox, ybox, zbox, nbin, bl);
-
-							HANDLE_ERROR (cudaPeekAtLastError());
-							HANDLE_ERROR(cudaDeviceSynchronize());
-						}
-
-						//Todo: Copy d_ge back from Device to Host
-						HANDLE_ERROR(cudaMemcpy(dest, source, , ));
-
-						nvtxRangePop(); //Pop for Pair Calculation
-
-						double pi=acos(-1.0l);
-						double rho=(numatm)/(xbox*ybox*zbox);
-						double norm=(4.0l*pi*rho)/3.0l;
-						double rl,ru,nideal;
-						double g2[nbin];
-						double r,gr,lngr,lngrbond,s2=0.0l,s2bond=0.0l;
-						double box=min(xbox,ybox);
-						box=min(box,zbox);
-						double del=box/(2.0l*nbin);
-						nvtxRangePush("Entropy_Calculation");
-						for (int i=0;i<nbin;i++) {
-							//      cout<<i+1<<" "<<h_g2[i]<<endl;
-							rl=(i)*del;
-							ru=rl+del;
-							nideal=norm*(ru*ru*ru-rl*rl*rl);
-							g2[i]=(double)h_g2[i]/((double)nconf*(double)numatm*nideal);
-							r=(i)*del;
-							pairfile<<(i+0.5l)*del<<" "<<g2[i]<<endl;
-							if (r<2.0l) {
-								gr=0.0l;
-							}
-							else {
-								gr=g2[i];
-							}
-							if (gr<1e-5) {
-								lngr=0.0l;
-							}
-							else {
-								lngr=log(gr);
-							}
-
-							if (g2[i]<1e-6) {
-								lngrbond=0.0l;
-							}
-							else {
-								lngrbond=log(g2[i]);
-							}
-							s2=s2-2.0l*pi*rho*((gr*lngr)-gr+1.0l)*del*r*r;
-							s2bond=s2bond-2.0l*pi*rho*((g2[i]*lngrbond)-g2[i]+1.0l)*del*r*r;
-
-						}
-						nvtxRangePop(); //Pop for Entropy Calculation
-						stwo<<"s2 value is "<<s2<<endl;
-						stwo<<"s2bond value is "<<s2bond<<endl;
-
-
-
-						//Note: Freeing up the GPU memory
-						cout<<"\n\n\n#Freeing Device memory"<<endl;
-						HANDLE_ERROR(cudaFree(d_x));
-						HANDLE_ERROR(cudaFree(d_y));
-						HANDLE_ERROR(cudaFree(d_z));
-						HANDLE_ERROR(cudaFree(d_g2));
-
-						cout<<"#Freeing Host memory"<<endl;
-						HANDLE_ERROR(cudaFreeHost ( h_x ) );
-						HANDLE_ERROR(cudaFreeHost ( h_y ) );
-						HANDLE_ERROR(cudaFreeHost ( h_z ) );
-						HANDLE_ERROR(cudaFreeHost ( h_g2 ) );
-
-						cout<<"#Number of atoms processed: "<<numatm<<endl<<endl;
-						cout<<"#Number of confs processed: "<<nconf<<endl<<endl;
-						cout<<"#number of threads used: "<<nthreads<<endl<<endl;
-						return 0;
+	HANDLE_ERROR(cudaMalloc(); //For d_y
+	HANDLE_ERROR(cudaMalloc(); //For d_z
+	HANDLE_ERROR(cudaMalloc(); //For d_g2
+
+	HANDLE_ERROR (cudaPeekAtLastError());
+
+	memset(h_g2,0,sizebin);
+
+	/////////reading cordinates//////////////////////////////////////////////
+	nvtxRangePush("Read_File");
+	double ax[numatm],ay[numatm],az[numatm];
+	for (int i=0;i<nconf;i++) {
+	dcdreadframe(ax,ay,az,infile,numatm,xbox,ybox,zbox);
+	for (int j=0;j<numatm;j++){
+	h_x[i*numatm+j]=ax[j];
+	h_y[i*numatm+j]=ay[j];
+	h_z[i*numatm+j]=az[j];
+	}
+	}
+	nvtxRangePop(); //pop for REading file
+
+
+	nvtxRangePush("Pair_Calculation");
+	//Todo: Copy the data from Host to Device before calculation on GPU
+	HANDLE_ERROR(cudaMemcpy(dest, source, ,));
+	HANDLE_ERROR(cudaMemcpy(dest, source, , ));
+	HANDLE_ERROR(cudaMemcpy(dest, source, , ));
+	HANDLE_ERROR(cudaMemcpy(dest, source, , ));
+
+	cout<<"Reading of input file and transfer to gpu is completed"<<endl;
+	//////////////////////////////////////////////////////////////////////////
+
+	near2=nthreads*(int(0.5*numatm*(numatm-1)/nthreads)+1);
+	unsigned long long int nblock = (near2/nthreads);
+
+	cout<<"Initial blocks are "<<nblock<<" "<<", now changing to ";
+
+	int maxblock=65535;
+	int bl;
+	int blockloop= int(nblock/maxblock);
+	if (blockloop != 0) {
+		nblock=maxblock;
+	}
+	cout<<nblock<<" and will run over "<<(blockloop+1)<<" blockloops"<<endl;
+
+	for (bl=0;bl<(blockloop+1);bl++) {
+		//cout <<bl<<endl;
+		//Todo: Fill the number of blocks and threads and pass the right device pointers
+		pair_gpu<<< , >>> (, , , , numatm, nconf, xbox, ybox, zbox, nbin, bl);
+
+		HANDLE_ERROR (cudaPeekAtLastError());
+		HANDLE_ERROR(cudaDeviceSynchronize());
+	}
+
+	//Todo: Copy d_ge back from Device to Host
+	HANDLE_ERROR(cudaMemcpy(dest, source, , ));
+
+	nvtxRangePop(); //Pop for Pair Calculation
+
+	double pi=acos(-1.0l);
+	double rho=(numatm)/(xbox*ybox*zbox);
+	double norm=(4.0l*pi*rho)/3.0l;
+	double rl,ru,nideal;
+	double g2[nbin];
+	double r,gr,lngr,lngrbond,s2=0.0l,s2bond=0.0l;
+	double box=min(xbox,ybox);
+	box=min(box,zbox);
+	double del=box/(2.0l*nbin);
+	nvtxRangePush("Entropy_Calculation");
+	for (int i=0;i<nbin;i++) {
+		//      cout<<i+1<<" "<<h_g2[i]<<endl;
+		rl=(i)*del;
+		ru=rl+del;
+		nideal=norm*(ru*ru*ru-rl*rl*rl);
+		g2[i]=(double)h_g2[i]/((double)nconf*(double)numatm*nideal);
+		r=(i)*del;
+		pairfile<<(i+0.5l)*del<<" "<<g2[i]<<endl;
+		if (r<2.0l) {
+			gr=0.0l;
+		}
+		else {
+			gr=g2[i];
+		}
+		if (gr<1e-5) {
+			lngr=0.0l;
+		}
+		else {
+			lngr=log(gr);
+		}
+
+		if (g2[i]<1e-6) {
+			lngrbond=0.0l;
+		}
+		else {
+			lngrbond=log(g2[i]);
+		}
+		s2=s2-2.0l*pi*rho*((gr*lngr)-gr+1.0l)*del*r*r;
+		s2bond=s2bond-2.0l*pi*rho*((g2[i]*lngrbond)-g2[i]+1.0l)*del*r*r;
+
+	}
+	nvtxRangePop(); //Pop for Entropy Calculation
+	stwo<<"s2 value is "<<s2<<endl;
+	stwo<<"s2bond value is "<<s2bond<<endl;
+
+
+
+	//Note: Freeing up the GPU memory
+	cout<<"\n\n\n#Freeing Device memory"<<endl;
+	HANDLE_ERROR(cudaFree(d_x));
+	HANDLE_ERROR(cudaFree(d_y));
+	HANDLE_ERROR(cudaFree(d_z));
+	HANDLE_ERROR(cudaFree(d_g2));
+
+	cout<<"#Freeing Host memory"<<endl;
+	HANDLE_ERROR(cudaFreeHost ( h_x ) );
+	HANDLE_ERROR(cudaFreeHost ( h_y ) );
+	HANDLE_ERROR(cudaFreeHost ( h_z ) );
+	HANDLE_ERROR(cudaFreeHost ( h_g2 ) );
+
+	cout<<"#Number of atoms processed: "<<numatm<<endl<<endl;
+	cout<<"#Number of confs processed: "<<nconf<<endl<<endl;
+	cout<<"#number of threads used: "<<nthreads<<endl<<endl;
+	return 0;
 }
 
 //Todo: Convert the call to GPU call by adding right keyword
@@ -227,7 +225,7 @@ void pair_gpu(
 	double n;
 
 	//Todo: Write indexing logic using threads and blocks
-	int i = 
+	int i =
 
 
 		int maxi = min(int(0.5*numatm*(numatm-1)-(bl*65535*128)),(65535*128));

+ 195 - 0
hpc/nways/nways_labs/nways_MD/English/C/source_code/openmp/SOLUTION/rdf_offload_collapse_num.cpp

@@ -0,0 +1,195 @@
+#include <stdio.h>
+#include <iostream>
+#include <fstream>
+#include <math.h>
+#include <cstring>
+#include <cstdio>
+#include <iomanip>
+#include <omp.h>
+#include "dcdread.h"
+#include <assert.h>
+#include <nvtx3/nvToolsExt.h>
+
+void pair_gpu(const double *d_x, const double *d_y, const double *d_z,
+              unsigned int *d_g2, int numatm, int nconf,
+              const double xbox, const double ybox, const double zbox,
+              int d_bin);
+
+int main(int argc, char *argv[])
+{
+    double xbox, ybox, zbox;
+    double *h_x, *h_y, *h_z;
+    unsigned int *h_g2;
+    int nbin;
+    int numatm, nconf, inconf;
+    string file;
+
+    ///////////////////////////////////////////////////////////////
+
+    inconf = 10;
+    nbin = 2000;
+    file = "../input/alk.traj.dcd";
+    ///////////////////////////////////////
+    std::ifstream infile;
+    infile.open(file.c_str());
+    if (!infile)
+    {
+        cout << "file " << file.c_str() << " not found\n";
+        return 1;
+    }
+    assert(infile);
+
+    ofstream pairfile, stwo;
+    pairfile.open("RDF.dat");
+    stwo.open("Pair_entropy.dat");
+
+    /////////////////////////////////////////////////////////
+    dcdreadhead(&numatm, &nconf, infile);
+    cout << "Dcd file has " << numatm << " atoms and " << nconf << " frames" << endl;
+    if (inconf > nconf)
+        cout << "nconf is reset to " << nconf << endl;
+    else
+    {
+        nconf = inconf;
+    }
+    cout << "Calculating RDF for " << nconf << " frames" << endl;
+    ////////////////////////////////////////////////////////
+
+    unsigned long long int sizef = nconf * numatm * sizeof(double);
+    unsigned long long int sizebin = nbin * sizeof(unsigned int);
+
+    h_x = (double *)malloc(sizef);
+    h_y = (double *)malloc(sizef);
+    h_z = (double *)malloc(sizef);
+    h_g2 = (unsigned int *)malloc(sizebin);
+
+    memset(h_g2, 0, sizebin);
+
+    /////////reading cordinates//////////////////////////////////////////////
+    nvtxRangePush("Read_File");
+
+    double ax[numatm], ay[numatm], az[numatm];
+    for (int i = 0; i < nconf; i++)
+    {
+        dcdreadframe(ax, ay, az, infile, numatm, xbox, ybox, zbox);
+        for (int j = 0; j < numatm; j++)
+        {
+            h_x[i * numatm + j] = ax[j];
+            h_y[i * numatm + j] = ay[j];
+            h_z[i * numatm + j] = az[j];
+        }
+    }
+    nvtxRangePop(); //pop for REading file
+    cout << "Reading of input file is completed" << endl;
+//////////////////////////////////////////////////////////////////////////
+#pragma omp target data map(h_x [0:nconf * numatm], h_y [0:nconf * numatm], h_z [0:nconf * numatm], h_g2 [0:nbin])
+    {
+        nvtxRangePush("Pair_Calculation");
+        pair_gpu(h_x, h_y, h_z, h_g2, numatm, nconf, xbox, ybox, zbox, nbin);
+        nvtxRangePop(); //Pop for Pair Calculation
+    }
+    ////////////////////////////////////////////////////////////////////////
+    double pi = acos(-1.0);
+    double rho = (numatm) / (xbox * ybox * zbox);
+    double norm = (4.0l * pi * rho) / 3.0l;
+    double rl, ru, nideal;
+    double g2[nbin];
+    double r, gr, lngr, lngrbond, s2 = 0.0l, s2bond = 0.0l;
+    double box = min(xbox, ybox);
+    box = min(box, zbox);
+    double del = box / (2.0l * nbin);
+    nvtxRangePush("Entropy_Calculation");
+    for (int i = 0; i < nbin; i++)
+    {
+        rl = (i)*del;
+        ru = rl + del;
+        nideal = norm * (ru * ru * ru - rl * rl * rl);
+        g2[i] = (double)h_g2[i] / ((double)nconf * (double)numatm * nideal);
+        r = (i)*del;
+        pairfile << (i + 0.5l) * del << " " << g2[i] << endl;
+        if (r < 2.0l)
+        {
+            gr = 0.0l;
+        }
+        else
+        {
+            gr = g2[i];
+        }
+        if (gr < 1e-5)
+        {
+            lngr = 0.0l;
+        }
+        else
+        {
+            lngr = log(gr);
+        }
+
+        if (g2[i] < 1e-6)
+        {
+            lngrbond = 0.0l;
+        }
+        else
+        {
+            lngrbond = log(g2[i]);
+        }
+        s2 = s2 - 2.0l * pi * rho * ((gr * lngr) - gr + 1.0l) * del * r * r;
+        s2bond = s2bond - 2.0l * pi * rho * ((g2[i] * lngrbond) - g2[i] + 1.0l) * del * r * r;
+    }
+    nvtxRangePop(); //Pop for Entropy Calculation
+    stwo << "s2 value is " << s2 << endl;
+    stwo << "s2bond value is " << s2bond << endl;
+
+    cout << "#Freeing Host memory" << endl;
+    free(h_x);
+    free(h_y);
+    free(h_z);
+    free(h_g2);
+
+    cout << "#Number of atoms processed: " << numatm << endl
+         << endl;
+    cout << "#Number of confs processed: " << nconf << endl
+         << endl;
+    return 0;
+}
+void pair_gpu(const double *d_x, const double *d_y, const double *d_z,
+              unsigned int *d_g2, int numatm, int nconf,
+              const double xbox, const double ybox, const double zbox, int d_bin)
+{
+    double r, cut, dx, dy, dz;
+    int ig2;
+    double box;
+    int myround;
+    box = min(xbox, ybox);
+    box = min(box, zbox);
+
+    double del = box / (2.0 * d_bin);
+    cut = box * 0.5;
+    int count = 0;
+    printf("\n %d %d ", nconf, numatm);
+    for (int frame = 0; frame < nconf; frame++)
+    {
+        printf("\n %d  ", frame);
+#pragma omp target teams distribute parallel for private(dx, dy, dz, r, ig2) collapse(2) num_threads(256)
+        for (int id1 = 0; id1 < numatm; id1++)
+        {
+            for (int id2 = 0; id2 < numatm; id2++)
+            {
+                dx = d_x[frame * numatm + id1] - d_x[frame * numatm + id2];
+                dy = d_y[frame * numatm + id1] - d_y[frame * numatm + id2];
+                dz = d_z[frame * numatm + id1] - d_z[frame * numatm + id2];
+
+                dx = dx - xbox * (round(dx / xbox));
+                dy = dy - ybox * (round(dy / ybox));
+                dz = dz - zbox * (round(dz / zbox));
+
+                r = sqrtf(dx * dx + dy * dy + dz * dz);
+                if (r < cut)
+                {
+                    ig2 = (int)(r / del);
+#pragma omp atomic
+                    d_g2[ig2] = d_g2[ig2] + 1;
+                }
+            }
+        }
+    } //frame ends
+}