{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before we begin, let's execute the cell below to display information about the CUDA driver and GPUs running on the server by running the `nvidia-smi` command. To do this, execute the cell block below by giving it focus (clicking on it with your mouse), and hitting Ctrl-Enter, or pressing the play button in the toolbar above. If all goes well, you should see some output returned below the grey cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!nvidia-smi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Learning objectives\n",
    "The **goal** of this lab is to:\n",
    "\n",
    "- Learn how to find bottleneck and performance limiters using Nsight tools\n",
    "- Learn about ways to to extract more parallelism in loops\n",
    "\n",
    "\n",
    "In this section, we will optimize the parallel [RDF](../serial/rdf_overview.ipynb) application using OpenMP offloading. Before we begin, feel free to have a look at the parallel version of the code and inspect it once again. \n",
    "\n",
    "[RDF Parallel Code](../../source_code/openmp/SOLUTION/rdf_offload.f90)\n",
    "\n",
    "Now, let's compile, and profile it with Nsight Systems first."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#compile for Tesla GPU\n",
    "!cd ../../source_code/openmp && nvfortran -mp=gpu -Minfo=mp -o rdf nvtx.f90 SOLUTION/rdf_offload.f90 -L/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/lib64 -lnvToolsExt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#profile and see output of nvptx\n",
    "!cd ../../source_code/openmp && nsys profile -t nvtx,cuda --stats=true --force-overwrite true -o rdf_offload ./rdf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's checkout the profiler's report. Download and save the report file by holding down <mark>Shift</mark> and <mark>Right-Clicking</mark> [Here](../../source_code/openacc/rdf_offload.qdrep) and open it via the GUI."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Further Optimization\n",
    "Currently both our distributed and workshared parallelism comes from the same loop. In other words, in this example, we used teams and distribute to spread parallelism across the full GPU. As a result, all of the parallelism comes from the outer loop. There are multiple ways to increase parallelism: \n",
    "\n",
    "- Moving the `Parallel do` to the inner loop\n",
    "- Collapsing the loops together\n",
    "\n",
    "In the following sections, we will explore each of these methods.\n",
    "\n",
    "\n",
    "### Split Teams Distribute from Parallel Do\n",
    "\n",
    "In this method, we distribute outer loop to thread teams and workshare inner loop across threads.The example below shows usage of OpenMP target splitting teams distribute from parallel do. The example code contains two levels of OpenMP offloading parallelism in a double-nested loop.\n",
    "\n",
    "```fortran\n",
    "!$omp target teams distribute\n",
    "    do i=1,N\n",
    "        !$omp parallel do\n",
    "        do j=1,N\n",
    "```\n",
    "\n",
    "Look at the profiler report from the previous section again. From the timeline, have a close look at the the kernel functions. Checkout the theoretical [occupancy](../GPU_Architecture_Terminologies.ipynb). As shown in the example screenshot below,`nvkernel_MAIN__F1L98_1_` kernel has the theoretical occupancy of 50%. It clearly shows that the occupancy is a limiting factor. *Occupancy* is a measure of how well the GPU compute resources are being utilized. It is about how much parallelism is running / how much parallelism the hardware could run.\n",
    "\n",
    "<img src=\"../images/f_offload_grid.png\">\n",
    "\n",
    "\n",
    "NVIDIA GPUs are comprised of multiple [streaming multiprocessors (SMs)](../GPU_Architecture_Terminologies.ipynb) where it can manage up to 2048 concurrent threads (not actively running at the same time). Low occupancy shows that there are not enough active [threads](../GPU_Architecture_Terminologies.ipynb) to fully utilize the computing resources. Higher occupancy implies that the scheduler has more active threads to choose from and hence achieves higher performance. So, what does this mean in OpenACC execution model?\n",
    "\n",
    "\n",
    "Now, lets start modifying the original code and split the teams distribute. From the top menu, click on *File*, and *Open* `rdf.f90` from the current directory at `Fortran/source_code/openmp` directory. Remember to **SAVE** your code after changes, before running below cells."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#compile for Tesla GPU\n",
    "!cd ../../source_code/openmp && nvfortran -mp=gpu -Minfo=mp -o rdf nvtx.f90 SOLUTION/rdf_offload.f90 -L/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/lib64 -lnvToolsExt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Inspect the compiler feedback (you should get a similar output as below) you can see from *Line 98* that it separated the teams distribute from parallel do.\n",
    "\n",
    "<img src=\"../images/f_openmp_feedback_offload_split.png\">\n",
    "\n",
    "Now, validate the output by running the executable, and then **Profile** your code with Nsight Systems command line `nsys`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Run on Nvidia GPU and check the output\n",
    "!cd ../../source_code/openmp && ./rdf && cat Pair_entropy.dat"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The output should be the following:\n",
    "\n",
    "```\n",
    "s2      :    -2.452690945278331     \n",
    "s2bond  :    -24.37502820694527    \n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#profile and see output of nvptx\n",
    "!cd ../../source_code/openmp && nsys profile -t nvtx,cuda --stats=true --force-overwrite true -o rdf_offload_split ./rdf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's checkout the profiler's report. Download and save the report file by holding down <mark>Shift</mark> and <mark>Right-Clicking</mark> [Here](../../source_code/openmp/rdf_offload_split.qdrep) and open it via the GUI. Compare the execution time for the `Pair Calculation` from the NVTX row with the previous section and see how much the performance was improved (you should be getting similar screenshots as below where 1) is the base version and 2) is the current version after splitting the \"Teams Distribute\" from \"Parallel Do\".  The \"Pair Calculation\" reduced from 61.2 ms to 45.3 ms.\n",
    "\n",
    "<img src=\"../images/f_offload_compare_nvtx.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Feel free to checkout the [solution](../../source_code/openmp/SOLUTION/rdf_offload_split.f90) to help you understand better."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Post-Lab Summary\n",
    "\n",
    "If you would like to download this lab for later viewing, it is recommend you go to your browsers File menu (not the Jupyter notebook file menu) and save the complete web page.  This will ensure the images are copied down as well. You can also execute the following cell block to create a zip-file of the files you've been working on, and download it with the link below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "cd ..\n",
    "rm -f nways_files.zip\n",
    "zip -r nways_files.zip *"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**After** executing the above zip command, you should be able to download and save the zip file [here] by holding down <mark>Shift</mark> and <mark>Right-Clicking</mark> [Here](../nways_files.zip).\n",
    "Let us now go back to parallelizing our code using other approaches.\n",
    "\n",
    "**IMPORTANT**: If you wish to dig deeper and profile the kernel with the Nsight Compute profiler, go to the next notebook. Otherwise, please click on **HOME** to go back to the main notebook for *N ways of GPU programming for MD* code. \n",
    "\n",
    "-----\n",
    "\n",
    "# <p style=\"text-align:center;border:3px; border-style:solid; border-color:#FF0000  ; padding: 1em\"> <a href=../../../nways_MD_start.ipynb>HOME</a>&nbsp; &nbsp; &nbsp; <a href=nways_openmp_opt_2.ipynb>NEXT</a></p>\n",
    "\n",
    "-----\n",
    "\n",
    "\n",
    "# Links and Resources\n",
    "[OpenMP Programming Model](https://computing.llnl.gov/tutorials/openMP/)\n",
    "\n",
    "[OpenMP Target Directive](https://www.openmp.org/wp-content/uploads/openmp-examples-4.5.0.pdf)\n",
    "\n",
    "[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)\n",
    "\n",
    "[NVIDIA Nsight Compute](https://developer.nvidia.com/nsight-compute)\n",
    "\n",
    "**NOTE**: To be able to see the Nsight System profiler output, please download Nsight System latest version from [here](https://developer.nvidia.com/nsight-systems).\n",
    "\n",
    "Don't forget to check out additional [OpenACC Resources](https://www.openacc.org/resources) and join our [OpenACC Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.\n",
    "\n",
    "--- \n",
    "\n",
    "## Licensing \n",
    "\n",
    "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}