{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before we begin, let's execute the cell below to display information about the CUDA driver and GPUs running on the server by running the `nvidia-smi` command. To do this, execute the cell block below by giving it focus (clicking on it with your mouse), and hitting Ctrl-Enter, or pressing the play button in the toolbar above. If all goes well, you should see some output returned below the grey cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!nvidia-smi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Learning objectives\n",
    "The **goal** of this lab is to:\n",
    "\n",
    "- Learn how to run the same code on both a multicore CPU and a GPU using the OpenMP Target programming model\n",
    "- Understand the key directives and steps involved in making a sequential code parallel\n",
    "\n",
    "We do not intend to cover:\n",
    "- Optimization techniques in details\n",
    "\n",
    "\n",
    "# OpenMP Directives\n",
    "- OpenMP has been formed in 1997 to focus on vendor-neutral Shared Memory Parallelism.\n",
    "- OpenMP 4.0 in 2013 expanded its focus beyond shared memory parallel computers including accelerators. \n",
    "- The OpenMP 4.0 target construct provides the means to offload data and computation to accelerators.\n",
    "\n",
    "Like OpenACC, OpenMP is directive based. Compiler directives appear as comments in your source code and are ignored by compilers unless you tell them otherwise - usually by specifying the appropriate compiler flag.\n",
    "\n",
    "In this notebook we will be using the OpenMP target construct to offload data and computation to GPU. Multiple compilers are in development to support OpenMP offloading to NVIDIA GPUs. We will using NVIDIA HPC SDK compiler for this tutorial.\n",
    "\n",
    "\n",
    "## OpenMP Syntax\n",
    "\n",
    "```!$omp directive ``` \n",
    "\n",
    "**!$** in Fortran is what's known as a \"compiler hint.\" These are very similar to programmer comments, however, the compiler will actually read our comments. They are a way for the programmer to \"guide\" the compiler, without running the chance damaging the code. If the compiler does not understand the comment, it can ignore it, rather than throw a syntax error.\n",
    "\n",
    "**omp** is an addition is known as the “sentinel”. It specifies that this is an OpenMP instruction. Any non-OpenMP compiler will ignore this. \n",
    "\n",
    "**directives** are commands in OpenMP that will tell the compiler to do some action. For now, we will only use directives that allow the compiler to parallelize our code.\n",
    "\n",
    "For beginners who are new to OpenMP directive, we will be introducing some terminologies and concepts before starting to add ```target``` directives to our code to offload onto GPU computation and data. \n",
    "\n",
    "## OpenMP Fork-Join Model\n",
    "\n",
    "OpenMP uses the fork-join model of parallel execution. All OpenMP programs begin as a single process: the master thread. The master thread executes sequentially until the first parallel region construct is encountered.\n",
    "\n",
    "**FORK**: the master thread then creates a team of parallel threads.The statements in the program that are enclosed by the parallel region construct are then executed in parallel among the various team threads.\n",
    "\n",
    "**JOIN**: When the team threads complete the statements in the parallel region construct, they synchronize and terminate, leaving only the master thread.\n",
    "\n",
    "<img src=\"../images/openmp_fork_join.png\"  width=\"50%\" height=\"50%\">\n",
    "\n",
    "## OpenMP Parallel Region\n",
    "\n",
    "A parallel region is a block of code that will be executed by multiple threads. This is the fundamental OpenMP parallel construct. When a thread reaches a PARALLEL directive, it creates a team of threads and becomes the master of the team. The master is a member of that team. Starting from the beginning of this parallel region, the code is duplicated and all threads will execute that code redundantly.There is an implied barrier at the end of a parallel region. Only the master thread continues execution past this point\n",
    "\n",
    "```fortran\n",
    "program hello\n",
    "  integer :: omp_rank\n",
    "!$omp parallel private(omp_rank)\n",
    "  omp_rank = omp_get_thread_num()\n",
    "  print *, 'Hello world! by thread ', omp_rank\n",
    "!$omp end parallel\n",
    "end program hello\n",
    " ```\n",
    "\n",
    "<img src=\"../images/openmp_parallel_construct.png\" width=\"50%\" height=\"50%\">\n",
    "\n",
    "## OpenMP Data-sharing\n",
    "In OpenMP, several constructs accept clauses that allow the user to control the data sharing. For example, in the example given previously we declared omp_rank to be private explicitly. You can use one of below clauses in a *Parallel* construct.\n",
    "\n",
    "- `private`: Declares variables to be private to each thread in a team. Private copies of the variable are initialized from the original object when entering the region.\n",
    "- `shared`: Shares variables among all the threads in a team.\n",
    "- `default`: Enables you to affect the data-scope attributes of variables.\n",
    "\n",
    "Developer can set the default data sharing scope using the default directive\n",
    "\n",
    "```fortran\n",
    "\n",
    "program hello\n",
    "  integer :: omp_rank\n",
    "!$omp parallel default(shared) private(omp_rank)\n",
    "  omp_rank = omp_get_thread_num()\n",
    "  print *, 'Hello world! by thread ', omp_rank\n",
    "!$omp end parallel\n",
    "end program hello\n",
    "\n",
    "```\n",
    "\n",
    "## OpenMP Work-sharing\n",
    "\n",
    "As described before ```parallel``` construct creates team of theads and the execution continues redundantly on all threads of team. Ideally we would need all threads within the team to work share i.e. spilt the work. A work-sharing construct divides the execution of the enclosed code region among the members of the team that encounter it. Work-sharing constructs do not launch new threads but Divides (“workshares”) the iterations of the  loop across the threads in the team . There is no implied barrier upon entry to a work-sharing construct, however there is an implied barrier at the end of a work sharing construct. \n",
    "\n",
    "There are multiple ways to allow worksharing, the code below makes use of ```for``` to divide the iteration of loop among threads.\n",
    "\n",
    "```fortran\n",
    "\n",
    "!Create a team of threads\n",
    "!$omp parallel\n",
    "!workshare this loop across those threads.\n",
    "   !$omp for\n",
    "    do i=1,N\n",
    "        < loop code >\n",
    "    end do\n",
    "!$omp end parallel\n",
    "\n",
    "```\n",
    "\n",
    "<img src=\"../images/openmp_parallelfor_construct.png\" width=\"50%\" height=\"50%\">\n",
    "\n",
    "\n",
    "\n",
    "## OpenMP Target Offloading\n",
    "\n",
    "By now you should have got familiar with the OpenMP programming model. Now let us start introducing key directives and construct used to add GPU offloading. \n",
    "\n",
    "\n",
    "### ```target ```\n",
    "\n",
    "```target``` construct consists of a target directive and an execution region. ```target``` directive define a target region, which is a block of computation that operates within a distinct data environment and is intended to be offloaded onto a parallel computation device during execution ( GPU in our case). Data used within the region may be implicitly or explicitly mapped to the device. All of OpenMP is allowed within target regions, but only a subset will run well on GPUs.\n",
    "\n",
    "The example below shows usage of target directive with implicitly mapped data\n",
    "```fortran\n",
    "    !Moves this region of code to the GPU and implicitly maps data.\n",
    "    !$omp target\n",
    "        !$omp parallel for\n",
    "        do i=1,N \n",
    "            ANew(j) = A (j-1) + A(j+1)\n",
    "        end do\n",
    "    !$omp end target    \n",
    "```\n",
    "\n",
    "###  ```target data``` to explicitly map the data\n",
    "\n",
    "Map a variable to/from the device.Map directive helps developer to explicitly define and reduce data copies. The ```target data```construct is used to mark such regions\n",
    "\n",
    "```fortran\n",
    "!$omp target map(map-type: list)\n",
    "```\n",
    "\n",
    "Example of mapping data directives are as follows: \n",
    "- to (list)\n",
    "    - Allocates memory on the device and copies data in when entering the region, the values are not copied back\n",
    "- from (list)\n",
    "    - Allocates memory on the device and copies the data to the host when exiting the region\n",
    "- alloc (list)\n",
    "    - Allocates memory on the device. If the data is already present on the device a reference counter is incremented\n",
    "\n",
    "```fortran\n",
    "!Moves this region of code to the GPU and explicitly maps data.\n",
    "!$omp target data map(to:A(:)) map(from:ANew(:))\n",
    "        !$omp parallel for\n",
    "        do i=1,N \n",
    "            ANew(j) = A (j-1) + A(j+1)\n",
    "        end do\n",
    "!$omp end target data\n",
    "\n",
    "\n",
    "```\n",
    "\n",
    "\n",
    "### ```teams``` directive\n",
    "```teams``` directve creates a league of thread teams where the master thread of each team executes the region. Each of these master threads executes sequentially. Or in other words teams directive spawn 1 or more thread teams with the same number of threads. The execution continues on the master threads of each team (redundantly). There is no synchronization allowed between teams. \n",
    "\n",
    "OpenMP calls that somewhere a team, which might be a thread on the CPU or maying a CUDA threadblock or OpenCL workgroup. It will choose how many teams to create based on where you're running, only a few on a CPU (like 1 per CPU core) or lots on a GPU (1000's possibly). ```teams``` allow OpenMP code to scale from small CPUs to large GPUs because each one works completely independently of each other ```teams```.\n",
    "\n",
    "<img src=\"../images/openmp_target_teams.png\" width=\"50%\" height=\"50%\">\n",
    "\n",
    "### ```distribute``` \n",
    "There's a good chance that we don't want the loop to be run redundantly in every master thread of ```teams``` though, that seems wasteful and potentially dangerous. With usage of ```distribute``` construct the iterations of the next loop are broken into groups that are “distributed” to the master threads of the teams. The iterations are distributed statically and there’s no guarantees about the order teams will execute. Also it does not generate parallelism/worksharing within the thread teams.\n",
    "\n",
    "<img src=\"../images/openmp_target_distribute.png\" width=\"50%\" height=\"50%\">\n",
    "\n",
    "Th example below of simple stencil code shows the usage of ```distribute``` along with ```team```:\n",
    "\n",
    "```fortran\n",
    "    !$omp target teams distribute\n",
    "        do j=1,N \n",
    "           do i=1,N \n",
    "             ANew(j,i) = 0.25 * (A(j-1,i)  +  A(j+1,i) + A(j,i-1) + A(j,i+1))\n",
    "           enddo\n",
    "        enddo\n",
    "    !$omp end target    \n",
    "```\n",
    "\n",
    "<img src=\"../images/openmp_teams.png\" width=\"80%\" height=\"80%\">\n",
    "\n",
    "\n",
    "### Work sharing to improve parallelism\n",
    "\n",
    "As shown in the image only the master thread performs the computation which is not so optimzal in case of GPU architecture. To solve this problem we will make use of work-sharing as we did before. When any team encounters a worksharing construct, the work inside the construct is divided among the members of the team, and executed cooperatively instead of being executed by every thread. There are many work sharing constructs defined, the one that we plan to use is : \n",
    "\n",
    "```\n",
    "!$omp parallel for\n",
    "```\n",
    "\n",
    "```fortran\n",
    "    !$omp target teams distribute parallel for\n",
    "        do j=1,N \n",
    "           do i=1,N \n",
    "             ANew(j,i) = 0.25 * (A(j-1,i)  +  A(j+1,i) + A(j,i-1) + A(j,i+1))\n",
    "           enddo\n",
    "        enddo\n",
    "    !$omp end target    \n",
    "```\n",
    "\n",
    "<img src=\"../images/openmp_teams_for.png\" width=\"80%\" height=\"80%\">\n",
    "\n",
    "\n",
    "## Atomic Construct\n",
    "\n",
    "In the code you will also require one more construct which will help you in getting the right results. OpenMP atomic construct ensures that a particular variable is accessed and/or updated atomically to prevent indeterminate results and race conditions. In other words, it prevents one thread from stepping on the toes of other threads due to accessing a variable simultaneously, resulting in different results run-to-run. For example, if we want to count the number of elements that have a value greater than zero, we could write the following:\n",
    "\n",
    "\n",
    "```fortran\n",
    "if(r<cut)then\n",
    "  !$omp atomic\n",
    "    cnt = cnt + 1\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, lets start modifying the original code and add the OpenMP directives. Click on the <b>[rdf.f90](../../source_code/openmp/rdf.f90)</b> link and modify `rdf.f90`. Remember to **SAVE** your code after changes, before running below cells."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Compile and Run for Multicore\n",
    "\n",
    "Having added OpenMP directives, let us compile the code. We will be using NVIDIA HPC SDK compiler for this exercise. The flags used for enabling OpenMP target offloading are as follows:\n",
    "\n",
    "<!--\n",
    "**-fopenmp** : This flag will give tell compiler to parse and act on OpenMP directive.\n",
    "**-fopenmp-targets** : This flag allows us to compile our code for a specific target parallel hardware. Without this flag, the code will be compiled for multicore execution.\n",
    "-->\n",
    "\n",
    "`-mp=gpu|multicore` : Select the target device for all parallel programming paradigms used (OpenACC, OpenMP, Standard Languages)\n",
    "- `gpu`             Globally set the target device to an NVIDIA GPU\n",
    "- `multicore`       Globally set the target device to the host CPU\n",
    "\n",
    "**NOTE:** `-Minfo=mp` enables OpenMP information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Compile the code for muticore\n",
    "!cd ../../source_code/openmp && nvfortran -mp=multicore -Minfo=mp -I/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/include -o rdf nvtx.f90 rdf.f90 -L/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/lib64 -lnvToolsExt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Inspect the compiler feedback (you should get a similar output as below) you can see from *Line 98* that it is generating a multicore code `98, Generating Multicore code`.\n",
    "\n",
    "```\n",
    "\tMulticore output\n",
    "\t\trdf:\n",
    "         98, !$omp target teams distribute parallel for\n",
    "         98, Generating Multicore code\n",
    "         99, Loop parallelized across teams and threads, schedule(static)\n",
    "```\n",
    "\n",
    "Make sure to validate the output by running the executable and validate the output."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Run the multicore code and check the output\n",
    "!cd ../../source_code/openmp && ./rdf && cat Pair_entropy.dat"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The output should be the following:\n",
    "\n",
    "```\n",
    "s2      :    -2.452690945278331     \n",
    "s2bond  :    -24.37502820694527    \n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#profile and see output of nvptx\n",
    "!cd ../../source_code/openmp && nsys profile -t nvtx --stats=true --force-overwrite true -o rdf_multicore ./rdf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's checkout the profiler's report. Download and save the report file by holding down <mark>Shift</mark> and <mark>Right-Clicking</mark> [Here](../../source_code/openmp/rdf_multicore.qdrep) and open it via the GUI. Have a look at the example expected profiler report below:\n",
    "\n",
    "<img src=\"../images/f_openmp_multicore.png\">\n",
    "\n",
    "Feel free to checkout the [solution](../../source_code/openmp/SOLUTION/rdf_offload.f90) to help you understand better."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Compile and Run for an NVIDIA GPU\n",
    "\n",
    "Without changing the code now let us try to recompile the code for NVIDIA GPU and rerun.\n",
    "The only difference is that now we pass `gpu` value to the `-mp` compiler option.`-mp=gpu`. **Understand and analyze** the solution present at:\n",
    "\n",
    "[RDF Code](../../source_code/openmp/SOLUTION/rdf_offload.f90)\n",
    "\n",
    "Open the downloaded files for inspection. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#compile for Tesla GPU\n",
    "!cd ../../source_code/openmp && nvfortran -mp=gpu -Minfo=mp -o rdf nvtx.f90 rdf.f90 -L/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/lib64 -lnvToolsExt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Inspect the compiler feedback (you should get a similar output as below) and you can see below: \n",
    "\n",
    "- *Line 97* shows variables mapped to the device\n",
    "- *Line 97* shows the GPU kernel is generated\n",
    "\n",
    "```\n",
    "rdf.f90:\n",
    "rdf:\n",
    "     98, !$omp target teams distribute parallel do\n",
    "         94, Generating map(tofrom:g(:),x(z_b_0:z_b_1,z_b_3:z_b_4),y(z_b_7:z_b_8,z_b_10:z_b_11),z(z_b_14:z_b_15,z_b_17:z_b_18)) \n",
    "         98, Generating Tesla and Multicore code\n",
    "             Generating \"nvkernel_MAIN__F1L98_1\" GPU kernel\n",
    "     98, Generating implicit map(tofrom:g(:),x(:,:),z(:,:),y(:,:)) \n",
    "```\n",
    "\n",
    "Make sure to validate the output by running the executable and validate the output."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Run on Nvidia GPU and check the output\n",
    "!cd ../../source_code/openmp && ./rdf && cat Pair_entropy.dat"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The output should be the following:\n",
    "\n",
    "```\n",
    "s2      :    -2.452690945278331     \n",
    "s2bond  :    -24.37502820694527    \n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#profile and see output of nvptx\n",
    "!cd ../../source_code/openmp && nsys profile -t nvtx,cuda --stats=true --force-overwrite true -o rdf_offload_gpu ./rdf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's checkout the profiler's report. Download and save the report file by holding down <mark>Shift</mark> and <mark>Right-Clicking</mark> [Here](../../source_code/openmp/rdf_offload_gpu.qdrep) and open it via the GUI. Have a look at the example expected profiler report below:\n",
    "\n",
    "<img src=\"../images/f_openmp_gpu.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you expand the CUDA row (Timeline view), you can see memory movements as well as Kernels. Checkout the NVTX row and compare the execution time for the `Pair_Calculation` for the multicore version and the GPU offload version. In the *example screenshot*, we were able to reduce the timing from 9.3 seconds to 82.6 mseconds.\n",
    "\n",
    "\n",
    "# OpenMP Analysis\n",
    "\n",
    "**Usage Scenarios**\n",
    "- Legacy codes with sizeable codebase needs to be ported to GPUs with minimal code changes to sequential code.\n",
    "- Developers want to see if the code structure favors GPU SIMD/SIMT style or as we say test the waters before moving a large piece of code to a GPU.\n",
    "\n",
    "\n",
    "**Limitations/Constraints**\n",
    "- Directive based programming model like OpenMP depends on a compiler to understand and convert your sequential code to CUDA constructs. OpenMP compiler with target offload support are evloving and they it cannot match the best performance that say using CUDA C constructs directly can give. Things like controlling execution at warp level or limiting the register counts etc are some of the examples\n",
    "    \n",
    "**Which Compilers Support OpenMP on GPU?**\n",
    "As of March 2020 here are the compilers that support OpenMP on GPU:\n",
    "\n",
    "| Compiler | Latest Version | Maintained by | Full or Partial Support |\n",
    "| --- | --- | --- | --- |\n",
    "| GCC | 10 | Mentor Graphics | 4.5 partial spec supported |\n",
    "| CCE| latest | Cray | 4.5 partial spec supported | \n",
    "| XL | latest | IBM | 4.5 partial spec supported |\n",
    "| Clang | 9.0 | Community | 4.5 partial spec supported |\n",
    "| HPC SDK | 21.7 | NVIDIA HPC SDK | 5.0 spec supported |\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Post-Lab Summary\n",
    "\n",
    "If you would like to download this lab for later viewing, it is recommend you go to your browsers File menu (not the Jupyter notebook file menu) and save the complete web page.  This will ensure the images are copied down as well. You can also execute the following cell block to create a zip-file of the files you've been working on, and download it with the link below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "cd ..\n",
    "rm -f nways_files.zip\n",
    "zip -r nways_files.zip *"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**After** executing the above zip command, you should be able to download and save the zip file by holding down <mark>Shift</mark> and <mark>Right-Clicking</mark> [Here](../nways_files.zip).\n",
    "Let us now go back to parallelizing our code using other approaches.\n",
    "\n",
    "**IMPORTANT**: Please click on **HOME** to go back to the main notebook for *N ways of GPU programming for MD* code.\n",
    "\n",
    "-----\n",
    "\n",
    "# <p style=\"text-align:center;border:3px; border-style:solid; border-color:#FF0000  ; padding: 1em\"> <a href=../../../nways_MD_start.ipynb>HOME</a></p>\n",
    "\n",
    "-----\n",
    "\n",
    "\n",
    "# Links and Resources\n",
    "[OpenMP Programming Model](https://computing.llnl.gov/tutorials/openMP/)\n",
    "\n",
    "[OpenMP Target Directive](https://www.openmp.org/wp-content/uploads/openmp-examples-4.5.0.pdf)\n",
    "\n",
    "[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)\n",
    "\n",
    "**NOTE**: To be able to see the Nsight System profiler output, please download Nsight System latest version from [here](https://developer.nvidia.com/nsight-systems).\n",
    "\n",
    "Don't forget to check out additional [OpenACC Resources](https://www.openacc.org/resources) and join our [OpenACC Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.\n",
    "\n",
    "--- \n",
    "\n",
    "## Licensing \n",
    "\n",
    "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}