{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# OpenACC Loop Optimizations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This version of the lab is intended for C/C++ programmers. The Fortran version of this lab is available [here](../../Fortran/jupyter_notebook/openacc_fortran_lab3.ipynb)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You will receive a warning five minutes before the lab instance shuts down. Remember to save your work! If you are about to run out of time, please see the [Post-Lab](#Post-Lab-Summary) section for saving this lab to view offline later."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "Let's execute the cell below to display information about the GPUs running on the server. To do this, execute the cell block below by giving it focus (clicking on it with your mouse), and hitting Ctrl-Enter, or pressing the play button in the toolbar above.  If all goes well, you should see some output returned below the grey cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!nvaccelinfo"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Introduction\n",
    "\n",
    "Our goal for this lab is to use the OpenACC Loop clauses to optimize our Parallel Loops.\n",
    "  \n",
    "<img src=\"images/development-cycle.png\" alt=\"OpenACC development cycle\" width=\"50%\">\n",
    "\n",
    "This is the OpenACC 3-Step development cycle.\n",
    "\n",
    "**Analyze** your code, and predict where potential parallelism can be uncovered. Use profiler to help understand what is happening in the code, and where parallelism may exist.\n",
    "\n",
    "**Parallelize** your code, starting with the most time consuming parts. Focus on maintaining correct results from your program.\n",
    "\n",
    "**Optimize** your code, focusing on maximizing performance. Performance may not increase all-at-once during early parallelization.\n",
    "\n",
    "We are currently tackling the **optimize** step. We will include the OpenACC loop clauses to optimize the execution of our parallel loop nests."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Run the Code\n",
    "\n",
    "In the previous labs, we have built up a working parallel code that can run on both a multicore CPU and a GPU. Let's run the code and note the performance, so that we can compare the runtime to any future optimizations we make. The code should take about a second to run at this point."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!cd ../source_code/lab3 && make clean && make && ./laplace"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Optional: Analyze the Code\n",
    "\n",
    "If you would like a refresher on the code files that we are working on, you may view both of them using the two links below by opening the downloaded file.\n",
    "\n",
    "[jacobi.c](../source_code/lab3/jacobi.c)  \n",
    "[laplace2d.c](../source_code/lab3/laplace2d.c)  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Optimize Loop Schedules\n",
    "\n",
    "The compiler has analyzed the loops in our two main functions and scheduled the iterations of the loops to run in parallel on our GPU and Multicore CPU. The compiler is usually pretty good at choosing how to break up loop iterations to run well on parallel accelerators, but sometimes we can eke out just a little more performance by guiding the compiler to make specific choices. First, let's look at the choices the compiler made for us. We'll focus on the `calcNext` routine, but you should look at the `swap` routine too. Here's the compiler feedback for that routine:\n",
    "\n",
    "```\n",
    "calcNext:\n",
    "     48, Generating copyin(A[:m*n])\n",
    "         Accelerator kernel generated\n",
    "         Generating Tesla code\n",
    "         49, #pragma acc loop gang /* blockIdx.x */\n",
    "             Generating reduction(max:error)\n",
    "         51, #pragma acc loop vector(128) /* threadIdx.x */\n",
    "     48, Generating implicit copy(error)\n",
    "         Generating copyout(Anew[:m*n])\n",
    "     51, Loop is parallelizable\n",
    "```\n",
    "\n",
    "The main loops on interest in `calcNext` are on lines 49 and 51. I see that the compiler has told me what loop clauses it chose for each of those loops. The outermost loop is treated as a *gang* loop, meaning it broke that loop up into chunks that can be spread out across the GPU or CPU cores easily. If you have programmed in CUDA before, you'll recognize that the compiler is mapping this loop to the CUDA thread blocks. The innermost loop is mapped instead to *vector* parallelism. You can think of a vector as some number of data cells that get the same operation applied to them at the same time. On any modern processor technology you need this mixture of *coarse grained* and *fine grained* parallelism to effectively use the hardware. Vector (fine grained) parallelism can operate extremely efficiently when performing the same operation on a bunch of data, but there's limits to how long a vector you can build. Gang (coarse grained) parallelism is highly scalable, because each chunk of work can operate completely independently of each other chunk, making it ideal for allowing processor cores to operate independently of each other.\n",
    "\n",
    "Let's look at some loop clauses that allow you to tune how the compiler maps our loop iterations to these different types of parallelism."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Collapse Clause\n",
    "\n",
    "The `collapse` clause allows us to transform a multi-dimensional loop nest into a single-dimensional loop. This process is helpful for increasing the overall length (which usually increases parallelism) of our loops, and will often help with memory locality. In our case, instead of looking at our loops as `n` and `m` iteration loops, it looks at them as a single `n * m` iteration loop, which gives it more flexibility in how to break up the iterations. Let's look at the syntax.\n",
    "\n",
    "```cpp\n",
    "#pragma acc parallel loop collapse( N )\n",
    "```\n",
    "\n",
    "Where N is the number of loops to collapse.\n",
    "\n",
    "```cpp\n",
    "#pragma acc parallel loop collapse( 3 )\n",
    "for(int i = 0; i < N; i++)\n",
    "{\n",
    "    for(int j = 0; j < M; j++)\n",
    "    {\n",
    "        for(int k = 0; k < Q; k++)\n",
    "        {\n",
    "            < loop code >\n",
    "        }\n",
    "    }\n",
    "}\n",
    "```\n",
    "\n",
    "This code will combine the 3-dimensional loop nest into a single 1-dimensional loop. The loops in our example code are fairly long-running, so I don't expect a lot of speed-up from collapsing them together, but let's try it anyway."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Implementing the Collapse Clause\n",
    "\n",
    "Click on the <b>[laplace2d.c](../source_code/lab3/laplace2d.c)</b> link and modify `laplace2d.c`. Use the **collapse clause** to collapse our multi-dimensional loops into a single dimensional loop. Remember to **SAVE** your code after changes, before running below cells.\n",
    "\n",
    "Remember to **SAVE** your code after changes, before running below cells.\n",
    "\n",
    "Then run the following script to see how the code runs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!cd ../source_code/lab3 && make clean && make && ./laplace"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Did your code speed-up at all?\n",
    "\n",
    "So when should you use the `collapse` clause? The collapse clause is particularly useful in two specific cases that occur when you have very deep loop nests (4, 5, or more loops nested together). The first case is when the innermost loops have very few iterations. The compiler will generally favor inner loops for *vector* parallelism, so if there's not enough loop iterations to fill our vector, we're going to be wasting our computational power. Take a loop at this code:\n",
    "\n",
    "```cpp\n",
    "#pragma acc parallel loop\n",
    "for(int i = 0; i < N; i++)\n",
    "{\n",
    "    for(int j = 0; j < 8; j++)\n",
    "    {\n",
    "        for(int k = 0; k < 8; k++)\n",
    "        {\n",
    "            < loop code >\n",
    "        }\n",
    "    }\n",
    "}\n",
    "```\n",
    "\n",
    "In this code, our innermost loop, which the compiler will likely want to vectorize has just 8 iterations. On a CPU, this may be OK, but on a GPU we generally want longer vectors. If I collapse the two innermost loops together, that gives me 64 iterations, which is starting to get in the range where GPUs make sense. So instead, I should write this:\n",
    "\n",
    "```cpp\n",
    "#pragma acc parallel loop \n",
    "for(int i = 0; i < N; i++)\n",
    "{\n",
    "    #pragma acc loop collapse(2)\n",
    "    for(int j = 0; j < 8; j++)\n",
    "    {\n",
    "        for(int k = 0; k < 8; k++)\n",
    "        {\n",
    "            < loop code >\n",
    "        }\n",
    "    }\n",
    "}\n",
    "```\n",
    "\n",
    "The other common case happens when you have sort-of short loops on the outermost loops of a loop nest. This is where the compiler looks first for *coarse grained* parallelism to spread across the CPU or GPU. If there's not enough parallelism here, then we're limited in how many CPU cores or how large a GPU we can effectively use. So in the example below, I took two 32 iteration loops and turn them into a single 1024 iteration loop to give the compiler the opportunity to parallelize the region on larger GPUs.\n",
    "\n",
    "```cpp\n",
    "#pragma acc parallel loop collapse( 2 )\n",
    "for(int i = 0; i < 32; i++)\n",
    "{\n",
    "    for(int j = 0; j < 32; j++)\n",
    "    {\n",
    "        for(int k = 0; k < N; k++)\n",
    "        {\n",
    "            < loop code >\n",
    "        }\n",
    "    }\n",
    "}\n",
    "```\n",
    "\n",
    "As a rule of thumb, if your code has loops that are tightly-nested together, meaning there's nothing inside of one loop except the nested loop, it's worth trying to collapse the loops completely. This won't always give you the best performance, but it will frequently provide better performance than the uncollapsed version."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at another clause that may help our code.\n",
    "\n",
    "### Tile Clause\n",
    "\n",
    "The `tile` clause allows us to break up a multi-dimensional loop into *tiles*, or *blocks*. This is often useful for increasing memory locality in codes like ours. Let's look at the syntax.\n",
    "\n",
    "```cpp\n",
    "#pragma acc parallel loop tile( x, y, z, ... )\n",
    "```\n",
    "\n",
    "Our tiles can have as many dimensions as we want, though we must be careful to not create a tile that is too large. Let's look at an example:\n",
    "\n",
    "```cpp\n",
    "#pragma acc parallel loop tile( 32, 32 )\n",
    "for(int i = 0; i < N; i++)\n",
    "{\n",
    "    for(int j = 0; j < M; j++)\n",
    "    {\n",
    "        < loop code >\n",
    "    }\n",
    "}\n",
    "```\n",
    "\n",
    "The above code will break our loop iterations up into 32x32 tiles (or blocks), and then execute those blocks in parallel. Let's look at a slightly more specific code.\n",
    "\n",
    "```cpp\n",
    "#pragma acc parallel loop tile( 32, 32 )\n",
    "for(int i = 0; i < 128; i++)\n",
    "{\n",
    "    for(int j = 0; j < 128; j++)\n",
    "    {\n",
    "        < loop code >\n",
    "    }\n",
    "}\n",
    "```\n",
    "\n",
    "In this code, we have 128x128 loop iterations, which are being broken up into 32x32 tiles. This means that we will have 16 tiles, each tile being size 32x32. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Implementing the Tile Clause\n",
    "\n",
    "Click on the <b>[laplace2d.c](../source_code/lab3/laplace2d.c)</b> link and modify `laplace2d.c`. Replace the `collapse` clause with the `tile` clause to break our multi-dimensional loops into smaller tiles. Try using a variety of different tile sizes, but for now keep one of the dimensions as a **multiple of 32**. We will talk later about why this is important.\n",
    "\n",
    "Remember to **SAVE** your code after changes, before running below cells.\n",
    "\n",
    "Then run the following script to see how the code runs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!cd ../source_code/lab3 && make clean && make && ./laplace"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Unlike the `collapse` clause, we need to do some experimentation to find the best value for our code. Try a few different values and fill out the table below. We know that the compiler is using a vector length of 128, according to the compiler feedback above, so you might start with values that multiply to 128, but also try some other values. For the speed-up column compare against the results from using the `collapse` clause.\n",
    "\n",
    "| Clause      | Time (s) | Speed-up |\n",
    "|-------------|----------|----------|\n",
    "| collapse(2) |          | 1.00X    |\n",
    "| tile(M, N)  |          |          |\n",
    "| tile(M, N)  |          |          |\n",
    "| tile(M, N)  |          |          |\n",
    "| tile(M, N)  |          |          |\n",
    "| tile(M, N)  |          |          |\n",
    "| tile(M, N)  |          |          |\n",
    "| tile(M, N)  |          |          |\n",
    "| tile(M, N)  |          |          |\n",
    "| tile(M, N)  |          |          |\n",
    "\n",
    "It's a good idea to ensure that the product of `M` and `N` here are divisible by 32. NVIDIA GPUs always operate in groups of 32 threads, so it's best to make sure that there's some multiple of 32 worth of work for the threads to do. Don't go any higher than 32 x 32 though, because NVIDIA GPUs are limited to at most 1024 threads in a gang."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "There's quite a bit more we can say about how to optimize your loops with OpenACC, but we'll leave that to a more advanced lab. With just the `tile` and `collapse` clauses in your toolbox you should already be in good shape for optimizing your loops with OpenACC."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Conclusion\n",
    "\n",
    "Our primary goal when using OpenACC is to parallelize our large for loops. To accomplish this, we must use the OpenACC loop directive and loop clauses. There are many ways to alter and optimize our loops, though it is up to the programmer to decide which route is the best to take. At this point in the lab series, you should be able to begin parallelizing your own personal code, and to be able to achieve a relatively high performance using OpenACC."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Bonus Task (More about Gangs, Workers, and Vectors)\n",
    "\n",
    "This week's bonus task is to learn a bit more about how OpenACC breaks up the loop iterations into *gangs*, *workers*, and *vectors*, which was discussed very briefly in the first lab. [Click Here](openacc_c_lab3-bonus.ipynb) for more information about these *levels of parallelism*.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Post-Lab Summary\n",
    "\n",
    "If you would like to download this lab for later viewing, it is recommend you go to your browsers File menu (not the Jupyter notebook file menu) and save the complete web page.  This will ensure the images are copied down as well.\n",
    "\n",
    "You can also execute the following cell block to create a zip-file of the files you've been working on, and download it with the link below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "cd ..\n",
    "rm -f openacc_files.zip\n",
    "zip -r openacc_files.zip *"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**After** executing the above zip command, you should be able to download and save the zip file by holding down <mark>Shift</mark> and <mark>Right-Clicking</mark> [Here](../openacc_files.zip)\n",
    "\n",
    "### Other Bootcamps\n",
    "The contents of this Bootcamp originates from [OpenACC GPU Bootcamp Github](https://github.com/gpuhackathons-org/gpubootcamp).  Here are some addional Bootcamp which might of interest: \n",
    "\n",
    "- [N-Ways to GPU Programming](https://github.com/gpuhackathons-org/gpubootcamp/tree/master/hpc/nways)\n",
    "\n",
    "--- \n",
    "\n",
    "## Licensing \n",
    "\n",
    "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}