{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Bonus Task (More about Gangs, Workers, and Vectors)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Gang/Worker/Vector\n",
    "\n",
    "This is our last optimization, and arguably the most important one. In OpenACC, **Gang Worker Vector** is used to define additional levels of parallelism. Specifically for NVIDIA GPUs, gang, worker, and vector will specify the *decomposition* of our loop iterations to GPU threads. Each loop will have an optimal Gang/Worker/Vector implementation, and finding that correct implementation will often take a bit of thinking, and possibly some trial and error. So let's explain how the `gang`, `worker`, and `vector` clauses actually work.\n",
    "\n",
    "![gang_worker_vector.png](images/gang_worker_vector.png)\n",
    "\n",
    "This image represents a single **gang**. When parallelizing our **for loops**, the **loop iterations** will be **broken up evenly** among a number of gangs. Each gang will contain a number of **threads**. These threads are organized into **blocks**. A **worker** is a row of threads. In the above graphic, there are 3 **workers**, which means that there are 3 rows of threads. The **vector** refers to how long each row is. So in the above graphic, the vector is 8, because each row is 8 threads long.\n",
    "\n",
    "By default, when programming for a GPU, **gang** and **vector** parallelism is automatically applied. Let's see a simple GPU sample code where we explicitly show how the gang and vector works.\n",
    "\n",
    "```fortran\n",
    "!$acc parallel loop gang\n",
    "do i = 1, N\n",
    "    !$acc loop vector\n",
    "    do j = 1, M\n",
    "        < loop code >\n",
    "    end do\n",
    "end do\n",
    "```\n",
    "\n",
    "The outer loop will be evenly spread across a number of **gangs**. Then, within those gangs, the inner-loop will be executed in parallel across the **vector**. This is a process that usually happens automatically, however, we can usually achieve better performance by optimizing the gang worker vector ourselves.\n",
    "\n",
    "Lets look at an example where using gang worker vector can greatly increase a loops parallelism.\n",
    "\n",
    "```fortran\n",
    "!$acc parallel loop gang\n",
    "do i = 1, < N\n",
    "    !$acc loop vector\n",
    "    do j = 1, M\n",
    "        do k = 1, Q\n",
    "            < loop code >\n",
    "        end do\n",
    "    end do\n",
    "end do\n",
    "```\n",
    "\n",
    "In this loop, we have **gang level** parallelism on the outer-loop, and **vector level** parallelism on the middle-loop. However, the inner-loop does not have any parallelism. This means that each thread will be running the inner-loop, however, GPU threads aren't really made to run entire loops. To fix this, we could use **worker level** parallelism to add another layer.\n",
    "\n",
    "```fortran\n",
    "!$acc parallel loop gang\n",
    "do i = 1, N\n",
    "    !$acc loop worker\n",
    "    do j = 1, M\n",
    "        !$acc loop vector\n",
    "        do k = 1, Q\n",
    "            < loop code >\n",
    "        end do\n",
    "    end do\n",
    "end do\n",
    "```\n",
    "\n",
    "Now, the outer-loop will be split across the gangs, the middle-loop will be split across the workers, and the inner loop will be executed by the threads within the vector."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Gang, Worker, and Vector Syntax\n",
    "\n",
    "We have been showing really general examples of gang worker vector so far. One of the largest benefits of gang worker vector is the ability to explicitly define how many gangs and workers you need, and how many threads should be in the vector. Let's look at the syntax for the parallel directive:\n",
    "\n",
    "```fortran\n",
    "!$acc parallel num_gangs( 2 ) num_workers( 4 ) vector_length( 32 )\n",
    "    !$acc loop gang worker\n",
    "    do i = 1, N\n",
    "        !$acc loop vector\n",
    "        do j = 1, M\n",
    "            < loop code >\n",
    "        end do\n",
    "    end do\n",
    "!$acc end parallel\n",
    "```\n",
    "\n",
    "And now the syntax for the kernels directive:\n",
    "\n",
    "```fortran\n",
    "!$acc kernels loop gang( 2 ) worker( 4 )\n",
    "do i = 1, N\n",
    "    !$acc loop vector( 32 )\n",
    "    do j = 1, M\n",
    "        < loop code >\n",
    "    end do\n",
    "end do\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Avoid Wasting Threads\n",
    "\n",
    "When parallelizing small arrays, you have to be careful that the number of threads within your vector is not larger than the number of loop iterations. Let's look at a simple example:\n",
    "\n",
    "```fortran\n",
    "!$acc kernels loop gang\n",
    "do i = 1, 1000000000\n",
    "    !$acc loop vector(256)\n",
    "    do j = 1, 32\n",
    "        < loop code >\n",
    "    end do\n",
    "end do\n",
    "```\n",
    "\n",
    "In this code, we are parallelizing an inner-loop that has 32 iterations. However, our vector is 256 threads long. This means that when we run this code, we will have a lot more threads than loop iterations, and a lot of the threads will be sitting idly. We could fix this in a few different ways, but let's use **worker level parallelism** to fix it.\n",
    "\n",
    "```fortran\n",
    "!$acc kernels loop gang worker(8)\n",
    "do i = 1, 1000000000\n",
    "    !$acc loop vector(32)\n",
    "    do j = 1, 32\n",
    "        < loop code >\n",
    "    end do\n",
    "end do\n",
    "```\n",
    "\n",
    "Originally we had 1 (implied) worker, that contained 256 threads. Now, we have 8 workers that each have only 32 threads. We have eliminated all of our wasted threads by reducing the length of the **vector** and increasing the number of **workers**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The Rule of 32 (Warps)\n",
    "\n",
    "The general rule of thumb for programming for NVIDIA GPUs is to always ensure that your vector length is a multiple of 32 (which means 32, 64, 96, 128, ... 512, ... 1024... etc.). This is because NVIDIA GPUs are optimized to use **warps**. Warps are groups of 32 threads that are executing the same computer instruction. So as a reference:\n",
    "\n",
    "```fortran\n",
    "!$acc kernels loop gang\n",
    "do i = 1, N\n",
    "    !$acc loop vector(32)\n",
    "    do j = 1, M\n",
    "        < loop code >\n",
    "    end do\n",
    "end do\n",
    "```\n",
    "\n",
    "will perform much better than:\n",
    "\n",
    "```fortran\n",
    "!$acc kernels loop gang\n",
    "do i = 1, N\n",
    "    !$acc loop vector(31)\n",
    "    do j = 1, M\n",
    "        < loop code >\n",
    "    end do\n",
    "end do\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Implementing the Gang, Worker, and Vector\n",
    "\n",
    "From the top menu, click on *File*, and *Open* `laplace2d.f90` from the current directory at `Fortran/source_code/lab3` directory. Replace our earlier clauses with **gang, worker, and vector** To reorganize our thread blocks. Try it using a few different numbers, but always keep the vector length as a **multiple of 32** to fully utilize **warps**.\n",
    "\n",
    "Remember to **SAVE** your code after changes, before running below cells.\n",
    "\n",
    "Then run the following script to see how the code runs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!cd ../source_code/lab3 && make clean && make && ./laplace"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In our tests, it was difficult to beat our earlier code using `gang`, `worker`, and `vector` clauses, compared to the `tile` clause, but it is very common when optimizing real OpenACC codes to tweak loop mappings using these clauses and adjusting the vector length, so keep these clauses in the back of your mind for the future.\n",
    "\n",
    "--- \n",
    "\n",
    "## Licensing \n",
    "\n",
    "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}