Kaynağa Gözat

Fixed REview Comments for Multi GPU HPC NWays

Bharat Kumar 2 yıl önce
ebeveyn
işleme
a0f7a8e24e

Dosya farkı çok büyük olduğundan ihmal edildi
+ 16 - 5
hpc/multi_gpu_nways/README.md


BIN
hpc/multi_gpu_nways/labs/CFD/English/C/images/jacobi_algo.jpg


+ 4 - 24
hpc/multi_gpu_nways/labs/CFD/English/C/jupyter_notebook/advanced_concepts/single_node_topology.ipynb

@@ -2,26 +2,6 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "251d3000",
-   "metadata": {},
-   "source": [
-    "Before we begin, let's get an overview of the CUDA driver version and the GPUs running on the server by executing the `nvidia-smi` command below. Highlight the cell below by clicking on it and then either hit `Ctrl+Enter` on the keyboard or click on the `Run` button on the toolbar above. The output will be visible below the cell."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "e6fa8e78",
-   "metadata": {
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "!nvidia-smi"
-   ]
-  },
-  {
-   "cell_type": "markdown",
    "id": "790904cd",
    "metadata": {},
    "source": [
@@ -60,11 +40,11 @@
     "\n",
     "![open_terminal_session](../../images/open_terminal_session.png)\n",
     "\n",
-    "On our DGX-1 system, the output is as follows:\n",
+    "On a DGX-1 system, the output is as follows:\n",
     "\n",
     "![nvidia_smi_topo_output](../../images/nvidia_smi_topo_output.png)\n",
     "\n",
-    "Focus one a particular row, say GPU 0. The output states that GPUs 1 through 4 are connected to it via NVLink (in addition to PCIe) and GPUs 5 through 7 are connected to it via PCIe as well as an \"SMP\" interconnect. We have a dual-socket system and the CPUs in these sockets are connected by an interconnect known as SMP interconnect.\n",
+    "Focus on a particular row, say GPU 0. The output states that GPUs 1 through 4 are connected to it via NVLink (in addition to PCIe) and GPUs 5 through 7 are connected to it via PCIe as well as an \"SMP\" interconnect. We have a dual-socket system and the CPUs in these sockets are connected by an interconnect known as SMP interconnect.\n",
     "\n",
     "Thus, GPU 0 to GPU 5 communication happens via not just PCIe, but also over the inter-socket interconnect within the same node. Clearly, this is a longer path than say the one between GPU 0 and GPU 1, which are connected via NVLink directly. We will discuss the NIC to GPU connection in the inter-node section of this bootcamp.\n",
     "\n",
@@ -228,7 +208,7 @@
     "\n",
     "Here's a link to the home notebook through which all other notebooks are accessible:\n",
     "\n",
-    "# [HOME](../../../introduction.ipynb)\n",
+    "# [HOME](../../../start_here.ipynb)\n",
     "\n",
     "---\n",
     "## Links and Resources\n",
@@ -264,7 +244,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.5"
+   "version": "3.6.9"
   }
  },
  "nbformat": 4,

+ 8 - 25
hpc/multi_gpu_nways/labs/CFD/English/C/jupyter_notebook/cuda/memcpy.ipynb

@@ -2,24 +2,6 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "dd0ae66a",
-   "metadata": {},
-   "source": [
-    "Before we begin, let's get an overview of the CUDA driver version and the GPUs running on the server by executing the `nvidia-smi` command below. Highlight the cell below by clicking on it and then either hit `Ctrl+Enter` on the keyboard or click on the `Run` button on the toolbar above. The output will be visible below the cell."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "b7d483e8",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "!nvidia-smi"
-   ]
-  },
-  {
-   "cell_type": "markdown",
    "id": "e4ddba18",
    "metadata": {},
    "source": [
@@ -38,7 +20,7 @@
     "Before we begin, we define two important terms:\n",
     "\n",
     "* **Latency:** The amount of time it takes to take a unit of data from point A to point B. For example, if 4B of data can be transferred from point A to B in 4 $\\mu$s, that is the latency of transfer.\n",
-    "* **Bandwidth:** The amount of data that can be transferred from point A to point B in a unit of time. For example, if the width of the bus is 64KiB and latency of transfer between point A and B is 4 $\\mu$s, the bandwidth is 64KiB * (1/4$\\mu$s) = 1.6 GiB/s.\n",
+    "* **Bandwidth:** The amount of data that can be transferred from point A to point B in a unit of time. For example, if the width of the bus is 64KB and latency of transfer between point A and B is 4 $\\mu$s, the bandwidth is 64KB * (1/4$\\mu$s) = 1.6 GB/s.\n",
     "\n",
     "To parallelize our application to multi-GPUs, we first review the different methods of domain decomposition available to us for splitting the data among the GPUs, thereby distributing the work. Broadly, we can divide data into either stripes or tiles.\n",
     "\n",
@@ -78,7 +60,7 @@
    "id": "62d045bd",
    "metadata": {},
    "source": [
-    "The command should output more than one GPU. Inside a program, the number of GPU in the node can be obtained using the `cudaGetDeviceCount(int *count)` function and to perform any task, like running a CUDA kernel, copy operation, etc. on a particular GPU, we use the `cudaSetDevice(int device)` function.\n",
+    "`nvidia-smi` utility shows the available GPU on a node, but inside a CUDA program, the number of GPU in the node can be obtained using the `cudaGetDeviceCount(int *count)` function. To perform any task, like running a CUDA kernel, copy operation, etc. on a particular GPU, we use the `cudaSetDevice(int device)` function.\n",
     "\n",
     "### Copying between GPUs\n",
     "\n",
@@ -109,7 +91,7 @@
     "} // Serializes operations with respect to the host\n",
     "```\n",
     "\n",
-    "As this code results in serialized execution:\n",
+    "This code results in serialized execution as shown in diagram below:\n",
     "\n",
     "![memcpy_serialized](../../images/memcpy_serialized.png)\n",
     "\n",
@@ -160,7 +142,7 @@
     "2. Asynchronously copy GPU-local L2 norm back to CPU and implement top and bottom halo exchanges.\n",
     "3. Synchronize the devices at the end of each iteration using `cudaDeviceSynchronize` function.\n",
     "\n",
-    "Review the topic above on Asynchronous Operations if in doubt. Recall the utility of using separate `for` loops for launching device kernels and initiating copy operations.\n",
+    "Review the topic above on Asynchronous Operations if in doubt. We will be using separate `for` loops for launching device kernels and initiating copy operations.\n",
     "\n",
     "After implementing these, let's compile the code:"
    ]
@@ -227,7 +209,7 @@
    "id": "c4ac727d",
    "metadata": {},
    "source": [
-    "In the profiler timeline, the first few seconds denote the single-GPU code running on one of the GPUs. This version is executed so we can compare the multi-GPU version with it and we have already analyzed it. Let's analyze the multi-GPU timeline.\n",
+    "[Download the profiler report here for visualization](../../source_code/cuda/jacobi_memcpy_report.qdrep). In the profiler timeline, the first few seconds denote the single-GPU code running on one of the GPUs. This version is executed so we can compare the multi-GPU version with it. Let's analyze the multi-GPU timeline.\n",
     "\n",
     "![jacobi_memcpy_report_overview](../../images/jacobi_memcpy_report_overview.png)\n",
     "\n",
@@ -385,6 +367,7 @@
    "id": "4b801eb0",
    "metadata": {},
    "source": [
+    "[Download the profiler report here for visualization](../../source_code/cuda/jacobi_memcpy_p2p_report.qdrep).\n",
     "The output we obtain is shared below:\n",
     "\n",
     "![jacobi_memcpy_p2p_report](../../images/jacobi_memcpy_p2p_report.png)\n",
@@ -399,7 +382,7 @@
     "\n",
     "Here's a link to the home notebook through which all other notebooks are accessible:\n",
     "\n",
-    "# [HOME](../../../introduction.ipynb)\n",
+    "# [HOME](../../../start_here.ipynb)\n",
     "\n",
     "---\n",
     "\n",
@@ -437,7 +420,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.5"
+   "version": "3.6.9"
   }
  },
  "nbformat": 4,

+ 5 - 23
hpc/multi_gpu_nways/labs/CFD/English/C/jupyter_notebook/cuda/streams.ipynb

@@ -2,24 +2,6 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "18638d64",
-   "metadata": {},
-   "source": [
-    "Before we begin, let's get an overview of the CUDA driver version and the GPUs running on the server by executing the `nvidia-smi` command below. Highlight the cell below by clicking on it and then either hit `Ctrl+Enter` on the keyboard or click on the `Run` button on the toolbar above. The output will be visible below the cell."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "6ddeeccc",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "!nvidia-smi"
-   ]
-  },
-  {
-   "cell_type": "markdown",
    "id": "a7c63ff6",
    "metadata": {},
    "source": [
@@ -27,7 +9,7 @@
     "\n",
     "We will learn about the following in this lab:\n",
     "\n",
-    "* Concept of overlapping computation withEventEvent communication\n",
+    "* Concept of overlapping computation with Memory transfer\n",
     "* CUDA Streams overview and implementation\n",
     "* CUDA Events overview and implementation\n",
     "* Synchronization primitives in CUDA for the whole device, stream, event, etc.\n",
@@ -192,7 +174,7 @@
    "id": "a6f5bf5e",
    "metadata": {},
    "source": [
-    "Open the report in GUI and measure the total time between two Jacobi iterations as shown below.\n",
+    "[Download the report here](../../source_code/cuda/jacobi_streams_p2p_report.qdrep) and open the report in GUI and measure the total time between two Jacobi iterations as shown below.\n",
     "\n",
     "![streams_util_selection](../../images/streams_util_selection.png)\n",
     "\n",
@@ -339,7 +321,7 @@
    "id": "0e330889-77e3-4fe3-9782-b4a13425c9bb",
    "metadata": {},
    "source": [
-    "Download the `.qdrep` report file and open it in the Nsight Systems GUI application:\n",
+    "[Download the .qdrep report file](../../source_code/cuda/jacobi_streams_events_p2p_report.qdrep) and open it in the Nsight Systems GUI application:\n",
     "\n",
     "![jacobi_memcpy_streams_events_p2p_report](../../images/jacobi_memcpy_streams_events_p2p_report.png)\n",
     "\n",
@@ -355,7 +337,7 @@
     "\n",
     "Here's a link to the home notebook through which all other notebooks are accessible:\n",
     "\n",
-    "# [HOME](../../../introduction.ipynb)\n",
+    "# [HOME](../../../start_here.ipynb)\n",
     "\n",
     "---\n",
     "## Links and Resources\n",
@@ -393,7 +375,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.5"
+   "version": "3.6.9"
   }
  },
  "nbformat": 4,

+ 5 - 22
hpc/multi_gpu_nways/labs/CFD/English/C/jupyter_notebook/mpi/cuda_aware.ipynb

@@ -2,28 +2,9 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "da592767",
-   "metadata": {},
-   "source": [
-    "Before we begin, let's get an overview of the CUDA driver version and the GPUs running on the server by executing the `nvidia-smi` command below. Highlight the cell below by clicking on it and then either hit `Ctrl+Enter` on the keyboard or click on the `Run` button on the toolbar above. The output will be visible below the cell."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "9b23b406",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "!nvidia-smi"
-   ]
-  },
-  {
-   "cell_type": "markdown",
    "id": "a89881ae",
    "metadata": {},
    "source": [
-    "**Note:** Execution results can vary significantly based on the MPI installation, supporting libraries, workload manager, and underlying CPU and GPU hardware configuration and topology. The codes in this lab have been tested on DGX-1 8 Tesla V100 16 GB nodes connected by Mellanox InfiniBand NICs running OpenMPI v4.1.1 with HPCX 2.8.1 and CUDA v11.3.0.0.\n",
     "\n",
     "# Learning Objectives\n",
     "\n",
@@ -33,6 +14,8 @@
     "* Impact of fine-tuning CUDA-aware MPI on application performance.\n",
     "* Underlying GPUDirect technologies like P2P and RDMA.\n",
     "\n",
+    "**Note:** Execution results can vary significantly based on the MPI installation, supporting libraries, workload manager, and underlying CPU and GPU hardware configuration and topology. The codes in this lab have been tested on DGX-1 8 Tesla V100 16 GB nodes connected by Mellanox InfiniBand NICs running OpenMPI v4.1.1 with HPCX 2.8.1 and CUDA v11.3.0.0.\n",
+    "\n",
     "# Improving Application Performance\n",
     "\n",
     "## Analysis\n",
@@ -186,7 +169,7 @@
     "\n",
     "Moreover, we are not interested in profiling the single GPU version as profiling it increases both profiling time and the `.qdrep` file size. So, we will skip running the single-GPU version by passing the `-skip_single_gpu` flag to binary. Note that we will not get the speedup and efficiency numbers.\n",
     "\n",
-    "That isn't a problem, however. As NVTX statistics provide the runtime for our multi-GPU Jacobi loop as well as the time taken for halo exchange, we can use them for comparison.\n",
+    "That isn't a problem, however as NVTX statistics provide the runtime for our multi-GPU Jacobi loop as well as the time taken for halo exchange, we can use them for comparison.\n",
     "\n",
     "Now, let us profile only the multi-GPU version for the baseline 1K iterations:\n",
     "\n"
@@ -337,7 +320,7 @@
     "\n",
     "Here's a link to the home notebook through which all other notebooks are accessible:\n",
     "\n",
-    "# [HOME](../../../introduction.ipynb)\n",
+    "# [HOME](../../../start_here.ipynb)\n",
     "\n",
     "---\n",
     "## Links and Resources\n",
@@ -374,7 +357,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.5"
+   "version": "3.6.9"
   }
  },
  "nbformat": 4,

+ 6 - 23
hpc/multi_gpu_nways/labs/CFD/English/C/jupyter_notebook/mpi/memcpy.ipynb

@@ -2,28 +2,9 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "4ecc207b-52c7-463a-8731-19203d384a30",
-   "metadata": {},
-   "source": [
-    "Before we begin, let's get an overview of the CUDA driver version and the GPUs running on the server by executing the `nvidia-smi` command below. Highlight the cell below by clicking on it and then either hit `Ctrl+Enter` on the keyboard or click on the `Run` button on the toolbar above. The output will be visible below the cell."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "4d6d1387-f525-40d4-bf3a-f7403bdce2b5",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "!nvidia-smi"
-   ]
-  },
-  {
-   "cell_type": "markdown",
    "id": "ed9d6f0d-cfa6-4ffd-b970-bee700bf1a90",
    "metadata": {},
    "source": [
-    "**Note:** Execution results can vary significantly based on the MPI installation, supporting libraries, workload manager, and underlying CPU and GPU hardware configuration and topology. The codes in this lab have been tested on DGX-1 8 Tesla V100 16 GB nodes connected by Mellanox InfiniBand NICs running OpenMPI v4.1.1 with HPCX 2.8.1 and CUDA v11.3.0.0.\n",
     "\n",
     "# Learning Objectives\n",
     "\n",
@@ -33,6 +14,8 @@
     "* Managing the two-level hierarchy created by global and local rank of a process and how it accesses GPU(s).\n",
     "* OpenMPI process mappings and its effect on application performance.\n",
     "\n",
+    "**Note:** Execution results can vary significantly based on the MPI installation, supporting libraries, workload manager, and underlying CPU and GPU hardware configuration and topology. The codes in this lab have been tested on DGX-1 8 Tesla V100 16 GB nodes connected by Mellanox InfiniBand NICs running OpenMPI v4.1.1 with HPCX 2.8.1 and CUDA v11.3.0.0.\n",
+    "\n",
     "## MPI Inter-Process Communication\n",
     "\n",
     "Let us learn more about how MPI communicates between processes.\n",
@@ -117,7 +100,7 @@
     "\n",
     "### Nodel-Level Local Rank\n",
     "\n",
-    "As we will run on multiple nodes, for example 2, the number of processes launched, 16, will not map one-to-one with GPU Device ID, which runs from 0 to 7 on each node. Thus, we need to create a local rank at the node level.\n",
+    "As we will run on multiple nodes, for example 2 nodes, the number of processes launched will be 16 ( assuming 8 GPU per node like in DGX). This requires addional mapping of process id to GPU Device ID, which runs from 0 to 7 on each node. Thus, we need to create a local rank at the node level.\n",
     "\n",
     "To achieve this, we split the `MPI_COMM_WORLD` communicator between the nodes and store it in a `local_comm` communicator. Then, we get the local rank by calling the familiar `MPI_Comm_rank` function. Finally, we free the `local_comm` communicator as we don't require it anymore. \n",
     "\n",
@@ -356,7 +339,7 @@
    "id": "c89cc9bd-aed4-4ae4-bd5c-ae4698d44d92",
    "metadata": {},
    "source": [
-    "Download the report and view it via the GUI. \n",
+    "[Download the report](../../source_code/mpi/jacobi_memcpy_mpi_report.qdrep) and view it via the GUI. \n",
     "\n",
     "You may notice that only 8 MPI processes are visible even though we launched 16 MPI processes. Nsight Systems displays the output from a single node and inter-node transactions (copy operations) are visible. This is for ease of viewing and doesn't impede our analysis.\n",
     "\n",
@@ -386,7 +369,7 @@
     "\n",
     "Here's a link to the home notebook through which all other notebooks are accessible:\n",
     "\n",
-    "# [HOME](../../../introduction.ipynb)\n",
+    "# [HOME](../../../start_here.ipynb)\n",
     "\n",
     "---\n",
     "## Links and Resources\n",
@@ -422,7 +405,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.5"
+   "version": "3.6.9"
   }
  },
  "nbformat": 4,

+ 5 - 23
hpc/multi_gpu_nways/labs/CFD/English/C/jupyter_notebook/mpi/multi_node_intro.ipynb

@@ -2,24 +2,6 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "eb0fd5ff",
-   "metadata": {},
-   "source": [
-    "Before we begin, let's get an overview of the CUDA driver version and the GPUs running on the server by executing the `nvidia-smi` command below. Highlight the cell below by clicking on it and then either hit `Ctrl+Enter` on the keyboard or click on the `Run` button on the toolbar above. The output will be visible below the cell."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "aa6710a0",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "!nvidia-smi"
-   ]
-  },
-  {
-   "cell_type": "markdown",
    "id": "403d97bc",
    "metadata": {},
    "source": [
@@ -28,7 +10,7 @@
     "In this lab we will learn about:\n",
     "\n",
     "* Multi-node Multi-GPU programming and importance of inter-process communication frameworks.\n",
-    "* Introduction MPI specification and APIs.\n",
+    "* Introduction to MPI specification and APIs.\n",
     "* Execution of Hello World MPI binary on single as well as multiple nodes.\n",
     "\n",
     "# Multi-Node Multi-GPU Programming\n",
@@ -37,7 +19,7 @@
     "\n",
     "A single process can spawn threads that can be spread within a node (potentially on multiple sockets) but it cannot cross the node boundary. Thus, scalable multi-node programming requires the use of multiple processes.\n",
     "\n",
-    "Inter-process communication is usually done by libraries like Open MPI. They expose communication APIs, synchronization constructs, etc. to the user. Let us now learn about programming in MPI.\n",
+    "Inter-process communication is usually done by libraries like OpenMPI. They expose communication APIs, synchronization constructs, etc. to the user. Let us now learn about programming in MPI.\n",
     "\n",
     "## MPI\n",
     "\n",
@@ -210,7 +192,7 @@
     "Hello world from processor <node_0_name>, rank 2 out of 4 processors\n",
     "```\n",
     "\n",
-    "**Note:** Promptly request help if you face difficulty at any step. Subsequent labs will assume the reader understands how to run a multi-node MPI job.\n",
+    "**Note:** Subsequent labs will assume the reader understands how to run a multi-node MPI job.\n",
     "\n",
     "Now, let us learn more MPI concepts and code a CUDA Memcpy and MPI-based Jacobi solver. Click below to move to the next lab:\n",
     "\n",
@@ -218,7 +200,7 @@
     "\n",
     "Here's a link to the home notebook through which all other notebooks are accessible:\n",
     "\n",
-    "# [HOME](../../../introduction.ipynb)\n",
+    "# [HOME](../../../start_here.ipynb)\n",
     "\n",
     "---\n",
     "## Links and Resources\n",
@@ -254,7 +236,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.5"
+   "version": "3.6.9"
   }
  },
  "nbformat": 4,

+ 3 - 21
hpc/multi_gpu_nways/labs/CFD/English/C/jupyter_notebook/nccl/nccl.ipynb

@@ -2,24 +2,6 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "79103e8e-da47-4528-ac30-2cdd2adccaea",
-   "metadata": {},
-   "source": [
-    "Before we begin, let's get an overview of the CUDA driver version and the GPUs running on the server by executing the `nvidia-smi` command below. Highlight the cell below by clicking on it and then either hit `Ctrl+Enter` on the keyboard or click on the `Run` button on the toolbar above. The output will be visible below the cell."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "6ca7ab3b-aef8-41d6-a568-8458bce7c7d6",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "!nvidia-smi"
-   ]
-  },
-  {
-   "cell_type": "markdown",
    "id": "e96ca37c-9955-4b7a-83ff-c99761bb51e9",
    "metadata": {},
    "source": [
@@ -270,7 +252,7 @@
    "id": "d197bb8c-6fe0-4afa-831a-96b320b936fd",
    "metadata": {},
    "source": [
-    "On the cell output above, view the NVTX Push-Pop stats. NCCL has already been instrumented using NVTX annotations so we don't need to add our own. However, since NCCL communication calls are asynchronous with respect to the host and execute mostly on the GPU, NVTX stats are not very helpful. \n",
+    "Downlaod the report from [here](../../source_code/nccl/jacobi_nccl_report.qdrep). On the cell output above, view the NVTX Push-Pop stats. NCCL has already been instrumented using NVTX annotations so we don't need to add our own. However, since NCCL communication calls are asynchronous with respect to the host and execute mostly on the GPU, NVTX stats are not very helpful. \n",
     "\n",
     "Download the `.qdrep` report file and open it in GUI. \n",
     "\n",
@@ -297,7 +279,7 @@
     "\n",
     "Here's a link to the home notebook through which all other notebooks are accessible:\n",
     "\n",
-    "# [HOME](../../../introduction.ipynb)\n",
+    "# [HOME](../../../start_here.ipynb)\n",
     "\n",
     "---\n",
     "## Links and Resources\n",
@@ -334,7 +316,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.5"
+   "version": "3.6.9"
   }
  },
  "nbformat": 4,

+ 3 - 21
hpc/multi_gpu_nways/labs/CFD/English/C/jupyter_notebook/nvshmem/nvshmem.ipynb

@@ -2,24 +2,6 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "4fe78e41-7a43-46ff-b6f8-60787bfaded6",
-   "metadata": {},
-   "source": [
-    "Before we begin, let's get an overview of the CUDA driver version and the GPUs running on the server by executing the `nvidia-smi` command below. Highlight the cell below by clicking on it and then either hit `Ctrl+Enter` on the keyboard or click on the `Run` button on the toolbar above. The output will be visible below the cell."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "c0a532a3-a19e-42d6-81c8-dfbf248604d2",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "!nvidia-smi"
-   ]
-  },
-  {
-   "cell_type": "markdown",
    "id": "b016bc32-95ed-4298-a475-0b2209ec8c1a",
    "metadata": {},
    "source": [
@@ -405,7 +387,7 @@
    "id": "19b62965-149a-43a3-9d12-d1919cbadefd",
    "metadata": {},
    "source": [
-    "Notice that most NVSHMEM calls are available in NVTX Push-Pop stats on the CLI output. Open `.qdrep` file in the GUI to view the Timeline:\n",
+    "Download the report from [here](../../source_code/nvshmem/jacobi_nvshmem_report.dqrep).Notice that most NVSHMEM calls are available in NVTX Push-Pop stats on the CLI output. Open `.qdrep` file in the GUI to view the Timeline:\n",
     "\n",
     "![nvshmem_profiler_report](../../images/nvshmem_profiler_report.png)\n",
     "\n",
@@ -417,7 +399,7 @@
     "\n",
     "Click link to the home notebook below through which all other notebooks are accessible:\n",
     "\n",
-    "# [HOME](../../../introduction.ipynb)\n",
+    "# [HOME](../../../start_here.ipynb)\n",
     "\n",
     "---\n",
     "## Links and Resources\n",
@@ -453,7 +435,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.5"
+   "version": "3.6.9"
   }
  },
  "nbformat": 4,

+ 44 - 15
hpc/multi_gpu_nways/labs/CFD/English/C/jupyter_notebook/single_gpu/single_gpu_overview.ipynb

@@ -10,7 +10,7 @@
     "The goal of this lab is to:\n",
     "\n",
     "* Review the scientific problem for which the Jacobi solver application has been developed.\n",
-    "* Understand the run the single-GPU code of the application.\n",
+    "* Understand the single-GPU code of the application.\n",
     "* Learn about NVIDIA Nsight Systems profiler and how to use it to analyze our application.\n",
     "\n",
     "# The Application\n",
@@ -21,17 +21,23 @@
     "\n",
     "Laplace Equation is a well-studied linear partial differential equation that governs steady state heat conduction, irrotational fluid flow, and many other phenomena. \n",
     "\n",
-    "In this lab, we will consider the 2D Laplace Equation on a rectangle with Dirichlet boundary conditions on the left and right boundary and periodic boundary conditions on top and bottom boundary. We wish to solve the following equation:\n",
+    "In this lab, we will consider the 2D Laplace Equation on a rectangle with [Dirichlet boundary conditions](https://en.wikipedia.org/wiki/Dirichlet_boundary_condition) on the left and right boundary and periodic boundary conditions on top and bottom boundary. We wish to solve the following equation:\n",
     "\n",
     "$\\Delta u(x,y) = 0\\;\\forall\\;(x,y)\\in\\Omega,\\delta\\Omega$\n",
     "\n",
     "### Jacobi Method\n",
     "\n",
-    "The Jacobi method is an iterative algorithm to solve a linear system of strictly diagonally dominant equations. The governing Laplace equation is discretized and converted to a matrix amenable to Jacobi-method based solver.\n",
+    "The Jacobi method is an iterative algorithm to solve a linear system of strictly diagonally dominant equations. The governing Laplace equation is discretized and converted to a matrix amenable to Jacobi-method based solver. The pseudo code for Jacobi iterative process can be seen in diagram below:\n",
+    "\n",
+    "![gpu_programming_process](../../images/jacobi_algo.jpg)\n",
+    "\n",
+    "\n",
+    "The outer loop defines the convergence point, which could either be defined as reaching max number of iterations or when [L2 Norm](https://link.springer.com/referenceworkentry/10.1007%2F978-0-387-73003-5_1070) reaches a max/min value. \n",
+    "\n",
     "\n",
     "### The Code\n",
     "\n",
-    "The GPU processing flow follows 3 key steps:\n",
+    "The GPU processing flow in general follows 3 key steps:\n",
     "\n",
     "1. Copy data from CPU to GPU\n",
     "2. Launch GPU Kernel\n",
@@ -39,7 +45,7 @@
     "\n",
     "![gpu_programming_process](../../images/gpu_programming_process.png)\n",
     "\n",
-    "Let's understand the single-GPU code first. \n",
+    "We follow the same 3 steps in our code. Let's understand the single-GPU code first. \n",
     "\n",
     "The source code file, [jacobi.cu](../../source_code/single_gpu/jacobi.cu) (click to open), is present in `CFD/English/C/source_code/single_gpu/` directory. \n",
     "\n",
@@ -49,12 +55,35 @@
     "\n",
     "Similarly, have look at the [Makefile](../../source_code/single_gpu/Makefile). \n",
     "\n",
-    "Refer to the `single_gpu(...)` function. The important steps at each iteration of the Jacobi Solver (that is, the `while` loop) are:\n",
+    "Refer to the `single_gpu(...)` function. The important steps at each iteration of the Jacobi Solver inside `while` loop are:\n",
     "1. The norm is set to 0 using `cudaMemset`.\n",
     "2. The device kernel `jacobi_kernel` is called to update the interier points.\n",
     "3. The norm is copied back to the host using `cudaMemcpy` (DtoH), and\n",
     "4. The periodic boundary conditions are re-applied for the next iteration using `cudaMemcpy` (DtoD).\n",
     "\n",
+    "```\n",
+    "    while (l2_norm > tol && iter < iter_max) {\n",
+    "        cudaMemset(l2_norm_d, 0, sizeof(float));\n",
+    "\n",
+    "\t   // Compute grid points for this iteration\n",
+    "        jacobi_kernel<<<dim_grid, dim_block>>>(a_new, a, l2_norm_d, iy_start, iy_end, nx);\n",
+    "       \n",
+    "        cudaMemcpy(l2_norm_h, l2_norm_d, sizeof(float), cudaMemcpyDeviceToHost));\n",
+    "\n",
+    "        // Apply periodic boundary conditions\n",
+    "        cudaMemcpy(a_new, a_new + (iy_end - 1) * nx, nx * sizeof(float), cudaMemcpyDeviceToDevice);\n",
+    "        cudaMemcpy(a_new + iy_end * nx, a_new + iy_start * nx, nx * sizeof(float),cudaMemcpyDeviceToDevice);\n",
+    "\n",
+    "\t    cudaDeviceSynchronize());\n",
+    "\t    l2_norm = *l2_norm_h;\n",
+    "\t    l2_norm = std::sqrt(l2_norm);\n",
+    "\n",
+    "        iter++;\n",
+    "\t    if ((iter % 100) == 0) printf(\"%5d, %0.6f\\n\", iter, l2_norm);\n",
+    "        std::swap(a_new, a);\n",
+    "    }\n",
+    "```\n",
+    "\n",
     "Note that we run the Jacobi solver for 1000 iterations over the grid.\n",
     "\n",
     "### Compilation and Execution\n",
@@ -135,7 +164,7 @@
     "\n",
     "# Profiling\n",
     "\n",
-    "While the program in our labs gives the execution time in its output, it may not always be convinient to time the execution from within the program. Moreover, just timing the execution does not reveal the bottlenecks directly. For that purpose, we profile the program with NVIDIA's NSight Systems profiler's command-line interface (CLI), `nsys`. \n",
+    "While the program in our labs gives the execution time in its output, it may not always be convinient to time the execution from within the program. Moreover, just timing the execution does not reveal the bottlenecks directly. For that purpose, we profile the program with NVIDIA's Nsight Systems profiler's command-line interface (CLI), `nsys`. \n",
     "\n",
     "### NVIDIA Nsight Systems\n",
     "\n",
@@ -146,9 +175,9 @@
     "![Nsight Systems timeline](../../images/nsys_overview.png)\n",
     "\n",
     "The highlighted portions are identified as follows:\n",
-    "* <span style=\"color:red\">Red</span>: The CPU tab provides thread-level core utilization data. \n",
-    "* <span style=\"color:blue\">Blue</span>: The CUDA HW tab displays GPU kernel and memory transfer activities and API calls.\n",
-    "* <span style=\"color:orange\">Orange</span>: The Threads tab gives a detailed view of each CPU thread's activity including from OS runtime libraries, MPI, NVTX, etc.\n",
+    "* <span style=\"color:red\">Red</span>: The CPU row provides thread-level core utilization data. \n",
+    "* <span style=\"color:blue\">Blue</span>: The CUDA HW row displays GPU kernel and memory transfer activities and API calls.\n",
+    "* <span style=\"color:orange\">Orange</span>: The Threads row gives a detailed view of each CPU thread's activity including from OS runtime libraries, MPI, NVTX, etc.\n",
     "\n",
     "#### `nsys` CLI\n",
     "\n",
@@ -183,7 +212,7 @@
     "\n",
     "### Improving performance\n",
     "\n",
-    "Any code snippet can be taken up for optimizations. However, it is important to realize that our current code is limited to a single GPU. Usually a very powerful first optimization is to parallelize the code, which in our case means running it on multiple GPUs. Thus, we generally follow the cyclical process:\n",
+    "Any code can be taken up for optimizations. We will follow the cyclic process to optimize our code and get best scaling results across multiple GPU:\n",
     "\n",
     "* **Analyze** the code using profilers to identify bottlenecks and hotspots.\n",
     "* **Parallelize** the routines where most of the time in the code is spent.\n",
@@ -215,17 +244,17 @@
    "id": "6db3c3c7",
    "metadata": {},
    "source": [
-    "Now, download the report and view it via the GUI. This is the analysis step. Right click on the NVTX tab and select the Events View.\n",
+    "Now, [download the report](../../source_code/jacobi_report.qdrep) and view it via the GUI. This is the analysis step. Right click on the NVTX tab and select the Events View.\n",
     "\n",
     "![nsys single_gpu_analysis](../../images/nsys_single_gpu_analysis.png)\n",
     "\n",
-    "Clearly, we need to parallelize the \"Jacobi Solve\" routine, which is essentially the iterative Jacobi solver loop. Click on the link to continue to the next lab where we parallelize the code using cudaMemcpy and understand concepts like Peer-to-Peer Memory Access.\n",
+    "Clearly, we need to parallelize the \"Jacobi Solve\" routine, which is essentially the iterative Jacobi solver loop. Click on the link to continue to the next lab where we parallelize the code using `cudaMemcpy` and understand concepts like Peer-to-Peer Memory Access.\n",
     "\n",
     "# [Next: CUDA Memcpy and Peer-to-Peer Memory Access](../cuda/memcpy.ipynb)\n",
     "\n",
     "Here's a link to the home notebook through which all other notebooks are accessible:\n",
     "\n",
-    "# [HOME](../../../introduction.ipynb)\n",
+    "# [HOME](../../../start_here.ipynb)\n",
     "\n",
     "---\n",
     "\n",
@@ -263,7 +292,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.5"
+   "version": "3.6.9"
   }
  },
  "nbformat": 4,

+ 9 - 7
hpc/multi_gpu_nways/labs/CFD/English/C/source_code/cuda/jacobi_memcpy.cu

@@ -224,20 +224,22 @@ int main(int argc, char* argv[]) {
         if (p2p == true) {
             const int top = dev_id > 0 ? dev_id - 1 : (num_devices - 1);
             int canAccessPeer = 0;
-            // TODO: Part 2- Check whether GPU "devices[dev_id]" can access peer "devices[top]"
-            CUDA_RT_CALL(cudaDeviceCanAccessPeer(&canAccessPeer, /*Fill me*/, /*Fill me*/));
+            // TODO: Part 2- Check whether GPU "devices[dev_id]" can access peer "devices[top]" 
+            //Fill and uncomment line below
+            //CUDA_RT_CALL(cudaDeviceCanAccessPeer(&canAccessPeer, /*Fill me*/, /*Fill me*/));
             if (canAccessPeer) {
-            // TODO: Part 2- Enable peer access from GPU "devices[dev_id]" to "devices[top]"
-                CUDA_RT_CALL(cudaDeviceEnablePeerAccess(/*Fill me*/, 0));
+            // TODO: Part 2- Enable peer access from GPU "devices[dev_id]" to "devices[top]" 
+            //Fill and uncomment line below
+                //CUDA_RT_CALL(cudaDeviceEnablePeerAccess(/*Fill me*/, 0));
             }
             const int bottom = (dev_id + 1) % num_devices;
             if (top != bottom) {
                 canAccessPeer = 0;
                 // TODO: Part 2- Check and enable peer access from GPU "devices[dev_id]" to
-                // "devices[bottom]", whenever possible
-                CUDA_RT_CALL(cudaDeviceCanAccessPeer(&canAccessPeer, /*Fill me*/, /*Fill me*/));
+                // "devices[bottom]", whenever possible. Fill and uncomment line below
+                //CUDA_RT_CALL(cudaDeviceCanAccessPeer(&canAccessPeer, /*Fill me*/, /*Fill me*/));
                 if (canAccessPeer) {
-                    CUDA_RT_CALL(cudaDeviceEnablePeerAccess(/*Fill me*/, 0));
+                    //CUDA_RT_CALL(cudaDeviceEnablePeerAccess(/*Fill me*/, 0));
                 }
             }
         }

+ 1 - 1
hpc/multi_gpu_nways/labs/CFD/English/C/source_code/single_gpu/Makefile

@@ -1,6 +1,6 @@
 # Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
 NVCC=nvcc
-CUDA_HOME=/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/
+#CUDA_HOME=hpc_sdk_path/Linux_x86_64/21.3/cuda/11.2/
 GENCODE_SM30	:= -gencode arch=compute_30,code=sm_30
 GENCODE_SM35	:= -gencode arch=compute_35,code=sm_35
 GENCODE_SM37	:= -gencode arch=compute_37,code=sm_37

+ 5 - 0
hpc/multi_gpu_nways/labs/CFD/English/Presentations/README.md

@@ -0,0 +1,5 @@
+For Partners who are interested in delivering the critical hands-on skills needed to advance science in form of Bootcamp can reach out to us at [GPU Hackathon Partner](https://gpuhackathons.org/partners) website. In addition to current bootcamp material the Partners will be provided with the following:
+
+- Presentation: All the Bootcamps are accompanied with training material presentations which can be used during the Bootcamp session.
+- Mini challenge : To test the knowledge gained during this Bootcamp a mini application challenge is provided along with sample Solution.
+- Additional Support: On case to case basis the Partners can also be trained on how to effectively deliver the Bootcamp with maximal impact.

+ 2 - 2
hpc/multi_gpu_nways/labs/CFD/English/introduction.ipynb

@@ -51,7 +51,7 @@
     "\n",
     "## Licensing \n",
     "\n",
-    "This material is released) by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)."
+    "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)."
    ]
   }
  ],
@@ -71,7 +71,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.5"
+   "version": "3.6.9"
   }
  },
  "nbformat": 4,