5 лет назад · 4a9f630945
--- a/hpc/multi_gpu_nways/Singularity
+++ b/hpc/multi_gpu_nways/Singularity
@@ -1 +1,53 @@
 
				-# To be populated
			
 
				+# Copyright (c) 2021 NVIDIA Corporation.  All rights reserved. 
			
 
				+
			
 
				+Bootstrap: docker
			
 
				+#FROM: nvcr.io/nvidia/nvhpc:20.11-devel-cuda_multi-ubuntu20.04
			
 
				+FROM: nvcr.io/nvidia/nvhpc:21.3-devel-cuda_multi-ubuntu20.04
			
 
				+
			
 
				+%environment
			
 
				+    export XDG_RUNTIME_DIR=
			
 
				+    export PATH="$PATH:/usr/local/bin:/opt/anaconda3/bin:/usr/bin"
			
 
				+    export PATH=/opt/nvidia/nsight-systems/2020.5.1/bin:/opt/nvidia/nsight-compute/2020.2.1:$PATH
			
 
				+    export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/lib:/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/lib64/"
			
 
				+
			
 
				+%post
			
 
				+    build_tmp=$(mktemp -d) && cd ${build_tmp}
			
 
				+
			
 
				+    apt-get -y update
			
 
				+    apt-get -y dist-upgrade 
			
 
				+    DEBIAN_FRONTEND=noninteractive apt-get -yq install --no-install-recommends \
			
 
				+	    m4 vim-nox emacs-nox nano zip\
			
 
				+ 	    python3-pip python3-setuptools git-core inotify-tools \
			
 
				+	    curl git-lfs \
			
 
				+	    build-essential libtbb-dev
			
 
				+    rm -rf /var/lib/apt/cache/* 
			
 
				+
			
 
				+    pip3 install --upgrade pip
			
 
				+    pip3 install --no-cache-dir jupyter
			
 
				+    pip3 install gdown
			
 
				+
			
 
				+    apt-get install --no-install-recommends -y build-essential 
			
 
				+
			
 
				+# NVIDIA nsight-systems-2020.5.1 ,nsight-compute-2
			
 
				+    apt-get update -y   
			
 
				+    DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends apt-transport-https ca-certificates gnupg wget
			
 
				+    apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys F60F4B3D7FA2AF80
			
 
				+    echo "deb https://developer.download.nvidia.com/devtools/repos/ubuntu2004/amd64/ /" >> /etc/apt/sources.list.d/nsight.list 
			
 
				+    apt-get update -y 
			
 
				+    DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends nsight-systems-2020.5.1 nsight-compute-2020.2.1 
			
 
				+    apt-get install --no-install-recommends -y build-essential
			
 
				+
			
 
				+    wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh 
			
 
				+    bash Miniconda3-latest-Linux-x86_64.sh -b -p /opt/anaconda3 
			
 
				+    rm Miniconda3-latest-Linux-x86_64.sh 
			
 
				+    
			
 
				+    cd /
			
 
				+    rm -rf ${build_tmp}
			
 
				+
			
 
				+%files
			
 
				+    labs/ /labs
			
 
				+%runscript
			
 
				+    "$@"
			
 
				+
			
 
				+%labels
			
 
				+    AUTHOR mozhgank
			
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
@@ -0,0 +1,149 @@
 
				+{
			
 
				+ "cells": [
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "39ad569e",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "# Laplace Equation\n",
			
 
				+    "\n",
			
 
				+    "Laplace Equation is a well-studied linear partial differential equation that governs steady state heat conduction, irrotational fluid flow, and many other phenomena. \n",
			
 
				+    "\n",
			
 
				+    "In this lab, we will consider the 2D Laplace Equation on a rectangle with Dirichlet boundary conditions on the left and right boundary and period boundary conditions on top and bottom boundary. We wish to solve the following equation:\n",
			
 
				+    "\n",
			
 
				+    "$\\Delta u(x,y) = 0\\;\\forall\\;(x,y)\\in\\Omega,\\delta\\Omega$\n",
			
 
				+    "\n",
			
 
				+    "# Jacobi Method\n",
			
 
				+    "\n",
			
 
				+    "The Jacobi method is an iterative algorithm to solve a linear system of strictly diagonally dominant equations. The governing equation is discretized and converted to a matrix amenable to Jacobi-method based solver.\n",
			
 
				+    "\n",
			
 
				+    "## The Code\n",
			
 
				+    "\n",
			
 
				+    "Let's understand the single-GPU code first. The source code file is available here: [jacobi.cu](../../source_code/single_gpu/jacobi.cu).\n",
			
 
				+    "\n",
			
 
				+    "Alternatively, you can open the `File` menu and click on the `Open...` option which opens Jupyter's file explorer in a new tab. Then, navigate to `CFD/English/C/source_code/single_gpu/` directory in which you can view the `jacobi.cu` file. \n",
			
 
				+    "\n",
			
 
				+    "Similarly, have look at the [Makefile](../../source_code/single_gpu/Makefile). \n",
			
 
				+    "\n",
			
 
				+    "Refer to the `single_gpu(...)` function. The important steps at iteration of the Jacobi Solver (that is, the `while` loop) are:\n",
			
 
				+    "1. The norm is set to 0.\n",
			
 
				+    "2. The device kernel is called to update the interier points.\n",
			
 
				+    "3. The norm is copied back to the host, and\n",
			
 
				+    "4. The boundary conditions are re-applied for the next iteration.\n",
			
 
				+    "\n",
			
 
				+    "Note that we run the Jacobi solver for 1000 iterations over the grid.\n",
			
 
				+    "\n",
			
 
				+    "## Compilation and Execution\n",
			
 
				+    "\n",
			
 
				+    "Let's compile the single-GPU code:"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 44,
			
 
				+   "id": "eac2daf7",
			
 
				+   "metadata": {
			
 
				+    "scrolled": true
			
 
				+   },
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "name": "stdout",
			
 
				+     "output_type": "stream",
			
 
				+     "text": [
			
 
				+      "rm -f jacobi jacobi.qdrep\r\n",
			
 
				+      "nvcc -DHAVE_CUB -Xcompiler -fopenmp -lineinfo -DUSE_NVTX -lnvToolsExt -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_80,code=compute_80 -std=c++14 jacobi.cu -o jacobi\r\n"
			
 
				+     ]
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "!cd ../../source_code/single_gpu && make clean && make"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "33345661",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "Now, let us execute the program: "
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": 45,
			
 
				+   "id": "e234f430",
			
 
				+   "metadata": {},
			
 
				+   "outputs": [
			
 
				+    {
			
 
				+     "name": "stdout",
			
 
				+     "output_type": "stream",
			
 
				+     "text": [
			
 
				+      "Single GPU jacobi relaxation: 1000 iterations on 16384 x 16384 mesh with norm check every 1 iterations\n",
			
 
				+      "    0, 31.999022\n",
			
 
				+      "  100, 0.897983\n",
			
 
				+      "  200, 0.535684\n",
			
 
				+      "  300, 0.395651\n",
			
 
				+      "  400, 0.319039\n",
			
 
				+      "  500, 0.269961\n",
			
 
				+      "  600, 0.235509\n",
			
 
				+      "  700, 0.209829\n",
			
 
				+      "  800, 0.189854\n",
			
 
				+      "  900, 0.173818\n",
			
 
				+      "16384x16384: 1 GPU:   3.3650 s\n"
			
 
				+     ]
			
 
				+    }
			
 
				+   ],
			
 
				+   "source": [
			
 
				+    "!cd ../../source_code/single_gpu && ./jacobi"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "14bb863e",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "The output reports the norm value every 100 iterations and the total execution time of the Jacobi Solver. We would like to decrease the overall execution time of the program. To quantify the performance gain, we denote the single-GPU execution time as $T_s$ and multi-GPU execution time for $P$ GPUs as $T_p$. using this, we obtain the figures-of-merit, speedup $S = T_s/T_p$ (optimal is $P$) and efficiency $E = S/P$ (optimal is $1$). "
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "id": "b0c6f16a",
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    " !cd ../../source_code/mpi && make clean && make"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "id": "0979d23b",
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "!cd ../../source_code/mpi && mpirun -np 8 nsys profile --trace=mpi,cuda,nvtx ./jacobi"
			
 
				+   ]
			
 
				+  }
			
 
				+ ],
			
 
				+ "metadata": {
			
 
				+  "kernelspec": {
			
 
				+   "display_name": "Python 3",
			
 
				+   "language": "python",
			
 
				+   "name": "python3"
			
 
				+  },
			
 
				+  "language_info": {
			
 
				+   "codemirror_mode": {
			
 
				+    "name": "ipython",
			
 
				+    "version": 3
			
 
				+   },
			
 
				+   "file_extension": ".py",
			
 
				+   "mimetype": "text/x-python",
			
 
				+   "name": "python",
			
 
				+   "nbconvert_exporter": "python",
			
 
				+   "pygments_lexer": "ipython3",
			
 
				+   "version": "3.8.5"
			
 
				+  }
			
 
				+ },
			
 
				+ "nbformat": 4,
			
 
				+ "nbformat_minor": 5
			
 
				+}
			
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/memcpy/Makefile
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/memcpy/Makefile
@@ -0,0 +1,32 @@
 
				+# Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
			
 
				+NVCC=nvcc
			
 
				+GENCODE_SM30	:= -gencode arch=compute_30,code=sm_30
			
 
				+GENCODE_SM35	:= -gencode arch=compute_35,code=sm_35
			
 
				+GENCODE_SM37	:= -gencode arch=compute_37,code=sm_37
			
 
				+GENCODE_SM50	:= -gencode arch=compute_50,code=sm_50
			
 
				+GENCODE_SM52	:= -gencode arch=compute_52,code=sm_52
			
 
				+GENCODE_SM60    := -gencode arch=compute_60,code=sm_60
			
 
				+GENCODE_SM70    := -gencode arch=compute_70,code=sm_70
			
 
				+GENCODE_SM80    := -gencode arch=compute_80,code=sm_80 -gencode arch=compute_80,code=compute_80
			
 
				+GENCODE_FLAGS	:= $(GENCODE_SM70) $(GENCODE_SM80)
			
 
				+ifdef DISABLE_CUB
			
 
				+        NVCC_FLAGS = -Xptxas --optimize-float-atomics
			
 
				+else
			
 
				+        NVCC_FLAGS = -DHAVE_CUB
			
 
				+endif
			
 
				+NVCC_FLAGS += -Xcompiler -fopenmp -lineinfo -DUSE_NVTX -lnvToolsExt $(GENCODE_FLAGS) -std=c++14
			
 
				+jacobi: Makefile jacobi.cu
			
 
				+	$(NVCC) $(NVCC_FLAGS) jacobi.cu -o jacobi
			
 
				+
			
 
				+.PHONY.: clean
			
 
				+clean:
			
 
				+	rm -f jacobi jacobi.qdrep
			
 
				+
			
 
				+sanitize: jacobi
			
 
				+	compute-sanitizer ./jacobi -niter 10
			
 
				+
			
 
				+run: jacobi
			
 
				+	./jacobi
			
 
				+
			
 
				+profile: jacobi
			
 
				+	nsys profile --trace=cuda,nvtx -o jacobi ./jacobi -niter 10
			
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/memcpy/jacobi
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/memcpy/jacobi
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/memcpy/jacobi.cu
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/memcpy/jacobi.cu
@@ -0,0 +1,510 @@
 
				+/* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
			
 
				+ *
			
 
				+ * Redistribution and use in source and binary forms, with or without
			
 
				+ * modification, are permitted provided that the following conditions
			
 
				+ * are met:
			
 
				+ *  * Redistributions of source code must retain the above copyright
			
 
				+ *    notice, this list of conditions and the following disclaimer.
			
 
				+ *  * Redistributions in binary form must reproduce the above copyright
			
 
				+ *    notice, this list of conditions and the following disclaimer in the
			
 
				+ *    documentation and/or other materials provided with the distribution.
			
 
				+ *  * Neither the name of NVIDIA CORPORATION nor the names of its
			
 
				+ *    contributors may be used to endorse or promote products derived
			
 
				+ *    from this software without specific prior written permission.
			
 
				+ *
			
 
				+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
			
 
				+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
			
 
				+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
			
 
				+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
			
 
				+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
			
 
				+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
			
 
				+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
			
 
				+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
			
 
				+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
			
 
				+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
			
 
				+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
			
 
				+ */
			
 
				+#include <algorithm>
			
 
				+#include <cmath>
			
 
				+#include <cstdio>
			
 
				+#include <iostream>
			
 
				+#include <sstream>
			
 
				+
			
 
				+#include <omp.h>
			
 
				+
			
 
				+#ifdef HAVE_CUB
			
 
				+#include <cub/block/block_reduce.cuh>
			
 
				+#endif  // HAVE_CUB
			
 
				+
			
 
				+#ifdef USE_NVTX
			
 
				+#include <nvToolsExt.h>
			
 
				+
			
 
				+const uint32_t colors[] = {0x0000ff00, 0x000000ff, 0x00ffff00, 0x00ff00ff,
			
 
				+                           0x0000ffff, 0x00ff0000, 0x00ffffff};
			
 
				+const int num_colors = sizeof(colors) / sizeof(uint32_t);
			
 
				+
			
 
				+#define PUSH_RANGE(name, cid)                              \
			
 
				+    {                                                      \
			
 
				+        int color_id = cid;                                \
			
 
				+        color_id = color_id % num_colors;                  \
			
 
				+        nvtxEventAttributes_t eventAttrib = {0};           \
			
 
				+        eventAttrib.version = NVTX_VERSION;                \
			
 
				+        eventAttrib.size = NVTX_EVENT_ATTRIB_STRUCT_SIZE;  \
			
 
				+        eventAttrib.colorType = NVTX_COLOR_ARGB;           \
			
 
				+        eventAttrib.color = colors[color_id];              \
			
 
				+        eventAttrib.messageType = NVTX_MESSAGE_TYPE_ASCII; \
			
 
				+        eventAttrib.message.ascii = name;                  \
			
 
				+        nvtxRangePushEx(&eventAttrib);                     \
			
 
				+    }
			
 
				+#define POP_RANGE nvtxRangePop();
			
 
				+#else
			
 
				+#define PUSH_RANGE(name, cid)
			
 
				+#define POP_RANGE
			
 
				+#endif
			
 
				+
			
 
				+#define CUDA_RT_CALL(call)                                                                  \
			
 
				+    {                                                                                       \
			
 
				+        cudaError_t cudaStatus = call;                                                      \
			
 
				+        if (cudaSuccess != cudaStatus)                                                      \
			
 
				+            fprintf(stderr,                                                                 \
			
 
				+                    "ERROR: CUDA RT call \"%s\" in line %d of file %s failed "              \
			
 
				+                    "with "                                                                 \
			
 
				+                    "%s (%d).\n",                                                           \
			
 
				+                    #call, __LINE__, __FILE__, cudaGetErrorString(cudaStatus), cudaStatus); \
			
 
				+    }
			
 
				+
			
 
				+constexpr int MAX_NUM_DEVICES = 32;
			
 
				+
			
 
				+typedef float real;
			
 
				+constexpr real tol = 1.0e-8;
			
 
				+
			
 
				+const real PI = 2.0 * std::asin(1.0);
			
 
				+
			
 
				+__global__ void initialize_boundaries(real* __restrict__ const a_new, real* __restrict__ const a,
			
 
				+                                      const real pi, const int offset, const int nx,
			
 
				+                                      const int my_ny, const int ny) {
			
 
				+    for (int iy = blockIdx.x * blockDim.x + threadIdx.x; iy < my_ny; iy += blockDim.x * gridDim.x) {
			
 
				+        const real y0 = sin(2.0 * pi * (offset + iy) / (ny - 1));
			
 
				+        a[iy * nx + 0] = y0;
			
 
				+        a[iy * nx + (nx - 1)] = y0;
			
 
				+        a_new[iy * nx + 0] = y0;
			
 
				+        a_new[iy * nx + (nx - 1)] = y0;
			
 
				+    }
			
 
				+}
			
 
				+
			
 
				+template <int BLOCK_DIM_X, int BLOCK_DIM_Y>
			
 
				+__global__ void jacobi_kernel(real* __restrict__ const a_new, const real* __restrict__ const a,
			
 
				+                              real* __restrict__ const l2_norm, const int iy_start,
			
 
				+                              const int iy_end, const int nx, const bool calculate_norm) {
			
 
				+#ifdef HAVE_CUB
			
 
				+    typedef cub::BlockReduce<real, BLOCK_DIM_X, cub::BLOCK_REDUCE_WARP_REDUCTIONS, BLOCK_DIM_Y>
			
 
				+        BlockReduce;
			
 
				+    __shared__ typename BlockReduce::TempStorage temp_storage;
			
 
				+#endif  // HAVE_CUB
			
 
				+    int iy = blockIdx.y * blockDim.y + threadIdx.y + iy_start;
			
 
				+    int ix = blockIdx.x * blockDim.x + threadIdx.x + 1;
			
 
				+    real local_l2_norm = 0.0;
			
 
				+
			
 
				+    if (iy < iy_end && ix < (nx - 1)) {
			
 
				+        const real new_val = 0.25 * (a[iy * nx + ix + 1] + a[iy * nx + ix - 1] +
			
 
				+                                     a[(iy + 1) * nx + ix] + a[(iy - 1) * nx + ix]);
			
 
				+        a_new[iy * nx + ix] = new_val;
			
 
				+        if (calculate_norm) {
			
 
				+            real residue = new_val - a[iy * nx + ix];
			
 
				+            local_l2_norm += residue * residue;
			
 
				+        }
			
 
				+    }
			
 
				+    if (calculate_norm) {
			
 
				+#ifdef HAVE_CUB
			
 
				+        real block_l2_norm = BlockReduce(temp_storage).Sum(local_l2_norm);
			
 
				+        if (0 == threadIdx.y && 0 == threadIdx.x) atomicAdd(l2_norm, block_l2_norm);
			
 
				+#else
			
 
				+        atomicAdd(l2_norm, local_l2_norm);
			
 
				+#endif  // HAVE_CUB
			
 
				+    }
			
 
				+}
			
 
				+
			
 
				+double single_gpu(const int nx, const int ny, const int iter_max, real* const a_ref_h,
			
 
				+                  const int nccheck, const bool print);
			
 
				+
			
 
				+template <typename T>
			
 
				+T get_argval(char** begin, char** end, const std::string& arg, const T default_val) {
			
 
				+    T argval = default_val;
			
 
				+    char** itr = std::find(begin, end, arg);
			
 
				+    if (itr != end && ++itr != end) {
			
 
				+        std::istringstream inbuf(*itr);
			
 
				+        inbuf >> argval;
			
 
				+    }
			
 
				+    return argval;
			
 
				+}
			
 
				+
			
 
				+bool get_arg(char** begin, char** end, const std::string& arg) {
			
 
				+    char** itr = std::find(begin, end, arg);
			
 
				+    if (itr != end) {
			
 
				+        return true;
			
 
				+    }
			
 
				+    return false;
			
 
				+}
			
 
				+
			
 
				+int main(int argc, char* argv[]) {
			
 
				+    const int iter_max = get_argval<int>(argv, argv + argc, "-niter", 1000);
			
 
				+    const int nccheck = get_argval<int>(argv, argv + argc, "-nccheck", 1);
			
 
				+    const int nx = get_argval<int>(argv, argv + argc, "-nx", 16384);
			
 
				+    const int ny = get_argval<int>(argv, argv + argc, "-ny", 16384);
			
 
				+    const bool csv = get_arg(argv, argv + argc, "-csv");
			
 
				+    const bool nop2p = get_arg(argv, argv + argc, "-nop2p");
			
 
				+
			
 
				+    real* a[MAX_NUM_DEVICES];
			
 
				+    real* a_new[MAX_NUM_DEVICES];
			
 
				+    real* a_ref_h;
			
 
				+    real* a_h;
			
 
				+    double runtime_serial = 0.0;
			
 
				+
			
 
				+    cudaStream_t compute_stream[MAX_NUM_DEVICES];
			
 
				+    cudaStream_t push_top_stream[MAX_NUM_DEVICES];
			
 
				+    cudaStream_t push_bottom_stream[MAX_NUM_DEVICES];
			
 
				+    cudaEvent_t compute_done[MAX_NUM_DEVICES];
			
 
				+    cudaEvent_t push_top_done[2][MAX_NUM_DEVICES];
			
 
				+    cudaEvent_t push_bottom_done[2][MAX_NUM_DEVICES];
			
 
				+
			
 
				+    real* l2_norm_d[MAX_NUM_DEVICES];
			
 
				+    real* l2_norm_h[MAX_NUM_DEVICES];
			
 
				+
			
 
				+    int iy_start[MAX_NUM_DEVICES];
			
 
				+    int iy_end[MAX_NUM_DEVICES];
			
 
				+
			
 
				+    int chunk_size[MAX_NUM_DEVICES];
			
 
				+
			
 
				+    int num_devices = 0;
			
 
				+    CUDA_RT_CALL(cudaGetDeviceCount(&num_devices));
			
 
				+    for (int dev_id = 0; dev_id < num_devices; ++dev_id) {
			
 
				+        CUDA_RT_CALL(cudaSetDevice(dev_id));
			
 
				+        CUDA_RT_CALL(cudaFree(0));
			
 
				+
			
 
				+        if (0 == dev_id) {
			
 
				+            CUDA_RT_CALL(cudaMallocHost(&a_ref_h, nx * ny * sizeof(real)));
			
 
				+            CUDA_RT_CALL(cudaMallocHost(&a_h, nx * ny * sizeof(real)));
			
 
				+            runtime_serial = single_gpu(nx, ny, iter_max, a_ref_h, nccheck, !csv);
			
 
				+        }
			
 
				+
			
 
				+        // ny - 2 rows are distributed amongst `size` ranks in such a way
			
 
				+        // that each rank gets either (ny - 2) / size or (ny - 2) / size + 1 rows.
			
 
				+        // This optimizes load balancing when (ny - 2) % size != 0
			
 
				+        int chunk_size_low = (ny - 2) / num_devices;
			
 
				+        int chunk_size_high = chunk_size_low + 1;
			
 
				+        // To calculate the number of ranks that need to compute an extra row,
			
 
				+        // the following formula is derived from this equation:
			
 
				+        // num_ranks_low * chunk_size_low + (size - num_ranks_low) * (chunk_size_low + 1) = ny - 2
			
 
				+        int num_ranks_low = num_devices * chunk_size_low + num_devices -
			
 
				+                            (ny - 2);  // Number of ranks with chunk_size = chunk_size_low
			
 
				+        if (dev_id < num_ranks_low)
			
 
				+            chunk_size[dev_id] = chunk_size_low;
			
 
				+        else
			
 
				+            chunk_size[dev_id] = chunk_size_high;
			
 
				+
			
 
				+        CUDA_RT_CALL(cudaMalloc(a + dev_id, nx * (chunk_size[dev_id] + 2) * sizeof(real)));
			
 
				+        CUDA_RT_CALL(cudaMalloc(a_new + dev_id, nx * (chunk_size[dev_id] + 2) * sizeof(real)));
			
 
				+
			
 
				+        CUDA_RT_CALL(cudaMemset(a[dev_id], 0, nx * (chunk_size[dev_id] + 2) * sizeof(real)));
			
 
				+        CUDA_RT_CALL(cudaMemset(a_new[dev_id], 0, nx * (chunk_size[dev_id] + 2) * sizeof(real)));
			
 
				+
			
 
				+        // Calculate local domain boundaries
			
 
				+        int iy_start_global;  // My start index in the global array
			
 
				+        if (dev_id < num_ranks_low) {
			
 
				+            iy_start_global = dev_id * chunk_size_low + 1;
			
 
				+        } else {
			
 
				+            iy_start_global =
			
 
				+                num_ranks_low * chunk_size_low + (dev_id - num_ranks_low) * chunk_size_high + 1;
			
 
				+        }
			
 
				+
			
 
				+        iy_start[dev_id] = 1;
			
 
				+        iy_end[dev_id] = iy_start[dev_id] + chunk_size[dev_id];
			
 
				+
			
 
				+        // Set dirichlet boundary conditions on left and right boarder
			
 
				+        initialize_boundaries<<<(ny / num_devices) / 128 + 1, 128>>>(
			
 
				+            a[dev_id], a_new[dev_id], PI, iy_start_global - 1, nx, (chunk_size[dev_id] + 2), ny);
			
 
				+        CUDA_RT_CALL(cudaGetLastError());
			
 
				+        CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+
			
 
				+        CUDA_RT_CALL(cudaStreamCreate(compute_stream + dev_id));
			
 
				+        CUDA_RT_CALL(cudaStreamCreate(push_top_stream + dev_id));
			
 
				+        CUDA_RT_CALL(cudaStreamCreate(push_bottom_stream + dev_id));
			
 
				+        CUDA_RT_CALL(cudaEventCreateWithFlags(compute_done + dev_id, cudaEventDisableTiming));
			
 
				+        CUDA_RT_CALL(cudaEventCreateWithFlags(push_top_done[0] + dev_id, cudaEventDisableTiming));
			
 
				+        CUDA_RT_CALL(
			
 
				+            cudaEventCreateWithFlags(push_bottom_done[0] + dev_id, cudaEventDisableTiming));
			
 
				+        CUDA_RT_CALL(cudaEventCreateWithFlags(push_top_done[1] + dev_id, cudaEventDisableTiming));
			
 
				+        CUDA_RT_CALL(
			
 
				+            cudaEventCreateWithFlags(push_bottom_done[1] + dev_id, cudaEventDisableTiming));
			
 
				+
			
 
				+        CUDA_RT_CALL(cudaMalloc(l2_norm_d + dev_id, sizeof(real)));
			
 
				+        CUDA_RT_CALL(cudaMallocHost(l2_norm_h + dev_id, sizeof(real)));
			
 
				+
			
 
				+        if (!nop2p) {
			
 
				+            const int top = dev_id > 0 ? dev_id - 1 : (num_devices - 1);
			
 
				+            int canAccessPeer = 0;
			
 
				+            CUDA_RT_CALL(cudaDeviceCanAccessPeer(&canAccessPeer, dev_id, top));
			
 
				+            if (canAccessPeer) {
			
 
				+                CUDA_RT_CALL(cudaDeviceEnablePeerAccess(top, 0));
			
 
				+            }
			
 
				+            const int bottom = (dev_id + 1) % num_devices;
			
 
				+            if (top != bottom) {
			
 
				+                canAccessPeer = 0;
			
 
				+                CUDA_RT_CALL(cudaDeviceCanAccessPeer(&canAccessPeer, dev_id, bottom));
			
 
				+                if (canAccessPeer) {
			
 
				+                    CUDA_RT_CALL(cudaDeviceEnablePeerAccess(bottom, 0));
			
 
				+                }
			
 
				+            }
			
 
				+        }
			
 
				+        CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+    }
			
 
				+
			
 
				+    for (int dev_id = 0; dev_id < num_devices; ++dev_id) {
			
 
				+        CUDA_RT_CALL(cudaSetDevice(dev_id));
			
 
				+        const int top = dev_id > 0 ? dev_id - 1 : (num_devices - 1);
			
 
				+        const int bottom = (dev_id + 1) % num_devices;
			
 
				+        CUDA_RT_CALL(cudaMemcpy(a_new[top] + (iy_end[top] * nx),
			
 
				+    				 a_new[dev_id] + iy_start[dev_id] * nx, nx * sizeof(real),
			
 
				+    				 cudaMemcpyDeviceToDevice));
			
 
				+        CUDA_RT_CALL(cudaMemcpy(a_new[bottom], a_new[dev_id] + (iy_end[dev_id] - 1) * nx,
			
 
				+    				 nx * sizeof(real), cudaMemcpyDeviceToDevice));
			
 
				+    }
			
 
				+
			
 
				+    for (int dev_id = 0; dev_id < num_devices; ++dev_id) {
			
 
				+        CUDA_RT_CALL(cudaSetDevice(dev_id));
			
 
				+	CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+    }
			
 
				+
			
 
				+    if (!csv)
			
 
				+        printf(
			
 
				+            "Jacobi relaxation: %d iterations on %d x %d mesh with norm check "
			
 
				+            "every %d iterations\n",
			
 
				+            iter_max, ny, nx, nccheck);
			
 
				+
			
 
				+    constexpr int dim_block_x = 32;
			
 
				+    constexpr int dim_block_y = 32;
			
 
				+    int iter = 0;
			
 
				+    bool calculate_norm;
			
 
				+    real l2_norm = 1.0;
			
 
				+
			
 
				+    double start = omp_get_wtime();
			
 
				+    PUSH_RANGE("Jacobi solve", 0)
			
 
				+    while (l2_norm > tol && iter < iter_max) {
			
 
				+        for (int dev_id = 0; dev_id < num_devices; ++dev_id) {
			
 
				+            const int top = dev_id > 0 ? dev_id - 1 : (num_devices - 1);
			
 
				+            const int bottom = (dev_id + 1) % num_devices;
			
 
				+            CUDA_RT_CALL(cudaSetDevice(dev_id));
			
 
				+
			
 
				+            CUDA_RT_CALL(
			
 
				+                cudaMemsetAsync(l2_norm_d[dev_id], 0, sizeof(real), compute_stream[dev_id]));
			
 
				+
			
 
				+            CUDA_RT_CALL(
			
 
				+                cudaStreamWaitEvent(compute_stream[dev_id], push_top_done[(iter % 2)][bottom], 0));
			
 
				+            CUDA_RT_CALL(
			
 
				+                cudaStreamWaitEvent(compute_stream[dev_id], push_bottom_done[(iter % 2)][top], 0));
			
 
				+
			
 
				+            calculate_norm = (iter % nccheck) == 0 || (!csv && (iter % 100) == 0);
			
 
				+            dim3 dim_grid((nx + dim_block_x - 1) / dim_block_x,
			
 
				+                          (chunk_size[dev_id] + dim_block_y - 1) / dim_block_y, 1);
			
 
				+
			
 
				+            jacobi_kernel<dim_block_x, dim_block_y>
			
 
				+                <<<dim_grid, {dim_block_x, dim_block_y, 1}, 0, compute_stream[dev_id]>>>(
			
 
				+                    a_new[dev_id], a[dev_id], l2_norm_d[dev_id], iy_start[dev_id], iy_end[dev_id],
			
 
				+                    nx, calculate_norm);
			
 
				+            CUDA_RT_CALL(cudaGetLastError());
			
 
				+            CUDA_RT_CALL(cudaEventRecord(compute_done[dev_id], compute_stream[dev_id]));
			
 
				+
			
 
				+            if (calculate_norm) {
			
 
				+                CUDA_RT_CALL(cudaMemcpyAsync(l2_norm_h[dev_id], l2_norm_d[dev_id], sizeof(real),
			
 
				+                                             cudaMemcpyDeviceToHost, compute_stream[dev_id]));
			
 
				+            }
			
 
				+
			
 
				+            // Apply periodic boundary conditions
			
 
				+            CUDA_RT_CALL(cudaStreamWaitEvent(push_top_stream[dev_id], compute_done[dev_id], 0));
			
 
				+            CUDA_RT_CALL(cudaMemcpyAsync(a_new[top] + (iy_end[top] * nx),
			
 
				+                                         a_new[dev_id] + iy_start[dev_id] * nx, nx * sizeof(real),
			
 
				+                                         cudaMemcpyDeviceToDevice, push_top_stream[dev_id]));
			
 
				+            CUDA_RT_CALL(
			
 
				+                cudaEventRecord(push_top_done[((iter + 1) % 2)][dev_id], push_top_stream[dev_id]));
			
 
				+
			
 
				+            CUDA_RT_CALL(cudaStreamWaitEvent(push_bottom_stream[dev_id], compute_done[dev_id], 0));
			
 
				+            CUDA_RT_CALL(cudaMemcpyAsync(a_new[bottom], a_new[dev_id] + (iy_end[dev_id] - 1) * nx,
			
 
				+                                         nx * sizeof(real), cudaMemcpyDeviceToDevice,
			
 
				+                                         push_bottom_stream[dev_id]));
			
 
				+            CUDA_RT_CALL(cudaEventRecord(push_bottom_done[((iter + 1) % 2)][dev_id],
			
 
				+                                         push_bottom_stream[dev_id]));
			
 
				+        }
			
 
				+        if (calculate_norm) {
			
 
				+            l2_norm = 0.0;
			
 
				+            for (int dev_id = 0; dev_id < num_devices; ++dev_id) {
			
 
				+                CUDA_RT_CALL(cudaStreamSynchronize(compute_stream[dev_id]));
			
 
				+                l2_norm += *(l2_norm_h[dev_id]);
			
 
				+            }
			
 
				+
			
 
				+            l2_norm = std::sqrt(l2_norm);
			
 
				+            if (!csv && (iter % 100) == 0) printf("%5d, %0.6f\n", iter, l2_norm);
			
 
				+        }
			
 
				+
			
 
				+        for (int dev_id = 0; dev_id < num_devices; ++dev_id) {
			
 
				+            std::swap(a_new[dev_id], a[dev_id]);
			
 
				+        }
			
 
				+        iter++;
			
 
				+    }
			
 
				+
			
 
				+    for (int dev_id = 0; dev_id < num_devices; ++dev_id) {
			
 
				+        CUDA_RT_CALL(cudaSetDevice(dev_id));
			
 
				+        CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+    }
			
 
				+
			
 
				+    POP_RANGE
			
 
				+    double stop = omp_get_wtime();
			
 
				+
			
 
				+    int offset = nx;
			
 
				+    for (int dev_id = 0; dev_id < num_devices; ++dev_id) {
			
 
				+        CUDA_RT_CALL(
			
 
				+            cudaMemcpy(a_h + offset, a[dev_id] + nx,
			
 
				+                       std::min((nx * ny) - offset, nx * chunk_size[dev_id]) * sizeof(real),
			
 
				+                       cudaMemcpyDeviceToHost));
			
 
				+        offset += std::min(chunk_size[dev_id] * nx, (nx * ny) - offset);
			
 
				+    }
			
 
				+
			
 
				+    bool result_correct = true;
			
 
				+    for (int iy = 1; result_correct && (iy < (ny - 1)); ++iy) {
			
 
				+        for (int ix = 1; result_correct && (ix < (nx - 1)); ++ix) {
			
 
				+            if (std::fabs(a_ref_h[iy * nx + ix] - a_h[iy * nx + ix]) > tol) {
			
 
				+                fprintf(stderr,
			
 
				+                        "ERROR: a[%d * %d + %d] = %f does not match %f "
			
 
				+                        "(reference)\n",
			
 
				+                        iy, nx, ix, a_h[iy * nx + ix], a_ref_h[iy * nx + ix]);
			
 
				+                result_correct = false;
			
 
				+            }
			
 
				+        }
			
 
				+    }
			
 
				+
			
 
				+    if (result_correct) {
			
 
				+        if (csv) {
			
 
				+            printf("single_threaded_copy, %d, %d, %d, %d, %d, %d, %f, %f\n", nx, ny, iter_max,
			
 
				+                   nccheck, num_devices, nop2p ? 0 : 1, (stop - start), runtime_serial);
			
 
				+        } else {
			
 
				+            printf("Num GPUs: %d.\n", num_devices);
			
 
				+            printf(
			
 
				+                "%dx%d: 1 GPU: %8.4f s, %d GPUs: %8.4f s, speedup: %8.2f, "
			
 
				+                "efficiency: %8.2f \n",
			
 
				+                ny, nx, runtime_serial, num_devices, (stop - start),
			
 
				+                runtime_serial / (stop - start),
			
 
				+                runtime_serial / (num_devices * (stop - start)) * 100);
			
 
				+        }
			
 
				+    }
			
 
				+
			
 
				+    for (int dev_id = (num_devices - 1); dev_id >= 0; --dev_id) {
			
 
				+        CUDA_RT_CALL(cudaSetDevice(dev_id));
			
 
				+        CUDA_RT_CALL(cudaEventDestroy(push_bottom_done[1][dev_id]));
			
 
				+        CUDA_RT_CALL(cudaEventDestroy(push_top_done[1][dev_id]));
			
 
				+        CUDA_RT_CALL(cudaEventDestroy(push_bottom_done[0][dev_id]));
			
 
				+        CUDA_RT_CALL(cudaEventDestroy(push_top_done[0][dev_id]));
			
 
				+        CUDA_RT_CALL(cudaEventDestroy(compute_done[dev_id]));
			
 
				+        CUDA_RT_CALL(cudaStreamDestroy(push_bottom_stream[dev_id]));
			
 
				+        CUDA_RT_CALL(cudaStreamDestroy(push_top_stream[dev_id]));
			
 
				+        CUDA_RT_CALL(cudaStreamDestroy(compute_stream[dev_id]));
			
 
				+
			
 
				+        CUDA_RT_CALL(cudaFreeHost(l2_norm_h[dev_id]));
			
 
				+        CUDA_RT_CALL(cudaFree(l2_norm_d[dev_id]));
			
 
				+
			
 
				+        CUDA_RT_CALL(cudaFree(a_new[dev_id]));
			
 
				+        CUDA_RT_CALL(cudaFree(a[dev_id]));
			
 
				+        if (0 == dev_id) {
			
 
				+            CUDA_RT_CALL(cudaFreeHost(a_h));
			
 
				+            CUDA_RT_CALL(cudaFreeHost(a_ref_h));
			
 
				+        }
			
 
				+    }
			
 
				+
			
 
				+    return result_correct ? 0 : 1;
			
 
				+}
			
 
				+
			
 
				+double single_gpu(const int nx, const int ny, const int iter_max, real* const a_ref_h,
			
 
				+                  const int nccheck, const bool print) {
			
 
				+    real* a;
			
 
				+    real* a_new;
			
 
				+
			
 
				+    real* l2_norm_d;
			
 
				+    real* l2_norm_h;
			
 
				+
			
 
				+    int iy_start = 1;
			
 
				+    int iy_end = (ny - 1);
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaMalloc(&a, nx * ny * sizeof(real)));
			
 
				+    CUDA_RT_CALL(cudaMalloc(&a_new, nx * ny * sizeof(real)));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaMemset(a, 0, nx * ny * sizeof(real)));
			
 
				+    CUDA_RT_CALL(cudaMemset(a_new, 0, nx * ny * sizeof(real)));
			
 
				+
			
 
				+    // Set diriclet boundary conditions on left and right boarder
			
 
				+    initialize_boundaries<<<ny / 128 + 1, 128>>>(a, a_new, PI, 0, nx, ny, ny);
			
 
				+    CUDA_RT_CALL(cudaGetLastError());
			
 
				+    CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaMalloc(&l2_norm_d, sizeof(real)));
			
 
				+    CUDA_RT_CALL(cudaMallocHost(&l2_norm_h, sizeof(real)));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+
			
 
				+    if (print)
			
 
				+        printf(
			
 
				+            "Single GPU jacobi relaxation: %d iterations on %d x %d mesh with "
			
 
				+            "norm "
			
 
				+            "check every %d iterations\n",
			
 
				+            iter_max, ny, nx, nccheck);
			
 
				+
			
 
				+    constexpr int dim_block_x = 32;
			
 
				+    constexpr int dim_block_y = 32;
			
 
				+    dim3 dim_grid((nx + dim_block_x - 1) / dim_block_x, (ny + dim_block_y - 1) / dim_block_y, 1);
			
 
				+
			
 
				+    int iter = 0;
			
 
				+    bool calculate_norm;
			
 
				+    real l2_norm = 1.0;
			
 
				+
			
 
				+    double start = omp_get_wtime();
			
 
				+    PUSH_RANGE("Jacobi solve", 0)
			
 
				+    while (l2_norm > tol && iter < iter_max) {
			
 
				+        CUDA_RT_CALL(cudaMemset(l2_norm_d, 0, sizeof(real)));
			
 
				+
			
 
				+        calculate_norm = (iter % nccheck) == 0 || (print && ((iter % 100) == 0));
			
 
				+        jacobi_kernel<dim_block_x, dim_block_y>
			
 
				+            <<<dim_grid, {dim_block_x, dim_block_y, 1}, 0, 0>>>(
			
 
				+                a_new, a, l2_norm_d, iy_start, iy_end, nx, calculate_norm);
			
 
				+        CUDA_RT_CALL(cudaGetLastError());
			
 
				+
			
 
				+        if (calculate_norm) {
			
 
				+            CUDA_RT_CALL(cudaMemcpy(l2_norm_h, l2_norm_d, sizeof(real), cudaMemcpyDeviceToHost));
			
 
				+        }
			
 
				+
			
 
				+        // Apply periodic boundary conditions
			
 
				+
			
 
				+        CUDA_RT_CALL(cudaMemcpy(a_new, a_new + (iy_end - 1) * nx, nx * sizeof(real),
			
 
				+                                     cudaMemcpyDeviceToDevice));
			
 
				+        CUDA_RT_CALL(cudaMemcpy(a_new + iy_end * nx, a_new + iy_start * nx, nx * sizeof(real),
			
 
				+                                     cudaMemcpyDeviceToDevice));
			
 
				+
			
 
				+        if (calculate_norm) {
			
 
				+	    CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+            //CUDA_RT_CALL(cudaStreamSynchronize(compute_stream));
			
 
				+            l2_norm = *l2_norm_h;
			
 
				+            l2_norm = std::sqrt(l2_norm);
			
 
				+            if (print && (iter % 100) == 0) printf("%5d, %0.6f\n", iter, l2_norm);
			
 
				+        }
			
 
				+
			
 
				+        std::swap(a_new, a);
			
 
				+        iter++;
			
 
				+    }
			
 
				+    POP_RANGE
			
 
				+    double stop = omp_get_wtime();
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaMemcpy(a_ref_h, a, nx * ny * sizeof(real), cudaMemcpyDeviceToHost));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaFreeHost(l2_norm_h));
			
 
				+    CUDA_RT_CALL(cudaFree(l2_norm_d));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaFree(a_new));
			
 
				+    CUDA_RT_CALL(cudaFree(a));
			
 
				+    return (stop - start);
			
 
				+}
			
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/memcpy/mgpm
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/memcpy/mgpm
@@ -0,0 +1,13 @@
 
				+#!/bin/bash
			
 
				+#SBATCH --nodes=2
			
 
				+
			
 
				+NPROCS=16
			
 
				+NPPERSOC=$(($NPROCS>>2))
			
 
				+source ~/init.sh
			
 
				+make clean && make
			
 
				+rm -rf hpctoolkit*
			
 
				+hpcrun -e CPUTIME -e IO -e gpu=nvidia -t ./jacobi
			
 
				+hpcstruct --gpucfg yes hpctoolkit*
			
 
				+hpcstruct jacobi
			
 
				+hpcprof -S jacobi.hpcstruct -I jacobi.cu -I $CUDA_HOME/+ hpctoolkit*
			
 
				+echo "===DONE==="
			
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/mpi/Makefile
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/mpi/Makefile
@@ -0,0 +1,42 @@
 
				+# Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
			
 
				+NP ?= 1
			
 
				+NVCC=nvcc
			
 
				+MPICXX=mpicxx
			
 
				+MPIRUN ?= mpirun
			
 
				+CUDA_HOME ?= /opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/
			
 
				+GENCODE_SM30	:= -gencode arch=compute_30,code=sm_30
			
 
				+GENCODE_SM35	:= -gencode arch=compute_35,code=sm_35
			
 
				+GENCODE_SM37	:= -gencode arch=compute_37,code=sm_37
			
 
				+GENCODE_SM50	:= -gencode arch=compute_50,code=sm_50
			
 
				+GENCODE_SM52	:= -gencode arch=compute_52,code=sm_52
			
 
				+GENCODE_SM60    := -gencode arch=compute_60,code=sm_60
			
 
				+GENCODE_SM70    := -gencode arch=compute_70,code=sm_70
			
 
				+GENCODE_SM80    := -gencode arch=compute_80,code=sm_80 -gencode arch=compute_80,code=compute_80
			
 
				+GENCODE_FLAGS	:= $(GENCODE_SM70) $(GENCODE_SM80)
			
 
				+ifdef DISABLE_CUB
			
 
				+        NVCC_FLAGS = -Xptxas --optimize-float-atomics
			
 
				+else
			
 
				+        NVCC_FLAGS = -DHAVE_CUB
			
 
				+endif
			
 
				+NVCC_FLAGS += -lineinfo $(GENCODE_FLAGS) -std=c++14
			
 
				+MPICXX_FLAGS = -DUSE_NVTX -I$(CUDA_HOME)/include -std=c++14
			
 
				+MPICXX_FLAGS = -DUSE_NVTX -g -I$(CUDA_HOME)/include -std=c++14
			
 
				+LD_FLAGS = -L$(CUDA_HOME)/lib64 -lcudart -lnvToolsExt
			
 
				+jacobi: Makefile jacobi.cpp jacobi_kernels.o
			
 
				+	$(MPICXX) $(MPICXX_FLAGS) jacobi.cpp jacobi_kernels.o $(LD_FLAGS) -o jacobi
			
 
				+
			
 
				+jacobi_kernels.o: Makefile jacobi_kernels.cu
			
 
				+	$(NVCC) $(NVCC_FLAGS) jacobi_kernels.cu -c
			
 
				+
			
 
				+.PHONY.: clean
			
 
				+clean:
			
 
				+	rm -f jacobi jacobi_kernels.o *.qdrep jacobi.*.compute-sanitizer.log
			
 
				+
			
 
				+sanitize: jacobi
			
 
				+	$(MPIRUN) -np $(NP) compute-sanitizer --log-file jacobi.%q{OMPI_COMM_WORLD_RANK}.compute-sanitizer.log ./jacobi -niter 10
			
 
				+
			
 
				+run: jacobi
			
 
				+	$(MPIRUN) -np $(NP) ./jacobi
			
 
				+
			
 
				+profile: jacobi
			
 
				+	$(MPIRUN) -np $(NP) nsys profile --trace=mpi,cuda,nvtx -o jacobi.%q{OMPI_COMM_WORLD_RANK} ./jacobi -niter 10
			
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/mpi/jacobi
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/mpi/jacobi
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/mpi/jacobi.cpp
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/mpi/jacobi.cpp
@@ -0,0 +1,407 @@
 
				+/* Copyright (c) 2017, NVIDIA CORPORATION. All rights reserved.
			
 
				+ *
			
 
				+ * Redistribution and use in source and binary forms, with or without
			
 
				+ * modification, are permitted provided that the following conditions
			
 
				+ * are met:
			
 
				+ *  * Redistributions of source code must retain the above copyright
			
 
				+ *    notice, this list of conditions and the following disclaimer.
			
 
				+ *  * Redistributions in binary form must reproduce the above copyright
			
 
				+ *    notice, this list of conditions and the following disclaimer in the
			
 
				+ *    documentation and/or other materials provided with the distribution.
			
 
				+ *  * Neither the name of NVIDIA CORPORATION nor the names of its
			
 
				+ *    contributors may be used to endorse or promote products derived
			
 
				+ *    from this software without specific prior written permission.
			
 
				+ *
			
 
				+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
			
 
				+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
			
 
				+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
			
 
				+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
			
 
				+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
			
 
				+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
			
 
				+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
			
 
				+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
			
 
				+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
			
 
				+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
			
 
				+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
			
 
				+ */
			
 
				+#include <algorithm>
			
 
				+#include <cmath>
			
 
				+#include <cstdio>
			
 
				+#include <iostream>
			
 
				+#include <sstream>
			
 
				+
			
 
				+#include <mpi.h>
			
 
				+
			
 
				+#define MPI_CALL(call)                                                                \
			
 
				+    {                                                                                 \
			
 
				+        int mpi_status = call;                                                        \
			
 
				+        if (0 != mpi_status) {                                                        \
			
 
				+            char mpi_error_string[MPI_MAX_ERROR_STRING];                              \
			
 
				+            int mpi_error_string_length = 0;                                          \
			
 
				+            MPI_Error_string(mpi_status, mpi_error_string, &mpi_error_string_length); \
			
 
				+            if (NULL != mpi_error_string)                                             \
			
 
				+                fprintf(stderr,                                                       \
			
 
				+                        "ERROR: MPI call \"%s\" in line %d of file %s failed "        \
			
 
				+                        "with %s "                                                    \
			
 
				+                        "(%d).\n",                                                    \
			
 
				+                        #call, __LINE__, __FILE__, mpi_error_string, mpi_status);     \
			
 
				+            else                                                                      \
			
 
				+                fprintf(stderr,                                                       \
			
 
				+                        "ERROR: MPI call \"%s\" in line %d of file %s failed "        \
			
 
				+                        "with %d.\n",                                                 \
			
 
				+                        #call, __LINE__, __FILE__, mpi_status);                       \
			
 
				+        }                                                                             \
			
 
				+    }
			
 
				+
			
 
				+#include <cuda_runtime.h>
			
 
				+
			
 
				+#ifdef USE_NVTX
			
 
				+#include <nvToolsExt.h>
			
 
				+
			
 
				+const uint32_t colors[] = {0x0000ff00, 0x000000ff, 0x00ffff00, 0x00ff00ff,
			
 
				+                           0x0000ffff, 0x00ff0000, 0x00ffffff};
			
 
				+const int num_colors = sizeof(colors) / sizeof(uint32_t);
			
 
				+
			
 
				+#define PUSH_RANGE(name, cid)                              \
			
 
				+    {                                                      \
			
 
				+        int color_id = cid;                                \
			
 
				+        color_id = color_id % num_colors;                  \
			
 
				+        nvtxEventAttributes_t eventAttrib = {0};           \
			
 
				+        eventAttrib.version = NVTX_VERSION;                \
			
 
				+        eventAttrib.size = NVTX_EVENT_ATTRIB_STRUCT_SIZE;  \
			
 
				+        eventAttrib.colorType = NVTX_COLOR_ARGB;           \
			
 
				+        eventAttrib.color = colors[color_id];              \
			
 
				+        eventAttrib.messageType = NVTX_MESSAGE_TYPE_ASCII; \
			
 
				+        eventAttrib.message.ascii = name;                  \
			
 
				+        nvtxRangePushEx(&eventAttrib);                     \
			
 
				+    }
			
 
				+#define POP_RANGE nvtxRangePop();
			
 
				+#else
			
 
				+#define PUSH_RANGE(name, cid)
			
 
				+#define POP_RANGE
			
 
				+#endif
			
 
				+
			
 
				+#define CUDA_RT_CALL(call)                                                                  \
			
 
				+    {                                                                                       \
			
 
				+        cudaError_t cudaStatus = call;                                                      \
			
 
				+        if (cudaSuccess != cudaStatus)                                                      \
			
 
				+            fprintf(stderr,                                                                 \
			
 
				+                    "ERROR: CUDA RT call \"%s\" in line %d of file %s failed "              \
			
 
				+                    "with "                                                                 \
			
 
				+                    "%s (%d).\n",                                                           \
			
 
				+                    #call, __LINE__, __FILE__, cudaGetErrorString(cudaStatus), cudaStatus); \
			
 
				+    }
			
 
				+
			
 
				+#ifdef USE_DOUBLE
			
 
				+typedef double real;
			
 
				+#define MPI_REAL_TYPE MPI_DOUBLE
			
 
				+#else
			
 
				+typedef float real;
			
 
				+#define MPI_REAL_TYPE MPI_FLOAT
			
 
				+#endif
			
 
				+
			
 
				+constexpr real tol = 1.0e-8;
			
 
				+
			
 
				+const real PI = 2.0 * std::asin(1.0);
			
 
				+
			
 
				+void launch_initialize_boundaries(real* __restrict__ const a_new, real* __restrict__ const a,
			
 
				+                                  const real pi, const int offset, const int nx, const int my_ny,
			
 
				+                                  const int ny);
			
 
				+
			
 
				+void launch_jacobi_kernel(real* __restrict__ const a_new, const real* __restrict__ const a,
			
 
				+                          real* __restrict__ const l2_norm, const int iy_start, const int iy_end,
			
 
				+                          const int nx, const bool calculate_norm);
			
 
				+
			
 
				+double single_gpu(const int nx, const int ny, const int iter_max, real* const a_ref_h,
			
 
				+                  const int nccheck, const bool print);
			
 
				+
			
 
				+template <typename T>
			
 
				+T get_argval(char** begin, char** end, const std::string& arg, const T default_val) {
			
 
				+    T argval = default_val;
			
 
				+    char** itr = std::find(begin, end, arg);
			
 
				+    if (itr != end && ++itr != end) {
			
 
				+        std::istringstream inbuf(*itr);
			
 
				+        inbuf >> argval;
			
 
				+    }
			
 
				+    return argval;
			
 
				+}
			
 
				+
			
 
				+bool get_arg(char** begin, char** end, const std::string& arg) {
			
 
				+    char** itr = std::find(begin, end, arg);
			
 
				+    if (itr != end) {
			
 
				+        return true;
			
 
				+    }
			
 
				+    return false;
			
 
				+}
			
 
				+
			
 
				+int main(int argc, char* argv[]) {
			
 
				+    MPI_CALL(MPI_Init(&argc, &argv));
			
 
				+    int rank;
			
 
				+    MPI_CALL(MPI_Comm_rank(MPI_COMM_WORLD, &rank));
			
 
				+    int size;
			
 
				+    MPI_CALL(MPI_Comm_size(MPI_COMM_WORLD, &size));
			
 
				+    int num_devices = 0;
			
 
				+    cudaGetDeviceCount(&num_devices);
			
 
				+
			
 
				+    const int iter_max = get_argval<int>(argv, argv + argc, "-niter", 1000);
			
 
				+    const int nccheck = get_argval<int>(argv, argv + argc, "-nccheck", 1);
			
 
				+    const int nx = get_argval<int>(argv, argv + argc, "-nx", 16384);
			
 
				+    const int ny = get_argval<int>(argv, argv + argc, "-ny", 16384);
			
 
				+    const bool csv = get_arg(argv, argv + argc, "-csv");
			
 
				+
			
 
				+    int local_rank = -1;
			
 
				+    {
			
 
				+        MPI_Comm local_comm;
			
 
				+        MPI_CALL(MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, rank, MPI_INFO_NULL,
			
 
				+                                     &local_comm));
			
 
				+
			
 
				+        MPI_CALL(MPI_Comm_rank(local_comm, &local_rank));
			
 
				+
			
 
				+        MPI_CALL(MPI_Comm_free(&local_comm));
			
 
				+    }
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaSetDevice(local_rank % num_devices));
			
 
				+    CUDA_RT_CALL(cudaFree(0));
			
 
				+
			
 
				+    real* a_ref_h;
			
 
				+    CUDA_RT_CALL(cudaMallocHost(&a_ref_h, nx * ny * sizeof(real)));
			
 
				+    real* a_h;
			
 
				+    CUDA_RT_CALL(cudaMallocHost(&a_h, nx * ny * sizeof(real)));
			
 
				+    double runtime_serial = single_gpu(nx, ny, iter_max, a_ref_h, nccheck, !csv && (0 == rank));
			
 
				+
			
 
				+    // ny - 2 rows are distributed amongst `size` ranks in such a way
			
 
				+    // that each rank gets either (ny - 2) / size or (ny - 2) / size + 1 rows.
			
 
				+    // This optimizes load balancing when (ny - 2) % size != 0
			
 
				+    int chunk_size;
			
 
				+    int chunk_size_low = (ny - 2) / size;
			
 
				+    int chunk_size_high = chunk_size_low + 1;
			
 
				+    // To calculate the number of ranks that need to compute an extra row,
			
 
				+    // the following formula is derived from this equation:
			
 
				+    // num_ranks_low * chunk_size_low + (size - num_ranks_low) * (chunk_size_low + 1) = ny - 2
			
 
				+    int num_ranks_low = size * chunk_size_low + size -
			
 
				+                        (ny - 2);  // Number of ranks with chunk_size = chunk_size_low
			
 
				+    if (rank < num_ranks_low)
			
 
				+        chunk_size = chunk_size_low;
			
 
				+    else
			
 
				+        chunk_size = chunk_size_high;
			
 
				+
			
 
				+    real* a;
			
 
				+    CUDA_RT_CALL(cudaMalloc(&a, nx * (chunk_size + 2) * sizeof(real)));
			
 
				+    real* a_new;
			
 
				+    CUDA_RT_CALL(cudaMalloc(&a_new, nx * (chunk_size + 2) * sizeof(real)));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaMemset(a, 0, nx * (chunk_size + 2) * sizeof(real)));
			
 
				+    CUDA_RT_CALL(cudaMemset(a_new, 0, nx * (chunk_size + 2) * sizeof(real)));
			
 
				+
			
 
				+    // Calculate local domain boundaries
			
 
				+    int iy_start_global;  // My start index in the global array
			
 
				+    if (rank < num_ranks_low) {
			
 
				+        iy_start_global = rank * chunk_size_low + 1;
			
 
				+    } else {
			
 
				+        iy_start_global =
			
 
				+            num_ranks_low * chunk_size_low + (rank - num_ranks_low) * chunk_size_high + 1;
			
 
				+    }
			
 
				+    int iy_end_global = iy_start_global + chunk_size - 1;  // My last index in the global array
			
 
				+
			
 
				+    int iy_start = 1;
			
 
				+    int iy_end = iy_start + chunk_size;
			
 
				+
			
 
				+    // Set diriclet boundary conditions on left and right boarder
			
 
				+    launch_initialize_boundaries(a, a_new, PI, iy_start_global - 1, nx, (chunk_size + 2), ny);
			
 
				+    CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+
			
 
				+    real* l2_norm_d;
			
 
				+    CUDA_RT_CALL(cudaMalloc(&l2_norm_d, sizeof(real)));
			
 
				+    real* l2_norm_h;
			
 
				+    CUDA_RT_CALL(cudaMallocHost(&l2_norm_h, sizeof(real)));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+
			
 
				+    if (!csv && 0 == rank) {
			
 
				+        printf(
			
 
				+            "Jacobi relaxation: %d iterations on %d x %d mesh with norm check "
			
 
				+            "every %d iterations\n",
			
 
				+            iter_max, ny, nx, nccheck);
			
 
				+    }
			
 
				+
			
 
				+    int iter = 0;
			
 
				+    real l2_norm = 1.0;
			
 
				+    bool calculate_norm;  // boolean to store whether l2 norm will be calculated in
			
 
				+                          //   an iteration or not
			
 
				+
			
 
				+    MPI_CALL(MPI_Barrier(MPI_COMM_WORLD));
			
 
				+    double start = MPI_Wtime();
			
 
				+    PUSH_RANGE("Jacobi solve", 0)
			
 
				+    while (l2_norm > tol && iter < iter_max) {
			
 
				+        CUDA_RT_CALL(cudaMemset(l2_norm_d, 0, sizeof(real)));
			
 
				+
			
 
				+        calculate_norm = (iter % nccheck) == 0 || (!csv && (iter % 100) == 0);
			
 
				+
			
 
				+        launch_jacobi_kernel(a_new, a, l2_norm_d, iy_start, iy_end, nx, calculate_norm);
			
 
				+
			
 
				+        if (calculate_norm) {
			
 
				+            CUDA_RT_CALL(cudaMemcpy(l2_norm_h, l2_norm_d, sizeof(real), cudaMemcpyDeviceToHost));
			
 
				+        }
			
 
				+
			
 
				+        const int top = rank > 0 ? rank - 1 : (size - 1);
			
 
				+        const int bottom = (rank + 1) % size;
			
 
				+
			
 
				+        // Apply periodic boundary conditions
			
 
				+	CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+        PUSH_RANGE("MPI", 5)
			
 
				+        MPI_CALL(MPI_Sendrecv(a_new + iy_start * nx, nx, MPI_REAL_TYPE, top, 0,
			
 
				+                              a_new + (iy_end * nx), nx, MPI_REAL_TYPE, bottom, 0, MPI_COMM_WORLD,
			
 
				+                              MPI_STATUS_IGNORE));
			
 
				+        MPI_CALL(MPI_Sendrecv(a_new + (iy_end - 1) * nx, nx, MPI_REAL_TYPE, bottom, 0, a_new, nx,
			
 
				+                              MPI_REAL_TYPE, top, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE));
			
 
				+        POP_RANGE
			
 
				+
			
 
				+        if (calculate_norm) {
			
 
				+            MPI_CALL(MPI_Allreduce(l2_norm_h, &l2_norm, 1, MPI_REAL_TYPE, MPI_SUM, MPI_COMM_WORLD));
			
 
				+            l2_norm = std::sqrt(l2_norm);
			
 
				+
			
 
				+            if (!csv && 0 == rank && (iter % 100) == 0) {
			
 
				+                printf("%5d, %0.6f\n", iter, l2_norm);
			
 
				+            }
			
 
				+        }
			
 
				+
			
 
				+        std::swap(a_new, a);
			
 
				+        iter++;
			
 
				+    }
			
 
				+    double stop = MPI_Wtime();
			
 
				+    POP_RANGE
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaMemcpy(a_h + iy_start_global * nx, a + nx,
			
 
				+                            std::min((ny - iy_start_global) * nx, chunk_size * nx) * sizeof(real),
			
 
				+                            cudaMemcpyDeviceToHost));
			
 
				+
			
 
				+    int result_correct = 1;
			
 
				+    for (int iy = iy_start_global; result_correct && (iy < iy_end_global); ++iy) {
			
 
				+        for (int ix = 1; result_correct && (ix < (nx - 1)); ++ix) {
			
 
				+            if (std::fabs(a_ref_h[iy * nx + ix] - a_h[iy * nx + ix]) > tol) {
			
 
				+                fprintf(stderr,
			
 
				+                        "ERROR on rank %d: a[%d * %d + %d] = %f does not match %f "
			
 
				+                        "(reference)\n",
			
 
				+                        rank, iy, nx, ix, a_h[iy * nx + ix], a_ref_h[iy * nx + ix]);
			
 
				+                result_correct = 0;
			
 
				+            }
			
 
				+        }
			
 
				+    }
			
 
				+
			
 
				+    int global_result_correct = 1;
			
 
				+    MPI_CALL(MPI_Allreduce(&result_correct, &global_result_correct, 1, MPI_INT, MPI_MIN,
			
 
				+                           MPI_COMM_WORLD));
			
 
				+    result_correct = global_result_correct;
			
 
				+
			
 
				+    if (rank == 0 && result_correct) {
			
 
				+        if (csv) {
			
 
				+            printf("mpi, %d, %d, %d, %d, %d, 1, %f, %f\n", nx, ny, iter_max, nccheck, size,
			
 
				+                   (stop - start), runtime_serial);
			
 
				+        } else {
			
 
				+            printf("Num GPUs: %d.\n", size);
			
 
				+            printf(
			
 
				+                "%dx%d: 1 GPU: %8.4f s, %d GPUs: %8.4f s, speedup: %8.2f, "
			
 
				+                "efficiency: %8.2f \n",
			
 
				+                ny, nx, runtime_serial, size, (stop - start), runtime_serial / (stop - start),
			
 
				+                runtime_serial / (size * (stop - start)) * 100);
			
 
				+        }
			
 
				+    }
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaFreeHost(l2_norm_h));
			
 
				+    CUDA_RT_CALL(cudaFree(l2_norm_d));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaFree(a_new));
			
 
				+    CUDA_RT_CALL(cudaFree(a));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaFreeHost(a_h));
			
 
				+    CUDA_RT_CALL(cudaFreeHost(a_ref_h));
			
 
				+
			
 
				+    MPI_CALL(MPI_Finalize());
			
 
				+    return (result_correct == 1) ? 0 : 1;
			
 
				+}
			
 
				+
			
 
				+double single_gpu(const int nx, const int ny, const int iter_max, real* const a_ref_h,
			
 
				+                  const int nccheck, const bool print) {
			
 
				+    real* a;
			
 
				+    real* a_new;
			
 
				+
			
 
				+    real* l2_norm_d;
			
 
				+    real* l2_norm_h;
			
 
				+
			
 
				+    int iy_start = 1;
			
 
				+    int iy_end = (ny - 1);
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaMalloc(&a, nx * ny * sizeof(real)));
			
 
				+    CUDA_RT_CALL(cudaMalloc(&a_new, nx * ny * sizeof(real)));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaMemset(a, 0, nx * ny * sizeof(real)));
			
 
				+    CUDA_RT_CALL(cudaMemset(a_new, 0, nx * ny * sizeof(real)));
			
 
				+
			
 
				+    // Set diriclet boundary conditions on left and right boarder
			
 
				+    launch_initialize_boundaries(a, a_new, PI, 0, nx, ny, ny);
			
 
				+    CUDA_RT_CALL(cudaGetLastError());
			
 
				+    CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaMalloc(&l2_norm_d, sizeof(real)));
			
 
				+    CUDA_RT_CALL(cudaMallocHost(&l2_norm_h, sizeof(real)));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+
			
 
				+    if (print)
			
 
				+        printf(
			
 
				+            "Single GPU jacobi relaxation: %d iterations on %d x %d mesh with "
			
 
				+            "norm "
			
 
				+            "check every %d iterations\n",
			
 
				+            iter_max, ny, nx, nccheck);
			
 
				+
			
 
				+    constexpr int dim_block_x = 32;
			
 
				+    constexpr int dim_block_y = 32;
			
 
				+    dim3 dim_grid((nx + dim_block_x - 1) / dim_block_x, (ny + dim_block_y - 1) / dim_block_y, 1);
			
 
				+
			
 
				+    int iter = 0;
			
 
				+    bool calculate_norm;
			
 
				+    real l2_norm = 1.0;
			
 
				+
			
 
				+    double start = MPI_Wtime();
			
 
				+    PUSH_RANGE("Jacobi solve", 0)
			
 
				+    while (l2_norm > tol && iter < iter_max) {
			
 
				+        CUDA_RT_CALL(cudaMemset(l2_norm_d, 0, sizeof(real)));
			
 
				+
			
 
				+        calculate_norm = (iter % nccheck) == 0 || (print && ((iter % 100) == 0));
			
 
				+        launch_jacobi_kernel(
			
 
				+                a_new, a, l2_norm_d, iy_start, iy_end, nx, calculate_norm);
			
 
				+        CUDA_RT_CALL(cudaGetLastError());
			
 
				+
			
 
				+        if (calculate_norm) {
			
 
				+            CUDA_RT_CALL(cudaMemcpy(l2_norm_h, l2_norm_d, sizeof(real), cudaMemcpyDeviceToHost));
			
 
				+        }
			
 
				+
			
 
				+        // Apply periodic boundary conditions
			
 
				+
			
 
				+        CUDA_RT_CALL(cudaMemcpy(a_new, a_new + (iy_end - 1) * nx, nx * sizeof(real),
			
 
				+                                     cudaMemcpyDeviceToDevice));
			
 
				+        CUDA_RT_CALL(cudaMemcpy(a_new + iy_end * nx, a_new + iy_start * nx, nx * sizeof(real),
			
 
				+                                     cudaMemcpyDeviceToDevice));
			
 
				+
			
 
				+        if (calculate_norm) {
			
 
				+	    CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+            l2_norm = *l2_norm_h;
			
 
				+            l2_norm = std::sqrt(l2_norm);
			
 
				+            if (print && (iter % 100) == 0) printf("%5d, %0.6f\n", iter, l2_norm);
			
 
				+        }
			
 
				+
			
 
				+        std::swap(a_new, a);
			
 
				+        iter++;
			
 
				+    }
			
 
				+    POP_RANGE
			
 
				+    double stop = MPI_Wtime();
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaMemcpy(a_ref_h, a, nx * ny * sizeof(real), cudaMemcpyDeviceToHost));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaFreeHost(l2_norm_h));
			
 
				+    CUDA_RT_CALL(cudaFree(l2_norm_d));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaFree(a_new));
			
 
				+    CUDA_RT_CALL(cudaFree(a));
			
 
				+    return (stop - start);
			
 
				+}
			
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/mpi/jacobi.o
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/mpi/jacobi.o
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/mpi/jacobi_kernels.cu
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/mpi/jacobi_kernels.cu
@@ -0,0 +1,113 @@
 
				+/* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
			
 
				+ *
			
 
				+ * Redistribution and use in source and binary forms, with or without
			
 
				+ * modification, are permitted provided that the following conditions
			
 
				+ * are met:
			
 
				+ *  * Redistributions of source code must retain the above copyright
			
 
				+ *    notice, this list of conditions and the following disclaimer.
			
 
				+ *  * Redistributions in binary form must reproduce the above copyright
			
 
				+ *    notice, this list of conditions and the following disclaimer in the
			
 
				+ *    documentation and/or other materials provided with the distribution.
			
 
				+ *  * Neither the name of NVIDIA CORPORATION nor the names of its
			
 
				+ *    contributors may be used to endorse or promote products derived
			
 
				+ *    from this software without specific prior written permission.
			
 
				+ *
			
 
				+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
			
 
				+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
			
 
				+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
			
 
				+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
			
 
				+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
			
 
				+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
			
 
				+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
			
 
				+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
			
 
				+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
			
 
				+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
			
 
				+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
			
 
				+ */
			
 
				+#include <cstdio>
			
 
				+
			
 
				+#ifdef HAVE_CUB
			
 
				+#include <cub/block/block_reduce.cuh>
			
 
				+#endif  // HAVE_CUB
			
 
				+
			
 
				+#define CUDA_RT_CALL(call)                                                                  \
			
 
				+    {                                                                                       \
			
 
				+        cudaError_t cudaStatus = call;                                                      \
			
 
				+        if (cudaSuccess != cudaStatus)                                                      \
			
 
				+            fprintf(stderr,                                                                 \
			
 
				+                    "ERROR: CUDA RT call \"%s\" in line %d of file %s failed "              \
			
 
				+                    "with "                                                                 \
			
 
				+                    "%s (%d).\n",                                                           \
			
 
				+                    #call, __LINE__, __FILE__, cudaGetErrorString(cudaStatus), cudaStatus); \
			
 
				+    }
			
 
				+
			
 
				+#ifdef USE_DOUBLE
			
 
				+typedef double real;
			
 
				+#define MPI_REAL_TYPE MPI_DOUBLE
			
 
				+#else
			
 
				+typedef float real;
			
 
				+#define MPI_REAL_TYPE MPI_FLOAT
			
 
				+#endif
			
 
				+
			
 
				+__global__ void initialize_boundaries(real* __restrict__ const a_new, real* __restrict__ const a,
			
 
				+                                      const real pi, const int offset, const int nx,
			
 
				+                                      const int my_ny, const int ny) {
			
 
				+    for (int iy = blockIdx.x * blockDim.x + threadIdx.x; iy < my_ny; iy += blockDim.x * gridDim.x) {
			
 
				+        const real y0 = sin(2.0 * pi * (offset + iy) / (ny - 1));
			
 
				+        a[iy * nx + 0] = y0;
			
 
				+        a[iy * nx + (nx - 1)] = y0;
			
 
				+        a_new[iy * nx + 0] = y0;
			
 
				+        a_new[iy * nx + (nx - 1)] = y0;
			
 
				+    }
			
 
				+}
			
 
				+
			
 
				+void launch_initialize_boundaries(real* __restrict__ const a_new, real* __restrict__ const a,
			
 
				+                                  const real pi, const int offset, const int nx, const int my_ny,
			
 
				+                                  const int ny) {
			
 
				+    initialize_boundaries<<<my_ny / 128 + 1, 128>>>(a_new, a, pi, offset, nx, my_ny, ny);
			
 
				+    CUDA_RT_CALL(cudaGetLastError());
			
 
				+}
			
 
				+
			
 
				+template <int BLOCK_DIM_X, int BLOCK_DIM_Y>
			
 
				+__global__ void jacobi_kernel(real* __restrict__ const a_new, const real* __restrict__ const a,
			
 
				+                              real* __restrict__ const l2_norm, const int iy_start,
			
 
				+                              const int iy_end, const int nx, const bool calculate_norm) {
			
 
				+#ifdef HAVE_CUB
			
 
				+    typedef cub::BlockReduce<real, BLOCK_DIM_X, cub::BLOCK_REDUCE_WARP_REDUCTIONS, BLOCK_DIM_Y>
			
 
				+        BlockReduce;
			
 
				+    __shared__ typename BlockReduce::TempStorage temp_storage;
			
 
				+#endif  // HAVE_CUB
			
 
				+    int iy = blockIdx.y * blockDim.y + threadIdx.y + iy_start;
			
 
				+    int ix = blockIdx.x * blockDim.x + threadIdx.x + 1;
			
 
				+    real local_l2_norm = 0.0;
			
 
				+
			
 
				+    if (iy < iy_end && ix < (nx - 1)) {
			
 
				+        const real new_val = 0.25 * (a[iy * nx + ix + 1] + a[iy * nx + ix - 1] +
			
 
				+                                     a[(iy + 1) * nx + ix] + a[(iy - 1) * nx + ix]);
			
 
				+        a_new[iy * nx + ix] = new_val;
			
 
				+        if (calculate_norm) {
			
 
				+            real residue = new_val - a[iy * nx + ix];
			
 
				+            local_l2_norm += residue * residue;
			
 
				+        }
			
 
				+    }
			
 
				+    if (calculate_norm) {
			
 
				+#ifdef HAVE_CUB
			
 
				+        real block_l2_norm = BlockReduce(temp_storage).Sum(local_l2_norm);
			
 
				+        if (0 == threadIdx.y && 0 == threadIdx.x) atomicAdd(l2_norm, block_l2_norm);
			
 
				+#else
			
 
				+        atomicAdd(l2_norm, local_l2_norm);
			
 
				+#endif  // HAVE_CUB
			
 
				+    }
			
 
				+}
			
 
				+
			
 
				+void launch_jacobi_kernel(real* __restrict__ const a_new, const real* __restrict__ const a,
			
 
				+                          real* __restrict__ const l2_norm, const int iy_start, const int iy_end,
			
 
				+                          const int nx, const bool calculate_norm) {
			
 
				+    constexpr int dim_block_x = 32;
			
 
				+    constexpr int dim_block_y = 32;
			
 
				+    dim3 dim_grid((nx + dim_block_x - 1) / dim_block_x,
			
 
				+                  ((iy_end - iy_start) + dim_block_y - 1) / dim_block_y, 1);
			
 
				+    jacobi_kernel<dim_block_x, dim_block_y><<<dim_grid, {dim_block_x, dim_block_y, 1}, 0, 0>>>(
			
 
				+        a_new, a, l2_norm, iy_start, iy_end, nx, calculate_norm);
			
 
				+    CUDA_RT_CALL(cudaGetLastError());
			
 
				+}
			
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/mpi/jacobi_kernels.o
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/mpi/jacobi_kernels.o
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/mpi/reports.zip
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/mpi/reports.zip
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/mpi/reports/report1.qdrep
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/mpi/reports/report1.qdrep
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/mpi/reports/report2.qdrep
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/mpi/reports/report2.qdrep
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/mpi/reports/report3.qdrep
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/mpi/reports/report3.qdrep
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/mpi/reports/report4.qdrep
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/mpi/reports/report4.qdrep
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/mpi/reports/report5.qdrep
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/mpi/reports/report5.qdrep
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/mpi/reports/report6.qdrep
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/mpi/reports/report6.qdrep
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/mpi/reports/report7.qdrep
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/mpi/reports/report7.qdrep
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/mpi/reports/report8.qdrep
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/mpi/reports/report8.qdrep
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/single_gpu/Makefile
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/single_gpu/Makefile
@@ -0,0 +1,33 @@
 
				+# Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
			
 
				+NVCC=nvcc
			
 
				+CUDA_HOME=/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/cuda/11.2/
			
 
				+GENCODE_SM30	:= -gencode arch=compute_30,code=sm_30
			
 
				+GENCODE_SM35	:= -gencode arch=compute_35,code=sm_35
			
 
				+GENCODE_SM37	:= -gencode arch=compute_37,code=sm_37
			
 
				+GENCODE_SM50	:= -gencode arch=compute_50,code=sm_50
			
 
				+GENCODE_SM52	:= -gencode arch=compute_52,code=sm_52
			
 
				+GENCODE_SM60    := -gencode arch=compute_60,code=sm_60
			
 
				+GENCODE_SM70    := -gencode arch=compute_70,code=sm_70
			
 
				+GENCODE_SM80    := -gencode arch=compute_80,code=sm_80 -gencode arch=compute_80,code=compute_80
			
 
				+GENCODE_FLAGS	:= $(GENCODE_SM70) $(GENCODE_SM80)
			
 
				+ifdef DISABLE_CUB
			
 
				+        NVCC_FLAGS = -Xptxas --optimize-float-atomics
			
 
				+else
			
 
				+        NVCC_FLAGS = -DHAVE_CUB
			
 
				+endif
			
 
				+NVCC_FLAGS += -Xcompiler -fopenmp -lineinfo -DUSE_NVTX -lnvToolsExt $(GENCODE_FLAGS) -std=c++14
			
 
				+jacobi: Makefile jacobi.cu
			
 
				+	$(NVCC) $(NVCC_FLAGS) jacobi.cu -o jacobi
			
 
				+
			
 
				+.PHONY.: clean
			
 
				+clean:
			
 
				+	rm -f jacobi jacobi.qdrep
			
 
				+
			
 
				+sanitize: jacobi
			
 
				+	compute-sanitizer ./jacobi -niter 10
			
 
				+
			
 
				+run: jacobi
			
 
				+	./jacobi
			
 
				+
			
 
				+profile: jacobi
			
 
				+	nsys profile --trace=cuda,nvtx -o jacobi ./jacobi -niter 10
			
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/single_gpu/jacobi
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/single_gpu/jacobi
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/single_gpu/jacobi.cu
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/single_gpu/jacobi.cu
@@ -0,0 +1,266 @@
 
				+/* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
			
 
				+ *
			
 
				+ * Redistribution and use in source and binary forms, with or without
			
 
				+ * modification, are permitted provided that the following conditions
			
 
				+ * are met:
			
 
				+ *  * Redistributions of source code must retain the above copyright
			
 
				+ *    notice, this list of conditions and the following disclaimer.
			
 
				+ *  * Redistributions in binary form must reproduce the above copyright
			
 
				+ *    notice, this list of conditions and the following disclaimer in the
			
 
				+ *    documentation and/or other materials provided with the distribution.
			
 
				+ *  * Neither the name of NVIDIA CORPORATION nor the names of its
			
 
				+ *    contributors may be used to endorse or promote products derived
			
 
				+ *    from this software without specific prior written permission.
			
 
				+ *
			
 
				+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
			
 
				+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
			
 
				+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
			
 
				+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
			
 
				+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
			
 
				+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
			
 
				+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
			
 
				+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
			
 
				+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
			
 
				+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
			
 
				+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
			
 
				+ */
			
 
				+#include <algorithm>
			
 
				+#include <array>
			
 
				+#include <climits>
			
 
				+#include <cmath>
			
 
				+#include <cstdio>
			
 
				+#include <iostream>
			
 
				+#include <iterator>
			
 
				+#include <sstream>
			
 
				+
			
 
				+#include <omp.h>
			
 
				+
			
 
				+#ifdef HAVE_CUB
			
 
				+#include <cub/block/block_reduce.cuh>
			
 
				+#endif  // HAVE_CUB
			
 
				+
			
 
				+#ifdef USE_NVTX
			
 
				+#include <nvToolsExt.h>
			
 
				+
			
 
				+const uint32_t colors[] = {0x0000ff00, 0x000000ff, 0x00ffff00, 0x00ff00ff,
			
 
				+                           0x0000ffff, 0x00ff0000, 0x00ffffff};
			
 
				+const int num_colors = sizeof(colors) / sizeof(uint32_t);
			
 
				+
			
 
				+#define PUSH_RANGE(name, cid)                              \
			
 
				+    {                                                      \
			
 
				+        int color_id = cid;                                \
			
 
				+        color_id = color_id % num_colors;                  \
			
 
				+        nvtxEventAttributes_t eventAttrib = {0};           \
			
 
				+        eventAttrib.version = NVTX_VERSION;                \
			
 
				+        eventAttrib.size = NVTX_EVENT_ATTRIB_STRUCT_SIZE;  \
			
 
				+        eventAttrib.colorType = NVTX_COLOR_ARGB;           \
			
 
				+        eventAttrib.color = colors[color_id];              \
			
 
				+        eventAttrib.messageType = NVTX_MESSAGE_TYPE_ASCII; \
			
 
				+        eventAttrib.message.ascii = name;                  \
			
 
				+        nvtxRangePushEx(&eventAttrib);                     \
			
 
				+    }
			
 
				+#define POP_RANGE nvtxRangePop();
			
 
				+#else
			
 
				+#define PUSH_RANGE(name, cid)
			
 
				+#define POP_RANGE
			
 
				+#endif
			
 
				+
			
 
				+#define CUDA_RT_CALL(call)                                                                  \
			
 
				+    {                                                                                       \
			
 
				+        cudaError_t cudaStatus = call;                                                      \
			
 
				+        if (cudaSuccess != cudaStatus)                                                      \
			
 
				+            fprintf(stderr,                                                                 \
			
 
				+                    "ERROR: CUDA RT call \"%s\" in line %d of file %s failed "              \
			
 
				+                    "with "                                                                 \
			
 
				+                    "%s (%d).\n",                                                           \
			
 
				+                    #call, __LINE__, __FILE__, cudaGetErrorString(cudaStatus), cudaStatus); \
			
 
				+    }
			
 
				+
			
 
				+typedef float real;
			
 
				+constexpr real tol = 1.0e-8;
			
 
				+
			
 
				+const real PI = 2.0 * std::asin(1.0);
			
 
				+
			
 
				+__global__ void initialize_boundaries(real* __restrict__ const a_new, real* __restrict__ const a,
			
 
				+                                      const real pi, const int offset, const int nx,
			
 
				+                                      const int my_ny, const int ny) {
			
 
				+    for (int iy = blockIdx.x * blockDim.x + threadIdx.x; iy < my_ny; iy += blockDim.x * gridDim.x) {
			
 
				+        const real y0 = sin(2.0 * pi * (offset + iy) / (ny - 1));
			
 
				+        a[iy * nx + 0] = y0;
			
 
				+        a[iy * nx + (nx - 1)] = y0;
			
 
				+        a_new[iy * nx + 0] = y0;
			
 
				+        a_new[iy * nx + (nx - 1)] = y0;
			
 
				+    }
			
 
				+}
			
 
				+
			
 
				+template <int BLOCK_DIM_X, int BLOCK_DIM_Y>
			
 
				+__global__ void jacobi_kernel(real* __restrict__ const a_new, const real* __restrict__ const a,
			
 
				+                              real* __restrict__ const l2_norm, const int iy_start,
			
 
				+                              const int iy_end, const int nx, const bool calculate_norm) {
			
 
				+#ifdef HAVE_CUB
			
 
				+    typedef cub::BlockReduce<real, BLOCK_DIM_X, cub::BLOCK_REDUCE_WARP_REDUCTIONS, BLOCK_DIM_Y>
			
 
				+        BlockReduce;
			
 
				+    __shared__ typename BlockReduce::TempStorage temp_storage;
			
 
				+#endif  // HAVE_CUB
			
 
				+    int iy = blockIdx.y * blockDim.y + threadIdx.y + iy_start;
			
 
				+    int ix = blockIdx.x * blockDim.x + threadIdx.x + 1;
			
 
				+    real local_l2_norm = 0.0;
			
 
				+
			
 
				+    if (iy < iy_end && ix < (nx - 1)) {
			
 
				+        const real new_val = 0.25 * (a[iy * nx + ix + 1] + a[iy * nx + ix - 1] +
			
 
				+                                     a[(iy + 1) * nx + ix] + a[(iy - 1) * nx + ix]);
			
 
				+        a_new[iy * nx + ix] = new_val;
			
 
				+        if (calculate_norm) {
			
 
				+            real residue = new_val - a[iy * nx + ix];
			
 
				+            local_l2_norm += residue * residue;
			
 
				+        }
			
 
				+    }
			
 
				+    if (calculate_norm) {
			
 
				+#ifdef HAVE_CUB
			
 
				+        real block_l2_norm = BlockReduce(temp_storage).Sum(local_l2_norm);
			
 
				+        if (0 == threadIdx.y && 0 == threadIdx.x) atomicAdd(l2_norm, block_l2_norm);
			
 
				+#else
			
 
				+        atomicAdd(l2_norm, local_l2_norm);
			
 
				+#endif  // HAVE_CUB
			
 
				+    }
			
 
				+}
			
 
				+
			
 
				+template <typename T>
			
 
				+T get_argval(char** begin, char** end, const std::string& arg, const T default_val) {
			
 
				+    T argval = default_val;
			
 
				+    char** itr = std::find(begin, end, arg);
			
 
				+    if (itr != end && ++itr != end) {
			
 
				+        std::istringstream inbuf(*itr);
			
 
				+        inbuf >> argval;
			
 
				+    }
			
 
				+    return argval;
			
 
				+}
			
 
				+
			
 
				+bool get_arg(char** begin, char** end, const std::string& arg) {
			
 
				+    char** itr = std::find(begin, end, arg);
			
 
				+    if (itr != end) {
			
 
				+        return true;
			
 
				+    }
			
 
				+    return false;
			
 
				+}
			
 
				+
			
 
				+double single_gpu(const int nx, const int ny, const int iter_max, real* const a_ref_h,
			
 
				+                  const int nccheck, const bool print);
			
 
				+
			
 
				+int main(int argc, char* argv[]) {
			
 
				+    const int iter_max = get_argval<int>(argv, argv + argc, "-niter", 1000);
			
 
				+    const int nccheck = get_argval<int>(argv, argv + argc, "-nccheck", 1);
			
 
				+    const int nx = get_argval<int>(argv, argv + argc, "-nx", 16384);
			
 
				+    const int ny = get_argval<int>(argv, argv + argc, "-ny", 16384);
			
 
				+    const bool csv = get_arg(argv, argv + argc, "-csv");
			
 
				+
			
 
				+    if (nccheck != 1) {
			
 
				+        fprintf(stderr, "Only nccheck = 1 is supported\n");
			
 
				+        return -1;
			
 
				+    }
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaSetDevice(0));
			
 
				+    CUDA_RT_CALL(cudaFree(0));
			
 
				+
			
 
				+    real* a_ref_h;
			
 
				+    CUDA_RT_CALL(cudaMallocHost(&a_ref_h, nx * ny * sizeof(real)));
			
 
				+    
			
 
				+    double runtime_serial = single_gpu(nx, ny, iter_max, a_ref_h, nccheck, !csv);
			
 
				+
			
 
				+    if (csv) {
			
 
				+        printf("single_gpu, %d, %d, %d, %d, %f\n", nx, ny, iter_max, nccheck, runtime_serial);
			
 
				+    } else {
			
 
				+        printf("%dx%d: 1 GPU: %8.4f s\n", ny, nx, runtime_serial);
			
 
				+    }
			
 
				+
			
 
				+    return 0;
			
 
				+}
			
 
				+
			
 
				+double single_gpu(const int nx, const int ny, const int iter_max, real* const a_ref_h,
			
 
				+                  const int nccheck, const bool print) {
			
 
				+    real* a;
			
 
				+    real* a_new;
			
 
				+
			
 
				+    real* l2_norm_d;
			
 
				+    real* l2_norm_h;
			
 
				+
			
 
				+    int iy_start = 1;
			
 
				+    int iy_end = (ny - 1);
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaMalloc(&a, nx * ny * sizeof(real)));
			
 
				+    CUDA_RT_CALL(cudaMalloc(&a_new, nx * ny * sizeof(real)));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaMemset(a, 0, nx * ny * sizeof(real)));
			
 
				+    CUDA_RT_CALL(cudaMemset(a_new, 0, nx * ny * sizeof(real)));
			
 
				+
			
 
				+    // Set diriclet boundary conditions on left and right boarder
			
 
				+    initialize_boundaries<<<ny / 128 + 1, 128>>>(a, a_new, PI, 0, nx, ny, ny);
			
 
				+    CUDA_RT_CALL(cudaGetLastError());
			
 
				+    CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaMalloc(&l2_norm_d, sizeof(real)));
			
 
				+    CUDA_RT_CALL(cudaMallocHost(&l2_norm_h, sizeof(real)));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+
			
 
				+    if (print)
			
 
				+        printf(
			
 
				+            "Single GPU jacobi relaxation: %d iterations on %d x %d mesh with "
			
 
				+            "norm "
			
 
				+            "check every %d iterations\n",
			
 
				+            iter_max, ny, nx, nccheck);
			
 
				+
			
 
				+    constexpr int dim_block_x = 32;
			
 
				+    constexpr int dim_block_y = 32;
			
 
				+    dim3 dim_grid((nx + dim_block_x - 1) / dim_block_x, (ny + dim_block_y - 1) / dim_block_y, 1);
			
 
				+
			
 
				+    int iter = 0;
			
 
				+    bool calculate_norm;
			
 
				+    real l2_norm = 1.0;
			
 
				+
			
 
				+    double start = omp_get_wtime();
			
 
				+    PUSH_RANGE("Jacobi solve", 0)
			
 
				+    while (l2_norm > tol && iter < iter_max) {
			
 
				+        CUDA_RT_CALL(cudaMemset(l2_norm_d, 0, sizeof(real)));
			
 
				+
			
 
				+        calculate_norm = (iter % nccheck) == 0 || (print && ((iter % 100) == 0));
			
 
				+        jacobi_kernel<dim_block_x, dim_block_y>
			
 
				+            <<<dim_grid, {dim_block_x, dim_block_y, 1}, 0, 0>>>(
			
 
				+                a_new, a, l2_norm_d, iy_start, iy_end, nx, calculate_norm);
			
 
				+        CUDA_RT_CALL(cudaGetLastError());
			
 
				+
			
 
				+        if (calculate_norm) {
			
 
				+            CUDA_RT_CALL(cudaMemcpy(l2_norm_h, l2_norm_d, sizeof(real), cudaMemcpyDeviceToHost));
			
 
				+        }
			
 
				+
			
 
				+        // Apply periodic boundary conditions
			
 
				+
			
 
				+        CUDA_RT_CALL(cudaMemcpy(a_new, a_new + (iy_end - 1) * nx, nx * sizeof(real),
			
 
				+                                     cudaMemcpyDeviceToDevice));
			
 
				+        CUDA_RT_CALL(cudaMemcpy(a_new + iy_end * nx, a_new + iy_start * nx, nx * sizeof(real),
			
 
				+                                     cudaMemcpyDeviceToDevice));
			
 
				+
			
 
				+        if (calculate_norm) {
			
 
				+	    CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+            //CUDA_RT_CALL(cudaStreamSynchronize(compute_stream));
			
 
				+            l2_norm = *l2_norm_h;
			
 
				+            l2_norm = std::sqrt(l2_norm);
			
 
				+            if (print && (iter % 100) == 0) printf("%5d, %0.6f\n", iter, l2_norm);
			
 
				+        }
			
 
				+
			
 
				+        std::swap(a_new, a);
			
 
				+        iter++;
			
 
				+    }
			
 
				+    POP_RANGE
			
 
				+    double stop = omp_get_wtime();
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaMemcpy(a_ref_h, a, nx * ny * sizeof(real), cudaMemcpyDeviceToHost));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaFreeHost(l2_norm_h));
			
 
				+    CUDA_RT_CALL(cudaFree(l2_norm_d));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaFree(a_new));
			
 
				+    CUDA_RT_CALL(cudaFree(a));
			
 
				+    return (stop - start);
			
 
				+}
			
 
				+
			
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/single_gpu/jacobi.cu.old
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/single_gpu/jacobi.cu.old
@@ -0,0 +1,315 @@
 
				+/* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
			
 
				+ *
			
 
				+ * Redistribution and use in source and binary forms, with or without
			
 
				+ * modification, are permitted provided that the following conditions
			
 
				+ * are met:
			
 
				+ *  * Redistributions of source code must retain the above copyright
			
 
				+ *    notice, this list of conditions and the following disclaimer.
			
 
				+ *  * Redistributions in binary form must reproduce the above copyright
			
 
				+ *    notice, this list of conditions and the following disclaimer in the
			
 
				+ *    documentation and/or other materials provided with the distribution.
			
 
				+ *  * Neither the name of NVIDIA CORPORATION nor the names of its
			
 
				+ *    contributors may be used to endorse or promote products derived
			
 
				+ *    from this software without specific prior written permission.
			
 
				+ *
			
 
				+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
			
 
				+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
			
 
				+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
			
 
				+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
			
 
				+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
			
 
				+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
			
 
				+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
			
 
				+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
			
 
				+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
			
 
				+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
			
 
				+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
			
 
				+ */
			
 
				+#include <algorithm>
			
 
				+#include <array>
			
 
				+#include <climits>
			
 
				+#include <cmath>
			
 
				+#include <cstdio>
			
 
				+#include <iostream>
			
 
				+#include <iterator>
			
 
				+#include <sstream>
			
 
				+
			
 
				+#include <omp.h>
			
 
				+
			
 
				+#ifdef HAVE_CUB
			
 
				+#include <cub/block/block_reduce.cuh>
			
 
				+#endif  // HAVE_CUB
			
 
				+
			
 
				+#ifdef USE_NVTX
			
 
				+#include <nvToolsExt.h>
			
 
				+
			
 
				+const uint32_t colors[] = {0x0000ff00, 0x000000ff, 0x00ffff00, 0x00ff00ff,
			
 
				+                           0x0000ffff, 0x00ff0000, 0x00ffffff};
			
 
				+const int num_colors = sizeof(colors) / sizeof(uint32_t);
			
 
				+
			
 
				+#define PUSH_RANGE(name, cid)                              \
			
 
				+    {                                                      \
			
 
				+        int color_id = cid;                                \
			
 
				+        color_id = color_id % num_colors;                  \
			
 
				+        nvtxEventAttributes_t eventAttrib = {0};           \
			
 
				+        eventAttrib.version = NVTX_VERSION;                \
			
 
				+        eventAttrib.size = NVTX_EVENT_ATTRIB_STRUCT_SIZE;  \
			
 
				+        eventAttrib.colorType = NVTX_COLOR_ARGB;           \
			
 
				+        eventAttrib.color = colors[color_id];              \
			
 
				+        eventAttrib.messageType = NVTX_MESSAGE_TYPE_ASCII; \
			
 
				+        eventAttrib.message.ascii = name;                  \
			
 
				+        nvtxRangePushEx(&eventAttrib);                     \
			
 
				+    }
			
 
				+#define POP_RANGE nvtxRangePop();
			
 
				+#else
			
 
				+#define PUSH_RANGE(name, cid)
			
 
				+#define POP_RANGE
			
 
				+#endif
			
 
				+
			
 
				+#define CUDA_RT_CALL(call)                                                                  \
			
 
				+    {                                                                                       \
			
 
				+        cudaError_t cudaStatus = call;                                                      \
			
 
				+        if (cudaSuccess != cudaStatus)                                                      \
			
 
				+            fprintf(stderr,                                                                 \
			
 
				+                    "ERROR: CUDA RT call \"%s\" in line %d of file %s failed "              \
			
 
				+                    "with "                                                                 \
			
 
				+                    "%s (%d).\n",                                                           \
			
 
				+                    #call, __LINE__, __FILE__, cudaGetErrorString(cudaStatus), cudaStatus); \
			
 
				+    }
			
 
				+
			
 
				+typedef float real;
			
 
				+constexpr real tol = 1.0e-8;
			
 
				+
			
 
				+const real PI = 2.0 * std::asin(1.0);
			
 
				+
			
 
				+__global__ void initialize_boundaries(real* __restrict__ const a_new, real* __restrict__ const a,
			
 
				+                                      const real pi, const int nx, const int ny) {
			
 
				+    for (int iy = blockIdx.x * blockDim.x + threadIdx.x; iy < ny; iy += blockDim.x * gridDim.x) {
			
 
				+        const real y0 = sin(2.0 * pi * iy / (ny - 1));
			
 
				+        a[iy * nx + 0] = y0;
			
 
				+        a[iy * nx + (nx - 1)] = y0;
			
 
				+        a_new[iy * nx + 0] = y0;
			
 
				+        a_new[iy * nx + (nx - 1)] = y0;
			
 
				+    }
			
 
				+}
			
 
				+
			
 
				+template <int BLOCK_DIM_X, int BLOCK_DIM_Y>
			
 
				+__global__ void jacobi_kernel(real* __restrict__ const a_new, const real* __restrict__ const a,
			
 
				+                              real* __restrict__ const l2_norm, const int iy_start,
			
 
				+                              const int iy_end, const int nx) {
			
 
				+#ifdef HAVE_CUB
			
 
				+    typedef cub::BlockReduce<real, BLOCK_DIM_X, cub::BLOCK_REDUCE_WARP_REDUCTIONS, BLOCK_DIM_Y>
			
 
				+        BlockReduce;
			
 
				+    __shared__ typename BlockReduce::TempStorage temp_storage;
			
 
				+#endif  // HAVE_CUB
			
 
				+    const int iy = blockIdx.y * blockDim.y + threadIdx.y + 1;
			
 
				+    const int ix = blockIdx.x * blockDim.x + threadIdx.x;
			
 
				+    real local_l2_norm = 0.0;
			
 
				+
			
 
				+    if (iy < iy_end) {
			
 
				+        if (ix >= 1 && ix < (nx - 1)) {
			
 
				+            const real new_val = 0.25 * (a[iy * nx + ix + 1] + a[iy * nx + ix - 1] +
			
 
				+                                         a[(iy + 1) * nx + ix] + a[(iy - 1) * nx + ix]);
			
 
				+            a_new[iy * nx + ix] = new_val;
			
 
				+
			
 
				+            // apply boundary conditions
			
 
				+            if (iy_start == iy) {
			
 
				+                a_new[iy_end * nx + ix] = new_val;
			
 
				+            }
			
 
				+
			
 
				+            if ((iy_end - 1) == iy) {
			
 
				+                a_new[(iy_start - 1) * nx + ix] = new_val;
			
 
				+            }
			
 
				+
			
 
				+            real residue = new_val - a[iy * nx + ix];
			
 
				+            local_l2_norm = residue * residue;
			
 
				+        }
			
 
				+    }
			
 
				+#ifdef HAVE_CUB
			
 
				+    real block_l2_norm = BlockReduce(temp_storage).Sum(local_l2_norm);
			
 
				+    if (0 == threadIdx.y && 0 == threadIdx.x) atomicAdd(l2_norm, block_l2_norm);
			
 
				+#else
			
 
				+    atomicAdd(l2_norm, local_l2_norm);
			
 
				+#endif  // HAVE_CUB
			
 
				+}
			
 
				+
			
 
				+double noopt(const int nx, const int ny, const int iter_max, real* const a_ref_h, const int nccheck,
			
 
				+             const bool print);
			
 
				+
			
 
				+template <typename T>
			
 
				+T get_argval(char** begin, char** end, const std::string& arg, const T default_val) {
			
 
				+    T argval = default_val;
			
 
				+    char** itr = std::find(begin, end, arg);
			
 
				+    if (itr != end && ++itr != end) {
			
 
				+        std::istringstream inbuf(*itr);
			
 
				+        inbuf >> argval;
			
 
				+    }
			
 
				+    return argval;
			
 
				+}
			
 
				+
			
 
				+bool get_arg(char** begin, char** end, const std::string& arg) {
			
 
				+    char** itr = std::find(begin, end, arg);
			
 
				+    if (itr != end) {
			
 
				+        return true;
			
 
				+    }
			
 
				+    return false;
			
 
				+}
			
 
				+
			
 
				+struct l2_norm_buf {
			
 
				+    cudaEvent_t copy_done;
			
 
				+    real* d;
			
 
				+    real* h;
			
 
				+};
			
 
				+
			
 
				+int main(int argc, char* argv[]) {
			
 
				+    const int iter_max = get_argval<int>(argv, argv + argc, "-niter", 1000);
			
 
				+    const int nccheck = get_argval<int>(argv, argv + argc, "-nccheck", 1);
			
 
				+    const int nx = get_argval<int>(argv, argv + argc, "-nx", 16384);
			
 
				+    const int ny = get_argval<int>(argv, argv + argc, "-ny", 16384);
			
 
				+    const bool csv = get_arg(argv, argv + argc, "-csv");
			
 
				+
			
 
				+    if (nccheck != 1) {
			
 
				+        fprintf(stderr, "Only nccheck = 1 is supported\n");
			
 
				+        return -1;
			
 
				+    }
			
 
				+
			
 
				+    real* a;
			
 
				+    real* a_new;
			
 
				+
			
 
				+    cudaStream_t compute_stream;
			
 
				+    cudaStream_t copy_l2_norm_stream;
			
 
				+    cudaStream_t reset_l2_norm_stream;
			
 
				+
			
 
				+    cudaEvent_t compute_done;
			
 
				+    cudaEvent_t reset_l2_norm_done[2];
			
 
				+
			
 
				+    real l2_norms[2];
			
 
				+    l2_norm_buf l2_norm_bufs[2];
			
 
				+
			
 
				+    int iy_start = 1;
			
 
				+    int iy_end = (ny - 1);
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaSetDevice(0));
			
 
				+    CUDA_RT_CALL(cudaFree(0));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaMalloc(&a, nx * ny * sizeof(real)));
			
 
				+    CUDA_RT_CALL(cudaMalloc(&a_new, nx * ny * sizeof(real)));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaMemset(a, 0, nx * ny * sizeof(real)));
			
 
				+    CUDA_RT_CALL(cudaMemset(a_new, 0, nx * ny * sizeof(real)));
			
 
				+
			
 
				+    // Set diriclet boundary conditions on left and right boarder
			
 
				+    initialize_boundaries<<<ny / 128 + 1, 128>>>(a, a_new, PI, nx, ny);
			
 
				+    CUDA_RT_CALL(cudaGetLastError());
			
 
				+    CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaStreamCreate(&compute_stream));
			
 
				+    CUDA_RT_CALL(cudaStreamCreate(&copy_l2_norm_stream));
			
 
				+    CUDA_RT_CALL(cudaStreamCreate(&reset_l2_norm_stream));
			
 
				+    CUDA_RT_CALL(cudaEventCreateWithFlags(&compute_done, cudaEventDisableTiming));
			
 
				+    CUDA_RT_CALL(cudaEventCreateWithFlags(&reset_l2_norm_done[0], cudaEventDisableTiming));
			
 
				+    CUDA_RT_CALL(cudaEventCreateWithFlags(&reset_l2_norm_done[1], cudaEventDisableTiming));
			
 
				+
			
 
				+    for (int i = 0; i < 2; ++i) {
			
 
				+        CUDA_RT_CALL(cudaEventCreateWithFlags(&l2_norm_bufs[i].copy_done, cudaEventDisableTiming));
			
 
				+        CUDA_RT_CALL(cudaMalloc(&l2_norm_bufs[i].d, sizeof(real)));
			
 
				+        CUDA_RT_CALL(cudaMemset(l2_norm_bufs[i].d, 0, sizeof(real)));
			
 
				+        CUDA_RT_CALL(cudaMallocHost(&l2_norm_bufs[i].h, sizeof(real)));
			
 
				+        (*l2_norm_bufs[i].h) = 1.0;
			
 
				+    }
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+
			
 
				+    if (!csv)
			
 
				+        printf(
			
 
				+            "Jacobi relaxation: %d iterations on %d x %d mesh with norm check "
			
 
				+            "every %d iterations\n",
			
 
				+            iter_max, ny, nx, nccheck);
			
 
				+
			
 
				+    constexpr int dim_block_x = 32;
			
 
				+    constexpr int dim_block_y = 32;
			
 
				+    dim3 dim_grid((nx + dim_block_x - 1) / dim_block_x, (ny + dim_block_y - 1) / dim_block_y, 1);
			
 
				+
			
 
				+    int iter = 0;
			
 
				+    for (int i = 0; i < 2; ++i) {
			
 
				+        l2_norms[i] = 0.0;
			
 
				+    }
			
 
				+
			
 
				+    double start = omp_get_wtime();
			
 
				+
			
 
				+    PUSH_RANGE("Jacobi solve", 0)
			
 
				+
			
 
				+    bool l2_norm_greater_than_tol = true;
			
 
				+    while (l2_norm_greater_than_tol && iter < iter_max) {
			
 
				+        // on new iteration: old current vars are now previous vars, old
			
 
				+        // previous vars are no longer needed
			
 
				+        int prev = iter % 2;
			
 
				+        int curr = (iter + 1) % 2;
			
 
				+
			
 
				+        // wait for memset from old previous iteration to complete
			
 
				+        CUDA_RT_CALL(cudaStreamWaitEvent(compute_stream, reset_l2_norm_done[curr], 0));
			
 
				+
			
 
				+        jacobi_kernel<dim_block_x, dim_block_y>
			
 
				+            <<<dim_grid, {dim_block_x, dim_block_y, 1}, 0, compute_stream>>>(
			
 
				+                a_new, a, l2_norm_bufs[curr].d, iy_start, iy_end, nx);
			
 
				+        CUDA_RT_CALL(cudaGetLastError());
			
 
				+        CUDA_RT_CALL(cudaEventRecord(compute_done, compute_stream));
			
 
				+
			
 
				+        // perform L2 norm calculation
			
 
				+        if ((iter % nccheck) == 0 || (!csv && (iter % 100) == 0)) {
			
 
				+            CUDA_RT_CALL(cudaStreamWaitEvent(copy_l2_norm_stream, compute_done, 0));
			
 
				+            CUDA_RT_CALL(cudaMemcpyAsync(l2_norm_bufs[curr].h, l2_norm_bufs[curr].d, sizeof(real),
			
 
				+                                         cudaMemcpyDeviceToHost, copy_l2_norm_stream));
			
 
				+            CUDA_RT_CALL(cudaEventRecord(l2_norm_bufs[curr].copy_done, copy_l2_norm_stream));
			
 
				+
			
 
				+            // make sure D2H copy is complete before using the data for
			
 
				+            // calculation
			
 
				+            CUDA_RT_CALL(cudaEventSynchronize(l2_norm_bufs[prev].copy_done));
			
 
				+
			
 
				+            l2_norms[prev] = *(l2_norm_bufs[prev].h);
			
 
				+            l2_norms[prev] = std::sqrt(l2_norms[prev]);
			
 
				+            l2_norm_greater_than_tol = (l2_norms[prev] > tol);
			
 
				+
			
 
				+            if (!csv && (iter % 100) == 0) {
			
 
				+                printf("%5d, %0.6f\n", iter, l2_norms[prev]);
			
 
				+            }
			
 
				+
			
 
				+            // reset everything for next iteration
			
 
				+            l2_norms[prev] = 0.0;
			
 
				+            *(l2_norm_bufs[prev].h) = 0.0;
			
 
				+            CUDA_RT_CALL(
			
 
				+                cudaMemsetAsync(l2_norm_bufs[prev].d, 0, sizeof(real), reset_l2_norm_stream));
			
 
				+            CUDA_RT_CALL(cudaEventRecord(reset_l2_norm_done[prev], reset_l2_norm_stream));
			
 
				+        }
			
 
				+
			
 
				+        std::swap(a_new, a);
			
 
				+        iter++;
			
 
				+    }
			
 
				+    CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+    POP_RANGE
			
 
				+    double stop = omp_get_wtime();
			
 
				+
			
 
				+    if (csv) {
			
 
				+        printf("single_gpu, %d, %d, %d, %d, %f\n", nx, ny, iter_max, nccheck, (stop - start));
			
 
				+    } else {
			
 
				+        printf("%dx%d: 1 GPU: %8.4f s\n", ny, nx, (stop - start));
			
 
				+    }
			
 
				+
			
 
				+    for (int i = 0; i < 2; ++i) {
			
 
				+        CUDA_RT_CALL(cudaFreeHost(l2_norm_bufs[i].h));
			
 
				+        CUDA_RT_CALL(cudaFree(l2_norm_bufs[i].d));
			
 
				+        CUDA_RT_CALL(cudaEventDestroy(l2_norm_bufs[i].copy_done));
			
 
				+    }
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaEventDestroy(reset_l2_norm_done[1]));
			
 
				+    CUDA_RT_CALL(cudaEventDestroy(reset_l2_norm_done[0]));
			
 
				+    CUDA_RT_CALL(cudaEventDestroy(compute_done));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaStreamDestroy(reset_l2_norm_stream));
			
 
				+    CUDA_RT_CALL(cudaStreamDestroy(copy_l2_norm_stream));
			
 
				+    CUDA_RT_CALL(cudaStreamDestroy(compute_stream));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaFree(a_new));
			
 
				+    CUDA_RT_CALL(cudaFree(a));
			
 
				+
			
 
				+    return 0;
			
 
				+}
			
--- a/hpc/multi_gpu_nways/labs/CFD/English/mgp_01_begin.ipynb
+++ b/hpc/multi_gpu_nways/labs/CFD/English/mgp_01_begin.ipynb
@@ -40,10 +40,11 @@
 
				     "\n",
			
 
				     "1. Overview of Jacobi Solver application\n",
			
 
				     "    * Review of single-GPU code\n",
			
 
				-    "    * Parallelizing to multiple GPUs using normal MPI\n",
			
 
				+    "    * Parallelizing to multiple GPUs using cudaMemcpy\n",
			
 
				     "2. Profiling with NVTX and Nsight Systems\n",
			
 
				-    "    * Profiling multi-GPU normal MPI code\n",
			
 
				-    "    * Using single-node CUDA-aware MPI with compute-copy overlap\n",
			
 
				+    "    * Profiling multi-GPU cudaMemcpy code\n",
			
 
				+    "    * Using single-node CUDA-aware MPI\n",
			
 
				+    "    * Optimizing CUDA-aware MPI with compute-copy overlap\n",
			
 
				     "3. Communication topology\n",
			
 
				     "    * Overview of intra-node and inter-node communication architecture\n",
			
 
				     "    * Benchmarking communication networks\n",
			
@@ -58,13 +59,6 @@
 
				     "\n",
			
 
				     "This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0). "
			
 
				    ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": null,
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": []
			
 
				   }
			
 
				  ],
			
 
				  "metadata": {
			
--- a/hpc/multi_gpu_nways/mgpm
+++ b/hpc/multi_gpu_nways/mgpm
@@ -0,0 +1,10 @@
 
				+#!/bin/bash
			
 
				+#SBATCH -N 1               # number of nodes
			
 
				+#SBATCH -t 08:00:00        # wall time
			
 
				+#SBATCH --exclusive           # exclusive node access
			
 
				+#SBATCH --ntasks-per-node=8    # n tasks per machine (one task per gpu)
			
 
				+#SBATCH --cores-per-socket=20    # 20 cores on each socket
			
 
				+
			
 
				+module purge
			
 
				+module load Singularity/3.6.1
			
 
				+singularity run --nv multi_gpu.simg  jupyter notebook --notebook-dir=~/multi_gpu_labs/ --port=8000 --ip=0.0.0.0 --no-browser --NotebookApp.token=""
			
--- a/hpc/multi_gpu_nways/slurm-165592.out
+++ b/hpc/multi_gpu_nways/slurm-165592.out
@@ -0,0 +1,57 @@
 
				+[I 21:04:51.121 NotebookApp] Authentication of /metrics is OFF, since other authentication is disabled.
			
 
				+[W 21:04:51.412 NotebookApp] All authentication is disabled.  Anyone who can connect to this server will be able to run code.
			
 
				+[I 21:04:51.414 NotebookApp] Serving notebooks from local directory: /home/anisaxena/multi_gpu_labs
			
 
				+[I 21:04:51.414 NotebookApp] Jupyter Notebook 6.4.0 is running at:
			
 
				+[I 21:04:51.414 NotebookApp] http://prm-dgx-05:8000/
			
 
				+[I 21:04:51.414 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
			
 
				+[I 21:05:36.430 NotebookApp] 302 GET / (10.33.12.57) 0.550000ms
			
 
				+[W 21:05:53.513 NotebookApp] Notebook CFD/English/C/jupyter_notebook/jacobi/overview.ipynb is not trusted
			
 
				+[I 21:05:54.029 NotebookApp] Kernel started: b7548e3a-38bd-48ee-9154-5e04bbfe6191, name: python3
			
 
				+[I 21:05:56.940 NotebookApp] Kernel started: 205d247e-f2c4-466c-aa39-18b8184a3efa, name: python3
			
 
				+[I 21:33:57.393 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 21:35:57.453 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 22:43:58.717 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 22:47:58.737 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 23:03:59.255 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 23:05:59.271 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 23:07:59.302 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 23:09:59.293 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 23:11:59.307 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 23:13:59.319 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 23:15:59.664 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 23:17:59.243 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 23:27:59.313 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 23:29:59.517 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 23:35:59.968 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 23:56:00.824 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 23:58:00.708 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 00:06:00.988 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 00:08:00.914 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 00:10:00.922 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 00:20:01.491 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 00:20:47.948 NotebookApp] New terminal with automatic name: 1
			
 
				+[I 00:20:59.004 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 00:26:01.841 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 00:28:02.156 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 00:34:02.906 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 00:36:02.787 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 00:38:02.942 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 00:40:03.096 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 00:44:03.083 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 00:46:03.095 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 02:14:05.872 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 02:32:06.461 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 02:34:06.464 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 02:36:06.368 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 02:38:06.487 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 02:46:07.347 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 02:50:07.076 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 02:51:11.220 NotebookApp] Kernel interrupted: b7548e3a-38bd-48ee-9154-5e04bbfe6191
			
 
				+[I 02:52:06.780 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 03:00:07.351 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 03:02:07.293 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 03:04:07.372 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 03:08:07.298 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 03:12:07.779 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 03:14:08.093 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb
			
 
				+[I 03:16:07.857 NotebookApp] Saving file at /CFD/English/C/jupyter_notebook/jacobi/overview.ipynb