пре 4 година · 60cac07f0f
--- a/.gitignore
+++ b/.gitignore
@@ -6,3 +6,6 @@ alk.traj.dcd
 
				 *.a
			
 
				 *.la
			
 
				 mgpm
			
 
				+*.o
			
 
				+*.out
			
 
				+final_remarks.ipynb
			
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/images/nccl_architecture.png
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/images/nccl_architecture.png
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/images/nccl_dgx1_topology.png
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/images/nccl_dgx1_topology.png
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/images/nccl_profiler_output.png
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/images/nccl_profiler_output.png
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/images/nvshmem_left_shift_output.png
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/images/nvshmem_left_shift_output.png
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/images/nvshmem_memory_model.png
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/images/nvshmem_memory_model.png
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/images/nvshmem_mpi_comparison.png
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/images/nvshmem_mpi_comparison.png
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/images/nvshmem_profiler_report.png
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/images/nvshmem_profiler_report.png
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/images/nvshmem_thread_level_comm.png
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/images/nvshmem_thread_level_comm.png
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/jupyter_notebook/cuda/memcpy.ipynb
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/jupyter_notebook/cuda/memcpy.ipynb
@@ -144,6 +144,8 @@
 
				     "\n",
			
 
				     "Now, let's parallelize our code across multiple GPUs by using `cudaSetDevice` and `cudaMemcpyAsync` operations. Open the [jacobi_memcpy.cu](../../source_code/cuda/jacobi_memcpy.cu) file.\n",
			
 
				     "\n",
			
 
				+    "Alternatively, you can navigate to `CFD/English/C/source_code/cuda/` directory in Jupyter's file browser in the left pane. Then, click to open the `jacobi_memcpy.cu` file. \n",
			
 
				+    "\n",
			
 
				     "Understand the flow of the program from within the `main` function. Review the following pre-Jacobi-computation steps:\n",
			
 
				     "\n",
			
 
				     "1. Computation of the memory chunk size to be allocated on each GPU stored in the `chunk_size` integer array.\n",
			
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/jupyter_notebook/cuda/streams.ipynb
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/jupyter_notebook/cuda/streams.ipynb
@@ -108,6 +108,8 @@
 
				     "\n",
			
 
				     "Now, let's implement CUDA streams in our application. Open the [jacobi_streams.cu](../../source_code/cuda/jacobi_streams.cu) file.\n",
			
 
				     "\n",
			
 
				+    "Alternatively, you can navigate to `CFD/English/C/source_code/cuda/` directory in Jupyter's file browser in the left pane. Then, click to open the `jacobi_streams.cu` file.\n",
			
 
				+    "\n",
			
 
				     "Note that we create 3 streams- `compute_stream`, `push_top_stream`, and `push_bottom_stream` for each GPU. We will compute the Jacobi iteration and perform GPU-local L2 norm copy operation on the `compute_stream`. Each GPU will perform its top and bottom halo copy operation to its neighbours using the `push_top_stream` and `push_bottom_stream` streams, respectively. \n",
			
 
				     "\n",
			
 
				     "Now, within the iterative Jacobi loop (the `while` loop), implement the following marked as `TODO: Part 3-`:\n",
			
@@ -248,6 +250,8 @@
 
				     "\n",
			
 
				     "Let's implement CUDA Events with Streams in our application. Open the [jacobi_streams_events.cu](../../source_code/cuda/jacobi_streams_events.cu) file.\n",
			
 
				     "\n",
			
 
				+    "Alternatively, you can navigate to `CFD/English/C/source_code/cuda/` directory in Jupyter's file browser in the left pane. Then, click to open the `jacobi_streams_events.cu` file.\n",
			
 
				+    "\n",
			
 
				     "Note that we create 5 events for each device, `compute_done`, `push_top_done[0]`, `push_top_done[1]`, `push_bottom_done[0]`, and `push_bottom_done[1]`. We need 2 events for each halo on every device:\n",
			
 
				     "\n",
			
 
				     "1. To synchronize \"top\" and \"bottom\" neighbour's `push_bottom_stream` and `push_top_stream` copy operations of $(i-1)^{th}$ iteration, respectively, before computing $i^{th}$ Jacobi iteration in `compute_stream`.\n",
			
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/jupyter_notebook/mpi/cuda_aware.ipynb
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/jupyter_notebook/mpi/cuda_aware.ipynb
@@ -87,7 +87,9 @@
 
				     "\n",
			
 
				     "## Implementation Exercise: Part 2\n",
			
 
				     "\n",
			
 
				-    "Open the [jacobi_cuda_aware_mpi.cpp](../../source_code/mpi/jacobi_cuda_aware_mpi.cpp) and [jacobi_kernels.cu](../../source_code/mpi/jacobi_kernels.cu) files in the `source_code/mpi` directory. The `jacobi_kernels.cu` file is same as in previous lab. Also open the [Makefile](../../source_code/mpi/Makefile) and note how the compilation and linking is also same as in previous lab.\n",
			
 
				+    "Open the [jacobi_cuda_aware_mpi.cpp](../../source_code/mpi/jacobi_cuda_aware_mpi.cpp) and [jacobi_kernels.cu](../../source_code/mpi/jacobi_kernels.cu) files. Alternatively, you can navigate to `CFD/English/C/source_code/mpi/` directory in Jupyter's file browser in the left pane. Then, click to open the `jacobi_cuda_aware_mpi.cpp` and `jacobi_kernels.cu` files. The `jacobi_kernels.cu` file is same as in previous lab. \n",
			
 
				+    "\n",
			
 
				+    "Also open the [Makefile](../../source_code/mpi/Makefile) and note how the compilation and linking is also same as in previous lab.\n",
			
 
				     "\n",
			
 
				     "Understand the flow of the `jacobi_cuda_aware_mpi.cpp` program and observe the following:\n",
			
 
				     "\n",
			
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/jupyter_notebook/mpi/memcpy.ipynb
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/jupyter_notebook/mpi/memcpy.ipynb
@@ -135,7 +135,7 @@
 
				     "\n",
			
 
				     "### Code Structure\n",
			
 
				     "\n",
			
 
				-    "Open the [jacobi_memcpy_mpi.cpp](../../source_code/mpi/jacobi_memcpy_mpi.cpp) file and the [jacobi_kernels.cu](../../source_code/mpi/jacobi_kernels.cu) files from the `source_code/mpi` directory.\n",
			
 
				+    "Open the [jacobi_memcpy_mpi.cpp](../../source_code/mpi/jacobi_memcpy_mpi.cpp) file and the [jacobi_kernels.cu](../../source_code/mpi/jacobi_kernels.cu) files. Alternatively, you can navigate to `CFD/English/C/source_code/mpi/` directory in Jupyter's file browser in the left pane. Then, click to open the `jacobi_memcpy_mpi.cpp` and `jacobi_kernels.cu` files.\n",
			
 
				     "\n",
			
 
				     "We separate the device kernels from other CUDA and MPI functions as `nvc++` compiler is required to compile CUDA C++ which may not be installed on some platforms Note that NVIDIA's HPC SDK includes the `nvc++` compiler.\n",
			
 
				     "\n",
			
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/jupyter_notebook/mpi/multi_node_intro.ipynb
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/jupyter_notebook/mpi/multi_node_intro.ipynb
@@ -76,7 +76,7 @@
 
				     "}\n",
			
 
				     "```\n",
			
 
				     "\n",
			
 
				-    "To access the program, open the [hello_world.c](../../source_code/mpi/hello_world.c) file in `C/source_code/mpi/` directory.\n",
			
 
				+    "To access the program, open the [hello_world.c](../../source_code/mpi/hello_world.c) file. Alternatively, you can navigate to `CFD/English/C/source_code/mpi/` directory in Jupyter's file browser in the left pane. Then, click to open the `hello_world.c` file.\n",
			
 
				     "\n",
			
 
				     "The MPI environment is initialized with `MPI_Init` through which all of MPI’s global and internal variables are constructed. A \"communicator\" is created between all processes that are spawned, and unique ranks are assigned to each process. \n",
			
 
				     "\n",
			
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/jupyter_notebook/nccl/nccl.ipynb
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/jupyter_notebook/nccl/nccl.ipynb
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/jupyter_notebook/nvshmem/nvshmem.ipynb
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/jupyter_notebook/nvshmem/nvshmem.ipynb
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/cuda/jacobi_memcpy.cu
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/cuda/jacobi_memcpy.cu
@@ -224,21 +224,20 @@ int main(int argc, char* argv[]) {
 
				         if (p2p == true) {
			
 
				             const int top = dev_id > 0 ? dev_id - 1 : (num_devices - 1);
			
 
				             int canAccessPeer = 0;
			
 
				-	    // TODO: Part 2- Check whether GPU "devices[dev_id]" can access peer "devices[top]"
			
 
				-            CUDA_RT_CALL(cudaDeviceCanAccessPeer(&canAccessPeer, devices[dev_id], devices[top]));
			
 
				+            // TODO: Part 2- Check whether GPU "devices[dev_id]" can access peer "devices[top]"
			
 
				+            CUDA_RT_CALL(cudaDeviceCanAccessPeer(&canAccessPeer, /*Fill me*/, /*Fill me*/));
			
 
				             if (canAccessPeer) {
			
 
				-		// TODO: Part 2- Enable peer access from GPU "devices[dev_id]" to "devices[top]"
			
 
				-                CUDA_RT_CALL(cudaDeviceEnablePeerAccess(devices[top], 0));
			
 
				+            // TODO: Part 2- Enable peer access from GPU "devices[dev_id]" to "devices[top]"
			
 
				+                CUDA_RT_CALL(cudaDeviceEnablePeerAccess(/*Fill me*/, 0));
			
 
				             }
			
 
				             const int bottom = (dev_id + 1) % num_devices;
			
 
				             if (top != bottom) {
			
 
				                 canAccessPeer = 0;
			
 
				-		// TODO: Part 2- Check and enable peer access from GPU "devices[dev_id]" to
			
 
				-		// "devices[bottom]", whenever possible
			
 
				-                CUDA_RT_CALL(cudaDeviceCanAccessPeer(&canAccessPeer, 
			
 
				-					devices[dev_id], devices[bottom]));
			
 
				+                // TODO: Part 2- Check and enable peer access from GPU "devices[dev_id]" to
			
 
				+                // "devices[bottom]", whenever possible
			
 
				+                CUDA_RT_CALL(cudaDeviceCanAccessPeer(&canAccessPeer, /*Fill me*/, /*Fill me*/));
			
 
				                 if (canAccessPeer) {
			
 
				-                    CUDA_RT_CALL(cudaDeviceEnablePeerAccess(devices[bottom], 0));
			
 
				+                    CUDA_RT_CALL(cudaDeviceEnablePeerAccess(/*Fill me*/, 0));
			
 
				                 }
			
 
				             }
			
 
				         }
			
@@ -274,7 +273,7 @@ int main(int argc, char* argv[]) {
 
				 	    // Launch device kernel on each GPU
			
 
				         for (int dev_id = 0; dev_id < num_devices; ++dev_id) {
			
 
				             // TODO: Part 1- Set current GPU to be "devices[dev_id]"
			
 
				-            CUDA_RT_CALL(cudaSetDevice(devices[dev_id]));
			
 
				+            CUDA_RT_CALL(cudaSetDevice(/*Fill me*/));
			
 
				 
			
 
				             CUDA_RT_CALL(cudaMemsetAsync(l2_norm_d[dev_id], 0, sizeof(float)));
			
 
				             dim3 dim_grid((nx + BLOCK_DIM_X - 1) / BLOCK_DIM_X,
			
@@ -283,13 +282,10 @@ int main(int argc, char* argv[]) {
 
				             // TODO: Part 1- Call Jacobi kernel with "dim_grid" blocks in grid and "dim_block"
			
 
				             // blocks per thread. "dev_id" variable points to corresponding memory allocated 
			
 
				             // for the current GPU.
			
 
				-            jacobi_kernel<<<dim_grid, dim_block>>>(
			
 
				-                    a_new[dev_id], a[dev_id], l2_norm_d[dev_id], iy_start[dev_id], iy_end[dev_id],
			
 
				-                    nx);
			
 
				+            jacobi_kernel<<</*Fill me*/, /*Fill me*/>>>(/*Fill me*/);
			
 
				 
			
 
				             // TODO: Part 1- Copy GPU-local L2 norm "l2_norm_d" back to CPU "l2_norm_h"
			
 
				-            CUDA_RT_CALL(cudaMemcpyAsync(l2_norm_h[dev_id], l2_norm_d[dev_id], sizeof(float),
			
 
				-                     cudaMemcpyDeviceToHost));
			
 
				+            CUDA_RT_CALL(cudaMemcpyAsync(/*Fill me*/, /*Fill me*/, sizeof(float), /*Fill me*/));
			
 
				 	}
			
 
				     // Launch async memory copy operations for halo exchange and 
			
 
				 	// for copying local-grid L2 norm from each GPU to host
			
@@ -298,23 +294,20 @@ int main(int argc, char* argv[]) {
 
				             const int bottom = (dev_id + 1) % num_devices;
			
 
				             
			
 
				             // TODO: Part 1- Set current GPU
			
 
				-            CUDA_RT_CALL(cudaSetDevice(devices[dev_id]));
			
 
				+            CUDA_RT_CALL(cudaSetDevice(/*Fill me*/));
			
 
				 
			
 
				             // TODO: Part 1- Implement halo exchange with top neighbour "top"
			
 
				-            CUDA_RT_CALL(cudaMemcpyAsync(a_new[top] + (iy_end[top] * nx),
			
 
				-                                         a_new[dev_id] + iy_start[dev_id] * nx, nx * sizeof(float),
			
 
				-                                         cudaMemcpyDeviceToDevice));
			
 
				+            CUDA_RT_CALL(cudaMemcpyAsync(/*Fill me*/, /*Fill me*/, nx * sizeof(float), /*Fill me*/));
			
 
				 	    
			
 
				             // TODO: Part 1- Implement halo exchange with bottom neighbour "bottom"
			
 
				-            CUDA_RT_CALL(cudaMemcpyAsync(a_new[bottom], a_new[dev_id] + (iy_end[dev_id] - 1) * nx,
			
 
				-                                         nx * sizeof(float), cudaMemcpyDeviceToDevice));
			
 
				+            CUDA_RT_CALL(cudaMemcpyAsync(/*Fill me*/, /*Fill me*/, nx * sizeof(float), /*Fill me*/));
			
 
				         }
			
 
				         l2_norm = 0.0;
			
 
				         // Synchronize devices and compute global L2 norm
			
 
				         for (int dev_id = 0; dev_id < num_devices; ++dev_id) {
			
 
				             // TODO: part 1- Set current GPU and call cudaDeviceSynchronize()
			
 
				-	        CUDA_RT_CALL(cudaSetDevice(devices[dev_id]));
			
 
				-            CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+	        CUDA_RT_CALL(cudaSetDevice(/*Fill me*/));
			
 
				+            CUDA_RT_CALL(/*Fill me*/);
			
 
				 
			
 
				             l2_norm += *(l2_norm_h[dev_id]);
			
 
				         }
			
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/cuda/jacobi_streams.cu
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/cuda/jacobi_streams.cu
@@ -263,22 +263,20 @@ int main(int argc, char* argv[]) {
 
				             // from the previous iteration by synchronizing "push_top_stream" and
			
 
				             // "push_bottom_stream" streams. Be careful with which neighbour's top stream and
			
 
				             // which neighbour's bottom stream needs to be synchronized.
			
 
				-            CUDA_RT_CALL(cudaStreamSynchronize(push_top_stream[bottom]));
			
 
				-            CUDA_RT_CALL(cudaStreamSynchronize(push_bottom_stream[top]));
			
 
				+            CUDA_RT_CALL(cudaStreamSynchronize(/*Fill me*/));
			
 
				+            CUDA_RT_CALL(cudaStreamSynchronize(/*Fill me*/));
			
 
				 
			
 
				             dim3 dim_grid((nx + BLOCK_DIM_X - 1) / BLOCK_DIM_X,
			
 
				                           (chunk_size[dev_id] + BLOCK_DIM_Y - 1) / BLOCK_DIM_Y, 1);
			
 
				 
			
 
				             // TODO: Part 3- Launch Jacobi kernel on "compute_stream[dev_id]" and all other
			
 
				             // functional arguments
			
 
				-            jacobi_kernel<<<dim_grid, dim_block, 0, compute_stream[dev_id]>>>(
			
 
				-                    a_new[dev_id], a[dev_id], l2_norm_d[dev_id], iy_start[dev_id], iy_end[dev_id],
			
 
				-                    nx);
			
 
				+            jacobi_kernel<<</*Fill me*/, /*Fill me*/, 0, /*Fill me*/>>>(/*Fill me*/);
			
 
				 
			
 
				             // TODO: Part 3- Copy GPU-local L2 norm "l2_norm_d" back to CPU "l2_norm_h" on
			
 
				             // "compute_stream[dev_id]"
			
 
				-            CUDA_RT_CALL(cudaMemcpyAsync(l2_norm_h[dev_id], l2_norm_d[dev_id], sizeof(float),
			
 
				-                     cudaMemcpyDeviceToHost, compute_stream[dev_id]));
			
 
				+            CUDA_RT_CALL(cudaMemcpyAsync(/*Fill me*/, /*Fill me*/, sizeof(float),
			
 
				+                                            /*Fill me*/, /*Fill me*/));
			
 
				         }    
			
 
				         for (int dev_id = 0; dev_id < num_devices; ++dev_id) {
			
 
				             const int top = dev_id > 0 ? dev_id - 1 : (num_devices - 1);
			
@@ -287,20 +285,18 @@ int main(int argc, char* argv[]) {
 
				 
			
 
				             // TODO: Part 3- Before copying the updated halos to neighbours, ensure the 
			
 
				             // computation is complete by synchronizing "compute_stream[dev_id]" stream
			
 
				-            CUDA_RT_CALL(cudaStreamSynchronize(compute_stream[dev_id]));
			
 
				+            CUDA_RT_CALL(cudaStreamSynchronize(/*Fill me*/));
			
 
				 
			
 
				             // Apply periodic boundary conditions
			
 
				             // TODO: Part 3- Implement halo exchange with top neighbour on current device's 
			
 
				             // "push_top_stream"
			
 
				-            CUDA_RT_CALL(cudaMemcpyAsync(a_new[top] + (iy_end[top] * nx),
			
 
				-                                         a_new[dev_id] + iy_start[dev_id] * nx, nx * sizeof(float),
			
 
				-                                         cudaMemcpyDeviceToDevice, push_top_stream[dev_id]));
			
 
				+            CUDA_RT_CALL(cudaMemcpyAsync(/*Fill me*/, /*Fill me*/, nx * sizeof(float),
			
 
				+                                         /*Fill me*/, /*Fill me*/));
			
 
				 
			
 
				             // TODO: Part 3- Implement halo exchange with "bottom" neighbour on current device's 
			
 
				             // "push_bottom_stream"
			
 
				-            CUDA_RT_CALL(cudaMemcpyAsync(a_new[bottom], a_new[dev_id] + (iy_end[dev_id] - 1) * nx,
			
 
				-                                         nx * sizeof(float), cudaMemcpyDeviceToDevice,
			
 
				-                                         push_bottom_stream[dev_id]));
			
 
				+            CUDA_RT_CALL(cudaMemcpyAsync(/*Fill me*/, /*Fill me*/, nx * sizeof(float),
			
 
				+                                         /*Fill me*/, /*Fill me*/));
			
 
				         }
			
 
				         l2_norm = 0.0;
			
 
				         for (int dev_id = 0; dev_id < num_devices; ++dev_id) {
			
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/cuda/jacobi_streams_events.cu
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/cuda/jacobi_streams_events.cu
@@ -265,11 +265,8 @@ int main(int argc, char* argv[]) {
 
				             // neighbours are not copied to "dev_id". The "push_top_done" and "push_bottom_done" 
			
 
				             // events are to monitored for "bottom" and "top" neighbours, respectively for the 
			
 
				             // previous iteration denoted by "iter % 2".
			
 
				-            // Note that there should be 2 distinct cudaStreamWaitEvent calls.
			
 
				-            CUDA_RT_CALL(
			
 
				-                cudaStreamWaitEvent(compute_stream[dev_id], push_top_done[(iter % 2)][bottom], 0));
			
 
				-            CUDA_RT_CALL(
			
 
				-                cudaStreamWaitEvent(compute_stream[dev_id], push_bottom_done[(iter % 2)][top], 0));
			
 
				+            CUDA_RT_CALL(cudaStreamWaitEvent(/*Fill me*/, /*Fill me*/, 0));
			
 
				+            CUDA_RT_CALL(cudaStreamWaitEvent(/*Fill me*/, /*Fill me*/, 0));
			
 
				 
			
 
				             dim3 dim_grid((nx + BLOCK_DIM_X - 1) / BLOCK_DIM_X,
			
 
				                           (chunk_size[dev_id] + BLOCK_DIM_Y - 1) / BLOCK_DIM_Y, 1);
			
@@ -280,10 +277,10 @@ int main(int argc, char* argv[]) {
 
				 
			
 
				             // TODO: Part 4- Record that Jacobi computation on "compute_stream" is done by using
			
 
				             // cudaEventRecord for "compute_done" event for "dev_id"
			
 
				-            CUDA_RT_CALL(cudaEventRecord(compute_done[dev_id], compute_stream[dev_id]));
			
 
				+            CUDA_RT_CALL(cudaEventRecord(/*Fill me*/, /*Fill me*/));
			
 
				 
			
 
				-            CUDA_RT_CALL(cudaMemcpyAsync(l2_norm_h[dev_id], l2_norm_d[dev_id], sizeof(float),
			
 
				-                     cudaMemcpyDeviceToHost, compute_stream[dev_id]));
			
 
				+            CUDA_RT_CALL(cudaMemcpyAsync(/*Fill me*/, /*Fill me*/, sizeof(float),
			
 
				+                                            /*Fill me*/, /*Fill me*/));
			
 
				         }    
			
 
				         for (int dev_id = 0; dev_id < num_devices; ++dev_id) {
			
 
				             const int top = dev_id > 0 ? dev_id - 1 : (num_devices - 1);
			
@@ -294,28 +291,27 @@ int main(int argc, char* argv[]) {
 
				             // TODO: Part 4- Wait for the Jacobi computation of "dev_id" to complete by using the
			
 
				             // "compute_done" event on "push_top_stream" so that the top halo isn't copied to the
			
 
				             // neighbour before computation is done
			
 
				-            CUDA_RT_CALL(cudaStreamWaitEvent(push_top_stream[dev_id], compute_done[dev_id], 0));
			
 
				-            CUDA_RT_CALL(cudaMemcpyAsync(a_new[top] + (iy_end[top] * nx),
			
 
				-                                         a_new[dev_id] + iy_start[dev_id] * nx, nx * sizeof(float),
			
 
				-                                         cudaMemcpyDeviceToDevice, push_top_stream[dev_id]));
			
 
				+            CUDA_RT_CALL(cudaStreamWaitEvent(/*Fill me*/, /*Fill me*/));
			
 
				+            CUDA_RT_CALL(cudaMemcpyAsync(/*Fill me*/, /*Fill me*/, nx * sizeof(float),
			
 
				+                                         /*Fill me*/, /*Fill me*/));
			
 
				+
			
 
				             // TODO: Part 4- Record completion of top halo copy from "dev_id" to its neighbour
			
 
				             // to be used in next iteration. Record the event for "push_top_done" stream of 
			
 
				             // "dev_id" for next iteration which is "(iter+1) % 2"
			
 
				-            CUDA_RT_CALL(
			
 
				-                cudaEventRecord(push_top_done[((iter + 1) % 2)][dev_id], push_top_stream[dev_id]));
			
 
				+            CUDA_RT_CALL(cudaEventRecord(/*Fill me*/, /*Fill me*/));
			
 
				 
			
 
				             // TODO: Part 4- Wait for the Jacobi computation of "dev_id" to complete by using the
			
 
				             // "compute_done" event on "push_bottom_stream" so that the bottom halo isn't copied to
			
 
				             // the neighbour before computation is done
			
 
				-            CUDA_RT_CALL(cudaStreamWaitEvent(push_bottom_stream[dev_id], compute_done[dev_id], 0));
			
 
				-            CUDA_RT_CALL(cudaMemcpyAsync(a_new[bottom], a_new[dev_id] + (iy_end[dev_id] - 1) * nx,
			
 
				-                                         nx * sizeof(float), cudaMemcpyDeviceToDevice,
			
 
				-                                         push_bottom_stream[dev_id]));
			
 
				+            CUDA_RT_CALL(cudaStreamWaitEvent(/*Fill me*/, /*Fill me*/, 0));
			
 
				+            CUDA_RT_CALL(cudaMemcpyAsync(/*Fill me*/, /*Fill me*/, nx * sizeof(float),
			
 
				+                                         /*Fill me*/, /*Fill me*/));
			
 
				+                                         
			
 
				             // TODO: Part 4- Record completion of bottom halo copy from "dev_id" to its neighbour
			
 
				             // to be used in next iteration. Record the event for "push_bottom_done" stream of 
			
 
				             // "dev_id" for next iteration which is "(iter+1) % 2"
			
 
				-            CUDA_RT_CALL(cudaEventRecord(push_bottom_done[((iter + 1) % 2)][dev_id],
			
 
				-                                         push_bottom_stream[dev_id]));
			
 
				+            CUDA_RT_CALL(cudaMemcpyAsync(/*Fill me*/, /*Fill me*/, nx * sizeof(float),
			
 
				+                                         /*Fill me*/, /*Fill me*/));
			
 
				         }
			
 
				         l2_norm = 0.0;
			
 
				         for (int dev_id = 0; dev_id < num_devices; ++dev_id) {
			
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/mpi/jacobi_cuda_aware_mpi.cpp
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/mpi/jacobi_cuda_aware_mpi.cpp
@@ -207,20 +207,23 @@ int main(int argc, char* argv[]) {
 
				 	    CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				 
			
 
				         nvtxRangePush("Halo exchange CUDA-aware MPI");
			
 
				-        // TODO: Part 2- 
			
 
				-        MPI_CALL(MPI_Sendrecv(a_new + iy_start * nx, nx, MPI_FLOAT, top, 0,
			
 
				-                              a_new + (iy_end * nx), nx, MPI_FLOAT, bottom, 0, MPI_COMM_WORLD,
			
 
				+        // TODO: Part 2- Implement top halo exchange. Use only GPU buffers in the MPI call's 
			
 
				+        // function arguments.
			
 
				+        MPI_CALL(MPI_Sendrecv(/*Fill me*/, nx, MPI_FLOAT, /*Fill me*/, 0,
			
 
				+                              /*Fill me*/, nx, MPI_FLOAT, /*Fill me*/, 0, MPI_COMM_WORLD,
			
 
				                               MPI_STATUS_IGNORE));
			
 
				         nvtxRangePop(); 
			
 
				 
			
 
				         nvtxRangePush("Halo exchange CUDA-aware MPI");
			
 
				-        // TODO: Part 2- 
			
 
				-        MPI_CALL(MPI_Sendrecv(a_new + (iy_end - 1) * nx, nx, MPI_FLOAT, bottom, 0, a_new, nx,
			
 
				-                              MPI_FLOAT, top, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE));
			
 
				+        // TODO: Part 2- Implement bottom halo exchange. Use only GPU buffers in the MPI call's 
			
 
				+        // function arguments.
			
 
				+        MPI_CALL(MPI_Sendrecv(/*Fill me*/, nx, MPI_FLOAT, /*Fill me*/, 0, 
			
 
				+                                /*Fill me*/, nx, MPI_FLOAT, /*Fill me*/, 0, MPI_COMM_WORLD, 
			
 
				+                                MPI_STATUS_IGNORE));
			
 
				         nvtxRangePop(); 
			
 
				 
			
 
				-        // TODO: Part 2- 
			
 
				-        MPI_CALL(MPI_Allreduce(l2_norm_h, &l2_norm, 1, MPI_FLOAT, MPI_SUM, MPI_COMM_WORLD));
			
 
				+        // TODO: Part 2- Reduce the rank-local L2 Norm to a global L2 norm
			
 
				+        MPI_CALL(MPI_Allreduce(/*Fill me*/, /*Fill me*/, 1, MPI_FLOAT, /*Fill me*/, MPI_COMM_WORLD));
			
 
				         l2_norm = std::sqrt(l2_norm);
			
 
				         
			
 
				         iter++;
			
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/mpi/jacobi_memcpy_mpi.cpp
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/mpi/jacobi_memcpy_mpi.cpp
@@ -115,14 +115,13 @@ int main(int argc, char* argv[]) {
 
				 
			
 
				     int local_rank = -1;
			
 
				     // TODO: Part 1- Obtain the node-level local rank by splitting the global communicator
			
 
				-    // Make sure ot free the local communicator after its use
			
 
				+    // Free the local communicator after its use
			
 
				     MPI_Comm local_comm;
			
 
				-    MPI_CALL(MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, rank, MPI_INFO_NULL,
			
 
				-                                    &local_comm));
			
 
				+    MPI_CALL(MPI_Comm_split_type(/*Fill me*/));
			
 
				 
			
 
				-    MPI_CALL(MPI_Comm_rank(local_comm, &local_rank));
			
 
				+    MPI_CALL(MPI_Comm_rank(/*Fill me*/));
			
 
				 
			
 
				-    MPI_CALL(MPI_Comm_free(&local_comm));
			
 
				+    MPI_CALL(MPI_Comm_free(/*Fill me*/));
			
 
				 
			
 
				     CUDA_RT_CALL(cudaSetDevice(local_rank % num_devices));
			
 
				     CUDA_RT_CALL(cudaFree(0));
			
@@ -218,8 +217,8 @@ int main(int argc, char* argv[]) {
 
				                                 cudaMemcpyDeviceToHost));
			
 
				         // TODO: Part 1- Implement the first set of halo exchanges using MPI_SendRecv explained 
			
 
				         // in the Jupyter Notebook. Observe the Memcpy operations above and below this comment
			
 
				-        MPI_CALL(MPI_Sendrecv(top_halo_buf, nx, MPI_FLOAT, top, 0,
			
 
				-                              bot_halo_buf, nx, MPI_FLOAT, bottom, 0, MPI_COMM_WORLD,
			
 
				+        MPI_CALL(MPI_Sendrecv(/*Fill me*/, nx, MPI_FLOAT, /*Fill me*/, 0,
			
 
				+                              /*Fill me*/, nx, MPI_FLOAT, /*Fill me*/, 0, MPI_COMM_WORLD,
			
 
				                               MPI_STATUS_IGNORE));
			
 
				         CUDA_RT_CALL(cudaMemcpy(a_new + (iy_end * nx), bot_halo_buf, nx * sizeof(float), 
			
 
				                                 cudaMemcpyHostToDevice));
			
@@ -229,17 +228,15 @@ int main(int argc, char* argv[]) {
 
				         // Second set of halo exchanges
			
 
				         // TODO: Part 1- Implement the Memcpy operations and MPI calls for the second set of
			
 
				         // halo exchanges
			
 
				-        CUDA_RT_CALL(cudaMemcpy(bot_halo_buf, a_new + (iy_end - 1) * nx, nx * sizeof(float), 
			
 
				-                                cudaMemcpyDeviceToHost));
			
 
				-        MPI_CALL(MPI_Sendrecv(bot_halo_buf, nx, MPI_FLOAT, bottom, 0, 
			
 
				-                                top_halo_buf, nx, MPI_FLOAT, top, 0, MPI_COMM_WORLD, 
			
 
				+        CUDA_RT_CALL(cudaMemcpy(/*Fill me*/, /*Fill me*/, nx * sizeof(float), /*Fill me*/));
			
 
				+        MPI_CALL(MPI_Sendrecv(/*Fill me*/, nx, MPI_FLOAT, /*Fill me*/, 0, 
			
 
				+                                /*Fill me*/, nx, MPI_FLOAT, /*Fill me*/, 0, MPI_COMM_WORLD, 
			
 
				                                 MPI_STATUS_IGNORE));
			
 
				-        CUDA_RT_CALL(cudaMemcpy(a_new, top_halo_buf, nx * sizeof(float), 
			
 
				-                                cudaMemcpyHostToDevice));
			
 
				+        CUDA_RT_CALL(cudaMemcpy(/*Fill me*/, /*Fill me*/, nx * sizeof(float), /*Fill me*/));
			
 
				         nvtxRangePop();                        
			
 
				 
			
 
				         // TODO: Part 1- Reduce the rank-local L2 Norm to a global L2 norm using MPI_Allreduce
			
 
				-        MPI_CALL(MPI_Allreduce(l2_norm_h, &l2_norm, 1, MPI_FLOAT, MPI_SUM, MPI_COMM_WORLD));
			
 
				+        MPI_CALL(MPI_Allreduce(/*Fill me*/, /*Fill me*/, 1, MPI_FLOAT, /*Fill me*/, MPI_COMM_WORLD));
			
 
				         
			
 
				         l2_norm = std::sqrt(l2_norm);
			
 
				 
			
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/nccl/Makefile
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/nccl/Makefile
@@ -1,42 +1,24 @@
 
				 # Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
			
 
				-NP ?= 1
			
 
				 NVCC=nvcc
			
 
				 MPICXX=mpicxx
			
 
				 MPIRUN ?= mpirun
			
 
				 #CUDA_HOME ?= /usr/local/cuda
			
 
				-#NCCL_HOME ?= /usr
			
 
				-GENCODE_SM30	:= -gencode arch=compute_30,code=sm_30
			
 
				-GENCODE_SM35	:= -gencode arch=compute_35,code=sm_35
			
 
				-GENCODE_SM37	:= -gencode arch=compute_37,code=sm_37
			
 
				-GENCODE_SM50	:= -gencode arch=compute_50,code=sm_50
			
 
				-GENCODE_SM52	:= -gencode arch=compute_52,code=sm_52
			
 
				-GENCODE_SM60    := -gencode arch=compute_60,code=sm_60
			
 
				+#NCCL_HOME ?= /usr/nccl/
			
 
				 GENCODE_SM70    := -gencode arch=compute_70,code=sm_70
			
 
				-GENCODE_SM80    := -gencode arch=compute_80,code=sm_80 -gencode arch=compute_80,code=compute_80
			
 
				+GENCODE_SM80    := -gencode arch=compute_80,code=sm_80
			
 
				 GENCODE_FLAGS	:= $(GENCODE_SM70) $(GENCODE_SM80)
			
 
				-ifdef DISABLE_CUB
			
 
				-        NVCC_FLAGS = -Xptxas --optimize-float-atomics
			
 
				-else
			
 
				-        NVCC_FLAGS = -DHAVE_CUB
			
 
				-endif
			
 
				+
			
 
				 NVCC_FLAGS += -lineinfo $(GENCODE_FLAGS) -std=c++14
			
 
				-MPICXX_FLAGS = -DUSE_NVTX -I$(CUDA_HOME)/include -I$(NCCL_HOME)/include -std=c++14
			
 
				+MPICXX_FLAGS = -DUSE_NVTX -I$(CUDA_HOME)/include -I$(NCCL_HOME)/include -fopenmp -std=c++14
			
 
				 LD_FLAGS = -L$(CUDA_HOME)/lib64 -lcudart -lnvToolsExt -lnccl
			
 
				-jacobi: Makefile jacobi.cpp jacobi_kernels.o
			
 
				-	$(MPICXX) $(MPICXX_FLAGS) jacobi.cpp jacobi_kernels.o $(LD_FLAGS) -o jacobi
			
 
				+
			
 
				+jacobi_nccl: Makefile jacobi_nccl.cpp jacobi_kernels.o
			
 
				+	$(MPICXX) $(MPICXX_FLAGS) jacobi_nccl.cpp jacobi_kernels.o $(LD_FLAGS) -o jacobi_nccl
			
 
				 
			
 
				 jacobi_kernels.o: Makefile jacobi_kernels.cu
			
 
				 	$(NVCC) $(NVCC_FLAGS) jacobi_kernels.cu -c
			
 
				 
			
 
				 .PHONY.: clean
			
 
				 clean:
			
 
				-	rm -f jacobi jacobi_kernels.o *.qdrep jacobi.*.compute-sanitizer.log
			
 
				-
			
 
				-sanitize: jacobi
			
 
				-	$(MPIRUN) -np $(NP) compute-sanitizer --log-file jacobi.%q{OMPI_COMM_WORLD_RANK}.compute-sanitizer.log ./jacobi -niter 10
			
 
				-
			
 
				-run: jacobi
			
 
				-	$(MPIRUN) -np $(NP) ./jacobi
			
 
				+	rm -rf jacobi_nccl jacobi_kernels.o *.qdrep *.sqlite
			
 
				 
			
 
				-profile: jacobi
			
 
				-	$(MPIRUN) -np $(NP) nsys profile --trace=mpi,cuda,nvtx -o jacobi.%q{OMPI_COMM_WORLD_RANK} ./jacobi -niter 10
			
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/nccl/jacobi_kernels.cu
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/nccl/jacobi_kernels.cu
@@ -26,9 +26,8 @@
 
				  */
			
 
				 #include <cstdio>
			
 
				 
			
 
				-#ifdef HAVE_CUB
			
 
				-#include <cub/block/block_reduce.cuh>
			
 
				-#endif  // HAVE_CUB
			
 
				+#define BLOCK_DIM_X 32
			
 
				+#define BLOCK_DIM_Y 32
			
 
				 
			
 
				 #define CUDA_RT_CALL(call)                                                                  \
			
 
				     {                                                                                       \
			
@@ -41,19 +40,10 @@
 
				                     #call, __LINE__, __FILE__, cudaGetErrorString(cudaStatus), cudaStatus); \
			
 
				     }
			
 
				 
			
 
				-#ifdef USE_DOUBLE
			
 
				-typedef double real;
			
 
				-#define MPI_REAL_TYPE MPI_DOUBLE
			
 
				-#else
			
 
				-typedef float real;
			
 
				-#define MPI_REAL_TYPE MPI_FLOAT
			
 
				-#endif
			
 
				-
			
 
				-__global__ void initialize_boundaries(real* __restrict__ const a_new, real* __restrict__ const a,
			
 
				-                                      const real pi, const int offset, const int nx,
			
 
				-                                      const int my_ny, const int ny) {
			
 
				+__global__ void initialize_boundaries(float*  a_new, float*  a, const float pi, const int offset, 
			
 
				+                    const int nx, const int my_ny, const int ny) {
			
 
				     for (int iy = blockIdx.x * blockDim.x + threadIdx.x; iy < my_ny; iy += blockDim.x * gridDim.x) {
			
 
				-        const real y0 = sin(2.0 * pi * (offset + iy) / (ny - 1));
			
 
				+        const float y0 = sin(2.0 * pi * (offset + iy) / (ny - 1));
			
 
				         a[iy * nx + 0] = y0;
			
 
				         a[iy * nx + (nx - 1)] = y0;
			
 
				         a_new[iy * nx + 0] = y0;
			
@@ -61,53 +51,48 @@ __global__ void initialize_boundaries(real* __restrict__ const a_new, real* __re
 
				     }
			
 
				 }
			
 
				 
			
 
				-void launch_initialize_boundaries(real* __restrict__ const a_new, real* __restrict__ const a,
			
 
				-                                  const real pi, const int offset, const int nx, const int my_ny,
			
 
				-                                  const int ny) {
			
 
				-    initialize_boundaries<<<my_ny / 128 + 1, 128>>>(a_new, a, pi, offset, nx, my_ny, ny);
			
 
				-    CUDA_RT_CALL(cudaGetLastError());
			
 
				-}
			
 
				-
			
 
				-template <int BLOCK_DIM_X, int BLOCK_DIM_Y>
			
 
				-__global__ void jacobi_kernel(real* __restrict__ const a_new, const real* __restrict__ const a,
			
 
				-                              real* __restrict__ const l2_norm, const int iy_start,
			
 
				-                              const int iy_end, const int nx, const bool calculate_norm) {
			
 
				-#ifdef HAVE_CUB
			
 
				-    typedef cub::BlockReduce<real, BLOCK_DIM_X, cub::BLOCK_REDUCE_WARP_REDUCTIONS, BLOCK_DIM_Y>
			
 
				-        BlockReduce;
			
 
				-    __shared__ typename BlockReduce::TempStorage temp_storage;
			
 
				-#endif  // HAVE_CUB
			
 
				+__global__ void jacobi_kernel(float*  a_new, const float*  a, float*  l2_norm, const int iy_start,
			
 
				+                              const int iy_end, const int nx) {
			
 
				     int iy = blockIdx.y * blockDim.y + threadIdx.y + iy_start;
			
 
				     int ix = blockIdx.x * blockDim.x + threadIdx.x + 1;
			
 
				-    real local_l2_norm = 0.0;
			
 
				+    __shared__ float block_l2_sum[BLOCK_DIM_X*BLOCK_DIM_Y];
			
 
				+    unsigned thread_index = threadIdx.y*BLOCK_DIM_X + threadIdx.x;
			
 
				 
			
 
				     if (iy < iy_end && ix < (nx - 1)) {
			
 
				-        const real new_val = 0.25 * (a[iy * nx + ix + 1] + a[iy * nx + ix - 1] +
			
 
				+        // Update grid point
			
 
				+        const float new_val = 0.25 * (a[iy * nx + ix + 1] + a[iy * nx + ix - 1] +
			
 
				                                      a[(iy + 1) * nx + ix] + a[(iy - 1) * nx + ix]);
			
 
				         a_new[iy * nx + ix] = new_val;
			
 
				-        if (calculate_norm) {
			
 
				-            real residue = new_val - a[iy * nx + ix];
			
 
				-            local_l2_norm += residue * residue;
			
 
				+        float residue = new_val - a[iy * nx + ix];
			
 
				+        // Set block-level L2 norm value for this grid point
			
 
				+        block_l2_sum[thread_index] = residue * residue;
			
 
				+    }
			
 
				+    else {
			
 
				+        block_l2_sum[thread_index] = 0;
			
 
				+    }
			
 
				+    // Reduce L2 norm for the block in parallel
			
 
				+    for (unsigned stride = 1; stride < BLOCK_DIM_X*BLOCK_DIM_Y; stride *= 2) {
			
 
				+        __syncthreads();
			
 
				+        if ((thread_index) % (2*stride) == 0) {
			
 
				+            block_l2_sum[thread_index] += block_l2_sum[thread_index + stride];
			
 
				         }
			
 
				     }
			
 
				-    if (calculate_norm) {
			
 
				-#ifdef HAVE_CUB
			
 
				-        real block_l2_norm = BlockReduce(temp_storage).Sum(local_l2_norm);
			
 
				-        if (0 == threadIdx.y && 0 == threadIdx.x) atomicAdd(l2_norm, block_l2_norm);
			
 
				-#else
			
 
				-        atomicAdd(l2_norm, local_l2_norm);
			
 
				-#endif  // HAVE_CUB
			
 
				+    // Atomically update global L2 norm with block-reduced L2 norm
			
 
				+    if (thread_index == 0) {
			
 
				+        atomicAdd(l2_norm, block_l2_sum[0]);
			
 
				     }
			
 
				 }
			
 
				 
			
 
				-void launch_jacobi_kernel(real* __restrict__ const a_new, const real* __restrict__ const a,
			
 
				-                          real* __restrict__ const l2_norm, const int iy_start, const int iy_end,
			
 
				-                          const int nx, const bool calculate_norm, cudaStream_t stream) {
			
 
				-    constexpr int dim_block_x = 32;
			
 
				-    constexpr int dim_block_y = 32;
			
 
				-    dim3 dim_grid((nx + dim_block_x - 1) / dim_block_x,
			
 
				-                  ((iy_end - iy_start) + dim_block_y - 1) / dim_block_y, 1);
			
 
				-    jacobi_kernel<dim_block_x, dim_block_y><<<dim_grid, {dim_block_x, dim_block_y, 1}, 0, stream>>>(
			
 
				-        a_new, a, l2_norm, iy_start, iy_end, nx, calculate_norm);
			
 
				-    CUDA_RT_CALL(cudaGetLastError());
			
 
				+void launch_initialize_boundaries(float*  a_new, float*  a, const float pi, const int offset, 
			
 
				+                                    const int nx, const int my_ny, const int ny) {
			
 
				+    initialize_boundaries<<<my_ny / 128 + 1, 128>>>(a_new, a, pi, offset, nx, my_ny, ny);
			
 
				 }
			
 
				+
			
 
				+void launch_jacobi_kernel(float*  a_new, const float*  a, float*  l2_norm, const int iy_start,
			
 
				+                              const int iy_end, const int nx, cudaStream_t stream) {
			
 
				+    dim3 dim_block(BLOCK_DIM_X, BLOCK_DIM_Y, 1);
			
 
				+    dim3 dim_grid((nx + BLOCK_DIM_X - 1) / BLOCK_DIM_X,
			
 
				+                  ((iy_end - iy_start) + BLOCK_DIM_Y - 1) / BLOCK_DIM_Y, 1);
			
 
				+    jacobi_kernel<<<dim_grid, dim_block, 0, stream>>>(a_new, a, l2_norm, iy_start, iy_end, nx);
			
 
				+}
			
 
				+
			
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/nccl/jacobi_nccl.cpp
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/nccl/jacobi_nccl.cpp
@@ -0,0 +1,406 @@
 
				+/* Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
			
 
				+ *
			
 
				+ * Redistribution and use in source and binary forms, with or without
			
 
				+ * modification, are permitted provided that the following conditions
			
 
				+ * are met:
			
 
				+ *  * Redistributions of source code must retain the above copyright
			
 
				+ *    notice, this list of conditions and the following disclaimer.
			
 
				+ *  * Redistributions in binary form must reproduce the above copyright
			
 
				+ *    notice, this list of conditions and the following disclaimer in the
			
 
				+ *    documentation and/or other materials provided with the distribution.
			
 
				+ *  * Neither the name of NVIDIA CORPORATION nor the names of its
			
 
				+ *    contributors may be used to endorse or promote products derived
			
 
				+ *    from this software without specific prior written permission.
			
 
				+ *
			
 
				+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
			
 
				+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
			
 
				+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
			
 
				+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
			
 
				+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
			
 
				+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
			
 
				+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
			
 
				+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
			
 
				+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
			
 
				+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
			
 
				+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
			
 
				+ */
			
 
				+#include <algorithm>
			
 
				+#include <cmath>
			
 
				+#include <cstdio>
			
 
				+#include <iostream>
			
 
				+#include <sstream>
			
 
				+#include <cuda_runtime.h>
			
 
				+#include <nvToolsExt.h>
			
 
				+#include <mpi.h>
			
 
				+#include <omp.h>
			
 
				+#include <nccl.h>
			
 
				+
			
 
				+#define MPI_CALL(call)                                                                \
			
 
				+    {                                                                                 \
			
 
				+        int mpi_status = call;                                                        \
			
 
				+        if (0 != mpi_status) {                                                        \
			
 
				+            char mpi_error_string[MPI_MAX_ERROR_STRING];                              \
			
 
				+            int mpi_error_string_length = 0;                                          \
			
 
				+            MPI_Error_string(mpi_status, mpi_error_string, &mpi_error_string_length); \
			
 
				+            if (NULL != mpi_error_string)                                             \
			
 
				+                fprintf(stderr,                                                       \
			
 
				+                        "ERROR: MPI call \"%s\" in line %d of file %s failed "        \
			
 
				+                        "with %s "                                                    \
			
 
				+                        "(%d).\n",                                                    \
			
 
				+                        #call, __LINE__, __FILE__, mpi_error_string, mpi_status);     \
			
 
				+            else                                                                      \
			
 
				+                fprintf(stderr,                                                       \
			
 
				+                        "ERROR: MPI call \"%s\" in line %d of file %s failed "        \
			
 
				+                        "with %d.\n",                                                 \
			
 
				+                        #call, __LINE__, __FILE__, mpi_status);                       \
			
 
				+        }                                                                             \
			
 
				+    }
			
 
				+
			
 
				+#define CUDA_RT_CALL(call)                                                                  \
			
 
				+    {                                                                                       \
			
 
				+        cudaError_t cudaStatus = call;                                                      \
			
 
				+        if (cudaSuccess != cudaStatus)                                                      \
			
 
				+            fprintf(stderr,                                                                 \
			
 
				+                    "ERROR: CUDA RT call \"%s\" in line %d of file %s failed "              \
			
 
				+                    "with "                                                                 \
			
 
				+                    "%s (%d).\n",                                                           \
			
 
				+                    #call, __LINE__, __FILE__, cudaGetErrorString(cudaStatus), cudaStatus); \
			
 
				+    }
			
 
				+
			
 
				+#define NCCL_CALL(call)                                                                     \
			
 
				+    {                                                                                       \
			
 
				+        ncclResult_t  ncclStatus = call;                                                    \
			
 
				+        if (ncclSuccess != ncclStatus)                                                      \
			
 
				+            fprintf(stderr,                                                                 \
			
 
				+                    "ERROR: NCCL call \"%s\" in line %d of file %s failed "                 \
			
 
				+                    "with "                                                                 \
			
 
				+                    "%s (%d).\n",                                                           \
			
 
				+                    #call, __LINE__, __FILE__, ncclGetErrorString(ncclStatus), ncclStatus); \
			
 
				+    }
			
 
				+
			
 
				+constexpr float tol = 1.0e-8;
			
 
				+
			
 
				+const float PI = 2.0 * std::asin(1.0);
			
 
				+
			
 
				+void launch_jacobi_kernel(float*  a_new, const float*  a, float*  l2_norm, const int iy_start,
			
 
				+                              const int iy_end, const int nx, cudaStream_t stream);
			
 
				+
			
 
				+void launch_initialize_boundaries(float*  a_new, float*  a, const float pi, const int offset, 
			
 
				+                                    const int nx, const int my_ny, const int ny);
			
 
				+
			
 
				+double single_gpu(const int nx, const int ny, const int iter_max, 
			
 
				+                    float* const a_ref_h, bool print);
			
 
				+
			
 
				+int get_argval(char** begin, char** end, const std::string& arg, const int default_val) {
			
 
				+    int argval = default_val;
			
 
				+    char** itr = std::find(begin, end, arg);
			
 
				+    if (itr != end && ++itr != end) {
			
 
				+        std::istringstream inbuf(*itr);
			
 
				+        inbuf >> argval;
			
 
				+    }
			
 
				+    return argval;
			
 
				+}
			
 
				+
			
 
				+bool get_arg(char** begin, char** end, const std::string& arg) {
			
 
				+    char** itr = std::find(begin, end, arg);
			
 
				+    if (itr != end) {
			
 
				+        return true;
			
 
				+    }
			
 
				+    return false;
			
 
				+}
			
 
				+
			
 
				+int main(int argc, char* argv[]) {
			
 
				+    MPI_CALL(MPI_Init(&argc, &argv));
			
 
				+    int rank;
			
 
				+    MPI_CALL(MPI_Comm_rank(MPI_COMM_WORLD, &rank));
			
 
				+    int size;
			
 
				+    MPI_CALL(MPI_Comm_size(MPI_COMM_WORLD, &size));
			
 
				+
			
 
				+    ncclUniqueId nccl_uid;
			
 
				+    if (rank == 0) NCCL_CALL(ncclGetUniqueId(&nccl_uid));
			
 
				+    MPI_CALL(MPI_Bcast(&nccl_uid, sizeof(ncclUniqueId), MPI_BYTE, 0, MPI_COMM_WORLD));
			
 
				+
			
 
				+    const int iter_max = get_argval(argv, argv + argc, "-niter", 1000);
			
 
				+    const int nx = get_argval(argv, argv + argc, "-nx", 16384);
			
 
				+    const int ny = get_argval(argv, argv + argc, "-ny", 16384);
			
 
				+    const bool skip_single_gpu = get_arg(argv, argv + argc, "-skip_single_gpu");
			
 
				+
			
 
				+    int local_rank = -1;
			
 
				+    {
			
 
				+        MPI_Comm local_comm;
			
 
				+        MPI_CALL(MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, rank, MPI_INFO_NULL,
			
 
				+                                     &local_comm));
			
 
				+
			
 
				+        MPI_CALL(MPI_Comm_rank(local_comm, &local_rank));
			
 
				+
			
 
				+        MPI_CALL(MPI_Comm_free(&local_comm));
			
 
				+    }
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaSetDevice(local_rank));
			
 
				+    CUDA_RT_CALL(cudaFree(0));
			
 
				+
			
 
				+    ncclComm_t nccl_comm;
			
 
				+    NCCL_CALL(ncclCommInitRank(&nccl_comm, size, nccl_uid, rank));
			
 
				+
			
 
				+
			
 
				+    int nccl_version = 0;
			
 
				+    NCCL_CALL(ncclGetVersion(&nccl_version));
			
 
				+    if ( nccl_version < 2800 ) {
			
 
				+        fprintf(stderr,"ERROR NCCL 2.8 or newer is required.\n");
			
 
				+        NCCL_CALL(ncclCommDestroy(nccl_comm));
			
 
				+        MPI_CALL(MPI_Finalize());
			
 
				+        return 1;
			
 
				+    }
			
 
				+
			
 
				+    float* a_ref_h;
			
 
				+    CUDA_RT_CALL(cudaMallocHost(&a_ref_h, nx * ny * sizeof(float)));
			
 
				+    float* a_h;
			
 
				+    CUDA_RT_CALL(cudaMallocHost(&a_h, nx * ny * sizeof(float)));
			
 
				+    
			
 
				+    double runtime_serial = 1;
			
 
				+    if (!skip_single_gpu){
			
 
				+        runtime_serial = single_gpu(nx, ny, iter_max, a_ref_h, rank == 0);
			
 
				+    }
			
 
				+    // ny - 2 rows are distributed amongst `size` ranks in such a way
			
 
				+    // that each rank gets either (ny - 2) / size or (ny - 2) / size + 1 rows.
			
 
				+    // This optimizes load balancing when (ny - 2) % size != 0
			
 
				+    int chunk_size;
			
 
				+    int chunk_size_low = (ny - 2) / size;
			
 
				+    int chunk_size_high = chunk_size_low + 1;
			
 
				+    // To calculate the number of ranks that need to compute an extra row,
			
 
				+    // the following formula is derived from this equation:
			
 
				+    // num_ranks_low * chunk_size_low + (size - num_ranks_low) * (chunk_size_low + 1) = ny - 2
			
 
				+    int num_ranks_low = size * chunk_size_low + size -
			
 
				+                        (ny - 2);  // Number of ranks with chunk_size = chunk_size_low
			
 
				+    if (rank < num_ranks_low)
			
 
				+        chunk_size = chunk_size_low;
			
 
				+    else
			
 
				+        chunk_size = chunk_size_high;
			
 
				+
			
 
				+    float* a;
			
 
				+    CUDA_RT_CALL(cudaMalloc(&a, nx * (chunk_size + 2) * sizeof(float)));
			
 
				+    float* a_new;
			
 
				+    CUDA_RT_CALL(cudaMalloc(&a_new, nx * (chunk_size + 2) * sizeof(float)));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaMemset(a, 0, nx * (chunk_size + 2) * sizeof(float)));
			
 
				+    CUDA_RT_CALL(cudaMemset(a_new, 0, nx * (chunk_size + 2) * sizeof(float)));
			
 
				+
			
 
				+    // Calculate local domain boundaries
			
 
				+    int iy_start_global;  // My start index in the global array
			
 
				+    if (rank < num_ranks_low) {
			
 
				+        iy_start_global = rank * chunk_size_low + 1;
			
 
				+    } else {
			
 
				+        iy_start_global =
			
 
				+            num_ranks_low * chunk_size_low + (rank - num_ranks_low) * chunk_size_high + 1;
			
 
				+    }
			
 
				+    int iy_end_global = iy_start_global + chunk_size - 1;  // My last index in the global array
			
 
				+
			
 
				+    int iy_start = 1;
			
 
				+    int iy_end = iy_start + chunk_size;
			
 
				+
			
 
				+    // Set diriclet boundary conditions on left and right boarder
			
 
				+    launch_initialize_boundaries(a, a_new, PI, iy_start_global - 1, nx, (chunk_size + 2), ny);
			
 
				+    CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+
			
 
				+    float* l2_norm_d;
			
 
				+    CUDA_RT_CALL(cudaMalloc(&l2_norm_d, sizeof(float)));
			
 
				+    float* l2_global_norm_d;
			
 
				+    CUDA_RT_CALL(cudaMalloc(&l2_global_norm_d, sizeof(float)));
			
 
				+    float* l2_norm_h;
			
 
				+    CUDA_RT_CALL(cudaMallocHost(&l2_norm_h, sizeof(float)));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+
			
 
				+    if (0 == rank) {
			
 
				+        printf("Jacobi relaxation: %d iterations on %d x %d mesh\n", iter_max, nx, ny);
			
 
				+    }
			
 
				+
			
 
				+    int iter = 0;
			
 
				+    float l2_norm = 1.0;
			
 
				+
			
 
				+    MPI_CALL(MPI_Barrier(MPI_COMM_WORLD));
			
 
				+    double start = MPI_Wtime();
			
 
				+    nvtxRangePush("Jacobi Solve Multi-GPU");
			
 
				+    while (l2_norm > tol && iter < iter_max) {
			
 
				+        CUDA_RT_CALL(cudaMemsetAsync(l2_norm_d, 0, sizeof(float)));
			
 
				+
			
 
				+        launch_jacobi_kernel(a_new, a, l2_norm_d, iy_start, iy_end, nx, 0);
			
 
				+
			
 
				+        const int top = rank > 0 ? rank - 1 : (size - 1);
			
 
				+        const int bottom = (rank + 1) % size;
			
 
				+
			
 
				+        // TODO: Reduce the device-local L2 norm, "l2_norm_d" to the global L2 norm on each device,
			
 
				+        // "l2_global_norm_d", using ncclAllReduce() function. Use "ncclSum" as the reduction operation.
			
 
				+        // Make sure to encapsulate this funciton call within NCCL group calls.
			
 
				+        // Use "0" in the stream parameter function argument.
			
 
				+        NCCL_CALL(/*Fill me*/);
			
 
				+        NCCL_CALL(ncclAllReduce(/*Fill me*/, /*Fill me*/, 1, ncclFloat, /*Fill me*/, nccl_comm, 0));
			
 
				+        NCCL_CALL(/*Fill me*/);
			
 
				+
			
 
				+        // TODO: Transfer the global L2 norm from each device to the host using cudaMemcpyAsync
			
 
				+        CUDA_RT_CALL(cudaMemcpyAsync(/*Fill me*/, /*Fill me*/, sizeof(float), /*Fill me*/));
			
 
				+
			
 
				+        // Apply periodic boundary conditions
			
 
				+        NCCL_CALL(ncclGroupStart());
			
 
				+        
			
 
				+        //TODO: Perform the first set of halo exchanges by:
			
 
				+        // 1. Receiving the top halo from the "top" neighbour into the "a_new" device memory array location. 
			
 
				+        // 2. Sending current device's bottom halo to "bottom" neighbour from the "a_new + (iy_end - 1) * nx"
			
 
				+        //    device memory array location.
			
 
				+        // Use "0" in the stream parameter function argument.
			
 
				+        NCCL_CALL(ncclRecv(/*Fill me*/, nx, ncclFloat, /*Fill me*/, nccl_comm, 0));
			
 
				+        NCCL_CALL(ncclSend(/*Fill me*/, nx, ncclFloat, /*Fill me*/, nccl_comm, 0));
			
 
				+
			
 
				+        //TODO: Perform the second set of halo exchanges by:
			
 
				+        // 1. Receiving the bottom halo from the "bottom" neighbour into the "a_new + (iy_end * nx)" 
			
 
				+        //    device memory array location. 
			
 
				+        // 2. Sending current device's top halo to "top" neighbour from the "a_new + iy_start * nx"
			
 
				+        //    device memory array location.
			
 
				+        // Use "0" in the stream parameter function argument.
			
 
				+        NCCL_CALL(ncclRecv(/*Fill me*/, nx, ncclFloat, /*Fill me*/, nccl_comm, 0));
			
 
				+        NCCL_CALL(ncclSend(/*Fill me*/, nx, ncclFloat, /*Fill me*/, nccl_comm, 0));
			
 
				+
			
 
				+        NCCL_CALL(ncclGroupEnd());
			
 
				+
			
 
				+        // TODO: Synchronize the device before computing the global L2 norm on host for printing
			
 
				+        CUDA_RT_CALL(/*Fill me*/);
			
 
				+
			
 
				+        l2_norm = *l2_norm_h;
			
 
				+        l2_norm = std::sqrt(l2_norm);
			
 
				+
			
 
				+        iter++;
			
 
				+        if (0 == rank && (iter % 100) == 0) {
			
 
				+            printf("%5d, %0.6f\n", iter, l2_norm);
			
 
				+        }
			
 
				+
			
 
				+        std::swap(a_new, a);
			
 
				+    }
			
 
				+    CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+    double stop = MPI_Wtime();
			
 
				+    nvtxRangePop();
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaMemcpy(a_h + iy_start_global * nx, a + nx,
			
 
				+                            std::min((ny - iy_start_global) * nx, chunk_size * nx) * sizeof(float),
			
 
				+                            cudaMemcpyDeviceToHost));
			
 
				+
			
 
				+    int result_correct = 1;
			
 
				+    if (!skip_single_gpu) {
			
 
				+        for (int iy = iy_start_global; result_correct && (iy < iy_end_global); ++iy) {
			
 
				+            for (int ix = 1; result_correct && (ix < (nx - 1)); ++ix) {
			
 
				+                if (std::fabs(a_ref_h[iy * nx + ix] - a_h[iy * nx + ix]) > tol) {
			
 
				+                    fprintf(stderr,
			
 
				+                            "ERROR on rank %d: a[%d * %d + %d] = %f does not match %f "
			
 
				+                            "(reference)\n",
			
 
				+                            rank, iy, nx, ix, a_h[iy * nx + ix], a_ref_h[iy * nx + ix]);
			
 
				+                    result_correct = 0;
			
 
				+                }
			
 
				+            }
			
 
				+        }
			
 
				+        int global_result_correct = 1;
			
 
				+        MPI_CALL(MPI_Allreduce(&result_correct, &global_result_correct, 1, MPI_INT, MPI_MIN,
			
 
				+                            MPI_COMM_WORLD));
			
 
				+        result_correct = global_result_correct;
			
 
				+    }
			
 
				+
			
 
				+    if (rank == 0 && result_correct) {
			
 
				+        printf("Num GPUs: %d.\n", size);
			
 
				+        if (!skip_single_gpu) {
			
 
				+            printf(
			
 
				+                "%dx%d: 1 GPU: %8.4f s, %d GPUs: %8.4f s, speedup: %8.2f, "
			
 
				+                "efficiency: %8.2f \n",
			
 
				+                nx, ny, runtime_serial, size, (stop - start), runtime_serial / (stop - start),
			
 
				+                runtime_serial / (size * (stop - start)) * 100);
			
 
				+        }
			
 
				+        else {
			
 
				+            printf("%dx%d: %d GPUs: %8.4f s \n", nx, ny, size, (stop - start)); 
			
 
				+        }
			
 
				+    }
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaFreeHost(l2_norm_h));
			
 
				+    CUDA_RT_CALL(cudaFree(l2_norm_d));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaFree(a_new));
			
 
				+    CUDA_RT_CALL(cudaFree(a));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaFreeHost(a_h));
			
 
				+    CUDA_RT_CALL(cudaFreeHost(a_ref_h));
			
 
				+
			
 
				+    NCCL_CALL(ncclCommDestroy(nccl_comm));
			
 
				+
			
 
				+    MPI_CALL(MPI_Finalize());
			
 
				+    return (result_correct == 1) ? 0 : 1;
			
 
				+}
			
 
				+
			
 
				+double single_gpu(const int nx, const int ny, const int iter_max, float* const a_ref_h, bool print) {
			
 
				+    float* a;
			
 
				+    float* a_new;
			
 
				+
			
 
				+    float* l2_norm_d;
			
 
				+    float* l2_norm_h;
			
 
				+
			
 
				+    int iy_start = 1;
			
 
				+    int iy_end = (ny - 1);
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaMalloc(&a, nx * ny * sizeof(float)));
			
 
				+    CUDA_RT_CALL(cudaMalloc(&a_new, nx * ny * sizeof(float)));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaMemset(a, 0, nx * ny * sizeof(float)));
			
 
				+    CUDA_RT_CALL(cudaMemset(a_new, 0, nx * ny * sizeof(float)));
			
 
				+
			
 
				+    // Set diriclet boundary conditions on left and right boarder
			
 
				+    nvtxRangePush("Init boundaries");
			
 
				+    launch_initialize_boundaries(a, a_new, PI, 0, nx, ny, ny);
			
 
				+    CUDA_RT_CALL(cudaGetLastError());
			
 
				+    CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+    nvtxRangePop();
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaMalloc(&l2_norm_d, sizeof(float)));
			
 
				+    CUDA_RT_CALL(cudaMallocHost(&l2_norm_h, sizeof(float)));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+
			
 
				+    if (print) {
			
 
				+        printf("Single GPU jacobi relaxation: %d iterations on %d x %d mesh\n", iter_max, nx, ny);
			
 
				+    }
			
 
				+
			
 
				+    int iter = 0;
			
 
				+    float l2_norm = 1.0;
			
 
				+
			
 
				+    double start = omp_get_wtime();
			
 
				+    nvtxRangePush("Jacobi Solve Single GPU");
			
 
				+    while (l2_norm > tol && iter < iter_max) {
			
 
				+        CUDA_RT_CALL(cudaMemset(l2_norm_d, 0, sizeof(float)));
			
 
				+
			
 
				+        // Compute grid points for this iteration
			
 
				+        launch_jacobi_kernel(a_new, a, l2_norm_d, iy_start, iy_end, nx, 0);
			
 
				+        CUDA_RT_CALL(cudaGetLastError());
			
 
				+        CUDA_RT_CALL(cudaMemcpy(l2_norm_h, l2_norm_d, sizeof(float), cudaMemcpyDeviceToHost));
			
 
				+
			
 
				+        // Apply periodic boundary conditions
			
 
				+        CUDA_RT_CALL(cudaMemcpy(a_new, a_new + (iy_end - 1) * nx, nx * sizeof(float),
			
 
				+                                     cudaMemcpyDeviceToDevice));
			
 
				+        CUDA_RT_CALL(cudaMemcpy(a_new + iy_end * nx, a_new + iy_start * nx, nx * sizeof(float),
			
 
				+                                     cudaMemcpyDeviceToDevice));
			
 
				+
			
 
				+        CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+        l2_norm = *l2_norm_h;
			
 
				+        l2_norm = std::sqrt(l2_norm);
			
 
				+
			
 
				+        iter++;
			
 
				+        if ((iter % 100) == 0 && print) {
			
 
				+            printf("%5d, %0.6f\n", iter, l2_norm);
			
 
				+        }
			
 
				+        std::swap(a_new, a);
			
 
				+    }
			
 
				+    nvtxRangePop();
			
 
				+    double stop = omp_get_wtime();
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaMemcpy(a_ref_h, a, nx * ny * sizeof(float), cudaMemcpyDeviceToHost));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaFreeHost(l2_norm_h));
			
 
				+    CUDA_RT_CALL(cudaFree(l2_norm_d));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaFree(a_new));
			
 
				+    CUDA_RT_CALL(cudaFree(a));
			
 
				+    return (stop - start);
			
 
				+}
			
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/nccl/solution/jacobi_nccl.cpp
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/nccl/solution/jacobi_nccl.cpp
@@ -0,0 +1,407 @@
 
				+/* Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
			
 
				+ *
			
 
				+ * Redistribution and use in source and binary forms, with or without
			
 
				+ * modification, are permitted provided that the following conditions
			
 
				+ * are met:
			
 
				+ *  * Redistributions of source code must retain the above copyright
			
 
				+ *    notice, this list of conditions and the following disclaimer.
			
 
				+ *  * Redistributions in binary form must reproduce the above copyright
			
 
				+ *    notice, this list of conditions and the following disclaimer in the
			
 
				+ *    documentation and/or other materials provided with the distribution.
			
 
				+ *  * Neither the name of NVIDIA CORPORATION nor the names of its
			
 
				+ *    contributors may be used to endorse or promote products derived
			
 
				+ *    from this software without specific prior written permission.
			
 
				+ *
			
 
				+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
			
 
				+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
			
 
				+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
			
 
				+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
			
 
				+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
			
 
				+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
			
 
				+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
			
 
				+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
			
 
				+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
			
 
				+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
			
 
				+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
			
 
				+ */
			
 
				+#include <algorithm>
			
 
				+#include <cmath>
			
 
				+#include <cstdio>
			
 
				+#include <iostream>
			
 
				+#include <sstream>
			
 
				+#include <cuda_runtime.h>
			
 
				+#include <nvToolsExt.h>
			
 
				+#include <mpi.h>
			
 
				+#include <omp.h>
			
 
				+#include <nccl.h>
			
 
				+
			
 
				+#define MPI_CALL(call)                                                                \
			
 
				+    {                                                                                 \
			
 
				+        int mpi_status = call;                                                        \
			
 
				+        if (0 != mpi_status) {                                                        \
			
 
				+            char mpi_error_string[MPI_MAX_ERROR_STRING];                              \
			
 
				+            int mpi_error_string_length = 0;                                          \
			
 
				+            MPI_Error_string(mpi_status, mpi_error_string, &mpi_error_string_length); \
			
 
				+            if (NULL != mpi_error_string)                                             \
			
 
				+                fprintf(stderr,                                                       \
			
 
				+                        "ERROR: MPI call \"%s\" in line %d of file %s failed "        \
			
 
				+                        "with %s "                                                    \
			
 
				+                        "(%d).\n",                                                    \
			
 
				+                        #call, __LINE__, __FILE__, mpi_error_string, mpi_status);     \
			
 
				+            else                                                                      \
			
 
				+                fprintf(stderr,                                                       \
			
 
				+                        "ERROR: MPI call \"%s\" in line %d of file %s failed "        \
			
 
				+                        "with %d.\n",                                                 \
			
 
				+                        #call, __LINE__, __FILE__, mpi_status);                       \
			
 
				+        }                                                                             \
			
 
				+    }
			
 
				+
			
 
				+#define CUDA_RT_CALL(call)                                                                  \
			
 
				+    {                                                                                       \
			
 
				+        cudaError_t cudaStatus = call;                                                      \
			
 
				+        if (cudaSuccess != cudaStatus)                                                      \
			
 
				+            fprintf(stderr,                                                                 \
			
 
				+                    "ERROR: CUDA RT call \"%s\" in line %d of file %s failed "              \
			
 
				+                    "with "                                                                 \
			
 
				+                    "%s (%d).\n",                                                           \
			
 
				+                    #call, __LINE__, __FILE__, cudaGetErrorString(cudaStatus), cudaStatus); \
			
 
				+    }
			
 
				+
			
 
				+#define NCCL_CALL(call)                                                                     \
			
 
				+    {                                                                                       \
			
 
				+        ncclResult_t  ncclStatus = call;                                                    \
			
 
				+        if (ncclSuccess != ncclStatus)                                                      \
			
 
				+            fprintf(stderr,                                                                 \
			
 
				+                    "ERROR: NCCL call \"%s\" in line %d of file %s failed "                 \
			
 
				+                    "with "                                                                 \
			
 
				+                    "%s (%d).\n",                                                           \
			
 
				+                    #call, __LINE__, __FILE__, ncclGetErrorString(ncclStatus), ncclStatus); \
			
 
				+    }
			
 
				+
			
 
				+constexpr float tol = 1.0e-8;
			
 
				+
			
 
				+const float PI = 2.0 * std::asin(1.0);
			
 
				+
			
 
				+void launch_jacobi_kernel(float*  a_new, const float*  a, float*  l2_norm, const int iy_start,
			
 
				+                              const int iy_end, const int nx, cudaStream_t stream);
			
 
				+
			
 
				+void launch_initialize_boundaries(float*  a_new, float*  a, const float pi, const int offset, 
			
 
				+                                    const int nx, const int my_ny, const int ny);
			
 
				+
			
 
				+double single_gpu(const int nx, const int ny, const int iter_max, 
			
 
				+                    float* const a_ref_h, bool print);
			
 
				+
			
 
				+int get_argval(char** begin, char** end, const std::string& arg, const int default_val) {
			
 
				+    int argval = default_val;
			
 
				+    char** itr = std::find(begin, end, arg);
			
 
				+    if (itr != end && ++itr != end) {
			
 
				+        std::istringstream inbuf(*itr);
			
 
				+        inbuf >> argval;
			
 
				+    }
			
 
				+    return argval;
			
 
				+}
			
 
				+
			
 
				+bool get_arg(char** begin, char** end, const std::string& arg) {
			
 
				+    char** itr = std::find(begin, end, arg);
			
 
				+    if (itr != end) {
			
 
				+        return true;
			
 
				+    }
			
 
				+    return false;
			
 
				+}
			
 
				+
			
 
				+int main(int argc, char* argv[]) {
			
 
				+    MPI_CALL(MPI_Init(&argc, &argv));
			
 
				+    int rank;
			
 
				+    MPI_CALL(MPI_Comm_rank(MPI_COMM_WORLD, &rank));
			
 
				+    int size;
			
 
				+    MPI_CALL(MPI_Comm_size(MPI_COMM_WORLD, &size));
			
 
				+
			
 
				+    ncclUniqueId nccl_uid;
			
 
				+    if (rank == 0) NCCL_CALL(ncclGetUniqueId(&nccl_uid));
			
 
				+    MPI_CALL(MPI_Bcast(&nccl_uid, sizeof(ncclUniqueId), MPI_BYTE, 0, MPI_COMM_WORLD));
			
 
				+
			
 
				+    const int iter_max = get_argval(argv, argv + argc, "-niter", 1000);
			
 
				+    const int nx = get_argval(argv, argv + argc, "-nx", 16384);
			
 
				+    const int ny = get_argval(argv, argv + argc, "-ny", 16384);
			
 
				+    const bool skip_single_gpu = get_arg(argv, argv + argc, "-skip_single_gpu");
			
 
				+
			
 
				+    int local_rank = -1;
			
 
				+    {
			
 
				+        MPI_Comm local_comm;
			
 
				+        MPI_CALL(MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, rank, MPI_INFO_NULL,
			
 
				+                                     &local_comm));
			
 
				+
			
 
				+        MPI_CALL(MPI_Comm_rank(local_comm, &local_rank));
			
 
				+
			
 
				+        MPI_CALL(MPI_Comm_free(&local_comm));
			
 
				+    }
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaSetDevice(local_rank));
			
 
				+    CUDA_RT_CALL(cudaFree(0));
			
 
				+
			
 
				+    ncclComm_t nccl_comm;
			
 
				+    NCCL_CALL(ncclCommInitRank(&nccl_comm, size, nccl_uid, rank));
			
 
				+
			
 
				+
			
 
				+    int nccl_version = 0;
			
 
				+    NCCL_CALL(ncclGetVersion(&nccl_version));
			
 
				+    if ( nccl_version < 2800 ) {
			
 
				+        fprintf(stderr,"ERROR NCCL 2.8 or newer is required.\n");
			
 
				+        NCCL_CALL(ncclCommDestroy(nccl_comm));
			
 
				+        MPI_CALL(MPI_Finalize());
			
 
				+        return 1;
			
 
				+    }
			
 
				+
			
 
				+    float* a_ref_h;
			
 
				+    CUDA_RT_CALL(cudaMallocHost(&a_ref_h, nx * ny * sizeof(float)));
			
 
				+    float* a_h;
			
 
				+    CUDA_RT_CALL(cudaMallocHost(&a_h, nx * ny * sizeof(float)));
			
 
				+    
			
 
				+    double runtime_serial = 1;
			
 
				+    if (!skip_single_gpu){
			
 
				+        runtime_serial = single_gpu(nx, ny, iter_max, a_ref_h, rank == 0);
			
 
				+    }
			
 
				+    // ny - 2 rows are distributed amongst `size` ranks in such a way
			
 
				+    // that each rank gets either (ny - 2) / size or (ny - 2) / size + 1 rows.
			
 
				+    // This optimizes load balancing when (ny - 2) % size != 0
			
 
				+    int chunk_size;
			
 
				+    int chunk_size_low = (ny - 2) / size;
			
 
				+    int chunk_size_high = chunk_size_low + 1;
			
 
				+    // To calculate the number of ranks that need to compute an extra row,
			
 
				+    // the following formula is derived from this equation:
			
 
				+    // num_ranks_low * chunk_size_low + (size - num_ranks_low) * (chunk_size_low + 1) = ny - 2
			
 
				+    int num_ranks_low = size * chunk_size_low + size -
			
 
				+                        (ny - 2);  // Number of ranks with chunk_size = chunk_size_low
			
 
				+    if (rank < num_ranks_low)
			
 
				+        chunk_size = chunk_size_low;
			
 
				+    else
			
 
				+        chunk_size = chunk_size_high;
			
 
				+
			
 
				+    float* a;
			
 
				+    CUDA_RT_CALL(cudaMalloc(&a, nx * (chunk_size + 2) * sizeof(float)));
			
 
				+    float* a_new;
			
 
				+    CUDA_RT_CALL(cudaMalloc(&a_new, nx * (chunk_size + 2) * sizeof(float)));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaMemset(a, 0, nx * (chunk_size + 2) * sizeof(float)));
			
 
				+    CUDA_RT_CALL(cudaMemset(a_new, 0, nx * (chunk_size + 2) * sizeof(float)));
			
 
				+
			
 
				+    // Calculate local domain boundaries
			
 
				+    int iy_start_global;  // My start index in the global array
			
 
				+    if (rank < num_ranks_low) {
			
 
				+        iy_start_global = rank * chunk_size_low + 1;
			
 
				+    } else {
			
 
				+        iy_start_global =
			
 
				+            num_ranks_low * chunk_size_low + (rank - num_ranks_low) * chunk_size_high + 1;
			
 
				+    }
			
 
				+    int iy_end_global = iy_start_global + chunk_size - 1;  // My last index in the global array
			
 
				+
			
 
				+    int iy_start = 1;
			
 
				+    int iy_end = iy_start + chunk_size;
			
 
				+
			
 
				+    // Set diriclet boundary conditions on left and right boarder
			
 
				+    launch_initialize_boundaries(a, a_new, PI, iy_start_global - 1, nx, (chunk_size + 2), ny);
			
 
				+    CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+
			
 
				+    float* l2_norm_d;
			
 
				+    CUDA_RT_CALL(cudaMalloc(&l2_norm_d, sizeof(float)));
			
 
				+    float* l2_global_norm_d;
			
 
				+    CUDA_RT_CALL(cudaMalloc(&l2_global_norm_d, sizeof(float)));
			
 
				+    float* l2_norm_h;
			
 
				+    CUDA_RT_CALL(cudaMallocHost(&l2_norm_h, sizeof(float)));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+
			
 
				+    if (0 == rank) {
			
 
				+        printf("Jacobi relaxation: %d iterations on %d x %d mesh\n", iter_max, nx, ny);
			
 
				+    }
			
 
				+
			
 
				+    int iter = 0;
			
 
				+    float l2_norm = 1.0;
			
 
				+
			
 
				+    MPI_CALL(MPI_Barrier(MPI_COMM_WORLD));
			
 
				+    double start = MPI_Wtime();
			
 
				+    nvtxRangePush("Jacobi Solve Multi-GPU");
			
 
				+    while (l2_norm > tol && iter < iter_max) {
			
 
				+        CUDA_RT_CALL(cudaMemsetAsync(l2_norm_d, 0, sizeof(float)));
			
 
				+
			
 
				+        launch_jacobi_kernel(a_new, a, l2_norm_d, iy_start, iy_end, nx, 0);
			
 
				+
			
 
				+        const int top = rank > 0 ? rank - 1 : (size - 1);
			
 
				+        const int bottom = (rank + 1) % size;
			
 
				+
			
 
				+        // TODO: Reduce the device-local L2 norm, "l2_norm_d" to the global L2 norm on each device,
			
 
				+        // "l2_global_norm_d", using ncclAllReduce() function. Use "ncclSum" as the reduction operation.
			
 
				+        // Make sure to encapsulate this funciton call within NCCL group calls.
			
 
				+        // Use "0" in the stream parameter function argument.
			
 
				+        NCCL_CALL(ncclGroupStart());
			
 
				+        NCCL_CALL(ncclAllReduce(l2_norm_d, l2_global_norm_d, 1, ncclFloat, ncclSum, nccl_comm, 
			
 
				+                                    0));
			
 
				+        NCCL_CALL(ncclGroupEnd());
			
 
				+
			
 
				+        // TODO: Transfer the global L2 norm from each device to the host using cudaMemcpyAsync
			
 
				+        CUDA_RT_CALL(cudaMemcpyAsync(l2_norm_h, l2_global_norm_d, sizeof(float), cudaMemcpyDeviceToHost));
			
 
				+
			
 
				+        // Apply periodic boundary conditions
			
 
				+        NCCL_CALL(ncclGroupStart());
			
 
				+        
			
 
				+        //TODO: Perform the first set of halo exchanges by:
			
 
				+        // 1. Receiving the top halo from the "top" neighbour into the "a_new" device memory array location. 
			
 
				+        // 2. Sending current device's bottom halo to "bottom" neighbour from the "a_new + (iy_end - 1) * nx"
			
 
				+        //    device memory array location.
			
 
				+        // Use "0" in the stream parameter function argument.
			
 
				+        NCCL_CALL(ncclRecv(a_new,                     nx, ncclFloat, top,    nccl_comm, 0));
			
 
				+        NCCL_CALL(ncclSend(a_new + (iy_end - 1) * nx, nx, ncclFloat, bottom, nccl_comm, 0));
			
 
				+
			
 
				+        //TODO: Perform the second set of halo exchanges by:
			
 
				+        // 1. Receiving the bottom halo from the "bottom" neighbour into the "a_new + (iy_end * nx)" 
			
 
				+        //    device memory array location. 
			
 
				+        // 2. Sending current device's top halo to "top" neighbour from the "a_new + iy_start * nx"
			
 
				+        //    device memory array location.
			
 
				+        // Use "0" in the stream parameter function argument.
			
 
				+        NCCL_CALL(ncclRecv(a_new + (iy_end * nx),     nx, ncclFloat, bottom, nccl_comm, 0));
			
 
				+        NCCL_CALL(ncclSend(a_new + iy_start * nx,     nx, ncclFloat, top,    nccl_comm, 0));
			
 
				+
			
 
				+        NCCL_CALL(ncclGroupEnd());
			
 
				+
			
 
				+        // TODO: Synchronize the device before computing the global L2 norm on host for printing
			
 
				+        CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+
			
 
				+        l2_norm = *l2_norm_h;
			
 
				+        l2_norm = std::sqrt(l2_norm);
			
 
				+
			
 
				+        iter++;
			
 
				+        if (0 == rank && (iter % 100) == 0) {
			
 
				+            printf("%5d, %0.6f\n", iter, l2_norm);
			
 
				+        }
			
 
				+
			
 
				+        std::swap(a_new, a);
			
 
				+    }
			
 
				+    CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+    double stop = MPI_Wtime();
			
 
				+    nvtxRangePop();
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaMemcpy(a_h + iy_start_global * nx, a + nx,
			
 
				+                            std::min((ny - iy_start_global) * nx, chunk_size * nx) * sizeof(float),
			
 
				+                            cudaMemcpyDeviceToHost));
			
 
				+
			
 
				+    int result_correct = 1;
			
 
				+    if (!skip_single_gpu) {
			
 
				+        for (int iy = iy_start_global; result_correct && (iy < iy_end_global); ++iy) {
			
 
				+            for (int ix = 1; result_correct && (ix < (nx - 1)); ++ix) {
			
 
				+                if (std::fabs(a_ref_h[iy * nx + ix] - a_h[iy * nx + ix]) > tol) {
			
 
				+                    fprintf(stderr,
			
 
				+                            "ERROR on rank %d: a[%d * %d + %d] = %f does not match %f "
			
 
				+                            "(reference)\n",
			
 
				+                            rank, iy, nx, ix, a_h[iy * nx + ix], a_ref_h[iy * nx + ix]);
			
 
				+                    result_correct = 0;
			
 
				+                }
			
 
				+            }
			
 
				+        }
			
 
				+        int global_result_correct = 1;
			
 
				+        MPI_CALL(MPI_Allreduce(&result_correct, &global_result_correct, 1, MPI_INT, MPI_MIN,
			
 
				+                            MPI_COMM_WORLD));
			
 
				+        result_correct = global_result_correct;
			
 
				+    }
			
 
				+
			
 
				+    if (rank == 0 && result_correct) {
			
 
				+        printf("Num GPUs: %d.\n", size);
			
 
				+        if (!skip_single_gpu) {
			
 
				+            printf(
			
 
				+                "%dx%d: 1 GPU: %8.4f s, %d GPUs: %8.4f s, speedup: %8.2f, "
			
 
				+                "efficiency: %8.2f \n",
			
 
				+                nx, ny, runtime_serial, size, (stop - start), runtime_serial / (stop - start),
			
 
				+                runtime_serial / (size * (stop - start)) * 100);
			
 
				+        }
			
 
				+        else {
			
 
				+            printf("%dx%d: %d GPUs: %8.4f s \n", nx, ny, size, (stop - start)); 
			
 
				+        }
			
 
				+    }
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaFreeHost(l2_norm_h));
			
 
				+    CUDA_RT_CALL(cudaFree(l2_norm_d));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaFree(a_new));
			
 
				+    CUDA_RT_CALL(cudaFree(a));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaFreeHost(a_h));
			
 
				+    CUDA_RT_CALL(cudaFreeHost(a_ref_h));
			
 
				+
			
 
				+    NCCL_CALL(ncclCommDestroy(nccl_comm));
			
 
				+
			
 
				+    MPI_CALL(MPI_Finalize());
			
 
				+    return (result_correct == 1) ? 0 : 1;
			
 
				+}
			
 
				+
			
 
				+double single_gpu(const int nx, const int ny, const int iter_max, float* const a_ref_h, bool print) {
			
 
				+    float* a;
			
 
				+    float* a_new;
			
 
				+
			
 
				+    float* l2_norm_d;
			
 
				+    float* l2_norm_h;
			
 
				+
			
 
				+    int iy_start = 1;
			
 
				+    int iy_end = (ny - 1);
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaMalloc(&a, nx * ny * sizeof(float)));
			
 
				+    CUDA_RT_CALL(cudaMalloc(&a_new, nx * ny * sizeof(float)));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaMemset(a, 0, nx * ny * sizeof(float)));
			
 
				+    CUDA_RT_CALL(cudaMemset(a_new, 0, nx * ny * sizeof(float)));
			
 
				+
			
 
				+    // Set diriclet boundary conditions on left and right boarder
			
 
				+    nvtxRangePush("Init boundaries");
			
 
				+    launch_initialize_boundaries(a, a_new, PI, 0, nx, ny, ny);
			
 
				+    CUDA_RT_CALL(cudaGetLastError());
			
 
				+    CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+    nvtxRangePop();
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaMalloc(&l2_norm_d, sizeof(float)));
			
 
				+    CUDA_RT_CALL(cudaMallocHost(&l2_norm_h, sizeof(float)));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+
			
 
				+    if (print) {
			
 
				+        printf("Single GPU jacobi relaxation: %d iterations on %d x %d mesh\n", iter_max, nx, ny);
			
 
				+    }
			
 
				+
			
 
				+    int iter = 0;
			
 
				+    float l2_norm = 1.0;
			
 
				+
			
 
				+    double start = omp_get_wtime();
			
 
				+    nvtxRangePush("Jacobi Solve Single GPU");
			
 
				+    while (l2_norm > tol && iter < iter_max) {
			
 
				+        CUDA_RT_CALL(cudaMemset(l2_norm_d, 0, sizeof(float)));
			
 
				+
			
 
				+        // Compute grid points for this iteration
			
 
				+        launch_jacobi_kernel(a_new, a, l2_norm_d, iy_start, iy_end, nx, 0);
			
 
				+        CUDA_RT_CALL(cudaGetLastError());
			
 
				+        CUDA_RT_CALL(cudaMemcpy(l2_norm_h, l2_norm_d, sizeof(float), cudaMemcpyDeviceToHost));
			
 
				+
			
 
				+        // Apply periodic boundary conditions
			
 
				+        CUDA_RT_CALL(cudaMemcpy(a_new, a_new + (iy_end - 1) * nx, nx * sizeof(float),
			
 
				+                                     cudaMemcpyDeviceToDevice));
			
 
				+        CUDA_RT_CALL(cudaMemcpy(a_new + iy_end * nx, a_new + iy_start * nx, nx * sizeof(float),
			
 
				+                                     cudaMemcpyDeviceToDevice));
			
 
				+
			
 
				+        CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+        l2_norm = *l2_norm_h;
			
 
				+        l2_norm = std::sqrt(l2_norm);
			
 
				+
			
 
				+        iter++;
			
 
				+        if ((iter % 100) == 0 && print) {
			
 
				+            printf("%5d, %0.6f\n", iter, l2_norm);
			
 
				+        }
			
 
				+        std::swap(a_new, a);
			
 
				+    }
			
 
				+    nvtxRangePop();
			
 
				+    double stop = omp_get_wtime();
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaMemcpy(a_ref_h, a, nx * ny * sizeof(float), cudaMemcpyDeviceToHost));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaFreeHost(l2_norm_h));
			
 
				+    CUDA_RT_CALL(cudaFree(l2_norm_d));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaFree(a_new));
			
 
				+    CUDA_RT_CALL(cudaFree(a));
			
 
				+    return (stop - start);
			
 
				+}
			
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/nvshmem/Makefile
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/nvshmem/Makefile
@@ -0,0 +1,29 @@
 
				+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
			
 
				+NP ?= 1
			
 
				+NVCC=nvcc
			
 
				+MPIRUN ?= mpirun
			
 
				+CUDA_HOME ?= /usr/local/cuda
			
 
				+ifndef NVSHMEM_HOME
			
 
				+$(error NVSHMEM_HOME is not set)
			
 
				+endif
			
 
				+ifndef MPI_HOME
			
 
				+$(error MPI_HOME is not set)
			
 
				+endif
			
 
				+GENCODE_SM70    := -gencode arch=compute_70,code=sm_70
			
 
				+GENCODE_SM80    := -gencode arch=compute_80,code=sm_80 -gencode arch=compute_80,code=compute_80
			
 
				+GENCODE_FLAGS	:= $(GENCODE_SM70) $(GENCODE_SM80)
			
 
				+
			
 
				+NVCC_FLAGS += -dc -Xcompiler -fopenmp -lineinfo -lnvToolsExt $(GENCODE_FLAGS) -std=c++14 -I$(NVSHMEM_HOME)/include -I$(MPI_HOME)/include
			
 
				+NVCC_LDFLAGS = -ccbin=mpic++ -L$(NVSHMEM_HOME)/lib -lnvshmem -L$(MPI_HOME)/lib -lmpi -L$(CUDA_HOME)/lib64 -lcuda -lcudart -lnvToolsExt
			
 
				+
			
 
				+left_shift: Makefile left_shift.cu
			
 
				+	$(NVCC) $(NVCC_FLAGS) left_shift.cu -c -o left_shift.o
			
 
				+	$(NVCC) $(GENCODE_FLAGS) left_shift.o -o left_shift $(NVCC_LDFLAGS)
			
 
				+
			
 
				+jacobi_nvshmem: Makefile jacobi_nvshmem.cu
			
 
				+	$(NVCC) $(NVCC_FLAGS) jacobi_nvshmem.cu -c -o jacobi_nvshmem.o
			
 
				+	$(NVCC) $(GENCODE_FLAGS) jacobi_nvshmem.o -o jacobi_nvshmem $(NVCC_LDFLAGS)
			
 
				+
			
 
				+.PHONY.: clean
			
 
				+clean:
			
 
				+	rm -rf jacobi_nvshmem left_shift *.o *.qdrep *.sqlite
			
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/nvshmem/jacobi_nvshmem.cu
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/nvshmem/jacobi_nvshmem.cu
@@ -0,0 +1,567 @@
 
				+/* Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
			
 
				+ *
			
 
				+ * Redistribution and use in source and binary forms, with or without
			
 
				+ * modification, are permitted provided that the following conditions
			
 
				+ * are met:
			
 
				+ *  * Redistributions of source code must retain the above copyright
			
 
				+ *    notice, this list of conditions and the following disclaimer.
			
 
				+ *  * Redistributions in binary form must reproduce the above copyright
			
 
				+ *    notice, this list of conditions and the following disclaimer in the
			
 
				+ *    documentation and/or other materials provided with the distribution.
			
 
				+ *  * Neither the name of NVIDIA CORPORATION nor the names of its
			
 
				+ *    contributors may be used to endorse or promote products derived
			
 
				+ *    from this software without specific prior written permission.
			
 
				+ *
			
 
				+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
			
 
				+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
			
 
				+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
			
 
				+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
			
 
				+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
			
 
				+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
			
 
				+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
			
 
				+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
			
 
				+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
			
 
				+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
			
 
				+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
			
 
				+ */
			
 
				+#include <mpi.h>
			
 
				+#include <nvshmem.h>
			
 
				+#include <nvshmemx.h>
			
 
				+#include <algorithm>
			
 
				+#include <cassert>
			
 
				+#include <cmath>
			
 
				+#include <cstdio>
			
 
				+#include <iostream>
			
 
				+#include <sstream>
			
 
				+#include <cub/block/block_reduce.cuh>
			
 
				+#include <nvToolsExt.h>
			
 
				+
			
 
				+#define BLOCK_DIM_X 1024
			
 
				+#define BLOCK_DIM_Y 1
			
 
				+
			
 
				+#define MPI_CALL(call)                                                                \
			
 
				+    {                                                                                 \
			
 
				+        int mpi_status = call;                                                        \
			
 
				+        if (0 != mpi_status) {                                                        \
			
 
				+            char mpi_error_string[MPI_MAX_ERROR_STRING];                              \
			
 
				+            int mpi_error_string_length = 0;                                          \
			
 
				+            MPI_Error_string(mpi_status, mpi_error_string, &mpi_error_string_length); \
			
 
				+            if (NULL != mpi_error_string)                                             \
			
 
				+                fprintf(stderr,                                                       \
			
 
				+                        "ERROR: MPI call \"%s\" in line %d of file %s failed "        \
			
 
				+                        "with %s "                                                    \
			
 
				+                        "(%d).\n",                                                    \
			
 
				+                        #call, __LINE__, __FILE__, mpi_error_string, mpi_status);     \
			
 
				+            else                                                                      \
			
 
				+                fprintf(stderr,                                                       \
			
 
				+                        "ERROR: MPI call \"%s\" in line %d of file %s failed "        \
			
 
				+                        "with %d.\n",                                                 \
			
 
				+                        #call, __LINE__, __FILE__, mpi_status);                       \
			
 
				+        }                                                                             \
			
 
				+    }
			
 
				+
			
 
				+#define CUDA_RT_CALL(call)                                                                  \
			
 
				+    {                                                                                       \
			
 
				+        cudaError_t cudaStatus = call;                                                      \
			
 
				+        if (cudaSuccess != cudaStatus)                                                      \
			
 
				+            fprintf(stderr,                                                                 \
			
 
				+                    "ERROR: CUDA RT call \"%s\" in line %d of file %s failed "              \
			
 
				+                    "with "                                                                 \
			
 
				+                    "%s (%d).\n",                                                           \
			
 
				+                    #call, __LINE__, __FILE__, cudaGetErrorString(cudaStatus), cudaStatus); \
			
 
				+    }
			
 
				+
			
 
				+// convert NVSHMEM_SYMMETRIC_SIZE string to long long unsigned int
			
 
				+long long unsigned int parse_nvshmem_symmetric_size(char *value) {
			
 
				+    long long unsigned int units, size;
			
 
				+
			
 
				+    assert(value != NULL);
			
 
				+
			
 
				+    if (strchr(value, 'G') != NULL) {
			
 
				+        units=1e9;
			
 
				+    } else if (strchr(value, 'M') != NULL) {
			
 
				+        units=1e6;
			
 
				+    } else if (strchr(value, 'K') != NULL) {
			
 
				+        units=1e3;
			
 
				+    } else {
			
 
				+        units=1;
			
 
				+    }
			
 
				+
			
 
				+    assert(atof(value) >= 0);
			
 
				+    size = (long long unsigned int) atof(value) * units;
			
 
				+
			
 
				+    return size;
			
 
				+}
			
 
				+
			
 
				+constexpr float tol = 1.0e-8;
			
 
				+const float PI = 2.0 * std::asin(1.0);
			
 
				+
			
 
				+__global__ void initialize_boundaries(float* __restrict__ const a_new, float* __restrict__ const a,
			
 
				+                                      const float pi, const int offset, const int nx,
			
 
				+                                      const int my_ny, int ny) {
			
 
				+    for (int iy = blockIdx.x * blockDim.x + threadIdx.x; iy < my_ny; iy += blockDim.x * gridDim.x) {
			
 
				+        const float y0 = sin(2.0 * pi * (offset + iy) / (ny - 1));
			
 
				+        a[(iy + 1) * nx + 0] = y0;
			
 
				+        a[(iy + 1) * nx + (nx - 1)] = y0;
			
 
				+        a_new[(iy + 1) * nx + 0] = y0;
			
 
				+        a_new[(iy + 1) * nx + (nx - 1)] = y0;
			
 
				+    }
			
 
				+}
			
 
				+
			
 
				+__global__ void jacobi_kernel(float* __restrict__ const a_new, const float* __restrict__ const a,
			
 
				+                              float* __restrict__ const l2_norm, const int iy_start,
			
 
				+                              const int iy_end, const int nx, const int top_pe, const int top_iy,
			
 
				+                              const int bottom_pe, const int bottom_iy) {
			
 
				+    int iy = blockIdx.y * blockDim.y + threadIdx.y + iy_start;
			
 
				+    int ix = blockIdx.x * blockDim.x + threadIdx.x + 1;
			
 
				+    __shared__ float block_l2_sum[BLOCK_DIM_X*BLOCK_DIM_Y];
			
 
				+    unsigned thread_index = threadIdx.y*BLOCK_DIM_X + threadIdx.x;
			
 
				+
			
 
				+    if (iy < iy_end && ix < (nx - 1)) {
			
 
				+        // Update grid point
			
 
				+        const float new_val = 0.25 * (a[iy * nx + ix + 1] + a[iy * nx + ix - 1] +
			
 
				+                                     a[(iy + 1) * nx + ix] + a[(iy - 1) * nx + ix]);
			
 
				+        a_new[iy * nx + ix] = new_val;
			
 
				+        float residue = new_val - a[iy * nx + ix];
			
 
				+        // Set block-level L2 norm value for this grid point
			
 
				+        block_l2_sum[thread_index] = residue * residue;
			
 
				+    }
			
 
				+    else {
			
 
				+        block_l2_sum[thread_index] = 0;
			
 
				+    }
			
 
				+
			
 
				+    /* starting (x, y) coordinate of the block */
			
 
				+    int block_iy = iy - threadIdx.y; /* That is, block_iy = blockIdx.y * blockDim.y + iy_start */
			
 
				+    int block_ix = ix - threadIdx.x; /* That is, block_ix = blockIdx.x * blockDim.x + 1 */
			
 
				+
			
 
				+    /* Communicate the boundaries */
			
 
				+    // TODO: Use block-level NVSHMEM put communication API to transfer the halo pointed to by 
			
 
				+    // "a_new + iy_start * nx + block_ix" to "a_new + top_iy * nx + block_ix" in the "top_pe"
			
 
				+    if ((block_iy <= iy_start) && (iy_start < block_iy + blockDim.y)) {
			
 
				+        nvshmemx_/*Fill me*/(/*Fill me*/, /*Fill me*/,
			
 
				+                                   min(blockDim.x, nx - 1 - block_ix), /*Fill me*/);
			
 
				+    }
			
 
				+    // TODO: Use block-level NVSHMEM put communication API to transfer the halo pointed to by 
			
 
				+    // "a_new + (iy_end - 1) * nx + block_ix" to "a_new + bottom_iy * nx + block_ix" in the "bottom_pe"
			
 
				+    if ((block_iy < iy_end) && (iy_end <= block_iy + blockDim.y)) {
			
 
				+        nvshmemx_/*Fill me*/(/*Fill me*/, /*Fill me*/,
			
 
				+                                   min(blockDim.x, nx - 1 - block_ix), /*Fill me*/);
			
 
				+    }
			
 
				+    // Reduce L2 norm for the block in parallel
			
 
				+    for (unsigned stride = 1; stride < BLOCK_DIM_X*BLOCK_DIM_Y; stride *= 2) {
			
 
				+        __syncthreads();
			
 
				+        if ((thread_index) % (2*stride) == 0) {
			
 
				+            block_l2_sum[thread_index] += block_l2_sum[thread_index + stride];
			
 
				+        }
			
 
				+    }
			
 
				+    // Atomically update global L2 norm with block-reduced L2 norm
			
 
				+    if (thread_index == 0) {
			
 
				+        atomicAdd(l2_norm, block_l2_sum[0]);
			
 
				+    }
			
 
				+}
			
 
				+
			
 
				+double single_gpu(const int nx, const int ny, const int iter_max, float* const a_ref_h,
			
 
				+                  const bool print, int mype);
			
 
				+
			
 
				+int get_argval(char** begin, char** end, const std::string& arg, const int default_val) {
			
 
				+    int argval = default_val;
			
 
				+    char** itr = std::find(begin, end, arg);
			
 
				+    if (itr != end && ++itr != end) {
			
 
				+        std::istringstream inbuf(*itr);
			
 
				+        inbuf >> argval;
			
 
				+    }
			
 
				+    return argval;
			
 
				+}
			
 
				+
			
 
				+bool get_arg(char** begin, char** end, const std::string& arg) {
			
 
				+    char** itr = std::find(begin, end, arg);
			
 
				+    if (itr != end) {
			
 
				+        return true;
			
 
				+    }
			
 
				+    return false;
			
 
				+}
			
 
				+
			
 
				+struct l2_norm_buf {
			
 
				+    cudaEvent_t copy_done;
			
 
				+    float* d;
			
 
				+    float* h;
			
 
				+};
			
 
				+
			
 
				+int main(int argc, char* argv[]) {
			
 
				+    const int iter_max = get_argval(argv, argv + argc, "-niter", 1000);
			
 
				+    const int nx = get_argval(argv, argv + argc, "-nx", 16384);
			
 
				+    const int ny = get_argval(argv, argv + argc, "-ny", 16384);
			
 
				+    const bool skip_single_gpu = get_arg(argv, argv + argc, "-skip_single_gpu");
			
 
				+
			
 
				+    float* a_new;
			
 
				+
			
 
				+    float* a_ref_h;
			
 
				+    float* a_h;
			
 
				+    double runtime_serial = 0.0;
			
 
				+
			
 
				+    float l2_norms[2];
			
 
				+
			
 
				+    int rank = 0, size = 1;
			
 
				+    MPI_CALL(MPI_Init(&argc, &argv));
			
 
				+    MPI_CALL(MPI_Comm_rank(MPI_COMM_WORLD, &rank));
			
 
				+    MPI_CALL(MPI_Comm_size(MPI_COMM_WORLD, &size));
			
 
				+
			
 
				+    int num_devices;
			
 
				+    CUDA_RT_CALL(cudaGetDeviceCount(&num_devices));
			
 
				+
			
 
				+    int local_rank = -1, local_size = 1;
			
 
				+    {
			
 
				+        MPI_Comm local_comm;
			
 
				+        MPI_Info info;
			
 
				+        MPI_CALL(MPI_Info_create(&info));
			
 
				+        MPI_CALL(
			
 
				+            MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, rank, info, &local_comm));
			
 
				+
			
 
				+        MPI_CALL(MPI_Comm_rank(local_comm, &local_rank));
			
 
				+        MPI_CALL(MPI_Comm_size(local_comm, &local_size));
			
 
				+        if (num_devices < local_size) {
			
 
				+            fprintf(stderr,
			
 
				+                    "ERROR: Number of devices is less numer of PEs \
			
 
				+                    on the node!\n");
			
 
				+            MPI_CALL(MPI_Comm_free(&local_comm));
			
 
				+            MPI_CALL(MPI_Info_free(&info));
			
 
				+            MPI_CALL(MPI_Finalize());
			
 
				+            return -1;
			
 
				+        }
			
 
				+
			
 
				+        MPI_CALL(MPI_Comm_free(&local_comm));
			
 
				+        MPI_CALL(MPI_Info_free(&info));
			
 
				+    }
			
 
				+    CUDA_RT_CALL(cudaSetDevice(local_rank));
			
 
				+    CUDA_RT_CALL(cudaFree(0));
			
 
				+
			
 
				+    MPI_Comm mpi_comm;
			
 
				+    nvshmemx_init_attr_t attr;
			
 
				+
			
 
				+    mpi_comm = MPI_COMM_WORLD;
			
 
				+    attr.mpi_comm = &mpi_comm;
			
 
				+    // Set symmetric heap size for nvshmem based on problem size
			
 
				+    // Its default value in nvshmem is 1 GB which is not sufficient
			
 
				+    // for large mesh sizes
			
 
				+    long long unsigned int mesh_size_per_rank = nx * (((ny - 2) + size - 1) / size + 2);
			
 
				+    long long unsigned int required_symmetric_heap_size =
			
 
				+        2 * mesh_size_per_rank * sizeof(float) *
			
 
				+        1.1;  // Factor 2 is because 2 arrays are allocated - a and a_new
			
 
				+              // 1.1 factor is just for alignment or other usage
			
 
				+
			
 
				+    char * value = getenv("NVSHMEM_SYMMETRIC_SIZE");
			
 
				+    if (value) { /* env variable is set */
			
 
				+        long long unsigned int size_env = parse_nvshmem_symmetric_size(value);
			
 
				+        if (size_env < required_symmetric_heap_size) {
			
 
				+            fprintf(stderr, "ERROR: Minimum NVSHMEM_SYMMETRIC_SIZE = %lluB, Current NVSHMEM_SYMMETRIC_SIZE = %s\n", required_symmetric_heap_size, value);
			
 
				+            MPI_CALL(MPI_Finalize());
			
 
				+            return -1;
			
 
				+        }
			
 
				+    } else {
			
 
				+        char symmetric_heap_size_str[100];
			
 
				+        sprintf(symmetric_heap_size_str, "%llu", required_symmetric_heap_size);
			
 
				+        if (!rank)
			
 
				+            printf("Setting environment variable NVSHMEM_SYMMETRIC_SIZE = %llu\n", required_symmetric_heap_size);
			
 
				+        setenv("NVSHMEM_SYMMETRIC_SIZE", symmetric_heap_size_str, 1);
			
 
				+    }
			
 
				+    nvshmemx_init_attr(NVSHMEMX_INIT_WITH_MPI_COMM, &attr);
			
 
				+
			
 
				+    int npes = nvshmem_n_pes();
			
 
				+    int mype = nvshmem_my_pe();
			
 
				+
			
 
				+    nvshmem_barrier_all();
			
 
				+
			
 
				+    bool result_correct = true;
			
 
				+    float* a;
			
 
				+
			
 
				+    cudaStream_t compute_stream;
			
 
				+    cudaStream_t reset_l2_norm_stream;
			
 
				+    cudaEvent_t compute_done[2];
			
 
				+    cudaEvent_t reset_l2_norm_done[2];
			
 
				+
			
 
				+    l2_norm_buf l2_norm_bufs[2];
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaMallocHost(&a_ref_h, nx * ny * sizeof(float)));
			
 
				+    CUDA_RT_CALL(cudaMallocHost(&a_h, nx * ny * sizeof(float)));
			
 
				+    if (!skip_single_gpu){
			
 
				+        runtime_serial = single_gpu(nx, ny, iter_max, a_ref_h, (0 == mype), mype);
			
 
				+    }
			
 
				+    
			
 
				+    nvshmem_barrier_all();
			
 
				+    // ny - 2 rows are distributed amongst `size` ranks in such a way
			
 
				+    // that each rank gets either (ny - 2) / size or (ny - 2) / size + 1 rows.
			
 
				+    // This optimizes load balancing when (ny - 2) % size != 0
			
 
				+    int chunk_size;
			
 
				+    int chunk_size_low = (ny - 2) / npes;
			
 
				+    int chunk_size_high = chunk_size_low + 1;
			
 
				+    // To calculate the number of ranks that need to compute an extra row,
			
 
				+    // the following formula is derived from this equation:
			
 
				+    // num_ranks_low * chunk_size_low + (size - num_ranks_low) * (chunk_size_low + 1) = ny - 2
			
 
				+    int num_ranks_low = npes * chunk_size_low + npes -
			
 
				+                        (ny - 2);  // Number of ranks with chunk_size = chunk_size_low
			
 
				+    if (mype < num_ranks_low)
			
 
				+        chunk_size = chunk_size_low;
			
 
				+    else
			
 
				+        chunk_size = chunk_size_high;
			
 
				+
			
 
				+    a = (float*)nvshmem_malloc(
			
 
				+        nx * (chunk_size_high + 2) *
			
 
				+        sizeof(float));  // Using chunk_size_high so that it is same across all PEs
			
 
				+    a_new = (float*)nvshmem_malloc(nx * (chunk_size_high + 2) * sizeof(float));
			
 
				+
			
 
				+    cudaMemset(a, 0, nx * (chunk_size + 2) * sizeof(float));
			
 
				+    cudaMemset(a_new, 0, nx * (chunk_size + 2) * sizeof(float));
			
 
				+
			
 
				+    // Calculate local domain boundaries
			
 
				+    int iy_start_global;  // My start index in the global array
			
 
				+    if (mype < num_ranks_low) {
			
 
				+        iy_start_global = mype * chunk_size_low + 1;
			
 
				+    } else {
			
 
				+        iy_start_global =
			
 
				+            num_ranks_low * chunk_size_low + (mype - num_ranks_low) * chunk_size_high + 1;
			
 
				+    }
			
 
				+    int iy_end_global = iy_start_global + chunk_size - 1;  // My last index in the global array
			
 
				+    // do not process boundaries
			
 
				+    iy_end_global = std::min(iy_end_global, ny - 4);
			
 
				+
			
 
				+    int iy_start = 1;
			
 
				+    int iy_end = (iy_end_global - iy_start_global + 1) + iy_start;
			
 
				+
			
 
				+    // calculate boundary indices for top and bottom boundaries
			
 
				+    int top_pe = mype > 0 ? mype - 1 : (npes - 1);
			
 
				+    int bottom_pe = (mype + 1) % npes;
			
 
				+
			
 
				+    int iy_end_top = (top_pe < num_ranks_low) ? chunk_size_low + 1 : chunk_size_high + 1;
			
 
				+    int iy_start_bottom = 0;
			
 
				+
			
 
				+    // Set diriclet boundary conditions on left and right boundary
			
 
				+    initialize_boundaries<<<(ny / npes) / 128 + 1, 128>>>(a, a_new, PI, iy_start_global - 1, nx,
			
 
				+                                                          chunk_size, ny - 2);
			
 
				+    CUDA_RT_CALL(cudaGetLastError());
			
 
				+    CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaStreamCreateWithFlags(&compute_stream, cudaStreamNonBlocking));
			
 
				+    CUDA_RT_CALL(cudaStreamCreate(&reset_l2_norm_stream));
			
 
				+    CUDA_RT_CALL(cudaEventCreate(&compute_done[0]));
			
 
				+    CUDA_RT_CALL(cudaEventCreate(&compute_done[1]));
			
 
				+    CUDA_RT_CALL(cudaEventCreate(&reset_l2_norm_done[0]));
			
 
				+    CUDA_RT_CALL(cudaEventCreate(&reset_l2_norm_done[1]));
			
 
				+
			
 
				+    for (int i = 0; i < 2; ++i) {
			
 
				+        CUDA_RT_CALL(cudaEventCreate(&l2_norm_bufs[i].copy_done));
			
 
				+        CUDA_RT_CALL(cudaMalloc(&l2_norm_bufs[i].d, sizeof(float)));
			
 
				+        CUDA_RT_CALL(cudaMemset(l2_norm_bufs[i].d, 0, sizeof(float)));
			
 
				+        CUDA_RT_CALL(cudaMallocHost(&l2_norm_bufs[i].h, sizeof(float)));
			
 
				+        *(l2_norm_bufs[i].h) = 1.0;
			
 
				+    }
			
 
				+
			
 
				+    nvshmemx_barrier_all_on_stream(compute_stream);
			
 
				+    MPI_CALL(MPI_Allreduce(l2_norm_bufs[0].h, &l2_norms[0], 1, MPI_FLOAT, MPI_SUM, MPI_COMM_WORLD));
			
 
				+    MPI_CALL(MPI_Allreduce(l2_norm_bufs[1].h, &l2_norms[1], 1, MPI_FLOAT, MPI_SUM, MPI_COMM_WORLD));
			
 
				+    CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+
			
 
				+    if (!mype) {
			
 
				+        printf("Jacobi relaxation: %d iterations on %d x %d mesh\n", iter_max, nx, ny);
			
 
				+    }
			
 
				+
			
 
				+    dim3 dim_grid((nx + BLOCK_DIM_X-1) / BLOCK_DIM_X, (chunk_size + BLOCK_DIM_Y-1) / BLOCK_DIM_Y, 1);
			
 
				+    dim3 dim_block(BLOCK_DIM_X, BLOCK_DIM_Y, 1);
			
 
				+
			
 
				+    int iter = 0;
			
 
				+    if (!mype) {
			
 
				+        for (int i = 0; i < 2; ++i) {
			
 
				+            l2_norms[i] = 1.0;
			
 
				+        }
			
 
				+    }
			
 
				+
			
 
				+    nvshmem_barrier_all();
			
 
				+
			
 
				+    double start = MPI_Wtime();
			
 
				+    bool l2_norm_greater_than_tol = true;
			
 
				+
			
 
				+    nvtxRangePush("Jacobi Solve Multi-GPU");
			
 
				+    while (l2_norm_greater_than_tol && iter < iter_max) {
			
 
				+        // on new iteration: old current vars are now previous vars, old
			
 
				+        // previous vars are no longer needed
			
 
				+        int prev = iter % 2;
			
 
				+        int curr = (iter + 1) % 2;
			
 
				+
			
 
				+        // TODO: Use "cudaStreamWaitEvent" on "compute_stream" to wait for "reset_l2_norm_done"
			
 
				+        // event to complete for "curr" iteration
			
 
				+        CUDA_RT_CALL(cudaStreamWaitEvent(/*Fill me*/, /*Fill me*/, 0));
			
 
				+        jacobi_kernel<<<dim_grid, dim_block, 0, compute_stream>>>(
			
 
				+                a_new, a, l2_norm_bufs[curr].d, iy_start, iy_end, nx, top_pe, iy_end_top, bottom_pe,
			
 
				+                iy_start_bottom);
			
 
				+
			
 
				+        // TODO: Put a barrier at the "compute_stream" stream level
			
 
				+        nvshmemx_/*Fill me*/(/*Fill me*/);
			
 
				+
			
 
				+        // perform L2 norm calculation
			
 
				+        // as soon as computation is complete -> D2H-copy L2 norm
			
 
				+        CUDA_RT_CALL(cudaMemcpyAsync(l2_norm_bufs[curr].h, l2_norm_bufs[curr].d, sizeof(float),
			
 
				+                                        cudaMemcpyDeviceToHost, compute_stream));
			
 
				+        // TODO: Record the event "l2_norm_bufs[curr].copy_done" for "compute_stream"
			
 
				+        CUDA_RT_CALL(cudaEventRecord(/*Fill me*/, /*Fill me*/));
			
 
				+
			
 
				+        // ensure previous D2H-copy is completed before using the data for
			
 
				+        // calculation
			
 
				+        CUDA_RT_CALL(cudaEventSynchronize(l2_norm_bufs[prev].copy_done));
			
 
				+
			
 
				+        MPI_CALL(MPI_Allreduce(l2_norm_bufs[prev].h, &l2_norms[prev], 1, MPI_FLOAT, MPI_SUM,
			
 
				+                                MPI_COMM_WORLD));
			
 
				+
			
 
				+        l2_norms[prev] = std::sqrt(l2_norms[prev]);
			
 
				+        l2_norm_greater_than_tol = (l2_norms[prev] > tol);
			
 
				+
			
 
				+        iter++;
			
 
				+        if ((iter % 100) == 0) {
			
 
				+            if (!mype) printf("%5d, %0.6f\n", iter, l2_norms[prev]);
			
 
				+        }
			
 
				+
			
 
				+        // reset everything for next iteration
			
 
				+        l2_norms[prev] = 0.0;
			
 
				+        *(l2_norm_bufs[prev].h) = 0.0;
			
 
				+        CUDA_RT_CALL(cudaMemcpyAsync(l2_norm_bufs[prev].d, l2_norm_bufs[prev].h, sizeof(float),
			
 
				+                                        cudaMemcpyHostToDevice, reset_l2_norm_stream));
			
 
				+        // TODO: Record the L2 norm reset in "reset_l2_norm_done[prev]" for "reset_l2_norm_stream"
			
 
				+        CUDA_RT_CALL(cudaEventRecord(/*Fill me*/, /*Fill me*/));
			
 
				+
			
 
				+        std::swap(a_new, a);
			
 
				+    }
			
 
				+    CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+    nvshmem_barrier_all();
			
 
				+    double stop = MPI_Wtime();
			
 
				+    nvtxRangePop();
			
 
				+
			
 
				+    nvshmem_barrier_all();
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaMemcpy(a_h + iy_start_global * nx, a + nx,
			
 
				+                            std::min(ny - 2 - iy_start_global, chunk_size) * nx * sizeof(float),
			
 
				+                            cudaMemcpyDeviceToHost));
			
 
				+
			
 
				+    result_correct = true;
			
 
				+    if (!skip_single_gpu) {
			
 
				+        for (int iy = iy_start_global; result_correct && (iy < iy_end_global); ++iy) {
			
 
				+            for (int ix = 1; result_correct && (ix < (nx - 1)); ++ix) {
			
 
				+                if (std::fabs(a_ref_h[iy * nx + ix] - a_h[iy * nx + ix]) > tol) {
			
 
				+                    fprintf(stderr,
			
 
				+                            "ERROR on rank %d: a[%d * %d + %d] = %f does not match %f "
			
 
				+                            "(reference)\n",
			
 
				+                            rank, iy, nx, ix, a_h[iy * nx + ix], a_ref_h[iy * nx + ix]);
			
 
				+                    result_correct = false;
			
 
				+                }
			
 
				+            }
			
 
				+        }
			
 
				+        int global_result_correct = 1;
			
 
				+        MPI_CALL(MPI_Allreduce(&result_correct, &global_result_correct, 1, MPI_INT, MPI_MIN,
			
 
				+                                MPI_COMM_WORLD));
			
 
				+        result_correct = global_result_correct;
			
 
				+    }
			
 
				+
			
 
				+    if (!mype && result_correct) {
			
 
				+        printf("Num GPUs: %d.\n", size);
			
 
				+        if (!skip_single_gpu) {
			
 
				+            printf(
			
 
				+                "%dx%d: 1 GPU: %8.4f s, %d GPUs: %8.4f s, speedup: %8.2f, "
			
 
				+                "efficiency: %8.2f \n",
			
 
				+                nx, ny, runtime_serial, size, (stop - start), runtime_serial / (stop - start),
			
 
				+                runtime_serial / (size * (stop - start)) * 100);
			
 
				+        }
			
 
				+        else {
			
 
				+            printf("%dx%d: %d GPUs: %8.4f s \n", nx, ny, size, (stop - start)); 
			
 
				+        }
			
 
				+    }
			
 
				+
			
 
				+    for (int i = 0; i < 2; ++i) {
			
 
				+        CUDA_RT_CALL(cudaFreeHost(l2_norm_bufs[i].h));
			
 
				+        CUDA_RT_CALL(cudaFree(l2_norm_bufs[i].d));
			
 
				+        CUDA_RT_CALL(cudaEventDestroy(l2_norm_bufs[i].copy_done));
			
 
				+    }
			
 
				+
			
 
				+    nvshmem_free(a);
			
 
				+    nvshmem_free(a_new);
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaEventDestroy(reset_l2_norm_done[1]));
			
 
				+    CUDA_RT_CALL(cudaEventDestroy(reset_l2_norm_done[0]));
			
 
				+    CUDA_RT_CALL(cudaEventDestroy(compute_done[1]));
			
 
				+    CUDA_RT_CALL(cudaEventDestroy(compute_done[0]));
			
 
				+    CUDA_RT_CALL(cudaStreamDestroy(reset_l2_norm_stream));
			
 
				+    CUDA_RT_CALL(cudaStreamDestroy(compute_stream));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaFreeHost(a_h));
			
 
				+    CUDA_RT_CALL(cudaFreeHost(a_ref_h));
			
 
				+
			
 
				+    nvshmem_finalize();
			
 
				+    MPI_CALL(MPI_Finalize());
			
 
				+
			
 
				+    return (result_correct == 1) ? 0 : 1;
			
 
				+}
			
 
				+
			
 
				+double single_gpu(const int nx, const int ny, const int iter_max, float* const a_ref_h,
			
 
				+                  const bool print, int mype) {
			
 
				+    float* a;
			
 
				+    float* a_new;
			
 
				+    float* l2_norm_d;
			
 
				+    float* l2_norm_h;
			
 
				+
			
 
				+    int iy_start = 1;
			
 
				+    int iy_end = ny - 3;
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaMalloc((void**)&a, nx * ny * sizeof(float)));
			
 
				+    CUDA_RT_CALL(cudaMalloc((void**)&a_new, nx * ny * sizeof(float)));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaMemset(a, 0, nx * ny * sizeof(float)));
			
 
				+    CUDA_RT_CALL(cudaMemset(a_new, 0, nx * ny * sizeof(float)));
			
 
				+
			
 
				+    // Set diriclet boundary conditions on left and right boarder
			
 
				+    initialize_boundaries<<<ny / 128 + 1, 128>>>(a, a_new, PI, 0, nx, ny - 2, ny - 2);
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaGetLastError());
			
 
				+    CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaMalloc(&l2_norm_d, sizeof(float)));
			
 
				+    CUDA_RT_CALL(cudaMallocHost(&l2_norm_h, sizeof(float)));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+
			
 
				+    if (print)
			
 
				+        printf("Single GPU jacobi relaxation: %d iterations on %d x %d mesh\n", iter_max, nx, ny);
			
 
				+
			
 
				+    dim3 dim_block(BLOCK_DIM_X, BLOCK_DIM_Y, 1);
			
 
				+    dim3 dim_grid((nx + BLOCK_DIM_X-1) / BLOCK_DIM_X, ((ny - 2) + BLOCK_DIM_Y-1) / BLOCK_DIM_Y, 1);
			
 
				+
			
 
				+    int iter = 0;
			
 
				+    float l2_norm = 1.0;
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+    double start = MPI_Wtime();
			
 
				+    nvtxRangePush("Jacobi Solve Single GPU");
			
 
				+
			
 
				+    while (l2_norm > tol && iter < iter_max) {
			
 
				+        CUDA_RT_CALL(cudaMemsetAsync(l2_norm_d, 0, sizeof(float)));
			
 
				+
			
 
				+        jacobi_kernel<<<dim_grid, dim_block>>>(a_new, a, l2_norm_d, iy_start, iy_end, nx, 
			
 
				+                                                mype, iy_end + 1, mype, (iy_start - 1));
			
 
				+        
			
 
				+        iter++;
			
 
				+        if (print && ((iter % 100) == 0)) {
			
 
				+            CUDA_RT_CALL(cudaMemcpy(l2_norm_h, l2_norm_d, sizeof(float), cudaMemcpyDeviceToHost));
			
 
				+            l2_norm = *l2_norm_h;
			
 
				+            l2_norm = std::sqrt(l2_norm);
			
 
				+            if (print && (iter % 100) == 0) printf("%5d, %0.6f\n", iter, l2_norm);
			
 
				+        }
			
 
				+
			
 
				+        std::swap(a_new, a);
			
 
				+    }
			
 
				+    CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+    nvtxRangePop();
			
 
				+    double stop = MPI_Wtime();
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaMemcpy(a_ref_h, a, nx * ny * sizeof(float), cudaMemcpyDeviceToHost));
			
 
				+    CUDA_RT_CALL(cudaFreeHost(l2_norm_h));
			
 
				+    CUDA_RT_CALL(cudaFree(l2_norm_d));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaFree(a_new));
			
 
				+    CUDA_RT_CALL(cudaFree(a));
			
 
				+    return (stop - start);
			
 
				+}
			
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/nvshmem/left_shift.cu
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/nvshmem/left_shift.cu
@@ -0,0 +1,55 @@
 
				+#include <stdio.h>
			
 
				+#include "mpi.h"
			
 
				+#include "nvshmem.h"
			
 
				+#include "nvshmemx.h"
			
 
				+
			
 
				+#define CUDA_CHECK(stmt)                                  \
			
 
				+do {                                                      \
			
 
				+    cudaError_t result = (stmt);                          \
			
 
				+    if (cudaSuccess != result) {                          \
			
 
				+        fprintf(stderr, "[%s:%d] CUDA failed with %s \n", \
			
 
				+         __FILE__, __LINE__, cudaGetErrorString(result)); \
			
 
				+        exit(-1);                                         \
			
 
				+    }                                                     \
			
 
				+} while (0)
			
 
				+
			
 
				+__global__ void simple_shift(int *destination) {
			
 
				+    int mype = nvshmem_my_pe();
			
 
				+    int npes = nvshmem_n_pes();
			
 
				+    int peer = (mype + 1) % npes;
			
 
				+
			
 
				+    nvshmem_int_p(destination, mype, peer);
			
 
				+}
			
 
				+
			
 
				+int main (int argc, char *argv[]) {
			
 
				+    int mype_node, msg;
			
 
				+    cudaStream_t stream;
			
 
				+    int rank, nranks;
			
 
				+    MPI_Comm mpi_comm = MPI_COMM_WORLD;
			
 
				+    nvshmemx_init_attr_t attr;
			
 
				+
			
 
				+    MPI_Init(&argc, &argv);
			
 
				+    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
			
 
				+    MPI_Comm_size(MPI_COMM_WORLD, &nranks);
			
 
				+
			
 
				+    attr.mpi_comm = &mpi_comm;
			
 
				+    nvshmemx_init_attr(NVSHMEMX_INIT_WITH_MPI_COMM, &attr);
			
 
				+    mype_node = nvshmem_team_my_pe(NVSHMEMX_TEAM_NODE);
			
 
				+
			
 
				+    CUDA_CHECK(cudaSetDevice(mype_node));
			
 
				+    CUDA_CHECK(cudaStreamCreate(&stream));
			
 
				+    int *destination = (int *) nvshmem_malloc (sizeof(int));
			
 
				+
			
 
				+    simple_shift<<<1, 1, 0, stream>>>(destination);
			
 
				+    nvshmemx_barrier_all_on_stream(stream);
			
 
				+    CUDA_CHECK(cudaMemcpyAsync(&msg, destination, sizeof(int),
			
 
				+                cudaMemcpyDeviceToHost, stream));
			
 
				+
			
 
				+    CUDA_CHECK(cudaStreamSynchronize(stream));
			
 
				+    printf("%d: received message %d\n", nvshmem_my_pe(), msg);
			
 
				+
			
 
				+    nvshmem_free(destination);
			
 
				+    nvshmem_finalize();
			
 
				+    MPI_Finalize();
			
 
				+    return 0;
			
 
				+}
			
--- a/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/nvshmem/solution/jacobi_nvshmem.cu
+++ b/hpc/multi_gpu_nways/labs/CFD/English/C/source_code/nvshmem/solution/jacobi_nvshmem.cu
@@ -0,0 +1,555 @@
 
				+/* Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
			
 
				+ *
			
 
				+ * Redistribution and use in source and binary forms, with or without
			
 
				+ * modification, are permitted provided that the following conditions
			
 
				+ * are met:
			
 
				+ *  * Redistributions of source code must retain the above copyright
			
 
				+ *    notice, this list of conditions and the following disclaimer.
			
 
				+ *  * Redistributions in binary form must reproduce the above copyright
			
 
				+ *    notice, this list of conditions and the following disclaimer in the
			
 
				+ *    documentation and/or other materials provided with the distribution.
			
 
				+ *  * Neither the name of NVIDIA CORPORATION nor the names of its
			
 
				+ *    contributors may be used to endorse or promote products derived
			
 
				+ *    from this software without specific prior written permission.
			
 
				+ *
			
 
				+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
			
 
				+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
			
 
				+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
			
 
				+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
			
 
				+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
			
 
				+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
			
 
				+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
			
 
				+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
			
 
				+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
			
 
				+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
			
 
				+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
			
 
				+ */
			
 
				+#include <mpi.h>
			
 
				+#include <nvshmem.h>
			
 
				+#include <nvshmemx.h>
			
 
				+#include <algorithm>
			
 
				+#include <cassert>
			
 
				+#include <cmath>
			
 
				+#include <cstdio>
			
 
				+#include <iostream>
			
 
				+#include <sstream>
			
 
				+#include <cub/block/block_reduce.cuh>
			
 
				+#include <nvToolsExt.h>
			
 
				+
			
 
				+#define BLOCK_DIM_X 1024
			
 
				+#define BLOCK_DIM_Y 1
			
 
				+
			
 
				+#define MPI_CALL(call)                                                                \
			
 
				+    {                                                                                 \
			
 
				+        int mpi_status = call;                                                        \
			
 
				+        if (0 != mpi_status) {                                                        \
			
 
				+            char mpi_error_string[MPI_MAX_ERROR_STRING];                              \
			
 
				+            int mpi_error_string_length = 0;                                          \
			
 
				+            MPI_Error_string(mpi_status, mpi_error_string, &mpi_error_string_length); \
			
 
				+            if (NULL != mpi_error_string)                                             \
			
 
				+                fprintf(stderr,                                                       \
			
 
				+                        "ERROR: MPI call \"%s\" in line %d of file %s failed "        \
			
 
				+                        "with %s "                                                    \
			
 
				+                        "(%d).\n",                                                    \
			
 
				+                        #call, __LINE__, __FILE__, mpi_error_string, mpi_status);     \
			
 
				+            else                                                                      \
			
 
				+                fprintf(stderr,                                                       \
			
 
				+                        "ERROR: MPI call \"%s\" in line %d of file %s failed "        \
			
 
				+                        "with %d.\n",                                                 \
			
 
				+                        #call, __LINE__, __FILE__, mpi_status);                       \
			
 
				+        }                                                                             \
			
 
				+    }
			
 
				+
			
 
				+#define CUDA_RT_CALL(call)                                                                  \
			
 
				+    {                                                                                       \
			
 
				+        cudaError_t cudaStatus = call;                                                      \
			
 
				+        if (cudaSuccess != cudaStatus)                                                      \
			
 
				+            fprintf(stderr,                                                                 \
			
 
				+                    "ERROR: CUDA RT call \"%s\" in line %d of file %s failed "              \
			
 
				+                    "with "                                                                 \
			
 
				+                    "%s (%d).\n",                                                           \
			
 
				+                    #call, __LINE__, __FILE__, cudaGetErrorString(cudaStatus), cudaStatus); \
			
 
				+    }
			
 
				+
			
 
				+// convert NVSHMEM_SYMMETRIC_SIZE string to long long unsigned int
			
 
				+long long unsigned int parse_nvshmem_symmetric_size(char *value) {
			
 
				+    long long unsigned int units, size;
			
 
				+
			
 
				+    assert(value != NULL);
			
 
				+
			
 
				+    if (strchr(value, 'G') != NULL) {
			
 
				+        units=1e9;
			
 
				+    } else if (strchr(value, 'M') != NULL) {
			
 
				+        units=1e6;
			
 
				+    } else if (strchr(value, 'K') != NULL) {
			
 
				+        units=1e3;
			
 
				+    } else {
			
 
				+        units=1;
			
 
				+    }
			
 
				+
			
 
				+    assert(atof(value) >= 0);
			
 
				+    size = (long long unsigned int) atof(value) * units;
			
 
				+
			
 
				+    return size;
			
 
				+}
			
 
				+
			
 
				+constexpr float tol = 1.0e-8;
			
 
				+const float PI = 2.0 * std::asin(1.0);
			
 
				+
			
 
				+__global__ void initialize_boundaries(float* __restrict__ const a_new, float* __restrict__ const a,
			
 
				+                                      const float pi, const int offset, const int nx,
			
 
				+                                      const int my_ny, int ny) {
			
 
				+    for (int iy = blockIdx.x * blockDim.x + threadIdx.x; iy < my_ny; iy += blockDim.x * gridDim.x) {
			
 
				+        const float y0 = sin(2.0 * pi * (offset + iy) / (ny - 1));
			
 
				+        a[(iy + 1) * nx + 0] = y0;
			
 
				+        a[(iy + 1) * nx + (nx - 1)] = y0;
			
 
				+        a_new[(iy + 1) * nx + 0] = y0;
			
 
				+        a_new[(iy + 1) * nx + (nx - 1)] = y0;
			
 
				+    }
			
 
				+}
			
 
				+
			
 
				+__global__ void jacobi_kernel(float* __restrict__ const a_new, const float* __restrict__ const a,
			
 
				+                              float* __restrict__ const l2_norm, const int iy_start,
			
 
				+                              const int iy_end, const int nx, const int top_pe, const int top_iy,
			
 
				+                              const int bottom_pe, const int bottom_iy) {
			
 
				+    int iy = blockIdx.y * blockDim.y + threadIdx.y + iy_start;
			
 
				+    int ix = blockIdx.x * blockDim.x + threadIdx.x + 1;
			
 
				+    __shared__ float block_l2_sum[BLOCK_DIM_X*BLOCK_DIM_Y];
			
 
				+    unsigned thread_index = threadIdx.y*BLOCK_DIM_X + threadIdx.x;
			
 
				+
			
 
				+    if (iy < iy_end && ix < (nx - 1)) {
			
 
				+        // Update grid point
			
 
				+        const float new_val = 0.25 * (a[iy * nx + ix + 1] + a[iy * nx + ix - 1] +
			
 
				+                                     a[(iy + 1) * nx + ix] + a[(iy - 1) * nx + ix]);
			
 
				+        a_new[iy * nx + ix] = new_val;
			
 
				+        float residue = new_val - a[iy * nx + ix];
			
 
				+        // Set block-level L2 norm value for this grid point
			
 
				+        block_l2_sum[thread_index] = residue * residue;
			
 
				+    }
			
 
				+    else {
			
 
				+        block_l2_sum[thread_index] = 0;
			
 
				+    }
			
 
				+
			
 
				+    /* starting (x, y) coordinate of the block */
			
 
				+    int block_iy =
			
 
				+        iy - threadIdx.y; /* Alternatively, block_iy = blockIdx.y * blockDim.y + iy_start */
			
 
				+    int block_ix = ix - threadIdx.x; /* Alternatively, block_ix = blockIdx.x * blockDim.x + 1 */
			
 
				+
			
 
				+    /* Communicate the boundaries */
			
 
				+    if ((block_iy <= iy_start) && (iy_start < block_iy + blockDim.y)) {
			
 
				+        nvshmemx_float_put_nbi_block(a_new + top_iy * nx + block_ix, a_new + iy_start * nx + block_ix,
			
 
				+                                   min(blockDim.x, nx - 1 - block_ix), top_pe);
			
 
				+    }
			
 
				+    if ((block_iy < iy_end) && (iy_end <= block_iy + blockDim.y)) {
			
 
				+        nvshmemx_float_put_nbi_block(a_new + bottom_iy * nx + block_ix,
			
 
				+                                   a_new + (iy_end - 1) * nx + block_ix,
			
 
				+                                   min(blockDim.x, nx - 1 - block_ix), bottom_pe);
			
 
				+    }
			
 
				+    // Reduce L2 norm for the block in parallel
			
 
				+    for (unsigned stride = 1; stride < BLOCK_DIM_X*BLOCK_DIM_Y; stride *= 2) {
			
 
				+        __syncthreads();
			
 
				+        if ((thread_index) % (2*stride) == 0) {
			
 
				+            block_l2_sum[thread_index] += block_l2_sum[thread_index + stride];
			
 
				+        }
			
 
				+    }
			
 
				+    // Atomically update global L2 norm with block-reduced L2 norm
			
 
				+    if (thread_index == 0) {
			
 
				+        atomicAdd(l2_norm, block_l2_sum[0]);
			
 
				+    }
			
 
				+}
			
 
				+
			
 
				+double single_gpu(const int nx, const int ny, const int iter_max, float* const a_ref_h,
			
 
				+                  const bool print, int mype);
			
 
				+
			
 
				+int get_argval(char** begin, char** end, const std::string& arg, const int default_val) {
			
 
				+    int argval = default_val;
			
 
				+    char** itr = std::find(begin, end, arg);
			
 
				+    if (itr != end && ++itr != end) {
			
 
				+        std::istringstream inbuf(*itr);
			
 
				+        inbuf >> argval;
			
 
				+    }
			
 
				+    return argval;
			
 
				+}
			
 
				+
			
 
				+bool get_arg(char** begin, char** end, const std::string& arg) {
			
 
				+    char** itr = std::find(begin, end, arg);
			
 
				+    if (itr != end) {
			
 
				+        return true;
			
 
				+    }
			
 
				+    return false;
			
 
				+}
			
 
				+
			
 
				+struct l2_norm_buf {
			
 
				+    cudaEvent_t copy_done;
			
 
				+    float* d;
			
 
				+    float* h;
			
 
				+};
			
 
				+
			
 
				+int main(int argc, char* argv[]) {
			
 
				+    const int iter_max = get_argval(argv, argv + argc, "-niter", 1000);
			
 
				+    const int nx = get_argval(argv, argv + argc, "-nx", 16384);
			
 
				+    const int ny = get_argval(argv, argv + argc, "-ny", 16384);
			
 
				+
			
 
				+    float* a_new;
			
 
				+
			
 
				+    float* a_ref_h;
			
 
				+    float* a_h;
			
 
				+    double runtime_serial = 0.0;
			
 
				+
			
 
				+    float l2_norms[2];
			
 
				+
			
 
				+    int rank = 0, size = 1;
			
 
				+    MPI_CALL(MPI_Init(&argc, &argv));
			
 
				+    MPI_CALL(MPI_Comm_rank(MPI_COMM_WORLD, &rank));
			
 
				+    MPI_CALL(MPI_Comm_size(MPI_COMM_WORLD, &size));
			
 
				+
			
 
				+    int num_devices;
			
 
				+    CUDA_RT_CALL(cudaGetDeviceCount(&num_devices));
			
 
				+
			
 
				+    int local_rank = -1, local_size = 1;
			
 
				+    {
			
 
				+        MPI_Comm local_comm;
			
 
				+        MPI_Info info;
			
 
				+        MPI_CALL(MPI_Info_create(&info));
			
 
				+        MPI_CALL(
			
 
				+            MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, rank, info, &local_comm));
			
 
				+
			
 
				+        MPI_CALL(MPI_Comm_rank(local_comm, &local_rank));
			
 
				+        MPI_CALL(MPI_Comm_size(local_comm, &local_size));
			
 
				+        if (num_devices < local_size) {
			
 
				+            fprintf(stderr,
			
 
				+                    "ERROR: Number of devices is less numer of PEs \
			
 
				+                    on the node!\n");
			
 
				+            MPI_CALL(MPI_Comm_free(&local_comm));
			
 
				+            MPI_CALL(MPI_Info_free(&info));
			
 
				+            MPI_CALL(MPI_Finalize());
			
 
				+            return -1;
			
 
				+        }
			
 
				+
			
 
				+        MPI_CALL(MPI_Comm_free(&local_comm));
			
 
				+        MPI_CALL(MPI_Info_free(&info));
			
 
				+    }
			
 
				+    CUDA_RT_CALL(cudaSetDevice(local_rank));
			
 
				+    CUDA_RT_CALL(cudaFree(0));
			
 
				+
			
 
				+    MPI_Comm mpi_comm;
			
 
				+    nvshmemx_init_attr_t attr;
			
 
				+
			
 
				+    mpi_comm = MPI_COMM_WORLD;
			
 
				+    attr.mpi_comm = &mpi_comm;
			
 
				+    // Set symmetric heap size for nvshmem based on problem size
			
 
				+    // Its default value in nvshmem is 1 GB which is not sufficient
			
 
				+    // for large mesh sizes
			
 
				+    long long unsigned int mesh_size_per_rank = nx * (((ny - 2) + size - 1) / size + 2);
			
 
				+    long long unsigned int required_symmetric_heap_size =
			
 
				+        2 * mesh_size_per_rank * sizeof(float) *
			
 
				+        1.1;  // Factor 2 is because 2 arrays are allocated - a and a_new
			
 
				+              // 1.1 factor is just for alignment or other usage
			
 
				+
			
 
				+    char * value = getenv("NVSHMEM_SYMMETRIC_SIZE");
			
 
				+    if (value) { /* env variable is set */
			
 
				+        long long unsigned int size_env = parse_nvshmem_symmetric_size(value);
			
 
				+        if (size_env < required_symmetric_heap_size) {
			
 
				+            fprintf(stderr, "ERROR: Minimum NVSHMEM_SYMMETRIC_SIZE = %lluB, Current NVSHMEM_SYMMETRIC_SIZE = %s\n", required_symmetric_heap_size, value);
			
 
				+            MPI_CALL(MPI_Finalize());
			
 
				+            return -1;
			
 
				+        }
			
 
				+    } else {
			
 
				+        char symmetric_heap_size_str[100];
			
 
				+        sprintf(symmetric_heap_size_str, "%llu", required_symmetric_heap_size);
			
 
				+        if (!rank)
			
 
				+            printf("Setting environment variable NVSHMEM_SYMMETRIC_SIZE = %llu\n", required_symmetric_heap_size);
			
 
				+        setenv("NVSHMEM_SYMMETRIC_SIZE", symmetric_heap_size_str, 1);
			
 
				+    }
			
 
				+    nvshmemx_init_attr(NVSHMEMX_INIT_WITH_MPI_COMM, &attr);
			
 
				+
			
 
				+    int npes = nvshmem_n_pes();
			
 
				+    int mype = nvshmem_my_pe();
			
 
				+
			
 
				+    nvshmem_barrier_all();
			
 
				+
			
 
				+    bool result_correct = true;
			
 
				+    float* a;
			
 
				+
			
 
				+    cudaStream_t compute_stream;
			
 
				+    cudaStream_t reset_l2_norm_stream;
			
 
				+    cudaEvent_t compute_done[2];
			
 
				+    cudaEvent_t reset_l2_norm_done[2];
			
 
				+
			
 
				+    l2_norm_buf l2_norm_bufs[2];
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaMallocHost(&a_ref_h, nx * ny * sizeof(float)));
			
 
				+    CUDA_RT_CALL(cudaMallocHost(&a_h, nx * ny * sizeof(float)));
			
 
				+    runtime_serial = single_gpu(nx, ny, iter_max, a_ref_h, (0 == mype), mype);
			
 
				+
			
 
				+    nvshmem_barrier_all();
			
 
				+    // ny - 2 rows are distributed amongst `size` ranks in such a way
			
 
				+    // that each rank gets either (ny - 2) / size or (ny - 2) / size + 1 rows.
			
 
				+    // This optimizes load balancing when (ny - 2) % size != 0
			
 
				+    int chunk_size;
			
 
				+    int chunk_size_low = (ny - 2) / npes;
			
 
				+    int chunk_size_high = chunk_size_low + 1;
			
 
				+    // To calculate the number of ranks that need to compute an extra row,
			
 
				+    // the following formula is derived from this equation:
			
 
				+    // num_ranks_low * chunk_size_low + (size - num_ranks_low) * (chunk_size_low + 1) = ny - 2
			
 
				+    int num_ranks_low = npes * chunk_size_low + npes -
			
 
				+                        (ny - 2);  // Number of ranks with chunk_size = chunk_size_low
			
 
				+    if (mype < num_ranks_low)
			
 
				+        chunk_size = chunk_size_low;
			
 
				+    else
			
 
				+        chunk_size = chunk_size_high;
			
 
				+
			
 
				+    a = (float*)nvshmem_malloc(
			
 
				+        nx * (chunk_size_high + 2) *
			
 
				+        sizeof(float));  // Using chunk_size_high so that it is same across all PEs
			
 
				+    a_new = (float*)nvshmem_malloc(nx * (chunk_size_high + 2) * sizeof(float));
			
 
				+
			
 
				+    cudaMemset(a, 0, nx * (chunk_size + 2) * sizeof(float));
			
 
				+    cudaMemset(a_new, 0, nx * (chunk_size + 2) * sizeof(float));
			
 
				+
			
 
				+    // Calculate local domain boundaries
			
 
				+    int iy_start_global;  // My start index in the global array
			
 
				+    if (mype < num_ranks_low) {
			
 
				+        iy_start_global = mype * chunk_size_low + 1;
			
 
				+    } else {
			
 
				+        iy_start_global =
			
 
				+            num_ranks_low * chunk_size_low + (mype - num_ranks_low) * chunk_size_high + 1;
			
 
				+    }
			
 
				+    int iy_end_global = iy_start_global + chunk_size - 1;  // My last index in the global array
			
 
				+    // do not process boundaries
			
 
				+    iy_end_global = std::min(iy_end_global, ny - 4);
			
 
				+
			
 
				+    int iy_start = 1;
			
 
				+    int iy_end = (iy_end_global - iy_start_global + 1) + iy_start;
			
 
				+
			
 
				+    // calculate boundary indices for top and bottom boundaries
			
 
				+    int top_pe = mype > 0 ? mype - 1 : (npes - 1);
			
 
				+    int bottom_pe = (mype + 1) % npes;
			
 
				+
			
 
				+    int iy_end_top = (top_pe < num_ranks_low) ? chunk_size_low + 1 : chunk_size_high + 1;
			
 
				+    int iy_start_bottom = 0;
			
 
				+
			
 
				+    // Set diriclet boundary conditions on left and right boundary
			
 
				+    initialize_boundaries<<<(ny / npes) / 128 + 1, 128>>>(a, a_new, PI, iy_start_global - 1, nx,
			
 
				+                                                          chunk_size, ny - 2);
			
 
				+    CUDA_RT_CALL(cudaGetLastError());
			
 
				+    CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaStreamCreateWithFlags(&compute_stream, cudaStreamNonBlocking));
			
 
				+    CUDA_RT_CALL(cudaStreamCreate(&reset_l2_norm_stream));
			
 
				+    CUDA_RT_CALL(cudaEventCreate(&compute_done[0]));
			
 
				+    CUDA_RT_CALL(cudaEventCreate(&compute_done[1]));
			
 
				+    CUDA_RT_CALL(cudaEventCreate(&reset_l2_norm_done[0]));
			
 
				+    CUDA_RT_CALL(cudaEventCreate(&reset_l2_norm_done[1]));
			
 
				+
			
 
				+    for (int i = 0; i < 2; ++i) {
			
 
				+        CUDA_RT_CALL(cudaEventCreate(&l2_norm_bufs[i].copy_done));
			
 
				+        CUDA_RT_CALL(cudaMalloc(&l2_norm_bufs[i].d, sizeof(float)));
			
 
				+        CUDA_RT_CALL(cudaMemset(l2_norm_bufs[i].d, 0, sizeof(float)));
			
 
				+        CUDA_RT_CALL(cudaMallocHost(&l2_norm_bufs[i].h, sizeof(float)));
			
 
				+        *(l2_norm_bufs[i].h) = 1.0;
			
 
				+    }
			
 
				+
			
 
				+    nvshmemx_barrier_all_on_stream(compute_stream);
			
 
				+    MPI_CALL(MPI_Allreduce(l2_norm_bufs[0].h, &l2_norms[0], 1, MPI_FLOAT, MPI_SUM, MPI_COMM_WORLD));
			
 
				+    MPI_CALL(MPI_Allreduce(l2_norm_bufs[1].h, &l2_norms[1], 1, MPI_FLOAT, MPI_SUM, MPI_COMM_WORLD));
			
 
				+    CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+
			
 
				+    if (!mype) {
			
 
				+        printf("Jacobi relaxation: %d iterations on %d x %d mesh\n", iter_max, nx, ny);
			
 
				+    }
			
 
				+
			
 
				+    dim3 dim_grid((nx + BLOCK_DIM_X-1) / BLOCK_DIM_X, (chunk_size + BLOCK_DIM_Y-1) / BLOCK_DIM_Y, 1);
			
 
				+    dim3 dim_block(BLOCK_DIM_X, BLOCK_DIM_Y, 1);
			
 
				+
			
 
				+    int iter = 0;
			
 
				+    if (!mype) {
			
 
				+        for (int i = 0; i < 2; ++i) {
			
 
				+            l2_norms[i] = 1.0;
			
 
				+        }
			
 
				+    }
			
 
				+
			
 
				+    nvshmem_barrier_all();
			
 
				+
			
 
				+    double start = MPI_Wtime();
			
 
				+    bool l2_norm_greater_than_tol = true;
			
 
				+
			
 
				+    nvtxRangePush("Jacobi Solve Multi-GPU");
			
 
				+    while (l2_norm_greater_than_tol && iter < iter_max) {
			
 
				+        // on new iteration: old current vars are now previous vars, old
			
 
				+        // previous vars are no longer needed
			
 
				+        int prev = iter % 2;
			
 
				+        int curr = (iter + 1) % 2;
			
 
				+
			
 
				+        CUDA_RT_CALL(cudaStreamWaitEvent(compute_stream, reset_l2_norm_done[curr], 0));
			
 
				+        jacobi_kernel<<<dim_grid, dim_block, 0, compute_stream>>>(
			
 
				+                a_new, a, l2_norm_bufs[curr].d, iy_start, iy_end, nx, top_pe, iy_end_top, bottom_pe,
			
 
				+                iy_start_bottom);
			
 
				+
			
 
				+        /* Instead of using nvshmemx_barrier_all_on_stream, we are using a custom implementation
			
 
				+           of barrier that just synchronizes with the neighbor PEs that is the PEs with whom a PE
			
 
				+           communicates. This will perform faster than a global barrier that would do redundant
			
 
				+           synchronization for this application. */
			
 
				+        nvshmemx_barrier_all_on_stream(compute_stream);
			
 
				+
			
 
				+        // perform L2 norm calculation
			
 
				+        // as soon as computation is complete -> D2H-copy L2 norm
			
 
				+        CUDA_RT_CALL(cudaMemcpyAsync(l2_norm_bufs[curr].h, l2_norm_bufs[curr].d, sizeof(float),
			
 
				+                                        cudaMemcpyDeviceToHost, compute_stream));
			
 
				+        CUDA_RT_CALL(cudaEventRecord(l2_norm_bufs[curr].copy_done, compute_stream));
			
 
				+
			
 
				+        // ensure previous D2H-copy is completed before using the data for
			
 
				+        // calculation
			
 
				+        CUDA_RT_CALL(cudaEventSynchronize(l2_norm_bufs[prev].copy_done));
			
 
				+
			
 
				+        MPI_CALL(MPI_Allreduce(l2_norm_bufs[prev].h, &l2_norms[prev], 1, MPI_FLOAT, MPI_SUM,
			
 
				+                                MPI_COMM_WORLD));
			
 
				+
			
 
				+        l2_norms[prev] = std::sqrt(l2_norms[prev]);
			
 
				+        l2_norm_greater_than_tol = (l2_norms[prev] > tol);
			
 
				+
			
 
				+        iter++;
			
 
				+        if ((iter % 100) == 0) {
			
 
				+            if (!mype) printf("%5d, %0.6f\n", iter, l2_norms[prev]);
			
 
				+        }
			
 
				+
			
 
				+        // reset everything for next iteration
			
 
				+        l2_norms[prev] = 0.0;
			
 
				+        *(l2_norm_bufs[prev].h) = 0.0;
			
 
				+        CUDA_RT_CALL(cudaMemcpyAsync(l2_norm_bufs[prev].d, l2_norm_bufs[prev].h, sizeof(float),
			
 
				+                                        cudaMemcpyHostToDevice, reset_l2_norm_stream));
			
 
				+        CUDA_RT_CALL(cudaEventRecord(reset_l2_norm_done[prev], reset_l2_norm_stream));
			
 
				+
			
 
				+        std::swap(a_new, a);
			
 
				+    }
			
 
				+    CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+    nvshmem_barrier_all();
			
 
				+    double stop = MPI_Wtime();
			
 
				+    nvtxRangePop();
			
 
				+
			
 
				+    nvshmem_barrier_all();
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaMemcpy(a_h + iy_start_global * nx, a + nx,
			
 
				+                            std::min(ny - 2 - iy_start_global, chunk_size) * nx * sizeof(float),
			
 
				+                            cudaMemcpyDeviceToHost));
			
 
				+
			
 
				+    result_correct = true;
			
 
				+    for (int iy = iy_start_global; result_correct && (iy < iy_end_global); ++iy) {
			
 
				+        for (int ix = 1; result_correct && (ix < (nx - 1)); ++ix) {
			
 
				+            if (std::fabs(a_ref_h[iy * nx + ix] - a_h[iy * nx + ix]) > tol) {
			
 
				+                fprintf(stderr,
			
 
				+                        "ERROR on rank %d: a[%d * %d + %d] = %f does not match %f "
			
 
				+                        "(reference)\n",
			
 
				+                        rank, iy, nx, ix, a_h[iy * nx + ix], a_ref_h[iy * nx + ix]);
			
 
				+                result_correct = false;
			
 
				+            }
			
 
				+        }
			
 
				+    }
			
 
				+
			
 
				+    int global_result_correct = 1;
			
 
				+    MPI_CALL(MPI_Allreduce(&result_correct, &global_result_correct, 1, MPI_INT, MPI_MIN,
			
 
				+                           MPI_COMM_WORLD));
			
 
				+    result_correct = global_result_correct;
			
 
				+
			
 
				+    if (!mype && result_correct) {
			
 
				+            printf("Num GPUs: %d.\n", npes);
			
 
				+            printf(
			
 
				+                "%dx%d: 1 GPU: %8.4f s, %d GPUs: %8.4f s, speedup: %8.2f, "
			
 
				+                "efficiency: %8.2f \n",
			
 
				+                ny, nx, runtime_serial, npes, (stop - start), runtime_serial / (stop - start),
			
 
				+                runtime_serial / (npes * (stop - start)) * 100);
			
 
				+    }
			
 
				+
			
 
				+    for (int i = 0; i < 2; ++i) {
			
 
				+        CUDA_RT_CALL(cudaFreeHost(l2_norm_bufs[i].h));
			
 
				+        CUDA_RT_CALL(cudaFree(l2_norm_bufs[i].d));
			
 
				+        CUDA_RT_CALL(cudaEventDestroy(l2_norm_bufs[i].copy_done));
			
 
				+    }
			
 
				+
			
 
				+    nvshmem_free(a);
			
 
				+    nvshmem_free(a_new);
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaEventDestroy(reset_l2_norm_done[1]));
			
 
				+    CUDA_RT_CALL(cudaEventDestroy(reset_l2_norm_done[0]));
			
 
				+    CUDA_RT_CALL(cudaEventDestroy(compute_done[1]));
			
 
				+    CUDA_RT_CALL(cudaEventDestroy(compute_done[0]));
			
 
				+    CUDA_RT_CALL(cudaStreamDestroy(reset_l2_norm_stream));
			
 
				+    CUDA_RT_CALL(cudaStreamDestroy(compute_stream));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaFreeHost(a_h));
			
 
				+    CUDA_RT_CALL(cudaFreeHost(a_ref_h));
			
 
				+
			
 
				+    nvshmem_finalize();
			
 
				+    MPI_CALL(MPI_Finalize());
			
 
				+
			
 
				+    return (result_correct == 1) ? 0 : 1;
			
 
				+}
			
 
				+
			
 
				+double single_gpu(const int nx, const int ny, const int iter_max, float* const a_ref_h,
			
 
				+                  const bool print, int mype) {
			
 
				+    float* a;
			
 
				+    float* a_new;
			
 
				+    float* l2_norm_d;
			
 
				+    float* l2_norm_h;
			
 
				+
			
 
				+    int iy_start = 1;
			
 
				+    int iy_end = ny - 3;
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaMalloc((void**)&a, nx * ny * sizeof(float)));
			
 
				+    CUDA_RT_CALL(cudaMalloc((void**)&a_new, nx * ny * sizeof(float)));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaMemset(a, 0, nx * ny * sizeof(float)));
			
 
				+    CUDA_RT_CALL(cudaMemset(a_new, 0, nx * ny * sizeof(float)));
			
 
				+
			
 
				+    // Set diriclet boundary conditions on left and right boarder
			
 
				+    initialize_boundaries<<<ny / 128 + 1, 128>>>(a, a_new, PI, 0, nx, ny - 2, ny - 2);
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaGetLastError());
			
 
				+    CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaMalloc(&l2_norm_d, sizeof(float)));
			
 
				+    CUDA_RT_CALL(cudaMallocHost(&l2_norm_h, sizeof(float)));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+
			
 
				+    if (print)
			
 
				+        printf("Single GPU jacobi relaxation: %d iterations on %d x %d mesh\n", iter_max, nx, ny);
			
 
				+
			
 
				+    dim3 dim_block(BLOCK_DIM_X, BLOCK_DIM_Y, 1);
			
 
				+    dim3 dim_grid((nx + BLOCK_DIM_X-1) / BLOCK_DIM_X, ((ny - 2) + BLOCK_DIM_Y-1) / BLOCK_DIM_Y, 1);
			
 
				+
			
 
				+    int iter = 0;
			
 
				+    float l2_norm = 1.0;
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+    double start = MPI_Wtime();
			
 
				+    nvtxRangePush("Jacobi Solve Single GPU");
			
 
				+
			
 
				+    while (l2_norm > tol && iter < iter_max) {
			
 
				+        CUDA_RT_CALL(cudaMemsetAsync(l2_norm_d, 0, sizeof(float)));
			
 
				+
			
 
				+        jacobi_kernel<<<dim_grid, dim_block>>>(a_new, a, l2_norm_d, iy_start, iy_end, nx, 
			
 
				+                                                mype, iy_end + 1, mype, (iy_start - 1));
			
 
				+        
			
 
				+        iter++;
			
 
				+        if (print && ((iter % 100) == 0)) {
			
 
				+            CUDA_RT_CALL(cudaMemcpy(l2_norm_h, l2_norm_d, sizeof(float), cudaMemcpyDeviceToHost));
			
 
				+            l2_norm = *l2_norm_h;
			
 
				+            l2_norm = std::sqrt(l2_norm);
			
 
				+            if (print && (iter % 100) == 0) printf("%5d, %0.6f\n", iter, l2_norm);
			
 
				+        }
			
 
				+
			
 
				+        std::swap(a_new, a);
			
 
				+    }
			
 
				+    CUDA_RT_CALL(cudaDeviceSynchronize());
			
 
				+    nvtxRangePop();
			
 
				+    double stop = MPI_Wtime();
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaMemcpy(a_ref_h, a, nx * ny * sizeof(float), cudaMemcpyDeviceToHost));
			
 
				+    CUDA_RT_CALL(cudaFreeHost(l2_norm_h));
			
 
				+    CUDA_RT_CALL(cudaFree(l2_norm_d));
			
 
				+
			
 
				+    CUDA_RT_CALL(cudaFree(a_new));
			
 
				+    CUDA_RT_CALL(cudaFree(a));
			
 
				+    return (stop - start);
			
 
				+}
			
--- a/hpc/multi_gpu_nways/labs/CFD/English/introduction.ipynb
+++ b/hpc/multi_gpu_nways/labs/CFD/English/introduction.ipynb
@@ -26,7 +26,7 @@
 
				     "Intermediate, Advanced\n",
			
 
				     "\n",
			
 
				     "### Target Audience and Prerequisites\n",
			
 
				-    "The target audience for this lab are researchers, graduate students, and developers who are interested in scaling their scientific applications to multiple nodes using multi-GPU implmenetations.\n",
			
 
				+    "The target audience for this lab are researchers, graduate students, and developers who are interested in scaling their scientific applications to multiple nodes using multi-GPU implementations.\n",
			
 
				     "\n",
			
 
				     "Experience in C/ C++ and basic CUDA programming is required. Experience with parallel programming frameworks like OpenMP or MPI is not required but a basic understanding of MPI is highly recommended.\n",
			
 
				     "\n",
			
@@ -44,14 +44,14 @@
 
				     "    * [MPI with CUDA Memcpy](C/jupyter_notebook/mpi/memcpy.ipynb)\n",
			
 
				     "    * [CUDA-aware MPI](C/jupyter_notebook/mpi/cuda_aware.ipynb)\n",
			
 
				     "    * [Supplemental: Configuring MPI in a containerized environment](C/jupyter_notebook/mpi/containers_and_mpi.ipynb)\n",
			
 
				-    "4. [NVIDIA Collectives Communications Library (NCCL)](C/jupyter_notebook/nccl/nccl/ipynb)\n",
			
 
				-    "5. NVHSMEM Library\n",
			
 
				-    "6. Final remarks\n",
			
 
				+    "4. [NVIDIA Collectives Communications Library (NCCL)](C/jupyter_notebook/nccl/nccl.ipynb)\n",
			
 
				+    "5. [NVHSMEM Library](C/jupyter_notebook/nvshmem/nvshmem.ipynb)\n",
			
 
				+    "\n",
			
 
				     "--- \n",
			
 
				     "\n",
			
 
				     "## Licensing \n",
			
 
				     "\n",
			
 
				-    "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)."
			
 
				+    "This material is released) by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)."
			
 
				    ]
			
 
				   }
			
 
				  ],