Browse Source

Added NCCL and NVSHMEM notebooks

Anish Saxena 2 years ago
parent
commit
60cac07f0f
30 changed files with 2885 additions and 227 deletions
  1. 3 0
      .gitignore
  2. BIN
      hpc/multi_gpu_nways/labs/CFD/English/C/images/nccl_architecture.png
  3. BIN
      hpc/multi_gpu_nways/labs/CFD/English/C/images/nccl_dgx1_topology.png
  4. BIN
      hpc/multi_gpu_nways/labs/CFD/English/C/images/nccl_profiler_output.png
  5. BIN
      hpc/multi_gpu_nways/labs/CFD/English/C/images/nvshmem_left_shift_output.png
  6. BIN
      hpc/multi_gpu_nways/labs/CFD/English/C/images/nvshmem_memory_model.png
  7. BIN
      hpc/multi_gpu_nways/labs/CFD/English/C/images/nvshmem_mpi_comparison.png
  8. BIN
      hpc/multi_gpu_nways/labs/CFD/English/C/images/nvshmem_profiler_report.png
  9. BIN
      hpc/multi_gpu_nways/labs/CFD/English/C/images/nvshmem_thread_level_comm.png
  10. 2 0
      hpc/multi_gpu_nways/labs/CFD/English/C/jupyter_notebook/cuda/memcpy.ipynb
  11. 4 0
      hpc/multi_gpu_nways/labs/CFD/English/C/jupyter_notebook/cuda/streams.ipynb
  12. 3 1
      hpc/multi_gpu_nways/labs/CFD/English/C/jupyter_notebook/mpi/cuda_aware.ipynb
  13. 1 1
      hpc/multi_gpu_nways/labs/CFD/English/C/jupyter_notebook/mpi/memcpy.ipynb
  14. 1 1
      hpc/multi_gpu_nways/labs/CFD/English/C/jupyter_notebook/mpi/multi_node_intro.ipynb
  15. 276 61
      hpc/multi_gpu_nways/labs/CFD/English/C/jupyter_notebook/nccl/nccl.ipynb
  16. 461 0
      hpc/multi_gpu_nways/labs/CFD/English/C/jupyter_notebook/nvshmem/nvshmem.ipynb
  17. 16 23
      hpc/multi_gpu_nways/labs/CFD/English/C/source_code/cuda/jacobi_memcpy.cu
  18. 10 14
      hpc/multi_gpu_nways/labs/CFD/English/C/source_code/cuda/jacobi_streams.cu
  19. 16 20
      hpc/multi_gpu_nways/labs/CFD/English/C/source_code/cuda/jacobi_streams_events.cu
  20. 11 8
      hpc/multi_gpu_nways/labs/CFD/English/C/source_code/mpi/jacobi_cuda_aware_mpi.cpp
  21. 11 14
      hpc/multi_gpu_nways/labs/CFD/English/C/source_code/mpi/jacobi_memcpy_mpi.cpp
  22. 8 26
      hpc/multi_gpu_nways/labs/CFD/English/C/source_code/nccl/Makefile
  23. 38 53
      hpc/multi_gpu_nways/labs/CFD/English/C/source_code/nccl/jacobi_kernels.cu
  24. 406 0
      hpc/multi_gpu_nways/labs/CFD/English/C/source_code/nccl/jacobi_nccl.cpp
  25. 407 0
      hpc/multi_gpu_nways/labs/CFD/English/C/source_code/nccl/solution/jacobi_nccl.cpp
  26. 29 0
      hpc/multi_gpu_nways/labs/CFD/English/C/source_code/nvshmem/Makefile
  27. 567 0
      hpc/multi_gpu_nways/labs/CFD/English/C/source_code/nvshmem/jacobi_nvshmem.cu
  28. 55 0
      hpc/multi_gpu_nways/labs/CFD/English/C/source_code/nvshmem/left_shift.cu
  29. 555 0
      hpc/multi_gpu_nways/labs/CFD/English/C/source_code/nvshmem/solution/jacobi_nvshmem.cu
  30. 5 5
      hpc/multi_gpu_nways/labs/CFD/English/introduction.ipynb

+ 3 - 0
.gitignore

@@ -6,3 +6,6 @@ alk.traj.dcd
 *.a
 *.la
 mgpm
+*.o
+*.out
+final_remarks.ipynb

BIN
hpc/multi_gpu_nways/labs/CFD/English/C/images/nccl_architecture.png


BIN
hpc/multi_gpu_nways/labs/CFD/English/C/images/nccl_dgx1_topology.png


BIN
hpc/multi_gpu_nways/labs/CFD/English/C/images/nccl_profiler_output.png


BIN
hpc/multi_gpu_nways/labs/CFD/English/C/images/nvshmem_left_shift_output.png


BIN
hpc/multi_gpu_nways/labs/CFD/English/C/images/nvshmem_memory_model.png


BIN
hpc/multi_gpu_nways/labs/CFD/English/C/images/nvshmem_mpi_comparison.png


BIN
hpc/multi_gpu_nways/labs/CFD/English/C/images/nvshmem_profiler_report.png


BIN
hpc/multi_gpu_nways/labs/CFD/English/C/images/nvshmem_thread_level_comm.png


+ 2 - 0
hpc/multi_gpu_nways/labs/CFD/English/C/jupyter_notebook/cuda/memcpy.ipynb

@@ -144,6 +144,8 @@
     "\n",
     "Now, let's parallelize our code across multiple GPUs by using `cudaSetDevice` and `cudaMemcpyAsync` operations. Open the [jacobi_memcpy.cu](../../source_code/cuda/jacobi_memcpy.cu) file.\n",
     "\n",
+    "Alternatively, you can navigate to `CFD/English/C/source_code/cuda/` directory in Jupyter's file browser in the left pane. Then, click to open the `jacobi_memcpy.cu` file. \n",
+    "\n",
     "Understand the flow of the program from within the `main` function. Review the following pre-Jacobi-computation steps:\n",
     "\n",
     "1. Computation of the memory chunk size to be allocated on each GPU stored in the `chunk_size` integer array.\n",

+ 4 - 0
hpc/multi_gpu_nways/labs/CFD/English/C/jupyter_notebook/cuda/streams.ipynb

@@ -108,6 +108,8 @@
     "\n",
     "Now, let's implement CUDA streams in our application. Open the [jacobi_streams.cu](../../source_code/cuda/jacobi_streams.cu) file.\n",
     "\n",
+    "Alternatively, you can navigate to `CFD/English/C/source_code/cuda/` directory in Jupyter's file browser in the left pane. Then, click to open the `jacobi_streams.cu` file.\n",
+    "\n",
     "Note that we create 3 streams- `compute_stream`, `push_top_stream`, and `push_bottom_stream` for each GPU. We will compute the Jacobi iteration and perform GPU-local L2 norm copy operation on the `compute_stream`. Each GPU will perform its top and bottom halo copy operation to its neighbours using the `push_top_stream` and `push_bottom_stream` streams, respectively. \n",
     "\n",
     "Now, within the iterative Jacobi loop (the `while` loop), implement the following marked as `TODO: Part 3-`:\n",
@@ -248,6 +250,8 @@
     "\n",
     "Let's implement CUDA Events with Streams in our application. Open the [jacobi_streams_events.cu](../../source_code/cuda/jacobi_streams_events.cu) file.\n",
     "\n",
+    "Alternatively, you can navigate to `CFD/English/C/source_code/cuda/` directory in Jupyter's file browser in the left pane. Then, click to open the `jacobi_streams_events.cu` file.\n",
+    "\n",
     "Note that we create 5 events for each device, `compute_done`, `push_top_done[0]`, `push_top_done[1]`, `push_bottom_done[0]`, and `push_bottom_done[1]`. We need 2 events for each halo on every device:\n",
     "\n",
     "1. To synchronize \"top\" and \"bottom\" neighbour's `push_bottom_stream` and `push_top_stream` copy operations of $(i-1)^{th}$ iteration, respectively, before computing $i^{th}$ Jacobi iteration in `compute_stream`.\n",

+ 3 - 1
hpc/multi_gpu_nways/labs/CFD/English/C/jupyter_notebook/mpi/cuda_aware.ipynb

@@ -87,7 +87,9 @@
     "\n",
     "## Implementation Exercise: Part 2\n",
     "\n",
-    "Open the [jacobi_cuda_aware_mpi.cpp](../../source_code/mpi/jacobi_cuda_aware_mpi.cpp) and [jacobi_kernels.cu](../../source_code/mpi/jacobi_kernels.cu) files in the `source_code/mpi` directory. The `jacobi_kernels.cu` file is same as in previous lab. Also open the [Makefile](../../source_code/mpi/Makefile) and note how the compilation and linking is also same as in previous lab.\n",
+    "Open the [jacobi_cuda_aware_mpi.cpp](../../source_code/mpi/jacobi_cuda_aware_mpi.cpp) and [jacobi_kernels.cu](../../source_code/mpi/jacobi_kernels.cu) files. Alternatively, you can navigate to `CFD/English/C/source_code/mpi/` directory in Jupyter's file browser in the left pane. Then, click to open the `jacobi_cuda_aware_mpi.cpp` and `jacobi_kernels.cu` files. The `jacobi_kernels.cu` file is same as in previous lab. \n",
+    "\n",
+    "Also open the [Makefile](../../source_code/mpi/Makefile) and note how the compilation and linking is also same as in previous lab.\n",
     "\n",
     "Understand the flow of the `jacobi_cuda_aware_mpi.cpp` program and observe the following:\n",
     "\n",

+ 1 - 1
hpc/multi_gpu_nways/labs/CFD/English/C/jupyter_notebook/mpi/memcpy.ipynb

@@ -135,7 +135,7 @@
     "\n",
     "### Code Structure\n",
     "\n",
-    "Open the [jacobi_memcpy_mpi.cpp](../../source_code/mpi/jacobi_memcpy_mpi.cpp) file and the [jacobi_kernels.cu](../../source_code/mpi/jacobi_kernels.cu) files from the `source_code/mpi` directory.\n",
+    "Open the [jacobi_memcpy_mpi.cpp](../../source_code/mpi/jacobi_memcpy_mpi.cpp) file and the [jacobi_kernels.cu](../../source_code/mpi/jacobi_kernels.cu) files. Alternatively, you can navigate to `CFD/English/C/source_code/mpi/` directory in Jupyter's file browser in the left pane. Then, click to open the `jacobi_memcpy_mpi.cpp` and `jacobi_kernels.cu` files.\n",
     "\n",
     "We separate the device kernels from other CUDA and MPI functions as `nvc++` compiler is required to compile CUDA C++ which may not be installed on some platforms Note that NVIDIA's HPC SDK includes the `nvc++` compiler.\n",
     "\n",

+ 1 - 1
hpc/multi_gpu_nways/labs/CFD/English/C/jupyter_notebook/mpi/multi_node_intro.ipynb

@@ -76,7 +76,7 @@
     "}\n",
     "```\n",
     "\n",
-    "To access the program, open the [hello_world.c](../../source_code/mpi/hello_world.c) file in `C/source_code/mpi/` directory.\n",
+    "To access the program, open the [hello_world.c](../../source_code/mpi/hello_world.c) file. Alternatively, you can navigate to `CFD/English/C/source_code/mpi/` directory in Jupyter's file browser in the left pane. Then, click to open the `hello_world.c` file.\n",
     "\n",
     "The MPI environment is initialized with `MPI_Init` through which all of MPI’s global and internal variables are constructed. A \"communicator\" is created between all processes that are spawned, and unique ranks are assigned to each process. \n",
     "\n",

File diff suppressed because it is too large
+ 276 - 61
hpc/multi_gpu_nways/labs/CFD/English/C/jupyter_notebook/nccl/nccl.ipynb


File diff suppressed because it is too large
+ 461 - 0
hpc/multi_gpu_nways/labs/CFD/English/C/jupyter_notebook/nvshmem/nvshmem.ipynb


+ 16 - 23
hpc/multi_gpu_nways/labs/CFD/English/C/source_code/cuda/jacobi_memcpy.cu

@@ -224,21 +224,20 @@ int main(int argc, char* argv[]) {
         if (p2p == true) {
             const int top = dev_id > 0 ? dev_id - 1 : (num_devices - 1);
             int canAccessPeer = 0;
-	    // TODO: Part 2- Check whether GPU "devices[dev_id]" can access peer "devices[top]"
-            CUDA_RT_CALL(cudaDeviceCanAccessPeer(&canAccessPeer, devices[dev_id], devices[top]));
+            // TODO: Part 2- Check whether GPU "devices[dev_id]" can access peer "devices[top]"
+            CUDA_RT_CALL(cudaDeviceCanAccessPeer(&canAccessPeer, /*Fill me*/, /*Fill me*/));
             if (canAccessPeer) {
-		// TODO: Part 2- Enable peer access from GPU "devices[dev_id]" to "devices[top]"
-                CUDA_RT_CALL(cudaDeviceEnablePeerAccess(devices[top], 0));
+            // TODO: Part 2- Enable peer access from GPU "devices[dev_id]" to "devices[top]"
+                CUDA_RT_CALL(cudaDeviceEnablePeerAccess(/*Fill me*/, 0));
             }
             const int bottom = (dev_id + 1) % num_devices;
             if (top != bottom) {
                 canAccessPeer = 0;
-		// TODO: Part 2- Check and enable peer access from GPU "devices[dev_id]" to
-		// "devices[bottom]", whenever possible
-                CUDA_RT_CALL(cudaDeviceCanAccessPeer(&canAccessPeer, 
-					devices[dev_id], devices[bottom]));
+                // TODO: Part 2- Check and enable peer access from GPU "devices[dev_id]" to
+                // "devices[bottom]", whenever possible
+                CUDA_RT_CALL(cudaDeviceCanAccessPeer(&canAccessPeer, /*Fill me*/, /*Fill me*/));
                 if (canAccessPeer) {
-                    CUDA_RT_CALL(cudaDeviceEnablePeerAccess(devices[bottom], 0));
+                    CUDA_RT_CALL(cudaDeviceEnablePeerAccess(/*Fill me*/, 0));
                 }
             }
         }
@@ -274,7 +273,7 @@ int main(int argc, char* argv[]) {
 	    // Launch device kernel on each GPU
         for (int dev_id = 0; dev_id < num_devices; ++dev_id) {
             // TODO: Part 1- Set current GPU to be "devices[dev_id]"
-            CUDA_RT_CALL(cudaSetDevice(devices[dev_id]));
+            CUDA_RT_CALL(cudaSetDevice(/*Fill me*/));
 
             CUDA_RT_CALL(cudaMemsetAsync(l2_norm_d[dev_id], 0, sizeof(float)));
             dim3 dim_grid((nx + BLOCK_DIM_X - 1) / BLOCK_DIM_X,
@@ -283,13 +282,10 @@ int main(int argc, char* argv[]) {
             // TODO: Part 1- Call Jacobi kernel with "dim_grid" blocks in grid and "dim_block"
             // blocks per thread. "dev_id" variable points to corresponding memory allocated 
             // for the current GPU.
-            jacobi_kernel<<<dim_grid, dim_block>>>(
-                    a_new[dev_id], a[dev_id], l2_norm_d[dev_id], iy_start[dev_id], iy_end[dev_id],
-                    nx);
+            jacobi_kernel<<</*Fill me*/, /*Fill me*/>>>(/*Fill me*/);
 
             // TODO: Part 1- Copy GPU-local L2 norm "l2_norm_d" back to CPU "l2_norm_h"
-            CUDA_RT_CALL(cudaMemcpyAsync(l2_norm_h[dev_id], l2_norm_d[dev_id], sizeof(float),
-                     cudaMemcpyDeviceToHost));
+            CUDA_RT_CALL(cudaMemcpyAsync(/*Fill me*/, /*Fill me*/, sizeof(float), /*Fill me*/));
 	}
     // Launch async memory copy operations for halo exchange and 
 	// for copying local-grid L2 norm from each GPU to host
@@ -298,23 +294,20 @@ int main(int argc, char* argv[]) {
             const int bottom = (dev_id + 1) % num_devices;
             
             // TODO: Part 1- Set current GPU
-            CUDA_RT_CALL(cudaSetDevice(devices[dev_id]));
+            CUDA_RT_CALL(cudaSetDevice(/*Fill me*/));
 
             // TODO: Part 1- Implement halo exchange with top neighbour "top"
-            CUDA_RT_CALL(cudaMemcpyAsync(a_new[top] + (iy_end[top] * nx),
-                                         a_new[dev_id] + iy_start[dev_id] * nx, nx * sizeof(float),
-                                         cudaMemcpyDeviceToDevice));
+            CUDA_RT_CALL(cudaMemcpyAsync(/*Fill me*/, /*Fill me*/, nx * sizeof(float), /*Fill me*/));
 	    
             // TODO: Part 1- Implement halo exchange with bottom neighbour "bottom"
-            CUDA_RT_CALL(cudaMemcpyAsync(a_new[bottom], a_new[dev_id] + (iy_end[dev_id] - 1) * nx,
-                                         nx * sizeof(float), cudaMemcpyDeviceToDevice));
+            CUDA_RT_CALL(cudaMemcpyAsync(/*Fill me*/, /*Fill me*/, nx * sizeof(float), /*Fill me*/));
         }
         l2_norm = 0.0;
         // Synchronize devices and compute global L2 norm
         for (int dev_id = 0; dev_id < num_devices; ++dev_id) {
             // TODO: part 1- Set current GPU and call cudaDeviceSynchronize()
-	        CUDA_RT_CALL(cudaSetDevice(devices[dev_id]));
-            CUDA_RT_CALL(cudaDeviceSynchronize());
+	        CUDA_RT_CALL(cudaSetDevice(/*Fill me*/));
+            CUDA_RT_CALL(/*Fill me*/);
 
             l2_norm += *(l2_norm_h[dev_id]);
         }

+ 10 - 14
hpc/multi_gpu_nways/labs/CFD/English/C/source_code/cuda/jacobi_streams.cu

@@ -263,22 +263,20 @@ int main(int argc, char* argv[]) {
             // from the previous iteration by synchronizing "push_top_stream" and
             // "push_bottom_stream" streams. Be careful with which neighbour's top stream and
             // which neighbour's bottom stream needs to be synchronized.
-            CUDA_RT_CALL(cudaStreamSynchronize(push_top_stream[bottom]));
-            CUDA_RT_CALL(cudaStreamSynchronize(push_bottom_stream[top]));
+            CUDA_RT_CALL(cudaStreamSynchronize(/*Fill me*/));
+            CUDA_RT_CALL(cudaStreamSynchronize(/*Fill me*/));
 
             dim3 dim_grid((nx + BLOCK_DIM_X - 1) / BLOCK_DIM_X,
                           (chunk_size[dev_id] + BLOCK_DIM_Y - 1) / BLOCK_DIM_Y, 1);
 
             // TODO: Part 3- Launch Jacobi kernel on "compute_stream[dev_id]" and all other
             // functional arguments
-            jacobi_kernel<<<dim_grid, dim_block, 0, compute_stream[dev_id]>>>(
-                    a_new[dev_id], a[dev_id], l2_norm_d[dev_id], iy_start[dev_id], iy_end[dev_id],
-                    nx);
+            jacobi_kernel<<</*Fill me*/, /*Fill me*/, 0, /*Fill me*/>>>(/*Fill me*/);
 
             // TODO: Part 3- Copy GPU-local L2 norm "l2_norm_d" back to CPU "l2_norm_h" on
             // "compute_stream[dev_id]"
-            CUDA_RT_CALL(cudaMemcpyAsync(l2_norm_h[dev_id], l2_norm_d[dev_id], sizeof(float),
-                     cudaMemcpyDeviceToHost, compute_stream[dev_id]));
+            CUDA_RT_CALL(cudaMemcpyAsync(/*Fill me*/, /*Fill me*/, sizeof(float),
+                                            /*Fill me*/, /*Fill me*/));
         }    
         for (int dev_id = 0; dev_id < num_devices; ++dev_id) {
             const int top = dev_id > 0 ? dev_id - 1 : (num_devices - 1);
@@ -287,20 +285,18 @@ int main(int argc, char* argv[]) {
 
             // TODO: Part 3- Before copying the updated halos to neighbours, ensure the 
             // computation is complete by synchronizing "compute_stream[dev_id]" stream
-            CUDA_RT_CALL(cudaStreamSynchronize(compute_stream[dev_id]));
+            CUDA_RT_CALL(cudaStreamSynchronize(/*Fill me*/));
 
             // Apply periodic boundary conditions
             // TODO: Part 3- Implement halo exchange with top neighbour on current device's 
             // "push_top_stream"
-            CUDA_RT_CALL(cudaMemcpyAsync(a_new[top] + (iy_end[top] * nx),
-                                         a_new[dev_id] + iy_start[dev_id] * nx, nx * sizeof(float),
-                                         cudaMemcpyDeviceToDevice, push_top_stream[dev_id]));
+            CUDA_RT_CALL(cudaMemcpyAsync(/*Fill me*/, /*Fill me*/, nx * sizeof(float),
+                                         /*Fill me*/, /*Fill me*/));
 
             // TODO: Part 3- Implement halo exchange with "bottom" neighbour on current device's 
             // "push_bottom_stream"
-            CUDA_RT_CALL(cudaMemcpyAsync(a_new[bottom], a_new[dev_id] + (iy_end[dev_id] - 1) * nx,
-                                         nx * sizeof(float), cudaMemcpyDeviceToDevice,
-                                         push_bottom_stream[dev_id]));
+            CUDA_RT_CALL(cudaMemcpyAsync(/*Fill me*/, /*Fill me*/, nx * sizeof(float),
+                                         /*Fill me*/, /*Fill me*/));
         }
         l2_norm = 0.0;
         for (int dev_id = 0; dev_id < num_devices; ++dev_id) {

+ 16 - 20
hpc/multi_gpu_nways/labs/CFD/English/C/source_code/cuda/jacobi_streams_events.cu

@@ -265,11 +265,8 @@ int main(int argc, char* argv[]) {
             // neighbours are not copied to "dev_id". The "push_top_done" and "push_bottom_done" 
             // events are to monitored for "bottom" and "top" neighbours, respectively for the 
             // previous iteration denoted by "iter % 2".
-            // Note that there should be 2 distinct cudaStreamWaitEvent calls.
-            CUDA_RT_CALL(
-                cudaStreamWaitEvent(compute_stream[dev_id], push_top_done[(iter % 2)][bottom], 0));
-            CUDA_RT_CALL(
-                cudaStreamWaitEvent(compute_stream[dev_id], push_bottom_done[(iter % 2)][top], 0));
+            CUDA_RT_CALL(cudaStreamWaitEvent(/*Fill me*/, /*Fill me*/, 0));
+            CUDA_RT_CALL(cudaStreamWaitEvent(/*Fill me*/, /*Fill me*/, 0));
 
             dim3 dim_grid((nx + BLOCK_DIM_X - 1) / BLOCK_DIM_X,
                           (chunk_size[dev_id] + BLOCK_DIM_Y - 1) / BLOCK_DIM_Y, 1);
@@ -280,10 +277,10 @@ int main(int argc, char* argv[]) {
 
             // TODO: Part 4- Record that Jacobi computation on "compute_stream" is done by using
             // cudaEventRecord for "compute_done" event for "dev_id"
-            CUDA_RT_CALL(cudaEventRecord(compute_done[dev_id], compute_stream[dev_id]));
+            CUDA_RT_CALL(cudaEventRecord(/*Fill me*/, /*Fill me*/));
 
-            CUDA_RT_CALL(cudaMemcpyAsync(l2_norm_h[dev_id], l2_norm_d[dev_id], sizeof(float),
-                     cudaMemcpyDeviceToHost, compute_stream[dev_id]));
+            CUDA_RT_CALL(cudaMemcpyAsync(/*Fill me*/, /*Fill me*/, sizeof(float),
+                                            /*Fill me*/, /*Fill me*/));
         }    
         for (int dev_id = 0; dev_id < num_devices; ++dev_id) {
             const int top = dev_id > 0 ? dev_id - 1 : (num_devices - 1);
@@ -294,28 +291,27 @@ int main(int argc, char* argv[]) {
             // TODO: Part 4- Wait for the Jacobi computation of "dev_id" to complete by using the
             // "compute_done" event on "push_top_stream" so that the top halo isn't copied to the
             // neighbour before computation is done
-            CUDA_RT_CALL(cudaStreamWaitEvent(push_top_stream[dev_id], compute_done[dev_id], 0));
-            CUDA_RT_CALL(cudaMemcpyAsync(a_new[top] + (iy_end[top] * nx),
-                                         a_new[dev_id] + iy_start[dev_id] * nx, nx * sizeof(float),
-                                         cudaMemcpyDeviceToDevice, push_top_stream[dev_id]));
+            CUDA_RT_CALL(cudaStreamWaitEvent(/*Fill me*/, /*Fill me*/));
+            CUDA_RT_CALL(cudaMemcpyAsync(/*Fill me*/, /*Fill me*/, nx * sizeof(float),
+                                         /*Fill me*/, /*Fill me*/));
+
             // TODO: Part 4- Record completion of top halo copy from "dev_id" to its neighbour
             // to be used in next iteration. Record the event for "push_top_done" stream of 
             // "dev_id" for next iteration which is "(iter+1) % 2"
-            CUDA_RT_CALL(
-                cudaEventRecord(push_top_done[((iter + 1) % 2)][dev_id], push_top_stream[dev_id]));
+            CUDA_RT_CALL(cudaEventRecord(/*Fill me*/, /*Fill me*/));
 
             // TODO: Part 4- Wait for the Jacobi computation of "dev_id" to complete by using the
             // "compute_done" event on "push_bottom_stream" so that the bottom halo isn't copied to
             // the neighbour before computation is done
-            CUDA_RT_CALL(cudaStreamWaitEvent(push_bottom_stream[dev_id], compute_done[dev_id], 0));
-            CUDA_RT_CALL(cudaMemcpyAsync(a_new[bottom], a_new[dev_id] + (iy_end[dev_id] - 1) * nx,
-                                         nx * sizeof(float), cudaMemcpyDeviceToDevice,
-                                         push_bottom_stream[dev_id]));
+            CUDA_RT_CALL(cudaStreamWaitEvent(/*Fill me*/, /*Fill me*/, 0));
+            CUDA_RT_CALL(cudaMemcpyAsync(/*Fill me*/, /*Fill me*/, nx * sizeof(float),
+                                         /*Fill me*/, /*Fill me*/));
+                                         
             // TODO: Part 4- Record completion of bottom halo copy from "dev_id" to its neighbour
             // to be used in next iteration. Record the event for "push_bottom_done" stream of 
             // "dev_id" for next iteration which is "(iter+1) % 2"
-            CUDA_RT_CALL(cudaEventRecord(push_bottom_done[((iter + 1) % 2)][dev_id],
-                                         push_bottom_stream[dev_id]));
+            CUDA_RT_CALL(cudaMemcpyAsync(/*Fill me*/, /*Fill me*/, nx * sizeof(float),
+                                         /*Fill me*/, /*Fill me*/));
         }
         l2_norm = 0.0;
         for (int dev_id = 0; dev_id < num_devices; ++dev_id) {

+ 11 - 8
hpc/multi_gpu_nways/labs/CFD/English/C/source_code/mpi/jacobi_cuda_aware_mpi.cpp

@@ -207,20 +207,23 @@ int main(int argc, char* argv[]) {
 	    CUDA_RT_CALL(cudaDeviceSynchronize());
 
         nvtxRangePush("Halo exchange CUDA-aware MPI");
-        // TODO: Part 2- 
-        MPI_CALL(MPI_Sendrecv(a_new + iy_start * nx, nx, MPI_FLOAT, top, 0,
-                              a_new + (iy_end * nx), nx, MPI_FLOAT, bottom, 0, MPI_COMM_WORLD,
+        // TODO: Part 2- Implement top halo exchange. Use only GPU buffers in the MPI call's 
+        // function arguments.
+        MPI_CALL(MPI_Sendrecv(/*Fill me*/, nx, MPI_FLOAT, /*Fill me*/, 0,
+                              /*Fill me*/, nx, MPI_FLOAT, /*Fill me*/, 0, MPI_COMM_WORLD,
                               MPI_STATUS_IGNORE));
         nvtxRangePop(); 
 
         nvtxRangePush("Halo exchange CUDA-aware MPI");
-        // TODO: Part 2- 
-        MPI_CALL(MPI_Sendrecv(a_new + (iy_end - 1) * nx, nx, MPI_FLOAT, bottom, 0, a_new, nx,
-                              MPI_FLOAT, top, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE));
+        // TODO: Part 2- Implement bottom halo exchange. Use only GPU buffers in the MPI call's 
+        // function arguments.
+        MPI_CALL(MPI_Sendrecv(/*Fill me*/, nx, MPI_FLOAT, /*Fill me*/, 0, 
+                                /*Fill me*/, nx, MPI_FLOAT, /*Fill me*/, 0, MPI_COMM_WORLD, 
+                                MPI_STATUS_IGNORE));
         nvtxRangePop(); 
 
-        // TODO: Part 2- 
-        MPI_CALL(MPI_Allreduce(l2_norm_h, &l2_norm, 1, MPI_FLOAT, MPI_SUM, MPI_COMM_WORLD));
+        // TODO: Part 2- Reduce the rank-local L2 Norm to a global L2 norm
+        MPI_CALL(MPI_Allreduce(/*Fill me*/, /*Fill me*/, 1, MPI_FLOAT, /*Fill me*/, MPI_COMM_WORLD));
         l2_norm = std::sqrt(l2_norm);
         
         iter++;

+ 11 - 14
hpc/multi_gpu_nways/labs/CFD/English/C/source_code/mpi/jacobi_memcpy_mpi.cpp

@@ -115,14 +115,13 @@ int main(int argc, char* argv[]) {
 
     int local_rank = -1;
     // TODO: Part 1- Obtain the node-level local rank by splitting the global communicator
-    // Make sure ot free the local communicator after its use
+    // Free the local communicator after its use
     MPI_Comm local_comm;
-    MPI_CALL(MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, rank, MPI_INFO_NULL,
-                                    &local_comm));
+    MPI_CALL(MPI_Comm_split_type(/*Fill me*/));
 
-    MPI_CALL(MPI_Comm_rank(local_comm, &local_rank));
+    MPI_CALL(MPI_Comm_rank(/*Fill me*/));
 
-    MPI_CALL(MPI_Comm_free(&local_comm));
+    MPI_CALL(MPI_Comm_free(/*Fill me*/));
 
     CUDA_RT_CALL(cudaSetDevice(local_rank % num_devices));
     CUDA_RT_CALL(cudaFree(0));
@@ -218,8 +217,8 @@ int main(int argc, char* argv[]) {
                                 cudaMemcpyDeviceToHost));
         // TODO: Part 1- Implement the first set of halo exchanges using MPI_SendRecv explained 
         // in the Jupyter Notebook. Observe the Memcpy operations above and below this comment
-        MPI_CALL(MPI_Sendrecv(top_halo_buf, nx, MPI_FLOAT, top, 0,
-                              bot_halo_buf, nx, MPI_FLOAT, bottom, 0, MPI_COMM_WORLD,
+        MPI_CALL(MPI_Sendrecv(/*Fill me*/, nx, MPI_FLOAT, /*Fill me*/, 0,
+                              /*Fill me*/, nx, MPI_FLOAT, /*Fill me*/, 0, MPI_COMM_WORLD,
                               MPI_STATUS_IGNORE));
         CUDA_RT_CALL(cudaMemcpy(a_new + (iy_end * nx), bot_halo_buf, nx * sizeof(float), 
                                 cudaMemcpyHostToDevice));
@@ -229,17 +228,15 @@ int main(int argc, char* argv[]) {
         // Second set of halo exchanges
         // TODO: Part 1- Implement the Memcpy operations and MPI calls for the second set of
         // halo exchanges
-        CUDA_RT_CALL(cudaMemcpy(bot_halo_buf, a_new + (iy_end - 1) * nx, nx * sizeof(float), 
-                                cudaMemcpyDeviceToHost));
-        MPI_CALL(MPI_Sendrecv(bot_halo_buf, nx, MPI_FLOAT, bottom, 0, 
-                                top_halo_buf, nx, MPI_FLOAT, top, 0, MPI_COMM_WORLD, 
+        CUDA_RT_CALL(cudaMemcpy(/*Fill me*/, /*Fill me*/, nx * sizeof(float), /*Fill me*/));
+        MPI_CALL(MPI_Sendrecv(/*Fill me*/, nx, MPI_FLOAT, /*Fill me*/, 0, 
+                                /*Fill me*/, nx, MPI_FLOAT, /*Fill me*/, 0, MPI_COMM_WORLD, 
                                 MPI_STATUS_IGNORE));
-        CUDA_RT_CALL(cudaMemcpy(a_new, top_halo_buf, nx * sizeof(float), 
-                                cudaMemcpyHostToDevice));
+        CUDA_RT_CALL(cudaMemcpy(/*Fill me*/, /*Fill me*/, nx * sizeof(float), /*Fill me*/));
         nvtxRangePop();                        
 
         // TODO: Part 1- Reduce the rank-local L2 Norm to a global L2 norm using MPI_Allreduce
-        MPI_CALL(MPI_Allreduce(l2_norm_h, &l2_norm, 1, MPI_FLOAT, MPI_SUM, MPI_COMM_WORLD));
+        MPI_CALL(MPI_Allreduce(/*Fill me*/, /*Fill me*/, 1, MPI_FLOAT, /*Fill me*/, MPI_COMM_WORLD));
         
         l2_norm = std::sqrt(l2_norm);
 

+ 8 - 26
hpc/multi_gpu_nways/labs/CFD/English/C/source_code/nccl/Makefile

@@ -1,42 +1,24 @@
 # Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
-NP ?= 1
 NVCC=nvcc
 MPICXX=mpicxx
 MPIRUN ?= mpirun
 #CUDA_HOME ?= /usr/local/cuda
-#NCCL_HOME ?= /usr
-GENCODE_SM30	:= -gencode arch=compute_30,code=sm_30
-GENCODE_SM35	:= -gencode arch=compute_35,code=sm_35
-GENCODE_SM37	:= -gencode arch=compute_37,code=sm_37
-GENCODE_SM50	:= -gencode arch=compute_50,code=sm_50
-GENCODE_SM52	:= -gencode arch=compute_52,code=sm_52
-GENCODE_SM60    := -gencode arch=compute_60,code=sm_60
+#NCCL_HOME ?= /usr/nccl/
 GENCODE_SM70    := -gencode arch=compute_70,code=sm_70
-GENCODE_SM80    := -gencode arch=compute_80,code=sm_80 -gencode arch=compute_80,code=compute_80
+GENCODE_SM80    := -gencode arch=compute_80,code=sm_80
 GENCODE_FLAGS	:= $(GENCODE_SM70) $(GENCODE_SM80)
-ifdef DISABLE_CUB
-        NVCC_FLAGS = -Xptxas --optimize-float-atomics
-else
-        NVCC_FLAGS = -DHAVE_CUB
-endif
+
 NVCC_FLAGS += -lineinfo $(GENCODE_FLAGS) -std=c++14
-MPICXX_FLAGS = -DUSE_NVTX -I$(CUDA_HOME)/include -I$(NCCL_HOME)/include -std=c++14
+MPICXX_FLAGS = -DUSE_NVTX -I$(CUDA_HOME)/include -I$(NCCL_HOME)/include -fopenmp -std=c++14
 LD_FLAGS = -L$(CUDA_HOME)/lib64 -lcudart -lnvToolsExt -lnccl
-jacobi: Makefile jacobi.cpp jacobi_kernels.o
-	$(MPICXX) $(MPICXX_FLAGS) jacobi.cpp jacobi_kernels.o $(LD_FLAGS) -o jacobi
+
+jacobi_nccl: Makefile jacobi_nccl.cpp jacobi_kernels.o
+	$(MPICXX) $(MPICXX_FLAGS) jacobi_nccl.cpp jacobi_kernels.o $(LD_FLAGS) -o jacobi_nccl
 
 jacobi_kernels.o: Makefile jacobi_kernels.cu
 	$(NVCC) $(NVCC_FLAGS) jacobi_kernels.cu -c
 
 .PHONY.: clean
 clean:
-	rm -f jacobi jacobi_kernels.o *.qdrep jacobi.*.compute-sanitizer.log
-
-sanitize: jacobi
-	$(MPIRUN) -np $(NP) compute-sanitizer --log-file jacobi.%q{OMPI_COMM_WORLD_RANK}.compute-sanitizer.log ./jacobi -niter 10
-
-run: jacobi
-	$(MPIRUN) -np $(NP) ./jacobi
+	rm -rf jacobi_nccl jacobi_kernels.o *.qdrep *.sqlite
 
-profile: jacobi
-	$(MPIRUN) -np $(NP) nsys profile --trace=mpi,cuda,nvtx -o jacobi.%q{OMPI_COMM_WORLD_RANK} ./jacobi -niter 10

+ 38 - 53
hpc/multi_gpu_nways/labs/CFD/English/C/source_code/nccl/jacobi_kernels.cu

@@ -26,9 +26,8 @@
  */
 #include <cstdio>
 
-#ifdef HAVE_CUB
-#include <cub/block/block_reduce.cuh>
-#endif  // HAVE_CUB
+#define BLOCK_DIM_X 32
+#define BLOCK_DIM_Y 32
 
 #define CUDA_RT_CALL(call)                                                                  \
     {                                                                                       \
@@ -41,19 +40,10 @@
                     #call, __LINE__, __FILE__, cudaGetErrorString(cudaStatus), cudaStatus); \
     }
 
-#ifdef USE_DOUBLE
-typedef double real;
-#define MPI_REAL_TYPE MPI_DOUBLE
-#else
-typedef float real;
-#define MPI_REAL_TYPE MPI_FLOAT
-#endif
-
-__global__ void initialize_boundaries(real* __restrict__ const a_new, real* __restrict__ const a,
-                                      const real pi, const int offset, const int nx,
-                                      const int my_ny, const int ny) {
+__global__ void initialize_boundaries(float*  a_new, float*  a, const float pi, const int offset, 
+                    const int nx, const int my_ny, const int ny) {
     for (int iy = blockIdx.x * blockDim.x + threadIdx.x; iy < my_ny; iy += blockDim.x * gridDim.x) {
-        const real y0 = sin(2.0 * pi * (offset + iy) / (ny - 1));
+        const float y0 = sin(2.0 * pi * (offset + iy) / (ny - 1));
         a[iy * nx + 0] = y0;
         a[iy * nx + (nx - 1)] = y0;
         a_new[iy * nx + 0] = y0;
@@ -61,53 +51,48 @@ __global__ void initialize_boundaries(real* __restrict__ const a_new, real* __re
     }
 }
 
-void launch_initialize_boundaries(real* __restrict__ const a_new, real* __restrict__ const a,
-                                  const real pi, const int offset, const int nx, const int my_ny,
-                                  const int ny) {
-    initialize_boundaries<<<my_ny / 128 + 1, 128>>>(a_new, a, pi, offset, nx, my_ny, ny);
-    CUDA_RT_CALL(cudaGetLastError());
-}
-
-template <int BLOCK_DIM_X, int BLOCK_DIM_Y>
-__global__ void jacobi_kernel(real* __restrict__ const a_new, const real* __restrict__ const a,
-                              real* __restrict__ const l2_norm, const int iy_start,
-                              const int iy_end, const int nx, const bool calculate_norm) {
-#ifdef HAVE_CUB
-    typedef cub::BlockReduce<real, BLOCK_DIM_X, cub::BLOCK_REDUCE_WARP_REDUCTIONS, BLOCK_DIM_Y>
-        BlockReduce;
-    __shared__ typename BlockReduce::TempStorage temp_storage;
-#endif  // HAVE_CUB
+__global__ void jacobi_kernel(float*  a_new, const float*  a, float*  l2_norm, const int iy_start,
+                              const int iy_end, const int nx) {
     int iy = blockIdx.y * blockDim.y + threadIdx.y + iy_start;
     int ix = blockIdx.x * blockDim.x + threadIdx.x + 1;
-    real local_l2_norm = 0.0;
+    __shared__ float block_l2_sum[BLOCK_DIM_X*BLOCK_DIM_Y];
+    unsigned thread_index = threadIdx.y*BLOCK_DIM_X + threadIdx.x;
 
     if (iy < iy_end && ix < (nx - 1)) {
-        const real new_val = 0.25 * (a[iy * nx + ix + 1] + a[iy * nx + ix - 1] +
+        // Update grid point
+        const float new_val = 0.25 * (a[iy * nx + ix + 1] + a[iy * nx + ix - 1] +
                                      a[(iy + 1) * nx + ix] + a[(iy - 1) * nx + ix]);
         a_new[iy * nx + ix] = new_val;
-        if (calculate_norm) {
-            real residue = new_val - a[iy * nx + ix];
-            local_l2_norm += residue * residue;
+        float residue = new_val - a[iy * nx + ix];
+        // Set block-level L2 norm value for this grid point
+        block_l2_sum[thread_index] = residue * residue;
+    }
+    else {
+        block_l2_sum[thread_index] = 0;
+    }
+    // Reduce L2 norm for the block in parallel
+    for (unsigned stride = 1; stride < BLOCK_DIM_X*BLOCK_DIM_Y; stride *= 2) {
+        __syncthreads();
+        if ((thread_index) % (2*stride) == 0) {
+            block_l2_sum[thread_index] += block_l2_sum[thread_index + stride];
         }
     }
-    if (calculate_norm) {
-#ifdef HAVE_CUB
-        real block_l2_norm = BlockReduce(temp_storage).Sum(local_l2_norm);
-        if (0 == threadIdx.y && 0 == threadIdx.x) atomicAdd(l2_norm, block_l2_norm);
-#else
-        atomicAdd(l2_norm, local_l2_norm);
-#endif  // HAVE_CUB
+    // Atomically update global L2 norm with block-reduced L2 norm
+    if (thread_index == 0) {
+        atomicAdd(l2_norm, block_l2_sum[0]);
     }
 }
 
-void launch_jacobi_kernel(real* __restrict__ const a_new, const real* __restrict__ const a,
-                          real* __restrict__ const l2_norm, const int iy_start, const int iy_end,
-                          const int nx, const bool calculate_norm, cudaStream_t stream) {
-    constexpr int dim_block_x = 32;
-    constexpr int dim_block_y = 32;
-    dim3 dim_grid((nx + dim_block_x - 1) / dim_block_x,
-                  ((iy_end - iy_start) + dim_block_y - 1) / dim_block_y, 1);
-    jacobi_kernel<dim_block_x, dim_block_y><<<dim_grid, {dim_block_x, dim_block_y, 1}, 0, stream>>>(
-        a_new, a, l2_norm, iy_start, iy_end, nx, calculate_norm);
-    CUDA_RT_CALL(cudaGetLastError());
+void launch_initialize_boundaries(float*  a_new, float*  a, const float pi, const int offset, 
+                                    const int nx, const int my_ny, const int ny) {
+    initialize_boundaries<<<my_ny / 128 + 1, 128>>>(a_new, a, pi, offset, nx, my_ny, ny);
 }
+
+void launch_jacobi_kernel(float*  a_new, const float*  a, float*  l2_norm, const int iy_start,
+                              const int iy_end, const int nx, cudaStream_t stream) {
+    dim3 dim_block(BLOCK_DIM_X, BLOCK_DIM_Y, 1);
+    dim3 dim_grid((nx + BLOCK_DIM_X - 1) / BLOCK_DIM_X,
+                  ((iy_end - iy_start) + BLOCK_DIM_Y - 1) / BLOCK_DIM_Y, 1);
+    jacobi_kernel<<<dim_grid, dim_block, 0, stream>>>(a_new, a, l2_norm, iy_start, iy_end, nx);
+}
+

+ 406 - 0
hpc/multi_gpu_nways/labs/CFD/English/C/source_code/nccl/jacobi_nccl.cpp

@@ -0,0 +1,406 @@
+/* Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NVIDIA CORPORATION nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+#include <algorithm>
+#include <cmath>
+#include <cstdio>
+#include <iostream>
+#include <sstream>
+#include <cuda_runtime.h>
+#include <nvToolsExt.h>
+#include <mpi.h>
+#include <omp.h>
+#include <nccl.h>
+
+#define MPI_CALL(call)                                                                \
+    {                                                                                 \
+        int mpi_status = call;                                                        \
+        if (0 != mpi_status) {                                                        \
+            char mpi_error_string[MPI_MAX_ERROR_STRING];                              \
+            int mpi_error_string_length = 0;                                          \
+            MPI_Error_string(mpi_status, mpi_error_string, &mpi_error_string_length); \
+            if (NULL != mpi_error_string)                                             \
+                fprintf(stderr,                                                       \
+                        "ERROR: MPI call \"%s\" in line %d of file %s failed "        \
+                        "with %s "                                                    \
+                        "(%d).\n",                                                    \
+                        #call, __LINE__, __FILE__, mpi_error_string, mpi_status);     \
+            else                                                                      \
+                fprintf(stderr,                                                       \
+                        "ERROR: MPI call \"%s\" in line %d of file %s failed "        \
+                        "with %d.\n",                                                 \
+                        #call, __LINE__, __FILE__, mpi_status);                       \
+        }                                                                             \
+    }
+
+#define CUDA_RT_CALL(call)                                                                  \
+    {                                                                                       \
+        cudaError_t cudaStatus = call;                                                      \
+        if (cudaSuccess != cudaStatus)                                                      \
+            fprintf(stderr,                                                                 \
+                    "ERROR: CUDA RT call \"%s\" in line %d of file %s failed "              \
+                    "with "                                                                 \
+                    "%s (%d).\n",                                                           \
+                    #call, __LINE__, __FILE__, cudaGetErrorString(cudaStatus), cudaStatus); \
+    }
+
+#define NCCL_CALL(call)                                                                     \
+    {                                                                                       \
+        ncclResult_t  ncclStatus = call;                                                    \
+        if (ncclSuccess != ncclStatus)                                                      \
+            fprintf(stderr,                                                                 \
+                    "ERROR: NCCL call \"%s\" in line %d of file %s failed "                 \
+                    "with "                                                                 \
+                    "%s (%d).\n",                                                           \
+                    #call, __LINE__, __FILE__, ncclGetErrorString(ncclStatus), ncclStatus); \
+    }
+
+constexpr float tol = 1.0e-8;
+
+const float PI = 2.0 * std::asin(1.0);
+
+void launch_jacobi_kernel(float*  a_new, const float*  a, float*  l2_norm, const int iy_start,
+                              const int iy_end, const int nx, cudaStream_t stream);
+
+void launch_initialize_boundaries(float*  a_new, float*  a, const float pi, const int offset, 
+                                    const int nx, const int my_ny, const int ny);
+
+double single_gpu(const int nx, const int ny, const int iter_max, 
+                    float* const a_ref_h, bool print);
+
+int get_argval(char** begin, char** end, const std::string& arg, const int default_val) {
+    int argval = default_val;
+    char** itr = std::find(begin, end, arg);
+    if (itr != end && ++itr != end) {
+        std::istringstream inbuf(*itr);
+        inbuf >> argval;
+    }
+    return argval;
+}
+
+bool get_arg(char** begin, char** end, const std::string& arg) {
+    char** itr = std::find(begin, end, arg);
+    if (itr != end) {
+        return true;
+    }
+    return false;
+}
+
+int main(int argc, char* argv[]) {
+    MPI_CALL(MPI_Init(&argc, &argv));
+    int rank;
+    MPI_CALL(MPI_Comm_rank(MPI_COMM_WORLD, &rank));
+    int size;
+    MPI_CALL(MPI_Comm_size(MPI_COMM_WORLD, &size));
+
+    ncclUniqueId nccl_uid;
+    if (rank == 0) NCCL_CALL(ncclGetUniqueId(&nccl_uid));
+    MPI_CALL(MPI_Bcast(&nccl_uid, sizeof(ncclUniqueId), MPI_BYTE, 0, MPI_COMM_WORLD));
+
+    const int iter_max = get_argval(argv, argv + argc, "-niter", 1000);
+    const int nx = get_argval(argv, argv + argc, "-nx", 16384);
+    const int ny = get_argval(argv, argv + argc, "-ny", 16384);
+    const bool skip_single_gpu = get_arg(argv, argv + argc, "-skip_single_gpu");
+
+    int local_rank = -1;
+    {
+        MPI_Comm local_comm;
+        MPI_CALL(MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, rank, MPI_INFO_NULL,
+                                     &local_comm));
+
+        MPI_CALL(MPI_Comm_rank(local_comm, &local_rank));
+
+        MPI_CALL(MPI_Comm_free(&local_comm));
+    }
+
+    CUDA_RT_CALL(cudaSetDevice(local_rank));
+    CUDA_RT_CALL(cudaFree(0));
+
+    ncclComm_t nccl_comm;
+    NCCL_CALL(ncclCommInitRank(&nccl_comm, size, nccl_uid, rank));
+
+
+    int nccl_version = 0;
+    NCCL_CALL(ncclGetVersion(&nccl_version));
+    if ( nccl_version < 2800 ) {
+        fprintf(stderr,"ERROR NCCL 2.8 or newer is required.\n");
+        NCCL_CALL(ncclCommDestroy(nccl_comm));
+        MPI_CALL(MPI_Finalize());
+        return 1;
+    }
+
+    float* a_ref_h;
+    CUDA_RT_CALL(cudaMallocHost(&a_ref_h, nx * ny * sizeof(float)));
+    float* a_h;
+    CUDA_RT_CALL(cudaMallocHost(&a_h, nx * ny * sizeof(float)));
+    
+    double runtime_serial = 1;
+    if (!skip_single_gpu){
+        runtime_serial = single_gpu(nx, ny, iter_max, a_ref_h, rank == 0);
+    }
+    // ny - 2 rows are distributed amongst `size` ranks in such a way
+    // that each rank gets either (ny - 2) / size or (ny - 2) / size + 1 rows.
+    // This optimizes load balancing when (ny - 2) % size != 0
+    int chunk_size;
+    int chunk_size_low = (ny - 2) / size;
+    int chunk_size_high = chunk_size_low + 1;
+    // To calculate the number of ranks that need to compute an extra row,
+    // the following formula is derived from this equation:
+    // num_ranks_low * chunk_size_low + (size - num_ranks_low) * (chunk_size_low + 1) = ny - 2
+    int num_ranks_low = size * chunk_size_low + size -
+                        (ny - 2);  // Number of ranks with chunk_size = chunk_size_low
+    if (rank < num_ranks_low)
+        chunk_size = chunk_size_low;
+    else
+        chunk_size = chunk_size_high;
+
+    float* a;
+    CUDA_RT_CALL(cudaMalloc(&a, nx * (chunk_size + 2) * sizeof(float)));
+    float* a_new;
+    CUDA_RT_CALL(cudaMalloc(&a_new, nx * (chunk_size + 2) * sizeof(float)));
+
+    CUDA_RT_CALL(cudaMemset(a, 0, nx * (chunk_size + 2) * sizeof(float)));
+    CUDA_RT_CALL(cudaMemset(a_new, 0, nx * (chunk_size + 2) * sizeof(float)));
+
+    // Calculate local domain boundaries
+    int iy_start_global;  // My start index in the global array
+    if (rank < num_ranks_low) {
+        iy_start_global = rank * chunk_size_low + 1;
+    } else {
+        iy_start_global =
+            num_ranks_low * chunk_size_low + (rank - num_ranks_low) * chunk_size_high + 1;
+    }
+    int iy_end_global = iy_start_global + chunk_size - 1;  // My last index in the global array
+
+    int iy_start = 1;
+    int iy_end = iy_start + chunk_size;
+
+    // Set diriclet boundary conditions on left and right boarder
+    launch_initialize_boundaries(a, a_new, PI, iy_start_global - 1, nx, (chunk_size + 2), ny);
+    CUDA_RT_CALL(cudaDeviceSynchronize());
+
+    float* l2_norm_d;
+    CUDA_RT_CALL(cudaMalloc(&l2_norm_d, sizeof(float)));
+    float* l2_global_norm_d;
+    CUDA_RT_CALL(cudaMalloc(&l2_global_norm_d, sizeof(float)));
+    float* l2_norm_h;
+    CUDA_RT_CALL(cudaMallocHost(&l2_norm_h, sizeof(float)));
+
+    CUDA_RT_CALL(cudaDeviceSynchronize());
+
+    if (0 == rank) {
+        printf("Jacobi relaxation: %d iterations on %d x %d mesh\n", iter_max, nx, ny);
+    }
+
+    int iter = 0;
+    float l2_norm = 1.0;
+
+    MPI_CALL(MPI_Barrier(MPI_COMM_WORLD));
+    double start = MPI_Wtime();
+    nvtxRangePush("Jacobi Solve Multi-GPU");
+    while (l2_norm > tol && iter < iter_max) {
+        CUDA_RT_CALL(cudaMemsetAsync(l2_norm_d, 0, sizeof(float)));
+
+        launch_jacobi_kernel(a_new, a, l2_norm_d, iy_start, iy_end, nx, 0);
+
+        const int top = rank > 0 ? rank - 1 : (size - 1);
+        const int bottom = (rank + 1) % size;
+
+        // TODO: Reduce the device-local L2 norm, "l2_norm_d" to the global L2 norm on each device,
+        // "l2_global_norm_d", using ncclAllReduce() function. Use "ncclSum" as the reduction operation.
+        // Make sure to encapsulate this funciton call within NCCL group calls.
+        // Use "0" in the stream parameter function argument.
+        NCCL_CALL(/*Fill me*/);
+        NCCL_CALL(ncclAllReduce(/*Fill me*/, /*Fill me*/, 1, ncclFloat, /*Fill me*/, nccl_comm, 0));
+        NCCL_CALL(/*Fill me*/);
+
+        // TODO: Transfer the global L2 norm from each device to the host using cudaMemcpyAsync
+        CUDA_RT_CALL(cudaMemcpyAsync(/*Fill me*/, /*Fill me*/, sizeof(float), /*Fill me*/));
+
+        // Apply periodic boundary conditions
+        NCCL_CALL(ncclGroupStart());
+        
+        //TODO: Perform the first set of halo exchanges by:
+        // 1. Receiving the top halo from the "top" neighbour into the "a_new" device memory array location. 
+        // 2. Sending current device's bottom halo to "bottom" neighbour from the "a_new + (iy_end - 1) * nx"
+        //    device memory array location.
+        // Use "0" in the stream parameter function argument.
+        NCCL_CALL(ncclRecv(/*Fill me*/, nx, ncclFloat, /*Fill me*/, nccl_comm, 0));
+        NCCL_CALL(ncclSend(/*Fill me*/, nx, ncclFloat, /*Fill me*/, nccl_comm, 0));
+
+        //TODO: Perform the second set of halo exchanges by:
+        // 1. Receiving the bottom halo from the "bottom" neighbour into the "a_new + (iy_end * nx)" 
+        //    device memory array location. 
+        // 2. Sending current device's top halo to "top" neighbour from the "a_new + iy_start * nx"
+        //    device memory array location.
+        // Use "0" in the stream parameter function argument.
+        NCCL_CALL(ncclRecv(/*Fill me*/, nx, ncclFloat, /*Fill me*/, nccl_comm, 0));
+        NCCL_CALL(ncclSend(/*Fill me*/, nx, ncclFloat, /*Fill me*/, nccl_comm, 0));
+
+        NCCL_CALL(ncclGroupEnd());
+
+        // TODO: Synchronize the device before computing the global L2 norm on host for printing
+        CUDA_RT_CALL(/*Fill me*/);
+
+        l2_norm = *l2_norm_h;
+        l2_norm = std::sqrt(l2_norm);
+
+        iter++;
+        if (0 == rank && (iter % 100) == 0) {
+            printf("%5d, %0.6f\n", iter, l2_norm);
+        }
+
+        std::swap(a_new, a);
+    }
+    CUDA_RT_CALL(cudaDeviceSynchronize());
+    double stop = MPI_Wtime();
+    nvtxRangePop();
+
+    CUDA_RT_CALL(cudaMemcpy(a_h + iy_start_global * nx, a + nx,
+                            std::min((ny - iy_start_global) * nx, chunk_size * nx) * sizeof(float),
+                            cudaMemcpyDeviceToHost));
+
+    int result_correct = 1;
+    if (!skip_single_gpu) {
+        for (int iy = iy_start_global; result_correct && (iy < iy_end_global); ++iy) {
+            for (int ix = 1; result_correct && (ix < (nx - 1)); ++ix) {
+                if (std::fabs(a_ref_h[iy * nx + ix] - a_h[iy * nx + ix]) > tol) {
+                    fprintf(stderr,
+                            "ERROR on rank %d: a[%d * %d + %d] = %f does not match %f "
+                            "(reference)\n",
+                            rank, iy, nx, ix, a_h[iy * nx + ix], a_ref_h[iy * nx + ix]);
+                    result_correct = 0;
+                }
+            }
+        }
+        int global_result_correct = 1;
+        MPI_CALL(MPI_Allreduce(&result_correct, &global_result_correct, 1, MPI_INT, MPI_MIN,
+                            MPI_COMM_WORLD));
+        result_correct = global_result_correct;
+    }
+
+    if (rank == 0 && result_correct) {
+        printf("Num GPUs: %d.\n", size);
+        if (!skip_single_gpu) {
+            printf(
+                "%dx%d: 1 GPU: %8.4f s, %d GPUs: %8.4f s, speedup: %8.2f, "
+                "efficiency: %8.2f \n",
+                nx, ny, runtime_serial, size, (stop - start), runtime_serial / (stop - start),
+                runtime_serial / (size * (stop - start)) * 100);
+        }
+        else {
+            printf("%dx%d: %d GPUs: %8.4f s \n", nx, ny, size, (stop - start)); 
+        }
+    }
+
+    CUDA_RT_CALL(cudaFreeHost(l2_norm_h));
+    CUDA_RT_CALL(cudaFree(l2_norm_d));
+
+    CUDA_RT_CALL(cudaFree(a_new));
+    CUDA_RT_CALL(cudaFree(a));
+
+    CUDA_RT_CALL(cudaFreeHost(a_h));
+    CUDA_RT_CALL(cudaFreeHost(a_ref_h));
+
+    NCCL_CALL(ncclCommDestroy(nccl_comm));
+
+    MPI_CALL(MPI_Finalize());
+    return (result_correct == 1) ? 0 : 1;
+}
+
+double single_gpu(const int nx, const int ny, const int iter_max, float* const a_ref_h, bool print) {
+    float* a;
+    float* a_new;
+
+    float* l2_norm_d;
+    float* l2_norm_h;
+
+    int iy_start = 1;
+    int iy_end = (ny - 1);
+
+    CUDA_RT_CALL(cudaMalloc(&a, nx * ny * sizeof(float)));
+    CUDA_RT_CALL(cudaMalloc(&a_new, nx * ny * sizeof(float)));
+
+    CUDA_RT_CALL(cudaMemset(a, 0, nx * ny * sizeof(float)));
+    CUDA_RT_CALL(cudaMemset(a_new, 0, nx * ny * sizeof(float)));
+
+    // Set diriclet boundary conditions on left and right boarder
+    nvtxRangePush("Init boundaries");
+    launch_initialize_boundaries(a, a_new, PI, 0, nx, ny, ny);
+    CUDA_RT_CALL(cudaGetLastError());
+    CUDA_RT_CALL(cudaDeviceSynchronize());
+    nvtxRangePop();
+
+    CUDA_RT_CALL(cudaMalloc(&l2_norm_d, sizeof(float)));
+    CUDA_RT_CALL(cudaMallocHost(&l2_norm_h, sizeof(float)));
+
+    CUDA_RT_CALL(cudaDeviceSynchronize());
+
+    if (print) {
+        printf("Single GPU jacobi relaxation: %d iterations on %d x %d mesh\n", iter_max, nx, ny);
+    }
+
+    int iter = 0;
+    float l2_norm = 1.0;
+
+    double start = omp_get_wtime();
+    nvtxRangePush("Jacobi Solve Single GPU");
+    while (l2_norm > tol && iter < iter_max) {
+        CUDA_RT_CALL(cudaMemset(l2_norm_d, 0, sizeof(float)));
+
+        // Compute grid points for this iteration
+        launch_jacobi_kernel(a_new, a, l2_norm_d, iy_start, iy_end, nx, 0);
+        CUDA_RT_CALL(cudaGetLastError());
+        CUDA_RT_CALL(cudaMemcpy(l2_norm_h, l2_norm_d, sizeof(float), cudaMemcpyDeviceToHost));
+
+        // Apply periodic boundary conditions
+        CUDA_RT_CALL(cudaMemcpy(a_new, a_new + (iy_end - 1) * nx, nx * sizeof(float),
+                                     cudaMemcpyDeviceToDevice));
+        CUDA_RT_CALL(cudaMemcpy(a_new + iy_end * nx, a_new + iy_start * nx, nx * sizeof(float),
+                                     cudaMemcpyDeviceToDevice));
+
+        CUDA_RT_CALL(cudaDeviceSynchronize());
+        l2_norm = *l2_norm_h;
+        l2_norm = std::sqrt(l2_norm);
+
+        iter++;
+        if ((iter % 100) == 0 && print) {
+            printf("%5d, %0.6f\n", iter, l2_norm);
+        }
+        std::swap(a_new, a);
+    }
+    nvtxRangePop();
+    double stop = omp_get_wtime();
+
+    CUDA_RT_CALL(cudaMemcpy(a_ref_h, a, nx * ny * sizeof(float), cudaMemcpyDeviceToHost));
+
+    CUDA_RT_CALL(cudaFreeHost(l2_norm_h));
+    CUDA_RT_CALL(cudaFree(l2_norm_d));
+
+    CUDA_RT_CALL(cudaFree(a_new));
+    CUDA_RT_CALL(cudaFree(a));
+    return (stop - start);
+}

+ 407 - 0
hpc/multi_gpu_nways/labs/CFD/English/C/source_code/nccl/solution/jacobi_nccl.cpp

@@ -0,0 +1,407 @@
+/* Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NVIDIA CORPORATION nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+#include <algorithm>
+#include <cmath>
+#include <cstdio>
+#include <iostream>
+#include <sstream>
+#include <cuda_runtime.h>
+#include <nvToolsExt.h>
+#include <mpi.h>
+#include <omp.h>
+#include <nccl.h>
+
+#define MPI_CALL(call)                                                                \
+    {                                                                                 \
+        int mpi_status = call;                                                        \
+        if (0 != mpi_status) {                                                        \
+            char mpi_error_string[MPI_MAX_ERROR_STRING];                              \
+            int mpi_error_string_length = 0;                                          \
+            MPI_Error_string(mpi_status, mpi_error_string, &mpi_error_string_length); \
+            if (NULL != mpi_error_string)                                             \
+                fprintf(stderr,                                                       \
+                        "ERROR: MPI call \"%s\" in line %d of file %s failed "        \
+                        "with %s "                                                    \
+                        "(%d).\n",                                                    \
+                        #call, __LINE__, __FILE__, mpi_error_string, mpi_status);     \
+            else                                                                      \
+                fprintf(stderr,                                                       \
+                        "ERROR: MPI call \"%s\" in line %d of file %s failed "        \
+                        "with %d.\n",                                                 \
+                        #call, __LINE__, __FILE__, mpi_status);                       \
+        }                                                                             \
+    }
+
+#define CUDA_RT_CALL(call)                                                                  \
+    {                                                                                       \
+        cudaError_t cudaStatus = call;                                                      \
+        if (cudaSuccess != cudaStatus)                                                      \
+            fprintf(stderr,                                                                 \
+                    "ERROR: CUDA RT call \"%s\" in line %d of file %s failed "              \
+                    "with "                                                                 \
+                    "%s (%d).\n",                                                           \
+                    #call, __LINE__, __FILE__, cudaGetErrorString(cudaStatus), cudaStatus); \
+    }
+
+#define NCCL_CALL(call)                                                                     \
+    {                                                                                       \
+        ncclResult_t  ncclStatus = call;                                                    \
+        if (ncclSuccess != ncclStatus)                                                      \
+            fprintf(stderr,                                                                 \
+                    "ERROR: NCCL call \"%s\" in line %d of file %s failed "                 \
+                    "with "                                                                 \
+                    "%s (%d).\n",                                                           \
+                    #call, __LINE__, __FILE__, ncclGetErrorString(ncclStatus), ncclStatus); \
+    }
+
+constexpr float tol = 1.0e-8;
+
+const float PI = 2.0 * std::asin(1.0);
+
+void launch_jacobi_kernel(float*  a_new, const float*  a, float*  l2_norm, const int iy_start,
+                              const int iy_end, const int nx, cudaStream_t stream);
+
+void launch_initialize_boundaries(float*  a_new, float*  a, const float pi, const int offset, 
+                                    const int nx, const int my_ny, const int ny);
+
+double single_gpu(const int nx, const int ny, const int iter_max, 
+                    float* const a_ref_h, bool print);
+
+int get_argval(char** begin, char** end, const std::string& arg, const int default_val) {
+    int argval = default_val;
+    char** itr = std::find(begin, end, arg);
+    if (itr != end && ++itr != end) {
+        std::istringstream inbuf(*itr);
+        inbuf >> argval;
+    }
+    return argval;
+}
+
+bool get_arg(char** begin, char** end, const std::string& arg) {
+    char** itr = std::find(begin, end, arg);
+    if (itr != end) {
+        return true;
+    }
+    return false;
+}
+
+int main(int argc, char* argv[]) {
+    MPI_CALL(MPI_Init(&argc, &argv));
+    int rank;
+    MPI_CALL(MPI_Comm_rank(MPI_COMM_WORLD, &rank));
+    int size;
+    MPI_CALL(MPI_Comm_size(MPI_COMM_WORLD, &size));
+
+    ncclUniqueId nccl_uid;
+    if (rank == 0) NCCL_CALL(ncclGetUniqueId(&nccl_uid));
+    MPI_CALL(MPI_Bcast(&nccl_uid, sizeof(ncclUniqueId), MPI_BYTE, 0, MPI_COMM_WORLD));
+
+    const int iter_max = get_argval(argv, argv + argc, "-niter", 1000);
+    const int nx = get_argval(argv, argv + argc, "-nx", 16384);
+    const int ny = get_argval(argv, argv + argc, "-ny", 16384);
+    const bool skip_single_gpu = get_arg(argv, argv + argc, "-skip_single_gpu");
+
+    int local_rank = -1;
+    {
+        MPI_Comm local_comm;
+        MPI_CALL(MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, rank, MPI_INFO_NULL,
+                                     &local_comm));
+
+        MPI_CALL(MPI_Comm_rank(local_comm, &local_rank));
+
+        MPI_CALL(MPI_Comm_free(&local_comm));
+    }
+
+    CUDA_RT_CALL(cudaSetDevice(local_rank));
+    CUDA_RT_CALL(cudaFree(0));
+
+    ncclComm_t nccl_comm;
+    NCCL_CALL(ncclCommInitRank(&nccl_comm, size, nccl_uid, rank));
+
+
+    int nccl_version = 0;
+    NCCL_CALL(ncclGetVersion(&nccl_version));
+    if ( nccl_version < 2800 ) {
+        fprintf(stderr,"ERROR NCCL 2.8 or newer is required.\n");
+        NCCL_CALL(ncclCommDestroy(nccl_comm));
+        MPI_CALL(MPI_Finalize());
+        return 1;
+    }
+
+    float* a_ref_h;
+    CUDA_RT_CALL(cudaMallocHost(&a_ref_h, nx * ny * sizeof(float)));
+    float* a_h;
+    CUDA_RT_CALL(cudaMallocHost(&a_h, nx * ny * sizeof(float)));
+    
+    double runtime_serial = 1;
+    if (!skip_single_gpu){
+        runtime_serial = single_gpu(nx, ny, iter_max, a_ref_h, rank == 0);
+    }
+    // ny - 2 rows are distributed amongst `size` ranks in such a way
+    // that each rank gets either (ny - 2) / size or (ny - 2) / size + 1 rows.
+    // This optimizes load balancing when (ny - 2) % size != 0
+    int chunk_size;
+    int chunk_size_low = (ny - 2) / size;
+    int chunk_size_high = chunk_size_low + 1;
+    // To calculate the number of ranks that need to compute an extra row,
+    // the following formula is derived from this equation:
+    // num_ranks_low * chunk_size_low + (size - num_ranks_low) * (chunk_size_low + 1) = ny - 2
+    int num_ranks_low = size * chunk_size_low + size -
+                        (ny - 2);  // Number of ranks with chunk_size = chunk_size_low
+    if (rank < num_ranks_low)
+        chunk_size = chunk_size_low;
+    else
+        chunk_size = chunk_size_high;
+
+    float* a;
+    CUDA_RT_CALL(cudaMalloc(&a, nx * (chunk_size + 2) * sizeof(float)));
+    float* a_new;
+    CUDA_RT_CALL(cudaMalloc(&a_new, nx * (chunk_size + 2) * sizeof(float)));
+
+    CUDA_RT_CALL(cudaMemset(a, 0, nx * (chunk_size + 2) * sizeof(float)));
+    CUDA_RT_CALL(cudaMemset(a_new, 0, nx * (chunk_size + 2) * sizeof(float)));
+
+    // Calculate local domain boundaries
+    int iy_start_global;  // My start index in the global array
+    if (rank < num_ranks_low) {
+        iy_start_global = rank * chunk_size_low + 1;
+    } else {
+        iy_start_global =
+            num_ranks_low * chunk_size_low + (rank - num_ranks_low) * chunk_size_high + 1;
+    }
+    int iy_end_global = iy_start_global + chunk_size - 1;  // My last index in the global array
+
+    int iy_start = 1;
+    int iy_end = iy_start + chunk_size;
+
+    // Set diriclet boundary conditions on left and right boarder
+    launch_initialize_boundaries(a, a_new, PI, iy_start_global - 1, nx, (chunk_size + 2), ny);
+    CUDA_RT_CALL(cudaDeviceSynchronize());
+
+    float* l2_norm_d;
+    CUDA_RT_CALL(cudaMalloc(&l2_norm_d, sizeof(float)));
+    float* l2_global_norm_d;
+    CUDA_RT_CALL(cudaMalloc(&l2_global_norm_d, sizeof(float)));
+    float* l2_norm_h;
+    CUDA_RT_CALL(cudaMallocHost(&l2_norm_h, sizeof(float)));
+
+    CUDA_RT_CALL(cudaDeviceSynchronize());
+
+    if (0 == rank) {
+        printf("Jacobi relaxation: %d iterations on %d x %d mesh\n", iter_max, nx, ny);
+    }
+
+    int iter = 0;
+    float l2_norm = 1.0;
+
+    MPI_CALL(MPI_Barrier(MPI_COMM_WORLD));
+    double start = MPI_Wtime();
+    nvtxRangePush("Jacobi Solve Multi-GPU");
+    while (l2_norm > tol && iter < iter_max) {
+        CUDA_RT_CALL(cudaMemsetAsync(l2_norm_d, 0, sizeof(float)));
+
+        launch_jacobi_kernel(a_new, a, l2_norm_d, iy_start, iy_end, nx, 0);
+
+        const int top = rank > 0 ? rank - 1 : (size - 1);
+        const int bottom = (rank + 1) % size;
+
+        // TODO: Reduce the device-local L2 norm, "l2_norm_d" to the global L2 norm on each device,
+        // "l2_global_norm_d", using ncclAllReduce() function. Use "ncclSum" as the reduction operation.
+        // Make sure to encapsulate this funciton call within NCCL group calls.
+        // Use "0" in the stream parameter function argument.
+        NCCL_CALL(ncclGroupStart());
+        NCCL_CALL(ncclAllReduce(l2_norm_d, l2_global_norm_d, 1, ncclFloat, ncclSum, nccl_comm, 
+                                    0));
+        NCCL_CALL(ncclGroupEnd());
+
+        // TODO: Transfer the global L2 norm from each device to the host using cudaMemcpyAsync
+        CUDA_RT_CALL(cudaMemcpyAsync(l2_norm_h, l2_global_norm_d, sizeof(float), cudaMemcpyDeviceToHost));
+
+        // Apply periodic boundary conditions
+        NCCL_CALL(ncclGroupStart());
+        
+        //TODO: Perform the first set of halo exchanges by:
+        // 1. Receiving the top halo from the "top" neighbour into the "a_new" device memory array location. 
+        // 2. Sending current device's bottom halo to "bottom" neighbour from the "a_new + (iy_end - 1) * nx"
+        //    device memory array location.
+        // Use "0" in the stream parameter function argument.
+        NCCL_CALL(ncclRecv(a_new,                     nx, ncclFloat, top,    nccl_comm, 0));
+        NCCL_CALL(ncclSend(a_new + (iy_end - 1) * nx, nx, ncclFloat, bottom, nccl_comm, 0));
+
+        //TODO: Perform the second set of halo exchanges by:
+        // 1. Receiving the bottom halo from the "bottom" neighbour into the "a_new + (iy_end * nx)" 
+        //    device memory array location. 
+        // 2. Sending current device's top halo to "top" neighbour from the "a_new + iy_start * nx"
+        //    device memory array location.
+        // Use "0" in the stream parameter function argument.
+        NCCL_CALL(ncclRecv(a_new + (iy_end * nx),     nx, ncclFloat, bottom, nccl_comm, 0));
+        NCCL_CALL(ncclSend(a_new + iy_start * nx,     nx, ncclFloat, top,    nccl_comm, 0));
+
+        NCCL_CALL(ncclGroupEnd());
+
+        // TODO: Synchronize the device before computing the global L2 norm on host for printing
+        CUDA_RT_CALL(cudaDeviceSynchronize());
+
+        l2_norm = *l2_norm_h;
+        l2_norm = std::sqrt(l2_norm);
+
+        iter++;
+        if (0 == rank && (iter % 100) == 0) {
+            printf("%5d, %0.6f\n", iter, l2_norm);
+        }
+
+        std::swap(a_new, a);
+    }
+    CUDA_RT_CALL(cudaDeviceSynchronize());
+    double stop = MPI_Wtime();
+    nvtxRangePop();
+
+    CUDA_RT_CALL(cudaMemcpy(a_h + iy_start_global * nx, a + nx,
+                            std::min((ny - iy_start_global) * nx, chunk_size * nx) * sizeof(float),
+                            cudaMemcpyDeviceToHost));
+
+    int result_correct = 1;
+    if (!skip_single_gpu) {
+        for (int iy = iy_start_global; result_correct && (iy < iy_end_global); ++iy) {
+            for (int ix = 1; result_correct && (ix < (nx - 1)); ++ix) {
+                if (std::fabs(a_ref_h[iy * nx + ix] - a_h[iy * nx + ix]) > tol) {
+                    fprintf(stderr,
+                            "ERROR on rank %d: a[%d * %d + %d] = %f does not match %f "
+                            "(reference)\n",
+                            rank, iy, nx, ix, a_h[iy * nx + ix], a_ref_h[iy * nx + ix]);
+                    result_correct = 0;
+                }
+            }
+        }
+        int global_result_correct = 1;
+        MPI_CALL(MPI_Allreduce(&result_correct, &global_result_correct, 1, MPI_INT, MPI_MIN,
+                            MPI_COMM_WORLD));
+        result_correct = global_result_correct;
+    }
+
+    if (rank == 0 && result_correct) {
+        printf("Num GPUs: %d.\n", size);
+        if (!skip_single_gpu) {
+            printf(
+                "%dx%d: 1 GPU: %8.4f s, %d GPUs: %8.4f s, speedup: %8.2f, "
+                "efficiency: %8.2f \n",
+                nx, ny, runtime_serial, size, (stop - start), runtime_serial / (stop - start),
+                runtime_serial / (size * (stop - start)) * 100);
+        }
+        else {
+            printf("%dx%d: %d GPUs: %8.4f s \n", nx, ny, size, (stop - start)); 
+        }
+    }
+
+    CUDA_RT_CALL(cudaFreeHost(l2_norm_h));
+    CUDA_RT_CALL(cudaFree(l2_norm_d));
+
+    CUDA_RT_CALL(cudaFree(a_new));
+    CUDA_RT_CALL(cudaFree(a));
+
+    CUDA_RT_CALL(cudaFreeHost(a_h));
+    CUDA_RT_CALL(cudaFreeHost(a_ref_h));
+
+    NCCL_CALL(ncclCommDestroy(nccl_comm));
+
+    MPI_CALL(MPI_Finalize());
+    return (result_correct == 1) ? 0 : 1;
+}
+
+double single_gpu(const int nx, const int ny, const int iter_max, float* const a_ref_h, bool print) {
+    float* a;
+    float* a_new;
+
+    float* l2_norm_d;
+    float* l2_norm_h;
+
+    int iy_start = 1;
+    int iy_end = (ny - 1);
+
+    CUDA_RT_CALL(cudaMalloc(&a, nx * ny * sizeof(float)));
+    CUDA_RT_CALL(cudaMalloc(&a_new, nx * ny * sizeof(float)));
+
+    CUDA_RT_CALL(cudaMemset(a, 0, nx * ny * sizeof(float)));
+    CUDA_RT_CALL(cudaMemset(a_new, 0, nx * ny * sizeof(float)));
+
+    // Set diriclet boundary conditions on left and right boarder
+    nvtxRangePush("Init boundaries");
+    launch_initialize_boundaries(a, a_new, PI, 0, nx, ny, ny);
+    CUDA_RT_CALL(cudaGetLastError());
+    CUDA_RT_CALL(cudaDeviceSynchronize());
+    nvtxRangePop();
+
+    CUDA_RT_CALL(cudaMalloc(&l2_norm_d, sizeof(float)));
+    CUDA_RT_CALL(cudaMallocHost(&l2_norm_h, sizeof(float)));
+
+    CUDA_RT_CALL(cudaDeviceSynchronize());
+
+    if (print) {
+        printf("Single GPU jacobi relaxation: %d iterations on %d x %d mesh\n", iter_max, nx, ny);
+    }
+
+    int iter = 0;
+    float l2_norm = 1.0;
+
+    double start = omp_get_wtime();
+    nvtxRangePush("Jacobi Solve Single GPU");
+    while (l2_norm > tol && iter < iter_max) {
+        CUDA_RT_CALL(cudaMemset(l2_norm_d, 0, sizeof(float)));
+
+        // Compute grid points for this iteration
+        launch_jacobi_kernel(a_new, a, l2_norm_d, iy_start, iy_end, nx, 0);
+        CUDA_RT_CALL(cudaGetLastError());
+        CUDA_RT_CALL(cudaMemcpy(l2_norm_h, l2_norm_d, sizeof(float), cudaMemcpyDeviceToHost));
+
+        // Apply periodic boundary conditions
+        CUDA_RT_CALL(cudaMemcpy(a_new, a_new + (iy_end - 1) * nx, nx * sizeof(float),
+                                     cudaMemcpyDeviceToDevice));
+        CUDA_RT_CALL(cudaMemcpy(a_new + iy_end * nx, a_new + iy_start * nx, nx * sizeof(float),
+                                     cudaMemcpyDeviceToDevice));
+
+        CUDA_RT_CALL(cudaDeviceSynchronize());
+        l2_norm = *l2_norm_h;
+        l2_norm = std::sqrt(l2_norm);
+
+        iter++;
+        if ((iter % 100) == 0 && print) {
+            printf("%5d, %0.6f\n", iter, l2_norm);
+        }
+        std::swap(a_new, a);
+    }
+    nvtxRangePop();
+    double stop = omp_get_wtime();
+
+    CUDA_RT_CALL(cudaMemcpy(a_ref_h, a, nx * ny * sizeof(float), cudaMemcpyDeviceToHost));
+
+    CUDA_RT_CALL(cudaFreeHost(l2_norm_h));
+    CUDA_RT_CALL(cudaFree(l2_norm_d));
+
+    CUDA_RT_CALL(cudaFree(a_new));
+    CUDA_RT_CALL(cudaFree(a));
+    return (stop - start);
+}

+ 29 - 0
hpc/multi_gpu_nways/labs/CFD/English/C/source_code/nvshmem/Makefile

@@ -0,0 +1,29 @@
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+NP ?= 1
+NVCC=nvcc
+MPIRUN ?= mpirun
+CUDA_HOME ?= /usr/local/cuda
+ifndef NVSHMEM_HOME
+$(error NVSHMEM_HOME is not set)
+endif
+ifndef MPI_HOME
+$(error MPI_HOME is not set)
+endif
+GENCODE_SM70    := -gencode arch=compute_70,code=sm_70
+GENCODE_SM80    := -gencode arch=compute_80,code=sm_80 -gencode arch=compute_80,code=compute_80
+GENCODE_FLAGS	:= $(GENCODE_SM70) $(GENCODE_SM80)
+
+NVCC_FLAGS += -dc -Xcompiler -fopenmp -lineinfo -lnvToolsExt $(GENCODE_FLAGS) -std=c++14 -I$(NVSHMEM_HOME)/include -I$(MPI_HOME)/include
+NVCC_LDFLAGS = -ccbin=mpic++ -L$(NVSHMEM_HOME)/lib -lnvshmem -L$(MPI_HOME)/lib -lmpi -L$(CUDA_HOME)/lib64 -lcuda -lcudart -lnvToolsExt
+
+left_shift: Makefile left_shift.cu
+	$(NVCC) $(NVCC_FLAGS) left_shift.cu -c -o left_shift.o
+	$(NVCC) $(GENCODE_FLAGS) left_shift.o -o left_shift $(NVCC_LDFLAGS)
+
+jacobi_nvshmem: Makefile jacobi_nvshmem.cu
+	$(NVCC) $(NVCC_FLAGS) jacobi_nvshmem.cu -c -o jacobi_nvshmem.o
+	$(NVCC) $(GENCODE_FLAGS) jacobi_nvshmem.o -o jacobi_nvshmem $(NVCC_LDFLAGS)
+
+.PHONY.: clean
+clean:
+	rm -rf jacobi_nvshmem left_shift *.o *.qdrep *.sqlite

+ 567 - 0
hpc/multi_gpu_nways/labs/CFD/English/C/source_code/nvshmem/jacobi_nvshmem.cu

@@ -0,0 +1,567 @@
+/* Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NVIDIA CORPORATION nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+#include <mpi.h>
+#include <nvshmem.h>
+#include <nvshmemx.h>
+#include <algorithm>
+#include <cassert>
+#include <cmath>
+#include <cstdio>
+#include <iostream>
+#include <sstream>
+#include <cub/block/block_reduce.cuh>
+#include <nvToolsExt.h>
+
+#define BLOCK_DIM_X 1024
+#define BLOCK_DIM_Y 1
+
+#define MPI_CALL(call)                                                                \
+    {                                                                                 \
+        int mpi_status = call;                                                        \
+        if (0 != mpi_status) {                                                        \
+            char mpi_error_string[MPI_MAX_ERROR_STRING];                              \
+            int mpi_error_string_length = 0;                                          \
+            MPI_Error_string(mpi_status, mpi_error_string, &mpi_error_string_length); \
+            if (NULL != mpi_error_string)                                             \
+                fprintf(stderr,                                                       \
+                        "ERROR: MPI call \"%s\" in line %d of file %s failed "        \
+                        "with %s "                                                    \
+                        "(%d).\n",                                                    \
+                        #call, __LINE__, __FILE__, mpi_error_string, mpi_status);     \
+            else                                                                      \
+                fprintf(stderr,                                                       \
+                        "ERROR: MPI call \"%s\" in line %d of file %s failed "        \
+                        "with %d.\n",                                                 \
+                        #call, __LINE__, __FILE__, mpi_status);                       \
+        }                                                                             \
+    }
+
+#define CUDA_RT_CALL(call)                                                                  \
+    {                                                                                       \
+        cudaError_t cudaStatus = call;                                                      \
+        if (cudaSuccess != cudaStatus)                                                      \
+            fprintf(stderr,                                                                 \
+                    "ERROR: CUDA RT call \"%s\" in line %d of file %s failed "              \
+                    "with "                                                                 \
+                    "%s (%d).\n",                                                           \
+                    #call, __LINE__, __FILE__, cudaGetErrorString(cudaStatus), cudaStatus); \
+    }
+
+// convert NVSHMEM_SYMMETRIC_SIZE string to long long unsigned int
+long long unsigned int parse_nvshmem_symmetric_size(char *value) {
+    long long unsigned int units, size;
+
+    assert(value != NULL);
+
+    if (strchr(value, 'G') != NULL) {
+        units=1e9;
+    } else if (strchr(value, 'M') != NULL) {
+        units=1e6;
+    } else if (strchr(value, 'K') != NULL) {
+        units=1e3;
+    } else {
+        units=1;
+    }
+
+    assert(atof(value) >= 0);
+    size = (long long unsigned int) atof(value) * units;
+
+    return size;
+}
+
+constexpr float tol = 1.0e-8;
+const float PI = 2.0 * std::asin(1.0);
+
+__global__ void initialize_boundaries(float* __restrict__ const a_new, float* __restrict__ const a,
+                                      const float pi, const int offset, const int nx,
+                                      const int my_ny, int ny) {
+    for (int iy = blockIdx.x * blockDim.x + threadIdx.x; iy < my_ny; iy += blockDim.x * gridDim.x) {
+        const float y0 = sin(2.0 * pi * (offset + iy) / (ny - 1));
+        a[(iy + 1) * nx + 0] = y0;
+        a[(iy + 1) * nx + (nx - 1)] = y0;
+        a_new[(iy + 1) * nx + 0] = y0;
+        a_new[(iy + 1) * nx + (nx - 1)] = y0;
+    }
+}
+
+__global__ void jacobi_kernel(float* __restrict__ const a_new, const float* __restrict__ const a,
+                              float* __restrict__ const l2_norm, const int iy_start,
+                              const int iy_end, const int nx, const int top_pe, const int top_iy,
+                              const int bottom_pe, const int bottom_iy) {
+    int iy = blockIdx.y * blockDim.y + threadIdx.y + iy_start;
+    int ix = blockIdx.x * blockDim.x + threadIdx.x + 1;
+    __shared__ float block_l2_sum[BLOCK_DIM_X*BLOCK_DIM_Y];
+    unsigned thread_index = threadIdx.y*BLOCK_DIM_X + threadIdx.x;
+
+    if (iy < iy_end && ix < (nx - 1)) {
+        // Update grid point
+        const float new_val = 0.25 * (a[iy * nx + ix + 1] + a[iy * nx + ix - 1] +
+                                     a[(iy + 1) * nx + ix] + a[(iy - 1) * nx + ix]);
+        a_new[iy * nx + ix] = new_val;
+        float residue = new_val - a[iy * nx + ix];
+        // Set block-level L2 norm value for this grid point
+        block_l2_sum[thread_index] = residue * residue;
+    }
+    else {
+        block_l2_sum[thread_index] = 0;
+    }
+
+    /* starting (x, y) coordinate of the block */
+    int block_iy = iy - threadIdx.y; /* That is, block_iy = blockIdx.y * blockDim.y + iy_start */
+    int block_ix = ix - threadIdx.x; /* That is, block_ix = blockIdx.x * blockDim.x + 1 */
+
+    /* Communicate the boundaries */
+    // TODO: Use block-level NVSHMEM put communication API to transfer the halo pointed to by 
+    // "a_new + iy_start * nx + block_ix" to "a_new + top_iy * nx + block_ix" in the "top_pe"
+    if ((block_iy <= iy_start) && (iy_start < block_iy + blockDim.y)) {
+        nvshmemx_/*Fill me*/(/*Fill me*/, /*Fill me*/,
+                                   min(blockDim.x, nx - 1 - block_ix), /*Fill me*/);
+    }
+    // TODO: Use block-level NVSHMEM put communication API to transfer the halo pointed to by 
+    // "a_new + (iy_end - 1) * nx + block_ix" to "a_new + bottom_iy * nx + block_ix" in the "bottom_pe"
+    if ((block_iy < iy_end) && (iy_end <= block_iy + blockDim.y)) {
+        nvshmemx_/*Fill me*/(/*Fill me*/, /*Fill me*/,
+                                   min(blockDim.x, nx - 1 - block_ix), /*Fill me*/);
+    }
+    // Reduce L2 norm for the block in parallel
+    for (unsigned stride = 1; stride < BLOCK_DIM_X*BLOCK_DIM_Y; stride *= 2) {
+        __syncthreads();
+        if ((thread_index) % (2*stride) == 0) {
+            block_l2_sum[thread_index] += block_l2_sum[thread_index + stride];
+        }
+    }
+    // Atomically update global L2 norm with block-reduced L2 norm
+    if (thread_index == 0) {
+        atomicAdd(l2_norm, block_l2_sum[0]);
+    }
+}
+
+double single_gpu(const int nx, const int ny, const int iter_max, float* const a_ref_h,
+                  const bool print, int mype);
+
+int get_argval(char** begin, char** end, const std::string& arg, const int default_val) {
+    int argval = default_val;
+    char** itr = std::find(begin, end, arg);
+    if (itr != end && ++itr != end) {
+        std::istringstream inbuf(*itr);
+        inbuf >> argval;
+    }
+    return argval;
+}
+
+bool get_arg(char** begin, char** end, const std::string& arg) {
+    char** itr = std::find(begin, end, arg);
+    if (itr != end) {
+        return true;
+    }
+    return false;
+}
+
+struct l2_norm_buf {
+    cudaEvent_t copy_done;
+    float* d;
+    float* h;
+};
+
+int main(int argc, char* argv[]) {
+    const int iter_max = get_argval(argv, argv + argc, "-niter", 1000);
+    const int nx = get_argval(argv, argv + argc, "-nx", 16384);
+    const int ny = get_argval(argv, argv + argc, "-ny", 16384);
+    const bool skip_single_gpu = get_arg(argv, argv + argc, "-skip_single_gpu");
+
+    float* a_new;
+
+    float* a_ref_h;
+    float* a_h;
+    double runtime_serial = 0.0;
+
+    float l2_norms[2];
+
+    int rank = 0, size = 1;
+    MPI_CALL(MPI_Init(&argc, &argv));
+    MPI_CALL(MPI_Comm_rank(MPI_COMM_WORLD, &rank));
+    MPI_CALL(MPI_Comm_size(MPI_COMM_WORLD, &size));
+
+    int num_devices;
+    CUDA_RT_CALL(cudaGetDeviceCount(&num_devices));
+
+    int local_rank = -1, local_size = 1;
+    {
+        MPI_Comm local_comm;
+        MPI_Info info;
+        MPI_CALL(MPI_Info_create(&info));
+        MPI_CALL(
+            MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, rank, info, &local_comm));
+
+        MPI_CALL(MPI_Comm_rank(local_comm, &local_rank));
+        MPI_CALL(MPI_Comm_size(local_comm, &local_size));
+        if (num_devices < local_size) {
+            fprintf(stderr,
+                    "ERROR: Number of devices is less numer of PEs \
+                    on the node!\n");
+            MPI_CALL(MPI_Comm_free(&local_comm));
+            MPI_CALL(MPI_Info_free(&info));
+            MPI_CALL(MPI_Finalize());
+            return -1;
+        }
+
+        MPI_CALL(MPI_Comm_free(&local_comm));
+        MPI_CALL(MPI_Info_free(&info));
+    }
+    CUDA_RT_CALL(cudaSetDevice(local_rank));
+    CUDA_RT_CALL(cudaFree(0));
+
+    MPI_Comm mpi_comm;
+    nvshmemx_init_attr_t attr;
+
+    mpi_comm = MPI_COMM_WORLD;
+    attr.mpi_comm = &mpi_comm;
+    // Set symmetric heap size for nvshmem based on problem size
+    // Its default value in nvshmem is 1 GB which is not sufficient
+    // for large mesh sizes
+    long long unsigned int mesh_size_per_rank = nx * (((ny - 2) + size - 1) / size + 2);
+    long long unsigned int required_symmetric_heap_size =
+        2 * mesh_size_per_rank * sizeof(float) *
+        1.1;  // Factor 2 is because 2 arrays are allocated - a and a_new
+              // 1.1 factor is just for alignment or other usage
+
+    char * value = getenv("NVSHMEM_SYMMETRIC_SIZE");
+    if (value) { /* env variable is set */
+        long long unsigned int size_env = parse_nvshmem_symmetric_size(value);
+        if (size_env < required_symmetric_heap_size) {
+            fprintf(stderr, "ERROR: Minimum NVSHMEM_SYMMETRIC_SIZE = %lluB, Current NVSHMEM_SYMMETRIC_SIZE = %s\n", required_symmetric_heap_size, value);
+            MPI_CALL(MPI_Finalize());
+            return -1;
+        }
+    } else {
+        char symmetric_heap_size_str[100];
+        sprintf(symmetric_heap_size_str, "%llu", required_symmetric_heap_size);
+        if (!rank)
+            printf("Setting environment variable NVSHMEM_SYMMETRIC_SIZE = %llu\n", required_symmetric_heap_size);
+        setenv("NVSHMEM_SYMMETRIC_SIZE", symmetric_heap_size_str, 1);
+    }
+    nvshmemx_init_attr(NVSHMEMX_INIT_WITH_MPI_COMM, &attr);
+
+    int npes = nvshmem_n_pes();
+    int mype = nvshmem_my_pe();
+
+    nvshmem_barrier_all();
+
+    bool result_correct = true;
+    float* a;
+
+    cudaStream_t compute_stream;
+    cudaStream_t reset_l2_norm_stream;
+    cudaEvent_t compute_done[2];
+    cudaEvent_t reset_l2_norm_done[2];
+
+    l2_norm_buf l2_norm_bufs[2];
+
+    CUDA_RT_CALL(cudaMallocHost(&a_ref_h, nx * ny * sizeof(float)));
+    CUDA_RT_CALL(cudaMallocHost(&a_h, nx * ny * sizeof(float)));
+    if (!skip_single_gpu){
+        runtime_serial = single_gpu(nx, ny, iter_max, a_ref_h, (0 == mype), mype);
+    }
+    
+    nvshmem_barrier_all();
+    // ny - 2 rows are distributed amongst `size` ranks in such a way
+    // that each rank gets either (ny - 2) / size or (ny - 2) / size + 1 rows.
+    // This optimizes load balancing when (ny - 2) % size != 0
+    int chunk_size;
+    int chunk_size_low = (ny - 2) / npes;
+    int chunk_size_high = chunk_size_low + 1;
+    // To calculate the number of ranks that need to compute an extra row,
+    // the following formula is derived from this equation:
+    // num_ranks_low * chunk_size_low + (size - num_ranks_low) * (chunk_size_low + 1) = ny - 2
+    int num_ranks_low = npes * chunk_size_low + npes -
+                        (ny - 2);  // Number of ranks with chunk_size = chunk_size_low
+    if (mype < num_ranks_low)
+        chunk_size = chunk_size_low;
+    else
+        chunk_size = chunk_size_high;
+
+    a = (float*)nvshmem_malloc(
+        nx * (chunk_size_high + 2) *
+        sizeof(float));  // Using chunk_size_high so that it is same across all PEs
+    a_new = (float*)nvshmem_malloc(nx * (chunk_size_high + 2) * sizeof(float));
+
+    cudaMemset(a, 0, nx * (chunk_size + 2) * sizeof(float));
+    cudaMemset(a_new, 0, nx * (chunk_size + 2) * sizeof(float));
+
+    // Calculate local domain boundaries
+    int iy_start_global;  // My start index in the global array
+    if (mype < num_ranks_low) {
+        iy_start_global = mype * chunk_size_low + 1;
+    } else {
+        iy_start_global =
+            num_ranks_low * chunk_size_low + (mype - num_ranks_low) * chunk_size_high + 1;
+    }
+    int iy_end_global = iy_start_global + chunk_size - 1;  // My last index in the global array
+    // do not process boundaries
+    iy_end_global = std::min(iy_end_global, ny - 4);
+
+    int iy_start = 1;
+    int iy_end = (iy_end_global - iy_start_global + 1) + iy_start;
+
+    // calculate boundary indices for top and bottom boundaries
+    int top_pe = mype > 0 ? mype - 1 : (npes - 1);
+    int bottom_pe = (mype + 1) % npes;
+
+    int iy_end_top = (top_pe < num_ranks_low) ? chunk_size_low + 1 : chunk_size_high + 1;
+    int iy_start_bottom = 0;
+
+    // Set diriclet boundary conditions on left and right boundary
+    initialize_boundaries<<<(ny / npes) / 128 + 1, 128>>>(a, a_new, PI, iy_start_global - 1, nx,
+                                                          chunk_size, ny - 2);
+    CUDA_RT_CALL(cudaGetLastError());
+    CUDA_RT_CALL(cudaDeviceSynchronize());
+
+    CUDA_RT_CALL(cudaStreamCreateWithFlags(&compute_stream, cudaStreamNonBlocking));
+    CUDA_RT_CALL(cudaStreamCreate(&reset_l2_norm_stream));
+    CUDA_RT_CALL(cudaEventCreate(&compute_done[0]));
+    CUDA_RT_CALL(cudaEventCreate(&compute_done[1]));
+    CUDA_RT_CALL(cudaEventCreate(&reset_l2_norm_done[0]));
+    CUDA_RT_CALL(cudaEventCreate(&reset_l2_norm_done[1]));
+
+    for (int i = 0; i < 2; ++i) {
+        CUDA_RT_CALL(cudaEventCreate(&l2_norm_bufs[i].copy_done));
+        CUDA_RT_CALL(cudaMalloc(&l2_norm_bufs[i].d, sizeof(float)));
+        CUDA_RT_CALL(cudaMemset(l2_norm_bufs[i].d, 0, sizeof(float)));
+        CUDA_RT_CALL(cudaMallocHost(&l2_norm_bufs[i].h, sizeof(float)));
+        *(l2_norm_bufs[i].h) = 1.0;
+    }
+
+    nvshmemx_barrier_all_on_stream(compute_stream);
+    MPI_CALL(MPI_Allreduce(l2_norm_bufs[0].h, &l2_norms[0], 1, MPI_FLOAT, MPI_SUM, MPI_COMM_WORLD));
+    MPI_CALL(MPI_Allreduce(l2_norm_bufs[1].h, &l2_norms[1], 1, MPI_FLOAT, MPI_SUM, MPI_COMM_WORLD));
+    CUDA_RT_CALL(cudaDeviceSynchronize());
+
+    if (!mype) {
+        printf("Jacobi relaxation: %d iterations on %d x %d mesh\n", iter_max, nx, ny);
+    }
+
+    dim3 dim_grid((nx + BLOCK_DIM_X-1) / BLOCK_DIM_X, (chunk_size + BLOCK_DIM_Y-1) / BLOCK_DIM_Y, 1);
+    dim3 dim_block(BLOCK_DIM_X, BLOCK_DIM_Y, 1);
+
+    int iter = 0;
+    if (!mype) {
+        for (int i = 0; i < 2; ++i) {
+            l2_norms[i] = 1.0;
+        }
+    }
+
+    nvshmem_barrier_all();
+
+    double start = MPI_Wtime();
+    bool l2_norm_greater_than_tol = true;
+
+    nvtxRangePush("Jacobi Solve Multi-GPU");
+    while (l2_norm_greater_than_tol && iter < iter_max) {
+        // on new iteration: old current vars are now previous vars, old
+        // previous vars are no longer needed
+        int prev = iter % 2;
+        int curr = (iter + 1) % 2;
+
+        // TODO: Use "cudaStreamWaitEvent" on "compute_stream" to wait for "reset_l2_norm_done"
+        // event to complete for "curr" iteration
+        CUDA_RT_CALL(cudaStreamWaitEvent(/*Fill me*/, /*Fill me*/, 0));
+        jacobi_kernel<<<dim_grid, dim_block, 0, compute_stream>>>(
+                a_new, a, l2_norm_bufs[curr].d, iy_start, iy_end, nx, top_pe, iy_end_top, bottom_pe,
+                iy_start_bottom);
+
+        // TODO: Put a barrier at the "compute_stream" stream level
+        nvshmemx_/*Fill me*/(/*Fill me*/);
+
+        // perform L2 norm calculation
+        // as soon as computation is complete -> D2H-copy L2 norm
+        CUDA_RT_CALL(cudaMemcpyAsync(l2_norm_bufs[curr].h, l2_norm_bufs[curr].d, sizeof(float),
+                                        cudaMemcpyDeviceToHost, compute_stream));
+        // TODO: Record the event "l2_norm_bufs[curr].copy_done" for "compute_stream"
+        CUDA_RT_CALL(cudaEventRecord(/*Fill me*/, /*Fill me*/));
+
+        // ensure previous D2H-copy is completed before using the data for
+        // calculation
+        CUDA_RT_CALL(cudaEventSynchronize(l2_norm_bufs[prev].copy_done));
+
+        MPI_CALL(MPI_Allreduce(l2_norm_bufs[prev].h, &l2_norms[prev], 1, MPI_FLOAT, MPI_SUM,
+                                MPI_COMM_WORLD));
+
+        l2_norms[prev] = std::sqrt(l2_norms[prev]);
+        l2_norm_greater_than_tol = (l2_norms[prev] > tol);
+
+        iter++;
+        if ((iter % 100) == 0) {
+            if (!mype) printf("%5d, %0.6f\n", iter, l2_norms[prev]);
+        }
+
+        // reset everything for next iteration
+        l2_norms[prev] = 0.0;
+        *(l2_norm_bufs[prev].h) = 0.0;
+        CUDA_RT_CALL(cudaMemcpyAsync(l2_norm_bufs[prev].d, l2_norm_bufs[prev].h, sizeof(float),
+                                        cudaMemcpyHostToDevice, reset_l2_norm_stream));
+        // TODO: Record the L2 norm reset in "reset_l2_norm_done[prev]" for "reset_l2_norm_stream"
+        CUDA_RT_CALL(cudaEventRecord(/*Fill me*/, /*Fill me*/));
+
+        std::swap(a_new, a);
+    }
+    CUDA_RT_CALL(cudaDeviceSynchronize());
+    nvshmem_barrier_all();
+    double stop = MPI_Wtime();
+    nvtxRangePop();
+
+    nvshmem_barrier_all();
+
+    CUDA_RT_CALL(cudaMemcpy(a_h + iy_start_global * nx, a + nx,
+                            std::min(ny - 2 - iy_start_global, chunk_size) * nx * sizeof(float),
+                            cudaMemcpyDeviceToHost));
+
+    result_correct = true;
+    if (!skip_single_gpu) {
+        for (int iy = iy_start_global; result_correct && (iy < iy_end_global); ++iy) {
+            for (int ix = 1; result_correct && (ix < (nx - 1)); ++ix) {
+                if (std::fabs(a_ref_h[iy * nx + ix] - a_h[iy * nx + ix]) > tol) {
+                    fprintf(stderr,
+                            "ERROR on rank %d: a[%d * %d + %d] = %f does not match %f "
+                            "(reference)\n",
+                            rank, iy, nx, ix, a_h[iy * nx + ix], a_ref_h[iy * nx + ix]);
+                    result_correct = false;
+                }
+            }
+        }
+        int global_result_correct = 1;
+        MPI_CALL(MPI_Allreduce(&result_correct, &global_result_correct, 1, MPI_INT, MPI_MIN,
+                                MPI_COMM_WORLD));
+        result_correct = global_result_correct;
+    }
+
+    if (!mype && result_correct) {
+        printf("Num GPUs: %d.\n", size);
+        if (!skip_single_gpu) {
+            printf(
+                "%dx%d: 1 GPU: %8.4f s, %d GPUs: %8.4f s, speedup: %8.2f, "
+                "efficiency: %8.2f \n",
+                nx, ny, runtime_serial, size, (stop - start), runtime_serial / (stop - start),
+                runtime_serial / (size * (stop - start)) * 100);
+        }
+        else {
+            printf("%dx%d: %d GPUs: %8.4f s \n", nx, ny, size, (stop - start)); 
+        }
+    }
+
+    for (int i = 0; i < 2; ++i) {
+        CUDA_RT_CALL(cudaFreeHost(l2_norm_bufs[i].h));
+        CUDA_RT_CALL(cudaFree(l2_norm_bufs[i].d));
+        CUDA_RT_CALL(cudaEventDestroy(l2_norm_bufs[i].copy_done));
+    }
+
+    nvshmem_free(a);
+    nvshmem_free(a_new);
+
+    CUDA_RT_CALL(cudaEventDestroy(reset_l2_norm_done[1]));
+    CUDA_RT_CALL(cudaEventDestroy(reset_l2_norm_done[0]));
+    CUDA_RT_CALL(cudaEventDestroy(compute_done[1]));
+    CUDA_RT_CALL(cudaEventDestroy(compute_done[0]));
+    CUDA_RT_CALL(cudaStreamDestroy(reset_l2_norm_stream));
+    CUDA_RT_CALL(cudaStreamDestroy(compute_stream));
+
+    CUDA_RT_CALL(cudaFreeHost(a_h));
+    CUDA_RT_CALL(cudaFreeHost(a_ref_h));
+
+    nvshmem_finalize();
+    MPI_CALL(MPI_Finalize());
+
+    return (result_correct == 1) ? 0 : 1;
+}
+
+double single_gpu(const int nx, const int ny, const int iter_max, float* const a_ref_h,
+                  const bool print, int mype) {
+    float* a;
+    float* a_new;
+    float* l2_norm_d;
+    float* l2_norm_h;
+
+    int iy_start = 1;
+    int iy_end = ny - 3;
+
+    CUDA_RT_CALL(cudaMalloc((void**)&a, nx * ny * sizeof(float)));
+    CUDA_RT_CALL(cudaMalloc((void**)&a_new, nx * ny * sizeof(float)));
+
+    CUDA_RT_CALL(cudaMemset(a, 0, nx * ny * sizeof(float)));
+    CUDA_RT_CALL(cudaMemset(a_new, 0, nx * ny * sizeof(float)));
+
+    // Set diriclet boundary conditions on left and right boarder
+    initialize_boundaries<<<ny / 128 + 1, 128>>>(a, a_new, PI, 0, nx, ny - 2, ny - 2);
+
+    CUDA_RT_CALL(cudaGetLastError());
+    CUDA_RT_CALL(cudaDeviceSynchronize());
+
+    CUDA_RT_CALL(cudaMalloc(&l2_norm_d, sizeof(float)));
+    CUDA_RT_CALL(cudaMallocHost(&l2_norm_h, sizeof(float)));
+
+    CUDA_RT_CALL(cudaDeviceSynchronize());
+
+    if (print)
+        printf("Single GPU jacobi relaxation: %d iterations on %d x %d mesh\n", iter_max, nx, ny);
+
+    dim3 dim_block(BLOCK_DIM_X, BLOCK_DIM_Y, 1);
+    dim3 dim_grid((nx + BLOCK_DIM_X-1) / BLOCK_DIM_X, ((ny - 2) + BLOCK_DIM_Y-1) / BLOCK_DIM_Y, 1);
+
+    int iter = 0;
+    float l2_norm = 1.0;
+
+    CUDA_RT_CALL(cudaDeviceSynchronize());
+    double start = MPI_Wtime();
+    nvtxRangePush("Jacobi Solve Single GPU");
+
+    while (l2_norm > tol && iter < iter_max) {
+        CUDA_RT_CALL(cudaMemsetAsync(l2_norm_d, 0, sizeof(float)));
+
+        jacobi_kernel<<<dim_grid, dim_block>>>(a_new, a, l2_norm_d, iy_start, iy_end, nx, 
+                                                mype, iy_end + 1, mype, (iy_start - 1));
+        
+        iter++;
+        if (print && ((iter % 100) == 0)) {
+            CUDA_RT_CALL(cudaMemcpy(l2_norm_h, l2_norm_d, sizeof(float), cudaMemcpyDeviceToHost));
+            l2_norm = *l2_norm_h;
+            l2_norm = std::sqrt(l2_norm);
+            if (print && (iter % 100) == 0) printf("%5d, %0.6f\n", iter, l2_norm);
+        }
+
+        std::swap(a_new, a);
+    }
+    CUDA_RT_CALL(cudaDeviceSynchronize());
+    nvtxRangePop();
+    double stop = MPI_Wtime();
+
+    CUDA_RT_CALL(cudaMemcpy(a_ref_h, a, nx * ny * sizeof(float), cudaMemcpyDeviceToHost));
+    CUDA_RT_CALL(cudaFreeHost(l2_norm_h));
+    CUDA_RT_CALL(cudaFree(l2_norm_d));
+
+    CUDA_RT_CALL(cudaFree(a_new));
+    CUDA_RT_CALL(cudaFree(a));
+    return (stop - start);
+}

+ 55 - 0
hpc/multi_gpu_nways/labs/CFD/English/C/source_code/nvshmem/left_shift.cu

@@ -0,0 +1,55 @@
+#include <stdio.h>
+#include "mpi.h"
+#include "nvshmem.h"
+#include "nvshmemx.h"
+
+#define CUDA_CHECK(stmt)                                  \
+do {                                                      \
+    cudaError_t result = (stmt);                          \
+    if (cudaSuccess != result) {                          \
+        fprintf(stderr, "[%s:%d] CUDA failed with %s \n", \
+         __FILE__, __LINE__, cudaGetErrorString(result)); \
+        exit(-1);                                         \
+    }                                                     \
+} while (0)
+
+__global__ void simple_shift(int *destination) {
+    int mype = nvshmem_my_pe();
+    int npes = nvshmem_n_pes();
+    int peer = (mype + 1) % npes;
+
+    nvshmem_int_p(destination, mype, peer);
+}
+
+int main (int argc, char *argv[]) {
+    int mype_node, msg;
+    cudaStream_t stream;
+    int rank, nranks;
+    MPI_Comm mpi_comm = MPI_COMM_WORLD;
+    nvshmemx_init_attr_t attr;
+
+    MPI_Init(&argc, &argv);
+    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
+    MPI_Comm_size(MPI_COMM_WORLD, &nranks);
+
+    attr.mpi_comm = &mpi_comm;
+    nvshmemx_init_attr(NVSHMEMX_INIT_WITH_MPI_COMM, &attr);
+    mype_node = nvshmem_team_my_pe(NVSHMEMX_TEAM_NODE);
+
+    CUDA_CHECK(cudaSetDevice(mype_node));
+    CUDA_CHECK(cudaStreamCreate(&stream));
+    int *destination = (int *) nvshmem_malloc (sizeof(int));
+
+    simple_shift<<<1, 1, 0, stream>>>(destination);
+    nvshmemx_barrier_all_on_stream(stream);
+    CUDA_CHECK(cudaMemcpyAsync(&msg, destination, sizeof(int),
+                cudaMemcpyDeviceToHost, stream));
+
+    CUDA_CHECK(cudaStreamSynchronize(stream));
+    printf("%d: received message %d\n", nvshmem_my_pe(), msg);
+
+    nvshmem_free(destination);
+    nvshmem_finalize();
+    MPI_Finalize();
+    return 0;
+}

+ 555 - 0
hpc/multi_gpu_nways/labs/CFD/English/C/source_code/nvshmem/solution/jacobi_nvshmem.cu

@@ -0,0 +1,555 @@
+/* Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *  * Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  * Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *  * Neither the name of NVIDIA CORPORATION nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+#include <mpi.h>
+#include <nvshmem.h>
+#include <nvshmemx.h>
+#include <algorithm>
+#include <cassert>
+#include <cmath>
+#include <cstdio>
+#include <iostream>
+#include <sstream>
+#include <cub/block/block_reduce.cuh>
+#include <nvToolsExt.h>
+
+#define BLOCK_DIM_X 1024
+#define BLOCK_DIM_Y 1
+
+#define MPI_CALL(call)                                                                \
+    {                                                                                 \
+        int mpi_status = call;                                                        \
+        if (0 != mpi_status) {                                                        \
+            char mpi_error_string[MPI_MAX_ERROR_STRING];                              \
+            int mpi_error_string_length = 0;                                          \
+            MPI_Error_string(mpi_status, mpi_error_string, &mpi_error_string_length); \
+            if (NULL != mpi_error_string)                                             \
+                fprintf(stderr,                                                       \
+                        "ERROR: MPI call \"%s\" in line %d of file %s failed "        \
+                        "with %s "                                                    \
+                        "(%d).\n",                                                    \
+                        #call, __LINE__, __FILE__, mpi_error_string, mpi_status);     \
+            else                                                                      \
+                fprintf(stderr,                                                       \
+                        "ERROR: MPI call \"%s\" in line %d of file %s failed "        \
+                        "with %d.\n",                                                 \
+                        #call, __LINE__, __FILE__, mpi_status);                       \
+        }                                                                             \
+    }
+
+#define CUDA_RT_CALL(call)                                                                  \
+    {                                                                                       \
+        cudaError_t cudaStatus = call;                                                      \
+        if (cudaSuccess != cudaStatus)                                                      \
+            fprintf(stderr,                                                                 \
+                    "ERROR: CUDA RT call \"%s\" in line %d of file %s failed "              \
+                    "with "                                                                 \
+                    "%s (%d).\n",                                                           \
+                    #call, __LINE__, __FILE__, cudaGetErrorString(cudaStatus), cudaStatus); \
+    }
+
+// convert NVSHMEM_SYMMETRIC_SIZE string to long long unsigned int
+long long unsigned int parse_nvshmem_symmetric_size(char *value) {
+    long long unsigned int units, size;
+
+    assert(value != NULL);
+
+    if (strchr(value, 'G') != NULL) {
+        units=1e9;
+    } else if (strchr(value, 'M') != NULL) {
+        units=1e6;
+    } else if (strchr(value, 'K') != NULL) {
+        units=1e3;
+    } else {
+        units=1;
+    }
+
+    assert(atof(value) >= 0);
+    size = (long long unsigned int) atof(value) * units;
+
+    return size;
+}
+
+constexpr float tol = 1.0e-8;
+const float PI = 2.0 * std::asin(1.0);
+
+__global__ void initialize_boundaries(float* __restrict__ const a_new, float* __restrict__ const a,
+                                      const float pi, const int offset, const int nx,
+                                      const int my_ny, int ny) {
+    for (int iy = blockIdx.x * blockDim.x + threadIdx.x; iy < my_ny; iy += blockDim.x * gridDim.x) {
+        const float y0 = sin(2.0 * pi * (offset + iy) / (ny - 1));
+        a[(iy + 1) * nx + 0] = y0;
+        a[(iy + 1) * nx + (nx - 1)] = y0;
+        a_new[(iy + 1) * nx + 0] = y0;
+        a_new[(iy + 1) * nx + (nx - 1)] = y0;
+    }
+}
+
+__global__ void jacobi_kernel(float* __restrict__ const a_new, const float* __restrict__ const a,
+                              float* __restrict__ const l2_norm, const int iy_start,
+                              const int iy_end, const int nx, const int top_pe, const int top_iy,
+                              const int bottom_pe, const int bottom_iy) {
+    int iy = blockIdx.y * blockDim.y + threadIdx.y + iy_start;
+    int ix = blockIdx.x * blockDim.x + threadIdx.x + 1;
+    __shared__ float block_l2_sum[BLOCK_DIM_X*BLOCK_DIM_Y];
+    unsigned thread_index = threadIdx.y*BLOCK_DIM_X + threadIdx.x;
+
+    if (iy < iy_end && ix < (nx - 1)) {
+        // Update grid point
+        const float new_val = 0.25 * (a[iy * nx + ix + 1] + a[iy * nx + ix - 1] +
+                                     a[(iy + 1) * nx + ix] + a[(iy - 1) * nx + ix]);
+        a_new[iy * nx + ix] = new_val;
+        float residue = new_val - a[iy * nx + ix];
+        // Set block-level L2 norm value for this grid point
+        block_l2_sum[thread_index] = residue * residue;
+    }
+    else {
+        block_l2_sum[thread_index] = 0;
+    }
+
+    /* starting (x, y) coordinate of the block */
+    int block_iy =
+        iy - threadIdx.y; /* Alternatively, block_iy = blockIdx.y * blockDim.y + iy_start */
+    int block_ix = ix - threadIdx.x; /* Alternatively, block_ix = blockIdx.x * blockDim.x + 1 */
+
+    /* Communicate the boundaries */
+    if ((block_iy <= iy_start) && (iy_start < block_iy + blockDim.y)) {
+        nvshmemx_float_put_nbi_block(a_new + top_iy * nx + block_ix, a_new + iy_start * nx + block_ix,
+                                   min(blockDim.x, nx - 1 - block_ix), top_pe);
+    }
+    if ((block_iy < iy_end) && (iy_end <= block_iy + blockDim.y)) {
+        nvshmemx_float_put_nbi_block(a_new + bottom_iy * nx + block_ix,
+                                   a_new + (iy_end - 1) * nx + block_ix,
+                                   min(blockDim.x, nx - 1 - block_ix), bottom_pe);
+    }
+    // Reduce L2 norm for the block in parallel
+    for (unsigned stride = 1; stride < BLOCK_DIM_X*BLOCK_DIM_Y; stride *= 2) {
+        __syncthreads();
+        if ((thread_index) % (2*stride) == 0) {
+            block_l2_sum[thread_index] += block_l2_sum[thread_index + stride];
+        }
+    }
+    // Atomically update global L2 norm with block-reduced L2 norm
+    if (thread_index == 0) {
+        atomicAdd(l2_norm, block_l2_sum[0]);
+    }
+}
+
+double single_gpu(const int nx, const int ny, const int iter_max, float* const a_ref_h,
+                  const bool print, int mype);
+
+int get_argval(char** begin, char** end, const std::string& arg, const int default_val) {
+    int argval = default_val;
+    char** itr = std::find(begin, end, arg);
+    if (itr != end && ++itr != end) {
+        std::istringstream inbuf(*itr);
+        inbuf >> argval;
+    }
+    return argval;
+}
+
+bool get_arg(char** begin, char** end, const std::string& arg) {
+    char** itr = std::find(begin, end, arg);
+    if (itr != end) {
+        return true;
+    }
+    return false;
+}
+
+struct l2_norm_buf {
+    cudaEvent_t copy_done;
+    float* d;
+    float* h;
+};
+
+int main(int argc, char* argv[]) {
+    const int iter_max = get_argval(argv, argv + argc, "-niter", 1000);
+    const int nx = get_argval(argv, argv + argc, "-nx", 16384);
+    const int ny = get_argval(argv, argv + argc, "-ny", 16384);
+
+    float* a_new;
+
+    float* a_ref_h;
+    float* a_h;
+    double runtime_serial = 0.0;
+
+    float l2_norms[2];
+
+    int rank = 0, size = 1;
+    MPI_CALL(MPI_Init(&argc, &argv));
+    MPI_CALL(MPI_Comm_rank(MPI_COMM_WORLD, &rank));
+    MPI_CALL(MPI_Comm_size(MPI_COMM_WORLD, &size));
+
+    int num_devices;
+    CUDA_RT_CALL(cudaGetDeviceCount(&num_devices));
+
+    int local_rank = -1, local_size = 1;
+    {
+        MPI_Comm local_comm;
+        MPI_Info info;
+        MPI_CALL(MPI_Info_create(&info));
+        MPI_CALL(
+            MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, rank, info, &local_comm));
+
+        MPI_CALL(MPI_Comm_rank(local_comm, &local_rank));
+        MPI_CALL(MPI_Comm_size(local_comm, &local_size));
+        if (num_devices < local_size) {
+            fprintf(stderr,
+                    "ERROR: Number of devices is less numer of PEs \
+                    on the node!\n");
+            MPI_CALL(MPI_Comm_free(&local_comm));
+            MPI_CALL(MPI_Info_free(&info));
+            MPI_CALL(MPI_Finalize());
+            return -1;
+        }
+
+        MPI_CALL(MPI_Comm_free(&local_comm));
+        MPI_CALL(MPI_Info_free(&info));
+    }
+    CUDA_RT_CALL(cudaSetDevice(local_rank));
+    CUDA_RT_CALL(cudaFree(0));
+
+    MPI_Comm mpi_comm;
+    nvshmemx_init_attr_t attr;
+
+    mpi_comm = MPI_COMM_WORLD;
+    attr.mpi_comm = &mpi_comm;
+    // Set symmetric heap size for nvshmem based on problem size
+    // Its default value in nvshmem is 1 GB which is not sufficient
+    // for large mesh sizes
+    long long unsigned int mesh_size_per_rank = nx * (((ny - 2) + size - 1) / size + 2);
+    long long unsigned int required_symmetric_heap_size =
+        2 * mesh_size_per_rank * sizeof(float) *
+        1.1;  // Factor 2 is because 2 arrays are allocated - a and a_new
+              // 1.1 factor is just for alignment or other usage
+
+    char * value = getenv("NVSHMEM_SYMMETRIC_SIZE");
+    if (value) { /* env variable is set */
+        long long unsigned int size_env = parse_nvshmem_symmetric_size(value);
+        if (size_env < required_symmetric_heap_size) {
+            fprintf(stderr, "ERROR: Minimum NVSHMEM_SYMMETRIC_SIZE = %lluB, Current NVSHMEM_SYMMETRIC_SIZE = %s\n", required_symmetric_heap_size, value);
+            MPI_CALL(MPI_Finalize());
+            return -1;
+        }
+    } else {
+        char symmetric_heap_size_str[100];
+        sprintf(symmetric_heap_size_str, "%llu", required_symmetric_heap_size);
+        if (!rank)
+            printf("Setting environment variable NVSHMEM_SYMMETRIC_SIZE = %llu\n", required_symmetric_heap_size);
+        setenv("NVSHMEM_SYMMETRIC_SIZE", symmetric_heap_size_str, 1);
+    }
+    nvshmemx_init_attr(NVSHMEMX_INIT_WITH_MPI_COMM, &attr);
+
+    int npes = nvshmem_n_pes();
+    int mype = nvshmem_my_pe();
+
+    nvshmem_barrier_all();
+
+    bool result_correct = true;
+    float* a;
+
+    cudaStream_t compute_stream;
+    cudaStream_t reset_l2_norm_stream;
+    cudaEvent_t compute_done[2];
+    cudaEvent_t reset_l2_norm_done[2];
+
+    l2_norm_buf l2_norm_bufs[2];
+
+    CUDA_RT_CALL(cudaMallocHost(&a_ref_h, nx * ny * sizeof(float)));
+    CUDA_RT_CALL(cudaMallocHost(&a_h, nx * ny * sizeof(float)));
+    runtime_serial = single_gpu(nx, ny, iter_max, a_ref_h, (0 == mype), mype);
+
+    nvshmem_barrier_all();
+    // ny - 2 rows are distributed amongst `size` ranks in such a way
+    // that each rank gets either (ny - 2) / size or (ny - 2) / size + 1 rows.
+    // This optimizes load balancing when (ny - 2) % size != 0
+    int chunk_size;
+    int chunk_size_low = (ny - 2) / npes;
+    int chunk_size_high = chunk_size_low + 1;
+    // To calculate the number of ranks that need to compute an extra row,
+    // the following formula is derived from this equation:
+    // num_ranks_low * chunk_size_low + (size - num_ranks_low) * (chunk_size_low + 1) = ny - 2
+    int num_ranks_low = npes * chunk_size_low + npes -
+                        (ny - 2);  // Number of ranks with chunk_size = chunk_size_low
+    if (mype < num_ranks_low)
+        chunk_size = chunk_size_low;
+    else
+        chunk_size = chunk_size_high;
+
+    a = (float*)nvshmem_malloc(
+        nx * (chunk_size_high + 2) *
+        sizeof(float));  // Using chunk_size_high so that it is same across all PEs
+    a_new = (float*)nvshmem_malloc(nx * (chunk_size_high + 2) * sizeof(float));
+
+    cudaMemset(a, 0, nx * (chunk_size + 2) * sizeof(float));
+    cudaMemset(a_new, 0, nx * (chunk_size + 2) * sizeof(float));
+
+    // Calculate local domain boundaries
+    int iy_start_global;  // My start index in the global array
+    if (mype < num_ranks_low) {
+        iy_start_global = mype * chunk_size_low + 1;
+    } else {
+        iy_start_global =
+            num_ranks_low * chunk_size_low + (mype - num_ranks_low) * chunk_size_high + 1;
+    }
+    int iy_end_global = iy_start_global + chunk_size - 1;  // My last index in the global array
+    // do not process boundaries
+    iy_end_global = std::min(iy_end_global, ny - 4);
+
+    int iy_start = 1;
+    int iy_end = (iy_end_global - iy_start_global + 1) + iy_start;
+
+    // calculate boundary indices for top and bottom boundaries
+    int top_pe = mype > 0 ? mype - 1 : (npes - 1);
+    int bottom_pe = (mype + 1) % npes;
+
+    int iy_end_top = (top_pe < num_ranks_low) ? chunk_size_low + 1 : chunk_size_high + 1;
+    int iy_start_bottom = 0;
+
+    // Set diriclet boundary conditions on left and right boundary
+    initialize_boundaries<<<(ny / npes) / 128 + 1, 128>>>(a, a_new, PI, iy_start_global - 1, nx,
+                                                          chunk_size, ny - 2);
+    CUDA_RT_CALL(cudaGetLastError());
+    CUDA_RT_CALL(cudaDeviceSynchronize());
+
+    CUDA_RT_CALL(cudaStreamCreateWithFlags(&compute_stream, cudaStreamNonBlocking));
+    CUDA_RT_CALL(cudaStreamCreate(&reset_l2_norm_stream));
+    CUDA_RT_CALL(cudaEventCreate(&compute_done[0]));
+    CUDA_RT_CALL(cudaEventCreate(&compute_done[1]));
+    CUDA_RT_CALL(cudaEventCreate(&reset_l2_norm_done[0]));
+    CUDA_RT_CALL(cudaEventCreate(&reset_l2_norm_done[1]));
+
+    for (int i = 0; i < 2; ++i) {
+        CUDA_RT_CALL(cudaEventCreate(&l2_norm_bufs[i].copy_done));
+        CUDA_RT_CALL(cudaMalloc(&l2_norm_bufs[i].d, sizeof(float)));
+        CUDA_RT_CALL(cudaMemset(l2_norm_bufs[i].d, 0, sizeof(float)));
+        CUDA_RT_CALL(cudaMallocHost(&l2_norm_bufs[i].h, sizeof(float)));
+        *(l2_norm_bufs[i].h) = 1.0;
+    }
+
+    nvshmemx_barrier_all_on_stream(compute_stream);
+    MPI_CALL(MPI_Allreduce(l2_norm_bufs[0].h, &l2_norms[0], 1, MPI_FLOAT, MPI_SUM, MPI_COMM_WORLD));
+    MPI_CALL(MPI_Allreduce(l2_norm_bufs[1].h, &l2_norms[1], 1, MPI_FLOAT, MPI_SUM, MPI_COMM_WORLD));
+    CUDA_RT_CALL(cudaDeviceSynchronize());
+
+    if (!mype) {
+        printf("Jacobi relaxation: %d iterations on %d x %d mesh\n", iter_max, nx, ny);
+    }
+
+    dim3 dim_grid((nx + BLOCK_DIM_X-1) / BLOCK_DIM_X, (chunk_size + BLOCK_DIM_Y-1) / BLOCK_DIM_Y, 1);
+    dim3 dim_block(BLOCK_DIM_X, BLOCK_DIM_Y, 1);
+
+    int iter = 0;
+    if (!mype) {
+        for (int i = 0; i < 2; ++i) {
+            l2_norms[i] = 1.0;
+        }
+    }
+
+    nvshmem_barrier_all();
+
+    double start = MPI_Wtime();
+    bool l2_norm_greater_than_tol = true;
+
+    nvtxRangePush("Jacobi Solve Multi-GPU");
+    while (l2_norm_greater_than_tol && iter < iter_max) {
+        // on new iteration: old current vars are now previous vars, old
+        // previous vars are no longer needed
+        int prev = iter % 2;
+        int curr = (iter + 1) % 2;
+
+        CUDA_RT_CALL(cudaStreamWaitEvent(compute_stream, reset_l2_norm_done[curr], 0));
+        jacobi_kernel<<<dim_grid, dim_block, 0, compute_stream>>>(
+                a_new, a, l2_norm_bufs[curr].d, iy_start, iy_end, nx, top_pe, iy_end_top, bottom_pe,
+                iy_start_bottom);
+
+        /* Instead of using nvshmemx_barrier_all_on_stream, we are using a custom implementation
+           of barrier that just synchronizes with the neighbor PEs that is the PEs with whom a PE
+           communicates. This will perform faster than a global barrier that would do redundant
+           synchronization for this application. */
+        nvshmemx_barrier_all_on_stream(compute_stream);
+
+        // perform L2 norm calculation
+        // as soon as computation is complete -> D2H-copy L2 norm
+        CUDA_RT_CALL(cudaMemcpyAsync(l2_norm_bufs[curr].h, l2_norm_bufs[curr].d, sizeof(float),
+                                        cudaMemcpyDeviceToHost, compute_stream));
+        CUDA_RT_CALL(cudaEventRecord(l2_norm_bufs[curr].copy_done, compute_stream));
+
+        // ensure previous D2H-copy is completed before using the data for
+        // calculation
+        CUDA_RT_CALL(cudaEventSynchronize(l2_norm_bufs[prev].copy_done));
+
+        MPI_CALL(MPI_Allreduce(l2_norm_bufs[prev].h, &l2_norms[prev], 1, MPI_FLOAT, MPI_SUM,
+                                MPI_COMM_WORLD));
+
+        l2_norms[prev] = std::sqrt(l2_norms[prev]);
+        l2_norm_greater_than_tol = (l2_norms[prev] > tol);
+
+        iter++;
+        if ((iter % 100) == 0) {
+            if (!mype) printf("%5d, %0.6f\n", iter, l2_norms[prev]);
+        }
+
+        // reset everything for next iteration
+        l2_norms[prev] = 0.0;
+        *(l2_norm_bufs[prev].h) = 0.0;
+        CUDA_RT_CALL(cudaMemcpyAsync(l2_norm_bufs[prev].d, l2_norm_bufs[prev].h, sizeof(float),
+                                        cudaMemcpyHostToDevice, reset_l2_norm_stream));
+        CUDA_RT_CALL(cudaEventRecord(reset_l2_norm_done[prev], reset_l2_norm_stream));
+
+        std::swap(a_new, a);
+    }
+    CUDA_RT_CALL(cudaDeviceSynchronize());
+    nvshmem_barrier_all();
+    double stop = MPI_Wtime();
+    nvtxRangePop();
+
+    nvshmem_barrier_all();
+
+    CUDA_RT_CALL(cudaMemcpy(a_h + iy_start_global * nx, a + nx,
+                            std::min(ny - 2 - iy_start_global, chunk_size) * nx * sizeof(float),
+                            cudaMemcpyDeviceToHost));
+
+    result_correct = true;
+    for (int iy = iy_start_global; result_correct && (iy < iy_end_global); ++iy) {
+        for (int ix = 1; result_correct && (ix < (nx - 1)); ++ix) {
+            if (std::fabs(a_ref_h[iy * nx + ix] - a_h[iy * nx + ix]) > tol) {
+                fprintf(stderr,
+                        "ERROR on rank %d: a[%d * %d + %d] = %f does not match %f "
+                        "(reference)\n",
+                        rank, iy, nx, ix, a_h[iy * nx + ix], a_ref_h[iy * nx + ix]);
+                result_correct = false;
+            }
+        }
+    }
+
+    int global_result_correct = 1;
+    MPI_CALL(MPI_Allreduce(&result_correct, &global_result_correct, 1, MPI_INT, MPI_MIN,
+                           MPI_COMM_WORLD));
+    result_correct = global_result_correct;
+
+    if (!mype && result_correct) {
+            printf("Num GPUs: %d.\n", npes);
+            printf(
+                "%dx%d: 1 GPU: %8.4f s, %d GPUs: %8.4f s, speedup: %8.2f, "
+                "efficiency: %8.2f \n",
+                ny, nx, runtime_serial, npes, (stop - start), runtime_serial / (stop - start),
+                runtime_serial / (npes * (stop - start)) * 100);
+    }
+
+    for (int i = 0; i < 2; ++i) {
+        CUDA_RT_CALL(cudaFreeHost(l2_norm_bufs[i].h));
+        CUDA_RT_CALL(cudaFree(l2_norm_bufs[i].d));
+        CUDA_RT_CALL(cudaEventDestroy(l2_norm_bufs[i].copy_done));
+    }
+
+    nvshmem_free(a);
+    nvshmem_free(a_new);
+
+    CUDA_RT_CALL(cudaEventDestroy(reset_l2_norm_done[1]));
+    CUDA_RT_CALL(cudaEventDestroy(reset_l2_norm_done[0]));
+    CUDA_RT_CALL(cudaEventDestroy(compute_done[1]));
+    CUDA_RT_CALL(cudaEventDestroy(compute_done[0]));
+    CUDA_RT_CALL(cudaStreamDestroy(reset_l2_norm_stream));
+    CUDA_RT_CALL(cudaStreamDestroy(compute_stream));
+
+    CUDA_RT_CALL(cudaFreeHost(a_h));
+    CUDA_RT_CALL(cudaFreeHost(a_ref_h));
+
+    nvshmem_finalize();
+    MPI_CALL(MPI_Finalize());
+
+    return (result_correct == 1) ? 0 : 1;
+}
+
+double single_gpu(const int nx, const int ny, const int iter_max, float* const a_ref_h,
+                  const bool print, int mype) {
+    float* a;
+    float* a_new;
+    float* l2_norm_d;
+    float* l2_norm_h;
+
+    int iy_start = 1;
+    int iy_end = ny - 3;
+
+    CUDA_RT_CALL(cudaMalloc((void**)&a, nx * ny * sizeof(float)));
+    CUDA_RT_CALL(cudaMalloc((void**)&a_new, nx * ny * sizeof(float)));
+
+    CUDA_RT_CALL(cudaMemset(a, 0, nx * ny * sizeof(float)));
+    CUDA_RT_CALL(cudaMemset(a_new, 0, nx * ny * sizeof(float)));
+
+    // Set diriclet boundary conditions on left and right boarder
+    initialize_boundaries<<<ny / 128 + 1, 128>>>(a, a_new, PI, 0, nx, ny - 2, ny - 2);
+
+    CUDA_RT_CALL(cudaGetLastError());
+    CUDA_RT_CALL(cudaDeviceSynchronize());
+
+    CUDA_RT_CALL(cudaMalloc(&l2_norm_d, sizeof(float)));
+    CUDA_RT_CALL(cudaMallocHost(&l2_norm_h, sizeof(float)));
+
+    CUDA_RT_CALL(cudaDeviceSynchronize());
+
+    if (print)
+        printf("Single GPU jacobi relaxation: %d iterations on %d x %d mesh\n", iter_max, nx, ny);
+
+    dim3 dim_block(BLOCK_DIM_X, BLOCK_DIM_Y, 1);
+    dim3 dim_grid((nx + BLOCK_DIM_X-1) / BLOCK_DIM_X, ((ny - 2) + BLOCK_DIM_Y-1) / BLOCK_DIM_Y, 1);
+
+    int iter = 0;
+    float l2_norm = 1.0;
+
+    CUDA_RT_CALL(cudaDeviceSynchronize());
+    double start = MPI_Wtime();
+    nvtxRangePush("Jacobi Solve Single GPU");
+
+    while (l2_norm > tol && iter < iter_max) {
+        CUDA_RT_CALL(cudaMemsetAsync(l2_norm_d, 0, sizeof(float)));
+
+        jacobi_kernel<<<dim_grid, dim_block>>>(a_new, a, l2_norm_d, iy_start, iy_end, nx, 
+                                                mype, iy_end + 1, mype, (iy_start - 1));
+        
+        iter++;
+        if (print && ((iter % 100) == 0)) {
+            CUDA_RT_CALL(cudaMemcpy(l2_norm_h, l2_norm_d, sizeof(float), cudaMemcpyDeviceToHost));
+            l2_norm = *l2_norm_h;
+            l2_norm = std::sqrt(l2_norm);
+            if (print && (iter % 100) == 0) printf("%5d, %0.6f\n", iter, l2_norm);
+        }
+
+        std::swap(a_new, a);
+    }
+    CUDA_RT_CALL(cudaDeviceSynchronize());
+    nvtxRangePop();
+    double stop = MPI_Wtime();
+
+    CUDA_RT_CALL(cudaMemcpy(a_ref_h, a, nx * ny * sizeof(float), cudaMemcpyDeviceToHost));
+    CUDA_RT_CALL(cudaFreeHost(l2_norm_h));
+    CUDA_RT_CALL(cudaFree(l2_norm_d));
+
+    CUDA_RT_CALL(cudaFree(a_new));
+    CUDA_RT_CALL(cudaFree(a));
+    return (stop - start);
+}

+ 5 - 5
hpc/multi_gpu_nways/labs/CFD/English/introduction.ipynb

@@ -26,7 +26,7 @@
     "Intermediate, Advanced\n",
     "\n",
     "### Target Audience and Prerequisites\n",
-    "The target audience for this lab are researchers, graduate students, and developers who are interested in scaling their scientific applications to multiple nodes using multi-GPU implmenetations.\n",
+    "The target audience for this lab are researchers, graduate students, and developers who are interested in scaling their scientific applications to multiple nodes using multi-GPU implementations.\n",
     "\n",
     "Experience in C/ C++ and basic CUDA programming is required. Experience with parallel programming frameworks like OpenMP or MPI is not required but a basic understanding of MPI is highly recommended.\n",
     "\n",
@@ -44,14 +44,14 @@
     "    * [MPI with CUDA Memcpy](C/jupyter_notebook/mpi/memcpy.ipynb)\n",
     "    * [CUDA-aware MPI](C/jupyter_notebook/mpi/cuda_aware.ipynb)\n",
     "    * [Supplemental: Configuring MPI in a containerized environment](C/jupyter_notebook/mpi/containers_and_mpi.ipynb)\n",
-    "4. [NVIDIA Collectives Communications Library (NCCL)](C/jupyter_notebook/nccl/nccl/ipynb)\n",
-    "5. NVHSMEM Library\n",
-    "6. Final remarks\n",
+    "4. [NVIDIA Collectives Communications Library (NCCL)](C/jupyter_notebook/nccl/nccl.ipynb)\n",
+    "5. [NVHSMEM Library](C/jupyter_notebook/nvshmem/nvshmem.ipynb)\n",
+    "\n",
     "--- \n",
     "\n",
     "## Licensing \n",
     "\n",
-    "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)."
+    "This material is released) by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)."
    ]
   }
  ],