## Multi-GPU Programming and Performance Analysis

## Learning objectives

Scaling applications to multiple GPUs across multiple nodes requires one to be adept at not just the programming models and optimization techniques, but also at performing root-cause analysis using in-depth profiling to identify and minimize bottlenecks. In this bootcamp, participants will learn to improve the performance of an application step-by-step, taking cues from profilers along the way. Moreover, understanding of the underlying technologies and communication topology will help us utilize high-performance NVIDIA libraries to extract more performance out of the system.

By the end of this bootcamp session, participants will be adept at:
* Reviewing communication architecture and topology
* Developing CUDA-aware multi-node multi-GPU MPI applications
* Profiling the application using Nsight Systems and HPCToolkit
* Applying optimizations like CUDA streams and overlapping compute and communication
* Understanding GPUDirect technologies like P2P and RDMA
* Utilizing NVIDIA NCCL and NVSHMEM libraries

### Tutorial Duration
The lab will take 8 hours to complete. Link to download all materials will be available at the end of the lab.

### Content Level
Intermediate, Advanced

### Target Audience and Prerequisites
The target audience for this lab are researchers, graduate students, and developers who are interested in scaling their scientific applications to multiple nodes using multi-GPU implmenetations.

Experience in C/ C++ and basic CUDA programming is required. Experience with parallel programming frameworks like OpenMP or MPI is not required but a basic understanding of MPI is highly recommended.

### Bootcamp Outline

We will take up the Jacobi Solver, an iterative technique for solving system of linear equations, in this tutorial. To begin, click on the first link below:

1. Overview of Jacobi Solver application
 * Review of single-GPU code
 * Parallelizing to multiple GPUs using normal MPI
2. Profiling with NVTX and Nsight Systems
 * Profiling multi-GPU normal MPI code
 * Using single-node CUDA-aware MPI with compute-copy overlap
3. Communication topology
 * Overview of intra-node and inter-node communication architecture
 * Benchmarking communication networks
4. Profiling with HPCToolkit
 * Analysis of GPUDirect P2P with single-node MPI
 * Analysis of GPUDirect RDMA with multi-node MPI
5. NCCL Library 
6. NVHSMEM Library
--- 

## Licensing 

This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0). 