## Multi-GPU Programming and Performance Analysis

## Learning objectives

Scaling applications to multiple GPUs across multiple nodes requires one to be adept at not just the programming models and optimization techniques, but also at performing root-cause analysis using in-depth profiling to identify and minimize bottlenecks. In this bootcamp, participants will learn to improve the performance of an application step-by-step, taking cues from profilers along the way. Moreover, understanding of the underlying technologies and communication topology will help us utilize high-performance NVIDIA libraries to extract more performance out of the system.

By the end of this bootcamp session, participants will be adept at:
* Reviewing communication architecture and topology
* Developing CUDA-aware multi-node multi-GPU MPI applications
* Profiling the application using NVIDIA Nsight Systems
* Applying optimizations like CUDA streams, events, and overlapping compute and communication
* Understanding GPUDirect technologies like P2P and RDMA
* Learning to use NVIDIA NCCL and NVSHMEM libraries

### Bootcamp Duration

The bootcamp will take 8 hours to complete. Link to download all materials will be available at the end of the lab.

### Content Level
Intermediate, Advanced

### Target Audience and Prerequisites
The target audience for this lab are researchers, graduate students, and developers who are interested in scaling their scientific applications to multiple nodes using multi-GPU implementations.

Experience in C/ C++ and basic CUDA programming is required. Experience with parallel programming frameworks like OpenMP or MPI is not required but a basic understanding of MPI is highly recommended.

### Bootcamp Outline

We will take up the Jacobi Solver, an iterative technique for solving system of linear equations, in this tutorial. To begin, click on the first link below:

1. [Overview of single-GPU code and Nsight Systems Profiler](C/jupyter_notebook/single_gpu/single_gpu_overview.ipynb)
2. Single Node Multi-GPU:
    * [CUDA Memcpy and Peer-to-Peer Memory Access](C/jupyter_notebook/cuda/memcpy.ipynb)
    * [Intra-node topology](C/jupyter_notebook/advanced_concepts/single_node_topology.ipynb)
    * [CUDA Streams and Events](C/jupyter_notebook/cuda/streams.ipynb)
3. Multi-Node Multi-GPU:
    * [Introduction to MPI and Multi-Node execution overview](C/jupyter_notebook/mpi/multi_node_intro.ipynb)
    * [MPI with CUDA Memcpy](C/jupyter_notebook/mpi/memcpy.ipynb)
    * [CUDA-aware MPI](C/jupyter_notebook/mpi/cuda_aware.ipynb)
    * [Supplemental: Configuring MPI in a containerized environment](C/jupyter_notebook/mpi/containers_and_mpi.ipynb)
4. [NVIDIA Collectives Communications Library (NCCL)](C/jupyter_notebook/nccl/nccl.ipynb)
5. [NVHSMEM Library](C/jupyter_notebook/nvshmem/nvshmem.ipynb)

--- 

## Licensing 

This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0).