{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Multi-GPU Programming and Performance Analysis\n", "\n", "## Learning objectives\n", "\n", "Scaling applications to multiple GPUs across multiple nodes requires one to be adept at not just the programming models and optimization techniques, but also at performing root-cause analysis using in-depth profiling to identify and minimize bottlenecks. In this bootcamp, participants will learn to improve the performance of an application step-by-step, taking cues from profilers along the way. Moreover, understanding of the underlying technologies and communication topology will help us utilize high-performance NVIDIA libraries to extract more performance out of the system.\n", "\n", "By the end of this bootcamp session, participants will be adept at:\n", "* Reviewing communication architecture and topology\n", "* Developing CUDA-aware multi-node multi-GPU MPI applications\n", "* Profiling the application using NVIDIA Nsight Systems\n", "* Applying optimizations like CUDA streams, events, and overlapping compute and communication\n", "* Understanding GPUDirect technologies like P2P and RDMA\n", "* Learning to use NVIDIA NCCL and NVSHMEM libraries\n", "\n", "### Bootcamp Duration\n", "\n", "The bootcamp will take 8 hours to complete. Link to download all materials will be available at the end of the lab.\n", "\n", "### Content Level\n", "Intermediate, Advanced\n", "\n", "### Target Audience and Prerequisites\n", "The target audience for this lab are researchers, graduate students, and developers who are interested in scaling their scientific applications to multiple nodes using multi-GPU implementations.\n", "\n", "Experience in C/ C++ and basic CUDA programming is required. Experience with parallel programming frameworks like OpenMP or MPI is not required but a basic understanding of MPI is highly recommended.\n", "\n", "### Bootcamp Outline\n", "\n", "We will take up the Jacobi Solver, an iterative technique for solving system of linear equations, in this tutorial. To begin, click on the first link below:\n", "\n", "1. [Overview of single-GPU code and Nsight Systems Profiler](C/jupyter_notebook/single_gpu/single_gpu_overview.ipynb)\n", "2. Single Node Multi-GPU:\n", " * [CUDA Memcpy and Peer-to-Peer Memory Access](C/jupyter_notebook/cuda/memcpy.ipynb)\n", " * [Intra-node topology](C/jupyter_notebook/advanced_concepts/single_node_topology.ipynb)\n", " * [CUDA Streams and Events](C/jupyter_notebook/cuda/streams.ipynb)\n", "3. Multi-Node Multi-GPU:\n", " * [Introduction to MPI and Multi-Node execution overview](C/jupyter_notebook/mpi/multi_node_intro.ipynb)\n", " * [MPI with CUDA Memcpy](C/jupyter_notebook/mpi/memcpy.ipynb)\n", " * [CUDA-aware MPI](C/jupyter_notebook/mpi/cuda_aware.ipynb)\n", " * [Supplemental: Configuring MPI in a containerized environment](C/jupyter_notebook/mpi/containers_and_mpi.ipynb)\n", "4. [NVIDIA Collectives Communications Library (NCCL)](C/jupyter_notebook/nccl/nccl.ipynb)\n", "5. [NVHSMEM Library](C/jupyter_notebook/nvshmem/nvshmem.ipynb)\n", "\n", "--- \n", "\n", "## Licensing \n", "\n", "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 4 }