{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Multi-GPU Programming and Performance Analysis\n", "\n", "## Learning objectives\n", "\n", "Scaling applications to multiple GPUs across multiple nodes requires one to be adept at not just the programming models and optimization techniques, but also at performing root-cause analysis using in-depth profiling to identify and minimize bottlenecks. In this bootcamp, participants will learn to improve the performance of an application step-by-step, taking cues from profilers along the way. Moreover, understanding of the underlying technologies and communication topology will help us utilize high-performance NVIDIA libraries to extract more performance out of the system.\n", "\n", "By the end of this bootcamp session, participants will be adept at:\n", "* Reviewing communication architecture and topology\n", "* Developing CUDA-aware multi-node multi-GPU MPI applications\n", "* Profiling the application using Nsight Systems and HPCToolkit\n", "* Applying optimizations like CUDA streams and overlapping compute and communication\n", "* Understanding GPUDirect technologies like P2P and RDMA\n", "* Utilizing NVIDIA NCCL and NVSHMEM libraries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tutorial Duration\n", "The lab will take 8 hours to complete. Link to download all materials will be available at the end of the lab.\n", "\n", "### Content Level\n", "Intermediate, Advanced\n", "\n", "### Target Audience and Prerequisites\n", "The target audience for this lab are researchers, graduate students, and developers who are interested in scaling their scientific applications to multiple nodes using multi-GPU implmenetations.\n", "\n", "Experience in C/ C++ and basic CUDA programming is required. Experience with parallel programming frameworks like OpenMP or MPI is not required but a basic understanding of MPI is highly recommended.\n", "\n", "### Bootcamp Outline\n", "\n", "We will take up the Jacobi Solver, an iterative technique for solving system of linear equations, in this tutorial. To begin, click on the first link below:\n", "\n", "1. Overview of Jacobi Solver application\n", " * Review of single-GPU code\n", " * Parallelizing to multiple GPUs using normal MPI\n", "2. Profiling with NVTX and Nsight Systems\n", " * Profiling multi-GPU normal MPI code\n", " * Using single-node CUDA-aware MPI with compute-copy overlap\n", "3. Communication topology\n", " * Overview of intra-node and inter-node communication architecture\n", " * Benchmarking communication networks\n", "4. Profiling with HPCToolkit\n", " * Analysis of GPUDirect P2P with single-node MPI\n", " * Analysis of GPUDirect RDMA with multi-node MPI\n", "5. NCCL Library \n", "6. NVHSMEM Library\n", "--- \n", "\n", "## Licensing \n", "\n", "This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0). " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }