{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Multi-GPU Programming and Performance Analysis\n",
    "\n",
    "## Learning objectives\n",
    "\n",
    "Scaling applications to multiple GPUs across multiple nodes requires one to be adept at not just the programming models and optimization techniques, but also at performing root-cause analysis using in-depth profiling to identify and minimize bottlenecks. In this bootcamp, participants will learn to improve the performance of an application step-by-step, taking cues from profilers along the way. Moreover, understanding of the underlying technologies and communication topology will help us utilize high-performance NVIDIA libraries to extract more performance out of the system.\n",
    "\n",
    "By the end of this bootcamp session, participants will be adept at:\n",
    "* Reviewing communication architecture and topology\n",
    "* Developing CUDA-aware multi-node multi-GPU MPI applications\n",
    "* Profiling the application using Nsight Systems and HPCToolkit\n",
    "* Applying optimizations like CUDA streams and overlapping compute and communication\n",
    "* Understanding GPUDirect technologies like P2P and RDMA\n",
    "* Utilizing NVIDIA NCCL and NVSHMEM libraries"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tutorial Duration\n",
    "The lab will take 8 hours to complete. Link to download all materials will be available at the end of the lab.\n",
    "\n",
    "### Content Level\n",
    "Intermediate, Advanced\n",
    "\n",
    "### Target Audience and Prerequisites\n",
    "The target audience for this lab are researchers, graduate students, and developers who are interested in scaling their scientific applications to multiple nodes using multi-GPU implmenetations.\n",
    "\n",
    "Experience in C/ C++ and basic CUDA programming is required. Experience with parallel programming frameworks like OpenMP or MPI is not required but a basic understanding of MPI is highly recommended.\n",
    "\n",
    "### Bootcamp Outline\n",
    "\n",
    "We will take up the Jacobi Solver, an iterative technique for solving system of linear equations, in this tutorial. To begin, click on the first link below:\n",
    "\n",
    "1. Overview of Jacobi Solver application\n",
    "    * Review of single-GPU code\n",
    "    * Parallelizing to multiple GPUs using normal MPI\n",
    "2. Profiling with NVTX and Nsight Systems\n",
    "    * Profiling multi-GPU normal MPI code\n",
    "    * Using single-node CUDA-aware MPI with compute-copy overlap\n",
    "3. Communication topology\n",
    "    * Overview of intra-node and inter-node communication architecture\n",
    "    * Benchmarking communication networks\n",
    "4. Profiling with HPCToolkit\n",
    "    * Analysis of GPUDirect P2P with single-node MPI\n",
    "    * Analysis of GPUDirect RDMA with multi-node MPI\n",
    "5. NCCL Library \n",
    "6. NVHSMEM Library\n",
    "--- \n",
    "\n",
    "## Licensing \n",
    "\n",
    "This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}