{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Multi-GPU Programming and Performance Analysis\n",
    "\n",
    "## Learning objectives\n",
    "\n",
    "Scaling applications to multiple GPUs across multiple nodes requires one to be adept at not just the programming models and optimization techniques, but also at performing root-cause analysis using in-depth profiling to identify and minimize bottlenecks. In this bootcamp, participants will learn to improve the performance of an application step-by-step, taking cues from profilers along the way. Moreover, understanding of the underlying technologies and communication topology will help us utilize high-performance NVIDIA libraries to extract more performance out of the system.\n",
    "\n",
    "By the end of this bootcamp session, participants will be adept at:\n",
    "* Reviewing communication architecture and topology\n",
    "* Developing CUDA-aware multi-node multi-GPU MPI applications\n",
    "* Profiling the application using NVIDIA Nsight Systems\n",
    "* Applying optimizations like CUDA streams, events, and overlapping compute and communication\n",
    "* Understanding GPUDirect technologies like P2P and RDMA\n",
    "* Learning to use NVIDIA NCCL and NVSHMEM libraries\n",
    "\n",
    "### Bootcamp Duration\n",
    "\n",
    "The bootcamp will take 8 hours to complete. Link to download all materials will be available at the end of the lab.\n",
    "\n",
    "### Content Level\n",
    "Intermediate, Advanced\n",
    "\n",
    "### Target Audience and Prerequisites\n",
    "The target audience for this lab are researchers, graduate students, and developers who are interested in scaling their scientific applications to multiple nodes using multi-GPU implementations.\n",
    "\n",
    "Experience in C/ C++ and basic CUDA programming is required. Experience with parallel programming frameworks like OpenMP or MPI is not required but a basic understanding of MPI is highly recommended.\n",
    "\n",
    "### Bootcamp Outline\n",
    "\n",
    "We will take up the Jacobi Solver, an iterative technique for solving system of linear equations, in this tutorial. To begin, click on the first link below:\n",
    "\n",
    "1. [Overview of single-GPU code and Nsight Systems Profiler](C/jupyter_notebook/single_gpu/single_gpu_overview.ipynb)\n",
    "2. Single Node Multi-GPU:\n",
    "    * [CUDA Memcpy and Peer-to-Peer Memory Access](C/jupyter_notebook/cuda/memcpy.ipynb)\n",
    "    * [Intra-node topology](C/jupyter_notebook/advanced_concepts/single_node_topology.ipynb)\n",
    "    * [CUDA Streams and Events](C/jupyter_notebook/cuda/streams.ipynb)\n",
    "3. Multi-Node Multi-GPU:\n",
    "    * [Introduction to MPI and Multi-Node execution overview](C/jupyter_notebook/mpi/multi_node_intro.ipynb)\n",
    "    * [MPI with CUDA Memcpy](C/jupyter_notebook/mpi/memcpy.ipynb)\n",
    "    * [CUDA-aware MPI](C/jupyter_notebook/mpi/cuda_aware.ipynb)\n",
    "    * [Supplemental: Configuring MPI in a containerized environment](C/jupyter_notebook/mpi/containers_and_mpi.ipynb)\n",
    "4. [NVIDIA Collectives Communications Library (NCCL)](C/jupyter_notebook/nccl/nccl.ipynb)\n",
    "5. [NVHSMEM Library](C/jupyter_notebook/nvshmem/nvshmem.ipynb)\n",
    "\n",
    "--- \n",
    "\n",
    "## Licensing \n",
    "\n",
    "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}