"
]
},
{
"cell_type": "markdown",
"id": "ecea172b",
"metadata": {},
"source": [
"# Introduction to Distributed Deep Learning - Part 2\n",
"\n",
"**Contents of this notebook:**\n",
"\n",
"- [Understanding System Topology](#Understanding-System-Topology)\n",
" - [Communication concepts](#Communication-concepts)\n",
"- [Intra-Node Communication Topology](#Intra-Node-communication-Topology)\n",
" - [Performance variation due to system topology](#Performance-variation-due-to-system-topology)\n",
" - [Profiling using DLProf](#Profiling-using-DLProf)\n",
"- [NCCL](#NCCL)\n",
" - [NCCL_P2P_LEVEL=0 or P2P Disabled](#NCCL_P2P_LEVEL=0-or-P2P-Disabled)\n",
" - [NCCL_P2P_LEVEL=1 or P2P via PCIe](#NCCL_P2P_LEVEL=1-or-P2P-via-PCIe)\n",
"- [Benchmarking the system topology](#Benchmarking-the-system-topology)\n",
"\n",
"**By the End of this Notebook you will:**\n",
"\n",
"- Understand how system topolgy plays a role in Distributed training.\n",
"- Understand intra-node topology and underlying technologies like P2P and their implication on program performance"
]
},
{
"cell_type": "markdown",
"id": "2b543f7a",
"metadata": {},
"source": [
"# Understanding System Topology\n",
"\n",
"In our previous notebook when we calculated the Throughput of deep learning training with different parameters , we saw a slight dip when we scaled from 4 to 8 GPUs , let us try to reason it by understanding the underlying the system.\n",
"\n",
"Before we begin, let us define two important terms:\n",
"\n",
"* **Latency:** The amount of time it takes to take a unit of data from point A to point B. For example, if 4B of data can be transferred from point A to B in 4 $\\mu$s, that is the latency of transfer.\n",
"* **Bandwidth:** The amount of data that can be transferred from point A to point B in a unit of time. For example, if the width of the bus is 64KiB and latency of transfer between point A and B is 4 $\\mu$s, the bandwidth is 64KiB * (1/4$\\mu$s) = 1.6 GiB/s.\n",
"\n",
"### Setting up the GPU\n",
"\n",
"To verify that our system has multiple GPUs in each node, run the command below:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "98a2c07a",
"metadata": {},
"outputs": [],
"source": [
"!nvidia-smi"
]
},
{
"cell_type": "markdown",
"id": "0fa346e5",
"metadata": {},
"source": [
"If the output is unclear, you can launch a Terminal session by clicking on `File` $\\rightarrow$ `New` $\\rightarrow$ `Terminal` or by following the steps as shown:\n",
"\n",
"![open_terminal_session](images/open_terminal.png)\n",
"\n",
"## Communication concepts\n",
"\n",
"There are many ways in which GPUs can transfer data between one another , let us look at two of the most used copy operations. Understanding these will help us with the further sections of the notebook when we benchmark and toggle different options available to us.\n",
"\n",
"#### Host Staging of Copy Operations\n",
"\n",
"The path taken by the data in both the cases is denoted by the red arrow as follows:\n",
"\n",
"
\n",
"\n",
"That is, in the above GPU-to-GPU memory copy, the data traverses from GPU 0 the PCIe bus to the CPU, where it is staged in a buffer before being copied to GPU 1. This is called \"host staging\" and it decreases the bandwidth while increasing the latency of the operation. If we eliminate host staging, we can usually improve the performance of our application.\n",
"\n",
"#### Peer-to-Peer Memory Access\n",
"\n",
"P2P allows devices to address each other's memory from within device kernels and eliminates host staging by transferring data either through the PCIe switch or through NVLink as denoted by the red arrow below. \n",
"\n",
"
\n",
"\n",
"Peer-to-Peer (P2P) memory access requires GPUs to share a Unified Virtual Address Space (UVA). UVA means that a single address space is used for the host and all modern NVIDIA GPU devices (specifically, those with compute capibility of 2.0 or higher).\n",
"\n",
"Let us now try to understand the Intra-node topology."
]
},
{
"cell_type": "markdown",
"id": "c33313ee",
"metadata": {},
"source": [
"## Intra-Node communication Topology\n",
"\n",
"Run the command below to display your node's GPU and NIC communication topology:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c770dd5a",
"metadata": {},
"outputs": [],
"source": [
"!nvidia-smi topo -m"
]
},
{
"cell_type": "markdown",
"id": "0c33e5e5",
"metadata": {},
"source": [
"Output of running the command on DGX-1 : \n",
"\n",
"![nvidia_smi_topo_output](images/nvidia_smi_topo_output.png)\n",
"\n",
"Focus one a particular row, say GPU 0. The output states that GPUs 1 through 4 are connected to it via NVLink (in addition to PCIe) and GPUs 5 through 7 are connected to it via PCIe as well as an \"SMP\" interconnect. We have a dual-socket system and the CPUs in these sockets are connected by an interconnect known as SMP interconnect.\n",
"\n",
"Thus, GPU 0 to GPU 5 communication happens via not just PCIe, but also over the inter-socket interconnect within the same node. Clearly, this is a longer path than say the one between GPU 0 and GPU 1, which are connected via NVLink directly.\n",
"\n",
"Even within the GPUs connected via NVLink, we see different annotations such as `NV1` and `NV2` that affect the communication bandwidth and hence the performance. In this section, we will explore the nuances associated with a diverse intra-node GPU communication topology like in the output above. Specifically, in our system, the communication topology is as follows:\n",
"\n",
"![dgx1_8x_tesla_v100_topo](images/dgx1_8x_tesla_v100_topo.png)\n",
"\n",
"Qualitatively, the bandwidth and latency vary with the topology as follows:\n",
"\n",
"
\n",
"\n",
"Host staging implies traversing through the CPU and the travel path taken is one of PHB, NODE, and SYS. In contrast, if the path taken is either NV1, NV2, or PIX, then P2P is available. PXB implies that the GPUs belong to different PCIe hubs and P2P is usually not supported in this case.\n",
"\n",
"A double NVLink connection provides twice the bandwidth compared to a single NVLink. \n",
"\n",
"For a pair of 2 GPUs, the peak bidirectional bandwidth are as follows:\n",
"* PCIe: Using PIX topology, 15.75GB/s for PCIe Gen 3.0 and 31.5GB/s for PCIe Gen 4.0.\n",
"* NVLink: Using NV# topology, 50GB/s per connection. So a double NVLink connection has 100GB/s peak bidirectional bandwidth.\n",
"\n",
"Let us understand what difference the underlying communication topology can make to the application performance in the following sub-section.\n",
"\n",
"**Note:** If your command output doesn't show any NVLink connection or if there's no difference in connection type (PIX, PXB, PHB, NODE, SYS, NV#) between any 2 pair of GPUs, then the communication bandwidth and latency will likely be the same between any pair and the following sub-sections will not display any performance difference."
]
},
{
"cell_type": "markdown",
"id": "0ed4ed8c",
"metadata": {},
"source": [
"### Performance variation due to system topology\n",
"\n",
"So far, we have run the application specifying the number of GPUs to use. To specify which GPU to use, we can supply the `CUDA_VISIBLE_DEVICES` environment variable to the executable to run our code on specific GPUs. If we want to run on only 2 GPUs, namely GPU 0 and GPU 3, we use set the environment variable `CUDA_VISIBLE_DEVICES=\"0,3\"` while executing the command. Let us also include the `NCCL_DEBUG=INFO` variable to understand how the GPUs are connected. We will take a closer look into the NCCL library in the upcoming section.\n",
"\n",
"Let us now run the command with two GPUs and compare the throughput achieved in both cases.\n",
"\n",
"\n",
"**Experiment 1** : Try to find the GPU pair with **highest bandwidth and lowest latency** available as per the table above and replace `0,3` with those GPUs, and then run the command below:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fb355c35",
"metadata": {},
"outputs": [],
"source": [
"!TF_CPP_MIN_LOG_LEVEL=3 NCCL_DEBUG=INFO CUDA_VISIBLE_DEVICES=\"0,3\" horovodrun -np 2 --mpi-args=\"--oversubscribe\" python3 ../source_code/N2/cnn_fmnist.py --batch-size=512 2> /dev/null"
]
},
{
"cell_type": "markdown",
"id": "0bea64af",
"metadata": {},
"source": [
"**Experiment 2** : Let us run the command below with the **highest latency and lowest bandwidth** ( in our case GPU `1,7` )."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "66861f56",
"metadata": {},
"outputs": [],
"source": [
"!TF_CPP_MIN_LOG_LEVEL=3 NCCL_DEBUG=INFO CUDA_VISIBLE_DEVICES=\"1,7\" horovodrun -np 2 --mpi-args=\"--oversubscribe\" python3 ../source_code/N2/cnn_fmnist.py --batch-size=512 2> /dev/null"
]
},
{
"cell_type": "markdown",
"id": "08e92959",
"metadata": {},
"source": [
"Now with the results obtained, we can now compare them :\n",
"\n",
"The scaling efficiency would likely be higher for the set of GPUs having **low latency and high bandwidth**.\n",
"\n",
"Output of running the command on DGX-1 : \n",
"\n",
"\n",
"```bash\n",
"NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff\n",
"NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0\n",
"NCCL INFO Setting affinity for GPU 3 to 0fffff00,000fffff\n",
"NCCL INFO Channel 00 : 1[b000] -> 0[6000] via P2P/IPC\n",
"NCCL INFO Channel 00 : 0[6000] -> 1[b000] via P2P/IPC\n",
"NCCL INFO Channel 01 : 1[b000] -> 0[6000] via P2P/IPC\n",
"NCCL INFO Channel 01 : 0[6000] -> 1[b000] via P2P/IPC\n",
"NCCL INFO Channel 02 : 1[b000] -> 0[6000] via P2P/IPC\n",
"NCCL INFO Channel 02 : 0[6000] -> 1[b000] via P2P/IPC\n",
"NCCL INFO Channel 03 : 1[b000] -> 0[6000] via P2P/IPC\n",
"NCCL INFO Channel 03 : 0[6000] -> 1[b000] via P2P/IPC\n",
"NCCL INFO Connected all rings\n",
"NCCL INFO Connected all trees\n",
"NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512\n",
"NCCL INFO 4 coll channels, 4 p2p channels, 4 p2p channels per peer\n",
"\n",
"\n",
"Epoch 4/8\n",
"Images/sec: 100702.49\n",
"Epoch 5/8\n",
"Images/sec: 101486.84\n",
"Epoch 6/8\n",
"Images/sec: 101490.28\n",
"Epoch 7/8\n",
"Images/sec: 99128.98\n",
"Epoch 8/8\n",
"Images/sec: 101215.77\n",
"```\n",
"\n",
"Now, run the binary a pair of GPUs that have the lowest available bandwidth. In our case, we use GPU 1 and GPU 7. \n",
"\n",
"\n",
"Output of running the command on DGX-1 : \n",
"\n",
"```bash\n",
"NCCL INFO Setting affinity for GPU 7 to ffff,f00000ff,fff00000\n",
"NCCL INFO Channel 00/02 : 0 1\n",
"NCCL INFO Channel 01/02 : 0 1\n",
"NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1\n",
"NCCL INFO Setting affinity for GPU 1 to 0fffff00,000fffff\n",
"NCCL INFO Channel 00 : 1[8a000] -> 0[7000] via direct shared memory\n",
"NCCL INFO Channel 00 : 0[7000] -> 1[8a000] via direct shared memory\n",
"NCCL INFO Channel 01 : 1[8a000] -> 0[7000] via direct shared memory\n",
"NCCL INFO Channel 01 : 0[7000] -> 1[8a000] via direct shared memory\n",
"NCCL INFO Connected all rings\n",
"NCCL INFO Connected all trees\n",
"NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512\n",
"NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer\n",
"\n",
"Epoch 4/8\n",
"Images/sec: 98996.51\n",
"Epoch 5/8\n",
"Images/sec: 98135.64\n",
"Epoch 6/8\n",
"Images/sec: 97798.09\n",
"Epoch 7/8\n",
"Images/sec: 96672.95\n",
"Epoch 8/8\n",
"Images/sec: 95782.78\n",
"```\n",
"\n",
"Let us now try to understand the time taken for communication using DLProf profiling."
]
},
{
"cell_type": "markdown",
"id": "788c6703",
"metadata": {},
"source": [
"### Profiling using DLProf\n",
"\n",
"**Important : If you are new to DLProf , kindly go through the DLProf Introduction notebook [here](2.2.DLProf.ipynb)**\n",
"\n",
"\n",
"The difference is subtle when we directly compare their throughput, but when we profile them using `dlprof` we can notice the difference in average time taken per NCCL call below. "
]
},
{
"cell_type": "markdown",
"id": "c75ec37a",
"metadata": {},
"source": [
"Let us now profile both cases from above and understand the communication time taken.\n",
"\n",
"**Profiling case 1** : Try to find the GPU pair with highest bandwidth and lowest latency available as per the table above and replace 0,3 with those GPUs, and then run the command below:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "007724ed",
"metadata": {},
"outputs": [],
"source": [
"!TF_CPP_MIN_LOG_LEVEL=3 CUDA_VISIBLE_DEVICES=\"0,3\" dlprof --output_path=\"Profile/N2_1\" horovodrun -np 2 --mpi-args=\"--oversubscribe\" python3 ../source_code/N2/cnn_fmnist.py --batch-size=512"
]
},
{
"cell_type": "markdown",
"id": "15166485",
"metadata": {},
"source": [
"**Profiling case 2** : Let us run the command below with the highest latency and lowest bandwidth.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3620a8bf",
"metadata": {},
"outputs": [],
"source": [
"!TF_CPP_MIN_LOG_LEVEL=3 CUDA_VISIBLE_DEVICES=\"1,7\" dlprof --output_path=\"Profile/N2_2\" horovodrun -np 2 --mpi-args=\"--oversubscribe\" python3 ../source_code/N2/cnn_fmnist.py --batch-size=512"
]
},
{
"cell_type": "markdown",
"id": "edd572cd",
"metadata": {},
"source": [
"Let us now view the profile using `dlprofviewer`\n",
"\n",
"To run the `dlprofviewer` server , open a new terminal and run the following command. Replace the port `6666` to the port that you want to view the `dlprofviewer` server to run on.\n",
"\n",
"```bash\n",
"dlprofviewer -b 0.0.0.0 --port 6666 Profile/N2_1/dlprof_dldb.sqlite\n",
"```\n",
"\n",
"Let us now open the Operatations table and view the operations summary."
]
},
{
"cell_type": "markdown",
"id": "d1a9953e",
"metadata": {},
"source": [
"For the DGX-1 cluster ,we get the following results from profiling.\n",
"\n",
"\n",
"Using GPU 0 & 3 \n",
"![p2p_2_gpu_memcpy_nsys](images/0_3.png)\n",
"\n",
"Using GPU 1 & 7 \n",
"![p2p_2_gpu_memcpy_nsys](images/2_7.png)\n",
"\n",
"We can notice the difference in average time taken by the NCCL calls has increase significantly in **case 2** , this is a much better respresentation to understand the time taken for the communication calls.\n",
"\n",
"Let us now try to understand how NCCL works and other options present in NCCL."
]
},
{
"cell_type": "markdown",
"id": "9d9808b3",
"metadata": {},
"source": [
"# NCCL\n",
"\n",
"The NVIDIA Collective Communication Library (NCCL) implements multi-GPU and multi-node communication primitives optimized for NVIDIA GPUs and Networking. NCCL provides routines such as `all-gather`, `all-reduce`, `broadcast`, `reduce`, `reduce-scatter` as well as point-to-point send and receive that are optimized to achieve high bandwidth and low latency over PCIe and NVLink high-speed interconnects within a node and over NVIDIA Mellanox Network across nodes.\n",
"\n",
"The Horovod framework also uses NCCL Collective communications to keep the all the GPUs in sync , we can then toggle P2P levels using Environment variables to manually switch between different communication protocols available. The complete list of Environment variables can be found \n",
"[here](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-p2p-disable)\n",
"\n",
"Let us now toggle Peer-to-peer levels using the `NCCL_P2P_LEVEL` environment variable. \n",
"\n",
"```text\n",
"NCCL_P2P_LEVEL\n",
"(since 2.3.4)\n",
"\n",
"The NCCL_P2P_LEVEL variable allows the user to finely control when to use the peer to peer (P2P) transport between GPUs. The level defines the maximum distance between GPUs where NCCL will use the P2P transport.\n",
"\n",
"Values accepted\n",
"LOC or 0 : Never use P2P (always disabled)\n",
"\n",
"NVL : Use P2P when GPUs are connected through NVLink\n",
"\n",
"PIX or 1 : Use P2P when GPUs are on the same PCI switch.\n",
"\n",
"PXB or 2 : Use P2P when GPUs are connected through PCI switches (potentially multiple hops).\n",
"\n",
"PHB or 3, or 4 : Use P2P when GPUs are on the same NUMA node. Traffic will go through the CPU.\n",
"\n",
"SYS or 5 : Use P2P betweem NUMA nodes, potentially crossing the SMP interconnect (e.g. QPI/UPI).\n",
"```\n",
"\n",
"We have benchmarked for the case where we use NVLink and verified it through `NCCL_DEBUG` environment variable, let us now try two different settings and compare their throughputs.\n",
"\n",
"### NCCL_P2P_LEVEL=0 or P2P Disabled"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9571de9a",
"metadata": {},
"outputs": [],
"source": [
"!NCCL_P2P_LEVEL=0 TF_CPP_MIN_LOG_LEVEL=3 NCCL_DEBUG=INFO CUDA_VISIBLE_DEVICES=\"0,3\" horovodrun -np 2 --mpi-args=\"--oversubscribe\" python3 ../source_code/N2/cnn_fmnist.py --batch-size=512 2> /dev/null"
]
},
{
"cell_type": "markdown",
"id": "dc07df4a",
"metadata": {},
"source": [
"Output of running the command on DGX-1 : \n",
"\n",
"```bash\n",
"NCCL INFO NCCL_P2P_LEVEL set by environment to LOC\n",
"NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0\n",
"NCCL INFO Channel 00/02 : 0 1\n",
"NCCL INFO Setting affinity for GPU 3 to 0fffff00,000fffff\n",
"NCCL INFO Channel 01/02 : 0 1\n",
"NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1\n",
"NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff\n",
"NCCL INFO Channel 00 : 1[b000] -> 0[6000] via direct shared memory\n",
"NCCL INFO Channel 00 : 0[6000] -> 1[b000] via direct shared memory\n",
"NCCL INFO Channel 01 : 1[b000] -> 0[6000] via direct shared memory\n",
"NCCL INFO Channel 01 : 0[6000] -> 1[b000] via direct shared memory\n",
"NCCL INFO Connected all rings\n",
"NCCL INFO Connected all trees\n",
"NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512\n",
"NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer\n",
"\n",
"Epoch 4/8\n",
"Images/sec: 95033.4\n",
"Epoch 5/8\n",
"Images/sec: 94848.44\n",
"Epoch 6/8\n",
"Images/sec: 94289.97\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "13ac3702",
"metadata": {},
"source": [
"### NCCL_P2P_LEVEL=1 or P2P via PCIe"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9a17a92f",
"metadata": {},
"outputs": [],
"source": [
"!NCCL_P2P_LEVEL=1 TF_CPP_MIN_LOG_LEVEL=3 NCCL_DEBUG=INFO CUDA_VISIBLE_DEVICES=\"0,3\" horovodrun -np 2 --mpi-args=\"--oversubscribe\" python3 ../source_code/N2/cnn_fmnist.py --batch-size=512 2> /dev/null"
]
},
{
"cell_type": "markdown",
"id": "893cb1f6",
"metadata": {},
"source": [
"Output of running the command on DGX-1 : \n",
"\n",
"```bash\n",
"NCCL INFO NCCL_P2P_LEVEL set by environment to PIX\n",
"NCCL INFO NCCL_P2P_LEVEL set by environment to PIX\n",
"NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0\n",
"NCCL INFO Channel 00/04 : 0 1\n",
"NCCL INFO Channel 01/04 : 0 1\n",
"NCCL INFO Channel 02/04 : 0 1\n",
"NCCL INFO Setting affinity for GPU 3 to 0fffff00,000fffff\n",
"NCCL INFO Channel 03/04 : 0 1\n",
"NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1\n",
"NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff\n",
"NCCL INFO Channel 00 : 1[b000] -> 0[6000] via P2P/IPC\n",
"NCCL INFO Channel 00 : 0[6000] -> 1[b000] via P2P/IPC\n",
"NCCL INFO Channel 01 : 1[b000] -> 0[6000] via P2P/IPC\n",
"NCCL INFO Channel 01 : 0[6000] -> 1[b000] via P2P/IPC\n",
"NCCL INFO Channel 02 : 1[b000] -> 0[6000] via P2P/IPC\n",
"NCCL INFO Channel 02 : 0[6000] -> 1[b000] via P2P/IPC\n",
"NCCL INFO Channel 03 : 1[b000] -> 0[6000] via P2P/IPC\n",
"NCCL INFO Channel 03 : 0[6000] -> 1[b000] via P2P/IPC\n",
"NCCL INFO Connected all rings\n",
"NCCL INFO Connected all trees\n",
"Epoch 4/8\n",
"Images/sec: 96529.63\n",
"Epoch 5/8\n",
"Images/sec: 97288.7\n",
"Epoch 6/8\n",
"Images/sec: 97230.33\n",
"Epoch 7/8\n",
"Images/sec: 97701.72\n",
"Epoch 8/8\n",
"Images/sec: 97075.39\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "63824c90",
"metadata": {},
"source": [
"We can summarise the results using the following table. \n",
"\n",
"|GPUs|Condition|Throughput|\n",
"|-|-|-|\n",
"|0,3|P2P via NVLink|~100000|\n",
"|0,3|P2P via PCIe|~97000|\n",
"|0,3|P2P Disabled|~95000|\n",
"\n",
"We now understand the role of communication and hardware configuration for training. In our case, we used a smaller model for quicker runtimes, this decrease in throughput due to communication is more pronounced when the data transfer size increases for larger models that typically require multi-node training, and in those cases, NVLink helps reduce the scaling efficiency gap as we scale further."
]
},
{
"cell_type": "markdown",
"id": "67065b6f",
"metadata": {},
"source": [
"### Benchmarking the system topology\n",
"\n",
"The above application is not very memory intensive as we mentioned earlier, this can also be verified using `dlprof` that most of the time in GPU is spent on computation. Therefore, to get a quantitative measure of latency and bandwidth impact due to topology, we run a micro-benchmark.\n",
"\n",
"**The p2pBandwidthLatencyTest micro-benchmark**\n",
"\n",
"p2pBandwidthLatencyTest is a part of [CUDA Samples GitHub repository](https://github.com/NVIDIA/cuda-samples) available to help CUDA developers. \n",
"\n",
"As the name suggests, this test measures the bandwidth and latency impact of P2P and underlying communication topology. Let's compile the benchmark:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8d7255e1",
"metadata": {},
"outputs": [],
"source": [
"!cd ../source_code/N2/Samples/p2pBandwidthLatencyTest/ && make clean && make"
]
},
{
"cell_type": "markdown",
"id": "f7bb509a",
"metadata": {},
"source": [
"Now, let's run the benchmark:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4bc69bc1",
"metadata": {},
"outputs": [],
"source": [
"!cd ../source_code/N2/Samples/p2pBandwidthLatencyTest/ && ./p2pBandwidthLatencyTest"
]
},
{
"cell_type": "markdown",
"id": "b6cec4a3",
"metadata": {},
"source": [
"The first part of the benchmark gives device information and P2P access available from each GPU (similar to `nvidia-smi topo -m` command). Next, the benchmark measures the unidirectional and bidirectional bandwidth and latency with P2P disabled and enabled.\n",
"\n",
"We share partial results obtained on running the command on DGX-1 : :\n",
"\n",
"```bash\n",
"Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)\n",
" D\\D 0 1 2 3 4 5 6 7 \n",
" 0 783.95 9.56 14.43 14.46 14.47 14.24 14.51 14.43 \n",
"\n",
"Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)\n",
" D\\D 0 1 2 3 4 5 6 7 \n",
" 0 784.87 48.49 48.49 96.85 96.90 14.25 14.54 14.49 \n",
" \n",
"P2P=Disabled Latency Matrix (us)\n",
" GPU 0 1 2 3 4 5 6 7 \n",
" 0 1.78 17.52 16.41 16.43 17.35 16.88 17.34 16.85 \n",
" \n",
"P2P=Enabled Latency (P2P Writes) Matrix (us)\n",
" GPU 0 1 2 3 4 5 6 7 \n",
" 0 1.76 1.62 1.61 2.01 2.02 18.44 19.15 19.34\n",
"```\n",
"\n",
"Our system is based on PCIe gen 3.0 with a peak maximum GPU-GPU PCIe banwidth of 15.75 GB/s. Let us analyze and understand these results:\n",
"\n",
"* GPU 0 and GPU 1/2: Connected by a single NVLink connection. By enabling P2P-\n",
" - Bandwidth reaches close to the maximum peak of 50 GB/s.\n",
" - Latency decreases by an order of magnitude.\n",
"* GPU 0 and GPU 3/4: Connected by a double NVLink connection. By enabling P2P-\n",
" - Bandwidth reaches close to the maximum peak of 100 GB/s.\n",
" - Latency decreases by an order of magnitude.\n",
"* GPU 0 and GPU 5/6/7: Connected by PCIe and SMP interconnect. By enabling P2P- \n",
" - Bandwidth is unchanged.\n",
" - Latency increases a marginally.\n",
" \n",
"Correlate these results with the communication topology that can be displayed by usng `nvidia-smi topo -m` command and the qualtitative table in the previous section. They should be consistent with one another.\n",
"\n",
"In general, we should try to set the GPUs in an application such that a GPU can share data with its neighbours using a high-bandwidth, low-latency communication topology. Enabling P2P, when possible, usually improves the performance by eliminating host staging."
]
},
{
"cell_type": "markdown",
"id": "6305bd4c",
"metadata": {},
"source": [
"**Now that we understand the role of system topology in distributed deep learning , let us now get hands-on with refractoring and scaling Deep learning models in the upcoming notebook.**"
]
},
{
"cell_type": "markdown",
"id": "95e57415",
"metadata": {},
"source": [
"***\n",
"\n",
"## Licensing\n",
"\n",
"This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)."
]
},
{
"cell_type": "markdown",
"id": "200bbfb9-9036-46e2-8f5d-c83be119ba68",
"metadata": {},
"source": [
"