\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Introduction to Distributed Deep Learning\n",
"\n",
"**Contents of this notebook:**\n",
"\n",
"- [The need for Distributed Deep Learning](#The-need-for-Distributed-Deep-Learning)\n",
"- [Differnet types of Distributed Deep learning and it's applications](#Differnet-types-of-Distributed-Deep-learning-and-it's-applications)\n",
" - [Training and Inference](#Training-and-Inference)\n",
" - [Data and Model Parallelism](#Data-and-Model-Parallelism)\n",
" - [Framework and NVIDIA NGC Support - Optional](#Framework-and-NVIDIA-NGC-Support---Optional)\n",
"- [Demo - Scalability across multiple GPUs](#Demo---Scalability-across-multiple-GPUs) \n",
"\n",
"\n",
"**By the End of this Notebook you will:**\n",
"\n",
"- Understand the need for distributed Deep Learning.\n",
"- Understand the ecosystem of distributed Deep Learning."
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"# The need for Distributed Deep Learning\n",
"\n",
"Artificial Intelligence has witnessed tremendous progress in the past decade, making its impact in almost every potential field. Still, the concepts for the backbone of Neural networks such as Perceptron, backpropagation were published in 1958 and 1974 respectively. They were not widespread because of the unavailability of computing power and lack of data transfer and storage systems. The introduction of the Internet and the recent improvements in compute have both been a key component of AI progress in recent times, with [NVIDIA](https://www.nvidia.com/) being the leader in AI progress in the modern world.\n",
"\n",
"With the recent advancements in computing power, modern Deep neural networks are capable of processing a wide variety of data. They can do a wide range of tasks, but these Deep learning workloads have also significantly grown in size. This makes deep learning applications time-intensive for larger workloads.\n",
"\n",
"The below chart plots the amount of computing required by various modern networks starting from AlexNet to AlphaGo Zero.\n",
"\n",
"
\n",
"\n",
"\n",
"This chart points out that AI training runs has been increasing exponentially with a 3.4-month doubling time, by comparison to [Moore's Law](https://en.wikipedia.org/wiki/Moore%27s_law) which has a doubling period of 2-years, this difference might make it practically hard to train larger networks, so to enable researchers and data scientists to train bigger networks with higher compute power, [NVIDIA](https://www.nvidia.com/) constantly innovates in both software and hardware forefronts bringing out new technology such as [AMP](https://developer.nvidia.com/automatic-mixed-precision) and [NVIDIA Tensor Cores](https://www.nvidia.com/en-in/data-center/tensor-cores/).\n",
"\n",
"- [AMP](https://developer.nvidia.com/automatic-mixed-precision): Deep Neural Network training has traditionally relied on IEEE single-precision format. However, you can train with mixed precision with half-precision while maintaining the network accuracy achieved with single precision. This technique of using both single- and half-precision representations is referred to as the mixed-precision technique. The benefits are : \n",
" - Speeds up math-intensive operations, such as linear and convolution layers, using [NVIDIA Tensor Cores](https://www.nvidia.com/en-in/data-center/tensor-cores/).\n",
" - Speeds up memory-limited operations by accessing half the bytes compared to single-precision.\n",
" - Reduces memory requirements for training models, enabling larger models or larger mini-batches.\n",
" \n",
"- [NVIDIA Tensor Cores](https://www.nvidia.com/en-in/data-center/tensor-cores/): Tensor Cores enable mixed-precision computing, dynamically adapting calculations to accelerate throughput while preserving accuracy. The latest generation expands these speedups to a full range of workloads. From 10X speedups in AI training with Tensor Float 32 (TF32), a revolutionary new precision, to 2.5X boosts for high-performance computing with floating-point 64 (FP64)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"GPUs are the choice of AI researchers and data scientists for their ability to perform massive parallelism and high throughput. \n",
"\n",
"Before going further, let us define what we mean by **throughput**. Throughput refers to the number of data units processed per unit of time. The data unit varies according to our application. For example, if we have a Computer Vision application, we calculate the images/sec processed through the deep learning network. By having a higher throughput, we process more data units, which leads us to faster convergence of our system.\n",
"\n",
"\n",
"Let us now do an experiment and calculate the throughput of training a Natural language processing with different batch sizes in a single GPU."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# GPT - Wikitext -2 : 32 ,64 , 128 , 256 ,512 , 1024\n",
"!TF_CPP_MIN_LOG_LEVEL=3 python3 ../source_code/N1/GPT.py --batch-size=32"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let us now tabulate the results that we obtained in the above experiment. Ignore the first iteration as it includes graph building time and one-time operations in them.\n",
"\n",
"Output of running the command on DGX-1 : \n",
"\n",
"|Batch Size| Throughput |\n",
"|-|-|\n",
"|32|1050|\n",
"|64|1498|\n",
"|128|1906|\n",
"|256|2192|\n",
"|512|2364|\n",
"|1024|2452|\n",
"|1280|2456|\n",
"\n",
"We can notice that the throughput increases as we initially increase the batch size and reaches a ceiling at which we have completely utilised the compute or memory throughput available to us."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A straightforward method to increase training throughput is to use multiple GPU devices to increase parallelism further; the below chart demonstrates the performance improvements in an increase in throughput of images processed by the two different deep neural networks. \n",
"\n",
"
\n",
"\n",
"Now that we understand the need for more computing power for modern networks and how multiple GPUs can bridge this gap with parallelism, let us now understand the types of Distributed Deep learning and their respective applications."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Differnet types of Distributed Deep learning and it's applications"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"### Training and Inference\n",
"\n",
"- **Training**: This is the widely used case for distributed deep learning called distributed Training. When the computing power needed for model convergence gets higher, multiples GPUs are then used to increase parallelism and thus reduce the training time. \n",
"\n",
"- **Inference**: Deep learning inference is the process of using a trained DNN model to make predictions against previously unseen data. Distributed inferencing is used in applications that require low latency and high throughput. One such example is running Inference on multiple video streams on an Intelligent Video Analytics application built with [NVIDIA DeepStream SDK](https://developer.nvidia.com/deepstream-sdk). \n",
"\n",
"\n",
"***\n",
"\n",
"### Data and Model Parallelism\n",
"\n",
"- **Model Parallelism**: Model parallelism is the process of splitting a model up between multiple devices or nodes and creating an efficient pipeline to train the model across these devices to maximize GPU utilization. An example representation of model parallelism can be as follows.\n",
"\n",
"
\n",
"\n",
"\n",
"- **Data Parallelism**: In modern deep learning, when the dataset is too big to be fit into the memory, we could only do stochastic gradient descent for batches. The shortcoming of stochastic gradient descent is that the estimate of the gradients might not accurately represent the true gradients of using the full dataset. Therefore, it may take much longer to converge. A natural way to have a more accurate estimate of the gradients is to use larger batch sizes or even use the full dataset. To allow this, the gradients of small batches are calculated on each GPU. The final estimate of the gradients is the weighted average of the gradients calculated from all the small batches. \n",
"\n",
" - **Synchronous data parallelism** : In synchronous data parallelism, all workers train over different slices of input data in sync and aggregate gradients at each step.\n",
" - **Asynchronous data parallelism** : In synchronous data parallelism, all workers are independently training over the input data and updating variables asynchronously. \n",
" \n",
"Optional: Typically `sync` training is supported via all-reduce and `async` through parameter server architecture\n",
"\n",
"\n",
"Example representation of Synchronous and Asynchronous data parallelism :\n",
"\n",
"
\n",
"
\n",
"
\n",
"
\n",
"\n",
"\n",
"- **Hybrid Parallelism** : Hybrid parallelism is used when we would like to make use of both Data and Model parallelism. An example would be when we need to train a large model that cannot fit into one GPU but can fit into a node, we could use Model parallelism inside a node and using Data parallelism across nodes.\n",
"\n",
"***\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Framework and NVIDIA NGC Support - Optional\n",
"\n",
"Let us look into some frameworks that support Distributed Deep learning. \n",
"\n",
"#### Frameworks :\n",
"\n",
"- **Tensorflow & Keras** : We can distribute deep learning training using minimal code changes using the `tf.distribute` API to distribute training across multiple GPUs, multiple machines or TPUs. We will look into some strategies that `tf.distribute` API offers in the upcoming notebooks \n",
"- **PyTorch** : PyTorch enables users to distrubute their training using `torch.nn.DataParallel` and `torch.nn.parallel.DistributedDataParallel` for Data parallelism. \n",
"- **MXNet** : MXNet uses `KV Store` server for distributed training. It has 4 modes that determine how the weights are updated and determine where the model is stored.\n",
"- **Horovod** : Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. Horovod was originally developed by Uber to make distributed deep learning fast and easy to use. With Horovod, an existing training script can be scaled up to run on hundreds of GPUs in just a few lines of Python code.\n",
"\n",
"#### NVIDIA NGC Support \n",
"\n",
"The NVIDIA NGC Catalog is a curated set of GPU-optimized software for AI, HPC and Visualization. The content provided by NVIDIA and third-party ISVs simplify building, customizing, and integrating GPU-optimized software into workflows, accelerating the time to solutions for users. The NGC Catalog consists of containers, pre-trained models, Helm charts for Kubernetes deployments and industry-specific AI toolkits with software development kits (SDKs). NGC Catalog contains containers that have Multi-node support built-in and can be readily deployed. \n",
"\n",
"\n",
"We will be primarily looking into using Distributed Deep learning training using Data parallelism in the upcoming notebooks. We will use both Horovod and Tensorflow for this, so the reader can choose whichever framework they would like to follow through. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Demo - Scalability across multiple GPUs"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"Now that we've have seen the basics around the Distributed deep learning system , let us now try running a demo and see how well it scales with multiple GPUs , this is done by calculating a term called as scaling efficiency. \n",
"\n",
"Scaling efficiency can be defined as follows :\n",
"\n",
"$$\n",
"Scaling\\ efficiency = \\frac{ \\frac{Total\\ samples\\ processed\\ per\\ unit\\ time}{Number\\ of\\ gpus} }{ \\ \\ Samples\\ processed\\ per\\ unit\\ time\\ per\\ gpu }\n",
"$$\n",
"\n",
"**Note : Scaling efficiency is usually less than 1 because of the additional communication that has to be performed to keep the system in sync.**\n",
"\n",
"Let us now try scaling a simple CNN clasifier with the FMNIST dataset using Synchronous training with Horovod and calcualte it's scaling efficiency run across 1,2,4 and 8 GPUs respectively."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# 1 GPU \n",
"!TF_CPP_MIN_LOG_LEVEL=3 horovodrun -np 1 --mpi-args=\"--oversubscribe\" python3 ../source_code/N1/cnn_fmnist.py --batch-size=2048 2> /dev/null"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# 2 GPUs \n",
"!TF_CPP_MIN_LOG_LEVEL=3 horovodrun -np 2 --mpi-args=\"--oversubscribe\" python3 ../source_code/N1/cnn_fmnist.py --batch-size=2048 2> /dev/null"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# 4 GPUs\n",
"!TF_CPP_MIN_LOG_LEVEL=3 horovodrun -np 4 --mpi-args=\"--oversubscribe\" python3 ../source_code/N1/cnn_fmnist.py --batch-size=2048 2> /dev/null"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# 8 GPUs\n",
"!TF_CPP_MIN_LOG_LEVEL=3 horovodrun -np 8 --mpi-args=\"--oversubscribe\" python3 ../source_code/N1/cnn_fmnist.py --batch-size=2048 2> /dev/null"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"Let us now aggregate the data into the below table below and calculate the scaling efficiency. Kindly fill in the data from the results obtained in your system.\n",
"\n",
"Please ignore the first couple of iterations as they have graph building time and one-time operations in them.\n",
"\n",
"\n",
"The table below contains output of running the command on DGX-1 : \n",
"\n",
"|#GPUs |Samples/sec|Scaling efficiency|\n",
"|-|-|-|\n",
"|1| 71149| NA|\n",
"|2| 125206| ~88% |\n",
"|4| 253439| ~88%|\n",
"|8| 489929| ~86%|\n",
"\n",
"\n",
"Now that we've run a demo and calculated the scaling efficiency of the model, we notice a slight dip in the scaling efficiency training the application using 8 GPUs. The reason for this lies in the way our GPUs are connected. Let us now get in-depth to understand our hardware environment and how it has the potential to affect the performance of our model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"***\n",
"\n",
"## Licensing\n",
"\n",
"This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"