\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Introduction to Distributed Deep Learning - Part 4\n",
"\n",
"**Contents of this notebook:**\n",
"\n",
"- [**Challenges with convergence**](#Challenges-with-convergence)\n",
" - [Concepts](#Concepts)\n",
" - [Impact of Batch size](#Impact-of-Batch-size)\n",
" - [Impact on test and validation accuracy](#Impact-on-test-and-validation-accuracy)\n",
" - [Techniques for faster convergence](#Techniques-for-faster-convergence)\n",
" - [Batch norm](#Batch-norm)\n",
" - [Learning rate scaling](#Learning-rate-scaling)\n",
" - [Learning rate warmup](#Learning-rate-warmup)\n",
" - [Using Optimizers built for Exascale Deep learning](#Using-Optimizers-built-for-Exascale-Deep-learning)\n",
"\n",
"**By the End of this Notebook you will:**\n",
"\n",
"- Understand the reasons behind slower convergence with increase in Batch size. \n",
"- Learn techniques to make convergence faster."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Challenges with convergence\n",
"\n",
"We noticed in the previous notebook that the Distributed Training with 2 or more GPUs had higher throughput than the single GPU training, but when we focused on the convergence of our model, we found it to converge slower than the former.\n",
"*Quoted from the previous notebook* : If we take a closer look at the results, we can find even after 8 epochs in both cases, the run with a single GPU at the end of 8 epochs has a **loss of 0.0487 and accuracy of 0.9851** , comparing that with 8 GPU case, we find at the end of 8 epochs we have a **loss of 0.3582 and accuracy of 0.8957**.\n",
"\n",
"We were also able to notice the accuracy took a bigger drop during training, let us now try to understand why this happens and how we can use some techniques to solve them.\n",
"\n",
"## Concepts\n",
"\n",
"\n",
"### Impact of Batch size\n",
"In the paper **Measuring the Effects of Data Parallelism on Neural Network Training** by **Christopher J. Shallue, Jaehoon Lee, Joseph Antognini, Jascha Sohl-Dickstein, Roy Frostig, George E. Dahl** , we see a relationship between steps taken to convergence and the batch size. \n",
"\n",
"\n",
"
\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To understand this phenomenon,a term called as Critical batch size was coined in the paper **An Empirical Model of Large-Batch Training** by **Sam McCandlish, Jared Kaplan, Dario Amodei, OpenAI Dota Team**\n",
"\n",
"#### Critical batch size \n",
"\n",
"Critical batch size is when compute efficiency drops below 50% optimal and larger batch sizes yield diminishing returns. \n",
"\n",
"\n",
"It is found that we can approximately predict the maximum useful batch size by measuring gradient noise scaling, which is a simple statistic that quantifies the signal-to-noise ratio of the network gradients. Heuristically, the noise scale measures the variation in the data as seen by the model (at a given stage in training). When the noise scale is small, looking at a lot of data in parallel quickly becomes redundant, whereas when it is large, we can still learn a lot from huge batches of data.\n",
"\n",
"Below is the critical batch size of some popular networks.\n",
"\n",
"
\n",
"\n",
"\n",
"Let us now understand another problem that we face in large-batch training. \n",
"\n",
"### Impact on test and validation accuracy \n",
"\n",
"It is found that test/validation accuracy decreases with an increase in Batch size, this can be noticed in our previous notebook as well from the figure below, which is from the paper **ImageNet Training in Minutes** by **Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, Kurt Keutzer** \n",
"\n",
"
\n",
"\n",
"This lack of generalization ability is because large-batch methods tend to converge to sharp minimizers of the training function. These minimizers are characterized by a significant number of large positive eigenvalues in $\\nabla^2 f(x)$ , and tend to generalize less well. In contrast, small-batch methods converge to flat minimizers characterized by having numerous small eigenvalues of $\\nabla^2 f(x)$. We have observed that the loss function landscape of deep neural networks is such that large-batch methods are attracted to regions with sharp minimizers. Unlike small-batch methods, are unable to escape basins of attraction of these minimizers.\n",
"\n",
"
\n",
"\n",
"This convergences to sharp minima and the difference between the training and test function lead to a higher validation loss. \n",
"\n",
"Now that we've understood the challenges with regards to convergence, let us now see some techniques to improve the convergence rate."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Before that let us now use a different dataset , We will be using CIFAR-10, which is a slightly more complex compared to FMNIST as it has 3 channel per image ( RGB ) while FMNIST only has 1.**\n",
"\n",
"Let us start with benchmarking CIFAR-10 with single and 8 GPUs and compare the convergence using train accuracy."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!TF_CPP_MIN_LOG_LEVEL=3 horovodrun -np 1 python3 ../source_code/N4/cifar_base.py --batch-size=512 2> /dev/null"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!TF_CPP_MIN_LOG_LEVEL=3 horovodrun -np 8 --mpi-args=\"--oversubscribe\" python3 ../source_code/N4/cifar_base.py --batch-size=512 2> /dev/null"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Results\n",
"\n",
"Output of running the command on DGX-1 : \n",
"\n",
"\n",
"Single GPU : \n",
"\n",
"```bash\n",
"[1,0]:Epoch 11/12\n",
"[1,0]:97/97 [==============================] - 1s 14ms/step - loss: 0.8486 - accuracy: 0.7002\n",
"[1,0]:Epoch time : 1.3384358882904053\n",
"[1,0]:Images/sec: 37357.04\n",
"[1,0]:Epoch 12/12\n",
"[1,0]:97/97 [==============================] - 1s 14ms/step - loss: 0.8085 - accuracy: 0.7153\n",
"[1,0]:Epoch time : 1.3324604034423828\n",
"[1,0]:Images/sec: 37524.57\n",
"```\n",
"\n",
"Scaling with 8-GPUs : \n",
"\n",
"```bash\n",
"[1,0]:Epoch 11/12\n",
"[1,0]:12/12 [==============================] - 0s 19ms/step - loss: 1.5277 - accuracy: 0.4408\n",
"[1,0]:Epoch time : 0.23624348640441895\n",
"[1,0]:Images/sec: 211646.05\n",
"[1,0]:Epoch 12/12\n",
"[1,0]:12/12 [==============================] - 0s 18ms/step - loss: 1.5352 - accuracy: 0.4404\n",
"[1,0]:Epoch time : 0.22572755813598633\n",
"[1,0]:Images/sec: 221505.96\n",
"```\n",
"\n",
"Now that we have a baseline , let us now try to improve the convergence rate of the models."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Techniques for faster convergence\n",
"\n",
"### Batch normalisation\n",
"\n",
"In the paper **Train longer, generalize better: closing the generalization gap in large batch training of neural networks** by **Elad Hoffer, Itay Hubara, Daniel Soudry** they introduced Ghost Batch Normalization to reduce generalization error.\n",
"\n",
"Batch Normalization is known to accelerate the training, increase the robustness of the neural network to different initialization schemes and improve generalization. Nonetheless, since it uses batch statistics, it is bounded to depend on the chosen batch size. We study this dependency and observe that by acquiring the statistics on small virtual (\"ghost\") batches instead of the real large batch, we can reduce the generalization error. This modification by itself reduces the generalization error substantially.\n",
"\n",
"This can be implemented by the Normalization of a smaller batch present in each GPU instead of the larger batch across all GPUs.\n",
"\n",
"Let us now implement Batch Norm for every GPU and test out that performance.\n",
"\n",
"This can be implemented by adding `BatchNormalisation` layers to the model using the Keras API \n",
"\n",
"```python3\n",
"X = BatchNormalization(axis=3)(X)\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!TF_CPP_MIN_LOG_LEVEL=3 horovodrun -np 8 --mpi-args=\"--oversubscribe\" python3 ../source_code/N4/cifar_batch_norm.py --batch-size=512 2> /dev/null"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Result\n",
"\n",
"Output of running the command on DGX-1 : \n",
"\n",
"```bash\n",
"[1,0]:Epoch 11/12\n",
"[1,0]:12/12 [==============================] - 0s 22ms/step - loss: 1.0192 - accuracy: 0.6413\n",
"[1,0]:Epoch time : 0.2735874652862549\n",
"[1,0]:Images/sec: 182756.91\n",
"[1,0]:Epoch 12/12\n",
"[1,0]:12/12 [==============================] - 0s 24ms/step - loss: 1.0059 - accuracy: 0.6416\n",
"[1,0]:Epoch time : 0.2887394428253174\n",
"[1,0]:Images/sec: 173166.5\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Learning rate scaling\n",
"\n",
"In the research paper **Accurate, Large Minibatch SGD:Training ImageNet in 1 Hour** by **Facebook researchers** they introduced the Linear Scaling Rule, which states that **When the minibatch size is multiplied by k, multiply the learning rate by k.**\n",
"\n",
"This can be implemented by Horovod using the following line. \n",
"\n",
"```python \n",
"scaled_lr = 0.001 * hvd.size() # Scale Learning rate to number of GPUs \n",
"opt = tf.optimizers.Adam(scaler_lr)\n",
"opt = hvd.DistributedOptimizer(opt) # Wrap Optimizer with hvd.DistributedOptimizer()\n",
"```\n",
"\n",
"Now, let us try running the cell with a scaled learning rate. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!TF_CPP_MIN_LOG_LEVEL=3 horovodrun -np 8 --mpi-args=\"--oversubscribe\" python3 ../source_code/N4/cifar_scalelr.py --batch-size=512 2> /dev/null"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Result \n",
"\n",
"Output of running the command on DGX-1 with 8 GPUs : \n",
"\n",
"\n",
"```bash\n",
"[1,0]:Epoch 11/12\n",
"[1,0]:12/12 [==============================] - 0s 25ms/step - loss: 1.2916 - accuracy: 0.5309\n",
"[1,0]:Epoch time : 0.3019692897796631\n",
"[1,0]:Images/sec: 165579.75\n",
"[1,0]:Epoch 12/12\n",
"[1,0]:12/12 [==============================] - 0s 24ms/step - loss: 1.2914 - accuracy: 0.5295\n",
"[1,0]:Epoch time : 0.2958521842956543\n",
"[1,0]:Images/sec: 169003.32\n",
"```\n",
"\n",
"We might notice that this time training accuracy did not converge , this is because the linear scaling rule breaks down when the network is changing rapidly, which commonly occurs in early stages of training. We find that this issue can be alleviated by a properly designed warmup. \n",
"\n",
"**Note** : You might get lucky and get it might converge , so try running the above cell multiple times to notice the effect of larger batch size."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Learning rate warmup\n",
"\n",
"Learning rate warmup gradually ramps up the learning rate from a small to a large value. This ramp avoids a sudden increase from a small learning rate to a large one, allowing healthy convergence at the start of training. In practice, with a large minibatch of size kn, we start from a learning rate of η and increment it by a constant amount at each iteration such that it reaches ηˆ = kη after 5 epochs. After the warmup phase, we go back to the original learning rate schedule. \n",
"\n",
"Learning rate warmup can be implemented in Horovod using the following callback.\n",
"\n",
"```python\n",
"# Horovod: using `lr = 1.0 * hvd.size()` from the very beginning leads to worse final accuracy.\n",
"# Scale the learning rate `lr = 1.0` ---> `lr = 1.0 * hvd.size()` during the first three epochs.\n",
"hvd.callbacks.LearningRateWarmupCallback(initial_lr=scaled_lr, warmup_epochs=3, verbose=1)\n",
"```\n",
"\n",
"Let us now implement this and run this."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!TF_CPP_MIN_LOG_LEVEL=3 horovodrun -np 8 --mpi-args=\"--oversubscribe\" python3 ../source_code/N4/cifar_warmup.py --batch-size=512 2> /dev/null"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Result \n",
"\n",
"Output of running the command on DGX-1 : \n",
"\n",
"\n",
"```bash\n",
"[1,0]:Epoch 11/12\n",
"[1,0]:12/12 [==============================] - 0s 23ms/step - loss: 1.0463 - accuracy: 0.6343\n",
"[1,0]:Epoch time : 0.27834606170654297\n",
"[1,0]:Images/sec: 179632.5\n",
"[1,0]:Epoch 12/12\n",
"[1,0]:12/12 [==============================] - 0s 23ms/step - loss: 1.0108 - accuracy: 0.6450\n",
"[1,0]:Epoch time : 0.28165721893310547\n",
"[1,0]:Images/sec: 177520.75\n",
"```\n",
"\n",
"\n",
"In this specific case , we obtain the results close to Batch Normalisation , but with increase in the batch size to a much higher value ,we can notice the improvements provided by Learning rate scaling and warmup."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Using Optimizers built for Exascale Deep learning\n",
"\n",
"#### The LAMB Optimizer\n",
"\n",
"A series of optimizers have been created to address this problem and allow for scaling to very large batch sizes and learning rates. One such optimizer is the LAMB optimizer. LAMB uses a layerwise adaptive large batch optimization to train with very little hyperparameter tuning. This optimizer enables the use of very large batch sizes without any degradation of performance.\n",
"\n",
"We can implement LAMB using the following lines.\n",
"\n",
"```python\n",
"# Import LAMB from Tensorflow Addons\n",
"from tensorflow_addons.optimizers import LAMB\n",
"\n",
"# Replace the Adam optimizer with LAMB:\n",
"opt = LAMB(learning_rate=scaled_lr)\n",
"```\n",
"\n",
"Let us now train using the LAMB optimizer and compare the results. \n",
"\n",
"You can learn more about the LAMB optimizer from its paper [here](https://arxiv.org/pdf/1904.00962.pdf)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!TF_CPP_MIN_LOG_LEVEL=3 horovodrun -np 8 --mpi-args=\"--oversubscribe\" python3 ../source_code/N4/cifar_lamb.py --batch-size=512 2> /dev/null"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Result \n",
"\n",
"Output of running the command on DGX-1 : \n",
"\n",
"```bash\n",
"[1,0]:Epoch 11/12\n",
"[1,0]:12/12 [==============================] - 0s 33ms/step - loss: 0.6223 - accuracy: 0.7829\n",
"[1,0]:Epoch time : 0.4006154537200928\n",
"[1,0]:Images/sec: 124807.97\n",
"[1,0]:Epoch 12/12\n",
"[1,0]:12/12 [==============================] - 0s 34ms/step - loss: 0.5630 - accuracy: 0.8092\n",
"[1,0]:Epoch time : 0.4174768924713135\n",
"[1,0]:Images/sec: 119767.11\n",
"```\n",
"\n",
"Let us now tabulate all the results together to understand the improvements in convergence time.\n",
"\n",
"|# of GPUs|Condition|Accuracy after 12 epochs|\n",
"|-|-|-|\n",
"|1|Single GPU|71%|\n",
"|8|Multi-GPU naive|44%|\n",
"|8|Batch Normalisation|64%|\n",
"|8|Learning rate scaling+warmup|65%|\n",
"|8|LAMB Optimizer|81%|\n",
"\n",
"\n",
"Now that we are aware of the concepts in Distributed deep learning training and techniques to use for faster convergence, Go to the next notebook to run the challenge notebook."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"***\n",
"\n",
"## Licensing\n",
"\n",
"This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"