# Megatron GPT Bootcamp

## Learning objectives

This objective of the bootcamp is to first, help you quickly go through one time the default Magatron workflow to let you familiarize on how Megatron works, thereafter we will be focus on catering to the specifics of local langauge needs, in this case Swedish. We will give recommandations/advices which can be optionally applied to your workflow and include some practical, useful scripts to help you kick-start your own journey in training local langauge Megatron GPT2/3 models. 


* Standard: Python
* Frameworks: Pytorch + Megatron-LM 

It is required to have more than one GPU for the bootcamp and we recommend using a [DGX](https://www.nvidia.com/en-in/data-center/dgx-systems/) like cluster with [NVLink / NVSwitch](https://www.nvidia.com/en-in/data-center/nvlink/) support.

Let's start with testing the GPUs you are running the code on in this bootcamp.

---
## check how many GPUs you have and GPU Mem capacity 

            Wed Aug 25 07:03:55 2021       
        +-----------------------------------------------------------------------------+
        | NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: 11.2     |
        |-------------------------------+----------------------+----------------------+
        | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
        | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
        |                               |                      |               MIG M. |
        |===============================+======================+======================|
        |   0  Tesla V100-SXM2...  On   | 00000000:06:00.0 Off |                    0 |
        | N/A   34C    P0    57W / 300W |      0MiB / 16160MiB |      0%      Default |
        |                               |                      |                  N/A |
        +-------------------------------+----------------------+----------------------+
        |   1  Tesla V100-SXM2...  On   | 00000000:85:00.0 Off |                    0 |
        | N/A   30C    P0    41W / 300W |      0MiB / 16160MiB |      0%      Default |
        |                               |                      |                  N/A |
        +-------------------------------+----------------------+----------------------+
        +-----------------------------------------------------------------------------+
        | Processes:                                                                  |
        |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
        |        ID   ID                                                   Usage      |
        |=============================================================================|
        |  No running processes found                                                 |
        +-----------------------------------------------------------------------------+


In [1]:
!nvidia-smi

Wed Aug 25 07:03:55 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  On   | 00000000:06:00.0 Off |                    0 |
| N/A   34C    P0    57W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:85:00.0 Off |                    0 |
| N/A   30C    P0    41W / 300W |      0MiB / 16160MiB |      0%      Default |
|       

---
## verify nvlink active 
OUTPUT should look something simialr to the below -

        GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-b29deceb-3745-51d2-2cf3-807ea8ac8e60)
             Link 0: 25.781 GB/s
             Link 1: 25.781 GB/s
             Link 2: 25.781 GB/s
             Link 3: 25.781 GB/s
             Link 4: 25.781 GB/s
             Link 5: 25.781 GB/s
        GPU 1: Tesla V100-SXM2-16GB (UUID: GPU-4de46420-3e95-182f-c0c3-d488dda562d8)
             Link 0: 25.781 GB/s
             Link 1: 25.781 GB/s
             Link 2: 25.781 GB/s
             Link 3: 25.781 GB/s
             Link 4: 25.781 GB/s
             Link 5: 25.781 GB/s

In [2]:
# verify nvlink status
!nvidia-smi nvlink --status

GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-b29deceb-3745-51d2-2cf3-807ea8ac8e60)
	 Link 0: 25.781 GB/s
	 Link 1: 25.781 GB/s
	 Link 2: 25.781 GB/s
	 Link 3: 25.781 GB/s
	 Link 4: 25.781 GB/s
	 Link 5: 25.781 GB/s
GPU 1: Tesla V100-SXM2-16GB (UUID: GPU-4de46420-3e95-182f-c0c3-d488dda562d8)
	 Link 0: 25.781 GB/s
	 Link 1: 25.781 GB/s
	 Link 2: 25.781 GB/s
	 Link 3: 25.781 GB/s
	 Link 4: 25.781 GB/s
	 Link 5: 25.781 GB/s
GPU 2: Tesla V100-SXM2-16GB (UUID: GPU-8e9b4e82-ac7f-c189-cc17-045a3585def2)
	 Link 0: 25.781 GB/s
	 Link 1: 25.781 GB/s
	 Link 2: 25.781 GB/s
	 Link 3: 25.781 GB/s
	 Link 4: 25.781 GB/s
	 Link 5: 25.781 GB/s
GPU 3: Tesla V100-SXM2-16GB (UUID: GPU-a3d96d2e-c606-b23f-e9e0-59a3a507fc10)
	 Link 0: 25.781 GB/s
	 Link 1: 25.781 GB/s
	 Link 2: 25.781 GB/s
	 Link 3: 25.781 GB/s
	 Link 4: 25.781 GB/s
	 Link 5: 25.781 GB/s


---
## verify profiling capability 
OUTPUT should look something simialr to the below
note that we want all environment check pass ( = OK or available )

            Sampling Environment Check
            Linux Kernel Paranoid Level = 1: OK
            Linux Distribution = Ubuntu
            Linux Kernel Version = 4.15.0-112-generic: OK
            Linux perf_event_open syscall available: OK
            Sampling trigger event available: OK
            Intel(c) Last Branch Record support: Available
            Sampling Environment: OK

In [3]:
# verify profiling capacility 
!nsys status -e


Sampling Environment Check
Linux Kernel Paranoid Level = 1: OK
Linux Distribution = Ubuntu
Linux Kernel Version = 4.15.0-112-generic: OK
Linux perf_event_open syscall available: OK
Sampling trigger event available: OK
Intel(c) Last Branch Record support: Available
Sampling Environment: OK


---
## making placeholder folders for dataset

In [2]:
import os
os.makedirs('./dataset/EN/32k', exist_ok=True)
os.makedirs('./dataset/EN/50k', exist_ok=True)
os.makedirs('./dataset/SV/32k', exist_ok=True)
os.makedirs('./dataset/SV/56k', exist_ok=True)

---
# create your own data - web crawling 
please go through the notebook [link here](./jupyter_notebook/Megatron-LM/tools/openwebtext/Day3-Website_scrapping.ipynb) to scrape NVIDIA blog's data 

### Tutorial Outline

The following contents will be covered during the Bootcamp :

- [**Introduction and outlines of Day 2**](./jupyter_notebook/Day2_0_intro.ipynb)
    Megatron 101 in half a day 
    - [Estimate hours/days needed to execute one end-to-end run per Megatron configuration](./jupyter_notebook/Day2-1_EstimateComputeDaysNeeded.ipynb)
    - [Understanding the core of Megatron - mpu ](./jupyter_notebook/Day2-2_MegatronFundementals.ipynb)
    - [About GPT's tokenizer](./jupyter_notebook/Day2-3_GPT_vocab_merge_files.ipynb)
    - [jsonfy and convert to mmap format](./jupyter_notebook/Day2-4_jsonfy_and_process2mmap.ipynb)
    - [Megatron runs vs config](./jupyter_notebook/Day2-5_Observe_GPT_runs_vs_performance.ipynb)
    - [challenge - the best profiler](./jupyter_notebook/Day2-5_Observe_GPT_runs_vs_performance.ipynb#TheChallenge)

- [**Day 3 outlines **](./jupyter_notebook/Day3-0_overview.ipynb)
    Getting started on training your own Megatron GPT models !
    - [Fetch and extract Swedish data](./jupyter_notebook/Megatron-LM/tools/openwebtext/Day3-1_acquiring_data.ipynb)
    - [Find sentence boundary and deduplicate your data](./jupyter_notebook/Megatron-LM/tools/openwebtext/Day3-2_SentenceBoundary_and_Deduplicate.ipynb)
        - [mini challenge - approaching groundtruth](./jupyter_notebook/Megatron-LM/tools/openwebtext/Day3-2_SentenceBoundary_and_Deduplicate.ipynb#TheChallenge)
    - [Train your own GPTBPE Tokenizer on your own data ](./jupyter_notebook/Day3-3_train_own_GPT2BPETokenizer.ipynb)
    - [customize preprocess data python script and convert to mmap](./jupyter_notebook/Day3-4_customize_process2mmap.ipynb)
    - [The Challenge - Go Big or go home!](./jupyter_notebook/Day3-5_run_Megatron_with_varying_config.ipynb)



### Tutorial Duration
The lab material will be presented in a 8 hr session. Link to material is available for download at the end of the gpubootcamp. 

### Content Level
Intermediate , Advanced

### Target Audience and Prerequisites
The target audience for this lab is researchers/graduate students and developers who are interested in learning about scaling their Deep learning systems to multiple GPUs to accelerate their scientific applications.

Basic understanding on Deep learning is required, If you are new to Deep learning , it is recommended to go through the [Distributed_Deep_Learning bootcamp](https://github.com/gpuhackathons-org/gpubootcamp/tree/master/ai/Distributed_Deep_Learning/English/python) prior.
 
**Disclaimer** : All the results mentioned in the notebooks were tested on a *DGX-1 machine equipped with 2 or 4 or 8 x Tesla V100 connected via NVLink*. The results would vary when using different hardware and would also depend on the Interconnect bandwidth and the thermal conditions of the machine.

--- 

## Licensing

This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0).