This folder contains contents for Practical Guide to train Megatron-LM GPT Model with your own langauge. There are 2 Labs, each with a differnt focus.
Outlines of Lab 1 Megatron 101 in half a day - a walk through of Megatron-LM's default workflow.
Outlines of Lab 2 Customizing Megatron-LM's workflow to adjust to local langauge needs.
Important : This bootcamp is intended to be delivered by NVIDIA certified instructors and TAs, it is NOT meant for self-paced learners.
Note1 : The lecture presentations as well as the solutions to the challenges and mini-challenges will be delivered at the end of each lab Note2 : Multi-nodes Megatron-LM GPT3 training can be added as an additional lab dependong on the availability of the compute resource.
The two labs will take approximately 12 hours ( including solving challenges and mini-challenges ) to complete.
Although this bootcamp is designed to run on a computing cluster with NVIDIA SuperPOD Architecture It is possible to run it in an environment where you have access to 2 X A100 GPUs 40GB with NVLink/NVSwitch.
When docker pull & run is allowed, and the GPUs are directly accessbile to the users in the environment.
git clone https://github.com/gpuhackathons-org/gpubootcamp.git &&
cd gpubootcamp
USR_PORT=
GPUS= # you only need two gpus
DIR=
With sudo privilege :
sudo docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=GPUS -p USR_PORT:USR_PORT -it --rm --ulimit memlock=-1 --ulimit stack=67108864 --cap-add=SYS_ADMIN -v DIR:/workspace nvcr.io/nvidia/pytorch:21.03-py3
Without sudo privilege but the user is added to the docker group :
docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=GPUS -p USR_PORT:USR_PORT -it --rm --ulimit memlock=-1 --ulimit stack=67108864 -v DIR:/workspace nvcr.io/nvidia/pytorch:21.03-py3
jupyter lab --no-browser --port=USR_PORT --allow-root --NotebookApp.token=''
Then, open the jupyter notebook in browser: localhost:USR_PORT
Navigate to /gpubootcamp/ai/Megatron/English/Python/ and open the Start_Here.ipynb
notebook.
A User Guide is often provided when one requests for access to a computing cluster with NVIDIA SuperPOD Architecture. However, each compute cluster might have slight deviations to the reference architecture on various levels, HW and/or SW as well as the resource management control setups.
It is likely the below steps will need to be adjusted, in which case, the user will need to consult the cluster admin or cluster operator to get help in debugging environment preparation in order to prepare for the bootcamp materaisl to run.
clone https://github.com/gpuhackathons-org/gpubootcamp.git
DIR_to_gpubootcamp= USR_PORT= # communiacate with the cluster admin to know which port is available to you as a user HOST_PORT= CLUSTER_NAME=
sudo singularity build pytorch_21.03.sif docker://nvcr.io/pytorch:21.03-py3
Note1: If you do not have sudo rights, you might need to either contact the cluster admin, or build this in another environment where you have sudo rights. Note2: You should copy the pytorch_21.03.sif to the cluster enviroment one level above the DIR_to_gpubootcamp
srun --gres=gpu:2 --pty bash -i
export SINGULARITY_BINDPATH=DIR_to_gpubootcamp
singularity run --nv pytorch_21.03.sif jupyter lab --notebook-dir=DIR_to_gpubootcamp --port=USR_PORT --ip=0.0.0.0 --no-browser --NotebookApp.iopub_data_rate_limit=1.0e15 --NotebookApp.token=""
ssh -L localhost:HOST_PORT:machine_number:USR_PORT CLUSTER_NAME
Then, open the jupyter notebook in browser: localhost:HOST_PORT
Navigate to gpubootcamp/ai/Megatron/English/Python/ and open the Start_Here.ipynb
notebook.
Q. "ResourceExhaustedError" error is observed while running the labs A. Currently the batch size and network model is set to consume 40GB GPU memory. In order to use the labs without any modifications it is recommended to have GPU with minimum 40GB GPU memory. Else the users can play with batch size to reduce the memory footprint, also ensure you have NVLINK/NVSwitch enabled in the environment.Do not enable MIG mode when requesting A100 GPUs as resources.