|
@@ -2,34 +2,33 @@
|
|
"cells": [
|
|
"cells": [
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "effective-university",
|
|
|
|
|
|
+ "id": "rising-software",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"## Scale up model size\n",
|
|
"## Scale up model size\n",
|
|
"---\n",
|
|
"---\n",
|
|
- "In previous notebooks, we downloaded and extracted our own Swedish raw text; practiced filter, clean and deduplicate the raw text data ; trained our own GPTBPETokenizer and fitted to the raw Swedish text ; converted the raw text to mmap format integrating a custom sentence-splitter.\n",
|
|
|
|
|
|
+ "In previous notebooks, we downloaded and extracted our own Swedish raw text with `Lab2-1_acquiring_data.ipynb`; practiced filter, clean and deduplicate the raw text data with `Lab2-2_SentenceBoundary_and_Deduplicate.ipynb` ; trained our own GPTBPETokenizer and fitted to the raw Swedish text with `Lab2-3_train_own_GPT2BPETokenizer.ipynb`; converted the raw text to mmap format integrating a custom sentence-splitter in `Lab2-4_customize_process2mmap.ipynb`.\n",
|
|
"\n",
|
|
"\n",
|
|
- "Now that we have learned the components to customize the Megatron-LM's workflow according to specific langauge needs ( in this case, it is Swedish). The next step is to train the Megatron-LM GPT model with the Swedish data. \n",
|
|
|
|
|
|
+ "We have learned all the essential components in order to customize Megatron-LM's default workflow in order to accommodate to specific langauge needs ( in this case, it is Swedish ). The obvious next step is to train the Megatron-LM GPT model with the processed Swedish data. \n",
|
|
"\n",
|
|
"\n",
|
|
- "However, constraint by how much compute resources you get, i.e the number of GPUs available for the training job, there is an upper limit of how big a model you can train.\n",
|
|
|
|
|
|
+ "However, constraint by how much compute resources one could get, that is, the number of GPUs available for the training job, there is an upper limit of how big a model you can train.\n",
|
|
"\n",
|
|
"\n",
|
|
- "Let's test this out by presenting a Challenge. \n",
|
|
|
|
|
|
+ "We will test ou thow big a model we could train with 2 X A100 GPUs 40GB, by presenting a Challenge!\n",
|
|
"\n",
|
|
"\n",
|
|
"## **Challenge ** - Go big or go home !\n",
|
|
"## **Challenge ** - Go big or go home !\n",
|
|
"\n",
|
|
"\n",
|
|
"- Constraints : \n",
|
|
"- Constraints : \n",
|
|
" - 2 x A100 GPUs 40G is allocated for this challenge.\n",
|
|
" - 2 x A100 GPUs 40G is allocated for this challenge.\n",
|
|
- " - Only the parameters in the **modifiable blocks** are allowed to be changed.\n",
|
|
|
|
|
|
+ " - Only the parameters in the **##### Begin/End of modifiable blocks #####** are allowed to be changed.\n",
|
|
" - Avoid OOM !\n",
|
|
" - Avoid OOM !\n",
|
|
" - Training run must be finished and checkpoint must be saved successfully.\n",
|
|
" - Training run must be finished and checkpoint must be saved successfully.\n",
|
|
"\n",
|
|
"\n",
|
|
- "\n",
|
|
|
|
"- Task : \n",
|
|
"- Task : \n",
|
|
- " given the above prerequisites, train as BIG a GPT model as possible.\n",
|
|
|
|
|
|
+ " Given the above constraints, train as BIG a GPT model as possible.\n",
|
|
"\n",
|
|
"\n",
|
|
- "- Winning criteria : the biggest model wins given the above constraints.\n",
|
|
|
|
|
|
+ "- Winning criteria : The biggest model wins given the above constraints.\n",
|
|
"\n",
|
|
"\n",
|
|
- "Note 1: Post the parameters you changed into the **modifiable blocks** on slack channels for verification.\n",
|
|
|
|
|
|
+ "Note 1: Post the parameters you changed into the **##### Begin/End of modifiable blocks #####** on bootcamp's slack channels for verification.\n",
|
|
"\n",
|
|
"\n",
|
|
"Note 2: We purposefully turned-off nsys profiling in this challenge, because calling nsys profiling will introduce a small overhead, which will impact the maximum achievable model size.\n",
|
|
"Note 2: We purposefully turned-off nsys profiling in this challenge, because calling nsys profiling will introduce a small overhead, which will impact the maximum achievable model size.\n",
|
|
"\n",
|
|
"\n",
|
|
@@ -38,19 +37,17 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "incorporate-bidding",
|
|
|
|
|
|
+ "id": "historic-eating",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
- "---\n",
|
|
|
|
- "# Hint :\n",
|
|
|
|
- "### call out a terminal and type in **nvidia-smi** to monitor the GPUs' utils and power consumption \n",
|
|
|
|
- "### remember to fill up the GPU memory\n",
|
|
|
|
- ""
|
|
|
|
|
|
+ "\n",
|
|
|
|
+ "**Hint** :\n",
|
|
|
|
+ "Use the knowledge gained from `Lab1-6_Observe_GPT_runs_vs_performance.ipynb`, especially the section with video demonstrating how to do live profiling during a live training run."
|
|
]
|
|
]
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "established-substitute",
|
|
|
|
|
|
+ "id": "cleared-toolbox",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"Modify and rerun the code blocks below to obtain a even bigger GPT model. \n",
|
|
"Modify and rerun the code blocks below to obtain a even bigger GPT model. \n",
|
|
@@ -62,7 +59,7 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "central-scheduling",
|
|
|
|
|
|
+ "id": "large-buying",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"<a id=\"MODIFY_CELL\"></a>"
|
|
"<a id=\"MODIFY_CELL\"></a>"
|
|
@@ -70,7 +67,7 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "handy-process",
|
|
|
|
|
|
+ "id": "approved-beatles",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"Always clean the checkpoint folder to ensure trainining start from scratch."
|
|
"Always clean the checkpoint folder to ensure trainining start from scratch."
|
|
@@ -79,7 +76,7 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"execution_count": 1,
|
|
- "id": "human-privacy",
|
|
|
|
|
|
+ "id": "attended-vault",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"outputs": [],
|
|
"source": [
|
|
"source": [
|
|
@@ -89,7 +86,7 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"execution_count": 2,
|
|
- "id": "chief-latter",
|
|
|
|
|
|
+ "id": "engaging-ocean",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [
|
|
"outputs": [
|
|
{
|
|
{
|
|
@@ -117,7 +114,7 @@
|
|
"MERGE_FILE='../dataset/SV/56k/merges.txt'\n",
|
|
"MERGE_FILE='../dataset/SV/56k/merges.txt'\n",
|
|
"PROFILE_OUTPUT_PATH='../profiles/SV/nsys_sv_' # modify this to your own profile path\n",
|
|
"PROFILE_OUTPUT_PATH='../profiles/SV/nsys_sv_' # modify this to your own profile path\n",
|
|
"\n",
|
|
"\n",
|
|
- "#### [TODO]--------------- Begin of modifiable block -----------#### \n",
|
|
|
|
|
|
+ "# -------------------- ##### Begin of modifiable block ##### -------------------- \n",
|
|
"\n",
|
|
"\n",
|
|
"TENSOR_MP_SIZE=<FILL_IN>\n",
|
|
"TENSOR_MP_SIZE=<FILL_IN>\n",
|
|
"PIPELINE_MP_SIZE=<FILL_IN>\n",
|
|
"PIPELINE_MP_SIZE=<FILL_IN>\n",
|
|
@@ -129,7 +126,7 @@
|
|
"SEQ_LEN=<FILL_IN>\n",
|
|
"SEQ_LEN=<FILL_IN>\n",
|
|
"MAX_POS_EM=<FILL_IN>\n",
|
|
"MAX_POS_EM=<FILL_IN>\n",
|
|
"\n",
|
|
"\n",
|
|
- "#### -------------------- end of modifiable blocks ------------------------#### \n",
|
|
|
|
|
|
+ "# -------------------- ##### End of modifiable blocks ##### ------------------------ \n",
|
|
"\n",
|
|
"\n",
|
|
"################## DO NOT modify anything below this line ##################\n",
|
|
"################## DO NOT modify anything below this line ##################\n",
|
|
"export OMP_NUM_THREADS=1\n",
|
|
"export OMP_NUM_THREADS=1\n",
|
|
@@ -173,18 +170,22 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "proprietary-elizabeth",
|
|
|
|
|
|
+ "id": "determined-cliff",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"Check how big is your model. By modify the parameters in the [params_cnt.sh](./params_cnt.sh)\n",
|
|
"Check how big is your model. By modify the parameters in the [params_cnt.sh](./params_cnt.sh)\n",
|
|
"\n",
|
|
"\n",
|
|
- "I got 6.6 Billion :) what about you ?"
|
|
|
|
|
|
+ "I got 6.6 Billion :) what about you ?\n",
|
|
|
|
+ "\n",
|
|
|
|
+ "Modify the [params count](./params_cnt.sh) accoring to your training configuration.\n",
|
|
|
|
+ "\n",
|
|
|
|
+ "After modification, run the below bash script to obtain the model size."
|
|
]
|
|
]
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"execution_count": null,
|
|
- "id": "beginning-homework",
|
|
|
|
|
|
+ "id": "green-magic",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"outputs": [],
|
|
"source": [
|
|
"source": [
|
|
@@ -193,25 +194,25 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "radio-secretariat",
|
|
|
|
|
|
+ "id": "awful-candle",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"Below is an example of expected outputs:\n",
|
|
"Below is an example of expected outputs:\n",
|
|
" \n",
|
|
" \n",
|
|
- " 6\n",
|
|
|
|
- " 6675628032\n"
|
|
|
|
|
|
+ " 6 <-- One could get different number depend on your training config\n",
|
|
|
|
+ " 6675628032 <-- One could get different number depend on your training config\n"
|
|
]
|
|
]
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "fuzzy-assault",
|
|
|
|
|
|
+ "id": "great-league",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"Re-run this cell below to get an even bigger GPT model\n",
|
|
"Re-run this cell below to get an even bigger GPT model\n",
|
|
"\n",
|
|
"\n",
|
|
"Remember to modify the [params count](./params_cnt.sh) to check how big is your model.\n",
|
|
"Remember to modify the [params count](./params_cnt.sh) to check how big is your model.\n",
|
|
"\n",
|
|
"\n",
|
|
- "Jump back and mdify the SV_GPT_goingBIG.sh, click here to \n",
|
|
|
|
|
|
+ "Jump back and edit the SV_GPT_goingBIG.sh, click here to \n",
|
|
"<a href=\"./Lab2-5_run_Megatron_with_varying_config.ipynb#MODIFY_CELL\">Jump back to modify and overwrite SV_GPT_goingBIG.sh </a> \n",
|
|
"<a href=\"./Lab2-5_run_Megatron_with_varying_config.ipynb#MODIFY_CELL\">Jump back to modify and overwrite SV_GPT_goingBIG.sh </a> \n",
|
|
"<a id=\"Rerun_Cell\"></a>"
|
|
"<a id=\"Rerun_Cell\"></a>"
|
|
]
|
|
]
|
|
@@ -219,7 +220,7 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"execution_count": null,
|
|
- "id": "rental-deputy",
|
|
|
|
|
|
+ "id": "italian-karma",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"outputs": [],
|
|
"source": [
|
|
"source": [
|
|
@@ -228,7 +229,7 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "korean-republic",
|
|
|
|
|
|
+ "id": "outstanding-application",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"Below is an example of expected outputs:\n",
|
|
"Below is an example of expected outputs:\n",
|
|
@@ -251,31 +252,27 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "official-concept",
|
|
|
|
|
|
+ "id": "blessed-grammar",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
- "--- \n",
|
|
|
|
- "\n",
|
|
|
|
- "## Additional Resources\n",
|
|
|
|
- "\n",
|
|
|
|
- "Language Models are Few-Shot Learners : https://arxiv.org/pdf/2005.14165.pdf\n",
|
|
|
|
|
|
+ "---\n",
|
|
"\n",
|
|
"\n",
|
|
- "Efficient Large-Scale Language Model Training on GPU Clusters : https://arxiv.org/pdf/2104.04473.pdf"
|
|
|
|
|
|
+ "## Links and Resources\n",
|
|
|
|
+ "Don't forget to read more on [Language Models are Few-Shot Learners](https://arxiv.org/pdf/2005.14165.pdf) and [Efficient Large-Scale Language Model Training on GPU Clusters](https://arxiv.org/pdf/2104.04473.pdf)."
|
|
]
|
|
]
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "laden-sender",
|
|
|
|
|
|
+ "id": "velvet-nylon",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
- "---\n",
|
|
|
|
- "\n",
|
|
|
|
- "## Congratulations on completing the mission !\n"
|
|
|
|
|
|
+ "-----\n",
|
|
|
|
+ "## <p style=\"text-align:center;border:3px; padding: 1em\"> <a href=../Start_Here.ipynb>HOME</a></p>"
|
|
]
|
|
]
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "premium-treasury",
|
|
|
|
|
|
+ "id": "framed-blood",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"-----\n",
|
|
"-----\n",
|