Forráskód Böngészése

enforce coherency between notebooks

zenodia 2 éve
szülő
commit
e8bb0b9265
17 módosított fájl, 460 hozzáadás és 3153 törlés
  1. 0 23
      ai/Megatron/Dockerfile
  2. 4 5
      ai/Megatron/English/Python/Start_Here.ipynb
  3. 23 15
      ai/Megatron/English/Python/jupyter_notebook/Lab1-2_EstimateComputeDaysNeeded.ipynb
  4. 42 43
      ai/Megatron/English/Python/jupyter_notebook/Lab1-3_MegatronFundementals.ipynb
  5. 36 34
      ai/Megatron/English/Python/jupyter_notebook/Lab1-4_GPT_vocab_merge_files.ipynb
  6. 21 18
      ai/Megatron/English/Python/jupyter_notebook/Lab1-5_jsonfy_and_process2mmap.ipynb
  7. 38 37
      ai/Megatron/English/Python/jupyter_notebook/Lab1-6_Observe_GPT_runs_vs_performance.ipynb
  8. 19 19
      ai/Megatron/English/Python/jupyter_notebook/Lab2-3_train_own_GPT2BPETokenizer.ipynb
  9. 40 37
      ai/Megatron/English/Python/jupyter_notebook/Lab2-4_customize_process2mmap.ipynb
  10. 42 45
      ai/Megatron/English/Python/jupyter_notebook/Lab2-5_run_Megatron_with_varying_config.ipynb
  11. 0 0
      ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/tools/openwebtext/HandCrafted_Duplicates.csv
  12. 24 24
      ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/tools/openwebtext/Lab1-1_Website_scrapping.ipynb
  13. 22 14
      ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/tools/openwebtext/Lab2-1_acquiring_data.ipynb
  14. 142 124
      ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/tools/openwebtext/Lab2-2_SentenceBoundary_and_Deduplicate.ipynb
  15. 0 2711
      ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/tools/openwebtext/custom_english_non_breaking_prefixes.txt
  16. 1 1
      ai/Megatron/English/Python/source_code/Day1-runMegatron-LM_GPT_template.sh
  17. 6 3
      ai/Megatron/README.md

+ 0 - 23
ai/Megatron/Dockerfile

@@ -1,23 +0,0 @@
-# Copyright (c) 2020 NVIDIA Corporation.  All rights reserved.
-
-# To build the docker container, run: $ sudo docker build -t ai-multi-gpu:latest .
-# To run: $ sudo docker run --rm -it --gpus=all -p 8888:8888 -p 8000:8000 ai-multi-gpu:latest
-# Finally, open http://127.0.0.1:8888/
-
-# Select Base Image 
-FROM nvcr.io/nvidia/pytorch:21.03-py3
-# Update the repo
-RUN apt-get update -y
-
-# Install required python packages
-RUN pip3 install tokenizers transformers ipywidgets widgetsnbextension  
-RUN jupyter nbextension enable --py widgetsnbextension
-RUN pip3 install nvtx ftfy langdetect numpy torch pandas nltk sentencepiece boto3 tqdm regex bs4 htmlmin tldextract sentence-splitter
-
-##### TODO - From the Final Repo Changing this 
-
-# TO COPY the data 
-COPY English/ /workspace/
-
-## Uncomment this line to run Jupyter notebook by default
-CMD jupyter-lab --no-browser --allow-root --ip=0.0.0.0 --port=8888 --NotebookApp.token="" --notebook-dir=/workspace/python/

+ 4 - 5
ai/Megatron/English/Python/Start_Here.ipynb

@@ -33,7 +33,7 @@
     "\n",
     "It is required to have more than one GPU for this bootcamp.\n",
     "\n",
-    "This bootcamp is tested on 2 x A100 GPUS with 40G memory. One should also have [NVLink / NVSwitch](https://www.nvidia.com/en-in/data-center/nvlink/)."
+    "This bootcamp is tested on 2 x A100 GPUs with 40G memory. One should also have [NVLink / NVSwitch](https://www.nvidia.com/en-in/data-center/nvlink/)."
    ]
   },
   {
@@ -190,10 +190,10 @@
     "- **Outlines of Lab 1**\n",
     "    Megatron 101 in half a day - Please go through the below notebooks sequentially.\n",
     "    1. [WebCrawling to obtain raw text data](./jupyter_notebook/Megatron-LM/tools/openwebtext/Lab1-1_Website_scrapping.ipynb)\n",
-    "    2. [Estimate hours/days needed to execute one end-to-end run per Megatron-LM configuration](./jupyter_notebook/Lab1-2_EstimateComputeDaysNeeded.ipynb)\n",
-    "    3. [Understanding the core of Megatron-LM - mpu ](./jupyter_notebook/Lab1-3_MegatronFundementals.ipynb)\n",
+    "    2. [Estimate hours/days needed to execute one end-to-end run per Megatron-LM's configuration](./jupyter_notebook/Lab1-2_EstimateComputeDaysNeeded.ipynb)\n",
+    "    3. [Understanding the core of Megatron-LM - MPU ](./jupyter_notebook/Lab1-3_MegatronFundementals.ipynb)\n",
     "    4. [About GPT's tokenizer](./jupyter_notebook/Lab1-4_GPT_vocab_merge_files.ipynb)\n",
-    "    5. [jsonfy and convert to mmap format](./jupyter_notebook/Lab1-5_jsonfy_and_process2mmap.ipynb)\n",
+    "    5. [Jsonfy and convert to mmap format](./jupyter_notebook/Lab1-5_jsonfy_and_process2mmap.ipynb)\n",
     "    6. [Megatron runs vs config](./jupyter_notebook/Lab1-6_Observe_GPT_runs_vs_performance.ipynb)\n",
     "\n",
     "\n",
@@ -201,7 +201,6 @@
     "    Getting started on training own language Megatron-LM GPT models -- Please go through the below notebooks sequentially.\n",
     "    1. [Fetch and extract Swedish data](./jupyter_notebook/Megatron-LM/tools/openwebtext/Lab2-1_acquiring_data.ipynb)\n",
     "    2. [Find sentence boundary and deduplicate your data](./jupyter_notebook/Megatron-LM/tools/openwebtext/Lab2-2_SentenceBoundary_and_Deduplicate.ipynb)\n",
-    "        - [mini challenge - approaching groundtruth](./jupyter_notebook/Megatron-LM/tools/openwebtext/Lab2-2_SentenceBoundary_and_Deduplicate.ipynb#TheChallenge)\n",
     "    3. [Train your own GPTBPE Tokenizer on your own data ](./jupyter_notebook/Lab2-3_train_own_GPT2BPETokenizer.ipynb)\n",
     "    4. [customize preprocess data python script and convert to mmap](./jupyter_notebook/Lab2-4_customize_process2mmap.ipynb)\n",
     "    5. [The Challenge - Go Big or go home!](./jupyter_notebook/Lab2-5_run_Megatron_with_varying_config.ipynb)\n",

+ 23 - 15
ai/Megatron/English/Python/jupyter_notebook/Lab1-2_EstimateComputeDaysNeeded.ipynb

@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "anticipated-allocation",
+   "id": "emerging-victoria",
    "metadata": {},
    "source": [
     "# Estimate Time\n",
@@ -38,7 +38,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "czech-cover",
+   "id": "convinced-wales",
    "metadata": {},
    "source": [
     "---\n",
@@ -46,13 +46,13 @@
     "\n",
     "<left><img src=\"./Megatron-LM/pics/TrainingTimeEstimate.JPG\" width=\"500\"/></left>\n",
     "\n",
-    "Two scenarios were extracted from the above paper (screenshot above) : [Efficient Large-Scale Language Model Training on GPU Clusters](https://arxiv.org/pdf/2104.04473.pdf) \n",
+    "Two scenarios were extracted from the screenshot of the paper : [Efficient Large-Scale Language Model Training on GPU Clusters](https://arxiv.org/pdf/2104.04473.pdf) \n",
     "\n",
     "**Scenario 1** -\n",
     "\n",
-    "T = 300Billion tokens # assumed data size measured in tokens\n",
+    "T = 300 billion tokens # assumed data size measured in tokens\n",
     "\n",
-    "P = 175 Billion GPT3 model\n",
+    "P = 175 billion GPT3 model\n",
     "\n",
     "n = 1024 GPUs\n",
     "\n",
@@ -65,11 +65,11 @@
     "\n",
     "**Scenario 2** - \n",
     "\n",
-    "T =  450 Billion tokens  \n",
+    "T =  450 billion tokens  \n",
     "\n",
-    "P = 1 Trillion parameters GPT 3 model\n",
+    "P = 1 trillion parameters GPT 3 model\n",
     "\n",
-    "n = 3072 \n",
+    "n = 3072 GPUs\n",
     "\n",
     "x = 163 teraFLOP/s per GPU \n",
     "\n",
@@ -79,9 +79,17 @@
    ]
   },
   {
+   "cell_type": "markdown",
+   "id": "endangered-colleague",
+   "metadata": {},
+   "source": [
+    "The below code block wrap the two scenarios within a function for automation."
+   ]
+  },
+  {
    "cell_type": "code",
    "execution_count": null,
-   "id": "attractive-medium",
+   "id": "blocked-turning",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -121,7 +129,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "indirect-insulin",
+   "id": "posted-stanford",
    "metadata": {},
    "source": [
     "Below is an example of expected outputs :\n",
@@ -135,7 +143,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "little-answer",
+   "id": "dedicated-convergence",
    "metadata": {},
    "source": [
     "---\n",
@@ -153,7 +161,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "manufactured-shape",
+   "id": "eight-family",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -167,7 +175,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "cubic-water",
+   "id": "animated-prevention",
    "metadata": {},
    "source": [
     "--- \n",
@@ -178,7 +186,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "beneficial-designer",
+   "id": "rolled-navigation",
    "metadata": {},
    "source": [
     "-----\n",
@@ -187,7 +195,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "raising-temperature",
+   "id": "civic-water",
    "metadata": {},
    "source": [
     "-----\n",

A különbségek nem kerülnek megjelenítésre, a fájl túl nagy
+ 42 - 43
ai/Megatron/English/Python/jupyter_notebook/Lab1-3_MegatronFundementals.ipynb


+ 36 - 34
ai/Megatron/English/Python/jupyter_notebook/Lab1-4_GPT_vocab_merge_files.ipynb

@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "mysterious-bride",
+   "id": "sustainable-wrong",
    "metadata": {},
    "source": [
     "## GPT Tokenizer files\n",
@@ -19,7 +19,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "thrown-pittsburgh",
+   "id": "fatal-think",
    "metadata": {},
    "source": [
     "Let's review the source code of [gpt2 tokenizer](https://huggingface.co/transformers/_modules/transformers/tokenization_gpt2.html)\n",
@@ -35,12 +35,12 @@
     "         tokenizer(\" Hello world\")['input_ids']\n",
     "        [18435, 995]\n",
     "\n",
-    "We expect our custom tokenizer, which we will train and obtain custom vocab.json and merges.txt files, when applies tokenization, should result in the same outputs above given the exact same input. \n"
+    "We expect our custom tokenizer, which we will later on train in lab 2,  will exhibit the same behavior of [treating spaces like parts of the tokens](https://huggingface.co/transformers/model_doc/gpt2.html#transformers.GPT2Tokenizer).\n"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "marine-alberta",
+   "id": "missing-congo",
    "metadata": {},
    "source": [
     "Install necessary python libraries."
@@ -49,7 +49,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "entitled-brass",
+   "id": "private-aurora",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -59,7 +59,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "generous-blake",
+   "id": "frequent-blues",
    "metadata": {},
    "source": [
     "Next, we proceed to fetch pretrained GPT Tokenizer files, namely the vocab and merge files, will ideally looks like. \n",
@@ -70,7 +70,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "different-relief",
+   "id": "conceptual-mason",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -80,7 +80,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "latest-thinking",
+   "id": "specific-pharmaceutical",
    "metadata": {},
    "source": [
     "Examine the vocab and merge files, observe the presence of Ġ character.\n",
@@ -90,7 +90,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "continental-keeping",
+   "id": "pursuant-paradise",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -107,7 +107,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "mathematical-depression",
+   "id": "private-hunter",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -116,22 +116,24 @@
   },
   {
    "cell_type": "markdown",
-   "id": "deluxe-empire",
+   "id": "cellular-standing",
    "metadata": {},
    "source": [
-    "The following code block will load a default GPT2Tokenizer from HuggingFace transformer library, we verify the following :\n",
+    "The following code block will load a default GPT2Tokenizer from HuggingFace's **_transformer_** library, we verify the following :\n",
     "\n",
     "            from transformers import GPT2Tokenizer\n",
     "            tokenizer = GPT2Tokenizer.from_pretrained(\"gpt2\")\n",
     "        \n",
     "            tokenizer(\" Hello world\")['input_ids']\n",
-    "            expected token ids for \" Hello world\" is [18435, 995]"
+    "            expected token ids for \" Hello world\" is [18435, 995]\n",
+    "\n",
+    "Note: The HuggingFace's **_transformer_** library does not have functions to train GPTBPE tokenizer, it can load a pre-trained tokenizer given valid files. For training GPTBPE Tokenizer, we will need to use another library called **_tokenizers_** also from HuggingFace."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "detailed-thirty",
+   "id": "driving-right",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -151,35 +153,37 @@
   },
   {
    "cell_type": "markdown",
-   "id": "ongoing-characterization",
+   "id": "collectible-rehabilitation",
    "metadata": {},
    "source": [
     "Below is the expected outputs :\n",
     "    \n",
     "         Hello world\n",
     "        tokens: ['ĠHello', 'Ġworld']\n",
-    "        ids: [18435, 995]"
+    "        ids: [18435, 995]\n",
+    "Observe the presence of the **Ġ** character. "
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "universal-penalty",
+   "id": "fluid-merit",
    "metadata": {},
    "source": [
-    "Next code block will load tokenizer library from huggingFace, we will observe the difference when setting `use_gpt` to True or False. \n",
+    "In the next code block, we will examine how HuggingFace's **_tokenizers_** library loads a pretrained tokenizer given gpt2-vocab.json and merges.txt files. \n",
+    "We will verify that, the usage of `use_gpt` flag will result in the same tokenization behavior, i.e the presence of the **Ġ** character. We will also double check that the token ids are identical to HuggingFace's **_transformer_** loaded `tokenizer = GPT2Tokenizer.from_pretrained(\"gpt2\")` when applying tokenization to the exact same text ` Hello world`. \n",
     "\n",
     "Setting `use_gpt` to True will evoke the following : \n",
     "\n",
     "        tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()\n",
     "        tokenizer.decoder = ByteLevelDecoder()\n",
     "        \n",
-    "This is the expected tokenizer behavior for GPT models, namely GPTBPE Tokenizer, this GPTBPE tokenizer will load the vocab.json and merges.txt files and tokenize as expected. Whereas setting `use_gpt` to False, will result in a normal BPE Tokenizer, the tokenization will behave differently."
+    "This is the expected tokenizer behavior for GPTBPE Tokenizer, this GPTBPE tokenizer will load the vocab.json and merges.txt files and tokenize as expected. Whereas setting `use_gpt` to False, will result in a normal BPE Tokenizer, the tokenization will behave differently."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "enhanced-factor",
+   "id": "quarterly-remains",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -221,7 +225,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "polar-context",
+   "id": "incident-positive",
    "metadata": {},
    "source": [
     "Below is the expected outputs :\n",
@@ -237,30 +241,28 @@
   },
   {
    "cell_type": "markdown",
-   "id": "governmental-software",
+   "id": "substantial-spank",
    "metadata": {},
    "source": [
     "What did we observed ? \n",
     "\n",
-    "Setting `use_gpt` to True will give us the expected behavor of GPTBPE tokenization. \n",
-    "\n",
-    "It will ensure the presence of Ġ : \n",
+    "We observe that by setting `use_gpt` flag to True in HuggingFace's **_tokenizers_** library when loading the same gpt2-vocab.json and merges.txt will give us the expected behavor of GPTBPE tokenization. \n",
     "\n",
-    "    tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()\n",
-    "    tokenizer.decoder = ByteLevelDecoder()\n",
+    "We further verify, by applying tokenization to the exact same text ` Hello world`, the result of the tokenizer, with `use_gpt` flag = True, will match the result of the HuggingFace's  **_transformer_** library loaded gpt2 tokenizer.\n",
     "\n",
+    "Whereas setting `use_gpt` flag = False would result in a different behavior. \n",
     "\n",
     "Therefore, we will enforce having :\n",
     "\n",
     "    tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()\n",
     "    tokenizer.decoder = ByteLevelDecoder()\n",
-    "When training our own GPTBPETokenizer with our own raw text data.\n",
-    "    "
+    "\n",
+    "When training our own GPTBPETokenizer with our own raw text data in Lab 2.    "
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "bigger-miami",
+   "id": "solid-aspect",
    "metadata": {},
    "source": [
     "We will now move the gpt-vocab.json and gpt2-merges.txt to the correct data folder as a preparation for the next step."
@@ -269,7 +271,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "killing-advance",
+   "id": "electrical-worcester",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -280,7 +282,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "ultimate-girlfriend",
+   "id": "related-saturn",
    "metadata": {},
    "source": [
     "---\n",
@@ -291,7 +293,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "academic-taiwan",
+   "id": "surprised-venue",
    "metadata": {},
    "source": [
     "-----\n",
@@ -300,7 +302,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "hungry-wilderness",
+   "id": "graduate-windsor",
    "metadata": {},
    "source": [
     "-----\n",

+ 21 - 18
ai/Megatron/English/Python/jupyter_notebook/Lab1-5_jsonfy_and_process2mmap.ipynb

@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "acting-adelaide",
+   "id": "structural-documentation",
    "metadata": {},
    "source": [
     "## Jsonfy + convert to mmap\n",
@@ -23,16 +23,19 @@
   },
   {
    "cell_type": "markdown",
-   "id": "aquatic-parcel",
+   "id": "alpha-yahoo",
    "metadata": {},
    "source": [
-    "1. Understand the need of preprocessing data to mmap format."
+    "1. Understand the need of preprocessing data to mmap format.\n",
+    "\n",
+    "The below cell blocks will demonstrate the speed up by using `np.memmap` than `np.load` to load an arbitrary data.\n",
+    "The `np.memmap` is integrated into preprocess_data.py. "
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 1,
-   "id": "according-boston",
+   "id": "useful-fancy",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -44,7 +47,7 @@
   {
    "cell_type": "code",
    "execution_count": 2,
-   "id": "armed-smell",
+   "id": "eligible-allen",
    "metadata": {},
    "outputs": [
     {
@@ -63,7 +66,7 @@
   {
    "cell_type": "code",
    "execution_count": 3,
-   "id": "dedicated-thirty",
+   "id": "grave-violence",
    "metadata": {},
    "outputs": [
     {
@@ -83,7 +86,7 @@
   {
    "cell_type": "code",
    "execution_count": 4,
-   "id": "pressing-boost",
+   "id": "living-cricket",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -93,7 +96,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "parallel-university",
+   "id": "professional-circuit",
    "metadata": {},
    "source": [
     "2. jsonfy the raw text data into loose json format.\n",
@@ -107,7 +110,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "legislative-provision",
+   "id": "thirty-specialist",
    "metadata": {},
    "source": [
     "We will now use the following python script to converting the raw text data into `extractedNVblogs.json` format as a preparation for the next step. \n",
@@ -125,7 +128,7 @@
   {
    "cell_type": "code",
    "execution_count": 5,
-   "id": "supported-budget",
+   "id": "postal-conjunction",
    "metadata": {},
    "outputs": [
     {
@@ -142,7 +145,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "split-samuel",
+   "id": "iraqi-scoop",
    "metadata": {},
    "source": [
     "3. Use preprocess_data.py to convert the cleaned data into mmap format as a preparation for training.\n",
@@ -155,7 +158,7 @@
   {
    "cell_type": "code",
    "execution_count": 9,
-   "id": "traditional-income",
+   "id": "promotional-pillow",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -169,7 +172,7 @@
   {
    "cell_type": "code",
    "execution_count": 10,
-   "id": "residential-honor",
+   "id": "acting-patrick",
    "metadata": {},
    "outputs": [
     {
@@ -232,7 +235,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "viral-shopping",
+   "id": "short-siemens",
    "metadata": {},
    "source": [
     "Below is the expected outputs :\n",
@@ -265,18 +268,18 @@
   },
   {
    "cell_type": "markdown",
-   "id": "virgin-hearts",
+   "id": "informational-willow",
    "metadata": {},
    "source": [
     "---\n",
     "\n",
     "## Links and Resources\n",
-    "Don't forget to [Read More on MMAP](https://docs.python.org/3/library/mmap.html).\n"
+    "Don't forget to [Read More on MMAP](https://docs.python.org/3/library/mmap.html) and examine the [indexed_dataset builder](https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/data/indexed_dataset.py#L407).\n"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "packed-panama",
+   "id": "forbidden-emerald",
    "metadata": {},
    "source": [
     "-----\n",
@@ -285,7 +288,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "lovely-tackle",
+   "id": "dedicated-russell",
    "metadata": {},
    "source": [
     "-----\n",

A különbségek nem kerülnek megjelenítésre, a fájl túl nagy
+ 38 - 37
ai/Megatron/English/Python/jupyter_notebook/Lab1-6_Observe_GPT_runs_vs_performance.ipynb


+ 19 - 19
ai/Megatron/English/Python/jupyter_notebook/Lab2-3_train_own_GPT2BPETokenizer.ipynb

@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "naval-commodity",
+   "id": "rapid-arctic",
    "metadata": {},
    "source": [
     "# Train custom GPTBPE  Tokenzer \n",
@@ -10,12 +10,12 @@
     "\n",
     "## Learning Objectives\n",
     "\n",
-    "In order to include the vocabulary of the local language into GPTBPE tokenizer, we need to be able to train GPTBPE Tokenizer on local language raw text data. The trained GPTBPE Tokenizer will produce it's own vocab.json and merges.txt files which is compatible with Megatron-LM's GPTBPE Tokenizer. \n",
+    "In order to include the vocabulary of the local language, in this case it is Swedish, into GPTBPE tokenizer, we need to be able to train GPTBPE Tokenizer on local language raw text data. The trained GPTBPE Tokenizer will produce it's own vocab.json and merges.txt files which will be compatible with Megatron-LM's GPTBPE Tokenizer. \n",
     "\n",
     "Previously in `Lab2-1_acquiring_data.ipynb`, we have acquired our own Swedish raw text data extracted from data source språkbank.\n",
     "Therefore, the goal of this notebook, is to train our own GPTBPE Tokenizer on the Swedish raw text data obtained from `Lab2-1_acquiring_data.ipynb`.\n",
     "\n",
-    "We can either choose to load a previously trained GPTBPE Tokenizer by providing the vocab.json and merges.txt files to the GPTBPE Tokenizer before training further with the raw text data, or we can choose to train a completely new GPTBPE Tokenizer.\n",
+    "We can either choose to load a previously trained GPTBPE Tokenizer by providing the vocab.json and merges.txt files to the GPTBPE Tokenizer before training further with the raw text data, or we can choose to train a completely new GPTBPE Tokenizer from scratch.\n",
     "\n",
     "The two options are covered in this notebook :\n",
     "\n",
@@ -23,7 +23,7 @@
     "    2. option 2 - train a GPT compatible tokenizer from scratch.\n",
     "\n",
     "\n",
-    "We will use HuggingFace's Tokenizer library and the trainer function in order train our own GPTBPE Tokenizer with our own raw text data.\n",
+    "We will use HuggingFace's Tokenizer library and the trainer function in order to train our own GPTBPE Tokenizer with our own raw text data.\n",
     "\n",
     "\n",
     "First, we will install the [HuggingFace Tokenizer library](https://huggingface.co/transformers/installation.html)"
@@ -32,7 +32,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "cathedral-jumping",
+   "id": "suspended-peace",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -42,7 +42,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "designing-occasion",
+   "id": "active-artwork",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -53,7 +53,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "separated-article",
+   "id": "average-boundary",
    "metadata": {},
    "source": [
     "A python script for training custom GPTBPE Tokenizer is provided for your convenience : \n",
@@ -76,7 +76,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "therapeutic-kentucky",
+   "id": "harmful-grounds",
    "metadata": {},
    "source": [
     "1. option 1 - load from pretrained vocab and merge files, then continue training with the new raw text."
@@ -85,7 +85,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "modular-result",
+   "id": "perceived-jerusalem",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -94,7 +94,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "roman-advocate",
+   "id": "weird-index",
    "metadata": {},
    "source": [
     "Below is the expected outputs :\n",
@@ -121,7 +121,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "better-consideration",
+   "id": "emerging-music",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -131,7 +131,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "mysterious-gossip",
+   "id": "filled-blast",
    "metadata": {},
    "source": [
     "2. option 2 - train a GPT compatible tokenizer from scratch."
@@ -140,7 +140,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "unexpected-cowboy",
+   "id": "uniform-complaint",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -151,7 +151,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "atlantic-serbia",
+   "id": "disabled-pencil",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -160,7 +160,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "heated-ranch",
+   "id": "tender-magic",
    "metadata": {},
    "source": [
     "Below is the expected outputs :\n",
@@ -185,7 +185,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "olive-japanese",
+   "id": "criminal-leadership",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -195,7 +195,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "mineral-middle",
+   "id": "orange-alignment",
    "metadata": {},
    "source": [
     "--- \n",
@@ -205,7 +205,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "pregnant-template",
+   "id": "offshore-truck",
    "metadata": {},
    "source": [
     "-----\n",
@@ -214,7 +214,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "limiting-visiting",
+   "id": "clinical-tuition",
    "metadata": {},
    "source": [
     "-----\n",

+ 40 - 37
ai/Megatron/English/Python/jupyter_notebook/Lab2-4_customize_process2mmap.ipynb

@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "otherwise-masters",
+   "id": "crazy-behalf",
    "metadata": {},
    "source": [
     "## Customize preprocess_data.py\n",
@@ -10,16 +10,16 @@
     "\n",
     "## Learning Objectives\n",
     "\n",
-    "We fetched our own Swedish raw text data in `Lab2-1_acquiring_data.ipynb`, we learned how to find sentence boundary with custom functions in `Lab2-2_SentenceBoundary_and_Deduplicate.ipynb` , we also trained a GPTBPETokenizer and fitted it to our raw Swedish text. \n",
+    "We fetched our own Swedish raw text data in `Lab2-1_acquiring_data.ipynb`, we learned how to find sentence boundary with custom functions in `Lab2-2_SentenceBoundary_and_Deduplicate.ipynb` , we also trained a GPTBPETokenizer and fitted it to our raw Swedish text with `Lab2-3_train_own_GPT2BPETokenizer.ipynb`. \n",
     "\n",
-    "We are now ready to incorporate the custom sentence-splitter into preprocess_data.py and covert the raw Swedish text to , first json format, and then mmap format.\n",
+    "We are now ready to incorporate the custom sentence-splitter into preprocess_data.py and covert the raw Swedish text to, first json format, and then mmap format.\n",
     "\n",
-    "Therefore, the goal of this notebook is to integrate all knowledge gained from both Lab 1 as well as the above notebooks, and challenge ourselves to further customize the preprocess_data.py with a <a href=\"./Lab2-4_customize_process2mmap.ipynb#Custom-Sentence-Splitter\">custom sentence-splitter</a>, and in the process, convert the new raw Sweden text to mmap format.\n",
+    "Therefore, the goal of this notebook is to integrate all knowledge gained from both Lab 1 as well as the above notebooks, and challenge ourselves to further customize the preprocess_data.py with a <a href=\"./Lab2-4_customize_process2mmap.ipynb#Custom-Sentence-Splitter\">custom sentence-splitter</a>  function, and in the process, convert the new raw Sweden text to mmap format.\n",
     "\n",
     "More specifically, this notebook will cover the steps to :\n",
     "\n",
-    "1.  Convert the extracted raw Swedish text from webnyheter2013.txt to webnyheter2013.json.\n",
-    "2.  Generate the mmap format files by default preprocess_data.py first to assure the possibility to move on to the next notebook in case time runs out.\n",
+    "1.  Convert the extracted raw Swedish text from `webnyheter2013.txt` to `webnyheter2013.json`.\n",
+    "2.  Generate the mmap format files by default preprocess_data.py as the first step to ensure we have data necessary for the next notebook to run, in case time runs out.\n",
     "\n",
     "\n",
     "Toward the end, there is a Mini-Challenge <a href=\"./Lab2-4_customize_process2mmap.ipynb#Mini-Challenge\">Jump to view Mini-Challenge</a>.\n"
@@ -27,7 +27,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "statutory-thesis",
+   "id": "suburban-coast",
    "metadata": {},
    "source": [
     "1. Convert the extracted raw Swedish text from webnyheter2013.txt to webnyheter2013.json."
@@ -36,7 +36,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "horizontal-cause",
+   "id": "parliamentary-accountability",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -45,7 +45,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "reserved-clear",
+   "id": "pursuant-ghost",
    "metadata": {},
    "source": [
     "Below is the expected outputs :\n",
@@ -58,16 +58,16 @@
   },
   {
    "cell_type": "markdown",
-   "id": "fixed-closing",
+   "id": "insured-excitement",
    "metadata": {},
    "source": [
-    "2.  Generate the mmap format files by default preprocess_data.py first to assure the possibility to move on to the next notebook in case time runs out."
+    "2. Generate the mmap format files by default preprocess_data.py as the first step to ensure we have data necessary for the next notebook to run, in case time runs out."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 7,
-   "id": "dried-intro",
+   "id": "palestinian-locking",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -81,7 +81,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "addressed-meeting",
+   "id": "collect-soccer",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -99,7 +99,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "moderate-future",
+   "id": "italian-mount",
    "metadata": {},
    "source": [
     "Below is the expected outputs :\n",
@@ -116,19 +116,22 @@
   },
   {
    "cell_type": "markdown",
-   "id": "electrical-executive",
+   "id": "surrounded-clothing",
    "metadata": {},
    "source": [
-    "Now we get the default mmap files (xxx.bin and xxx.idx ) and therefore guarantee the data needed for the next notebook to run.\n",
-    "We can now move on. We start by copy the old preprocess_data.py and rename it to `MYpreprocess_data.py`\n",
+    "Now we get the default mmap files (xxx.bin and xxx.idx ) and therefore guarantee we have the data needed for the next notebook to run disregard whether we finish the mini-challenge or not. \n",
     "\n",
-    "cp the preprocess_data.py into a new python script called `MYpreprocess_data.py`"
+    "We can now move on. We start by copy the old preprocess_data.py and rename it to `MYpreprocess_data.py`. \n",
+    "\n",
+    "Note: As best practice, one never overwrites original python script existed in the given repo directly, one copies the original python script and rename it to a new python script, then work on the new python script, in case of irreversible failures, one can always refer to the original python script, and start again.\n",
+    "\n",
+    "The below code block will duplicate the preprocess_data.py script and renamed the copied python script into a new python script called `MYpreprocess_data.py`."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 2,
-   "id": "greatest-receptor",
+   "id": "addressed-month",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -137,7 +140,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "south-devil",
+   "id": "growing-restriction",
    "metadata": {},
    "source": [
     "<a id=\"Custom-Sentence-Splitter\"></a>"
@@ -145,16 +148,16 @@
   },
   {
    "cell_type": "markdown",
-   "id": "rough-pickup",
+   "id": "chemical-selection",
    "metadata": {},
    "source": [
-    "The below code block is our custom sentence-splitter `cut_sentence_with_quotation_marks`, the custom function is provided for your convenience for integarting to  `MYpreprocess_data.py`"
+    "The custom sentence-splitter `cut_sentence_with_quotation_marks` function is provided below for your convenience, please integrate this custom function into `MYpreprocess_data.py`."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 3,
-   "id": "prostate-profession",
+   "id": "vital-latino",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -189,7 +192,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "large-birthday",
+   "id": "musical-benjamin",
    "metadata": {},
    "source": [
     "<a id=\"Mini-Challenge\"></a>"
@@ -197,7 +200,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "medical-incident",
+   "id": "robust-apparel",
    "metadata": {},
    "source": [
     "---\n",
@@ -219,7 +222,7 @@
   {
    "cell_type": "code",
    "execution_count": 2,
-   "id": "selected-depth",
+   "id": "decimal-enlargement",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -432,7 +435,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "raised-victim",
+   "id": "innocent-delight",
    "metadata": {},
    "source": [
     "Below cell block specify all the input parameters in order to run `MYpreprocess_data.py`. \n",
@@ -443,7 +446,7 @@
   {
    "cell_type": "code",
    "execution_count": 11,
-   "id": "fluid-dayton",
+   "id": "geographic-convention",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -456,10 +459,10 @@
   },
   {
    "cell_type": "markdown",
-   "id": "concerned-protest",
+   "id": "unavailable-steps",
    "metadata": {},
    "source": [
-    "Below is a ReRun cell block to run `MYpreprocess_data.py` and produce the customSentenceSplit_text_document.bin and customSentenceSplit_text_document.idx files.\n",
+    "Below code block is a ReRun cell to launch `MYpreprocess_data.py` and produce the customSentenceSplit_text_document.bin and customSentenceSplit_text_document.idx files, if the script runs successfully.\n",
     "\n",
     "<a id=\"Rerun_Cell\"></a>\n",
     "\n",
@@ -469,7 +472,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "rolled-welcome",
+   "id": "smoking-memory",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -487,16 +490,16 @@
   },
   {
    "cell_type": "markdown",
-   "id": "reduced-court",
+   "id": "strange-maldives",
    "metadata": {},
    "source": [
-    "Check whether these two files : customSentenceSplit_text_document.bin and customSentenceSplit_text_document.idx files are successfully generated."
+    "Check whether these two files : `customSentenceSplit_text_document.bin` and `customSentenceSplit_text_document.idx` files were successfully generated and is in the correct folder under dataset."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "secondary-stereo",
+   "id": "difficult-library",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -506,7 +509,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "premier-birth",
+   "id": "strategic-confusion",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -516,7 +519,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "eastern-ministry",
+   "id": "temporal-spring",
    "metadata": {},
    "source": [
     "-----\n",
@@ -525,7 +528,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "qualified-admission",
+   "id": "parental-tourism",
    "metadata": {},
    "source": [
     "-----\n",

+ 42 - 45
ai/Megatron/English/Python/jupyter_notebook/Lab2-5_run_Megatron_with_varying_config.ipynb

@@ -2,34 +2,33 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "effective-university",
+   "id": "rising-software",
    "metadata": {},
    "source": [
     "## Scale up model size\n",
     "---\n",
-    "In previous notebooks, we downloaded and extracted our own Swedish raw text; practiced filter, clean and deduplicate the raw text data ; trained our own GPTBPETokenizer and fitted to the raw Swedish text ; converted the raw text to mmap format integrating a custom sentence-splitter.\n",
+    "In previous notebooks, we downloaded and extracted our own Swedish raw text with `Lab2-1_acquiring_data.ipynb`; practiced filter, clean and deduplicate the raw text data with `Lab2-2_SentenceBoundary_and_Deduplicate.ipynb` ; trained our own GPTBPETokenizer and fitted to the raw Swedish text with `Lab2-3_train_own_GPT2BPETokenizer.ipynb`; converted the raw text to mmap format integrating a custom sentence-splitter in `Lab2-4_customize_process2mmap.ipynb`.\n",
     "\n",
-    "Now that we have learned the components to customize the Megatron-LM's workflow according to specific langauge needs ( in this case, it is Swedish). The next step is to train the Megatron-LM GPT model with the Swedish data. \n",
+    "We have learned all the essential components in order to customize Megatron-LM's default workflow in order to accommodate to specific langauge needs ( in this case, it is Swedish ). The obvious next step is to train the Megatron-LM GPT model with the processed Swedish data. \n",
     "\n",
-    "However, constraint by how much compute resources you get, i.e the number of GPUs available for the training job, there is an upper limit of how big a model you can train.\n",
+    "However, constraint by how much compute resources one could get, that is, the number of GPUs available for the training job, there is an upper limit of how big a model you can train.\n",
     "\n",
-    "Let's test this out by presenting a Challenge. \n",
+    "We will test ou thow big a model we could train with 2 X A100 GPUs 40GB, by presenting a Challenge!\n",
     "\n",
     "## **Challenge ** - Go big or go home !\n",
     "\n",
     "- Constraints : \n",
     "    - 2 x A100 GPUs 40G is allocated for this challenge.\n",
-    "    - Only the parameters in the **modifiable blocks** are allowed to be changed.\n",
+    "    - Only the parameters in the **##### Begin/End of modifiable blocks #####** are allowed to be changed.\n",
     "    - Avoid OOM !\n",
     "    - Training run must be finished and checkpoint must be saved successfully.\n",
     "\n",
-    "\n",
     "- Task : \n",
-    "        given the above prerequisites, train as BIG a GPT model as possible.\n",
+    "        Given the above constraints, train as BIG a GPT model as possible.\n",
     "\n",
-    "- Winning criteria : the biggest model wins given the above constraints.\n",
+    "- Winning criteria : The biggest model wins given the above constraints.\n",
     "\n",
-    "Note 1: Post the parameters you changed into the **modifiable blocks** on slack channels for verification.\n",
+    "Note 1: Post the parameters you changed into the **##### Begin/End of modifiable blocks #####**  on bootcamp's slack channels for verification.\n",
     "\n",
     "Note 2: We purposefully turned-off nsys profiling in this challenge, because calling nsys profiling will introduce a small overhead, which will impact the maximum achievable model size.\n",
     "\n",
@@ -38,19 +37,17 @@
   },
   {
    "cell_type": "markdown",
-   "id": "incorporate-bidding",
+   "id": "historic-eating",
    "metadata": {},
    "source": [
-    "---\n",
-    "# Hint :\n",
-    "### call out a terminal and type in **nvidia-smi** to monitor the GPUs' utils and power consumption \n",
-    "### remember to fill up the GPU memory\n",
-    "![call out a terminal ](./Megatron-LM/pics/Alt_callout2terminals.JPG)"
+    "\n",
+    "**Hint** :\n",
+    "Use the knowledge gained from `Lab1-6_Observe_GPT_runs_vs_performance.ipynb`, especially the section with video demonstrating how to do live profiling during a live training run."
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "established-substitute",
+   "id": "cleared-toolbox",
    "metadata": {},
    "source": [
     "Modify and rerun the code blocks below to obtain a even bigger GPT model. \n",
@@ -62,7 +59,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "central-scheduling",
+   "id": "large-buying",
    "metadata": {},
    "source": [
     "<a id=\"MODIFY_CELL\"></a>"
@@ -70,7 +67,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "handy-process",
+   "id": "approved-beatles",
    "metadata": {},
    "source": [
     "Always clean the checkpoint folder to ensure trainining start from scratch."
@@ -79,7 +76,7 @@
   {
    "cell_type": "code",
    "execution_count": 1,
-   "id": "human-privacy",
+   "id": "attended-vault",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -89,7 +86,7 @@
   {
    "cell_type": "code",
    "execution_count": 2,
-   "id": "chief-latter",
+   "id": "engaging-ocean",
    "metadata": {},
    "outputs": [
     {
@@ -117,7 +114,7 @@
     "MERGE_FILE='../dataset/SV/56k/merges.txt'\n",
     "PROFILE_OUTPUT_PATH='../profiles/SV/nsys_sv_' # modify this to your own profile path\n",
     "\n",
-    "#### [TODO]--------------- Begin of modifiable block -----------#### \n",
+    "# -------------------- #####  Begin of modifiable block ##### -------------------- \n",
     "\n",
     "TENSOR_MP_SIZE=<FILL_IN>\n",
     "PIPELINE_MP_SIZE=<FILL_IN>\n",
@@ -129,7 +126,7 @@
     "SEQ_LEN=<FILL_IN>\n",
     "MAX_POS_EM=<FILL_IN>\n",
     "\n",
-    "#### -------------------- end of modifiable blocks ------------------------#### \n",
+    "# -------------------- #####  End of modifiable blocks ##### ------------------------ \n",
     "\n",
     "##################  DO NOT modify anything below this line ##################\n",
     "export OMP_NUM_THREADS=1\n",
@@ -173,18 +170,22 @@
   },
   {
    "cell_type": "markdown",
-   "id": "proprietary-elizabeth",
+   "id": "determined-cliff",
    "metadata": {},
    "source": [
     "Check how big is your model. By modify the parameters in the [params_cnt.sh](./params_cnt.sh)\n",
     "\n",
-    "I got 6.6 Billion :)  what about you ?"
+    "I got 6.6 Billion :)  what about you ?\n",
+    "\n",
+    "Modify the [params count](./params_cnt.sh) accoring to your training configuration.\n",
+    "\n",
+    "After modification, run the below bash script to obtain the model size."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "beginning-homework",
+   "id": "green-magic",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -193,25 +194,25 @@
   },
   {
    "cell_type": "markdown",
-   "id": "radio-secretariat",
+   "id": "awful-candle",
    "metadata": {},
    "source": [
     "Below is an example of expected outputs:\n",
     "    \n",
-    "        6\n",
-    "        6675628032\n"
+    "        6 <-- One could get different number depend on your training config\n",
+    "        6675628032 <-- One could get different number depend on your training config\n"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "fuzzy-assault",
+   "id": "great-league",
    "metadata": {},
    "source": [
     "Re-run this cell below to get an even bigger GPT model\n",
     "\n",
     "Remember to modify the [params count](./params_cnt.sh) to check how big is your model.\n",
     "\n",
-    "Jump back and mdify the SV_GPT_goingBIG.sh, click here to \n",
+    "Jump back and edit the SV_GPT_goingBIG.sh, click here to \n",
     "<a href=\"./Lab2-5_run_Megatron_with_varying_config.ipynb#MODIFY_CELL\">Jump back to modify and overwrite SV_GPT_goingBIG.sh </a> \n",
     "<a id=\"Rerun_Cell\"></a>"
    ]
@@ -219,7 +220,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "rental-deputy",
+   "id": "italian-karma",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -228,7 +229,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "korean-republic",
+   "id": "outstanding-application",
    "metadata": {},
    "source": [
     "Below is an example of expected outputs:\n",
@@ -251,31 +252,27 @@
   },
   {
    "cell_type": "markdown",
-   "id": "official-concept",
+   "id": "blessed-grammar",
    "metadata": {},
    "source": [
-    "--- \n",
-    "\n",
-    "## Additional Resources\n",
-    "\n",
-    "Language Models are Few-Shot Learners : https://arxiv.org/pdf/2005.14165.pdf\n",
+    "---\n",
     "\n",
-    "Efficient Large-Scale Language Model Training on GPU Clusters : https://arxiv.org/pdf/2104.04473.pdf"
+    "## Links and Resources\n",
+    "Don't forget to read more on [Language Models are Few-Shot Learners](https://arxiv.org/pdf/2005.14165.pdf) and [Efficient Large-Scale Language Model Training on GPU Clusters](https://arxiv.org/pdf/2104.04473.pdf)."
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "laden-sender",
+   "id": "velvet-nylon",
    "metadata": {},
    "source": [
-    "---\n",
-    "\n",
-    "## Congratulations on completing the mission !\n"
+    "-----\n",
+    "## <p style=\"text-align:center;border:3px; padding: 1em\"> <a href=../Start_Here.ipynb>HOME</a></p>"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "premium-treasury",
+   "id": "framed-blood",
    "metadata": {},
    "source": [
     "-----\n",

ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/tools/openwebtext/df2.csv → ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/tools/openwebtext/HandCrafted_Duplicates.csv


+ 24 - 24
ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/tools/openwebtext/Lab1-1_Website_scrapping.ipynb

@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "quality-channel",
+   "id": "micro-village",
    "metadata": {},
    "source": [
     "## Website scrapping\n",
@@ -27,12 +27,12 @@
     "    5. Parse the html file and extract raw text and write to disk.\n",
     "    6. Move the `extractedNVblogs.txt` to the correct folder under **dataset**.\n",
     "\n",
-    "This notebook did not intend to cover crawling webpages for other website/webpages.\n"
+    "This notebook did not intend to cover crawling webpages for other websites/webpages.\n"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "everyday-leonard",
+   "id": "listed-stopping",
    "metadata": {},
    "source": [
     "1. install python libraries and download 2 python scripts which will be used for website crawling."
@@ -41,7 +41,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "exotic-grave",
+   "id": "endangered-vessel",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -57,7 +57,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "tamil-electric",
+   "id": "designed-insight",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -68,7 +68,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "precious-birth",
+   "id": "knowing-andorra",
    "metadata": {},
    "source": [
     "2. Crawl links from a seeded url and write to a text file named `NVdevblog_urls.txt`"
@@ -77,7 +77,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "dietary-beads",
+   "id": "natural-commander",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -87,7 +87,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "potential-regard",
+   "id": "incredible-lunch",
    "metadata": {},
    "source": [
     "3. Remove incompliant links from the text file in order to ensure legal compliancy.\n",
@@ -100,7 +100,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "amazing-nickname",
+   "id": "angry-cattle",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -122,7 +122,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "military-electronics",
+   "id": "defined-interim",
    "metadata": {},
    "source": [
     "4. Fetch the corresponding webpage from each approved url and write it to `XXX.html` format."
@@ -131,7 +131,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "worth-album",
+   "id": "competitive-tower",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -141,7 +141,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "collective-dimension",
+   "id": "guided-certification",
    "metadata": {},
    "source": [
     "Below is an example of expected outputs :\n",
@@ -157,7 +157,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "heard-recovery",
+   "id": "neural-method",
    "metadata": {},
    "source": [
     "5. Parse the html file and extract the raw text data, which will be written to `extractedNVblogs.txt`."
@@ -166,7 +166,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "suspended-degree",
+   "id": "affected-albania",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -202,7 +202,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "continued-voice",
+   "id": "first-research",
    "metadata": {},
    "source": [
     "6. Move the `extractedNVblogs.txt` to the correct folder under **dataset**. This file `extractedNVblogs.txt` will be used in subsequent notebooks in lab1."
@@ -211,7 +211,7 @@
   {
    "cell_type": "code",
    "execution_count": 12,
-   "id": "german-shareware",
+   "id": "fourth-certificate",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -220,7 +220,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "willing-charleston",
+   "id": "cultural-retro",
    "metadata": {},
    "source": [
     "**Note:** Please run below cell to free up space."
@@ -229,7 +229,7 @@
   {
    "cell_type": "code",
    "execution_count": 14,
-   "id": "square-montana",
+   "id": "stretch-pattern",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -241,7 +241,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "brave-ranking",
+   "id": "black-courage",
    "metadata": {},
    "source": [
     "Verify `extractedNVblogs.txt` is successfully moved to the correct folder."
@@ -250,7 +250,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "pressed-model",
+   "id": "standing-bridges",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -259,7 +259,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "worse-affairs",
+   "id": "prepared-ballot",
    "metadata": {},
    "source": [
     "Below is an example of expected outputs :\n",
@@ -269,7 +269,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "convenient-treatment",
+   "id": "cellular-termination",
    "metadata": {},
    "source": [
     "--- \n",
@@ -280,7 +280,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "sorted-federation",
+   "id": "complex-valentine",
    "metadata": {},
    "source": [
     "-----\n",
@@ -289,7 +289,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "exclusive-qualification",
+   "id": "flush-bruce",
    "metadata": {},
    "source": [
     "--- \n",

+ 22 - 14
ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/tools/openwebtext/Lab2-1_acquiring_data.ipynb

@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "stopped-graph",
+   "id": "fabulous-yield",
    "metadata": {},
    "source": [
     "## Acquire Swedish data \n",
@@ -10,7 +10,7 @@
     "\n",
     "For data licensing and privacy concerns, we will not providing training data in this bootcamp.\n",
     "\n",
-    "However, we do need data in order to proceed the customization of Megatron-LM's workflow for Swedish, hence, the first thing we need to do, is to acquire Swedish raw text data.\n",
+    "However, we do need data in order to proceed the customization of Megatron-LM's workflow for local language needs ( in this case, it is Swedish ), hence, the first thing we need to do, is to acquire Swedish raw text data.\n",
     "\n",
     "This notebook is therefore provided to assist acquisition of Swedish raw text data from språkbanken.\n",
     "\n",
@@ -18,7 +18,7 @@
     "\n",
     "    1. Download data via wget and download the python script which will be used to extract the Swedish text.\n",
     "    \n",
-    "    2. unzip the data using bunzip and move the data to the correct folder under dataset.\n",
+    "    2. Unzip the data using bunzip and move the data to the correct folder under dataset.\n",
     "    \n",
     "    3. A custom function is provided in order to extract raw txt file from xml file and move the text file to the correct folder under dataset.\n",
     "\n",
@@ -32,7 +32,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "realistic-boating",
+   "id": "wrapped-fields",
    "metadata": {},
    "source": [
     "1. Download data via wget and download the python script which will be used to extract the Swedish text."
@@ -41,7 +41,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "electric-motel",
+   "id": "silent-writer",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -51,7 +51,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "activated-sigma",
+   "id": "sunset-moisture",
    "metadata": {},
    "source": [
     "2. unzip the data using bunzip and move the data to the correct folder under dataset."
@@ -60,7 +60,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "blond-begin",
+   "id": "satisfied-absolute",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -70,7 +70,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "interior-disease",
+   "id": "detected-volleyball",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -80,7 +80,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "severe-table",
+   "id": "expired-compact",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -89,7 +89,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "worse-ceiling",
+   "id": "minimal-charm",
    "metadata": {},
    "source": [
     "3. A custom function is provided in order to extract raw txt file from xml file and move the text file to the correct folder under dataset."
@@ -98,7 +98,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "afraid-eleven",
+   "id": "hungry-fundamental",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -134,9 +134,17 @@
    ]
   },
   {
+   "cell_type": "markdown",
+   "id": "thick-consortium",
+   "metadata": {},
+   "source": [
+    "Verify the output `webnyheter2013.txt` exist under `../../../../dataset/SV/`, we need this raw text file to proceed the subsequent notebooks for Lab2."
+   ]
+  },
+  {
    "cell_type": "code",
    "execution_count": null,
-   "id": "dietary-thinking",
+   "id": "bronze-interpretation",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -145,7 +153,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "false-calgary",
+   "id": "pleased-saskatchewan",
    "metadata": {},
    "source": [
     "-----\n",
@@ -154,7 +162,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "elegant-bandwidth",
+   "id": "local-limit",
    "metadata": {},
    "source": [
     "-----\n",

A különbségek nem kerülnek megjelenítésre, a fájl túl nagy
+ 142 - 124
ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/tools/openwebtext/Lab2-2_SentenceBoundary_and_Deduplicate.ipynb


A különbségek nem kerülnek megjelenítésre, a fájl túl nagy
+ 0 - 2711
ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/tools/openwebtext/custom_english_non_breaking_prefixes.txt


+ 1 - 1
ai/Megatron/English/Python/source_code/Day1-runMegatron-LM_GPT_template.sh

@@ -9,7 +9,7 @@
 ###  -----------------  modify <UserName> and <FILL_IN> in the section below -----------------
 #SBATCH --output=//proj/guest_at_nsc/users/<UserName>/output/multinode_template_%x_%j_$DATETIME.log 
 
-DIR='/proj/guest_at_nsc/users/<UserName>/'
+DIR='/proj/<BootCamp_DIR>/users/<UserName>/'
 DATETIME=`date +'date_%y-%m-%d_time_%H-%M-%S'`
 CHECKPOINT_PATH=$DIR/output/sv_gpt3_ckpt/
 VOCAB_FILE=$DIR/dataset/vocab.json

+ 6 - 3
ai/Megatron/README.md

@@ -12,6 +12,7 @@ There are 2 Labs, each with a differnt focus.
 **Important** : This bootcamp is intended to be delivered by NVIDIA certified instructors and TAs, it is _NOT_ meant for self-paced learners.
 
 Note1 : The lecture presentations as well as the solutions to the challenges and mini-challenges will be delivered at the end of each lab
+Note2 : Multi-nodes Megatron-LM GPT3 training can be added as an additional lab dependong on the availability of the compute resource.
 
 ## Labs Duration :
 The two labs will take approximately 12 hours ( including solving challenges and mini-challenges ) to complete.
@@ -21,7 +22,8 @@ The two labs will take approximately 12 hours ( including solving challenges and
 Although this bootcamp is designed to run on a computing cluster with [NVIDIA SuperPOD Architecture](https://resources.nvidia.com/en-us-auto-datacenter/nvpod-superpod-wp-09)
 It is possible to run it in an environment where you have access to 2 X A100 GPUs 40GB with NVLink/NVSwitch.
 
-### Scenario 1 : When docker pull & run is allowed, and the GPUs are directly accessbile to the users in the environment.
+### Scenario 1 : local station with 2 X A100 GPU 40GB and NVLINK 
+When docker pull & run is allowed, and the GPUs are directly accessbile to the users in the environment.
 
 #### Step 1 - Clone the gpubootcamp repo to obtain the scripts and notebooks.
 `git clone https://github.com/gpuhackathons-org/gpubootcamp.git &&
@@ -35,14 +37,14 @@ With sudo privilege :
 `sudo docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=GPUS -p USR_PORT:USR_PORT -it --rm --ulimit memlock=-1 --ulimit stack=67108864 --cap-add=SYS_ADMIN -v DIR:/workspace nvcr.io/nvidia/pytorch:21.03-py3 `
 
 Without sudo privilege but the user is added to the docker group : 
-`docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=GPUS -p USR_PORTUSR_PORT -it --rm --ulimit memlock=-1 --ulimit stack=67108864 -v DIR:/workspace nvcr.io/nvidia/pytorch:21.03-py3`
+`docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=GPUS -p USR_PORT:USR_PORT -it --rm --ulimit memlock=-1 --ulimit stack=67108864 -v DIR:/workspace nvcr.io/nvidia/pytorch:21.03-py3`
 
 #### Step 3 - call out jupyter lab 
 ` jupyter lab --no-browser --port=USR_PORT --allow-root --NotebookApp.token='' `
 
 #### Step 4 - in another terminal , call out a browser ( such as firefox )
 Then, open the jupyter notebook in browser: localhost:USR_PORT
-Navigate to gpubootcamp/ai/Megatron/English/Python/ and open the `Start_Here.ipynb` notebook.
+Navigate to /gpubootcamp/ai/Megatron/English/Python/ and open the `Start_Here.ipynb` notebook.
 
 ### Scenario 2 : Accessing the jupyter lab with Singularity + Slurm + SSH port forwarding is allowed
 A User Guide is often provided when one requests for access to a computing cluster with [NVIDIA SuperPOD Architecture](https://resources.nvidia.com/en-us-auto-datacenter/nvpod-superpod-wp-09). However, each compute cluster might have slight deviations to the reference architecture on various levels, HW and/or SW as well as the resource management control setups. 
@@ -58,6 +60,7 @@ HOST_PORT=<Available_PORT_ON_HOST>
 CLUSTER_NAME=<Obtain_this_from_cluster_admin>
 #### Step 2 - Build the pytorch_21.03.sif file  
 `sudo singularity build pytorch_21.03.sif docker://nvcr.io/pytorch:21.03-py3`
+
 Note1: If you do not have sudo rights, you might need to either contact the cluster admin, or build this in another environment where you have sudo rights.
 Note2: You should copy the pytorch_21.03.sif to the cluster enviroment one level above the DIR_to_gpubootcamp