瀏覽代碼

implement proof-reading thanks to Millie T

zenodia 2 年之前
父節點
當前提交
4ad921a7fd

+ 12 - 12
ai/Megatron/English/Python/Start_Here.ipynb

@@ -25,7 +25,7 @@
     " \n",
     " - Profiling : core concepts on GPUs performance across multicampus and/or multi-node runs.\n",
     "\n",
-    "In Lab 2, the focus is shifted to the **customization** of [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) workflow. We will walk through and exercise steps for customization of the Megatron-LM's workflow in order to address to local language needs.  \n",
+    "In Lab 2, the focus is shifted to the **customization** of [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) workflow. We will walk through and exercise steps to customise the Megatron-LM's workflow in order to address to local language needs.  \n",
     "\n",
     "\n",
     "* Standard: Python\n",
@@ -40,7 +40,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Start by checking available gpus in the environment using nvidia-smi "
+    "Start by checking available GPUs in the environment using nvidia-smi "
    ]
   },
   {
@@ -56,7 +56,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Verify you have 2 x A100 GPUs, each with 40G memory, below is an example of expected outputs : \n",
+    "Verify you have 2 x A100 GPUs, each with 40G memory. Below is an example of expected outputs: \n",
     "\n",
     "            Wed Sep 15 09:14:15 2021       \n",
     "            +-----------------------------------------------------------------------------+\n",
@@ -98,7 +98,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Verify NVlink is active, below is an example of expected outputs : \n",
+    "Verify NVlink is active. Below is an example of expected outputs: \n",
     "\n",
     "        GPU 0: A100-SXM4-40GB (UUID: GPU-2e4d2105-718d-3b94-6f0f-25c148681e83)\n",
     "             Link 0: 25 GB/s\n",
@@ -141,7 +141,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Verify profiling capability, the expected output should look something simialr to the below\n",
+    "Verify profiling capability, the expected output should look something similar to the below:\n",
     "\n",
     "            Sampling Environment Check\n",
     "            Linux Kernel Paranoid Level = 2: OK\n",
@@ -158,7 +158,7 @@
    "metadata": {},
    "source": [
     "\n",
-    "To start with, we need to create folders as placeholders for dataset. We are going to populate these folders later."
+    "To start with, we need to create folders as placeholders for the dataset. We are going to populate these folders later."
    ]
   },
   {
@@ -189,7 +189,7 @@
     "\n",
     "- **Outlines of Lab 1**\n",
     "    Megatron 101 in half a day - Please go through the below notebooks sequentially.\n",
-    "    1. [WebCrawling to obtain raw text data](./jupyter_notebook/Megatron-LM/tools/openwebtext/Lab1-1_Website_scrapping.ipynb)\n",
+    "    1. [WebCrawling to obtain raw text data](./jupyter_notebook/Megatron-LM/tools/openwebtext/Lab1-1_Website_scraping.ipynb)\n",
     "    2. [Estimate hours/days needed to execute one end-to-end run per Megatron-LM's configuration](./jupyter_notebook/Lab1-2_EstimateComputeDaysNeeded.ipynb)\n",
     "    3. [Understanding the core of Megatron-LM - MPU ](./jupyter_notebook/Lab1-3_MegatronFundementals.ipynb)\n",
     "    4. [About GPT's tokenizer](./jupyter_notebook/Lab1-4_GPT_vocab_merge_files.ipynb)\n",
@@ -198,11 +198,11 @@
     "\n",
     "\n",
     "- **Outlines of Lab 2**\n",
-    "    Getting started on training own language Megatron-LM GPT models -- Please go through the below notebooks sequentially.\n",
+    "    Getting started on training your own language models using Megatron-LM GPT -- Please go through the below notebooks sequentially.\n",
     "    1. [Fetch and extract Swedish data](./jupyter_notebook/Megatron-LM/tools/openwebtext/Lab2-1_acquiring_data.ipynb)\n",
     "    2. [Find sentence boundary and deduplicate your data](./jupyter_notebook/Megatron-LM/tools/openwebtext/Lab2-2_SentenceBoundary_and_Deduplicate.ipynb)\n",
     "    3. [Train your own GPTBPE Tokenizer on your own data ](./jupyter_notebook/Lab2-3_train_own_GPT2BPETokenizer.ipynb)\n",
-    "    4. [customize preprocess data python script and convert to mmap](./jupyter_notebook/Lab2-4_customize_process2mmap.ipynb)\n",
+    "    4. [Customize preprocess data python script and convert to mmap](./jupyter_notebook/Lab2-4_customize_process2mmap.ipynb)\n",
     "    5. [The Challenge - Go Big or go home!](./jupyter_notebook/Lab2-5_run_Megatron_with_varying_config.ipynb)\n",
     "\n"
    ]
@@ -212,15 +212,15 @@
    "metadata": {},
    "source": [
     "### Tutorial Duration\n",
-    "The lab material will be presented in a 12 hr session. Link to material is available for download at the end of the gpubootcamp. \n",
+    "The lab material will be presented in a 12 hour session. Link to material is available for download at the end of the gpubootcamp. \n",
     "\n",
     "### Content Level\n",
-    "Intermediate , Advanced\n",
+    "Intermediate, Advanced\n",
     "\n",
     "### Target Audience and Prerequisites\n",
     "The target audience for this lab is researchers/graduate students and developers who are interested in learning about training very large language models on a super computing cluster.\n",
     "\n",
-    "Basic understanding on Deep learning and Pytorch is required, if you are new to Deep learning and or new to Pytorch, it is recommended to go through the [Distributed_Deep_Learning bootcamp](https://github.com/gpuhackathons-org/gpubootcamp/tree/master/ai/Distributed_Deep_Learning/English/python) and [Pytorch tutorials](https://pytorch.org/tutorials/) as prior.\n",
+    "Basic understanding on Deep learning and Pytorch is required. If you are new to Deep learning and or new to Pytorch, it is recommended to go through the [Distributed_Deep_Learning bootcamp](https://github.com/gpuhackathons-org/gpubootcamp/tree/master/ai/Distributed_Deep_Learning/English/python) and [Pytorch tutorials](https://pytorch.org/tutorials/) as prior.\n",
     " \n",
     "**Disclaimer** : All the results mentioned in the notebooks were tested on a *DGX-2 machine equipped with 2 x A100 GPUs connected via NVLink*. The results would vary when using different hardware and would also depend on the Interconnect bandwidth and the thermal conditions of the machine."
    ]

File diff suppressed because it is too large
+ 16 - 17
ai/Megatron/English/Python/jupyter_notebook/Lab1-2_EstimateComputeDaysNeeded.ipynb


File diff suppressed because it is too large
+ 50 - 50
ai/Megatron/English/Python/jupyter_notebook/Lab1-3_MegatronFundementals.ipynb


+ 34 - 34
ai/Megatron/English/Python/jupyter_notebook/Lab1-4_GPT_vocab_merge_files.ipynb

@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "sustainable-wrong",
+   "id": "mental-dating",
    "metadata": {},
    "source": [
     "## GPT Tokenizer files\n",
@@ -14,15 +14,15 @@
     "\n",
     "Later on, we will use the observations from this notebook to train a GPTBPE Tokenizer with our own raw text data.\n",
     "\n",
-    "We will load and verify GPTBPE Tokenizer and make sure the output tokens and token ids are as expected. \n"
+    "We will load and verify the GPTBPE Tokenizer and make sure the output tokens and token ids are as expected. \n"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "fatal-think",
+   "id": "indoor-albany",
    "metadata": {},
    "source": [
-    "Let's review the source code of [gpt2 tokenizer](https://huggingface.co/transformers/_modules/transformers/tokenization_gpt2.html)\n",
+    "Let's review the source code of [GPT2 tokenizer](https://huggingface.co/transformers/_modules/transformers/tokenization_gpt2.html)\n",
     "\n",
     "    This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will\n",
     "    be encoded differently whether it is at the beginning of the sentence (without space) or not:\n",
@@ -35,12 +35,12 @@
     "         tokenizer(\" Hello world\")['input_ids']\n",
     "        [18435, 995]\n",
     "\n",
-    "We expect our custom tokenizer, which we will later on train in lab 2,  will exhibit the same behavior of [treating spaces like parts of the tokens](https://huggingface.co/transformers/model_doc/gpt2.html#transformers.GPT2Tokenizer).\n"
+    "We expect our custom tokenizer, which we will train later in lab 2,  will exhibit the same behavior of [treating spaces like parts of the tokens](https://huggingface.co/transformers/model_doc/gpt2.html#transformers.GPT2Tokenizer).\n"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "missing-congo",
+   "id": "composed-amount",
    "metadata": {},
    "source": [
     "Install necessary python libraries."
@@ -49,7 +49,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "private-aurora",
+   "id": "instructional-seller",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -59,10 +59,10 @@
   },
   {
    "cell_type": "markdown",
-   "id": "frequent-blues",
+   "id": "geological-drinking",
    "metadata": {},
    "source": [
-    "Next, we proceed to fetch pretrained GPT Tokenizer files, namely the vocab and merge files, will ideally looks like. \n",
+    "Next, we will proceed to fetch the pretrained GPT Tokenizer files, namely the vocab and merge files, will ideally looks like. \n",
     "\n",
     "We can later on use these observations to validate our custom trained GPTBPE tokenizer and the corresponding vocab.json and merges.txt file, in order to ensure the custom trained GPTBPE tokenizer will tokenze as expected."
    ]
@@ -70,7 +70,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "conceptual-mason",
+   "id": "destroyed-cleaning",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -80,7 +80,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "specific-pharmaceutical",
+   "id": "sustainable-sweet",
    "metadata": {},
    "source": [
     "Examine the vocab and merge files, observe the presence of Ġ character.\n",
@@ -90,7 +90,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "pursuant-paradise",
+   "id": "overhead-freeware",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -107,7 +107,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "private-hunter",
+   "id": "eight-perception",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -116,10 +116,10 @@
   },
   {
    "cell_type": "markdown",
-   "id": "cellular-standing",
+   "id": "knowing-tsunami",
    "metadata": {},
    "source": [
-    "The following code block will load a default GPT2Tokenizer from HuggingFace's **_transformer_** library, we verify the following :\n",
+    "The following code block will load a default GPT2Tokenizer from HuggingFace's **_transformer_** library, verify the following:\n",
     "\n",
     "            from transformers import GPT2Tokenizer\n",
     "            tokenizer = GPT2Tokenizer.from_pretrained(\"gpt2\")\n",
@@ -127,13 +127,13 @@
     "            tokenizer(\" Hello world\")['input_ids']\n",
     "            expected token ids for \" Hello world\" is [18435, 995]\n",
     "\n",
-    "Note: The HuggingFace's **_transformer_** library does not have functions to train GPTBPE tokenizer, it can load a pre-trained tokenizer given valid files. For training GPTBPE Tokenizer, we will need to use another library called **_tokenizers_** also from HuggingFace."
+    "Note: The HuggingFace's **transformer** library does not have functions to train the GPTBPE tokenizer, but it can load a pre-trained tokenizer given valid files. For training the GPTBPE Tokenizer, we will need to use another library called **tokenizers**, also from HuggingFace."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "driving-right",
+   "id": "headed-baseline",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -153,7 +153,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "collectible-rehabilitation",
+   "id": "lined-thousand",
    "metadata": {},
    "source": [
     "Below is the expected outputs :\n",
@@ -166,24 +166,24 @@
   },
   {
    "cell_type": "markdown",
-   "id": "fluid-merit",
+   "id": "animated-durham",
    "metadata": {},
    "source": [
     "In the next code block, we will examine how HuggingFace's **_tokenizers_** library loads a pretrained tokenizer given gpt2-vocab.json and merges.txt files. \n",
-    "We will verify that, the usage of `use_gpt` flag will result in the same tokenization behavior, i.e the presence of the **Ġ** character. We will also double check that the token ids are identical to HuggingFace's **_transformer_** loaded `tokenizer = GPT2Tokenizer.from_pretrained(\"gpt2\")` when applying tokenization to the exact same text ` Hello world`. \n",
+    "We will verify that, the usage of the `use_gpt` flag will result in the same tokenization behavior, i.e the presence of the **Ġ** character. We will also double check that the token IDs are identical to HuggingFace's **_transformer_** loaded `tokenizer = GPT2Tokenizer.from_pretrained(\"gpt2\")` when applying tokenization to the exact same text ` Hello world`. \n",
     "\n",
     "Setting `use_gpt` to True will evoke the following : \n",
     "\n",
     "        tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()\n",
     "        tokenizer.decoder = ByteLevelDecoder()\n",
     "        \n",
-    "This is the expected tokenizer behavior for GPTBPE Tokenizer, this GPTBPE tokenizer will load the vocab.json and merges.txt files and tokenize as expected. Whereas setting `use_gpt` to False, will result in a normal BPE Tokenizer, the tokenization will behave differently."
+    "This is the expected tokenizer behavior for GPTBPE Tokenizer. This GPTBPE tokenizer will load the vocab.json and merges.txt files and tokenize as expected. Whereas setting the `use_gpt` flag to False, will result in a normal BPE Tokenizer and the tokenization will behave differently."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "quarterly-remains",
+   "id": "german-decimal",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -225,7 +225,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "incident-positive",
+   "id": "joined-executive",
    "metadata": {},
    "source": [
     "Below is the expected outputs :\n",
@@ -241,16 +241,16 @@
   },
   {
    "cell_type": "markdown",
-   "id": "substantial-spank",
+   "id": "other-equality",
    "metadata": {},
    "source": [
-    "What did we observed ? \n",
+    "What did we observed? \n",
     "\n",
-    "We observe that by setting `use_gpt` flag to True in HuggingFace's **_tokenizers_** library when loading the same gpt2-vocab.json and merges.txt will give us the expected behavor of GPTBPE tokenization. \n",
+    "By setting the `use_gpt` flag to True in HuggingFace's **_tokenizers_** library when loading the same gpt2-vocab.json and merges.txt gives us the expected behavior of GPTBPE tokenization. \n",
     "\n",
-    "We further verify, by applying tokenization to the exact same text ` Hello world`, the result of the tokenizer, with `use_gpt` flag = True, will match the result of the HuggingFace's  **_transformer_** library loaded gpt2 tokenizer.\n",
+    "We can further verify that by applying tokenization to the exact same text ` Hello world`, the result of the tokenizer, with the`use_gpt` flag = True, will match the result of the HuggingFace's  **_transformer_** library loaded gpt2 tokenizer.\n",
     "\n",
-    "Whereas setting `use_gpt` flag = False would result in a different behavior. \n",
+    "Whereas setting the `use_gpt` flag = False would result in a different behavior. \n",
     "\n",
     "Therefore, we will enforce having :\n",
     "\n",
@@ -262,7 +262,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "solid-aspect",
+   "id": "certain-indie",
    "metadata": {},
    "source": [
     "We will now move the gpt-vocab.json and gpt2-merges.txt to the correct data folder as a preparation for the next step."
@@ -271,7 +271,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "electrical-worcester",
+   "id": "interstate-center",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -282,18 +282,18 @@
   },
   {
    "cell_type": "markdown",
-   "id": "related-saturn",
+   "id": "developing-finland",
    "metadata": {},
    "source": [
     "---\n",
     "\n",
     "## Links and Resources\n",
-    "Don't forget to check out additional resources such as [HuggingFace Tokenizer Documentation](https://huggingface.co/docs/tokenizers/python/latest/quicktour.html) and [Train GPT-2 in your own langauge](https://towardsdatascience.com/train-gpt-2-in-your-own-language-fc6ad4d60171).\n"
+    "Don't forget to check out additional resources such as [HuggingFace Tokenizer Documentation](https://huggingface.co/docs/tokenizers/python/latest/quicktour.html) and [Train GPT-2 in your own language](https://towardsdatascience.com/train-gpt-2-in-your-own-language-fc6ad4d60171).\n"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "surprised-venue",
+   "id": "appreciated-technology",
    "metadata": {},
    "source": [
     "-----\n",
@@ -302,7 +302,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "graduate-windsor",
+   "id": "parallel-preliminary",
    "metadata": {},
    "source": [
     "-----\n",

+ 24 - 26
ai/Megatron/English/Python/jupyter_notebook/Lab1-5_jsonfy_and_process2mmap.ipynb

@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "structural-documentation",
+   "id": "aggressive-madness",
    "metadata": {},
    "source": [
     "## Jsonfy + convert to mmap\n",
@@ -14,28 +14,26 @@
     "\n",
     "In particular, we will cover the following steps :\n",
     "\n",
-    "    1. Understand the need of preprocessing data to mmap format.\n",
+    "    1. Understand the need for preprocessing data to mmap format.\n",
     "    2. Convert the raw text data into loose json format.\n",
-    "    3. Use preprocess_data.py to convert the cleaned data into mmap format as a preparation for training.\n",
-    "\n",
-    "\n"
+    "    3. Use preprocess_data.py to convert the cleaned data into mmap format as a preparation for training.\n"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "alpha-yahoo",
+   "id": "alternative-supply",
    "metadata": {},
    "source": [
-    "1. Understand the need of preprocessing data to mmap format.\n",
+    "1. Understand the need for preprocessing data to mmap format.\n",
     "\n",
-    "The below cell blocks will demonstrate the speed up by using `np.memmap` than `np.load` to load an arbitrary data.\n",
+    "The below cell blocks will demonstrate the increased speed-up of using `np.memmap` instead of `np.load` to load arbitrary data.\n",
     "The `np.memmap` is integrated into preprocess_data.py. "
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 1,
-   "id": "useful-fancy",
+   "id": "instructional-tunnel",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -47,7 +45,7 @@
   {
    "cell_type": "code",
    "execution_count": 2,
-   "id": "eligible-allen",
+   "id": "brave-barcelona",
    "metadata": {},
    "outputs": [
     {
@@ -66,7 +64,7 @@
   {
    "cell_type": "code",
    "execution_count": 3,
-   "id": "grave-violence",
+   "id": "warming-hardwood",
    "metadata": {},
    "outputs": [
     {
@@ -86,7 +84,7 @@
   {
    "cell_type": "code",
    "execution_count": 4,
-   "id": "living-cricket",
+   "id": "documentary-pointer",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -96,24 +94,24 @@
   },
   {
    "cell_type": "markdown",
-   "id": "professional-circuit",
+   "id": "institutional-shape",
    "metadata": {},
    "source": [
     "2. jsonfy the raw text data into loose json format.\n",
     "\n",
     "The preprocess_data.py is expecting to receive json format data. Hence we need to convert the raw text data to json format first.\n",
-    "It is assumed that the json format data, will have one element per document, and the 'text' field in the json data, it's value will be extracted in preprocess_data.py. Other fields can also be specified for extraction. \n",
-    "An example of how the json data should look like, is given by the following : \n",
+    "It is assumed that the json format data, will have one element per document, and the value of the 'text' field in the json data will be extracted in preprocess_data.py. Other fields can also be specified for extraction. \n",
+    "An example of how the json data should look is showing in the following: \n",
     "\n",
     "    {\"src\": \"The Internet\", \"text\": \"jumps over the lazy dog\", \"type\": \"Eng\", \"id\": \"42\", \"title\": \"Second Part\"}\n"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "thirty-specialist",
+   "id": "frozen-vinyl",
    "metadata": {},
    "source": [
-    "We will now use the following python script to converting the raw text data into `extractedNVblogs.json` format as a preparation for the next step. \n",
+    "We will now use the following python script to convert the raw text data into `extractedNVblogs.json` format as a preparation for the next step. \n",
     "\n",
     "\n",
     "    python create_loose_json.py --help\n",
@@ -128,7 +126,7 @@
   {
    "cell_type": "code",
    "execution_count": 5,
-   "id": "postal-conjunction",
+   "id": "surrounded-permit",
    "metadata": {},
    "outputs": [
     {
@@ -145,12 +143,12 @@
   },
   {
    "cell_type": "markdown",
-   "id": "iraqi-scoop",
+   "id": "limiting-window",
    "metadata": {},
    "source": [
     "3. Use preprocess_data.py to convert the cleaned data into mmap format as a preparation for training.\n",
     "\n",
-    "We are now ready to feed `extractedNVblogs.json`  data to Megatron-LM's preprocess_data.py in order to further convert the data to mmap format.\n",
+    "We are now ready to feed `extractedNVblogs.json`  data to Megatron-LM's preprocess_data.py in order to convert the data to mmap format.\n",
     "\n",
     "The following two code blocks will convert the `extractedNVblogs.json` to `NVblog_text_document.bin` and `NVblog_text_document.idx`"
    ]
@@ -158,7 +156,7 @@
   {
    "cell_type": "code",
    "execution_count": 9,
-   "id": "promotional-pillow",
+   "id": "numerical-tuner",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -172,7 +170,7 @@
   {
    "cell_type": "code",
    "execution_count": 10,
-   "id": "acting-patrick",
+   "id": "inner-match",
    "metadata": {},
    "outputs": [
     {
@@ -235,7 +233,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "short-siemens",
+   "id": "grand-demand",
    "metadata": {},
    "source": [
     "Below is the expected outputs :\n",
@@ -268,7 +266,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "informational-willow",
+   "id": "aware-insert",
    "metadata": {},
    "source": [
     "---\n",
@@ -279,7 +277,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "forbidden-emerald",
+   "id": "educational-ministry",
    "metadata": {},
    "source": [
     "-----\n",
@@ -288,7 +286,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "dedicated-russell",
+   "id": "genetic-gamma",
    "metadata": {},
    "source": [
     "-----\n",

File diff suppressed because it is too large
+ 46 - 47
ai/Megatron/English/Python/jupyter_notebook/Lab1-6_Observe_GPT_runs_vs_performance.ipynb


+ 24 - 24
ai/Megatron/English/Python/jupyter_notebook/Lab2-3_train_own_GPT2BPETokenizer.ipynb

@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "rapid-arctic",
+   "id": "agreed-commercial",
    "metadata": {},
    "source": [
     "# Train custom GPTBPE  Tokenzer \n",
@@ -10,7 +10,7 @@
     "\n",
     "## Learning Objectives\n",
     "\n",
-    "In order to include the vocabulary of the local language, in this case it is Swedish, into GPTBPE tokenizer, we need to be able to train GPTBPE Tokenizer on local language raw text data. The trained GPTBPE Tokenizer will produce it's own vocab.json and merges.txt files which will be compatible with Megatron-LM's GPTBPE Tokenizer. \n",
+    "In order to include the vocabulary of the local language (in this case it is Swedish) into the GPTBPE tokenizer, we need to be able to train GPTBPE Tokenizer on local language with raw text data. The trained GPTBPE Tokenizer will produce it's own vocab.json and merges.txt files which will be compatible with Megatron-LM's GPTBPE Tokenizer. \n",
     "\n",
     "Previously in `Lab2-1_acquiring_data.ipynb`, we have acquired our own Swedish raw text data extracted from data source språkbank.\n",
     "Therefore, the goal of this notebook, is to train our own GPTBPE Tokenizer on the Swedish raw text data obtained from `Lab2-1_acquiring_data.ipynb`.\n",
@@ -19,8 +19,8 @@
     "\n",
     "The two options are covered in this notebook :\n",
     "\n",
-    "    1. option 1 - load from pretrained vocab and merge files, then continue training with the new raw text.\n",
-    "    2. option 2 - train a GPT compatible tokenizer from scratch.\n",
+    "    1. Option 1 - load from pretrained vocab and merge files, then continue training with the new raw text.\n",
+    "    2. Option 2 - train a GPT compatible tokenizer from scratch.\n",
     "\n",
     "\n",
     "We will use HuggingFace's Tokenizer library and the trainer function in order to train our own GPTBPE Tokenizer with our own raw text data.\n",
@@ -32,7 +32,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "suspended-peace",
+   "id": "military-paintball",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -42,7 +42,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "active-artwork",
+   "id": "quarterly-candidate",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -53,7 +53,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "average-boundary",
+   "id": "seasonal-minister",
    "metadata": {},
    "source": [
     "A python script for training custom GPTBPE Tokenizer is provided for your convenience : \n",
@@ -76,16 +76,16 @@
   },
   {
    "cell_type": "markdown",
-   "id": "harmful-grounds",
+   "id": "julian-phoenix",
    "metadata": {},
    "source": [
-    "1. option 1 - load from pretrained vocab and merge files, then continue training with the new raw text."
+    "1. Option 1 - load from pretrained vocab and merge files, then continue training with the new raw text."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "perceived-jerusalem",
+   "id": "severe-technical",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -94,10 +94,10 @@
   },
   {
    "cell_type": "markdown",
-   "id": "weird-index",
+   "id": "spectacular-swing",
    "metadata": {},
    "source": [
-    "Below is the expected outputs :\n",
+    "Below is the expected output:\n",
     "        \n",
     "        [00:00:14] Compute merges                           ███████░ 51520    /    56000\n",
     "        [00:00:14] Compute merges                           ███████░ 52640    /    56000\n",
@@ -121,7 +121,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "emerging-music",
+   "id": "empty-cabin",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -131,16 +131,16 @@
   },
   {
    "cell_type": "markdown",
-   "id": "filled-blast",
+   "id": "dramatic-punishment",
    "metadata": {},
    "source": [
-    "2. option 2 - train a GPT compatible tokenizer from scratch."
+    "2. Option 2 - train a GPT compatible tokenizer from scratch."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "uniform-complaint",
+   "id": "imported-supervision",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -151,7 +151,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "disabled-pencil",
+   "id": "eleven-disclaimer",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -160,10 +160,10 @@
   },
   {
    "cell_type": "markdown",
-   "id": "tender-magic",
+   "id": "normal-adobe",
    "metadata": {},
    "source": [
-    "Below is the expected outputs :\n",
+    "Below is the expected output:\n",
     "    \n",
     "        [00:00:11] Compute merges                           ███████░ 30720    /    32000\n",
     "        [00:00:11] Compute merges                           ███████░ 31360    /    32000\n",
@@ -185,7 +185,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "criminal-leadership",
+   "id": "straight-sleeping",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -195,17 +195,17 @@
   },
   {
    "cell_type": "markdown",
-   "id": "orange-alignment",
+   "id": "disabled-accuracy",
    "metadata": {},
    "source": [
     "--- \n",
     "## Links and Resources\n",
-    "Don't forget to check out additional resources such as [HuggingFace Tokenizer Documentation](https://huggingface.co/docs/tokenizers/python/latest/quicktour.html) and [Train GPTBPE Tokenizer in your own langauge](https://towardsdatascience.com/train-gpt-2-in-your-own-language-fc6ad4d60171)."
+    "Don't forget to check out additional resources such as [HuggingFace Tokenizer Documentation](https://huggingface.co/docs/tokenizers/python/latest/quicktour.html) and [Train GPTBPE Tokenizer in your own language](https://towardsdatascience.com/train-gpt-2-in-your-own-language-fc6ad4d60171)."
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "offshore-truck",
+   "id": "utility-duncan",
    "metadata": {},
    "source": [
     "-----\n",
@@ -214,7 +214,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "clinical-tuition",
+   "id": "invisible-reynolds",
    "metadata": {},
    "source": [
     "-----\n",

+ 40 - 40
ai/Megatron/English/Python/jupyter_notebook/Lab2-4_customize_process2mmap.ipynb

@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "fixed-species",
+   "id": "amateur-threat",
    "metadata": {},
    "source": [
     "## Customize preprocess_data.py\n",
@@ -10,11 +10,11 @@
     "\n",
     "## Learning Objectives\n",
     "\n",
-    "We fetched our own Swedish raw text data in `Lab2-1_acquiring_data.ipynb`, we learned how to find sentence boundary with custom functions in `Lab2-2_SentenceBoundary_and_Deduplicate.ipynb` , we also trained a GPTBPETokenizer and fitted it to our raw Swedish text with `Lab2-3_train_own_GPT2BPETokenizer.ipynb`. \n",
+    "We fetched our own Swedish raw text data in `Lab2-1_acquiring_data.ipynb`, we learned how to find sentence boundary with custom functions in `Lab2-2_SentenceBoundary_and_Deduplicate.ipynb`, and we trained a GPTBPETokenizer and fitted it to our raw Swedish text with `Lab2-3_train_own_GPT2BPETokenizer.ipynb`. \n",
     "\n",
-    "We are now ready to incorporate the custom sentence-splitter into preprocess_data.py and covert the raw Swedish text to, first json format, and then mmap format.\n",
+    "We are now ready to incorporate the custom sentence-splitter into preprocess_data.py and convert the raw Swedish text first to, json format, and then mmap format.\n",
     "\n",
-    "Therefore, the goal of this notebook is to integrate all knowledge gained from both Lab 1 as well as the above notebooks, and challenge ourselves to further customize the preprocess_data.py with a <a href=\"./Lab2-4_customize_process2mmap.ipynb#Custom-Sentence-Splitter\">custom sentence-splitter</a>  function, and in the process, convert the new raw Sweden text to mmap format.\n",
+    "Therefore, the goal of this notebook is to integrate all of the knowledge gained from both Lab 1 as well as the above notebooks, and challenge ourselves to further customize the preprocess_data.py with a <a href=\"./Lab2-4_customize_process2mmap.ipynb#Custom-Sentence-Splitter\">custom sentence-splitter</a>  function. In the process, we'll convert the new raw Sweden text to mmap format.\n",
     "\n",
     "More specifically, this notebook will cover the steps to :\n",
     "\n",
@@ -27,7 +27,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "comparative-render",
+   "id": "pleasant-brake",
    "metadata": {},
    "source": [
     "1. Convert the extracted raw Swedish text from webnyheter2013.txt to webnyheter2013.json."
@@ -36,7 +36,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "alien-spanking",
+   "id": "diverse-winning",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -45,10 +45,10 @@
   },
   {
    "cell_type": "markdown",
-   "id": "quiet-innocent",
+   "id": "independent-houston",
    "metadata": {},
    "source": [
-    "Below is the expected outputs :\n",
+    "Below is the expected output:\n",
     "\n",
     "        process 1000000 documents so far ...\n",
     "        example:  – Vi har en bra generation som spelat tillsammans ett tag .\n",
@@ -58,16 +58,16 @@
   },
   {
    "cell_type": "markdown",
-   "id": "relative-execution",
+   "id": "every-equilibrium",
    "metadata": {},
    "source": [
-    "2. Generate the mmap format files by default preprocess_data.py as the first step to ensure we have data necessary for the next notebook to run, in case time runs out."
+    "2. Generate the mmap format files by default preprocess_data.py as the first step to ensure we have the necessary data for the next notebook to run, in case time runs out."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "known-illness",
+   "id": "black-schedule",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -81,7 +81,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "least-platform",
+   "id": "cloudy-brighton",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -99,10 +99,10 @@
   },
   {
    "cell_type": "markdown",
-   "id": "lined-literacy",
+   "id": "driven-terminal",
    "metadata": {},
    "source": [
-    "Below is the expected outputs :\n",
+    "Below is the expected output:\n",
     "\n",
     "    Processed 1248300 documents (52998.601302473544 docs/s, 5.869853647730749 MB/s).\n",
     "    Processed 1248400 documents (53001.39142986273 docs/s, 5.870136451906283 MB/s).\n",
@@ -116,14 +116,14 @@
   },
   {
    "cell_type": "markdown",
-   "id": "periodic-treaty",
+   "id": "superior-stuff",
    "metadata": {},
    "source": [
-    "Now we get the default mmap files (xxx.bin and xxx.idx ) and therefore guarantee we have the data needed for the next notebook to run disregard whether we finish the mini-challenge or not. \n",
+    "Now we get the default mmap files (xxx.bin and xxx.idx ) and therefore guarantee we have the data needed for the next notebook to run regardless of whether we finish the mini-challenge or not. \n",
     "\n",
-    "We can now move on. We start by copy the old preprocess_data.py and rename it to `MYpreprocess_data.py`. \n",
+    "We can now move on. We start by copying the old preprocess_data.py and rename it to `MYpreprocess_data.py`. \n",
     "\n",
-    "Note: As best practice, one never overwrites original python script existed in the given repo directly, one copies the original python script and rename it to a new python script, then work on the new python script, in case of irreversible failures, one can always refer to the original python script, and start again.\n",
+    "Note: As best practice, one never overwrites an original python script that exist in the given repo directly. You should copy the original python script and rename it to a new python script, then work on the new python script. In case of irreversible failures, you can always refer to the original python script, and start again.\n",
     "\n",
     "The below code block will duplicate the preprocess_data.py script and renamed the copied python script into a new python script called `MYpreprocess_data.py`."
    ]
@@ -131,7 +131,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "norman-accreditation",
+   "id": "protective-topic",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -140,7 +140,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "maritime-bunny",
+   "id": "restricted-holiday",
    "metadata": {},
    "source": [
     "<a id=\"Custom-Sentence-Splitter\"></a>"
@@ -148,7 +148,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "foreign-advocacy",
+   "id": "funny-evaluation",
    "metadata": {},
    "source": [
     "The custom sentence-splitter `cut_sentence_with_quotation_marks` function is provided below for your convenience, please integrate this custom function into `MYpreprocess_data.py`."
@@ -157,7 +157,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "celtic-latter",
+   "id": "federal-midwest",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -192,7 +192,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "bacterial-consequence",
+   "id": "reported-silver",
    "metadata": {},
    "source": [
     "<a id=\"Mini-Challenge\"></a>"
@@ -200,11 +200,11 @@
   },
   {
    "cell_type": "markdown",
-   "id": "separated-occupation",
+   "id": "dress-container",
    "metadata": {},
    "source": [
     "---\n",
-    "## **Mini-Challenge ** - integrate the custom sentence splitter into MYpreprocess_data.py\n",
+    "## **Mini-Challenge** - integrate the custom sentence splitter into MYpreprocess_data.py\n",
     "\n",
     "Task : Modify and overwrite `MYpreprocess_data.py` below to incoporate the custom `cut_sentence_with_quotation_marks`\n",
     "\n",
@@ -222,7 +222,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "unknown-seven",
+   "id": "modern-bunny",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -435,18 +435,18 @@
   },
   {
    "cell_type": "markdown",
-   "id": "ruled-service",
+   "id": "continuing-digest",
    "metadata": {},
    "source": [
-    "Below cell block specify all the input parameters in order to run `MYpreprocess_data.py`. \n",
+    "The below cell block specifies all the input parameters in order to run `MYpreprocess_data.py`. \n",
     "\n",
-    "Please do **NOT** modify anything in below cell."
+    "Please do **NOT** modify anything in the below cell."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "simplified-antarctica",
+   "id": "changed-indiana",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -459,20 +459,20 @@
   },
   {
    "cell_type": "markdown",
-   "id": "understanding-things",
+   "id": "unauthorized-manor",
    "metadata": {},
    "source": [
-    "Below code block is a ReRun cell to launch `MYpreprocess_data.py` and produce the customSentenceSplit_text_document.bin and customSentenceSplit_text_document.idx files, if the script runs successfully.\n",
+    "The below code block is a ReRun cell to launch `MYpreprocess_data.py` and produce the customSentenceSplit_text_document.bin and customSentenceSplit_text_document.idx files, if the script runs successfully.\n",
     "\n",
     "<a id=\"Rerun_Cell\"></a>\n",
     "\n",
-    "Go back and modify `MYpreprocess_data.py`, click on this shortcut link to <a href=\"./Lab2-4_customize_process2mmap.ipynb#MODIFY_CELL\">Jump to Modify MYpreprocess_data.py</a> "
+    "Go back and modify `MYpreprocess_data.py`. Click on this shortcut link to <a href=\"./Lab2-4_customize_process2mmap.ipynb#MODIFY_CELL\">Jump to Modify MYpreprocess_data.py</a> "
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "exclusive-region",
+   "id": "specific-presence",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -490,16 +490,16 @@
   },
   {
    "cell_type": "markdown",
-   "id": "armed-german",
+   "id": "compound-photographer",
    "metadata": {},
    "source": [
-    "Check whether these two files : `customSentenceSplit_text_document.bin` and `customSentenceSplit_text_document.idx` files were successfully generated and is in the correct folder under dataset."
+    "Check whether these two files : `customSentenceSplit_text_document.bin` and `customSentenceSplit_text_document.idx` files were successfully generated and are in the correct folder under dataset."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "fantastic-harmony",
+   "id": "quarterly-mediterranean",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -509,7 +509,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "final-stomach",
+   "id": "distinguished-latitude",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -519,7 +519,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "still-movement",
+   "id": "nervous-farming",
    "metadata": {},
    "source": [
     "-----\n",
@@ -528,7 +528,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "organized-mother",
+   "id": "neural-motor",
    "metadata": {},
    "source": [
     "-----\n",

+ 30 - 31
ai/Megatron/English/Python/jupyter_notebook/Lab2-5_run_Megatron_with_varying_config.ipynb

@@ -2,18 +2,17 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "alike-prisoner",
+   "id": "passing-intersection",
    "metadata": {},
    "source": [
     "## Scale up model size\n",
     "---\n",
-    "In previous notebooks, we downloaded and extracted our own Swedish raw text with `Lab2-1_acquiring_data.ipynb`; practiced filter, clean and deduplicate the raw text data with `Lab2-2_SentenceBoundary_and_Deduplicate.ipynb` ; trained our own GPTBPETokenizer and fitted to the raw Swedish text with `Lab2-3_train_own_GPT2BPETokenizer.ipynb`; converted the raw text to mmap format integrating a custom sentence-splitter in `Lab2-4_customize_process2mmap.ipynb`.\n",
+    "In previous notebooks, we downloaded and extracted our own Swedish raw text with `Lab2-1_acquiring_data.ipynb`; practiced filtering, cleaning and deduplicating the raw text data with `Lab2-2_SentenceBoundary_and_Deduplicate.ipynb` ; trained our own GPTBPETokenizer using the raw Swedish text with `Lab2-3_train_own_GPT2BPETokenizer.ipynb`; and converted the raw text to mmap format integrating a custom sentence-splitter in `Lab2-4_customize_process2mmap.ipynb`.\n",
     "\n",
-    "We have learned all the essential components in order to customize Megatron-LM's default workflow in order to accommodate to specific langauge needs ( in this case, it is Swedish ). The obvious next step is to train the Megatron-LM GPT model with the processed Swedish data. \n",
+    "We have learned all the essential components in order to customize Megatron-LM's default workflow  accommodating to specific language needs (in this case, Swedish). The obvious next step is to train the Megatron-LM GPT model with the processed Swedish data.\n",
+    "However, constrainedt by how much compute resources one could get, i.e. the number of GPUs available for the training job, there is an upper limit of how big a model you can train.\n",
     "\n",
-    "However, constraint by how much compute resources one could get, that is, the number of GPUs available for the training job, there is an upper limit of how big a model you can train.\n",
-    "\n",
-    "We will test ou thow big a model we could train with 2 X A100 GPUs 40GB, by presenting a Challenge!\n",
+    "We will test ou t what size model we could train with 2 X A100 GPUs 40GB, by presenting a challenge!\n",
     "\n",
     "## **Challenge ** - Go big or go home !\n",
     "\n",
@@ -21,23 +20,23 @@
     "    - 2 x A100 GPUs 40G is allocated for this challenge.\n",
     "    - Only the parameters in the **##### Begin/End of modifiable blocks #####** are allowed to be changed.\n",
     "    - Avoid OOM !\n",
-    "    - Training run must be finished and checkpoint must be saved successfully.\n",
+    "    - Training run must be finished and the checkpoint must be saved successfully.\n",
     "\n",
     "- Task : \n",
-    "        Given the above constraints, train as BIG a GPT model as possible.\n",
+    "        Given the above constraints, train the biggest GPT model as possible.\n",
     "\n",
-    "- Winning criteria : The biggest model wins given the above constraints.\n",
+    "- Winning criteria : The biggest model wins!\n",
     "\n",
     "Note 1: Post the parameters you changed into the **##### Begin/End of modifiable blocks #####**  on bootcamp's slack channels for verification.\n",
     "\n",
     "Note 2: We purposefully turned-off nsys profiling in this challenge, because calling nsys profiling will introduce a small overhead, which will impact the maximum achievable model size.\n",
     "\n",
-    "Go directly to the code block and modify training configuration, click here to <a href=\"./Lab2-5_run_Megatron_with_varying_config.ipynb#MODIFY_CELL\">Jump to Code Cell and Modify Training Config</a> "
+    "Go directly to the code block and modify the training configuration, click here to <a href=\"./Lab2-5_run_Megatron_with_varying_config.ipynb#MODIFY_CELL\">Jump to Code Cell and Modify Training Config</a> "
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "material-finland",
+   "id": "dated-lithuania",
    "metadata": {},
    "source": [
     "\n",
@@ -47,10 +46,10 @@
   },
   {
    "cell_type": "markdown",
-   "id": "driven-drawing",
+   "id": "narrative-population",
    "metadata": {},
    "source": [
-    "Modify and rerun the code blocks below to obtain a even bigger GPT model. \n",
+    "Modify and rerun the code blocks below to obtain an even bigger GPT model. \n",
     "\n",
     "\n",
     "<a id=\"MODIFY_CELL\"></a>\n",
@@ -59,7 +58,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "proprietary-marketing",
+   "id": "increased-difference",
    "metadata": {},
    "source": [
     "<a id=\"MODIFY_CELL\"></a>"
@@ -67,16 +66,16 @@
   },
   {
    "cell_type": "markdown",
-   "id": "adjustable-engineer",
+   "id": "opposed-population",
    "metadata": {},
    "source": [
-    "Always clean the checkpoint folder to ensure trainining start from scratch."
+    "Always clean the checkpoint folder to ensure training starts from scratch."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "other-parts",
+   "id": "unsigned-banks",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -87,7 +86,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "invisible-pepper",
+   "id": "relative-count",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -163,14 +162,14 @@
   },
   {
    "cell_type": "markdown",
-   "id": "formal-turner",
+   "id": "personalized-permit",
    "metadata": {},
    "source": [
-    "Check how big is your model. By modify the parameters in the [params_cnt.sh](./params_cnt.sh) to match the training parames above.\n",
+    "Check how big your model is by modify the parameters in the [params_cnt.sh](./params_cnt.sh) to match the training parames above.\n",
     "\n",
-    "I got 1.6 Billion :)  what about you ?\n",
+    "I got 1.6 illion ! :)  what about you ?\n",
     "\n",
-    "Modify the [params count](./params_cnt.sh) accoring to your training configuration.\n",
+    "Modify the [params count](./params_cnt.sh) according to your training configuration.\n",
     "\n",
     "After modification, run the below bash script to obtain the model size."
    ]
@@ -178,7 +177,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "welcome-donor",
+   "id": "stuffed-possibility",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -187,7 +186,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "noticed-trinity",
+   "id": "nominated-chuck",
    "metadata": {},
    "source": [
     "Below is an example of expected outputs:\n",
@@ -198,12 +197,12 @@
   },
   {
    "cell_type": "markdown",
-   "id": "convenient-ontario",
+   "id": "primary-praise",
    "metadata": {},
    "source": [
     "Re-run this cell below to get an even bigger GPT model\n",
     "\n",
-    "Remember to modify the [params count](./params_cnt.sh) to check how big is your model.\n",
+    "Remember to modify the [params count](./params_cnt.sh) to check how big your model is.\n",
     "\n",
     "Jump back and edit the SV_GPT_goingBIG.sh, click here to \n",
     "<a href=\"./Lab2-5_run_Megatron_with_varying_config.ipynb#MODIFY_CELL\">Jump back to modify and overwrite SV_GPT_goingBIG.sh </a> \n",
@@ -213,7 +212,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "representative-kentucky",
+   "id": "divine-sierra",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -222,7 +221,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "unnecessary-african",
+   "id": "perfect-fisher",
    "metadata": {},
    "source": [
     "Below is an example of expected outputs:\n",
@@ -245,7 +244,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "pretty-handle",
+   "id": "chief-costume",
    "metadata": {},
    "source": [
     "---\n",
@@ -256,7 +255,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "caroline-induction",
+   "id": "collectible-turkey",
    "metadata": {},
    "source": [
     "-----\n",
@@ -265,7 +264,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "ranking-pillow",
+   "id": "legendary-forestry",
    "metadata": {},
    "source": [
     "-----\n",

+ 32 - 32
ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/tools/openwebtext/Lab1-1_Website_scrapping.ipynb

@@ -2,27 +2,27 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "micro-village",
+   "id": "general-milwaukee",
    "metadata": {},
    "source": [
-    "## Website scrapping\n",
+    "## Website scraping\n",
     "\n",
-    "It is strongly recommanded to consult with local legal department for compliance before proceeding to scrap content from websites/webpages with permission.\n",
+    "It is strongly recommended to consult with your local legal department for compliance before proceeding to scrape content from websites/webpages with permission.\n",
     "\n",
-    "There is no one-fits-all website scrapping solution, when applying the following to other websites/webpages, please modify accoridngly.\n",
+    "There is no one-fits-all website scrapping solution, when applying the following to other websites/webpages, please modify accordingly.\n",
     "\n",
     "---\n",
     "\n",
     "## Learning Objectives\n",
     "The goal of this lab is to obtain raw text data via webscrapping.\n",
     "\n",
-    "To run through Megatron-LM default workflow in order to train a GPT model, we will need to obtain data first. The outcome of this notebook is the raw text data which will be used for subsequent tasks in Lab1.\n",
+    "To run through Megatron-LM's default workflow in order to train a GPT model, we will need to obtain data first. The outcome of this notebook is the raw text data which will be used for subsequent tasks in Lab1.\n",
     "\n",
     "This notebook covers the below steps : \n",
     "\n",
     "    1. Install necessary python libraries and download 2 python scripts which will be used for website crawling.\n",
     "    2. Crawl links from a seeded url and write to a text file.\n",
-    "    3. Remove incompliant links from the text file in order to ensure legal compliancy.\n",
+    "    3. Remove incompliant links from the text file in order to ensure legal compliance.\n",
     "    4. Fetch the corresponding webpage from each approved url and write it to html format.\n",
     "    5. Parse the html file and extract raw text and write to disk.\n",
     "    6. Move the `extractedNVblogs.txt` to the correct folder under **dataset**.\n",
@@ -32,16 +32,16 @@
   },
   {
    "cell_type": "markdown",
-   "id": "listed-stopping",
+   "id": "requested-committee",
    "metadata": {},
    "source": [
-    "1. install python libraries and download 2 python scripts which will be used for website crawling."
+    "1. Install python libraries and download 2 python scripts which will be used for website crawling."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "endangered-vessel",
+   "id": "material-inspection",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -57,7 +57,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "designed-insight",
+   "id": "stainless-parent",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -68,7 +68,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "knowing-andorra",
+   "id": "fifth-scanner",
    "metadata": {},
    "source": [
     "2. Crawl links from a seeded url and write to a text file named `NVdevblog_urls.txt`"
@@ -77,7 +77,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "natural-commander",
+   "id": "adopted-alignment",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -87,10 +87,10 @@
   },
   {
    "cell_type": "markdown",
-   "id": "incredible-lunch",
+   "id": "excellent-disorder",
    "metadata": {},
    "source": [
-    "3. Remove incompliant links from the text file in order to ensure legal compliancy.\n",
+    "3. Remove incompliant links from the text file in order to ensure legal compliance.\n",
     "\n",
     "    Normally, one should check with the legal and remove each incompliant link.\n",
     "\n",
@@ -100,7 +100,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "angry-cattle",
+   "id": "european-arkansas",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -122,7 +122,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "defined-interim",
+   "id": "basic-leader",
    "metadata": {},
    "source": [
     "4. Fetch the corresponding webpage from each approved url and write it to `XXX.html` format."
@@ -131,7 +131,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "competitive-tower",
+   "id": "valid-driver",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -141,7 +141,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "guided-certification",
+   "id": "pretty-contrast",
    "metadata": {},
    "source": [
     "Below is an example of expected outputs :\n",
@@ -157,7 +157,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "neural-method",
+   "id": "bridal-export",
    "metadata": {},
    "source": [
     "5. Parse the html file and extract the raw text data, which will be written to `extractedNVblogs.txt`."
@@ -166,7 +166,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "affected-albania",
+   "id": "oriental-wound",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -202,16 +202,16 @@
   },
   {
    "cell_type": "markdown",
-   "id": "first-research",
+   "id": "adverse-republic",
    "metadata": {},
    "source": [
-    "6. Move the `extractedNVblogs.txt` to the correct folder under **dataset**. This file `extractedNVblogs.txt` will be used in subsequent notebooks in lab1."
+    "6. Move the `extractedNVblogs.txt` to the correct folder under the **dataset** folder. This file `extractedNVblogs.txt` will be used in subsequent notebooks in lab1."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 12,
-   "id": "fourth-certificate",
+   "id": "fifty-scratch",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -220,16 +220,16 @@
   },
   {
    "cell_type": "markdown",
-   "id": "cultural-retro",
+   "id": "infrared-jonathan",
    "metadata": {},
    "source": [
-    "**Note:** Please run below cell to free up space."
+    "**Note:** Please run the below cell to free up space."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 14,
-   "id": "stretch-pattern",
+   "id": "federal-detroit",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -241,7 +241,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "black-courage",
+   "id": "boolean-arrow",
    "metadata": {},
    "source": [
     "Verify `extractedNVblogs.txt` is successfully moved to the correct folder."
@@ -250,7 +250,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "standing-bridges",
+   "id": "smart-equivalent",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -259,7 +259,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "prepared-ballot",
+   "id": "partial-roller",
    "metadata": {},
    "source": [
     "Below is an example of expected outputs :\n",
@@ -269,7 +269,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "cellular-termination",
+   "id": "liberal-scheme",
    "metadata": {},
    "source": [
     "--- \n",
@@ -280,7 +280,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "complex-valentine",
+   "id": "settled-insider",
    "metadata": {},
    "source": [
     "-----\n",
@@ -289,7 +289,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "flush-bruce",
+   "id": "infinite-kingston",
    "metadata": {},
    "source": [
     "--- \n",

+ 21 - 21
ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/tools/openwebtext/Lab2-1_acquiring_data.ipynb

@@ -2,37 +2,37 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "fabulous-yield",
+   "id": "exotic-testing",
    "metadata": {},
    "source": [
     "## Acquire Swedish data \n",
     "---\n",
     "\n",
-    "For data licensing and privacy concerns, we will not providing training data in this bootcamp.\n",
+    "For data licensing and privacy concerns, we will not provide training data in this bootcamp.\n",
     "\n",
-    "However, we do need data in order to proceed the customization of Megatron-LM's workflow for local language needs ( in this case, it is Swedish ), hence, the first thing we need to do, is to acquire Swedish raw text data.\n",
+    "However, we do need data in order to proceed the customization of Megatron-LM's workflow for local language needs ( in this case, it is Swedish ), hence, the first thing we need to do, is acquiring some Swedish raw text data.\n",
     "\n",
     "This notebook is therefore provided to assist acquisition of Swedish raw text data from språkbanken.\n",
     "\n",
-    "following the steps given below -\n",
+    "Gollowing the steps given below:\n",
     "\n",
     "    1. Download data via wget and download the python script which will be used to extract the Swedish text.\n",
     "    \n",
     "    2. Unzip the data using bunzip and move the data to the correct folder under dataset.\n",
     "    \n",
-    "    3. A custom function is provided in order to extract raw txt file from xml file and move the text file to the correct folder under dataset.\n",
+    "    3. A custom function is provided in order to extract the raw txt file from the xml file and move the text file to the correct folder under dataset.\n",
     "\n",
     "\n",
     "\n",
     "**About the data source : språkbank**  :\n",
     "\n",
     "This data belongs to Språkbanken, Språkbanken Text is a research unit and part of the National Language Bank, a national e-infrastructure to support research based on linguistic data.\n",
-    "[read more about språkbank](https://spraakbanken.gu.se/om)\n"
+    "[Learn more about språkbank here](https://spraakbanken.gu.se/om)\n"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "wrapped-fields",
+   "id": "disciplinary-sheet",
    "metadata": {},
    "source": [
     "1. Download data via wget and download the python script which will be used to extract the Swedish text."
@@ -41,7 +41,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "silent-writer",
+   "id": "adverse-degree",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -51,16 +51,16 @@
   },
   {
    "cell_type": "markdown",
-   "id": "sunset-moisture",
+   "id": "antique-extra",
    "metadata": {},
    "source": [
-    "2. unzip the data using bunzip and move the data to the correct folder under dataset."
+    "2. Unzip the data using bunzip and move the data to the correct folder under dataset."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "satisfied-absolute",
+   "id": "rolled-banking",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -70,7 +70,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "detected-volleyball",
+   "id": "isolated-mauritius",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -80,7 +80,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "expired-compact",
+   "id": "driving-wagon",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -89,16 +89,16 @@
   },
   {
    "cell_type": "markdown",
-   "id": "minimal-charm",
+   "id": "weighted-hygiene",
    "metadata": {},
    "source": [
-    "3. A custom function is provided in order to extract raw txt file from xml file and move the text file to the correct folder under dataset."
+    "3. A custom function is provided in order to extract the raw txt file from the xml file and move the text file to the correct folder under dataset."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "hungry-fundamental",
+   "id": "encouraging-business",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -135,16 +135,16 @@
   },
   {
    "cell_type": "markdown",
-   "id": "thick-consortium",
+   "id": "built-subcommittee",
    "metadata": {},
    "source": [
-    "Verify the output `webnyheter2013.txt` exist under `../../../../dataset/SV/`, we need this raw text file to proceed the subsequent notebooks for Lab2."
+    "Verify the output `webnyheter2013.txt` exists under `../../../../dataset/SV/`. We need this raw text file to proceed with the subsequent notebooks for Lab2."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "bronze-interpretation",
+   "id": "conservative-andorra",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -153,7 +153,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "pleased-saskatchewan",
+   "id": "modern-handbook",
    "metadata": {},
    "source": [
     "-----\n",
@@ -162,7 +162,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "local-limit",
+   "id": "portable-excuse",
    "metadata": {},
    "source": [
     "-----\n",

File diff suppressed because it is too large
+ 88 - 87
ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/tools/openwebtext/Lab2-2_SentenceBoundary_and_Deduplicate.ipynb


+ 3 - 3
ai/Megatron/README.md

@@ -23,7 +23,7 @@ Although this boot camp is designed to run on a computing cluster with [NVIDIA S
 It is possible to run it in an environment where you have access to 2 X A100 GPUs 40 GB with NVLink/Switch.
 
 ### Scenario 1 : local station with 2 X A100 GPU 40 GB and NVLINK 
-When docker pull & run is allowed, and the GPUs are directly accessbile to the users in the environment.
+When docker pull & run is allowed, and the GPUs are directly accessible to the users in the environment.
 
 #### Step 1 - Clone the gpubootcamp repo to obtain the scripts and notebooks.
 `git clone https://github.com/gpuhackathons-org/gpubootcamp.git &&
@@ -49,7 +49,7 @@ Navigate to /gpubootcamp/ai/Megatron/English/Python/ and open the `Start_Here.ip
 ### Scenario 2 : Accessing the Jupiter lab with Singularity + Slurm + SSH port forwarding is allowed
 A User Guide is often provided when one requests for access to a computing cluster with [NVIDIA SuperPOD Architecture](https://resources.nvidia.com/en-us-auto-datacenter/nvpod-superpod-wp-09). However, each compute cluster might have slight deviations to the reference architecture on various levels, HW and/or SW as well as the resource management control setups. 
 
-It is likely the below steps will need to be adjusted, in which case, the user will need to consult the cluster admin or cluster operator to get help in debugging environment preparation in order to prepare for the boot camp materiel to run.
+It is likely the below steps will need to be adjusted, in which case, the user will need to consult the cluster admin or cluster operator to get help in debugging environment preparation in order to prepare for the boot camp material to run.
 
 #### Step 1 - Clone the gpubootcamp repo to obtain the scripts and notebooks.
 ` clone https://github.com/gpuhackathons-org/gpubootcamp.git`
@@ -84,7 +84,7 @@ Navigate to gpubootcamp/ai/Megatron/English/Python/ and open the `Start_Here.ipy
 ## Known Issues
 
 Q. "ResourceExhaustedError" error is observed while running the labs
-A. Currently the batch size and network model is set to consume 40 GB GPU memory. In order to use the labs without any modifications it is recommended to have GPU with minimum 40 GB GPU memory. Else the users can play with batch size to reduce the memory footprint, also ensure you have NVLINK/Switch enabled in the environment.Do not enable MIG mode when requesting A100 GPUs as resources.
+A. Currently the batch size and network model is set to consume 40 GB GPU memory. In order to use the labs without any modifications it is recommended to use a GPU with a minimum of 40 GB GPU memory. Otherwise, the users can play with batch size to reduce the memory footprint, ensuring you have NVLINK/Switch enabled in the environment. Do not enable MIG mode when requesting A100 GPUs as resources.
 
 - Please go through the list of existing bugs/issues or file a new issue at [Github](https://github.com/gpuhackathons-org/gpubootcamp/issues).