Browse Source

re-wording and re-formating

zenodia 3 years ago
parent
commit
6f0af5cf87

+ 105 - 177
ai/Megatron/English/Python/Start_Here.ipynb

@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Megatron GPT Bootcamp\n",
+    "## Megatron GPT Bootcamp\n",
     "\n",
     "## Learning objectives"
    ]
@@ -13,94 +13,41 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "This objective of this bootcamp is designed to onborad you with NVIDIA [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) in a step-wised manner. We will give you the necessary tools and knoweldge to kick-start training your own language model. \n",
+    "The objective of this bootcamp is designed for training very large language models with NVIDIA [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) in a step-wised manner. \n",
     "\n",
-    "More specifically, In Day 2, We will learn the default [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)'s workflow, highlighting :\n",
+    "There are two labs, each with a focus point. \n",
     "\n",
-    "   - Given a fixed dataset ( measured by # of tokens ) calculate compute needs in order to plan training runs and request resources.\n",
+    "In Lab 1, we will learn the default [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) workflow, highlighting :\n",
+    "\n",
+    "   - How to calculate time-to-compute needs for resource planning.\n",
     "    \n",
-    "   - Understanding Megatron-LM's core engine - Model Parallel Unit, this is the key which enable the possibility to train model with up to 1 trillion parameters on a superPOD.\n",
+    "   - Understanding Megatron-LM's core engine - Model Parallel Unit(MPU)\n",
     "    \n",
-    "   - Profiling : as we scale, it is important to maintain the performance of GPUs utilization across multi-gpus or multi-node runs.\n",
+    "   - Profiling : core concepts on GPUs performance across multi-gpus and/or multi-node runs.\n",
+    "\n",
+    "In Lab 2, the focus is shifted to the **customization** of [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) workflow. We will walk through and exercise steps for customization of the Megatron-LM's workflow in order to address to local langauge needs.  \n",
     "\n",
-    "In Day 3, we will shift our focus on all the customization we need to incoporate into [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)'s workflow, in order to cater for local langauge needs, in this case Swedish. We will give recommandations which can be optionally applied to your workflow and include some practical, useful scripts to help you kick-start your own journey in training local langauge Megatron GPT2/3 models. \n",
     "\n",
     "* Standard: Python\n",
     "* Frameworks: Pytorch + Megatron-LM \n",
     "\n",
-    "It is required to have more than one GPU for the bootcamp and we recommend using a [DGX](https://www.nvidia.com/en-in/data-center/dgx-systems/) like cluster with [NVLink / NVSwitch](https://www.nvidia.com/en-in/data-center/nvlink/) support.\n",
+    "It is required to have more than one GPU for this bootcamp.\n",
     "\n",
-    "Let's start with testing the GPUs you are running the code on in this bootcamp."
+    "This bootcamp is tested on 2 x A100 GPUS with 40G memory. One should also have [NVLink / NVSwitch](https://www.nvidia.com/en-in/data-center/nvlink/)."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "---\n",
-    "### Check # of GPUs you have and GPU memory capacity \n",
-    "\n",
-    "            Wed Aug 25 07:03:55 2021       \n",
-    "        +-----------------------------------------------------------------------------+\n",
-    "        | NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: 11.2     |\n",
-    "        |-------------------------------+----------------------+----------------------+\n",
-    "        | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |\n",
-    "        | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |\n",
-    "        |                               |                      |               MIG M. |\n",
-    "        |===============================+======================+======================|\n",
-    "        |   0  Tesla V100-SXM2...  On   | 00000000:06:00.0 Off |                    0 |\n",
-    "        | N/A   34C    P0    57W / 300W |      0MiB / 16160MiB |      0%      Default |\n",
-    "        |                               |                      |                  N/A |\n",
-    "        +-------------------------------+----------------------+----------------------+\n",
-    "        |   1  Tesla V100-SXM2...  On   | 00000000:85:00.0 Off |                    0 |\n",
-    "        | N/A   30C    P0    41W / 300W |      0MiB / 16160MiB |      0%      Default |\n",
-    "        |                               |                      |                  N/A |\n",
-    "        +-------------------------------+----------------------+----------------------+\n",
-    "        +-----------------------------------------------------------------------------+\n",
-    "        | Processes:                                                                  |\n",
-    "        |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |\n",
-    "        |        ID   ID                                                   Usage      |\n",
-    "        |=============================================================================|\n",
-    "        |  No running processes found                                                 |\n",
-    "        +-----------------------------------------------------------------------------+\n"
+    "Start by checking available gpus in the environment using nvidia-smi "
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": null,
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Wed Sep 15 09:14:15 2021       \n",
-      "+-----------------------------------------------------------------------------+\n",
-      "| NVIDIA-SMI 460.27.04    Driver Version: 460.27.04    CUDA Version: 11.2     |\n",
-      "|-------------------------------+----------------------+----------------------+\n",
-      "| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |\n",
-      "| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |\n",
-      "|                               |                      |               MIG M. |\n",
-      "|===============================+======================+======================|\n",
-      "|   0  A100-SXM4-40GB      On   | 00000000:07:00.0 Off |                    0 |\n",
-      "| N/A   24C    P0    57W / 400W |      0MiB / 40536MiB |      4%      Default |\n",
-      "|                               |                      |             Disabled |\n",
-      "+-------------------------------+----------------------+----------------------+\n",
-      "|   1  A100-SXM4-40GB      On   | 00000000:0F:00.0 Off |                    0 |\n",
-      "| N/A   24C    P0    53W / 400W |      0MiB / 40536MiB |      0%      Default |\n",
-      "|                               |                      |             Disabled |\n",
-      "+-------------------------------+----------------------+----------------------+\n",
-      "                                                                               \n",
-      "+-----------------------------------------------------------------------------+\n",
-      "| Processes:                                                                  |\n",
-      "|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |\n",
-      "|        ID   ID                                                   Usage      |\n",
-      "|=============================================================================|\n",
-      "|  No running processes found                                                 |\n",
-      "+-----------------------------------------------------------------------------+\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
     "!nvidia-smi"
    ]
@@ -109,66 +56,41 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "---\n",
-    "### Verify NVlink is active \n",
-    "OUTPUT should look simialr to the below -\n",
+    "Verify you have 2 x A100 GPUs, each with 40G memory, below is an example of expected outputs : \n",
     "\n",
-    "        GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-b29deceb-3745-51d2-2cf3-807ea8ac8e60)\n",
-    "             Link 0: 25.781 GB/s\n",
-    "             Link 1: 25.781 GB/s\n",
-    "             Link 2: 25.781 GB/s\n",
-    "             Link 3: 25.781 GB/s\n",
-    "             Link 4: 25.781 GB/s\n",
-    "             Link 5: 25.781 GB/s\n",
-    "        GPU 1: Tesla V100-SXM2-16GB (UUID: GPU-4de46420-3e95-182f-c0c3-d488dda562d8)\n",
-    "             Link 0: 25.781 GB/s\n",
-    "             Link 1: 25.781 GB/s\n",
-    "             Link 2: 25.781 GB/s\n",
-    "             Link 3: 25.781 GB/s\n",
-    "             Link 4: 25.781 GB/s\n",
-    "             Link 5: 25.781 GB/s"
+    "            Wed Sep 15 09:14:15 2021       \n",
+    "            +-----------------------------------------------------------------------------+\n",
+    "            | NVIDIA-SMI 460.27.04    Driver Version: 460.27.04    CUDA Version: 11.2     |\n",
+    "            |-------------------------------+----------------------+----------------------+\n",
+    "            | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |\n",
+    "            | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |\n",
+    "            |                               |                      |               MIG M. |\n",
+    "            |===============================+======================+======================|\n",
+    "            |   0  A100-SXM4-40GB      On   | 00000000:07:00.0 Off |                    0 |\n",
+    "            | N/A   24C    P0    57W / 400W |      0MiB / 40536MiB |      4%      Default |\n",
+    "            |                               |                      |             Disabled |\n",
+    "            +-------------------------------+----------------------+----------------------+\n",
+    "            |   1  A100-SXM4-40GB      On   | 00000000:0F:00.0 Off |                    0 |\n",
+    "            | N/A   24C    P0    53W / 400W |      0MiB / 40536MiB |      0%      Default |\n",
+    "            |                               |                      |             Disabled |\n",
+    "            +-------------------------------+----------------------+----------------------+\n",
+    "\n",
+    "            +-----------------------------------------------------------------------------+\n",
+    "            | Processes:                                                                  |\n",
+    "            |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |\n",
+    "            |        ID   ID                                                   Usage      |\n",
+    "            |=============================================================================|\n",
+    "            |  No running processes found                                                 |\n",
+    "            +-----------------------------------------------------------------------------+\n"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": null,
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "GPU 0: A100-SXM4-40GB (UUID: GPU-2e4d2105-718d-3b94-6f0f-25c148681e83)\n",
-      "\t Link 0: 25 GB/s\n",
-      "\t Link 1: 25 GB/s\n",
-      "\t Link 2: 25 GB/s\n",
-      "\t Link 3: 25 GB/s\n",
-      "\t Link 4: 25 GB/s\n",
-      "\t Link 5: 25 GB/s\n",
-      "\t Link 6: 25 GB/s\n",
-      "\t Link 7: 25 GB/s\n",
-      "\t Link 8: 25 GB/s\n",
-      "\t Link 9: 25 GB/s\n",
-      "\t Link 10: 25 GB/s\n",
-      "\t Link 11: 25 GB/s\n",
-      "GPU 1: A100-SXM4-40GB (UUID: GPU-49615223-919e-6f9f-ad79-69d86bc1a13b)\n",
-      "\t Link 0: 25 GB/s\n",
-      "\t Link 1: 25 GB/s\n",
-      "\t Link 2: 25 GB/s\n",
-      "\t Link 3: 25 GB/s\n",
-      "\t Link 4: 25 GB/s\n",
-      "\t Link 5: 25 GB/s\n",
-      "\t Link 6: 25 GB/s\n",
-      "\t Link 7: 25 GB/s\n",
-      "\t Link 8: 25 GB/s\n",
-      "\t Link 9: 25 GB/s\n",
-      "\t Link 10: 25 GB/s\n",
-      "\t Link 11: 25 GB/s\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
-    "### verify nvlink status\n",
+    "# verify nvlink status\n",
     "!nvidia-smi nvlink --status"
    ]
   },
@@ -176,42 +98,41 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "---\n",
-    "### Verify Profiling Capability \n",
-    "OUTPUT should look something simialr to the below\n",
-    "note that we want all environment check pass ( = OK or available )\n",
+    "Verify NVlink is active, below is an example of expected outputs : \n",
     "\n",
-    "            Sampling Environment Check\n",
-    "            Linux Kernel Paranoid Level = 1: OK\n",
-    "            Linux Distribution = Ubuntu\n",
-    "            Linux Kernel Version = 4.15.0-112-generic: OK\n",
-    "            Linux perf_event_open syscall available: OK\n",
-    "            Sampling trigger event available: OK\n",
-    "            Intel(c) Last Branch Record support: Available\n",
-    "            Sampling Environment: OK"
+    "        GPU 0: A100-SXM4-40GB (UUID: GPU-2e4d2105-718d-3b94-6f0f-25c148681e83)\n",
+    "             Link 0: 25 GB/s\n",
+    "             Link 1: 25 GB/s\n",
+    "             Link 2: 25 GB/s\n",
+    "             Link 3: 25 GB/s\n",
+    "             Link 4: 25 GB/s\n",
+    "             Link 5: 25 GB/s\n",
+    "             Link 6: 25 GB/s\n",
+    "             Link 7: 25 GB/s\n",
+    "             Link 8: 25 GB/s\n",
+    "             Link 9: 25 GB/s\n",
+    "             Link 10: 25 GB/s\n",
+    "             Link 11: 25 GB/s\n",
+    "        GPU 1: A100-SXM4-40GB (UUID: GPU-49615223-919e-6f9f-ad79-69d86bc1a13b)\n",
+    "             Link 0: 25 GB/s\n",
+    "             Link 1: 25 GB/s\n",
+    "             Link 2: 25 GB/s\n",
+    "             Link 3: 25 GB/s\n",
+    "             Link 4: 25 GB/s\n",
+    "             Link 5: 25 GB/s\n",
+    "             Link 6: 25 GB/s\n",
+    "             Link 7: 25 GB/s\n",
+    "             Link 8: 25 GB/s\n",
+    "             Link 9: 25 GB/s\n",
+    "             Link 10: 25 GB/s\n",
+    "             Link 11: 25 GB/s"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": null,
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "\n",
-      "Sampling Environment Check\n",
-      "Linux Kernel Paranoid Level = 2: OK\n",
-      "Linux Distribution = Ubuntu\n",
-      "Linux Kernel Version = 4.18.0-305.12.1.el8_4.x86_64: OK\n",
-      "Linux perf_event_open syscall available: OK\n",
-      "Sampling trigger event available: OK\n",
-      "Intel(c) Last Branch Record support: Not Available\n",
-      "Sampling Environment: OK\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
     "# verify profiling capacility \n",
     "!nsys status -e"
@@ -222,7 +143,24 @@
    "metadata": {},
    "source": [
     "---\n",
-    "### Making Placeholder folders for dataset"
+    "Verify Profiling Capability, the expected output should look something simialr to the below\n",
+    "\n",
+    "            Sampling Environment Check\n",
+    "            Linux Kernel Paranoid Level = 2: OK\n",
+    "            Linux Distribution = Ubuntu\n",
+    "            Linux Kernel Version = 4.18.0-305.12.1.el8_4.x86_64: OK\n",
+    "            Linux perf_event_open syscall available: OK\n",
+    "            Sampling trigger event available: OK\n",
+    "            Intel(c) Last Branch Record support: Not Available\n",
+    "            Sampling Environment: OK"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "To start with, we need to create placeholder for dataset. We are going to populate these folders later."
    ]
   },
   {
@@ -247,37 +185,27 @@
    "metadata": {},
    "source": [
     "---\n",
-    "## Create Your Own Data - Web Crawling \n",
-    "It is mandatory to fetch your own data via web crawling NVIDIA blogs webpages, extracting raw text from the webpage. \n",
-    "Please make sure you go through the notebook **[link here](./jupyter_notebook/Megatron-LM/tools/openwebtext/Day3-Website_scrapping.ipynb)** to scrape raw text from NVIDIA blogs' webpages. "
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "---\n",
     "### Tutorial Outline\n",
     "\n",
     "The following contents will be covered during the Bootcamp :\n",
     "\n",
-    "- **Outlines of Day 2**\n",
-    "    Megatron 101 in half a day \n",
-    "    - [Estimate hours/days needed to execute one end-to-end run per Megatron configuration](./jupyter_notebook/Day2-1_EstimateComputeDaysNeeded.ipynb)\n",
-    "    - [Understanding the core of Megatron - mpu ](./jupyter_notebook/Day2-2_MegatronFundementals.ipynb)\n",
-    "    - [About GPT's tokenizer](./jupyter_notebook/Day2-3_GPT_vocab_merge_files.ipynb)\n",
-    "    - [jsonfy and convert to mmap format](./jupyter_notebook/Day2-4_jsonfy_and_process2mmap.ipynb)\n",
-    "    - [Megatron runs vs config](./jupyter_notebook/Day2-5_Observe_GPT_runs_vs_performance.ipynb)\n",
-    "    - [challenge - the best profiler](./jupyter_notebook/Day2-5_Observe_GPT_runs_vs_performance.ipynb#TheChallenge)\n",
+    "- **Outlines of Lab 1**\n",
+    "    Megatron 101 in half a day - Please go through the below notebooks sequentially.\n",
+    "    1. [WebCrawling](./jupyter_notebook/Megatron-LM/tools/openwebtext/Lab1-1_Website_scrapping.ipynb)\n",
+    "    2. [Estimate hours/days needed to execute one end-to-end run per Megatron configuration](./jupyter_notebook/Lab1-2_EstimateComputeDaysNeeded.ipynb)\n",
+    "    3. [Understanding the core of Megatron - mpu ](./jupyter_notebook/Lab1-3_MegatronFundementals.ipynb)\n",
+    "    4. [About GPT's tokenizer](./jupyter_notebook/Lab1-4_GPT_vocab_merge_files.ipynb)\n",
+    "    5. [jsonfy and convert to mmap format](./jupyter_notebook/Lab1-5_jsonfy_and_process2mmap.ipynb)\n",
+    "    6. [Megatron runs vs config](./jupyter_notebook/Lab1-6_Observe_GPT_runs_vs_performance.ipynb)\n",
     "\n",
     "- **Outlines of Day 3**\n",
-    "    Getting started on training your own Megatron GPT models !\n",
-    "    - [Fetch and extract Swedish data](./jupyter_notebook/Megatron-LM/tools/openwebtext/Day3-1_acquiring_data.ipynb)\n",
-    "    - [Find sentence boundary and deduplicate your data](./jupyter_notebook/Megatron-LM/tools/openwebtext/Day3-2_SentenceBoundary_and_Deduplicate.ipynb)\n",
-    "        - [mini challenge - approaching groundtruth](./jupyter_notebook/Megatron-LM/tools/openwebtext/Day3-2_SentenceBoundary_and_Deduplicate.ipynb#TheChallenge)\n",
-    "    - [Train your own GPTBPE Tokenizer on your own data ](./jupyter_notebook/Day3-3_train_own_GPT2BPETokenizer.ipynb)\n",
-    "    - [customize preprocess data python script and convert to mmap](./jupyter_notebook/Day3-4_customize_process2mmap.ipynb)\n",
-    "    - [The Challenge - Go Big or go home!](./jupyter_notebook/Day3-5_run_Megatron_with_varying_config.ipynb)\n",
+    "    Getting started on training own language Megatron GPT models -- Please go through the below notebooks sequentially.\n",
+    "    1. [Fetch and extract Swedish data](./jupyter_notebook/Megatron-LM/tools/openwebtext/Lab2-1_acquiring_data.ipynb)\n",
+    "    2. [Find sentence boundary and deduplicate your data](./jupyter_notebook/Megatron-LM/tools/openwebtext/Lab2-2_SentenceBoundary_and_Deduplicate.ipynb)\n",
+    "        - [mini challenge - approaching groundtruth](./jupyter_notebook/Megatron-LM/tools/openwebtext/Lab2-2_SentenceBoundary_and_Deduplicate.ipynb#TheChallenge)\n",
+    "    3. [Train your own GPTBPE Tokenizer on your own data ](./jupyter_notebook/Lab2-3_train_own_GPT2BPETokenizer.ipynb)\n",
+    "    4. [customize preprocess data python script and convert to mmap](./jupyter_notebook/Lab2-4_customize_process2mmap.ipynb)\n",
+    "    5. [The Challenge - Go Big or go home!](./jupyter_notebook/Lab2-5_run_Megatron_with_varying_config.ipynb)\n",
     "\n"
    ]
   },

+ 219 - 0
ai/Megatron/English/Python/jupyter_notebook/Lab1-2_EstimateComputeDaysNeeded.ipynb

@@ -0,0 +1,219 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "special-singer",
+   "metadata": {},
+   "source": [
+    "# Estimate Time\n",
+    "---\n",
+    "\n",
+    "## Learning Objectives\n",
+    "The goal of this lab is to estimate compute time needed for an end to end training run.\n",
+    "\n",
+    "**Motivation**: In order to request for computing resources for a training job on a cluster, one must provide information such as, the number of nodes/gpus and the estimated time of the training job run.\n",
+    "\n",
+    "Training time (in seconds) is approximated with this equation : 8*T*P/n*X\n",
+    "\n",
+    "- T = dataset size measured in numbers of tokens in the dataset\n",
+    "- P = model parameters for GPT3 varients\n",
+    "- n = number of GPUs in the compute cluster\n",
+    "- x = achieved teraflops per GPU \n",
+    "\n",
+    "\n",
+    "The above equation was extracted from this paper : [Efficient Large-Scale Language Model Training on GPU Clusters](https://arxiv.org/pdf/2104.04473.pdf)\n",
+    "\n",
+    "---------------------------------------------------------------------------------------------------\n",
+    "\n",
+    "Assets provided below for you convenience : \n",
+    "\n",
+    "<center><img src=\"./Megatron-LM/pics/GPT3_all.png\" width=\"700\"/></center>\n",
+    "\n",
+    "<center><img src=\"./Megatron-LM/pics/achieved_teraflops_per_gpu.JPG\" width=\"700\"/></center>\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "complicated-reproduction",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "## Sanity check - \n",
+    "\n",
+    "<left><img src=\"./Megatron-LM/pics/TrainingTimeEstimate.JPG\" width=\"500\"/></left>\n",
+    "\n",
+    "Two scenarios were extracted from the above paper (screenshot above) : [Efficient Large-Scale Language Model Training on GPU Clusters](https://arxiv.org/pdf/2104.04473.pdf) \n",
+    "\n",
+    "**Scenario 1** -\n",
+    "\n",
+    "T = 300Billion tokens # assumed data size measured in tokens\n",
+    "\n",
+    "P = 175 Billion GPT3 model\n",
+    "\n",
+    "n = 1024 GPUs\n",
+    "\n",
+    "x = 140 teraFLOP/s per GPU\n",
+    "\n",
+    "Question : How many hours/ days will you need given the scenaio above for you to compute an end to end training job ?\n",
+    "\n",
+    "Answer : We should observe around **34 days** for an end to end training run\n",
+    "\n",
+    "\n",
+    "**Scenario 2** - \n",
+    "\n",
+    "T =  450 Billion tokens  \n",
+    "\n",
+    "P = 1 Trillion parameters GPT 3 model\n",
+    "\n",
+    "n = 3072 \n",
+    "\n",
+    "x = 163 teraFLOP/s per GPU \n",
+    "\n",
+    "Question: How many hours/ days will you need given this scenaio above for you to compute an end to end training job ?\n",
+    "\n",
+    "Answer: We should observe around **84 days** for an end to end training run\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "hundred-array",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# The following code block contain automatic functions which assist calculating time-to-compute for an end to end training run.\n",
+    "import numpy as np\n",
+    "# T = dataset size measured in numbers of tokens in the dataset\n",
+    "# P = model parameters for GPT3 varients\n",
+    "# n = number of GPUs in the compute cluster\n",
+    "# x = achieved teraflops per GPU \n",
+    "\n",
+    "def calculate_days_needed(T , P , n ,x):\n",
+    "    if x is None:\n",
+    "        return 'not a good SuperPOD use case, let us try a bigger model :)'\n",
+    "    else:        \n",
+    "        tot=8*T*P\n",
+    "        div=n*x\n",
+    "        compute_sec=tot/div\n",
+    "        #convert compute seconds to days\n",
+    "        to_days=round(compute_sec/(3600*24),1)\n",
+    "        return to_days\n",
+    "## sanity check against the two scenarios above \n",
+    "T=[300*1e+9, 450*1e+9]\n",
+    "n=[1024,3072]\n",
+    "GPT3_models_labels=[  'gpt3_175B','gpt3_1Trillion']\n",
+    "GPT3_model_params=[ 175*1e+9,1*1e+12 ]\n",
+    "GPT3_model_params_str=['175 Billion','1Trillion']\n",
+    "#according to the table above\n",
+    "GPT3_X=[140*1e+12,163*1e+12]\n",
+    "print(\"all below are measured with dataset size **300 billion** measured in tokens \\n\")\n",
+    "scene=1\n",
+    "for gpt3_name, gpt3_params, gpt3_param_str, x, n_,t in zip(GPT3_models_labels,GPT3_model_params,GPT3_model_params_str, GPT3_X ,n,T):\n",
+    "    days_needed=calculate_days_needed(t,gpt3_params,n_,x)\n",
+    "    print(\" ----------------------------scenario {}-----------------------------------\".format(scene))\n",
+    "    print(\" language model :{} with {} number of parameters , it will need {} days to compute \\n\".format(gpt3_name, gpt3_param_str, str(days_needed)))\n",
+    "    scene+=1"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "noted-sense",
+   "metadata": {},
+   "source": [
+    "Below is an example of expected outputs :\n",
+    "\n",
+    "     ----------------------------scenario 1-----------------------------------\n",
+    "     language model :gpt3_175B with 175 Billion number of parameters , it will need 33.9 days to compute \n",
+    "\n",
+    "     ----------------------------scenario 2-----------------------------------\n",
+    "     language model :gpt3_1Trillion with 1Trillion number of parameters , it will need 83.2 days to compute\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "indie-schema",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "**Exercise** -\n",
+    "\n",
+    "For a GPT3 model size of 70B parameters with approximatedly 300 Billion tokens in an existing dataset\n",
+    "You have requested 1/4 of the total number of gpus available in [BerzeLiUs](https://www.nsc.liu.se/support/systems/berzelius-getting-started/).\n",
+    "\n",
+    "\n",
+    "Question -\n",
+    "\n",
+    "How many hours/days would you need to do an end to end training run ? \n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cosmetic-gregory",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "T=<FILL_IN> \n",
+    "p=<FILL_IN> \n",
+    "n=<FILL_IN> \n",
+    "x=<FILL_IN> \n",
+    "gpt3_params=<FILL_IN> \n",
+    "calculate_days_needed(T,gpt3_params,n,x)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "viral-upper",
+   "metadata": {},
+   "source": [
+    "--- \n",
+    "\n",
+    "## Links and Resources\n",
+    "Don't forget to check out additional resources such as [Efficient Large-Scale Language Model Training on GPU Clusters](https://arxiv.org/pdf/2104.04473.pdf ), [Language Models are Few-Shot Learners](https://arxiv.org/pdf/2005.14165.pdf) and [Scaling Laws for Neural Language Models](https://arxiv.org/pdf/2001.08361.pdf)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "spiritual-dancing",
+   "metadata": {},
+   "source": [
+    "-----\n",
+    "## <p style=\"text-align:center;border:3px; padding: 1em\"> <a href=../Start_Here.ipynb>HOME</a>&nbsp; &nbsp; &nbsp; <a href=./Lab1-3_MegatronFundementals.ipynb>NEXT</a></p>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "silent-kruger",
+   "metadata": {},
+   "source": [
+    "-----\n",
+    "\n",
+    "\n",
+    "## Licensing \n",
+    "\n",
+    "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). "
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

+ 788 - 0
ai/Megatron/English/Python/jupyter_notebook/Lab1-3_MegatronFundementals.ipynb

@@ -0,0 +1,788 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "wooden-perth",
+   "metadata": {},
+   "source": [
+    "##  Megatron-LM Model Parallel Unit (MPU)\n",
+    "---\n",
+    "\n",
+    "NVIDIA's Megatron-LM makes training very large langauge models ( up to one trillion parameters ) a reality. Megatron-LM's core, Model Paralleism Unit ( MPU ), is the backbones for [DeepSpeed](https://www.deepspeed.ai/features/#model-parallelism) and Facebook [FairScale](https://github.com/facebookresearch/fairscale), both of them integrate Megatron-LM's MPU heavily in the backend.\n",
+    "\n",
+    "## Learning Objectives\n",
+    "\n",
+    "The goal of this lab is to understand how Megatro-LM's Model Parallel Unit (MPU) works, more specifically, we will cover :\n",
+    "\n",
+    "    - GPUs grouping affinity per training configuration.\n",
+    "    - Tensor Parallism : \n",
+    "        - Column Parallel\n",
+    "        - Row Parallel\n",
+    "\n",
+    "Pipeline parallism will be covered in the lecture instead. \n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "legislative-dividend",
+   "metadata": {},
+   "source": [
+    "Before going through GPUs grouping affinity, we need to specify the following parameters :\n",
+    "\n",
+    "- p = Pipeline Model Parallel  \n",
+    "- t = Tensor Model Parallel\n",
+    "- d = Data Parallal \n",
+    "- n = Total number of GPUs used in the training\n",
+    "\n",
+    "Note that Megatron-LM requires p * t * d = n\n",
+    "\n",
+    "\n",
+    "Let's first go through a default example given by Megatron-LM [initializer.py](https://github.com/NVIDIA/Megatron-LM/blob/90e0a0dd08159e1c95f4f9d99bb8687f327d36c3/megatron/mpu/initialize.py).\n",
+    "\n",
+    "The following training configuration is assumed : \n",
+    "\n",
+    "    tensor_model_parallel_size_= 2 \n",
+    "\n",
+    "    pipeline_model_parallel_size_= 4\n",
+    "\n",
+    "Let's say we have a total of 16 GPUs denoted by g0 ... g15, hence \n",
+    "\n",
+    "    world_size = 16  \n",
+    "\n",
+    "Accoridng to Megatron-LM [initializer.py](https://github.com/NVIDIA/Megatron-LM/blob/90e0a0dd08159e1c95f4f9d99bb8687f327d36c3/megatron/mpu/initialize.py) we should see the following ...\n",
+    "\n",
+    "    8 data_parallel groups:\n",
+    "        [g0, g2], [g1, g3], [g4, g6], [g5, g7], [g8, g10], [g9, g11], [g12, g14], [g13, g15]\n",
+    "    8 tensor model-parallel groups:\n",
+    "        [g0, g1], [g2, g3], [g4, g5], [g6, g7], [g8, g9], [g10, g11], [g12, g13], [g14, g15]\n",
+    "    4 pipeline model-parallel groups:\n",
+    "        [g0, g4, g8, g12], [g1, g5, g9, g13], [g2, g6, g10, g14], [g3, g7, g11, g15]\n",
+    "\n",
+    "**Note** that for efficiency, the caller should make sure adjacent ranks are on the same DGX box.\n",
+    "For example if we are using 2 DGX-1 boxes\n",
+    "with a total of 16 GPUs, rank 0 to 7 belong to the first box and\n",
+    "ranks 8 to 15 belong to the second box."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "faced-detroit",
+   "metadata": {},
+   "source": [
+    "The code block below is modified from Megatron-LM [initializer.py](https://github.com/NVIDIA/Megatron-LM/blob/90e0a0dd08159e1c95f4f9d99bb8687f327d36c3/megatron/mpu/initialize.py) in order to avoid the need of having actual 16 physical GPUs to run this notebook, in another word, this notebook can be run without have any physical GPUs present."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cooperative-contractor",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import itertools\n",
+    "def ensure_divisibility(numerator, denominator):\n",
+    "    \"\"\"Ensure that numerator is divisible by the denominator.\"\"\"\n",
+    "    assert numerator % denominator == 0, '{} is not divisible by {}'.format(\n",
+    "        numerator, denominator)\n",
+    "def initialize_model_parallel(tensor_model_parallel_size_=2,\n",
+    "                              pipeline_model_parallel_size_= 4,\n",
+    "                              world_size=16):\n",
+    "    print(' ---------- world size is set to : {} ---------- '.format(world_size))\n",
+    "    print('> initializing tensor model parallel with size {}'.format(tensor_model_parallel_size_))\n",
+    "    print('> initializing pipeline model parallel with size {}'.format(pipeline_model_parallel_size_))\n",
+    "    \n",
+    "    tensor_model_parallel_size = min(tensor_model_parallel_size_, world_size)\n",
+    "    pipeline_model_parallel_size = min(pipeline_model_parallel_size_, world_size)\n",
+    "                                       \n",
+    "    # make sure world_size is divisible by t * p    \n",
+    "    ensure_divisibility(world_size,tensor_model_parallel_size * pipeline_model_parallel_size)\n",
+    "                        \n",
+    "    data_parallel_size = world_size // (tensor_model_parallel_size * pipeline_model_parallel_size)\n",
+    "    print(\"> data parallel size is set to : \", data_parallel_size)\n",
+    "    num_tensor_model_parallel_groups = world_size // tensor_model_parallel_size\n",
+    "    num_pipeline_model_parallel_groups = world_size // pipeline_model_parallel_size\n",
+    "    num_data_parallel_groups = world_size // data_parallel_size\n",
+    "    print(\"---------- parallel groups ----------\")\n",
+    "    print(\"num_tensor_model_parallel_groups : \", num_tensor_model_parallel_groups)\n",
+    "    print(\"num_pipeline_model_parallel_groups : \", num_pipeline_model_parallel_groups)\n",
+    "    print(\"num_data_parallel_groups : \",num_data_parallel_groups )\n",
+    "    # Build the data-parallel groups.\n",
+    "    _DATA_PARALLEL_GROUP = []\n",
+    "    _MODEL_PARALLEL_GROUP = []\n",
+    "    _TENSOR_MODEL_PARALLEL_GROUP = []\n",
+    "    _PIPE_MODEL_PARALLEL_GROUP=[]   \n",
+    "    _MODEL_PARALLEL_GROUP=[]\n",
+    "    all_data_parallel_group_ranks = []\n",
+    "    for i in range(pipeline_model_parallel_size):\n",
+    "        start_rank = i * num_pipeline_model_parallel_groups\n",
+    "        end_rank = (i + 1) * num_pipeline_model_parallel_groups\n",
+    "        #print(\"start rank : {} | end rank :{}\".format(start_rank, end_rank))\n",
+    "        temp=[]\n",
+    "        for j in range(tensor_model_parallel_size):\n",
+    "            ranks = range(start_rank + j, end_rank,\n",
+    "                          tensor_model_parallel_size)\n",
+    "            temp.append(list(ranks))\n",
+    "            all_data_parallel_group_ranks.append(list(ranks))\n",
+    "    _DATA_PARALLEL_GROUP=all_data_parallel_group_ranks\n",
+    "\n",
+    "    for i in range(num_pipeline_model_parallel_groups):\n",
+    "        ranks = range(i, world_size,\n",
+    "                      num_pipeline_model_parallel_groups)        \n",
+    "        _PIPE_MODEL_PARALLEL_GROUP.append(list(ranks))\n",
+    "    \n",
+    "    \n",
+    "    for i in range(data_parallel_size):\n",
+    "        ranks = [data_parallel_group_ranks[i]\n",
+    "                 for data_parallel_group_ranks in all_data_parallel_group_ranks]\n",
+    "        _MODEL_PARALLEL_GROUP.append(ranks)\n",
+    "    \n",
+    "    for i in range(num_tensor_model_parallel_groups):\n",
+    "        ranks = range(i * tensor_model_parallel_size,\n",
+    "                      (i + 1) * tensor_model_parallel_size)\n",
+    "        _TENSOR_MODEL_PARALLEL_GROUP.append(list(ranks))\n",
+    "    print(\"-----\"*20)\n",
+    "    print(\"_DATA_PARALLEL_GROUP \\n :\", _DATA_PARALLEL_GROUP)\n",
+    "    print(\"-----\"*20)\n",
+    "    print(\"_TENSOR_MODEL_PARALLEL_GROUP \\n :\", _TENSOR_MODEL_PARALLEL_GROUP)\n",
+    "    print(\"-----\"*20)\n",
+    "    print(\"_PIPE_MODEL_PARALLEL_GROUP \\n :\", _PIPE_MODEL_PARALLEL_GROUP)\n",
+    "    print(\"-----\"*20)\n",
+    "    print(\"Total :{} full models being partitioned into :{} GPUs \".format(len(_MODEL_PARALLEL_GROUP),world_size))\n",
+    "    for idx, m in zip(range(len(_MODEL_PARALLEL_GROUP)),_MODEL_PARALLEL_GROUP):\n",
+    "        m=[str(l) for l in m]\n",
+    "        print(\"model {} : is partitioned into gpus :{}\".format(str(idx),','.join(m)))   \n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "floppy-princeton",
+   "metadata": {},
+   "source": [
+    "\n",
+    "Sanity check, verify, after run the below code cell, the result will match the comment inside of [megatron/mpu/initializer.py](https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/mpu/initialize.py#L63).\n",
+    "Since we have a total of 16 GPUs denoted by g0 ... g15, we expected to see the below result :\n",
+    "\n",
+    "            8 data_parallel groups:\n",
+    "                [g0, g2], [g1, g3], [g4, g6], [g5, g7], [g8, g10], [g9, g11], [g12, g14], [g13, g15]\n",
+    "            8 tensor model-parallel groups:\n",
+    "                [g0, g1], [g2, g3], [g4, g5], [g6, g7], [g8, g9], [g10, g11], [g12, g13], [g14, g15]\n",
+    "            4 pipeline model-parallel groups:\n",
+    "                [g0, g4, g8, g12], [g1, g5, g9, g13], [g2, g6, g10, g14], [g3, g7, g11, g15]\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "postal-hotel",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "initialize_model_parallel(tensor_model_parallel_size_=2,\n",
+    "                              pipeline_model_parallel_size_= 4,\n",
+    "                              world_size=16)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "electric-differential",
+   "metadata": {},
+   "source": [
+    "Below is the expected outputs :\n",
+    "\n",
+    "         ---------- world size is set to : 16 ---------- \n",
+    "        > initializing tensor model parallel with size 2\n",
+    "        > initializing pipeline model parallel with size 4\n",
+    "        > data parallel size is set to :  2\n",
+    "        ---------- parallel groups ----------\n",
+    "        num_tensor_model_parallel_groups :  8\n",
+    "        num_pipeline_model_parallel_groups :  4\n",
+    "        num_data_parallel_groups :  8\n",
+    "        ----------------------------------------------------------------------------------------------------\n",
+    "        _DATA_PARALLEL_GROUP \n",
+    "         : [[0, 2], [1, 3], [4, 6], [5, 7], [8, 10], [9, 11], [12, 14], [13, 15]]\n",
+    "        ----------------------------------------------------------------------------------------------------\n",
+    "        _TENSOR_MODEL_PARALLEL_GROUP \n",
+    "         : [[0, 1], [2, 3], [4, 5], [6, 7], [8, 9], [10, 11], [12, 13], [14, 15]]\n",
+    "        ----------------------------------------------------------------------------------------------------\n",
+    "        _PIPE_MODEL_PARALLEL_GROUP \n",
+    "         : [[0, 4, 8, 12], [1, 5, 9, 13], [2, 6, 10, 14], [3, 7, 11, 15]]\n",
+    "        ----------------------------------------------------------------------------------------------------\n",
+    "        Total :2 full models being partitioned into :16 GPUs \n",
+    "        model 0 : is partitioned into gpus :0,1,4,5,8,9,12,13\n",
+    "        model 1 : is partitioned into gpus :2,3,6,7,10,11,14,15"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "prime-puppy",
+   "metadata": {},
+   "source": [
+    "Try a different training configuration, what did you get ? "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "external-freedom",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "## assuming the world size is 16, that is, you have 16 GPUs\n",
+    "tensor_model_parallel_size= <FILL_IN>  # try a different tensor_model_parallel_size_\n",
+    "pipeline_model_parallel_size= <FILL_IN>  # try a different pipeline_model_parallel_size_\n",
+    "world_size=16\n",
+    "assert world_size%(tensor_model_parallel_size * pipeline_model_parallel_size)==0,'please make sure world_size is divisible by tensor_model_parallel_size * pipeline_model_parallel_size' \n",
+    "initialize_model_parallel(tensor_model_parallel_size_=tensor_model_parallel_size,pipeline_model_parallel_size_= pipeline_model_parallel_size,world_size=world_size)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "every-cotton",
+   "metadata": {},
+   "source": [
+    "----------------------------------------------------------------------\n",
+    "Column Parallel is part of Megatron-LM's Tensor Parallelism.  \n",
+    "[ColumnParallel reference](https://github.com/NVIDIA/Megatron-LM/blob/90e0a0dd08159e1c95f4f9d99bb8687f327d36c3/megatron/mpu/layers.py#L201)\n",
+    "![ColumnParallel](./Megatron-LM/pics/ColumnParallel.JPG)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "turkish-capability",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "## Below class is modified from the original Megatron repo in order to skip environment variable initialization\n",
+    "\n",
+    "import sys\n",
+    "sys.path.append(\"./Megatron-LM\")\n",
+    "from megatron.mpu import layers\n",
+    "from torch.nn.parameter import Parameter\n",
+    "import torch.nn.init as init\n",
+    "import torch\n",
+    "import random\n",
+    "from megatron import *\n",
+    "from megatron.mpu.tests import *\n",
+    "from megatron.mpu.utils import *\n",
+    "global world_size \n",
+    "world_size = 16\n",
+    "class myColumnParallelLinear(torch.nn.Module):\n",
+    "    \"\"\"Linear layer with column parallelism.\n",
+    "    The linear layer is defined as Y = XA + b. A is parallelized along\n",
+    "    its second dimension as A = [A_1, ..., A_p].\n",
+    "    Arguments:\n",
+    "        input_size: first dimension of matrix A.\n",
+    "        output_size: second dimension of matrix A.\n",
+    "        bias: If true, add bias\n",
+    "        gather_output: If true, call all-gether on output and make Y avaiable\n",
+    "                       to all GPUs, otherwise, every GPU will have its output\n",
+    "                       which is Y_i = XA_i\n",
+    "        init_method: method to initialize weights. Note that bias is always set\n",
+    "                     to zero.\n",
+    "        stride: For the strided linear layers.\n",
+    "        keep_master_weight_for_test: This was added for testing and should be\n",
+    "                                     set to False. It returns the master weights\n",
+    "                                     used for initialization.\n",
+    "        skip_bias_add: This was added to enable performance optimations where bias\n",
+    "                       can be fused with other elementwise operations. we skip \n",
+    "                       adding bias but instead return it.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    def __init__(self, input_size, output_size, bias=True, gather_output=True,\n",
+    "                 init_method=init.xavier_normal_, stride=1,\n",
+    "                 keep_master_weight_for_test=False,\n",
+    "                 skip_bias_add=False):\n",
+    "        super(myColumnParallelLinear, self).__init__()\n",
+    "\n",
+    "        # Keep input parameters\n",
+    "        self.input_size = input_size\n",
+    "        self.output_size = output_size\n",
+    "        self.gather_output = gather_output\n",
+    "        # Divide the weight matrix along the last dimension.\n",
+    "        \n",
+    "        self.output_size_per_partition = divide(output_size, world_size)\n",
+    "        self.skip_bias_add = skip_bias_add\n",
+    "\n",
+    "        # Parameters.\n",
+    "        # Note: torch.nn.functional.linear performs XA^T + b and as a result\n",
+    "        # we allocate the transpose.\n",
+    "        # Initialize weight.        \n",
+    "        use_cpu_initialization=True # hard coded to use cpu\n",
+    "        params_dtype = torch.float # skipping need of args\n",
+    "        \n",
+    "        if use_cpu_initialization:\n",
+    "            self.weight = Parameter(torch.empty(self.output_size_per_partition,\n",
+    "                                                self.input_size,\n",
+    "                                                dtype=params_dtype))\n",
+    "            \n",
+    "            self.master_weight = m_initialize_affine_weight_cpu(\n",
+    "                self.weight, self.output_size, self.input_size,\n",
+    "                self.output_size_per_partition, 0, init_method,\n",
+    "                stride=stride, return_master_weight=keep_master_weight_for_test)\n",
+    "            \n",
+    "        else:\n",
+    "            self.weight = Parameter(torch.empty(\n",
+    "                self.output_size_per_partition, self.input_size,\n",
+    "                device=torch.cuda.current_device(), dtype=params_dtype))\n",
+    "            _initialize_affine_weight_gpu(self.weight, init_method,\n",
+    "                                          partition_dim=0, stride=stride)\n",
+    "            \n",
+    "        if bias:\n",
+    "            if use_cpu_initialization:\n",
+    "                self.bias = Parameter(torch.empty(\n",
+    "                    self.output_size_per_partition, dtype=params_dtype))\n",
+    "            else:\n",
+    "                self.bias = Parameter(torch.empty(\n",
+    "                    self.output_size_per_partition,\n",
+    "                    device=torch.cuda.current_device(),\n",
+    "                    dtype=params_dtype))\n",
+    "            # Always initialize bias to zero.\n",
+    "            with torch.no_grad():\n",
+    "                self.bias.zero_()\n",
+    "        else:\n",
+    "            self.register_parameter('bias', None)\n",
+    "\n",
+    "    def forward(self, input_):\n",
+    "        # Set up backprop all-reduce.\n",
+    "        print(\"in Column parallel forward\")\n",
+    "        input_parallel = copy_to_tensor_model_parallel_region(input_)\n",
+    "        # Matrix multiply.\n",
+    "\n",
+    "        bias = self.bias if not self.skip_bias_add else None\n",
+    "        output_parallel = F.linear(input_parallel, self.weight, bias)\n",
+    "        if self.gather_output:\n",
+    "            # All-gather across the partitions.\n",
+    "            output = gather_from_tensor_model_parallel_region(output_parallel)\n",
+    "        else:\n",
+    "            output = output_parallel \n",
+    "        output_bias = self.bias if self.skip_bias_add else None\n",
+    "        return output, output_bias"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "experimental-ocean",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def get_weight_list(master_weight,tensor_model_parallel_gp):\n",
+    "    my_weight_list=[]\n",
+    "    a,b=master_weight.size()\n",
+    "    print(\"A = [\")\n",
+    "    tensor_model_parallel_gp=list(itertools.chain(*tensor_model_parallel_gp))\n",
+    "    cnt=0\n",
+    "    for gp in tensor_model_parallel_gp :\n",
+    "        if which_model_parallel=='col': \n",
+    "            temp=master_weight[gp::world_size].T\n",
+    "            if cnt < world_size -1 : \n",
+    "                print(\"A{}=\".format(str(cnt)), temp.size(), end = ',')\n",
+    "            else:\n",
+    "                print(\"A{}=\".format(str(cnt)), temp.size())\n",
+    "        elif which_model_parallel =='row':\n",
+    "            temp=master_weight.T\n",
+    "            temp=temp[gp::world_size]\n",
+    "            if cnt < world_size -1 :\n",
+    "                print(\"A{}=\".format(str(cnt)), temp.size(),',')\n",
+    "            else:\n",
+    "                print(\"A{}=\".format(str(cnt)), temp.size())\n",
+    "\n",
+    "        else:\n",
+    "            print(\"set which_model_parallel to **col** or **row**\")\n",
+    "        cnt+=1    \n",
+    "        my_weight_list.append(temp)\n",
+    "            \n",
+    "    print(\" ]\")\n",
+    "    print(len(my_weight_list))\n",
+    "    return my_weight_list\n",
+    "def m_initialize_affine_weight_cpu(weight, output_size, input_size,\n",
+    "                                  per_partition_size, partition_dim,\n",
+    "                                  init_method, stride=1,\n",
+    "                                  return_master_weight=False):\n",
+    "    \"\"\"Initialize affine weight for model parallel.\n",
+    "    Build the master weight on all processes and scatter\n",
+    "    the relevant chunk.\"\"\"\n",
+    "    params_dtype = torch.float\n",
+    "    # Initialize master weight\n",
+    "    master_weight = torch.empty(output_size, input_size,\n",
+    "                                dtype=torch.float,\n",
+    "                                requires_grad=False)    \n",
+    "    \n",
+    "    master_weight = master_weight.to(dtype=params_dtype)\n",
+    "    # Split and copy\n",
+    "    per_partition_per_stride_size = divide(per_partition_size, stride)\n",
+    "    print(\"per_partition_per_stride_size \",per_partition_per_stride_size)\n",
+    "    weight_list = torch.split(master_weight, per_partition_per_stride_size,\n",
+    "                              dim=partition_dim)\n",
+    "    ########  tensor_model_parallel_gp below is hard-coded for tensor_model_parallel_size= 2 , pipeline_model_parallel_size= 4 ########\n",
+    "    ########    if you use other model parallel configuration , please copy and replace it in tensor_model_parallel_gp     ########\n",
+    "    tensor_model_parallel_gp=[[0, 1], [2, 3], [4, 5], [6, 7], [8, 9], [10, 11], [12, 13], [14, 15]] \n",
+    "    my_weight_list = get_weight_list(master_weight,tensor_model_parallel_gp)\n",
+    "    \n",
+    "    with torch.no_grad():\n",
+    "        torch.cat(my_weight_list, dim=partition_dim, out=weight)\n",
+    "    if return_master_weight:\n",
+    "        return master_weight\n",
+    "    return None"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "wired-graham",
+   "metadata": {},
+   "source": [
+    "Peek inside Column Parallel Class, indeed Column Parallel can partition input matrix A into [A0, A1, A2 ...An] each of the partition matrix Ai, where i= 0, 1, 2 ...n, is sliced **column-wised**. Each column-wised partitioned matrix Ai will then be sent to the corresponding gpu based on the groupped gpu affinity."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "circular-hearing",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tensor_model_parallel_size= 2 \n",
+    "pipeline_model_parallel_size= 4  \n",
+    "input_size = 1024 # 1024 rows\n",
+    "output_size = 512 # 256 columns\n",
+    "which_model_parallel='col'\n",
+    "print(\"this is how A is sliced column-wised ...\\n\")\n",
+    "testCol=myColumnParallelLinear(input_size, output_size, bias=True, gather_output=True,\n",
+    "                 init_method=init.xavier_normal_, stride=1,\n",
+    "                 keep_master_weight_for_test=False,\n",
+    "                 skip_bias_add=False)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "instrumental-nickel",
+   "metadata": {},
+   "source": [
+    "Below is the expected outputs :\n",
+    "\n",
+    "        this is how A is sliced column-wised ...\n",
+    "        per_partition_per_stride_size  32\n",
+    "        A = [\n",
+    "        A0= torch.Size([1024, 32]),A1= torch.Size([1024, 32]),A2= torch.Size([1024, 32]),A3= torch.Size([1024, 32]),A4= torch.Size([1024, 32]),A5= torch.Size([1024, 32]),A6= torch.Size([1024, 32]),A7= torch.Size([1024, 32]),A8= torch.Size([1024, 32]),A9= torch.Size([1024, 32]),A10= torch.Size([1024, 32]),A11= torch.Size([1024, 32]),A12= torch.Size([1024, 32]),A13= torch.Size([1024, 32]),A14= torch.Size([1024, 32]),A15= torch.Size([1024, 32])\n",
+    "         ]\n",
+    "        16"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "saved-quantity",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "per_partition_per_stride_size=32\n",
+    "assert 16* per_partition_per_stride_size == 512 "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "complete-camel",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "type(testCol)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "divided-mexican",
+   "metadata": {},
+   "source": [
+    "Below is the expected outputs :\n",
+    "    \n",
+    "        __main__.myColumnParallelLinear"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "corporate-recipient",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "testCol.input_size, testCol.output_size"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "sound-haven",
+   "metadata": {},
+   "source": [
+    "Below is the expected outputs :\n",
+    "    \n",
+    "        (1024, 512)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "coupled-stevens",
+   "metadata": {},
+   "source": [
+    "----------------------------------------------------------------------\n",
+    "## Megatron-LM's Row Parallel \n",
+    "[RowParallel reference](https://github.com/NVIDIA/Megatron-LM/blob/90e0a0dd08159e1c95f4f9d99bb8687f327d36c3/megatron/mpu/layers.py#L294)\n",
+    "![RowParallel](./Megatron-LM/pics/RowParallel.JPG)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "prerequisite-simulation",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "## Below class is modified from the original Megatron repo in order to skip environment variable initialization\n",
+    "\n",
+    "class myRowParallelLinear(torch.nn.Module):\n",
+    "    \"\"\"Linear layer with row parallelism.\n",
+    "    The linear layer is defined as Y = XA + b. A is parallelized along\n",
+    "    its first dimension and X along its second dimension as:\n",
+    "               -   -\n",
+    "              | A_1 |\n",
+    "              | .   |\n",
+    "          A = | .   |        X = [X_1, ..., X_p]\n",
+    "              | .   |\n",
+    "              | A_p |\n",
+    "               -   -\n",
+    "    Arguments:\n",
+    "        input_size: first dimension of matrix A.\n",
+    "        output_size: second dimension of matrix A.\n",
+    "        bias: If true, add bias. Note that bias is not parallelized.\n",
+    "        input_is_parallel: If true, we assume that the input is already\n",
+    "                           split across the GPUs and we do not split\n",
+    "                           again.\n",
+    "        init_method: method to initialize weights. Note that bias is always set\n",
+    "                     to zero.\n",
+    "        stride: For the strided linear layers.\n",
+    "        keep_master_weight_for_test: This was added for testing and should be\n",
+    "                                     set to False. It returns the master weights\n",
+    "                                     used for initialization.\n",
+    "        skip_bias_add: This was added to enable performance optimations where bias\n",
+    "                       can be fused with other elementwise operations. we skip \n",
+    "                       adding bias but instead return it.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    def __init__(self, input_size, output_size, bias=True,\n",
+    "                 input_is_parallel=False,\n",
+    "                 init_method=init.xavier_normal_, stride=1,\n",
+    "                 keep_master_weight_for_test=False,\n",
+    "                 skip_bias_add=False):\n",
+    "        super(myRowParallelLinear, self).__init__()\n",
+    "\n",
+    "        # Keep input parameters\n",
+    "        self.input_size = input_size\n",
+    "        self.output_size = output_size\n",
+    "        self.input_is_parallel = input_is_parallel\n",
+    "        # Divide the weight matrix along the last dimension.\n",
+    "        self.input_size_per_partition = divide(input_size, world_size)\n",
+    "        self.skip_bias_add = skip_bias_add\n",
+    "        print(\"input_size_per_partition \", self.input_size_per_partition)\n",
+    "        \n",
+    "\n",
+    "        # Parameters.\n",
+    "        # Note: torch.nn.functional.linear performs XA^T + b and as a result\n",
+    "        # we allocate the transpose.\n",
+    "        # Initialize weight.\n",
+    "        use_cpu_initialization=True # hard coded to use cpu\n",
+    "        params_dtype = torch.float # skipping need of args\n",
+    "        if use_cpu_initialization:\n",
+    "            self.weight = Parameter(torch.empty(self.output_size,\n",
+    "                                                self.input_size_per_partition,\n",
+    "                                                dtype=params_dtype))\n",
+    "            self.master_weight = m_initialize_affine_weight_cpu(\n",
+    "                self.weight, self.output_size, self.input_size,\n",
+    "                self.input_size_per_partition, 1, init_method,\n",
+    "                stride=stride, return_master_weight=keep_master_weight_for_test)\n",
+    "        else:\n",
+    "            self.weight = Parameter(torch.empty(\n",
+    "                self.output_size, self.input_size_per_partition,\n",
+    "                device=torch.cuda.current_device(), dtype=params_dtype))\n",
+    "            _initialize_affine_weight_gpu(self.weight, init_method,\n",
+    "                                          partition_dim=1, stride=stride)\n",
+    "        if bias:\n",
+    "            if use_cpu_initialization:\n",
+    "                self.bias = Parameter(torch.empty(self.output_size,\n",
+    "                                                  dtype=params_dtype))\n",
+    "            else:\n",
+    "                self.bias = Parameter(torch.empty(\n",
+    "                    self.output_size, device=torch.cuda.current_device(),\n",
+    "                    dtype=params_dtype))\n",
+    "            # Always initialize bias to zero.\n",
+    "            with torch.no_grad():\n",
+    "                self.bias.zero_()\n",
+    "        else:\n",
+    "            self.register_parameter('bias', None)\n",
+    "\n",
+    "\n",
+    "\n",
+    "    def forward(self, input_):\n",
+    "        # Set up backprop all-reduce.\n",
+    "        if self.input_is_parallel:\n",
+    "            input_parallel = input_\n",
+    "        else:\n",
+    "            input_parallel = scatter_to_tensor_model_parallel_region(input_)\n",
+    "        # Matrix multiply.\n",
+    "        output_parallel = F.linear(input_parallel, self.weight)\n",
+    "        # All-reduce across all the partitions.\n",
+    "        output_ = reduce_from_tensor_model_parallel_region(output_parallel)\n",
+    "        if not self.skip_bias_add:\n",
+    "            output = output_ + self.bias if self.bias is not None else output_\n",
+    "            output_bias = None\n",
+    "        else:\n",
+    "            output = output_\n",
+    "            output_bias = self.bias\n",
+    "        return output, output_bias"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "satisfactory-irish",
+   "metadata": {},
+   "source": [
+    "Peek inside Row Parallel Class, indeed Row Parallel can partition input matrix A into [A0, A1, A2 ...An] each of the partition matrix Ai, where i= 0, 1, 2 ...n, Ai was chopped in a **row-wised** manner. Each row-wised partitioned matrix Ai will then be sent to the corresponding gpu based on the groupped gpu affinity."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "green-compiler",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "input_size = 1024 # first dimension of the matrix\n",
+    "output_size = 512 # 2nd dimension of the matrix\n",
+    "print(\"this is how A is sliced Row-wised ...\\n\")\n",
+    "which_model_parallel='row'\n",
+    "testRow=myRowParallelLinear(input_size,output_size, bias=True,\n",
+    "                 input_is_parallel=False,\n",
+    "                 init_method=init.xavier_normal_, stride=1,\n",
+    "                 keep_master_weight_for_test=False,\n",
+    "                 skip_bias_add=False)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "growing-standing",
+   "metadata": {},
+   "source": [
+    "Below is the expected outputs :\n",
+    "\n",
+    "        this is how A is sliced Row-wised ...\n",
+    "\n",
+    "        input_size_per_partition  64\n",
+    "        per_partition_per_stride_size  64\n",
+    "        A = [\n",
+    "        A0= torch.Size([64, 512]) ,\n",
+    "        A1= torch.Size([64, 512]) ,\n",
+    "        A2= torch.Size([64, 512]) ,\n",
+    "        A3= torch.Size([64, 512]) ,\n",
+    "        A4= torch.Size([64, 512]) ,\n",
+    "        A5= torch.Size([64, 512]) ,\n",
+    "        A6= torch.Size([64, 512]) ,\n",
+    "        A7= torch.Size([64, 512]) ,\n",
+    "        A8= torch.Size([64, 512]) ,\n",
+    "        A9= torch.Size([64, 512]) ,\n",
+    "        A10= torch.Size([64, 512]) ,\n",
+    "        A11= torch.Size([64, 512]) ,\n",
+    "        A12= torch.Size([64, 512]) ,\n",
+    "        A13= torch.Size([64, 512]) ,\n",
+    "        A14= torch.Size([64, 512]) ,\n",
+    "        A15= torch.Size([64, 512])\n",
+    "         ]\n",
+    "        16"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "handmade-judge",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "per_partition_per_stride_size=64\n",
+    "assert 16* per_partition_per_stride_size == 1024 "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "found-investing",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "testRow.input_size, testRow.output_size"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "prime-combine",
+   "metadata": {},
+   "source": [
+    "Below is the expected outputs :\n",
+    "    \n",
+    "    (1024, 512)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "disciplinary-western",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "\n",
+    "## Links and Resources\n",
+    "Don't forget to check out additional resources such as [Efficient Large-Scale Language Model Training on GPU Clusters](https://arxiv.org/pdf/2104.04473.pdf ) and [Pushing Forward the Frontiers of Natural Language Processing](https://blogs.nvidia.com/blog/2021/09/16/nlp-frontiers-ai-hardware-summit/)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "blessed-crystal",
+   "metadata": {},
+   "source": [
+    "-----\n",
+    "## <p style=\"text-align:center;border:3px; padding: 1em\"> <a href=../Start_Here.ipynb>HOME</a>&nbsp; &nbsp; &nbsp; <a href=./Lab1-4_GPT_vocab_merge_files.ipynb>NEXT</a></p>\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "lucky-average",
+   "metadata": {},
+   "source": [
+    "-----\n",
+    "\n",
+    "\n",
+    "## Licensing \n",
+    "\n",
+    "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). "
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

+ 332 - 0
ai/Megatron/English/Python/jupyter_notebook/Lab1-4_GPT_vocab_merge_files.ipynb

@@ -0,0 +1,332 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "noticed-neighborhood",
+   "metadata": {},
+   "source": [
+    "## GPT Tokenizer files\n",
+    "---\n",
+    "\n",
+    "## Learning Objectives\n",
+    "\n",
+    "The goal of this lab is to examine the difference between BPE and GPTBPE Tokenizer.\n",
+    "\n",
+    "Later on, we will use the observations from this notebook to train a GPT Tokenizer with our own raw text data.\n",
+    "\n",
+    "We will load and verify GPTBPE Tokenizer and make sure the output tokens and token ids are as expected. \n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "comic-architecture",
+   "metadata": {},
+   "source": [
+    "Let's review the source code of [gpt2 tokenizer](https://huggingface.co/transformers/_modules/transformers/tokenization_gpt2.html)\n",
+    "\n",
+    "    This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will\n",
+    "    be encoded differently whether it is at the beginning of the sentence (without space) or not:\n",
+    "\n",
+    "    \n",
+    "\n",
+    "         from transformers import GPT2Tokenizer\n",
+    "         tokenizer = GPT2Tokenizer.from_pretrained(\"gpt2\")\n",
+    "        \n",
+    "         tokenizer(\" Hello world\")['input_ids']\n",
+    "        [18435, 995]\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "portuguese-protocol",
+   "metadata": {},
+   "source": [
+    "Install necessary python libraries."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "latter-owner",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install tokenizers transformers ipywidgets\n",
+    "!jupyter nbextension enable --py widgetsnbextension"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "trained-glenn",
+   "metadata": {},
+   "source": [
+    "Next, we proceed to fetch pretrained GPT Tokenizer files, namely the vocab and merge files."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "textile-trance",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json\n",
+    "!wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "veterinary-plenty",
+   "metadata": {},
+   "source": [
+    "Examine the vocab and merge files, noted the presence of Ġ character.\n",
+    "Ġ = space + 256 , this character is used as a control letter."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "blessed-carbon",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "import random\n",
+    "with open('gpt2-vocab.json') as ip_file:\n",
+    "    o = json.load(ip_file)\n",
+    "    take=20\n",
+    "    rn=random.randint(0,len(o)-1)\n",
+    "    print(\"noted that the Ġ = space + 256 is the control letter\")\n",
+    "    print(list(o.keys())[rn:rn+take])            "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "enhanced-stack",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!tail -n 5 gpt2-merges.txt"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "orange-baker",
+   "metadata": {},
+   "source": [
+    "The following code block will load GPT2Tokenizer from HuggingFace transformer library, we verify the following :\n",
+    "\n",
+    "            from transformers import GPT2Tokenizer\n",
+    "            tokenizer = GPT2Tokenizer.from_pretrained(\"gpt2\")\n",
+    "        \n",
+    "            tokenizer(\" Hello world\")['input_ids']\n",
+    "            expected token ids for \" Hello world\" is [18435, 995]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cardiac-burner",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import GPT2Tokenizer\n",
+    "tokenizer = GPT2Tokenizer.from_pretrained(\"gpt2\")\n",
+    "\n",
+    "print('\\n notice the **SPACE** in front of ** Hello world** \\n')\n",
+    "sample_text=\" Hello world\"\n",
+    "print(sample_text)\n",
+    "out=tokenizer.tokenize(sample_text)\n",
+    "print(\"tokens:\",out)\n",
+    "ids=tokenizer(sample_text)['input_ids']\n",
+    "print(\"ids:\",ids)\n",
+    "## expected output :\n",
+    "## [18435, 995]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "alien-harris",
+   "metadata": {},
+   "source": [
+    "Below is the expected outputs :\n",
+    "    \n",
+    "         Hello world\n",
+    "        tokens: ['ĠHello', 'Ġworld']\n",
+    "        ids: [18435, 995]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "amazing-brick",
+   "metadata": {},
+   "source": [
+    "Next code block will load tokenizer library from huggingFace, we will observe the difference when setting `use_gpt` to True or False. \n",
+    "\n",
+    "Setting `use_gpt` to True will evoke the following : \n",
+    "\n",
+    "        tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()\n",
+    "        tokenizer.decoder = ByteLevelDecoder()\n",
+    "        \n",
+    "This is the expected tokenizer behavior for GPT models, namely GPTBPE Tokenizer, this GPTBPE tokenizer will load the vocab.json and merges.txt files and tokenize as expected. Whereas setting `use_gpt` to False, will result in a normal BPE Tokenizer, the tokenization will behave differently."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "substantial-strike",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from tokenizers import Tokenizer, models, pre_tokenizers, trainers\n",
+    "from tokenizers.decoders import ByteLevel as ByteLevelDecoder\n",
+    "from tokenizers.models import BPE\n",
+    "import json\n",
+    "\n",
+    "def load_tokenizer(vocab_file,merge_file, use_gpt):\n",
+    "    tokenizer = Tokenizer(BPE())\n",
+    "    tokenizer.model = BPE.from_file(vocab_file, merge_file)\n",
+    "    with open(vocab_file, 'r') as f2:\n",
+    "        vocab = json.loads(f2.read())  \n",
+    "    if use_gpt:\n",
+    "        tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()\n",
+    "        tokenizer.decoder = ByteLevelDecoder()\n",
+    "    return tokenizer , vocab\n",
+    "vocab_file='./gpt2-vocab.json'\n",
+    "merge_file='./gpt2-merges.txt'\n",
+    "tokenizers_gpt,_=load_tokenizer(vocab_file,merge_file,True)\n",
+    "sample_text=' Hello world' \n",
+    "output=tokenizers_gpt.encode(sample_text)\n",
+    "ids=output.ids\n",
+    "tokens=output.tokens\n",
+    "#print(tokens ,'\\n')\n",
+    "print(\"tokens: \",tokens)\n",
+    "print(\"ids:\",ids)\n",
+    "\n",
+    "tokenizers_bpe,_=load_tokenizer(vocab_file,merge_file, False)\n",
+    "sample_text=' Hello world'\n",
+    "output=tokenizers_bpe.encode(sample_text)\n",
+    "ids=output.ids\n",
+    "tokens=output.tokens\n",
+    "print(\"---\"*10)\n",
+    "print('\\nnotice the difference when using BPE as tokenizer instead of GPT2BPE tokenizer')\n",
+    "print(\"tokens: \",tokens)\n",
+    "print(\"ids:\",ids)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "blessed-prize",
+   "metadata": {},
+   "source": [
+    "Below is the expected outputs :\n",
+    "\n",
+    "        tokens:  ['ĠHello', 'Ġworld']\n",
+    "        ids: [18435, 995]\n",
+    "        ------------------------------\n",
+    "\n",
+    "        notice the difference when using BPE as tokenizer instead of GPT2BPE tokenizer\n",
+    "        tokens:  ['H', 'ellow', 'orld']\n",
+    "        ids: [39, 5037, 1764]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "vocal-conflict",
+   "metadata": {},
+   "source": [
+    "What did we observed ? \n",
+    "\n",
+    "Setting `use_gpt` to True will give us the expected behavor of GPTBPE tokenization. \n",
+    "\n",
+    "It will ensure the presence of Ġ : \n",
+    "\n",
+    "    tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()\n",
+    "    tokenizer.decoder = ByteLevelDecoder()\n",
+    "\n",
+    "\n",
+    "Therefore, we will enforce having :\n",
+    "\n",
+    "    tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()\n",
+    "    tokenizer.decoder = ByteLevelDecoder()\n",
+    "When training our own GPTBPETokenizer with our own raw text data.\n",
+    "    "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "contemporary-lancaster",
+   "metadata": {},
+   "source": [
+    "We will now move the gpt-vocab.json and gpt2-merges.txt to the correct data folder as a preparation for the next step."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "flush-amazon",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!mv gpt2-vocab.json ../dataset/EN/50k/\n",
+    "!mv gpt2-merges.txt ../dataset/EN/50k/\n",
+    "!ls ../dataset/EN/50k/"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "sapphire-horse",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "\n",
+    "## Links and Resources\n",
+    "Don't forget to check out additional resources such as [HuggingFace Tokenizer Documentation](https://huggingface.co/docs/tokenizers/python/latest/quicktour.html) and [Train GPT-2 in your own langauge](https://towardsdatascience.com/train-gpt-2-in-your-own-language-fc6ad4d60171).\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "creative-dressing",
+   "metadata": {},
+   "source": [
+    "-----\n",
+    "## <p style=\"text-align:center;border:3px; padding: 1em\"> <a href=../Start_Here.ipynb>HOME</a>&nbsp; &nbsp; &nbsp; <a href=./Lab1-5_jsonfy_and_process2mmap.ipynb>NEXT</a></p>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "offshore-greece",
+   "metadata": {},
+   "source": [
+    "-----\n",
+    "\n",
+    "\n",
+    "## Licensing \n",
+    "\n",
+    "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). "
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

+ 321 - 0
ai/Megatron/English/Python/jupyter_notebook/Lab1-5_jsonfy_and_process2mmap.ipynb

@@ -0,0 +1,321 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "acting-adelaide",
+   "metadata": {},
+   "source": [
+    "## Jsonfy + convert to mmap\n",
+    "---\n",
+    "\n",
+    "## Learning Objectives\n",
+    "\n",
+    "The goal of this lab is to convert the raw data to Megatron-LM's raw text data to mmap format.\n",
+    "\n",
+    "In particular, we will cover the following steps :\n",
+    "\n",
+    "    1. Understand the need of preprocessing data to mmap format.\n",
+    "    2. Convert the raw text data into loose json format.\n",
+    "    3. Use preprocess_data.py to convert the cleaned data into mmap format as a preparation for training.\n",
+    "\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "aquatic-parcel",
+   "metadata": {},
+   "source": [
+    "1. Understand the need of preprocessing data to mmap format."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "according-boston",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "out=np.random.random((1024,2048))\n",
+    "np.save('myarr',out)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "armed-smell",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "3.84 ms ± 36.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%timeit \n",
+    "out=np.load('myarr.npy')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "dedicated-thirty",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "43 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%timeit\n",
+    "array = np.memmap(\"myarr.npy\", mode=\"r\",\n",
+    "                  dtype=np.int16, shape=(1024, 1024))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "pressing-boost",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "## clean up\n",
+    "!rm myarr.npy"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "parallel-university",
+   "metadata": {},
+   "source": [
+    "2. jsonfy the raw text data into loose json format.\n",
+    "\n",
+    "The preprocess_data.py is expecting to receive json format data. Hence we need to convert the raw text data to json format first.\n",
+    "It is assumed that the json format data, will have one element per document, and the 'text' field in the json data, it's value will be extracted in preprocess_data.py. Other fields can also be specified for extraction. \n",
+    "An example of how the json data should look like, is given by the following : \n",
+    "\n",
+    "    {\"src\": \"The Internet\", \"text\": \"jumps over the lazy dog\", \"type\": \"Eng\", \"id\": \"42\", \"title\": \"Second Part\"}\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "legislative-provision",
+   "metadata": {},
+   "source": [
+    "We will now use the following python script to converting the raw text data into `extractedNVblogs.json` format as a preparation for the next step. \n",
+    "\n",
+    "\n",
+    "    python create_loose_json.py --help\n",
+    "        usage: create_loose_json.py [-h] [--infile INFILE] [--outfile OUTFILE]\n",
+    "\n",
+    "        optional arguments:\n",
+    "          -h, --help         show this help message and exit\n",
+    "          --infile INFILE    input file path\n",
+    "          --outfile OUTFILE  output file path"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "supported-budget",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "finished processing 71 lines to loose json format\n"
+     ]
+    }
+   ],
+   "source": [
+    "!python create_loose_json.py --infile ../dataset/EN/extractedNVblogs.txt --outfile ../dataset/EN/extractedNVblogs.json"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "split-samuel",
+   "metadata": {},
+   "source": [
+    "3. Use preprocess_data.py to convert the cleaned data into mmap format as a preparation for training.\n",
+    "\n",
+    "We are now ready to feed `extractedNVblogs.json`  data to Megatron-LM's preprocess_data.py in order to further convert the data to mmap format.\n",
+    "\n",
+    "The following two code blocks will convert the `extractedNVblogs.json` to `NVblog_text_document.bin` and `NVblog_text_document.idx`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "traditional-income",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "INPUT_JSON_FILE='../dataset/EN/extractedNVblogs.json'\n",
+    "OUTPUT_PATH='../dataset/EN/NVblog'\n",
+    "VOCAB_FILE='../dataset/EN/50k/gpt2-vocab.json'\n",
+    "MERGE_FILE='../dataset/EN/50k/gpt2-merges.txt'\n",
+    "NUM_CPUS=16"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "residential-honor",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Opening ../dataset/EN/extractedNVblogs.json\n",
+      "> building GPT2BPETokenizer tokenizer ...\n",
+      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
+      "> building GPT2BPETokenizer tokenizer ...\n",
+      "> building GPT2BPETokenizer tokenizer ...\n",
+      "> building GPT2BPETokenizer tokenizer ...\n",
+      "> building GPT2BPETokenizer tokenizer ...\n",
+      "> building GPT2BPETokenizer tokenizer ...\n",
+      "> building GPT2BPETokenizer tokenizer ...\n",
+      "> building GPT2BPETokenizer tokenizer ...\n",
+      "> building GPT2BPETokenizer tokenizer ...\n",
+      "> building GPT2BPETokenizer tokenizer ...\n",
+      "> building GPT2BPETokenizer tokenizer ...\n",
+      "> building GPT2BPETokenizer tokenizer ...\n",
+      "> building GPT2BPETokenizer tokenizer ...\n",
+      "> building GPT2BPETokenizer tokenizer ...\n",
+      "> building GPT2BPETokenizer tokenizer ...\n",
+      "> building GPT2BPETokenizer tokenizer ...\n",
+      "> building GPT2BPETokenizer tokenizer ...\n",
+      "Vocab size: 50257\n",
+      "Output prefix: ../dataset/EN/NVblog\n",
+      "Time to startup: 0.1618051528930664\n",
+      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
+      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
+      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
+      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
+      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
+      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
+      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
+      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
+      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
+      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
+      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
+      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
+      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
+      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
+      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
+      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n"
+     ]
+    }
+   ],
+   "source": [
+    "!python ./Megatron-LM/tools/preprocess_data.py \\\n",
+    "                       --input $INPUT_JSON_FILE \\\n",
+    "                       --output-prefix $OUTPUT_PATH \\\n",
+    "                       --json-keys text \\\n",
+    "                       --vocab-file $VOCAB_FILE \\\n",
+    "                       --merge-file $MERGE_FILE \\\n",
+    "                       --dataset-impl mmap \\\n",
+    "                       --tokenizer-type GPT2BPETokenizer \\\n",
+    "                       --workers $NUM_CPUS \\\n",
+    "                       --append-eod"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "viral-shopping",
+   "metadata": {},
+   "source": [
+    "Below is the expected outputs :\n",
+    "\n",
+    "                    Opening ./Megatron-LM/dataset/EN/extractedNVblogs.json\n",
+    "                    > building GPT2BPETokenizer tokenizer ...\n",
+    "                     > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
+    "                    > building GPT2BPETokenizer tokenizer ...\n",
+    "                    > building GPT2BPETokenizer tokenizer ...\n",
+    "                    > building GPT2BPETokenizer tokenizer ...\n",
+    "                    > building GPT2BPETokenizer tokenizer ...\n",
+    "                    > building GPT2BPETokenizer tokenizer ...\n",
+    "                    > building GPT2BPETokenizer tokenizer ...\n",
+    "                    > building GPT2BPETokenizer tokenizer ...\n",
+    "                    > building GPT2BPETokenizer tokenizer ...\n",
+    "                    > building GPT2BPETokenizer tokenizer ...\n",
+    "                    > building GPT2BPETokenizer tokenizer ...\n",
+    "                    > building GPT2BPETokenizer tokenizer ...\n",
+    "                    > building GPT2BPETokenizer tokenizer ...\n",
+    "                    > building GPT2BPETokenizer tokenizer ...\n",
+    "                    > building GPT2BPETokenizer tokenizer ...\n",
+    "                    > building GPT2BPETokenizer tokenizer ...\n",
+    "                    > building GPT2BPETokenizer tokenizer ...\n",
+    "                    Vocab size: 50257\n",
+    "                    Output prefix: ./Megatron-LM/dataset/EN/NVblogs\n",
+    "                    Time to startup: 0.5460700988769531\n",
+    "                     > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
+    "                     > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "virgin-hearts",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "\n",
+    "## Links and Resources\n",
+    "Don't forget to [Read More on MMAP](https://docs.python.org/3/library/mmap.html).\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "packed-panama",
+   "metadata": {},
+   "source": [
+    "-----\n",
+    "## <p style=\"text-align:center;border:3px; padding: 1em\"> <a href=../Start_Here.ipynb>HOME</a>&nbsp; &nbsp; &nbsp; <a href=./Lab1-6_Observe_GPT_runs_vs_performance.ipynb>NEXT</a></p>\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "lovely-tackle",
+   "metadata": {},
+   "source": [
+    "-----\n",
+    "\n",
+    "\n",
+    "## Licensing \n",
+    "\n",
+    "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). "
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

File diff suppressed because it is too large
+ 324 - 0
ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/tools/openwebtext/Lab1-1_Website_scrapping.ipynb