Procházet zdrojové kódy

implement feedbacks from reviewers

zenodia před 3 roky
rodič
revize
1367c593aa

+ 25 - 10
ai/Megatron/English/Python/Start_Here.ipynb

@@ -6,10 +6,24 @@
    "source": [
     "# Megatron GPT Bootcamp\n",
     "\n",
-    "## Learning objectives\n",
+    "## Learning objectives"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This objective of this bootcamp is designed to onborad you with NVIDIA [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) in a step-wised manner. We will give you the necessary tools and knoweldge to kick-start training your own language model. \n",
+    "\n",
+    "More specifically, In Day 2, We will learn the default [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)'s workflow, highlighting :\n",
     "\n",
-    "This objective of the bootcamp is to first, help you quickly go through one time the default Magatron workflow to let you familiarize on how Megatron works, thereafter we will be focus on catering to the specifics of local langauge needs, in this case Swedish. We will give recommandations/advices which can be optionally applied to your workflow and include some practical, useful scripts to help you kick-start your own journey in training local langauge Megatron GPT2/3 models. \n",
+    "   - Given a fixed dataset ( measured by # of tokens ) calculate compute needs in order to plan training runs and request resources.\n",
+    "    \n",
+    "   - Understanding Megatron-LM's core engine - Model Parallel Unit, this is the key which enable the possibility to train model with up to 1 trillion parameters on a superPOD.\n",
+    "    \n",
+    "   - Profiling : as we scale, it is important to maintain the performance of GPUs utilization across multi-gpus or multi-node runs.\n",
     "\n",
+    "In Day 3, we will shift our focus on all the customization we need to incoporate into [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)'s workflow, in order to cater for local langauge needs, in this case Swedish. We will give recommandations which can be optionally applied to your workflow and include some practical, useful scripts to help you kick-start your own journey in training local langauge Megatron GPT2/3 models. \n",
     "\n",
     "* Standard: Python\n",
     "* Frameworks: Pytorch + Megatron-LM \n",
@@ -24,7 +38,7 @@
    "metadata": {},
    "source": [
     "---\n",
-    "## check how many GPUs you have and GPU Mem capacity \n",
+    "### Check # of GPUs you have and GPU memory capacity \n",
     "\n",
     "            Wed Aug 25 07:03:55 2021       \n",
     "        +-----------------------------------------------------------------------------+\n",
@@ -96,8 +110,8 @@
    "metadata": {},
    "source": [
     "---\n",
-    "## verify nvlink active \n",
-    "OUTPUT should look something simialr to the below -\n",
+    "### Verify NVlink is active \n",
+    "OUTPUT should look simialr to the below -\n",
     "\n",
     "        GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-b29deceb-3745-51d2-2cf3-807ea8ac8e60)\n",
     "             Link 0: 25.781 GB/s\n",
@@ -154,7 +168,7 @@
     }
    ],
    "source": [
-    "# verify nvlink status\n",
+    "### verify nvlink status\n",
     "!nvidia-smi nvlink --status"
    ]
   },
@@ -163,7 +177,7 @@
    "metadata": {},
    "source": [
     "---\n",
-    "## verify profiling capability \n",
+    "### Verify Profiling Capability \n",
     "OUTPUT should look something simialr to the below\n",
     "note that we want all environment check pass ( = OK or available )\n",
     "\n",
@@ -208,7 +222,7 @@
    "metadata": {},
    "source": [
     "---\n",
-    "## making placeholder folders for dataset"
+    "### Making Placeholder folders for dataset"
    ]
   },
   {
@@ -233,8 +247,9 @@
    "metadata": {},
    "source": [
     "---\n",
-    "# create your own data - web crawling \n",
-    "please go through the notebook [link here](./jupyter_notebook/Megatron-LM/tools/openwebtext/Day3-Website_scrapping.ipynb) to scrape NVIDIA blog's data "
+    "## Create Your Own Data - Web Crawling \n",
+    "It is mandatory to fetch your own data via web crawling NVIDIA blogs webpages, extracting raw text from the webpage. \n",
+    "Please make sure you go through the notebook **[link here](./jupyter_notebook/Megatron-LM/tools/openwebtext/Day3-Website_scrapping.ipynb)** to scrape raw text from NVIDIA blogs' webpages. "
    ]
   },
   {

+ 57 - 24
ai/Megatron/English/Python/jupyter_notebook/Day2-1_EstimateComputeDaysNeeded.ipynb

@@ -2,21 +2,22 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "proud-packet",
+   "id": "strong-match",
    "metadata": {},
    "source": [
-    "# \n",
-    "\n",
-    "# 1_Estimate compute hours/days needed to execute one end-to-end run\n",
+    "# Estimate compute hours/days needed to execute one end-to-end run\n",
     "---\n",
     "\n",
     "## Learning Objectives\n",
-    "- **The goal of this lab is to:**\n",
-    "Understand how reserve compute resource per given data volume + model configuration for a training run. This is important not only for cluster capacity planning as well as for strategic research planning ( how many end to end experiments one can run given the compute capacity and duration )\n",
+    "The goal of this lab is size the problem :\n",
+    "Understanding how to calculate hours/days needed in order to reserve compute resources for the training job per given existing data volume and desired model size. \n",
+    "It is important for both the admin in the compute cluster to do capacity forecasting and for researchers to plan their experiments strategically.\n",
+    "\n",
+    "- Extracting the formular from the paper [Efficient Large-Scale Language Model Training on GPU Clusters](https://arxiv.org/pdf/2104.04473.pdf), per given [GPT3 variants](https://arxiv.org/pdf/2005.14165.pdf) based on assumed [Teraflops reference table](https://arxiv.org/pdf/2104.04473.pdf)\n",
+    "\n",
+    "- Understanding how to estimate compute resource needed per dataset volume ( measured in # of tokens ) and a chosen model size\n",
     "\n",
-    "    - Extracting the formular in from the paper [Efficient Large-Scale Language Model Training on GPU Clusters](https://arxiv.org/pdf/2104.04473.pdf), per given [GPT3 variants](https://arxiv.org/pdf/2005.14165.pdf) based on assumed [Teraflops reference table](https://arxiv.org/pdf/2104.04473.pdf)\n",
-    "    - Understanding how to estimate compute needed per dataset volume ( measured in number of tokens ) \n",
-    "    - Apply to your own/ imagenary data volume and and compute cluster set-ups\n",
+    "- Apply to your own imagenary data volume and a figurative compute cluster set-ups\n",
     "---------------------------------------------------------------------------------------------------\n",
     "\n",
     "- assuming the following information \n",
@@ -25,6 +26,7 @@
     "- n = number of GPUs in the compute cluster\n",
     "- x = achieved teraflops per GPU \n",
     "\n",
+    "Training time (in seconds) is approximated with this equation : 8*T*P/n*X\n",
     "you will need the following tables from the above papers for the estimation \n",
     "\n",
     "<center><img src=\"./Megatron-LM/pics/GPT3_all.png\" width=\"700\"/></center>\n",
@@ -36,22 +38,33 @@
   },
   {
    "cell_type": "markdown",
-   "id": "political-oriental",
+   "id": "pregnant-basket",
    "metadata": {},
    "source": [
     "---\n",
-    "## let's do a sanity check \n",
-    "scenario 1 - Given 300Billion tokens , 1024 GPUs, with 175 Billion model parmeters , assuming 140 teraFLOP/s per GPU \n",
-    "we should observe around **34 days** for an end to end training run\n",
+    "## let's do a sanity check - \n",
+    "\n",
+    "**Assumption** : you have an existing dataset, you know the volume of the dataset ( measure in # of tokens )\n",
+    "\n",
+    "**Scenario 1** - Given 300Billion tokens, you want to train 175 Billion GPT3 model, you have access to 1024 GPUs, look up on the table above to fetch 140 teraFLOP/s per GPU\n",
+    "\n",
+    "Question : How many hours/ days will you need given the scenaio above for you to compute an end to end training job ?\n",
     "\n",
-    "scenario 2 - Given 450Billion tokens , 3072 GPUs, with 1 Trillion model parmeters , assuming 163 teraFLOP/s per GPU \n",
-    "we should observe around **84 days** for an end to end training run\n"
+    "Answer : We should observe around **34 days** for an end to end training run\n",
+    "\n",
+    "--\n",
+    "\n",
+    "**scenario 2** - You increase the data volume to 450 Billion tokens, you want to train a big model, say 1 Trillion parameters, you have access to 3072 GPUs, and fetching the 163 teraFLOP/s per GPU from the table above\n",
+    "\n",
+    "Question: How many hours/ days will you need given this scenaio above for you to compute an end to end training job ?\n",
+    "\n",
+    "Answer: We should observe around **84 days** for an end to end training run\n"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 16,
-   "id": "linear-collector",
+   "id": "played-broadcast",
    "metadata": {},
    "outputs": [
     {
@@ -71,12 +84,15 @@
    ],
    "source": [
     "import numpy as np\n",
+    "# T = dataset size measured in numbers of tokens in the dataset\n",
+    "# P = model parameters for GPT3 varients\n",
+    "# n = number of GPUs in the compute cluster\n",
+    "# x = achieved teraflops per GPU \n",
     "\n",
     "def calculate_days_needed(T , P , n ,x):\n",
     "    if x is None:\n",
     "        return 'not a good SuperPOD use case, let us try a bigger model :)'\n",
-    "    else:\n",
-    "        #x=140*1e+12 # TeraFlop/s per GPU\n",
+    "    else:        \n",
     "        tot=8*T*P\n",
     "        div=n*x\n",
     "        compute_sec=tot/div\n",
@@ -100,7 +116,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "fancy-hollywood",
+   "id": "ruled-score",
    "metadata": {},
    "source": [
     "---\n",
@@ -116,12 +132,11 @@
   {
    "cell_type": "code",
    "execution_count": 5,
-   "id": "interior-technology",
+   "id": "circular-northwest",
    "metadata": {
     "collapsed": true,
     "jupyter": {
-     "outputs_hidden": true,
-     "source_hidden": true
+     "outputs_hidden": true
     }
    },
    "outputs": [
@@ -146,7 +161,25 @@
   },
   {
    "cell_type": "markdown",
-   "id": "distinguished-electricity",
+   "id": "peaceful-colorado",
+   "metadata": {},
+   "source": [
+    "--- \n",
+    "\n",
+    "## Additional Resources\n",
+    "\n",
+    "Efficient Large-Scale Language Model Training on GPU Clusters : https://arxiv.org/pdf/2104.04473.pdf \n",
+    "\n",
+    "Language Models are Few-Shot Learners : https://arxiv.org/pdf/2005.14165.pdf\n",
+    "\n",
+    "Scaling Laws for Neural Language Models : https://arxiv.org/pdf/2001.08361.pdf\n",
+    "\n",
+    "<left><img src=\"./Megatron-LM/pics/data_loss_model_size_compute.JPG\" width=\"700\"/></center>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "northern-fellowship",
    "metadata": {},
    "source": [
     "---\n",
@@ -160,7 +193,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "brilliant-delta",
+   "id": "linear-culture",
    "metadata": {},
    "source": [
     "-----\n",

+ 52 - 59
ai/Megatron/English/Python/jupyter_notebook/Day2-2_MegatronFundementals.ipynb

@@ -2,16 +2,14 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "vocational-baseball",
+   "id": "prerequisite-disaster",
    "metadata": {},
    "source": [
-    "# \n",
-    "\n",
-    "#  2_Understanding Megatron core - mpu\n",
+    "#  Understanding Megatron-LM's core - MPU\n",
     "---\n",
-    "NVIDIA's Megatron-LM makes training very large langauge models ( up to a trillion params model) for own language a reality, Megatron-LM's core mpu ( model paralleism unit ) is the basis for all subsequence branching efforts on training very large models, such as [DeepSpeed](https://www.deepspeed.ai/features/#model-parallelism).\n",
+    "NVIDIA's Megatron-LM makes training very large langauge models ( up to one trillion parameters ) a reality, Megatron-LM's core MPU ( Model Paralleism Unit ) forms the base for all subsequence optimization efforts on training very large models, such as [DeepSpeed](https://www.deepspeed.ai/features/#model-parallelism).\n",
     "\n",
-    "However, the common side effect is that one could easily suffer from low GPUs utilization when configure Megatron GPT training runs with insufficient configuraitons.( see below an example traiing run with insufficient configurations that result in low or inconsistent gpu utils in training !\n",
+    "However, the common side effect is that, with a bad training configuration, one could easily suffer from low GPUs utilization( screenshots below as an example of a bad training configuration which resulted in low or inconsistent gpus utilization).\n",
     "\n",
     "Therefore, we will be taking our very first step of understanding how the core of Megatron-LM's mpu works and thereafter\n",
     "taking a closer look on  Megatron-LM's training runs performance. \n",
@@ -19,19 +17,20 @@
     "![a training run with low or inconsistent gpus utils example](./Megatron-LM/pics/naive_run.JPG)\n",
     "\n",
     "## Learning Objectives\n",
-    "- **The goal of this lab is to understand the core of Megatron's mpu = model parallelism unit works **\n",
-    "    - How Megatron group GPU-affinity per model parallel configuration ( pipeline parallel | tensor parallel )\n",
-    "    - Tensor Parallel : Column Parallel\n",
-    "    - Tensor Parallel : Row Parallel\n"
+    "The goal of this lab is to understand the core of Megatron-LM's mpu = model parallelism unit works\n",
+    "\n",
+    "- How Megatron-LM is groupping GPU-affinity per model parallel configuration ( pipeline parallel | tensor parallel )\n",
+    "- Tensor Parallel : Column Parallel\n",
+    "- Tensor Parallel : Row Parallel\n"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "democratic-disclosure",
+   "id": "peaceful-article",
    "metadata": {},
    "source": [
     "---------------------------------------------------------------------------\n",
-    "## How Megatron group GPU-affinity per model-data parallel configuration\n",
+    "## How Megatron-LM is groupping GPU-affinity per model-data parallel configuration\n",
     "\n",
     "Parallelism : Model & Data \n",
     "- p = Pipeline Model Parallel  \n",
@@ -39,7 +38,7 @@
     "- d = Data Parallal \n",
     "- n = Total number of GPUs used in the training\n",
     "\n",
-    "#### Megatron requires p * t * d = n\n",
+    "**Note : Megatron-LM requires p * t * d = n**\n",
     "\n",
     "Parallel group - grouping for torch distributed  (NCCL)\n",
     "- num_tensor_model_parallel_groups = n / t\n",
@@ -56,7 +55,7 @@
     "\n",
     "therefore, world_size=16  \n",
     "\n",
-    "then accoridng to Megatron [initializer.py](https://github.com/NVIDIA/Megatron-LM/blob/90e0a0dd08159e1c95f4f9d99bb8687f327d36c3/megatron/mpu/initialize.py) we should see the following ...\n",
+    "then accoridng to Megatron-LM [initializer.py](https://github.com/NVIDIA/Megatron-LM/blob/90e0a0dd08159e1c95f4f9d99bb8687f327d36c3/megatron/mpu/initialize.py) we should see the following ...\n",
     "\n",
     "    8 data_parallel groups:\n",
     "        [g0, g2], [g1, g3], [g4, g6], [g5, g7], [g8, g10], [g9, g11], [g12, g14], [g13, g15]\n",
@@ -65,8 +64,8 @@
     "    4 pipeline model-parallel groups:\n",
     "        [g0, g4, g8, g12], [g1, g5, g9, g13], [g2, g6, g10, g14], [g3, g7, g11, g15]\n",
     "\n",
-    "Note that for efficiency, the caller should make sure adjacent ranks\n",
-    "are on the same DGX box. \n",
+    "**Note that for efficiency, the caller should make sure adjacent ranks are on the same DGX box.**\n",
+    "\n",
     "For example if we are using 2 DGX-1 boxes\n",
     "with a total of 16 GPUs, rank 0 to 7 belong to the first box and\n",
     "ranks 8 to 15 belong to the second box."
@@ -75,7 +74,7 @@
   {
    "cell_type": "code",
    "execution_count": 1,
-   "id": "passive-dependence",
+   "id": "reflected-israeli",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -150,13 +149,12 @@
     "    print(\"Total :{} full models being partitioned into :{} GPUs \".format(len(_MODEL_PARALLEL_GROUP),world_size))\n",
     "    for idx, m in zip(range(len(_MODEL_PARALLEL_GROUP)),_MODEL_PARALLEL_GROUP):\n",
     "        m=[str(l) for l in m]\n",
-    "        print(\"model {} : is partitioned into gpus :{}\".format(str(idx),','.join(m)))   \n",
-    "\n"
+    "        print(\"model {} : is partitioned into gpus :{}\".format(str(idx),','.join(m)))   \n"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "natural-genetics",
+   "id": "moving-strip",
    "metadata": {},
    "source": [
     "---\n",
@@ -177,7 +175,7 @@
   {
    "cell_type": "code",
    "execution_count": 2,
-   "id": "residential-action",
+   "id": "promotional-stack",
    "metadata": {},
    "outputs": [
     {
@@ -217,7 +215,7 @@
   {
    "cell_type": "code",
    "execution_count": 3,
-   "id": "likely-visiting",
+   "id": "loose-haven",
    "metadata": {},
    "outputs": [
     {
@@ -259,11 +257,11 @@
   },
   {
    "cell_type": "markdown",
-   "id": "bright-companion",
+   "id": "nearby-immunology",
    "metadata": {},
    "source": [
     "----------------------------------------------------------------------\n",
-    "## Megatron Column Parallel \n",
+    "## Megatron-LM's Column Parallel \n",
     "[ColumnParallel reference](https://github.com/NVIDIA/Megatron-LM/blob/90e0a0dd08159e1c95f4f9d99bb8687f327d36c3/megatron/mpu/layers.py#L201)\n",
     "![ColumnParallel](./Megatron-LM/pics/ColumnParallel.JPG)"
    ]
@@ -271,7 +269,7 @@
   {
    "cell_type": "code",
    "execution_count": 5,
-   "id": "closing-louisville",
+   "id": "consolidated-operation",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -291,7 +289,7 @@
   {
    "cell_type": "code",
    "execution_count": 6,
-   "id": "bizarre-remedy",
+   "id": "expressed-builder",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -394,7 +392,7 @@
   {
    "cell_type": "code",
    "execution_count": 7,
-   "id": "recent-cradle",
+   "id": "elect-detail",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -461,7 +459,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "divine-morris",
+   "id": "dress-proportion",
    "metadata": {},
    "source": [
     "## Peek inside Column Parallel Class"
@@ -470,7 +468,7 @@
   {
    "cell_type": "code",
    "execution_count": 8,
-   "id": "sound-privacy",
+   "id": "german-method",
    "metadata": {},
    "outputs": [
     {
@@ -503,7 +501,7 @@
   {
    "cell_type": "code",
    "execution_count": 9,
-   "id": "continental-quarter",
+   "id": "ethical-secondary",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -514,7 +512,7 @@
   {
    "cell_type": "code",
    "execution_count": 10,
-   "id": "integral-cause",
+   "id": "republican-saint",
    "metadata": {},
    "outputs": [
     {
@@ -535,7 +533,7 @@
   {
    "cell_type": "code",
    "execution_count": 11,
-   "id": "impressive-watts",
+   "id": "sealed-eclipse",
    "metadata": {},
    "outputs": [
     {
@@ -555,11 +553,11 @@
   },
   {
    "cell_type": "markdown",
-   "id": "explicit-mother",
+   "id": "played-romantic",
    "metadata": {},
    "source": [
     "----------------------------------------------------------------------\n",
-    "## Megatron Row Parallel \n",
+    "## Megatron-LM's Row Parallel \n",
     "[RowParallel reference](https://github.com/NVIDIA/Megatron-LM/blob/90e0a0dd08159e1c95f4f9d99bb8687f327d36c3/megatron/mpu/layers.py#L294)\n",
     "![RowParallel](./Megatron-LM/pics/RowParallel.JPG)"
    ]
@@ -567,7 +565,7 @@
   {
    "cell_type": "code",
    "execution_count": 53,
-   "id": "owned-pixel",
+   "id": "logical-union",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -674,27 +672,8 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 39,
-   "id": "latin-persian",
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "16.0"
-      ]
-     },
-     "execution_count": 39,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": []
-  },
-  {
-   "cell_type": "code",
    "execution_count": 74,
-   "id": "sudden-tokyo",
+   "id": "annual-commonwealth",
    "metadata": {},
    "outputs": [
     {
@@ -744,7 +723,7 @@
   {
    "cell_type": "code",
    "execution_count": 72,
-   "id": "prompt-victoria",
+   "id": "complex-ultimate",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -755,7 +734,7 @@
   {
    "cell_type": "code",
    "execution_count": 58,
-   "id": "mysterious-bishop",
+   "id": "neither-johnson",
    "metadata": {},
    "outputs": [
     {
@@ -775,7 +754,21 @@
   },
   {
    "cell_type": "markdown",
-   "id": "effective-premises",
+   "id": "wired-contents",
+   "metadata": {},
+   "source": [
+    "--- \n",
+    "\n",
+    "## Additional Resources\n",
+    "\n",
+    "Efficient Large-Scale Language Model Training on GPU Clusters : https://arxiv.org/pdf/2104.04473.pdf \n",
+    "\n",
+    "Pushing Forward the Frontiers of Natural Language Processing : https://blogs.nvidia.com/blog/2021/09/16/nlp-frontiers-ai-hardware-summit/"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "patent-caution",
    "metadata": {},
    "source": [
     "---\n",
@@ -787,7 +780,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "surprising-utilization",
+   "id": "together-paragraph",
    "metadata": {},
    "source": [
     "-----\n",

+ 35 - 20
ai/Megatron/English/Python/jupyter_notebook/Day2-3_GPT_vocab_merge_files.ipynb

@@ -2,21 +2,22 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "maritime-macro",
+   "id": "strong-deadline",
    "metadata": {},
    "source": [
-    "# \n",
+    "<img src=http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png style=\"width: 90px; float: right;\">\n",
     "\n",
-    "# 3_About GPT vocab and merge files\n",
+    "# About GPT vocab and merge files\n",
     "---\n",
     "\n",
     "## Learning Objectives\n",
-    "- **The goal of this lab is to:**\n",
-    "    - the difference between BPE and GPTBPE Tokenizer\n",
-    "    - load and verify GPTBPE Tokenizer can do tokenization as expected \n",
+    "The goal of this lab is to:\n",
     "\n",
+    "- the difference between BPE and GPTBPE Tokenizer\n",
+    "- load and verify GPTBPE Tokenizer can do tokenization as expected \n",
     "\n",
-    "Download the GPT vocab and merge files \n",
+    "\n",
+    "Note: Please proceed to download the GPT vocab and merge files \n",
     "\n",
     "Download vocab file [English_vocab](https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json)\n",
     "\n",
@@ -25,7 +26,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "compliant-champion",
+   "id": "level-trigger",
    "metadata": {},
    "source": [
     "#### let's review the source code of [gpt2 tokenizer](https://huggingface.co/transformers/_modules/transformers/tokenization_gpt2.html)\n",
@@ -47,7 +48,7 @@
   {
    "cell_type": "code",
    "execution_count": 1,
-   "id": "thrown-aurora",
+   "id": "mysterious-favorite",
    "metadata": {},
    "outputs": [
     {
@@ -133,7 +134,7 @@
   {
    "cell_type": "code",
    "execution_count": 2,
-   "id": "critical-apparel",
+   "id": "shaped-mercury",
    "metadata": {},
    "outputs": [
     {
@@ -197,7 +198,7 @@
   {
    "cell_type": "code",
    "execution_count": 3,
-   "id": "circular-covering",
+   "id": "traditional-triangle",
    "metadata": {},
    "outputs": [
     {
@@ -215,7 +216,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "liked-reach",
+   "id": "disciplinary-journalist",
    "metadata": {},
    "source": [
     "## examine the vocab and merge files"
@@ -224,7 +225,7 @@
   {
    "cell_type": "code",
    "execution_count": 4,
-   "id": "bored-standing",
+   "id": "accomplished-builder",
    "metadata": {},
    "outputs": [
     {
@@ -250,7 +251,7 @@
   {
    "cell_type": "code",
    "execution_count": 5,
-   "id": "driven-coaching",
+   "id": "dependent-bridge",
    "metadata": {},
    "outputs": [
     {
@@ -271,7 +272,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "celtic-cheese",
+   "id": "imperial-setting",
    "metadata": {},
    "source": [
     "## sanity check load from transformer GPT2Tokenizer "
@@ -280,7 +281,7 @@
   {
    "cell_type": "code",
    "execution_count": 6,
-   "id": "handled-cooper",
+   "id": "corresponding-yugoslavia",
    "metadata": {},
    "outputs": [
     {
@@ -314,7 +315,7 @@
   {
    "cell_type": "code",
    "execution_count": 7,
-   "id": "electrical-performance",
+   "id": "overhead-philip",
    "metadata": {},
    "outputs": [
     {
@@ -373,7 +374,7 @@
   {
    "cell_type": "code",
    "execution_count": 8,
-   "id": "funny-scheduling",
+   "id": "interpreted-termination",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -384,7 +385,21 @@
   },
   {
    "cell_type": "markdown",
-   "id": "placed-necessity",
+   "id": "vulnerable-outside",
+   "metadata": {},
+   "source": [
+    "--- \n",
+    "\n",
+    "## Additional Resources\n",
+    "\n",
+    "HuggingFace Tokenizer Documentation : https://huggingface.co/docs/tokenizers/python/latest/quicktour.html\n",
+    "\n",
+    "Train GPT-2 in your own langauge : https://towardsdatascience.com/train-gpt-2-in-your-own-language-fc6ad4d60171"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "aquatic-blanket",
    "metadata": {},
    "source": [
     "---\n",
@@ -398,7 +413,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "respected-class",
+   "id": "injured-pursuit",
    "metadata": {},
    "source": [
     "-----\n",

+ 29 - 19
ai/Megatron/English/Python/jupyter_notebook/Day2-4_jsonfy_and_process2mmap.ipynb

@@ -2,12 +2,10 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "dependent-chemistry",
+   "id": "fifteen-channel",
    "metadata": {},
    "source": [
-    "# \n",
-    "\n",
-    "# 4_Jsonfy and preprocess to mmap format for optimizing data loading\n",
+    "# Jsonfy and preprocess to mmap format for optimizing data loading\n",
     "---\n",
     "\n",
     "## Learning Objectives\n",
@@ -24,7 +22,7 @@
   {
    "cell_type": "code",
    "execution_count": 1,
-   "id": "square-louisville",
+   "id": "governing-country",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -36,7 +34,7 @@
   {
    "cell_type": "code",
    "execution_count": 2,
-   "id": "human-appliance",
+   "id": "mysterious-hazard",
    "metadata": {},
    "outputs": [
     {
@@ -55,7 +53,7 @@
   {
    "cell_type": "code",
    "execution_count": 3,
-   "id": "heard-baseball",
+   "id": "designed-swaziland",
    "metadata": {},
    "outputs": [
     {
@@ -75,7 +73,7 @@
   {
    "cell_type": "code",
    "execution_count": 4,
-   "id": "dynamic-nudist",
+   "id": "sublime-beaver",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -85,7 +83,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "soviet-jumping",
+   "id": "aggressive-victim",
    "metadata": {},
    "source": [
     "----------------------------------------------------------\n",
@@ -97,7 +95,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "acting-covering",
+   "id": "specific-scheduling",
    "metadata": {},
    "source": [
     "----------------------------------------------------------\n",
@@ -114,7 +112,7 @@
   {
    "cell_type": "code",
    "execution_count": 5,
-   "id": "organic-malaysia",
+   "id": "powered-cooking",
    "metadata": {},
    "outputs": [
     {
@@ -131,7 +129,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "rubber-absolute",
+   "id": "marine-packing",
    "metadata": {},
    "source": [
     "----------------------------------------------------------\n",
@@ -166,7 +164,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "lined-transfer",
+   "id": "complete-rebate",
    "metadata": {},
    "source": [
     "----------------------------------------------------------\n",
@@ -203,7 +201,7 @@
   {
    "cell_type": "code",
    "execution_count": 6,
-   "id": "marked-midnight",
+   "id": "resident-convert",
    "metadata": {},
    "outputs": [
     {
@@ -223,7 +221,7 @@
   {
    "cell_type": "code",
    "execution_count": 9,
-   "id": "adjustable-hammer",
+   "id": "integrated-bridges",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -236,7 +234,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "hidden-patrick",
+   "id": "supported-celtic",
    "metadata": {},
    "source": [
     "---\n",
@@ -271,7 +269,7 @@
   {
    "cell_type": "code",
    "execution_count": 10,
-   "id": "professional-lawyer",
+   "id": "familiar-victorian",
    "metadata": {},
    "outputs": [
     {
@@ -334,7 +332,19 @@
   },
   {
    "cell_type": "markdown",
-   "id": "valuable-equilibrium",
+   "id": "tropical-gathering",
+   "metadata": {},
+   "source": [
+    "--- \n",
+    "\n",
+    "## Additional Resources\n",
+    "\n",
+    "Read More on MMAP  : https://docs.python.org/3/library/mmap.html\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "solid-reminder",
    "metadata": {},
    "source": [
     "---\n",
@@ -348,7 +358,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "accurate-drinking",
+   "id": "occupational-ranking",
    "metadata": {},
    "source": [
     "-----\n",

+ 49 - 36
ai/Megatron/English/Python/jupyter_notebook/Day2-5_Observe_GPT_runs_vs_performance.ipynb

@@ -2,16 +2,14 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "crude-morning",
+   "id": "worth-conversation",
    "metadata": {},
    "source": [
-    "# \n",
-    "\n",
-    "# 5 Monitor GPT training performance with varying config\n",
+    "# Monitor GPT training performance with varying config\n",
     "---\n",
     "\n",
     "## Learning Objectives\n",
-    "- **The goal of this lab is to monitor the performance of your training runs with different GPT training configurations **\n",
+    "The goal of this lab is to monitor the performance of your training runs with different GPT training configurations \n",
     "    - motivation : why should we care ? \n",
     "    \n",
     "    Answer : bad config result in very low / inconsistent gpus utilizations which in turn, slow down training and therefore longer experiments per run, it's a lose-lose-lose situation on all sides.\n",
@@ -25,12 +23,12 @@
     "        - starts with multiGPUs \n",
     "    - challenge : beat the record !\n",
     "\n",
-    "it is possible to obtain more than 80% GPU utilizations overall with high tensorcore ops sustained throughout during **training** for all gpus \n"
+    "Note: it is possible to obtain more than 80% GPU utilizations overall with high tensorcore ops sustained throughout during **training** for all gpus.\n"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "linear-broad",
+   "id": "rough-penny",
    "metadata": {},
    "source": [
     "----------------------------------------------------------\n",
@@ -42,7 +40,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "olympic-usage",
+   "id": "dominican-candle",
    "metadata": {},
    "source": [
     "----------------------------------------------------------\n",
@@ -54,7 +52,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "bacterial-sunset",
+   "id": "therapeutic-pizza",
    "metadata": {},
    "source": [
     "----------------------------------------------------------\n",
@@ -64,9 +62,9 @@
     "<center><img src=\"./Megatron-LM/pics/Alt_callout2terminals.JPG\" width=\"600\"/></center>\n",
     "\n",
     "\n",
-    "         ----- launch profiling sessions to record: visualize on Nsight( please use Nsight Systems version >=2021.3.1 ) ----\n",
+    "         ----- launch profiling sessions to record: visualize on Nsight( please use Nsight Systems version >=2021.4.1 ) ----\n",
     "         \n",
-    "[Installing Nsight](https://developer.nvidia.com/gameworksdownload#?dn=nsight-systems-2021-3-1-54)\n",
+    "[Installing Nsight](https://developer.nvidia.com/gameworksdownload#?dn=nsight-systems-2021-4-1)\n",
     "\n",
     "[User Guide](https://docs.nvidia.com/nsight-systems/UserGuide/index.html)\n",
     "\n",
@@ -76,7 +74,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "brave-armstrong",
+   "id": "immune-chambers",
    "metadata": {},
    "source": [
     "---\n",
@@ -86,7 +84,7 @@
   {
    "cell_type": "code",
    "execution_count": 1,
-   "id": "russian-hydrogen",
+   "id": "false-imagination",
    "metadata": {},
    "outputs": [
     {
@@ -108,7 +106,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "joint-magnitude",
+   "id": "magnetic-uncertainty",
    "metadata": {},
    "source": [
     "---\n",
@@ -132,7 +130,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "aerial-tunnel",
+   "id": "utility-narrative",
    "metadata": {},
    "source": [
     "---\n",
@@ -143,7 +141,7 @@
   {
    "cell_type": "code",
    "execution_count": 4,
-   "id": "previous-gamma",
+   "id": "large-saint",
    "metadata": {},
    "outputs": [
     {
@@ -434,7 +432,7 @@
   {
    "cell_type": "code",
    "execution_count": 5,
-   "id": "static-shelf",
+   "id": "cognitive-harrison",
    "metadata": {},
    "outputs": [
     {
@@ -451,7 +449,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "handled-reference",
+   "id": "subject-genesis",
    "metadata": {},
    "source": [
     "---\n",
@@ -462,7 +460,7 @@
   {
    "cell_type": "code",
    "execution_count": 10,
-   "id": "incident-bridal",
+   "id": "delayed-renewal",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -472,7 +470,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "touched-broadcasting",
+   "id": "thorough-credits",
    "metadata": {},
    "source": [
     "----------------------------------------------------------\n",
@@ -504,7 +502,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "suffering-designer",
+   "id": "concrete-acrobat",
    "metadata": {},
    "source": [
     "---\n",
@@ -515,7 +513,7 @@
   {
    "cell_type": "code",
    "execution_count": 11,
-   "id": "complicated-wales",
+   "id": "serious-finance",
    "metadata": {},
    "outputs": [
     {
@@ -808,19 +806,18 @@
   },
   {
    "cell_type": "markdown",
-   "id": "checked-mobile",
+   "id": "piano-martial",
    "metadata": {},
    "source": [
     "--------------------------------------------------\n",
     "-----\n",
     "visualizing the profiles via nsight should look similar to the following \n",
-    "<center><img src=\"./Megatron-LM/pics/GPUs_utils_naive.JPG\" width=\"1000\"/></center>\n",
-    "<center><img src=\"./Megatron-LM/pics/multigpu_naive_run.jpg\" width=\"1000\"/></center>\n"
+    "<center><img src=\"./Megatron-LM/pics/GPUs_utils_naive.JPG\" width=\"1000\"/></center>\n"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "eight-consciousness",
+   "id": "satisfied-private",
    "metadata": {},
    "source": [
     "---\n",
@@ -848,7 +845,7 @@
   {
    "cell_type": "code",
    "execution_count": 6,
-   "id": "musical-bowling",
+   "id": "architectural-forum",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -857,7 +854,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "plain-behavior",
+   "id": "phantom-astronomy",
    "metadata": {},
    "source": [
     "---\n",
@@ -868,7 +865,7 @@
   {
    "cell_type": "code",
    "execution_count": 7,
-   "id": "induced-weekend",
+   "id": "varied-breakdown",
    "metadata": {},
    "outputs": [
     {
@@ -1152,7 +1149,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "breeding-favor",
+   "id": "neural-leisure",
    "metadata": {},
    "source": [
     "--------------------------------------------------\n",
@@ -1163,7 +1160,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "first-dublin",
+   "id": "muslim-channel",
    "metadata": {},
    "source": [
     "<a id=\"TheChallenge\"></a>"
@@ -1171,7 +1168,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "durable-annual",
+   "id": "toxic-consideration",
    "metadata": {},
    "source": [
     "----------------\n",
@@ -1189,8 +1186,8 @@
     " \n",
     "\n",
     "\n",
-    "task: modify the [profiling bash script](./Megatron-LM/profile_2nd_run.sh) and rerun \n",
-    "<a href=\"./Day2-5_Observe_GPT_runs_vs_performance.ipynb#Rerun_Cell\">Jump to ReRun Cell</a> \n",
+    "task: modify this --> [profiling bash script](./Megatron-LM/profile_2nd_run.sh) and rerun \n",
+    "<a href=\"./Day2-5_Observe_GPT_runs_vs_performance.ipynb#Rerun_Cell\">GO to ReRun Cell</a> \n",
     "monitor the training runs to get an overall >80% gpu utils in **training** runs \n",
     "\n",
     "```\n",
@@ -1210,7 +1207,23 @@
   },
   {
    "cell_type": "markdown",
-   "id": "multiple-wonder",
+   "id": "chemical-overview",
+   "metadata": {},
+   "source": [
+    "--- \n",
+    "\n",
+    "## Additional Resources\n",
+    "\n",
+    "NVIDIA Nsight Systems : https://docs.nvidia.com/nsight-systems/index.html\n",
+    "\n",
+    "NVTX Tutorial : https://developer.nvidia.com/blog/nvidia-tools-extension-api-nvtx-annotation-tool-for-profiling-code-in-python-and-c-c/\n",
+    "\n",
+    "Nsight Systems : https://developer.nvidia.com/blog/transitioning-nsight-systems-nvidia-visual-profiler-nvprof/\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "precise-baking",
    "metadata": {},
    "source": [
     "---\n",
@@ -1220,7 +1233,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "comprehensive-spirituality",
+   "id": "complete-birmingham",
    "metadata": {},
    "source": [
     "-----\n",

+ 39 - 27
ai/Megatron/English/Python/jupyter_notebook/Day3-3_train_own_GPT2BPETokenizer.ipynb

@@ -2,33 +2,31 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "coated-analyst",
+   "id": "respected-vintage",
    "metadata": {},
    "source": [
-    "# \n",
-    "\n",
     "# Train your own GPT compatible Tokenzer and obtain vocab.json & merges.txt\n",
     "---\n",
     "\n",
     "## Learning Objectives\n",
-    "- **The goal of this lab is to show you how to train your own GPTBPE tokenizer on your own raw text data **\n",
-    "    - train your own GPT compatible tokenizer given own text data in own langauge\n",
-    "        1. option 1 - load from pretrained vocab and merge files, and fit to the new corpus \n",
-    "        2. option 2 - train a GPT compatible tokenizer from scratch\n",
+    "The goal of this lab is to demonstrate how to train your own GPTBPE tokenizer on your own raw text data \n",
+    "\n",
+    "- train your own GPT compatible tokenizer given own text data in own langauge\n",
+    "    1. option 1 - load from pretrained vocab and merge files, and fit to the new corpus \n",
+    "    2. option 2 - train a GPT compatible tokenizer from scratch\n",
     "\n",
     "we will elaborate how to train your own GPT compatible tokenizer and obtain vocab and merge files\n",
+    "\n",
     "we will be using HuggingFace's ByteLevel BPE Tokenizer and trainer to complete this task\n",
     "\n",
     "--------------------------------------------------------------------------------------------------------------------\n",
-    "we need to install the [HuggingFace Tokenizer library](https://huggingface.co/transformers/installation.html)\n",
-    "\n",
-    "!pip install tokenizers"
+    "First of all, we need to install the [HuggingFace Tokenizer library](https://huggingface.co/transformers/installation.html)"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 1,
-   "id": "dental-drama",
+   "id": "transsexual-republican",
    "metadata": {},
    "outputs": [
     {
@@ -47,7 +45,7 @@
   {
    "cell_type": "code",
    "execution_count": 3,
-   "id": "alive-spirituality",
+   "id": "golden-retailer",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -58,11 +56,11 @@
   },
   {
    "cell_type": "markdown",
-   "id": "communist-shooting",
+   "id": "accepting-simon",
    "metadata": {},
    "source": [
     "-------------------------------------------------------------------------------\n",
-    "## how to use the python script below -       \n",
+    "## How to use the python script below -       \n",
     "  trainGPTTokenizer.py [-h] \n",
     "\n",
     "        optional arguments:\n",
@@ -79,12 +77,12 @@
   },
   {
    "cell_type": "markdown",
-   "id": "broken-texas",
+   "id": "latin-netscape",
    "metadata": {},
    "source": [
     "---\n",
-    "## load_pretrained vocab and merge files into the trainer and then train on new txt\n",
-    "#### OUTPUT should be similar to the below ---\n",
+    "## Load pretrained vocab and merge files into the trainer and then train on new txt\n",
+    "#### OUTPUT should look similar to the below ---\n",
     "        \n",
     "        loading gpt2bpe english vocab and merge \n",
     "        include minimal special token end of text \n",
@@ -106,7 +104,7 @@
   {
    "cell_type": "code",
    "execution_count": 4,
-   "id": "vulnerable-flour",
+   "id": "visible-setup",
    "metadata": {},
    "outputs": [
     {
@@ -279,7 +277,7 @@
   {
    "cell_type": "code",
    "execution_count": 9,
-   "id": "broad-bradley",
+   "id": "analyzed-pacific",
    "metadata": {},
    "outputs": [
     {
@@ -297,12 +295,12 @@
   },
   {
    "cell_type": "markdown",
-   "id": "small-billion",
+   "id": "brown-pickup",
    "metadata": {},
    "source": [
     "---\n",
-    "## train completely from scratch with the raw txt to obtain vocab.json and merges.txt files\n",
-    "#### OUTPUT should be similar to the below ---\n",
+    "## Train completely from scratch with the raw txt to obtain vocab.json and merges.txt files\n",
+    "#### OUTPUT should look similar to the below ---\n",
     "    include minimal special token end of text \n",
     "    [00:00:00] Pre-processing files (914 Mo)            ░░░░░░░░                  0%\n",
     "    [00:00:02] Pre-processing files (914 Mo)            ░░░░░░░░                  1%\n",
@@ -321,7 +319,7 @@
   {
    "cell_type": "code",
    "execution_count": 5,
-   "id": "civil-waste",
+   "id": "pointed-toner",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -332,7 +330,7 @@
   {
    "cell_type": "code",
    "execution_count": 6,
-   "id": "civilian-addition",
+   "id": "north-reality",
    "metadata": {},
    "outputs": [
     {
@@ -485,7 +483,7 @@
   {
    "cell_type": "code",
    "execution_count": 10,
-   "id": "sunrise-sound",
+   "id": "wireless-galaxy",
    "metadata": {},
    "outputs": [
     {
@@ -503,7 +501,21 @@
   },
   {
    "cell_type": "markdown",
-   "id": "innocent-helping",
+   "id": "requested-sphere",
+   "metadata": {},
+   "source": [
+    "--- \n",
+    "\n",
+    "## Additional Resources\n",
+    "\n",
+    "HuggingFace Tokenizer Documentation : https://huggingface.co/docs/tokenizers/python/latest/quicktour.html\n",
+    "\n",
+    "Train GPT-2 in your own langauge : https://towardsdatascience.com/train-gpt-2-in-your-own-language-fc6ad4d60171"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "contrary-quantum",
    "metadata": {},
    "source": [
     "---\n",
@@ -517,7 +529,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "normal-tomato",
+   "id": "sized-google",
    "metadata": {},
    "source": [
     "-----\n",

+ 67 - 75
ai/Megatron/English/Python/jupyter_notebook/Day3-4_customize_process2mmap.ipynb

@@ -2,71 +2,70 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "cooked-laser",
+   "id": "increased-rehabilitation",
    "metadata": {},
    "source": [
-    "# \n",
-    "\n",
-    "# 3_Jsonfy and preprocess to mmap format for optimizing data loading\n",
+    "# Jsonfy and preprocess to mmap format for optimizing data loading\n",
     "---\n",
     "\n",
     "## Learning Objectives\n",
-    "- **The goal of this lab is to re-use the knowledge we gain from Day 2 and utilize that knowledge to jsonfy our cleaned text data and convert to mmap format**\n",
-    "    -  jsonfy the extract webnyheter2013.txt --> webnyheter2013.json\n",
-    "    -  mini-challenge - \n",
-    "        - task 1 : integrate a customized sentence splitter into the preprocess_data.py, let's call it MYpreprocess_data.py  \n",
-    "                        \n",
-    "                        ---------the customized sentence splitter is this function below ------------\n",
-    "        \n",
-    "                            import re\n",
-    "                            import nltk\n",
-    "                            from nltk.tokenize import sent_tokenize\n",
-    "                            def normal_cut_sentence(temp):\n",
-    "                                return sent_tokenize(temp)\n",
+    "The goal of this lab is to re-use the knowledge we gain from Day 2 and utilize that knowledge to jsonfy our cleaned text data and convert to mmap format\n",
+    "\n",
+    "-  convert the extract webnyheter2013.txt --> webnyheter2013.json\n",
+    "-  mini-challenge - \n",
+    "    - task 1 : integrate a customized sentence splitter into the preprocess_data.py, let's call it MYpreprocess_data.py  \n",
+    "\n",
+    "                    ---------the customized sentence splitter is this function below ------------\n",
+    "\n",
+    "                        import re\n",
+    "                        import nltk\n",
+    "                        from nltk.tokenize import sent_tokenize\n",
+    "                        def normal_cut_sentence(temp):\n",
+    "                            return sent_tokenize(temp)\n",
+    "\n",
+    "                        def cut_sentence_with_quotation_marks(text):\n",
+    "                            p = re.compile(\"“.*?”\")\n",
+    "                            list = []\n",
+    "                            index = 0\n",
+    "                            length = len(text)\n",
+    "                            for i in p.finditer(text):\n",
+    "                                temp = ''\n",
+    "                                start = i.start()\n",
+    "                                end = i.end()\n",
+    "                                for j in range(index, start):\n",
+    "                                    temp += text[j]\n",
+    "                                if temp != '':\n",
+    "                                    temp_list = normal_cut_sentence(temp)\n",
+    "                                    list += temp_list\n",
+    "                                temp = ''\n",
+    "                                for k in range(start, end):\n",
+    "                                    temp += text[k]\n",
+    "                                if temp != ' ':\n",
+    "                                    list.append(temp)\n",
+    "                                index = end\n",
+    "                            return list\n",
     "\n",
-    "                            def cut_sentence_with_quotation_marks(text):\n",
-    "                                p = re.compile(\"“.*?”\")\n",
-    "                                list = []\n",
-    "                                index = 0\n",
-    "                                length = len(text)\n",
-    "                                for i in p.finditer(text):\n",
-    "                                    temp = ''\n",
-    "                                    start = i.start()\n",
-    "                                    end = i.end()\n",
-    "                                    for j in range(index, start):\n",
-    "                                        temp += text[j]\n",
-    "                                    if temp != '':\n",
-    "                                        temp_list = normal_cut_sentence(temp)\n",
-    "                                        list += temp_list\n",
-    "                                    temp = ''\n",
-    "                                    for k in range(start, end):\n",
-    "                                        temp += text[k]\n",
-    "                                    if temp != ' ':\n",
-    "                                        list.append(temp)\n",
-    "                                    index = end\n",
-    "                                return list\n",
-    "        \n",
-    "        task 2 : succesfully run the new MYpreprocess_data.py on example dataset **./Megatron-LM/dataset/EN/extractedNVblogs.json** \n",
+    "    task 2 : succesfully run the new MYpreprocess_data.py on example dataset **./Megatron-LM/dataset/EN/extractedNVblogs.json** \n",
     "\n",
-    "            hint : open ./Megatron-LM/tools/MYpreprocess_data.py and modify the class between L55 -L91 in \n",
-    "             \n",
-    "        ![modify the class below](./Megatron-LM/pics/customize_preprocess_data_script.JPG)\n",
-    "            \n"
+    "        hint : open ./Megatron-LM/tools/MYpreprocess_data.py and modify the class between L55 -L91 in \n",
+    "\n",
+    "    ![modify the class below](./Megatron-LM/pics/customize_preprocess_data_script.JPG)\n",
+    "\n"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "incorporate-episode",
+   "id": "better-palmer",
    "metadata": {},
    "source": [
     "---\n",
-    "## jsonfy the extract webnyheter2013.txt --> webnyheter2013.json"
+    "## Jsonfy the extract webnyheter2013.txt --> webnyheter2013.json"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 1,
-   "id": "whole-compensation",
+   "id": "flush-monitoring",
    "metadata": {},
    "outputs": [
     {
@@ -86,17 +85,17 @@
   },
   {
    "cell_type": "markdown",
-   "id": "valued-insulation",
+   "id": "retired-wellington",
    "metadata": {},
    "source": [
     "---\n",
-    "# run through the default preprocess_data.py to obtain the xxx.idx and xxx.bin files"
+    "## Run through the default preprocess_data.py to obtain the xxx.idx and xxx.bin files"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 7,
-   "id": "sonic-router",
+   "id": "entitled-burner",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -110,7 +109,7 @@
   {
    "cell_type": "code",
    "execution_count": 8,
-   "id": "accurate-democracy",
+   "id": "measured-protest",
    "metadata": {},
    "outputs": [
     {
@@ -12663,7 +12662,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "completed-aquarium",
+   "id": "rational-charlotte",
    "metadata": {},
    "source": [
     "---\n",
@@ -12673,7 +12672,7 @@
   {
    "cell_type": "code",
    "execution_count": 2,
-   "id": "wrong-delivery",
+   "id": "direct-combining",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -12683,7 +12682,7 @@
   {
    "cell_type": "code",
    "execution_count": 3,
-   "id": "innocent-contributor",
+   "id": "labeled-vegetation",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -12718,7 +12717,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "acquired-examination",
+   "id": "civic-training",
    "metadata": {},
    "source": [
     "----------------\n",
@@ -12737,7 +12736,7 @@
   {
    "cell_type": "code",
    "execution_count": 2,
-   "id": "wired-adult",
+   "id": "aerial-incentive",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -12951,7 +12950,7 @@
   {
    "cell_type": "code",
    "execution_count": 11,
-   "id": "interpreted-london",
+   "id": "numerical-jason",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -12964,21 +12963,22 @@
   },
   {
    "cell_type": "markdown",
-   "id": "ready-happening",
+   "id": "fifth-foster",
    "metadata": {},
    "source": [
     "---\n",
-    "## below is a ReRun cell overwrite MYpreprocess_data.py\n",
+    "## Below is a ReRun cell to overwrite MYpreprocess_data.py\n",
+    "\n",
     "Run the below MYpreprocess_data.py successfully to obtain CustomSentenceSplitter_text_document.idx and CustomSentenceSplitter_text_document.idx file\n",
     "<a id=\"Rerun_Cell\"></a>\n",
     "\n",
-    "Go to cell above to modify MYpreprocess_data.py <a href=\"./Day3-4_customize_process2mmap.ipynb#MODIFY_CELL\">Jump to Modify MYpreprocess_data.py</a> "
+    "Go back and modify MYpreprocess_data.py, shortcut link here--> <a href=\"./Day3-4_customize_process2mmap.ipynb#MODIFY_CELL\">Jump to Modify MYpreprocess_data.py</a> "
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "identical-rugby",
+   "id": "japanese-stations",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -12997,7 +12997,7 @@
   {
    "cell_type": "code",
    "execution_count": 12,
-   "id": "hearing-algorithm",
+   "id": "broad-legend",
    "metadata": {
     "collapsed": true,
     "jupyter": {
@@ -25557,7 +25557,7 @@
   {
    "cell_type": "code",
    "execution_count": 13,
-   "id": "happy-violation",
+   "id": "above-scholarship",
    "metadata": {
     "collapsed": true,
     "jupyter": {
@@ -25584,18 +25584,10 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 9,
-   "id": "incorrect-apache",
+   "execution_count": null,
+   "id": "patient-haiti",
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "rm: cannot remove './Megatron-LM/tools/MYpreprocess_data.py': No such file or directory\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
     "## clean up\n",
     "!rm ./Megatron-LM/tools/MYpreprocess_data.py"
@@ -25603,7 +25595,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "according-muscle",
+   "id": "embedded-excerpt",
    "metadata": {},
    "source": [
     "---\n",
@@ -25617,7 +25609,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "reverse-subcommittee",
+   "id": "smart-snake",
    "metadata": {},
    "source": [
     "-----\n",

+ 51 - 36
ai/Megatron/English/Python/jupyter_notebook/Day3-5_run_Megatron_with_varying_config.ipynb

@@ -2,17 +2,15 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "charged-allen",
+   "id": "theoretical-diary",
    "metadata": {},
    "source": [
-    "# \n",
-    "\n",
-    "# 5 Monitor GPT training performance with varying config\n",
+    "# Monitor GPT training performance with varying config\n",
     "---\n",
     "\n",
     "## **Challenge ** - Go big or go home !\n",
     "- prerequisites : \n",
-    "    - use your current given # of gpus \n",
+    "    - use your current given # of gpus\n",
     "    - do NOT changing the following parameters **--train-samples 100 **\n",
     "    - you cannot go OOM \n",
     "    - you must sustain >60% GPUs utilization in the **training** phase \n",
@@ -20,29 +18,29 @@
     "\n",
     "\n",
     "- task : \n",
-    "        given the above constraints, train as BIG GPT model as possible\n",
+    "        given the above prerequisites, train as BIG a GPT model as possible\n",
     "\n",
     "\n",
     "\n",
     "- winning criteria : the biggest model wins given the above constraints(=prerequisites).\n",
     "\n",
-    "    <a href=\"./Day3-5_run_Megatron_with_varying_config.ipynb#Rerun_Cell\">Jump to ReRun Cell</a> \n",
+    "    Go directly to the challenge , link here --> <a href=\"./Day3-5_run_Megatron_with_varying_config.ipynb#Rerun_Cell\">Jump to ReRun Cell</a> \n",
     "\n",
     "```\n",
-    "                                #### the follow params are allowed to change \n",
-    "                                WORLD_SIZE=8 # <--- remember to change the number of GPUs you actually have in your system\n",
-    "                                GPUS_PER_NODE=8 # <--- remember to change the number of GPUs you actually have in your system\n",
+    "                        #### the follow params are allowed to change \n",
+    "                        WORLD_SIZE=8 # <--- remember to change the number of GPUs you actually have in your system\n",
+    "                        GPUS_PER_NODE=8 # <--- remember to change the number of GPUs you actually have in your system\n",
     "\n",
-    "                                TENSOR_MP_SIZE=8\n",
-    "                                PIPELINE_MP_SIZE=1\n",
-    "                                LYS=32\n",
-    "                                HIDDEN_SZ=2048\n",
-    "                                NUM_ATTN_HEADS=32\n",
-    "                                MICRO_BZ=\n",
-    "                                GLOBAL_BZ=\n",
-    "                                SEQ_LEN=\n",
-    "                                MAX_POS_EM=\n",
-    "                                #### ---------------------------#### \n",
+    "                        TENSOR_MP_SIZE=8\n",
+    "                        PIPELINE_MP_SIZE=1\n",
+    "                        LYS=32\n",
+    "                        HIDDEN_SZ=2048\n",
+    "                        NUM_ATTN_HEADS=32\n",
+    "                        MICRO_BZ=\n",
+    "                        GLOBAL_BZ=\n",
+    "                        SEQ_LEN=\n",
+    "                        MAX_POS_EM=\n",
+    "                        #### ---------------------------#### \n",
     "``` \n",
     "                                ----------------------------For your reference --------------------------\n",
     "<center><img src=\"./Megatron-LM/pics/GPT3_all.png\" width=\"700\"/></center>"
@@ -50,7 +48,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "continuing-passport",
+   "id": "gross-macro",
    "metadata": {},
    "source": [
     "---\n",
@@ -62,7 +60,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "corrected-bacteria",
+   "id": "human-classics",
    "metadata": {},
    "source": [
     "---\n",
@@ -74,7 +72,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "dramatic-opinion",
+   "id": "finnish-think",
    "metadata": {},
    "source": [
     "<a id=\"Rerun_Cell\"></a>"
@@ -83,7 +81,7 @@
   {
    "cell_type": "code",
    "execution_count": 1,
-   "id": "massive-industry",
+   "id": "saving-locator",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -93,7 +91,7 @@
   {
    "cell_type": "code",
    "execution_count": 2,
-   "id": "understood-swimming",
+   "id": "adverse-genetics",
    "metadata": {},
    "outputs": [
     {
@@ -175,11 +173,11 @@
   },
   {
    "cell_type": "markdown",
-   "id": "monetary-trial",
+   "id": "ecological-delaware",
    "metadata": {},
    "source": [
     "---\n",
-    "## check how big is your model - \n",
+    "## Check how big is your model - \n",
     "modify the parameters in the [params_cnt.sh](./params_cnt.sh)\n",
     "I got 6 Billion :)  what about you ?"
    ]
@@ -187,7 +185,7 @@
   {
    "cell_type": "code",
    "execution_count": 3,
-   "id": "afraid-promise",
+   "id": "latin-granny",
    "metadata": {},
    "outputs": [
     {
@@ -205,7 +203,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "portuguese-freedom",
+   "id": "widespread-sunset",
    "metadata": {},
    "source": [
     "---\n",
@@ -229,13 +227,15 @@
   },
   {
    "cell_type": "markdown",
-   "id": "competent-romania",
+   "id": "standing-change",
    "metadata": {},
    "source": [
     "---\n",
     "# Re-run this cell below to get an even bigger GPT model\n",
-    "## remember to modify the [params count](./params_cnt.sh) to check how big is your model\n",
-    "## click the below to go back to Modify the profile_SVGPT_BIG.sh \n",
+    "\n",
+    "Remember to modify the [params count](./params_cnt.sh) to check how big is your model\n",
+    "\n",
+    "Jump back and mdify the profile_SVGPT_BIG.sh, click here --> \n",
     "<a href=\"./Day3-5_run_Megatron_with_varying_config.ipynb#MODIFY_CELL\">Jump back to modify and overwrite profile_SVGPT_BIG.sh </a> \n",
     "<a id=\"Rerun_Cell\"></a>"
    ]
@@ -243,7 +243,7 @@
   {
    "cell_type": "code",
    "execution_count": 4,
-   "id": "injured-pasta",
+   "id": "novel-campbell",
    "metadata": {},
    "outputs": [
     {
@@ -550,16 +550,31 @@
   },
   {
    "cell_type": "markdown",
-   "id": "entertaining-transparency",
+   "id": "center-microphone",
+   "metadata": {},
+   "source": [
+    "--- \n",
+    "\n",
+    "## Additional Resources\n",
+    "\n",
+    "Language Models are Few-Shot Learners : https://arxiv.org/pdf/2005.14165.pdf\n",
+    "\n",
+    "Efficient Large-Scale Language Model Training on GPU Clusters : https://arxiv.org/pdf/2104.04473.pdf"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "basic-horror",
    "metadata": {},
    "source": [
-    "## Remember to copy and paste your output on Slack or Zoom\n",
+    "---\n",
+    "\n",
     "## Congratulations on completing the mission !\n"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "hidden-minister",
+   "id": "recovered-pleasure",
    "metadata": {},
    "source": [
     "-----\n",

binární
ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/pics/data_loss_model_size_compute.JPG


+ 17 - 16
ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/tools/openwebtext/Day3-1_acquiring_data.ipynb

@@ -2,29 +2,30 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "ahead-surrey",
+   "id": "above-newark",
    "metadata": {},
    "source": [
-    "# \n",
-    "\n",
     "# Fetch Swedish data on your own \n",
     "---\n",
     "\n",
     "## Due to GDPR , we are not allowed to provide data to attendees in the bootcamp !\n",
-    "- **Therefore, we kindly ask you to fetch and preprocess the data on your own following the steps given below-  **\n",
+    "Therefore, we kindly ask you to fetch and preprocess this publically accessible dataset on your own,\n",
+    "following the steps given below -\n",
     "    - wget språkbank [webnyheter2013](http://spraakbanken.gu.se/lb/resurser/meningsmangder/webbnyheter2013.xml.bz2)\n",
     "    - wget språkbank [provided script](https://raw.githubusercontent.com/spraakbanken/sb-nltk-tools/master/sb_corpus_reader.py) to extract the data\n",
     "    - use the function below to extract the xml file into raw txt file\n",
     "    - (additional advice) discuss filtering on number of sentences per document and number of tokens per sentence \n",
     "        - example for English GPT training, it is recommand to check on the stats of your raw data and come-up with a good rule-of-thumb to proceed filtering, it is,however,recommanded to look into language-specific cleaning and follow up robust clearning procedure to obtain quality corpus.\n",
     "\n",
-    "in this notebook, we will embrace the method provided by the [Megatron-LM repo](https://github.com/NVIDIA/Megatron-LM/tree/main/tools/openwebtext) as well as introduce other complimenting methods that might be of interest for data cleaning.\n",
-    "\n"
+    "---\n",
+    "## About the data source -\n",
+    "This data belongs to Språkbanken, Språkbanken Text is a research unit and part of the National Language Bank, a national e-infrastructure to support research based on linguistic data.\n",
+    "[read more about språkbank](https://spraakbanken.gu.se/om)\n"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "exterior-avatar",
+   "id": "prepared-shield",
    "metadata": {},
    "source": [
     "--------------------------------------------------------------------------------------------------------------------\n",
@@ -34,7 +35,7 @@
   {
    "cell_type": "code",
    "execution_count": 1,
-   "id": "yellow-happening",
+   "id": "painted-broadway",
    "metadata": {},
    "outputs": [
     {
@@ -66,7 +67,7 @@
   {
    "cell_type": "code",
    "execution_count": 2,
-   "id": "happy-spectrum",
+   "id": "unavailable-munich",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -76,7 +77,7 @@
   {
    "cell_type": "code",
    "execution_count": 3,
-   "id": "modular-helmet",
+   "id": "fantastic-throat",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -86,7 +87,7 @@
   {
    "cell_type": "code",
    "execution_count": 4,
-   "id": "roman-strap",
+   "id": "saved-potter",
    "metadata": {},
    "outputs": [
     {
@@ -104,7 +105,7 @@
   {
    "cell_type": "code",
    "execution_count": 5,
-   "id": "moderate-newfoundland",
+   "id": "pleasant-estimate",
    "metadata": {},
    "outputs": [
     {
@@ -132,7 +133,7 @@
   {
    "cell_type": "code",
    "execution_count": 6,
-   "id": "annoying-topic",
+   "id": "psychological-measure",
    "metadata": {},
    "outputs": [
     {
@@ -181,7 +182,7 @@
   {
    "cell_type": "code",
    "execution_count": 7,
-   "id": "exterior-episode",
+   "id": "conservative-yacht",
    "metadata": {},
    "outputs": [
     {
@@ -198,7 +199,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "impaired-sierra",
+   "id": "thousand-volleyball",
    "metadata": {},
    "source": [
     "---\n",
@@ -212,7 +213,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "junior-washington",
+   "id": "similar-battery",
    "metadata": {},
    "source": [
     "-----\n",

+ 117 - 97
ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/tools/openwebtext/Day3-2_SentenceBoundary_and_Deduplicate.ipynb

@@ -2,39 +2,38 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "cognitive-explanation",
+   "id": "guided-latin",
    "metadata": {},
    "source": [
-    "# \n",
-    "\n",
     "# About data cleaning for own language \n",
     "---\n",
     "\n",
     "## Learning Objectives\n",
-    "- **The goal of this lab is to go through some basic data cleansing methods which should be cautiously applied to own langauge dataset  **\n",
-    "    - langauge detection and filtering \n",
-    "    - finding sentence boundary, and give some examples :\n",
-    "    it is importance to be able to find good sentence boundary per given document, see [Megatron-LM/tools/preprocess_data.py](https://github.com/NVIDIA/Megatron-LM/blob/main/tools/preprocess_data.py#L84)\n",
-    "        1. [sentence-splitter](https://github.com/mediacloud/sentence-splitter)\n",
-    "        2. [NLTK](https://github.com/nltk/nltk)\n",
-    "        3. write your own sentence splitter, a home-made example\n",
-    "    - deduplicate documents based on similarity score\n",
-    "    - (additional advice) discuss filtering on number of sentences per document and number of tokens per sentence \n",
-    "        - example for English GPT training, it is recommand to check on the stats of your raw data and come-up with a good rule-of-thumb to proceed filtering, it is,however,recommanded to look into language-specific cleaning and follow up robust clearning procedure to obtain quality corpus.\n",
+    "The goal of this lab is to go through some basic data cleansing methods which should be cautiously applied to own langauge dataset\n",
+    "\n",
+    "- langauge detection  \n",
+    "- finding sentence boundary, and give some examples :\n",
+    "it is importance to be able to find good sentence boundary per given document, see [Megatron-LM/tools/preprocess_data.py](https://github.com/NVIDIA/Megatron-LM/blob/main/tools/preprocess_data.py#L84)\n",
+    "    1. [sentence-splitter](https://github.com/mediacloud/sentence-splitter)\n",
+    "    2. [NLTK](https://github.com/nltk/nltk)\n",
+    "    3. write your own sentence splitter, a home-made example\n",
+    "- deduplicate documents based on similarity score\n",
+    "- (additional advice) discuss filtering on number of sentences per document and number of tokens per sentence \n",
+    "    - example for English GPT training, it is recommand to check on the stats of your raw data and come-up with a good rule-of-thumb to proceed filtering, it is,however,recommanded to look into language-specific cleaning and follow up robust clearning procedure to obtain quality corpus.\n",
     "\n",
     "in this notebook, we will embrace the method provided by the [Megatron-LM repo](https://github.com/NVIDIA/Megatron-LM/tree/main/tools/openwebtext) as well as introduce other complimenting methods that might be of interest for data cleaning.\n",
     "\n",
-    "---------------\n",
-    "### At the end, there's a **mini challenge** for hands-on practice identifying number of duplicates approach groudtruth number!"
+    "\n",
+    "**Note** :At the end, there's a **mini challenge** for hands-on practice identifying number of duplicates approach groudtruth number!"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "injured-appraisal",
+   "id": "structured-superior",
    "metadata": {},
    "source": [
     "--------------------------------------------------------------------------------------------------------------------\n",
-    "#### start by install necessary libraries -\n",
+    "### Install necessary libraries -\n",
     "\n",
     "    install LSH - \n",
     "\n",
@@ -59,7 +58,7 @@
   {
    "cell_type": "code",
    "execution_count": 1,
-   "id": "elect-chair",
+   "id": "sensitive-violation",
    "metadata": {},
    "outputs": [
     {
@@ -138,17 +137,17 @@
   },
   {
    "cell_type": "markdown",
-   "id": "excessive-madison",
+   "id": "conscious-provision",
    "metadata": {},
    "source": [
     "-------------------------------------------------------------------------------\n",
-    "## detect and filter undesired langauge in the raw text corpus"
+    "## Language detection"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 2,
-   "id": "choice-nicholas",
+   "id": "entitled-company",
    "metadata": {},
    "outputs": [
     {
@@ -171,7 +170,7 @@
   {
    "cell_type": "code",
    "execution_count": 3,
-   "id": "nonprofit-statistics",
+   "id": "fancy-translator",
    "metadata": {},
    "outputs": [
     {
@@ -193,7 +192,7 @@
   {
    "cell_type": "code",
    "execution_count": 4,
-   "id": "adverse-robertson",
+   "id": "experimental-burns",
    "metadata": {},
    "outputs": [
     {
@@ -214,24 +213,24 @@
   },
   {
    "cell_type": "markdown",
-   "id": "interpreted-links",
+   "id": "sporting-involvement",
    "metadata": {},
    "source": [
     "-----------------------------------------------------------\n",
-    "## finding sentence boundary - NLTK"
+    "## Finding sentence boundary (altnertive 1) - NLTK "
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 5,
-   "id": "typical-accused",
+   "execution_count": 1,
+   "id": "prerequisite-london",
    "metadata": {},
    "outputs": [
     {
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "[nltk_data] Downloading package punkt to /home/x_zench/nltk_data...\n",
+      "[nltk_data] Downloading package punkt to /root/nltk_data...\n",
       "[nltk_data]   Unzipping tokenizers/punkt.zip.\n"
      ]
     },
@@ -241,7 +240,7 @@
        "True"
       ]
      },
-     "execution_count": 5,
+     "execution_count": 1,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -253,8 +252,8 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 6,
-   "id": "ranking-semester",
+   "execution_count": 20,
+   "id": "worldwide-fruit",
    "metadata": {},
    "outputs": [
     {
@@ -262,16 +261,20 @@
      "output_type": "stream",
      "text": [
       "original doc is :\n",
-      "  På sommarhalvåret längst vattnet och i skogarna på Lidingö eller på Djurgården – på vintern blir det löpbandet på gymmet .Händelsen skapade mycket oro inom klubben , eftersom skridskoåkning äger rum på sjöar och vattendrag där det givetvis inte finns någon gatuadress att uppge vid olycka .\n",
+      "  Detta är ett stycke. Den innehåller flera meningar. \"Men varför\", frågar du?\n",
       "------- sentence 0 -------\n",
-      "På sommarhalvåret längst vattnet och i skogarna på Lidingö eller på Djurgården – på vintern blir det löpbandet på gymmet .Händelsen skapade mycket oro inom klubben , eftersom skridskoåkning äger rum på sjöar och vattendrag där det givetvis inte finns någon gatuadress att uppge vid olycka .\n"
+      "Detta är ett stycke.\n",
+      "------- sentence 1 -------\n",
+      "Den innehåller flera meningar.\n",
+      "------- sentence 2 -------\n",
+      "\"Men varför\", frågar du?\n"
      ]
     }
    ],
    "source": [
     "import nltk\n",
     "from nltk.tokenize import sent_tokenize\n",
-    "text='På sommarhalvåret längst vattnet och i skogarna på Lidingö eller på Djurgården – på vintern blir det löpbandet på gymmet .Händelsen skapade mycket oro inom klubben , eftersom skridskoåkning äger rum på sjöar och vattendrag där det givetvis inte finns någon gatuadress att uppge vid olycka .'\n",
+    "text='Detta är ett stycke. Den innehåller flera meningar. \"Men varför\", frågar du?'\n",
     "print(\"original doc is :\\n \", text)\n",
     "sents=sent_tokenize(text)\n",
     "i=0\n",
@@ -283,33 +286,33 @@
   },
   {
    "cell_type": "markdown",
-   "id": "portable-bumper",
+   "id": "micro-interface",
    "metadata": {},
    "source": [
     "-----------------------------------------------------------\n",
-    "## finding sentence boundary - Sentence-Splitter"
+    "## Finding sentence boundary (altnertive 2) - Sentence-Splitter"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 8,
-   "id": "environmental-rating",
+   "id": "third-caution",
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "--2021-09-15 11:26:03--  https://github.com/mediacloud/sentence-splitter/blob/develop/sentence_splitter/non_breaking_prefixes/sv.txt\n",
-      "Resolving github.com (github.com)... 140.82.121.4\n",
-      "Connecting to github.com (github.com)|140.82.121.4|:443... connected.\n",
+      "--2021-09-24 18:41:51--  https://github.com/mediacloud/sentence-splitter/blob/develop/sentence_splitter/non_breaking_prefixes/sv.txt\n",
+      "Resolving github.com (github.com)... 140.82.121.3\n",
+      "Connecting to github.com (github.com)|140.82.121.3|:443... connected.\n",
       "HTTP request sent, awaiting response... 200 OK\n",
       "Length: unspecified [text/html]\n",
       "Saving to: ‘sv.txt’\n",
       "\n",
-      "sv.txt                  [ <=>                ] 191.69K  --.-KB/s    in 0.08s   \n",
+      "sv.txt                  [ <=>                ] 190.41K  --.-KB/s    in 0.1s    \n",
       "\n",
-      "2021-09-15 11:26:04 (2.23 MB/s) - ‘sv.txt’ saved [196294]\n",
+      "2021-09-24 18:41:52 (1.76 MB/s) - ‘sv.txt’ saved [194983]\n",
       "\n"
      ]
     }
@@ -321,7 +324,7 @@
   {
    "cell_type": "code",
    "execution_count": 9,
-   "id": "collective-medication",
+   "id": "spectacular-eligibility",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -330,8 +333,8 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 10,
-   "id": "secure-encounter",
+   "execution_count": 21,
+   "id": "imported-probability",
    "metadata": {},
    "outputs": [
     {
@@ -339,7 +342,11 @@
      "output_type": "stream",
      "text": [
       "------- sentence 0 -------\n",
-      "På sommarhalvåret längst vattnet och i skogarna på Lidingö eller på Djurgården – på vintern blir det löpbandet på gymmet .Händelsen skapade mycket oro inom klubben , eftersom skridskoåkning äger rum på sjöar och vattendrag där det givetvis inte finns någon gatuadress att uppge vid olycka .\n"
+      "Detta är ett stycke.\n",
+      "------- sentence 1 -------\n",
+      "Den innehåller flera meningar.\n",
+      "------- sentence 2 -------\n",
+      "\"Men varför\", frågar du?\n"
      ]
     }
    ],
@@ -356,17 +363,17 @@
   },
   {
    "cell_type": "markdown",
-   "id": "varying-province",
+   "id": "conventional-medium",
    "metadata": {},
    "source": [
     "-----------------------------------------------------------\n",
-    "## finding sentence boundary - create your own sentence splitter"
+    "## Finding sentence boundary - create your own sentence splitter"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 11,
-   "id": "voluntary-madness",
+   "execution_count": 22,
+   "id": "selective-nelson",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -378,7 +385,7 @@
     "\n",
     "def cut_sentence_with_quotation_marks(text):\n",
     "    p = re.compile(\"“.*?”\")\n",
-    "    list = []\n",
+    "    ls = []\n",
     "    index = 0\n",
     "    length = len(text)\n",
     "    for i in p.finditer(text):\n",
@@ -389,29 +396,29 @@
     "            temp += text[j]\n",
     "        if temp != '':\n",
     "            temp_list = normal_cut_sentence(temp)\n",
-    "            list += temp_list\n",
+    "            ls += temp_list\n",
     "        temp = ''\n",
     "        for k in range(start, end):\n",
     "            temp += text[k]\n",
     "        if temp != ' ':\n",
-    "            list.append(temp)\n",
+    "            ls.append(temp)\n",
     "        index = end\n",
-    "    return list"
+    "    return ls"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 12,
-   "id": "raising-salad",
+   "execution_count": 23,
+   "id": "statistical-croatia",
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "------- sentence 1 -------\n",
+      "------- sentence 3 -------\n",
       "Andersson pekas ut som nästa partiledare:\n",
-      "------- sentence 2 -------\n",
+      "------- sentence 4 -------\n",
       "“Medlemmarna ska säga sitt”\n"
      ]
     }
@@ -427,17 +434,17 @@
   },
   {
    "cell_type": "markdown",
-   "id": "facial-trading",
+   "id": "ultimate-geometry",
    "metadata": {},
    "source": [
     "-----------------------------------------------------------\n",
-    "## deduplicate text based on similarity score  "
+    "## Deduplicate text based on similarity score  "
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 1,
-   "id": "agricultural-onion",
+   "id": "internal-opposition",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -486,7 +493,7 @@
   {
    "cell_type": "code",
    "execution_count": 2,
-   "id": "nonprofit-panama",
+   "id": "provincial-cross",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -500,7 +507,7 @@
   {
    "cell_type": "code",
    "execution_count": 3,
-   "id": "empty-while",
+   "id": "successful-consent",
    "metadata": {},
    "outputs": [
     {
@@ -526,16 +533,16 @@
   },
   {
    "cell_type": "markdown",
-   "id": "blind-union",
+   "id": "disciplinary-resolution",
    "metadata": {},
    "source": [
-    "## dataset extracted from NVIDIA blog urls "
+    "## Load dataset extracted from NVIDIA blog urls "
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 4,
-   "id": "instant-grade",
+   "id": "based-pizza",
    "metadata": {},
    "outputs": [
     {
@@ -610,16 +617,16 @@
   },
   {
    "cell_type": "markdown",
-   "id": "attached-candle",
+   "id": "continuing-palmer",
    "metadata": {},
    "source": [
-    "## create our own groudtruth dataset"
+    "## Create our own groudtruth dataset"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 5,
-   "id": "constant-mouth",
+   "id": "convenient-clear",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -650,7 +657,7 @@
   {
    "cell_type": "code",
    "execution_count": 6,
-   "id": "accepting-truck",
+   "id": "vertical-philosophy",
    "metadata": {},
    "outputs": [
     {
@@ -749,7 +756,7 @@
   {
    "cell_type": "code",
    "execution_count": 7,
-   "id": "acting-tiffany",
+   "id": "agricultural-bubble",
    "metadata": {},
    "outputs": [
     {
@@ -773,7 +780,7 @@
   {
    "cell_type": "code",
    "execution_count": 8,
-   "id": "wicked-youth",
+   "id": "incorporated-nepal",
    "metadata": {},
    "outputs": [
     {
@@ -794,7 +801,7 @@
   {
    "cell_type": "code",
    "execution_count": 9,
-   "id": "rural-lotus",
+   "id": "future-worry",
    "metadata": {},
    "outputs": [
     {
@@ -888,7 +895,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "dominican-trick",
+   "id": "secure-rapid",
    "metadata": {},
    "source": [
     "---\n",
@@ -898,7 +905,7 @@
   {
    "cell_type": "code",
    "execution_count": 10,
-   "id": "married-straight",
+   "id": "equivalent-victor",
    "metadata": {},
    "outputs": [
     {
@@ -998,7 +1005,7 @@
   {
    "cell_type": "code",
    "execution_count": 11,
-   "id": "rocky-courage",
+   "id": "suffering-bangkok",
    "metadata": {},
    "outputs": [
     {
@@ -1022,7 +1029,7 @@
   {
    "cell_type": "code",
    "execution_count": 12,
-   "id": "boring-piece",
+   "id": "efficient-angola",
    "metadata": {},
    "outputs": [
     {
@@ -1039,7 +1046,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "fresh-norfolk",
+   "id": "featured-packet",
    "metadata": {},
    "source": [
     "---\n",
@@ -1050,7 +1057,7 @@
   {
    "cell_type": "code",
    "execution_count": 13,
-   "id": "starting-arabic",
+   "id": "scheduled-publication",
    "metadata": {},
    "outputs": [
     {
@@ -1145,7 +1152,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "banner-dispute",
+   "id": "judicial-circular",
    "metadata": {},
    "source": [
     "---\n",
@@ -1155,7 +1162,7 @@
   {
    "cell_type": "code",
    "execution_count": 14,
-   "id": "spread-entity",
+   "id": "bronze-salem",
    "metadata": {},
    "outputs": [
     {
@@ -1179,7 +1186,7 @@
   {
    "cell_type": "code",
    "execution_count": 15,
-   "id": "exempt-juice",
+   "id": "engaging-persian",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -1189,7 +1196,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "abandoned-valuation",
+   "id": "forward-attachment",
    "metadata": {},
    "source": [
     "<a id=\"TheChallenge\"></a>"
@@ -1197,13 +1204,13 @@
   },
   {
    "cell_type": "markdown",
-   "id": "thick-external",
+   "id": "supposed-columbia",
    "metadata": {},
    "source": [
     "---\n",
     "# Mini Challenge - approaching the groundtruth !\n",
     "\n",
-    "Task : Aiming to approach the number 31 modifying the below parameters\n",
+    "**Task : Aiming to approach the number 31 modifying the below parameters**\n",
     "rerun cell <a href=\"./Day3-2_SentenceBoundary_and_Deduplicate.ipynb#Rerun_Cell\">Jump to ReRun Cell</a>\n",
     "\n",
     "Consider yourself pass this mini challenge when you approach the number **31 +/- 3** ! \n",
@@ -1222,12 +1229,11 @@
   {
    "cell_type": "code",
    "execution_count": 42,
-   "id": "sophisticated-boating",
+   "id": "automated-liver",
    "metadata": {
     "collapsed": true,
     "jupyter": {
-     "outputs_hidden": true,
-     "source_hidden": true
+     "outputs_hidden": true
     },
     "tags": []
    },
@@ -1257,11 +1263,8 @@
   {
    "cell_type": "code",
    "execution_count": 114,
-   "id": "meaningful-sample",
+   "id": "anticipated-senator",
    "metadata": {
-    "jupyter": {
-     "source_hidden": true
-    },
     "tags": []
    },
    "outputs": [],
@@ -1278,12 +1281,11 @@
   {
    "cell_type": "code",
    "execution_count": 115,
-   "id": "operational-steps",
+   "id": "pregnant-temple",
    "metadata": {
     "collapsed": true,
     "jupyter": {
-     "outputs_hidden": true,
-     "source_hidden": true
+     "outputs_hidden": true
     },
     "tags": []
    },
@@ -1314,7 +1316,25 @@
   },
   {
    "cell_type": "markdown",
-   "id": "revolutionary-framing",
+   "id": "junior-voluntary",
+   "metadata": {},
+   "source": [
+    "--- \n",
+    "\n",
+    "## Additional Resources\n",
+    "\n",
+    "Language Detect : https://github.com/Mimino666/langdetect\n",
+    "\n",
+    "NLTK Sentence Tokenizer  : https://www.nltk.org/api/nltk.tokenize.html\n",
+    "\n",
+    "Sentence Splitter : https://github.com/mediacloud/sentence-splitter\n",
+    "\n",
+    "Local Sensitive Hashing : http://snap.stanford.edu/class/cs246-2012/slides/03-lsh.pdf\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "banned-telling",
    "metadata": {},
    "source": [
     "---\n",
@@ -1328,7 +1348,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "cutting-greeting",
+   "id": "controlling-boulder",
    "metadata": {},
    "source": [
     "-----\n",

+ 48 - 34
ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/tools/openwebtext/Day3-Website_scrapping.ipynb

@@ -2,35 +2,33 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "grave-judges",
+   "id": "magnetic-filter",
    "metadata": {},
    "source": [
-    "# \n",
+    "# Get Your Own Data via webcrawling the website / webpages you have permisson to use\n",
     "\n",
-    "# (option) : get your own data via scrap the website you have permisson to use\n",
+    "note_1 : strongly recommand to consult with your own legal department for compliance before proceeding this step on websites/webpages you have permission to\n",
     "\n",
-    "note1 : strongly recommand to consult with your own legal department for compliance before proceeding this step on websites/webpages you have permission to\n",
+    "note_2 : modification needed when applying to different website/webpages\n",
     "\n",
-    "note2 : modification needed when applying to different website/webpages\n",
-    "\n",
-    "note3 : use at your own risk !\n",
+    "note_3 : use at your own risk !\n",
     "\n",
     "---\n",
     "\n",
     "## Learning Objectives\n",
-    "- **The goal of this lab is to demonstrate that there are ways to obtain your own data via webscrapping sites which you have permission to**\n",
-    "    - Provide a list of urls in a text file via collecting webpages links\n",
+    "The goal of this lab is to demonstrate the fact that, there are ways to obtain your own data, the example given below is to crawl the webpages, from a seeding url, with the end goal of obtaining the raw texts. Please double check you have permission to extract the content of these webpages.\n",
+    "\n",
+    "- Provide a list of urls in a text file via collecting webpages links \n",
     "    \n",
-    "    the base url used to crawl links from is  https://developer.nvidia.com/blog/\n",
+    "the base url used to crawl links from is  https://developer.nvidia.com/blog/\n",
     "    \n",
-    "    - scrap the webpage ( in html format ) using [scrapy](https://docs.scrapy.org/en/latest/index.html) and obtain desired text, concatenate text per webpage into one raw text file\n",
-    "\n"
+    "- scrap the webpage ( in html format ) using [scrapy](https://docs.scrapy.org/en/latest/index.html) and obtain raw text, then concatenate all the raw text per webpage into one text file\n"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "responsible-steering",
+   "id": "contained-collection",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -43,7 +41,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "pretty-brooklyn",
+   "id": "found-people",
    "metadata": {},
    "source": [
     "## crawl NVblog landing page and obtain links to the individual blogs\n",
@@ -53,7 +51,7 @@
   {
    "cell_type": "code",
    "execution_count": 2,
-   "id": "interstate-kernel",
+   "id": "humanitarian-democracy",
    "metadata": {},
    "outputs": [
     {
@@ -81,7 +79,7 @@
   {
    "cell_type": "code",
    "execution_count": 3,
-   "id": "extraordinary-enforcement",
+   "id": "healthy-condition",
    "metadata": {},
    "outputs": [
     {
@@ -109,7 +107,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "patient-hudson",
+   "id": "structured-curve",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -119,7 +117,7 @@
   {
    "cell_type": "code",
    "execution_count": 5,
-   "id": "american-offset",
+   "id": "elect-moldova",
    "metadata": {},
    "outputs": [
     {
@@ -349,7 +347,7 @@
   {
    "cell_type": "code",
    "execution_count": 6,
-   "id": "blind-opportunity",
+   "id": "blind-chess",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -361,7 +359,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "parallel-firmware",
+   "id": "agricultural-spiritual",
    "metadata": {},
    "source": [
     "## fetch the urls of interest and convert to html files \n",
@@ -371,7 +369,7 @@
   {
    "cell_type": "code",
    "execution_count": 7,
-   "id": "mighty-treat",
+   "id": "animated-europe",
    "metadata": {},
    "outputs": [
     {
@@ -402,7 +400,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "integrated-court",
+   "id": "responsible-killing",
    "metadata": {},
    "source": [
     "## fetch given url and save as .html file"
@@ -411,7 +409,7 @@
   {
    "cell_type": "code",
    "execution_count": 8,
-   "id": "authorized-convertible",
+   "id": "cordless-inside",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -421,7 +419,7 @@
   {
    "cell_type": "code",
    "execution_count": 9,
-   "id": "multiple-supervision",
+   "id": "virtual-superior",
    "metadata": {},
    "outputs": [
     {
@@ -509,7 +507,7 @@
   {
    "cell_type": "code",
    "execution_count": 10,
-   "id": "offensive-swedish",
+   "id": "suffering-surgery",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -526,7 +524,7 @@
   {
    "cell_type": "code",
    "execution_count": 11,
-   "id": "atmospheric-bangladesh",
+   "id": "apparent-cartoon",
    "metadata": {},
    "outputs": [
     {
@@ -563,7 +561,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "medical-cabinet",
+   "id": "yellow-municipality",
    "metadata": {},
    "source": [
     "---\n",
@@ -573,7 +571,7 @@
   {
    "cell_type": "code",
    "execution_count": 12,
-   "id": "minus-travel",
+   "id": "smart-opinion",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -582,7 +580,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "demanding-accountability",
+   "id": "vietnamese-constant",
    "metadata": {},
    "source": [
     "---\n",
@@ -592,7 +590,7 @@
   {
    "cell_type": "code",
    "execution_count": 14,
-   "id": "fresh-thirty",
+   "id": "norman-administrator",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -604,7 +602,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "variable-tamil",
+   "id": "designed-marine",
    "metadata": {},
    "source": [
     "--- \n",
@@ -614,7 +612,7 @@
   {
    "cell_type": "code",
    "execution_count": 15,
-   "id": "front-batman",
+   "id": "christian-austria",
    "metadata": {},
    "outputs": [
     {
@@ -631,7 +629,23 @@
   },
   {
    "cell_type": "markdown",
-   "id": "burning-humor",
+   "id": "athletic-james",
+   "metadata": {},
+   "source": [
+    "--- \n",
+    "\n",
+    "## Additional Resources\n",
+    "\n",
+    "scrapy : https://docs.scrapy.org/en/latest/\n",
+    "\n",
+    "beautifulsoup : https://beautiful-soup-4.readthedocs.io/en/latest/\n",
+    "\n",
+    "selenium : https://www.selenium.dev/selenium/docs/api/py/index.html\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "failing-australian",
    "metadata": {},
    "source": [
     "## Back To Start Menu\n",
@@ -640,7 +654,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "guilty-strand",
+   "id": "seven-contrary",
    "metadata": {},
    "source": [
     "--- \n",

Rozdílová data souboru nebyla zobrazena, protože soubor je příliš velký
+ 2711 - 0
ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/tools/openwebtext/custom_english_non_breaking_prefixes.txt