4 лет назад · 2cadc3a05f
--- a/ai/Megatron/English/Python/jupyter_notebook/Day2-1_EstimateComputeDaysNeeded.ipynb
+++ b/ai/Megatron/English/Python/jupyter_notebook/Day2-1_EstimateComputeDaysNeeded.ipynb
@@ -1,229 +0,0 @@
 
				-{
			
 
				- "cells": [
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "id": "strong-match",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "# Estimate compute hours/days needed to execute one end-to-end run\n",
			
 
				-    "---\n",
			
 
				-    "\n",
			
 
				-    "## Learning Objectives\n",
			
 
				-    "The goal of this lab is size the problem :\n",
			
 
				-    "Understanding how to calculate hours/days needed in order to reserve compute resources for the training job per given existing data volume and desired model size. \n",
			
 
				-    "It is important for both the admin in the compute cluster to do capacity forecasting and for researchers to plan their experiments strategically.\n",
			
 
				-    "\n",
			
 
				-    "- Extracting the formular from the paper [Efficient Large-Scale Language Model Training on GPU Clusters](https://arxiv.org/pdf/2104.04473.pdf), per given [GPT3 variants](https://arxiv.org/pdf/2005.14165.pdf) based on assumed [Teraflops reference table](https://arxiv.org/pdf/2104.04473.pdf)\n",
			
 
				-    "\n",
			
 
				-    "- Understanding how to estimate compute resource needed per dataset volume ( measured in # of tokens ) and a chosen model size\n",
			
 
				-    "\n",
			
 
				-    "- Apply to your own imagenary data volume and a figurative compute cluster set-ups\n",
			
 
				-    "---------------------------------------------------------------------------------------------------\n",
			
 
				-    "\n",
			
 
				-    "- assuming the following information \n",
			
 
				-    "- T = dataset size measured in numbers of tokens in the dataset\n",
			
 
				-    "- P = model parameters for GPT3 varients\n",
			
 
				-    "- n = number of GPUs in the compute cluster\n",
			
 
				-    "- x = achieved teraflops per GPU \n",
			
 
				-    "\n",
			
 
				-    "Training time (in seconds) is approximated with this equation : 8*T*P/n*X\n",
			
 
				-    "you will need the following tables from the above papers for the estimation \n",
			
 
				-    "\n",
			
 
				-    "<center><img src=\"./Megatron-LM/pics/GPT3_all.png\" width=\"700\"/></center>\n",
			
 
				-    "\n",
			
 
				-    "<center><img src=\"./Megatron-LM/pics/achieved_teraflops_per_gpu.JPG\" width=\"700\"/></center>\n",
			
 
				-    "\n",
			
 
				-    "<center><img src=\"./Megatron-LM/pics/TrainingTimeEstimate.JPG\" width=\"500\"/></center>"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "id": "pregnant-basket",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "---\n",
			
 
				-    "## let's do a sanity check - \n",
			
 
				-    "\n",
			
 
				-    "**Assumption** : you have an existing dataset, you know the volume of the dataset ( measure in # of tokens )\n",
			
 
				-    "\n",
			
 
				-    "**Scenario 1** - Given 300Billion tokens, you want to train 175 Billion GPT3 model, you have access to 1024 GPUs, look up on the table above to fetch 140 teraFLOP/s per GPU\n",
			
 
				-    "\n",
			
 
				-    "Question : How many hours/ days will you need given the scenaio above for you to compute an end to end training job ?\n",
			
 
				-    "\n",
			
 
				-    "Answer : We should observe around **34 days** for an end to end training run\n",
			
 
				-    "\n",
			
 
				-    "--\n",
			
 
				-    "\n",
			
 
				-    "**scenario 2** - You increase the data volume to 450 Billion tokens, you want to train a big model, say 1 Trillion parameters, you have access to 3072 GPUs, and fetching the 163 teraFLOP/s per GPU from the table above\n",
			
 
				-    "\n",
			
 
				-    "Question: How many hours/ days will you need given this scenaio above for you to compute an end to end training job ?\n",
			
 
				-    "\n",
			
 
				-    "Answer: We should observe around **84 days** for an end to end training run\n"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": 16,
			
 
				-   "id": "played-broadcast",
			
 
				-   "metadata": {},
			
 
				-   "outputs": [
			
 
				-    {
			
 
				-     "name": "stdout",
			
 
				-     "output_type": "stream",
			
 
				-     "text": [
			
 
				-      "all below are measured with dataset size **300 billion** measured in tokens \n",
			
 
				-      "\n",
			
 
				-      " ----------------------------------------------------------------------------------------\n",
			
 
				-      " language model :gpt3_175B with 175 Billion number of parameters , it will need 33.9 days to compute \n",
			
 
				-      "\n",
			
 
				-      " ----------------------------------------------------------------------------------------\n",
			
 
				-      " language model :gpt3_1Trillion with 1Trillion number of parameters , it will need 83.2 days to compute \n",
			
 
				-      "\n"
			
 
				-     ]
			
 
				-    }
			
 
				-   ],
			
 
				-   "source": [
			
 
				-    "import numpy as np\n",
			
 
				-    "# T = dataset size measured in numbers of tokens in the dataset\n",
			
 
				-    "# P = model parameters for GPT3 varients\n",
			
 
				-    "# n = number of GPUs in the compute cluster\n",
			
 
				-    "# x = achieved teraflops per GPU \n",
			
 
				-    "\n",
			
 
				-    "def calculate_days_needed(T , P , n ,x):\n",
			
 
				-    "    if x is None:\n",
			
 
				-    "        return 'not a good SuperPOD use case, let us try a bigger model :)'\n",
			
 
				-    "    else:        \n",
			
 
				-    "        tot=8*T*P\n",
			
 
				-    "        div=n*x\n",
			
 
				-    "        compute_sec=tot/div\n",
			
 
				-    "        #convert compute seconds to days\n",
			
 
				-    "        to_days=round(compute_sec/(3600*24),1)\n",
			
 
				-    "        return to_days\n",
			
 
				-    "## sanity check against the paper reported figure above \n",
			
 
				-    "T=[300*1e+9, 450*1e+9]\n",
			
 
				-    "n=[1024,3072]\n",
			
 
				-    "GPT3_models_labels=[  'gpt3_175B','gpt3_1Trillion']\n",
			
 
				-    "GPT3_model_params=[ 175*1e+9,1*1e+12 ]\n",
			
 
				-    "GPT3_model_params_str=['175 Billion','1Trillion']\n",
			
 
				-    "#according to the table above\n",
			
 
				-    "GPT3_X=[140*1e+12,163*1e+12]\n",
			
 
				-    "print(\"all below are measured with dataset size **300 billion** measured in tokens \\n\")\n",
			
 
				-    "for gpt3_name, gpt3_params, gpt3_param_str, x, n_,t in zip(GPT3_models_labels,GPT3_model_params,GPT3_model_params_str, GPT3_X ,n,T):\n",
			
 
				-    "    days_needed=calculate_days_needed(t,gpt3_params,n_,x)\n",
			
 
				-    "    print(\" ----------------------------------------------------------------------------------------\")\n",
			
 
				-    "    print(\" language model :{} with {} number of parameters , it will need {} days to compute \\n\".format(gpt3_name, gpt3_param_str, str(days_needed)))\n"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "id": "ruled-score",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "---\n",
			
 
				-    "## Exercise -\n",
			
 
				-    "Question -\n",
			
 
				-    "for a GPT3 model size of 70B parameters with approximatedly 300 Billion tokens in existing dataset\n",
			
 
				-    "giveing a 1/4 of the BerzeLiUs compute avaialbility.   \n",
			
 
				-    "how may hours/days would you need to compute \n",
			
 
				-    "when you are ready , check against the solution uncollapse \n",
			
 
				-    "**. . .**\n"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": 5,
			
 
				-   "id": "circular-northwest",
			
 
				-   "metadata": {
			
 
				-    "collapsed": true,
			
 
				-    "jupyter": {
			
 
				-     "outputs_hidden": true
			
 
				-    }
			
 
				-   },
			
 
				-   "outputs": [
			
 
				-    {
			
 
				-     "data": {
			
 
				-      "text/plain": [
			
 
				-       "115.7"
			
 
				-      ]
			
 
				-     },
			
 
				-     "execution_count": 5,
			
 
				-     "metadata": {},
			
 
				-     "output_type": "execute_result"
			
 
				-    }
			
 
				-   ],
			
 
				-   "source": [
			
 
				-    "T=300*1e+9 #oftokens in the dataset\n",
			
 
				-    "n=int(480*0.25) # Berzelius Max 480 GPUs # number of GPUs in the compute cluster\n",
			
 
				-    "x=140*1e+12\n",
			
 
				-    "gpt3_params=70*1e+9\n",
			
 
				-    "calculate_days_needed(T,gpt3_params,n,x)"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "id": "peaceful-colorado",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "--- \n",
			
 
				-    "\n",
			
 
				-    "## Additional Resources\n",
			
 
				-    "\n",
			
 
				-    "Efficient Large-Scale Language Model Training on GPU Clusters : https://arxiv.org/pdf/2104.04473.pdf \n",
			
 
				-    "\n",
			
 
				-    "Language Models are Few-Shot Learners : https://arxiv.org/pdf/2005.14165.pdf\n",
			
 
				-    "\n",
			
 
				-    "Scaling Laws for Neural Language Models : https://arxiv.org/pdf/2001.08361.pdf\n",
			
 
				-    "\n",
			
 
				-    "<left><img src=\"./Megatron-LM/pics/data_loss_model_size_compute.JPG\" width=\"700\"/></center>"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "id": "northern-fellowship",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "---\n",
			
 
				-    "## Up Next : \n",
			
 
				-    "\n",
			
 
				-    "[Understanding the core of Megatron - mpu ](./Day2-2_MegatronFundementals.ipynb)\n",
			
 
				-    "\n",
			
 
				-    "## Back To Start Menu\n",
			
 
				-    "[start menu](../Start_Here.ipynb)"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "id": "linear-culture",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "-----\n",
			
 
				-    "\n",
			
 
				-    "\n",
			
 
				-    "## Licensing \n",
			
 
				-    "\n",
			
 
				-    "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). "
			
 
				-   ]
			
 
				-  }
			
 
				- ],
			
 
				- "metadata": {
			
 
				-  "kernelspec": {
			
 
				-   "display_name": "Python 3",
			
 
				-   "language": "python",
			
 
				-   "name": "python3"
			
 
				-  },
			
 
				-  "language_info": {
			
 
				-   "codemirror_mode": {
			
 
				-    "name": "ipython",
			
 
				-    "version": 3
			
 
				-   },
			
 
				-   "file_extension": ".py",
			
 
				-   "mimetype": "text/x-python",
			
 
				-   "name": "python",
			
 
				-   "nbconvert_exporter": "python",
			
 
				-   "pygments_lexer": "ipython3",
			
 
				-   "version": "3.8.8"
			
 
				-  }
			
 
				- },
			
 
				- "nbformat": 4,
			
 
				- "nbformat_minor": 5
			
 
				-}
			
--- a/ai/Megatron/English/Python/jupyter_notebook/Day2-2_MegatronFundementals.ipynb
+++ b/ai/Megatron/English/Python/jupyter_notebook/Day2-2_MegatronFundementals.ipynb
@@ -1,815 +0,0 @@
 
				-{
			
 
				- "cells": [
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "id": "selected-material",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "#  Understanding Megatron-LM's core - MPU\n",
			
 
				-    "---\n",
			
 
				-    "NVIDIA's Megatron-LM makes training very large langauge models ( up to one trillion parameters ) a reality, Megatron-LM's core MPU ( Model Paralleism Unit ) forms the base for all subsequence optimization efforts on training very large models, such as [DeepSpeed](https://www.deepspeed.ai/features/#model-parallelism).\n",
			
 
				-    "\n",
			
 
				-    "However, the common side effect is that, with a bad training configuration, one could easily suffer from low GPUs utilization( screenshots below as an example of a bad training configuration which resulted in low or inconsistent gpus utilization).\n",
			
 
				-    "\n",
			
 
				-    "Therefore, we will be taking our very first step of understanding how the core of Megatron-LM's mpu works and thereafter\n",
			
 
				-    "taking a closer look on  Megatron-LM's training runs performance. \n",
			
 
				-    "\n",
			
 
				-    "![a training run with low or inconsistent gpus utils example](./Megatron-LM/pics/naive_run.JPG)\n",
			
 
				-    "\n",
			
 
				-    "## Learning Objectives\n",
			
 
				-    "The goal of this lab is to understand the core of Megatron-LM's mpu = model parallelism unit works\n",
			
 
				-    "\n",
			
 
				-    "- How Megatron-LM is groupping GPU-affinity per model parallel configuration ( pipeline parallel | tensor parallel )\n",
			
 
				-    "- Tensor Parallel : Column Parallel\n",
			
 
				-    "- Tensor Parallel : Row Parallel\n"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "id": "renewable-simon",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "---------------------------------------------------------------------------\n",
			
 
				-    "## How Megatron-LM is groupping GPU-affinity per model-data parallel configuration\n",
			
 
				-    "\n",
			
 
				-    "Parallelism : Model & Data \n",
			
 
				-    "- p = Pipeline Model Parallel  \n",
			
 
				-    "- t = Tensor Model Parallel\n",
			
 
				-    "- d = Data Parallal \n",
			
 
				-    "- n = Total number of GPUs used in the training\n",
			
 
				-    "\n",
			
 
				-    "**Note : Megatron-LM requires p * t * d = n**\n",
			
 
				-    "\n",
			
 
				-    "Parallel group - grouping for torch distributed  (NCCL)\n",
			
 
				-    "- num_tensor_model_parallel_groups = n / t\n",
			
 
				-    "- num_pipeline_model_parallel_groups = n / p\n",
			
 
				-    "- num_data_parallel_groups = n / d\n",
			
 
				-    "\n",
			
 
				-    "assuming our configurations as following : \n",
			
 
				-    "\n",
			
 
				-    "tensor_model_parallel_size_=2 \n",
			
 
				-    "\n",
			
 
				-    "pipeline_model_parallel_size_= 4\n",
			
 
				-    "\n",
			
 
				-    "let's say we have a total of 16 GPUs denoted by g0 ... g15 \n",
			
 
				-    "\n",
			
 
				-    "therefore, world_size=16  \n",
			
 
				-    "\n",
			
 
				-    "then accoridng to Megatron-LM [initializer.py](https://github.com/NVIDIA/Megatron-LM/blob/90e0a0dd08159e1c95f4f9d99bb8687f327d36c3/megatron/mpu/initialize.py) we should see the following ...\n",
			
 
				-    "\n",
			
 
				-    "    8 data_parallel groups:\n",
			
 
				-    "        [g0, g2], [g1, g3], [g4, g6], [g5, g7], [g8, g10], [g9, g11], [g12, g14], [g13, g15]\n",
			
 
				-    "    8 tensor model-parallel groups:\n",
			
 
				-    "        [g0, g1], [g2, g3], [g4, g5], [g6, g7], [g8, g9], [g10, g11], [g12, g13], [g14, g15]\n",
			
 
				-    "    4 pipeline model-parallel groups:\n",
			
 
				-    "        [g0, g4, g8, g12], [g1, g5, g9, g13], [g2, g6, g10, g14], [g3, g7, g11, g15]\n",
			
 
				-    "\n",
			
 
				-    "**Note that for efficiency, the caller should make sure adjacent ranks are on the same DGX box.**\n",
			
 
				-    "\n",
			
 
				-    "For example if we are using 2 DGX-1 boxes\n",
			
 
				-    "with a total of 16 GPUs, rank 0 to 7 belong to the first box and\n",
			
 
				-    "ranks 8 to 15 belong to the second box."
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": 1,
			
 
				-   "id": "greek-simpson",
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "import itertools\n",
			
 
				-    "def ensure_divisibility(numerator, denominator):\n",
			
 
				-    "    \"\"\"Ensure that numerator is divisible by the denominator.\"\"\"\n",
			
 
				-    "    assert numerator % denominator == 0, '{} is not divisible by {}'.format(\n",
			
 
				-    "        numerator, denominator)\n",
			
 
				-    "def initialize_model_parallel(tensor_model_parallel_size_=2,\n",
			
 
				-    "                              pipeline_model_parallel_size_= 4,\n",
			
 
				-    "                              world_size=16):\n",
			
 
				-    "    print(' ---------- world size is set to : {} ---------- '.format(world_size))\n",
			
 
				-    "    print('> initializing tensor model parallel with size {}'.format(tensor_model_parallel_size_))\n",
			
 
				-    "    print('> initializing pipeline model parallel with size {}'.format(pipeline_model_parallel_size_))\n",
			
 
				-    "    \n",
			
 
				-    "    tensor_model_parallel_size = min(tensor_model_parallel_size_, world_size)\n",
			
 
				-    "    pipeline_model_parallel_size = min(pipeline_model_parallel_size_, world_size)\n",
			
 
				-    "                                       \n",
			
 
				-    "    # make sure world_size is divisible by t * p    \n",
			
 
				-    "    ensure_divisibility(world_size,tensor_model_parallel_size * pipeline_model_parallel_size)\n",
			
 
				-    "                        \n",
			
 
				-    "    data_parallel_size = world_size // (tensor_model_parallel_size * pipeline_model_parallel_size)\n",
			
 
				-    "    print(\"> data parallel size is set to : \", data_parallel_size)\n",
			
 
				-    "    num_tensor_model_parallel_groups = world_size // tensor_model_parallel_size\n",
			
 
				-    "    num_pipeline_model_parallel_groups = world_size // pipeline_model_parallel_size\n",
			
 
				-    "    num_data_parallel_groups = world_size // data_parallel_size\n",
			
 
				-    "    print(\"---------- parallel groups ----------\")\n",
			
 
				-    "    print(\"num_tensor_model_parallel_groups : \", num_tensor_model_parallel_groups)\n",
			
 
				-    "    print(\"num_pipeline_model_parallel_groups : \", num_pipeline_model_parallel_groups)\n",
			
 
				-    "    print(\"num_data_parallel_groups : \",num_data_parallel_groups )\n",
			
 
				-    "    # Build the data-parallel groups.\n",
			
 
				-    "    _DATA_PARALLEL_GROUP = []\n",
			
 
				-    "    _MODEL_PARALLEL_GROUP = []\n",
			
 
				-    "    _TENSOR_MODEL_PARALLEL_GROUP = []\n",
			
 
				-    "    _PIPE_MODEL_PARALLEL_GROUP=[]   \n",
			
 
				-    "    _MODEL_PARALLEL_GROUP=[]\n",
			
 
				-    "    all_data_parallel_group_ranks = []\n",
			
 
				-    "    for i in range(pipeline_model_parallel_size):\n",
			
 
				-    "        start_rank = i * num_pipeline_model_parallel_groups\n",
			
 
				-    "        end_rank = (i + 1) * num_pipeline_model_parallel_groups\n",
			
 
				-    "        #print(\"start rank : {} | end rank :{}\".format(start_rank, end_rank))\n",
			
 
				-    "        temp=[]\n",
			
 
				-    "        for j in range(tensor_model_parallel_size):\n",
			
 
				-    "            ranks = range(start_rank + j, end_rank,\n",
			
 
				-    "                          tensor_model_parallel_size)\n",
			
 
				-    "            temp.append(list(ranks))\n",
			
 
				-    "            all_data_parallel_group_ranks.append(list(ranks))\n",
			
 
				-    "    _DATA_PARALLEL_GROUP=all_data_parallel_group_ranks\n",
			
 
				-    "\n",
			
 
				-    "    for i in range(num_pipeline_model_parallel_groups):\n",
			
 
				-    "        ranks = range(i, world_size,\n",
			
 
				-    "                      num_pipeline_model_parallel_groups)        \n",
			
 
				-    "        _PIPE_MODEL_PARALLEL_GROUP.append(list(ranks))\n",
			
 
				-    "    \n",
			
 
				-    "    \n",
			
 
				-    "    for i in range(data_parallel_size):\n",
			
 
				-    "        ranks = [data_parallel_group_ranks[i]\n",
			
 
				-    "                 for data_parallel_group_ranks in all_data_parallel_group_ranks]\n",
			
 
				-    "        _MODEL_PARALLEL_GROUP.append(ranks)\n",
			
 
				-    "    \n",
			
 
				-    "    for i in range(num_tensor_model_parallel_groups):\n",
			
 
				-    "        ranks = range(i * tensor_model_parallel_size,\n",
			
 
				-    "                      (i + 1) * tensor_model_parallel_size)\n",
			
 
				-    "        _TENSOR_MODEL_PARALLEL_GROUP.append(list(ranks))\n",
			
 
				-    "    print(\"-----\"*20)\n",
			
 
				-    "    print(\"_DATA_PARALLEL_GROUP \\n :\", _DATA_PARALLEL_GROUP)\n",
			
 
				-    "    print(\"-----\"*20)\n",
			
 
				-    "    print(\"_TENSOR_MODEL_PARALLEL_GROUP \\n :\", _TENSOR_MODEL_PARALLEL_GROUP)\n",
			
 
				-    "    print(\"-----\"*20)\n",
			
 
				-    "    print(\"_PIPE_MODEL_PARALLEL_GROUP \\n :\", _PIPE_MODEL_PARALLEL_GROUP)\n",
			
 
				-    "    print(\"-----\"*20)\n",
			
 
				-    "    print(\"Total :{} full models being partitioned into :{} GPUs \".format(len(_MODEL_PARALLEL_GROUP),world_size))\n",
			
 
				-    "    for idx, m in zip(range(len(_MODEL_PARALLEL_GROUP)),_MODEL_PARALLEL_GROUP):\n",
			
 
				-    "        m=[str(l) for l in m]\n",
			
 
				-    "        print(\"model {} : is partitioned into gpus :{}\".format(str(idx),','.join(m)))   \n"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "id": "transparent-myanmar",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "---\n",
			
 
				-    "## sanity check, verify the below matches [megatron/mpu/initializer.py](https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/mpu/initialize.py#L63)\n",
			
 
				-    "        Let's say we have a total of 16 GPUs denoted by g0 ... g15 \n",
			
 
				-    "            8 data_parallel groups:\n",
			
 
				-    "                [g0, g2], [g1, g3], [g4, g6], [g5, g7], [g8, g10], [g9, g11], [g12, g14], [g13, g15]\n",
			
 
				-    "            8 tensor model-parallel groups:\n",
			
 
				-    "                [g0, g1], [g2, g3], [g4, g5], [g6, g7], [g8, g9], [g10, g11], [g12, g13], [g14, g15]\n",
			
 
				-    "            4 pipeline model-parallel groups:\n",
			
 
				-    "                [g0, g4, g8, g12], [g1, g5, g9, g13], [g2, g6, g10, g14], [g3, g7, g11, g15]\n",
			
 
				-    "        Note that for efficiency, the caller should make sure adjacent ranks\n",
			
 
				-    "        are on the same DGX box. For example if we are using 2 DGX-1 boxes\n",
			
 
				-    "        with a total of 16 GPUs, rank 0 to 7 belong to the first box and\n",
			
 
				-    "        ranks 8 to 15 belong to the second box.\n"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": 2,
			
 
				-   "id": "physical-lightweight",
			
 
				-   "metadata": {},
			
 
				-   "outputs": [
			
 
				-    {
			
 
				-     "name": "stdout",
			
 
				-     "output_type": "stream",
			
 
				-     "text": [
			
 
				-      " ---------- world size is set to : 16 ---------- \n",
			
 
				-      "> initializing tensor model parallel with size 2\n",
			
 
				-      "> initializing pipeline model parallel with size 4\n",
			
 
				-      "> data parallel size is set to :  2\n",
			
 
				-      "---------- parallel groups ----------\n",
			
 
				-      "num_tensor_model_parallel_groups :  8\n",
			
 
				-      "num_pipeline_model_parallel_groups :  4\n",
			
 
				-      "num_data_parallel_groups :  8\n",
			
 
				-      "----------------------------------------------------------------------------------------------------\n",
			
 
				-      "_DATA_PARALLEL_GROUP \n",
			
 
				-      " : [[0, 2], [1, 3], [4, 6], [5, 7], [8, 10], [9, 11], [12, 14], [13, 15]]\n",
			
 
				-      "----------------------------------------------------------------------------------------------------\n",
			
 
				-      "_TENSOR_MODEL_PARALLEL_GROUP \n",
			
 
				-      " : [[0, 1], [2, 3], [4, 5], [6, 7], [8, 9], [10, 11], [12, 13], [14, 15]]\n",
			
 
				-      "----------------------------------------------------------------------------------------------------\n",
			
 
				-      "_PIPE_MODEL_PARALLEL_GROUP \n",
			
 
				-      " : [[0, 4, 8, 12], [1, 5, 9, 13], [2, 6, 10, 14], [3, 7, 11, 15]]\n",
			
 
				-      "----------------------------------------------------------------------------------------------------\n",
			
 
				-      "Total :2 full models being partitioned into :16 GPUs \n",
			
 
				-      "model 0 : is partitioned into gpus :0,1,4,5,8,9,12,13\n",
			
 
				-      "model 1 : is partitioned into gpus :2,3,6,7,10,11,14,15\n"
			
 
				-     ]
			
 
				-    }
			
 
				-   ],
			
 
				-   "source": [
			
 
				-    "initialize_model_parallel(tensor_model_parallel_size_=2,\n",
			
 
				-    "                              pipeline_model_parallel_size_= 4,\n",
			
 
				-    "                              world_size=16)"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": 3,
			
 
				-   "id": "confidential-mills",
			
 
				-   "metadata": {},
			
 
				-   "outputs": [
			
 
				-    {
			
 
				-     "name": "stdout",
			
 
				-     "output_type": "stream",
			
 
				-     "text": [
			
 
				-      " ---------- world size is set to : 16 ---------- \n",
			
 
				-      "> initializing tensor model parallel with size 4\n",
			
 
				-      "> initializing pipeline model parallel with size 2\n",
			
 
				-      "> data parallel size is set to :  2\n",
			
 
				-      "---------- parallel groups ----------\n",
			
 
				-      "num_tensor_model_parallel_groups :  4\n",
			
 
				-      "num_pipeline_model_parallel_groups :  8\n",
			
 
				-      "num_data_parallel_groups :  8\n",
			
 
				-      "----------------------------------------------------------------------------------------------------\n",
			
 
				-      "_DATA_PARALLEL_GROUP \n",
			
 
				-      " : [[0, 4], [1, 5], [2, 6], [3, 7], [8, 12], [9, 13], [10, 14], [11, 15]]\n",
			
 
				-      "----------------------------------------------------------------------------------------------------\n",
			
 
				-      "_TENSOR_MODEL_PARALLEL_GROUP \n",
			
 
				-      " : [[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15]]\n",
			
 
				-      "----------------------------------------------------------------------------------------------------\n",
			
 
				-      "_PIPE_MODEL_PARALLEL_GROUP \n",
			
 
				-      " : [[0, 8], [1, 9], [2, 10], [3, 11], [4, 12], [5, 13], [6, 14], [7, 15]]\n",
			
 
				-      "----------------------------------------------------------------------------------------------------\n",
			
 
				-      "Total :2 full models being partitioned into :16 GPUs \n",
			
 
				-      "model 0 : is partitioned into gpus :0,1,2,3,8,9,10,11\n",
			
 
				-      "model 1 : is partitioned into gpus :4,5,6,7,12,13,14,15\n"
			
 
				-     ]
			
 
				-    }
			
 
				-   ],
			
 
				-   "source": [
			
 
				-    "tensor_model_parallel_size= 4  # try a different tensor_model_parallel_size_\n",
			
 
				-    "pipeline_model_parallel_size= 2  # try a different pipeline_model_parallel_size_\n",
			
 
				-    "world_size=16\n",
			
 
				-    "assert world_size%(tensor_model_parallel_size * pipeline_model_parallel_size)==0,'please make sure world_size is divisible by tensor_model_parallel_size * pipeline_model_parallel_size' \n",
			
 
				-    "\n",
			
 
				-    "initialize_model_parallel(tensor_model_parallel_size_=tensor_model_parallel_size,pipeline_model_parallel_size_= pipeline_model_parallel_size,world_size=world_size)"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "id": "designed-guidance",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "----------------------------------------------------------------------\n",
			
 
				-    "## Megatron-LM's Column Parallel \n",
			
 
				-    "[ColumnParallel reference](https://github.com/NVIDIA/Megatron-LM/blob/90e0a0dd08159e1c95f4f9d99bb8687f327d36c3/megatron/mpu/layers.py#L201)\n",
			
 
				-    "![ColumnParallel](./Megatron-LM/pics/ColumnParallel.JPG)"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": 3,
			
 
				-   "id": "organized-orange",
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "import sys\n",
			
 
				-    "sys.path.append(\"./Megatron-LM\")\n",
			
 
				-    "from megatron.mpu import layers\n",
			
 
				-    "from torch.nn.parameter import Parameter\n",
			
 
				-    "import torch.nn.init as init\n",
			
 
				-    "import torch\n",
			
 
				-    "import random\n",
			
 
				-    "from megatron import *\n",
			
 
				-    "from megatron.mpu.tests import *\n",
			
 
				-    "\n",
			
 
				-    "from megatron.mpu.utils import *"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": 4,
			
 
				-   "id": "compound-morning",
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "## the below class is slightly modified from the original Megatron repo to skip environment variable initialization such as world_size\n",
			
 
				-    "global world_size \n",
			
 
				-    "world_size = 16\n",
			
 
				-    "class myColumnParallelLinear(torch.nn.Module):\n",
			
 
				-    "    \"\"\"Linear layer with column parallelism.\n",
			
 
				-    "    The linear layer is defined as Y = XA + b. A is parallelized along\n",
			
 
				-    "    its second dimension as A = [A_1, ..., A_p].\n",
			
 
				-    "    Arguments:\n",
			
 
				-    "        input_size: first dimension of matrix A.\n",
			
 
				-    "        output_size: second dimension of matrix A.\n",
			
 
				-    "        bias: If true, add bias\n",
			
 
				-    "        gather_output: If true, call all-gether on output and make Y avaiable\n",
			
 
				-    "                       to all GPUs, otherwise, every GPU will have its output\n",
			
 
				-    "                       which is Y_i = XA_i\n",
			
 
				-    "        init_method: method to initialize weights. Note that bias is always set\n",
			
 
				-    "                     to zero.\n",
			
 
				-    "        stride: For the strided linear layers.\n",
			
 
				-    "        keep_master_weight_for_test: This was added for testing and should be\n",
			
 
				-    "                                     set to False. It returns the master weights\n",
			
 
				-    "                                     used for initialization.\n",
			
 
				-    "        skip_bias_add: This was added to enable performance optimations where bias\n",
			
 
				-    "                       can be fused with other elementwise operations. we skip \n",
			
 
				-    "                       adding bias but instead return it.\n",
			
 
				-    "    \"\"\"\n",
			
 
				-    "\n",
			
 
				-    "    def __init__(self, input_size, output_size, bias=True, gather_output=True,\n",
			
 
				-    "                 init_method=init.xavier_normal_, stride=1,\n",
			
 
				-    "                 keep_master_weight_for_test=False,\n",
			
 
				-    "                 skip_bias_add=False):\n",
			
 
				-    "        super(myColumnParallelLinear, self).__init__()\n",
			
 
				-    "\n",
			
 
				-    "        # Keep input parameters\n",
			
 
				-    "        self.input_size = input_size\n",
			
 
				-    "        self.output_size = output_size\n",
			
 
				-    "        self.gather_output = gather_output\n",
			
 
				-    "        # Divide the weight matrix along the last dimension.\n",
			
 
				-    "        \n",
			
 
				-    "        self.output_size_per_partition = divide(output_size, world_size)\n",
			
 
				-    "        self.skip_bias_add = skip_bias_add\n",
			
 
				-    "\n",
			
 
				-    "        # Parameters.\n",
			
 
				-    "        # Note: torch.nn.functional.linear performs XA^T + b and as a result\n",
			
 
				-    "        # we allocate the transpose.\n",
			
 
				-    "        # Initialize weight.        \n",
			
 
				-    "        use_cpu_initialization=True # hard coded to use cpu\n",
			
 
				-    "        params_dtype = torch.float # skipping need of args\n",
			
 
				-    "        \n",
			
 
				-    "        if use_cpu_initialization:\n",
			
 
				-    "            self.weight = Parameter(torch.empty(self.output_size_per_partition,\n",
			
 
				-    "                                                self.input_size,\n",
			
 
				-    "                                                dtype=params_dtype))\n",
			
 
				-    "            \n",
			
 
				-    "            self.master_weight = m_initialize_affine_weight_cpu(\n",
			
 
				-    "                self.weight, self.output_size, self.input_size,\n",
			
 
				-    "                self.output_size_per_partition, 0, init_method,\n",
			
 
				-    "                stride=stride, return_master_weight=keep_master_weight_for_test)\n",
			
 
				-    "            \n",
			
 
				-    "        else:\n",
			
 
				-    "            self.weight = Parameter(torch.empty(\n",
			
 
				-    "                self.output_size_per_partition, self.input_size,\n",
			
 
				-    "                device=torch.cuda.current_device(), dtype=params_dtype))\n",
			
 
				-    "            _initialize_affine_weight_gpu(self.weight, init_method,\n",
			
 
				-    "                                          partition_dim=0, stride=stride)\n",
			
 
				-    "            \n",
			
 
				-    "        if bias:\n",
			
 
				-    "            if use_cpu_initialization:\n",
			
 
				-    "                self.bias = Parameter(torch.empty(\n",
			
 
				-    "                    self.output_size_per_partition, dtype=params_dtype))\n",
			
 
				-    "            else:\n",
			
 
				-    "                self.bias = Parameter(torch.empty(\n",
			
 
				-    "                    self.output_size_per_partition,\n",
			
 
				-    "                    device=torch.cuda.current_device(),\n",
			
 
				-    "                    dtype=params_dtype))\n",
			
 
				-    "            # Always initialize bias to zero.\n",
			
 
				-    "            with torch.no_grad():\n",
			
 
				-    "                self.bias.zero_()\n",
			
 
				-    "        else:\n",
			
 
				-    "            self.register_parameter('bias', None)\n",
			
 
				-    "\n",
			
 
				-    "    def forward(self, input_):\n",
			
 
				-    "        # Set up backprop all-reduce.\n",
			
 
				-    "        print(\"in Column parallel forward\")\n",
			
 
				-    "        input_parallel = copy_to_tensor_model_parallel_region(input_)\n",
			
 
				-    "        # Matrix multiply.\n",
			
 
				-    "\n",
			
 
				-    "        bias = self.bias if not self.skip_bias_add else None\n",
			
 
				-    "        output_parallel = F.linear(input_parallel, self.weight, bias)\n",
			
 
				-    "        if self.gather_output:\n",
			
 
				-    "            # All-gather across the partitions.\n",
			
 
				-    "            output = gather_from_tensor_model_parallel_region(output_parallel)\n",
			
 
				-    "        else:\n",
			
 
				-    "            output = output_parallel \n",
			
 
				-    "        output_bias = self.bias if self.skip_bias_add else None\n",
			
 
				-    "        return output, output_bias"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": 5,
			
 
				-   "id": "editorial-refund",
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "def get_weight_list(master_weight,tensor_model_parallel_gp):\n",
			
 
				-    "    my_weight_list=[]\n",
			
 
				-    "    a,b=master_weight.size()\n",
			
 
				-    "    print(\"A = [\")\n",
			
 
				-    "    tensor_model_parallel_gp=list(itertools.chain(*tensor_model_parallel_gp))\n",
			
 
				-    "    cnt=0\n",
			
 
				-    "    for gp in tensor_model_parallel_gp :\n",
			
 
				-    "        if which_model_parallel=='col': \n",
			
 
				-    "            temp=master_weight[gp::world_size].T\n",
			
 
				-    "            if cnt < world_size -1 : \n",
			
 
				-    "                print(\"A{}=\".format(str(cnt)), temp.size(), end = ',')\n",
			
 
				-    "            else:\n",
			
 
				-    "                print(\"A{}=\".format(str(cnt)), temp.size())\n",
			
 
				-    "        elif which_model_parallel =='row':\n",
			
 
				-    "            temp=master_weight.T\n",
			
 
				-    "            temp=temp[gp::world_size]\n",
			
 
				-    "            if cnt < world_size -1 :\n",
			
 
				-    "                print(\"A{}=\".format(str(cnt)), temp.size(),',')\n",
			
 
				-    "            else:\n",
			
 
				-    "                print(\"A{}=\".format(str(cnt)), temp.size())\n",
			
 
				-    "\n",
			
 
				-    "        else:\n",
			
 
				-    "            print(\"set which_model_parallel to **col** or **row**\")\n",
			
 
				-    "        cnt+=1    \n",
			
 
				-    "        my_weight_list.append(temp)\n",
			
 
				-    "            \n",
			
 
				-    "    print(\" ]\")\n",
			
 
				-    "    print(len(my_weight_list))\n",
			
 
				-    "    return my_weight_list\n",
			
 
				-    "def m_initialize_affine_weight_cpu(weight, output_size, input_size,\n",
			
 
				-    "                                  per_partition_size, partition_dim,\n",
			
 
				-    "                                  init_method, stride=1,\n",
			
 
				-    "                                  return_master_weight=False):\n",
			
 
				-    "    \"\"\"Initialize affine weight for model parallel.\n",
			
 
				-    "    Build the master weight on all processes and scatter\n",
			
 
				-    "    the relevant chunk.\"\"\"\n",
			
 
				-    "    params_dtype = torch.float\n",
			
 
				-    "    # Initialize master weight\n",
			
 
				-    "    master_weight = torch.empty(output_size, input_size,\n",
			
 
				-    "                                dtype=torch.float,\n",
			
 
				-    "                                requires_grad=False)    \n",
			
 
				-    "    \n",
			
 
				-    "    master_weight = master_weight.to(dtype=params_dtype)\n",
			
 
				-    "    # Split and copy\n",
			
 
				-    "    per_partition_per_stride_size = divide(per_partition_size, stride)\n",
			
 
				-    "    print(\"per_partition_per_stride_size \",per_partition_per_stride_size)\n",
			
 
				-    "    weight_list = torch.split(master_weight, per_partition_per_stride_size,\n",
			
 
				-    "                              dim=partition_dim)\n",
			
 
				-    "    ########  tensor_model_parallel_gp below is hard-coded for tensor_model_parallel_size= 2 , pipeline_model_parallel_size= 4 ########\n",
			
 
				-    "    ########    if you use other model parallel configuration , please copy and paste it below    ########\n",
			
 
				-    "    tensor_model_parallel_gp=[[0, 1], [2, 3], [4, 5], [6, 7], [8, 9], [10, 11], [12, 13], [14, 15]] \n",
			
 
				-    "    my_weight_list = get_weight_list(master_weight,tensor_model_parallel_gp)\n",
			
 
				-    "    \n",
			
 
				-    "    with torch.no_grad():\n",
			
 
				-    "        torch.cat(my_weight_list, dim=partition_dim, out=weight)\n",
			
 
				-    "    if return_master_weight:\n",
			
 
				-    "        return master_weight\n",
			
 
				-    "    return None"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "id": "distinguished-rhythm",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "## Peek inside Column Parallel Class"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": 6,
			
 
				-   "id": "under-secondary",
			
 
				-   "metadata": {},
			
 
				-   "outputs": [
			
 
				-    {
			
 
				-     "name": "stdout",
			
 
				-     "output_type": "stream",
			
 
				-     "text": [
			
 
				-      "this is how A is sliced column-wised ...\n",
			
 
				-      "\n",
			
 
				-      "per_partition_per_stride_size  32\n",
			
 
				-      "A = [\n",
			
 
				-      "A0= torch.Size([1024, 32]),A1= torch.Size([1024, 32]),A2= torch.Size([1024, 32]),A3= torch.Size([1024, 32]),A4= torch.Size([1024, 32]),A5= torch.Size([1024, 32]),A6= torch.Size([1024, 32]),A7= torch.Size([1024, 32]),A8= torch.Size([1024, 32]),A9= torch.Size([1024, 32]),A10= torch.Size([1024, 32]),A11= torch.Size([1024, 32]),A12= torch.Size([1024, 32]),A13= torch.Size([1024, 32]),A14= torch.Size([1024, 32]),A15= torch.Size([1024, 32])\n",
			
 
				-      " ]\n",
			
 
				-      "16\n"
			
 
				-     ]
			
 
				-    }
			
 
				-   ],
			
 
				-   "source": [
			
 
				-    "tensor_model_parallel_size= 2 \n",
			
 
				-    "pipeline_model_parallel_size= 4  \n",
			
 
				-    "input_size = 1024 # 1024 rows\n",
			
 
				-    "output_size = 512 # 256 columns\n",
			
 
				-    "which_model_parallel='col'\n",
			
 
				-    "print(\"this is how A is sliced column-wised ...\\n\")\n",
			
 
				-    "testCol=myColumnParallelLinear(input_size, output_size, bias=True, gather_output=True,\n",
			
 
				-    "                 init_method=init.xavier_normal_, stride=1,\n",
			
 
				-    "                 keep_master_weight_for_test=False,\n",
			
 
				-    "                 skip_bias_add=False)"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": 7,
			
 
				-   "id": "selective-snake",
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "per_partition_per_stride_size=32\n",
			
 
				-    "assert 16* per_partition_per_stride_size == 512 "
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": 8,
			
 
				-   "id": "inside-france",
			
 
				-   "metadata": {},
			
 
				-   "outputs": [
			
 
				-    {
			
 
				-     "data": {
			
 
				-      "text/plain": [
			
 
				-       "__main__.myColumnParallelLinear"
			
 
				-      ]
			
 
				-     },
			
 
				-     "execution_count": 8,
			
 
				-     "metadata": {},
			
 
				-     "output_type": "execute_result"
			
 
				-    }
			
 
				-   ],
			
 
				-   "source": [
			
 
				-    "type(testCol)"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": 9,
			
 
				-   "id": "agricultural-marine",
			
 
				-   "metadata": {},
			
 
				-   "outputs": [
			
 
				-    {
			
 
				-     "data": {
			
 
				-      "text/plain": [
			
 
				-       "(1024, 512)"
			
 
				-      ]
			
 
				-     },
			
 
				-     "execution_count": 9,
			
 
				-     "metadata": {},
			
 
				-     "output_type": "execute_result"
			
 
				-    }
			
 
				-   ],
			
 
				-   "source": [
			
 
				-    "testCol.input_size, testCol.output_size"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "id": "experienced-profit",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "----------------------------------------------------------------------\n",
			
 
				-    "## Megatron-LM's Row Parallel \n",
			
 
				-    "[RowParallel reference](https://github.com/NVIDIA/Megatron-LM/blob/90e0a0dd08159e1c95f4f9d99bb8687f327d36c3/megatron/mpu/layers.py#L294)\n",
			
 
				-    "![RowParallel](./Megatron-LM/pics/RowParallel.JPG)"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": 10,
			
 
				-   "id": "mineral-adapter",
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "class myRowParallelLinear(torch.nn.Module):\n",
			
 
				-    "    \"\"\"Linear layer with row parallelism.\n",
			
 
				-    "    The linear layer is defined as Y = XA + b. A is parallelized along\n",
			
 
				-    "    its first dimension and X along its second dimension as:\n",
			
 
				-    "               -   -\n",
			
 
				-    "              | A_1 |\n",
			
 
				-    "              | .   |\n",
			
 
				-    "          A = | .   |        X = [X_1, ..., X_p]\n",
			
 
				-    "              | .   |\n",
			
 
				-    "              | A_p |\n",
			
 
				-    "               -   -\n",
			
 
				-    "    Arguments:\n",
			
 
				-    "        input_size: first dimension of matrix A.\n",
			
 
				-    "        output_size: second dimension of matrix A.\n",
			
 
				-    "        bias: If true, add bias. Note that bias is not parallelized.\n",
			
 
				-    "        input_is_parallel: If true, we assume that the input is already\n",
			
 
				-    "                           split across the GPUs and we do not split\n",
			
 
				-    "                           again.\n",
			
 
				-    "        init_method: method to initialize weights. Note that bias is always set\n",
			
 
				-    "                     to zero.\n",
			
 
				-    "        stride: For the strided linear layers.\n",
			
 
				-    "        keep_master_weight_for_test: This was added for testing and should be\n",
			
 
				-    "                                     set to False. It returns the master weights\n",
			
 
				-    "                                     used for initialization.\n",
			
 
				-    "        skip_bias_add: This was added to enable performance optimations where bias\n",
			
 
				-    "                       can be fused with other elementwise operations. we skip \n",
			
 
				-    "                       adding bias but instead return it.\n",
			
 
				-    "    \"\"\"\n",
			
 
				-    "\n",
			
 
				-    "    def __init__(self, input_size, output_size, bias=True,\n",
			
 
				-    "                 input_is_parallel=False,\n",
			
 
				-    "                 init_method=init.xavier_normal_, stride=1,\n",
			
 
				-    "                 keep_master_weight_for_test=False,\n",
			
 
				-    "                 skip_bias_add=False):\n",
			
 
				-    "        super(myRowParallelLinear, self).__init__()\n",
			
 
				-    "\n",
			
 
				-    "        # Keep input parameters\n",
			
 
				-    "        self.input_size = input_size\n",
			
 
				-    "        self.output_size = output_size\n",
			
 
				-    "        self.input_is_parallel = input_is_parallel\n",
			
 
				-    "        # Divide the weight matrix along the last dimension.\n",
			
 
				-    "        self.input_size_per_partition = divide(input_size, world_size)\n",
			
 
				-    "        self.skip_bias_add = skip_bias_add\n",
			
 
				-    "        print(\"input_size_per_partition \", self.input_size_per_partition)\n",
			
 
				-    "        \n",
			
 
				-    "\n",
			
 
				-    "        # Parameters.\n",
			
 
				-    "        # Note: torch.nn.functional.linear performs XA^T + b and as a result\n",
			
 
				-    "        # we allocate the transpose.\n",
			
 
				-    "        # Initialize weight.\n",
			
 
				-    "        use_cpu_initialization=True # hard coded to use cpu\n",
			
 
				-    "        params_dtype = torch.float # skipping need of args\n",
			
 
				-    "        if use_cpu_initialization:\n",
			
 
				-    "            self.weight = Parameter(torch.empty(self.output_size,\n",
			
 
				-    "                                                self.input_size_per_partition,\n",
			
 
				-    "                                                dtype=params_dtype))\n",
			
 
				-    "            self.master_weight = m_initialize_affine_weight_cpu(\n",
			
 
				-    "                self.weight, self.output_size, self.input_size,\n",
			
 
				-    "                self.input_size_per_partition, 1, init_method,\n",
			
 
				-    "                stride=stride, return_master_weight=keep_master_weight_for_test)\n",
			
 
				-    "        else:\n",
			
 
				-    "            self.weight = Parameter(torch.empty(\n",
			
 
				-    "                self.output_size, self.input_size_per_partition,\n",
			
 
				-    "                device=torch.cuda.current_device(), dtype=params_dtype))\n",
			
 
				-    "            _initialize_affine_weight_gpu(self.weight, init_method,\n",
			
 
				-    "                                          partition_dim=1, stride=stride)\n",
			
 
				-    "        if bias:\n",
			
 
				-    "            if use_cpu_initialization:\n",
			
 
				-    "                self.bias = Parameter(torch.empty(self.output_size,\n",
			
 
				-    "                                                  dtype=params_dtype))\n",
			
 
				-    "            else:\n",
			
 
				-    "                self.bias = Parameter(torch.empty(\n",
			
 
				-    "                    self.output_size, device=torch.cuda.current_device(),\n",
			
 
				-    "                    dtype=params_dtype))\n",
			
 
				-    "            # Always initialize bias to zero.\n",
			
 
				-    "            with torch.no_grad():\n",
			
 
				-    "                self.bias.zero_()\n",
			
 
				-    "        else:\n",
			
 
				-    "            self.register_parameter('bias', None)\n",
			
 
				-    "\n",
			
 
				-    "\n",
			
 
				-    "\n",
			
 
				-    "    def forward(self, input_):\n",
			
 
				-    "        # Set up backprop all-reduce.\n",
			
 
				-    "        if self.input_is_parallel:\n",
			
 
				-    "            input_parallel = input_\n",
			
 
				-    "        else:\n",
			
 
				-    "            input_parallel = scatter_to_tensor_model_parallel_region(input_)\n",
			
 
				-    "        # Matrix multiply.\n",
			
 
				-    "        output_parallel = F.linear(input_parallel, self.weight)\n",
			
 
				-    "        # All-reduce across all the partitions.\n",
			
 
				-    "        output_ = reduce_from_tensor_model_parallel_region(output_parallel)\n",
			
 
				-    "        if not self.skip_bias_add:\n",
			
 
				-    "            output = output_ + self.bias if self.bias is not None else output_\n",
			
 
				-    "            output_bias = None\n",
			
 
				-    "        else:\n",
			
 
				-    "            output = output_\n",
			
 
				-    "            output_bias = self.bias\n",
			
 
				-    "        return output, output_bias"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": 11,
			
 
				-   "id": "nearby-latino",
			
 
				-   "metadata": {},
			
 
				-   "outputs": [
			
 
				-    {
			
 
				-     "name": "stdout",
			
 
				-     "output_type": "stream",
			
 
				-     "text": [
			
 
				-      "this is how A is sliced Row-wised ...\n",
			
 
				-      "\n",
			
 
				-      "input_size_per_partition  64\n",
			
 
				-      "per_partition_per_stride_size  64\n",
			
 
				-      "A = [\n",
			
 
				-      "A0= torch.Size([64, 512]) ,\n",
			
 
				-      "A1= torch.Size([64, 512]) ,\n",
			
 
				-      "A2= torch.Size([64, 512]) ,\n",
			
 
				-      "A3= torch.Size([64, 512]) ,\n",
			
 
				-      "A4= torch.Size([64, 512]) ,\n",
			
 
				-      "A5= torch.Size([64, 512]) ,\n",
			
 
				-      "A6= torch.Size([64, 512]) ,\n",
			
 
				-      "A7= torch.Size([64, 512]) ,\n",
			
 
				-      "A8= torch.Size([64, 512]) ,\n",
			
 
				-      "A9= torch.Size([64, 512]) ,\n",
			
 
				-      "A10= torch.Size([64, 512]) ,\n",
			
 
				-      "A11= torch.Size([64, 512]) ,\n",
			
 
				-      "A12= torch.Size([64, 512]) ,\n",
			
 
				-      "A13= torch.Size([64, 512]) ,\n",
			
 
				-      "A14= torch.Size([64, 512]) ,\n",
			
 
				-      "A15= torch.Size([64, 512])\n",
			
 
				-      " ]\n",
			
 
				-      "16\n"
			
 
				-     ]
			
 
				-    }
			
 
				-   ],
			
 
				-   "source": [
			
 
				-    "tensor_model_parallel_size= 2 \n",
			
 
				-    "pipeline_model_parallel_size= 4  \n",
			
 
				-    "input_size = 1024 # first dimension of the matrix\n",
			
 
				-    "output_size = 512 # 2nd dimension of the matrix\n",
			
 
				-    "print(\"this is how A is sliced Row-wised ...\\n\")\n",
			
 
				-    "which_model_parallel='row'\n",
			
 
				-    "testRow=myRowParallelLinear(input_size,output_size, bias=True,\n",
			
 
				-    "                 input_is_parallel=False,\n",
			
 
				-    "                 init_method=init.xavier_normal_, stride=1,\n",
			
 
				-    "                 keep_master_weight_for_test=False,\n",
			
 
				-    "                 skip_bias_add=False)"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": 72,
			
 
				-   "id": "economic-istanbul",
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "per_partition_per_stride_size=64\n",
			
 
				-    "assert 16* per_partition_per_stride_size == 1024 "
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": 58,
			
 
				-   "id": "pursuant-denial",
			
 
				-   "metadata": {},
			
 
				-   "outputs": [
			
 
				-    {
			
 
				-     "data": {
			
 
				-      "text/plain": [
			
 
				-       "(1024, 512)"
			
 
				-      ]
			
 
				-     },
			
 
				-     "execution_count": 58,
			
 
				-     "metadata": {},
			
 
				-     "output_type": "execute_result"
			
 
				-    }
			
 
				-   ],
			
 
				-   "source": [
			
 
				-    "testRow.input_size, testRow.output_size"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "id": "stretch-creature",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "--- \n",
			
 
				-    "\n",
			
 
				-    "## Additional Resources\n",
			
 
				-    "\n",
			
 
				-    "Efficient Large-Scale Language Model Training on GPU Clusters : https://arxiv.org/pdf/2104.04473.pdf \n",
			
 
				-    "\n",
			
 
				-    "Pushing Forward the Frontiers of Natural Language Processing : https://blogs.nvidia.com/blog/2021/09/16/nlp-frontiers-ai-hardware-summit/"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "id": "stopped-software",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "---\n",
			
 
				-    "## Up Next -\n",
			
 
				-    "[About GPT's tokenizer](./Day2-3_GPT_vocab_merge_files.ipynb)\n",
			
 
				-    "## Back To Start Menu\n",
			
 
				-    "[start menu](../Start_Here.ipynb)"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "id": "dated-garbage",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "-----\n",
			
 
				-    "\n",
			
 
				-    "\n",
			
 
				-    "## Licensing \n",
			
 
				-    "\n",
			
 
				-    "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). "
			
 
				-   ]
			
 
				-  }
			
 
				- ],
			
 
				- "metadata": {
			
 
				-  "kernelspec": {
			
 
				-   "display_name": "Python 3",
			
 
				-   "language": "python",
			
 
				-   "name": "python3"
			
 
				-  },
			
 
				-  "language_info": {
			
 
				-   "codemirror_mode": {
			
 
				-    "name": "ipython",
			
 
				-    "version": 3
			
 
				-   },
			
 
				-   "file_extension": ".py",
			
 
				-   "mimetype": "text/x-python",
			
 
				-   "name": "python",
			
 
				-   "nbconvert_exporter": "python",
			
 
				-   "pygments_lexer": "ipython3",
			
 
				-   "version": "3.8.8"
			
 
				-  }
			
 
				- },
			
 
				- "nbformat": 4,
			
 
				- "nbformat_minor": 5
			
 
				-}
			
--- a/ai/Megatron/English/Python/jupyter_notebook/Day2-3_GPT_vocab_merge_files.ipynb
+++ b/ai/Megatron/English/Python/jupyter_notebook/Day2-3_GPT_vocab_merge_files.ipynb
@@ -1,449 +0,0 @@
 
				-{
			
 
				- "cells": [
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "id": "strong-deadline",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "<img src=http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png style=\"width: 90px; float: right;\">\n",
			
 
				-    "\n",
			
 
				-    "# About GPT vocab and merge files\n",
			
 
				-    "---\n",
			
 
				-    "\n",
			
 
				-    "## Learning Objectives\n",
			
 
				-    "The goal of this lab is to:\n",
			
 
				-    "\n",
			
 
				-    "- the difference between BPE and GPTBPE Tokenizer\n",
			
 
				-    "- load and verify GPTBPE Tokenizer can do tokenization as expected \n",
			
 
				-    "\n",
			
 
				-    "\n",
			
 
				-    "Note: Please proceed to download the GPT vocab and merge files \n",
			
 
				-    "\n",
			
 
				-    "Download vocab file [English_vocab](https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json)\n",
			
 
				-    "\n",
			
 
				-    "Download merge file [English_merge](https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt)\n"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "id": "level-trigger",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "#### let's review the source code of [gpt2 tokenizer](https://huggingface.co/transformers/_modules/transformers/tokenization_gpt2.html)\n",
			
 
				-    "\n",
			
 
				-    "Construct a GPT-2 tokenizer. Based on byte-level Byte-Pair-Encoding.\n",
			
 
				-    "\n",
			
 
				-    "    This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will\n",
			
 
				-    "    be encoded differently whether it is at the beginning of the sentence (without space) or not:\n",
			
 
				-    "\n",
			
 
				-    "    ::\n",
			
 
				-    "\n",
			
 
				-    "         from transformers import GPT2Tokenizer\n",
			
 
				-    "         tokenizer = GPT2Tokenizer.from_pretrained(\"gpt2\")\n",
			
 
				-    "        \n",
			
 
				-    "         tokenizer(\" Hello world\")['input_ids']\n",
			
 
				-    "        [18435, 995]\n"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": 1,
			
 
				-   "id": "mysterious-favorite",
			
 
				-   "metadata": {},
			
 
				-   "outputs": [
			
 
				-    {
			
 
				-     "name": "stdout",
			
 
				-     "output_type": "stream",
			
 
				-     "text": [
			
 
				-      "Defaulting to user installation because normal site-packages is not writeable\n",
			
 
				-      "Requirement already satisfied: tokenizers in /home/x_zench/.local/lib/python3.8/site-packages (0.10.3)\n",
			
 
				-      "Requirement already satisfied: transformers in /home/x_zench/.local/lib/python3.8/site-packages (4.10.0)\n",
			
 
				-      "Requirement already satisfied: ipywidgets in /home/x_zench/.local/lib/python3.8/site-packages (7.6.4)\n",
			
 
				-      "Requirement already satisfied: pyyaml>=5.1 in /opt/conda/lib/python3.8/site-packages (from transformers) (5.4.1)\n",
			
 
				-      "Requirement already satisfied: sacremoses in /opt/conda/lib/python3.8/site-packages (from transformers) (0.0.35)\n",
			
 
				-      "Requirement already satisfied: requests in /opt/conda/lib/python3.8/site-packages (from transformers) (2.24.0)\n",
			
 
				-      "Requirement already satisfied: regex!=2019.12.17 in /opt/conda/lib/python3.8/site-packages (from transformers) (2021.3.17)\n",
			
 
				-      "Requirement already satisfied: packaging in /opt/conda/lib/python3.8/site-packages (from transformers) (20.9)\n",
			
 
				-      "Requirement already satisfied: numpy>=1.17 in /opt/conda/lib/python3.8/site-packages (from transformers) (1.19.2)\n",
			
 
				-      "Requirement already satisfied: tqdm>=4.27 in /opt/conda/lib/python3.8/site-packages (from transformers) (4.53.0)\n",
			
 
				-      "Requirement already satisfied: huggingface-hub>=0.0.12 in /home/x_zench/.local/lib/python3.8/site-packages (from transformers) (0.0.16)\n",
			
 
				-      "Requirement already satisfied: filelock in /opt/conda/lib/python3.8/site-packages (from transformers) (3.0.12)\n",
			
 
				-      "Requirement already satisfied: jupyterlab-widgets>=1.0.0; python_version >= \"3.6\" in /home/x_zench/.local/lib/python3.8/site-packages (from ipywidgets) (1.0.1)\n",
			
 
				-      "Requirement already satisfied: ipykernel>=4.5.1 in /opt/conda/lib/python3.8/site-packages (from ipywidgets) (5.5.0)\n",
			
 
				-      "Requirement already satisfied: ipython-genutils~=0.2.0 in /opt/conda/lib/python3.8/site-packages (from ipywidgets) (0.2.0)\n",
			
 
				-      "Requirement already satisfied: nbformat>=4.2.0 in /opt/conda/lib/python3.8/site-packages (from ipywidgets) (5.1.2)\n",
			
 
				-      "Requirement already satisfied: traitlets>=4.3.1 in /opt/conda/lib/python3.8/site-packages (from ipywidgets) (5.0.5)\n",
			
 
				-      "Requirement already satisfied: widgetsnbextension~=3.5.0 in /home/x_zench/.local/lib/python3.8/site-packages (from ipywidgets) (3.5.1)\n",
			
 
				-      "Requirement already satisfied: ipython>=4.0.0; python_version >= \"3.3\" in /opt/conda/lib/python3.8/site-packages (from ipywidgets) (7.21.0)\n",
			
 
				-      "Requirement already satisfied: click in /opt/conda/lib/python3.8/site-packages (from sacremoses->transformers) (7.1.2)\n",
			
 
				-      "Requirement already satisfied: joblib in /opt/conda/lib/python3.8/site-packages (from sacremoses->transformers) (1.0.1)\n",
			
 
				-      "Requirement already satisfied: six in /opt/conda/lib/python3.8/site-packages (from sacremoses->transformers) (1.15.0)\n",
			
 
				-      "Requirement already satisfied: chardet<4,>=3.0.2 in /opt/conda/lib/python3.8/site-packages (from requests->transformers) (3.0.4)\n",
			
 
				-      "Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/lib/python3.8/site-packages (from requests->transformers) (1.25.11)\n",
			
 
				-      "Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.8/site-packages (from requests->transformers) (2020.12.5)\n",
			
 
				-      "Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.8/site-packages (from requests->transformers) (2.10)\n",
			
 
				-      "Requirement already satisfied: pyparsing>=2.0.2 in /opt/conda/lib/python3.8/site-packages (from packaging->transformers) (2.4.7)\n",
			
 
				-      "Requirement already satisfied: typing-extensions in /opt/conda/lib/python3.8/site-packages (from huggingface-hub>=0.0.12->transformers) (3.7.4.3)\n",
			
 
				-      "Requirement already satisfied: tornado>=4.2 in /opt/conda/lib/python3.8/site-packages (from ipykernel>=4.5.1->ipywidgets) (6.1)\n",
			
 
				-      "Requirement already satisfied: jupyter-client in /opt/conda/lib/python3.8/site-packages (from ipykernel>=4.5.1->ipywidgets) (6.1.12)\n",
			
 
				-      "Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in /opt/conda/lib/python3.8/site-packages (from nbformat>=4.2.0->ipywidgets) (3.0.2)\n",
			
 
				-      "Requirement already satisfied: jupyter-core in /opt/conda/lib/python3.8/site-packages (from nbformat>=4.2.0->ipywidgets) (4.7.1)\n",
			
 
				-      "Requirement already satisfied: notebook>=4.4.1 in /opt/conda/lib/python3.8/site-packages (from widgetsnbextension~=3.5.0->ipywidgets) (6.2.0)\n",
			
 
				-      "Requirement already satisfied: decorator in /opt/conda/lib/python3.8/site-packages (from ipython>=4.0.0; python_version >= \"3.3\"->ipywidgets) (4.4.2)\n",
			
 
				-      "Requirement already satisfied: pygments in /opt/conda/lib/python3.8/site-packages (from ipython>=4.0.0; python_version >= \"3.3\"->ipywidgets) (2.8.1)\n",
			
 
				-      "Requirement already satisfied: pexpect>4.3; sys_platform != \"win32\" in /opt/conda/lib/python3.8/site-packages (from ipython>=4.0.0; python_version >= \"3.3\"->ipywidgets) (4.8.0)\n",
			
 
				-      "Requirement already satisfied: jedi>=0.16 in /opt/conda/lib/python3.8/site-packages (from ipython>=4.0.0; python_version >= \"3.3\"->ipywidgets) (0.17.0)\n",
			
 
				-      "Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /opt/conda/lib/python3.8/site-packages (from ipython>=4.0.0; python_version >= \"3.3\"->ipywidgets) (3.0.8)\n",
			
 
				-      "Requirement already satisfied: pickleshare in /opt/conda/lib/python3.8/site-packages (from ipython>=4.0.0; python_version >= \"3.3\"->ipywidgets) (0.7.5)\n",
			
 
				-      "Requirement already satisfied: backcall in /opt/conda/lib/python3.8/site-packages (from ipython>=4.0.0; python_version >= \"3.3\"->ipywidgets) (0.2.0)\n",
			
 
				-      "Requirement already satisfied: setuptools>=18.5 in /opt/conda/lib/python3.8/site-packages (from ipython>=4.0.0; python_version >= \"3.3\"->ipywidgets) (50.3.1.post20201107)\n",
			
 
				-      "Requirement already satisfied: pyzmq>=13 in /opt/conda/lib/python3.8/site-packages (from jupyter-client->ipykernel>=4.5.1->ipywidgets) (22.0.3)\n",
			
 
				-      "Requirement already satisfied: python-dateutil>=2.1 in /opt/conda/lib/python3.8/site-packages (from jupyter-client->ipykernel>=4.5.1->ipywidgets) (2.8.1)\n",
			
 
				-      "Requirement already satisfied: attrs>=17.4.0 in /opt/conda/lib/python3.8/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets) (20.3.0)\n",
			
 
				-      "Requirement already satisfied: pyrsistent>=0.14.0 in /opt/conda/lib/python3.8/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets) (0.17.3)\n",
			
 
				-      "Requirement already satisfied: argon2-cffi in /opt/conda/lib/python3.8/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (20.1.0)\n",
			
 
				-      "Requirement already satisfied: nbconvert in /opt/conda/lib/python3.8/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (6.0.7)\n",
			
 
				-      "Requirement already satisfied: jinja2 in /opt/conda/lib/python3.8/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (2.11.3)\n",
			
 
				-      "Requirement already satisfied: prometheus-client in /opt/conda/lib/python3.8/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (0.9.0)\n",
			
 
				-      "Requirement already satisfied: terminado>=0.8.3 in /opt/conda/lib/python3.8/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (0.9.3)\n",
			
 
				-      "Requirement already satisfied: Send2Trash>=1.5.0 in /opt/conda/lib/python3.8/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (1.5.0)\n",
			
 
				-      "Requirement already satisfied: ptyprocess>=0.5 in /opt/conda/lib/python3.8/site-packages (from pexpect>4.3; sys_platform != \"win32\"->ipython>=4.0.0; python_version >= \"3.3\"->ipywidgets) (0.7.0)\n",
			
 
				-      "Requirement already satisfied: parso>=0.7.0 in /opt/conda/lib/python3.8/site-packages (from jedi>=0.16->ipython>=4.0.0; python_version >= \"3.3\"->ipywidgets) (0.8.1)\n",
			
 
				-      "Requirement already satisfied: wcwidth in /opt/conda/lib/python3.8/site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=4.0.0; python_version >= \"3.3\"->ipywidgets) (0.2.5)\n",
			
 
				-      "Requirement already satisfied: cffi>=1.0.0 in /opt/conda/lib/python3.8/site-packages (from argon2-cffi->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (1.14.3)\n",
			
 
				-      "Requirement already satisfied: defusedxml in /opt/conda/lib/python3.8/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (0.7.1)\n",
			
 
				-      "Requirement already satisfied: entrypoints>=0.2.2 in /opt/conda/lib/python3.8/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (0.3)\n",
			
 
				-      "Requirement already satisfied: pandocfilters>=1.4.1 in /opt/conda/lib/python3.8/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (1.4.3)\n",
			
 
				-      "Requirement already satisfied: jupyterlab-pygments in /opt/conda/lib/python3.8/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (0.1.2)\n",
			
 
				-      "Requirement already satisfied: testpath in /opt/conda/lib/python3.8/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (0.4.4)\n",
			
 
				-      "Requirement already satisfied: nbclient<0.6.0,>=0.5.0 in /opt/conda/lib/python3.8/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (0.5.3)\n",
			
 
				-      "Requirement already satisfied: mistune<2,>=0.8.1 in /opt/conda/lib/python3.8/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (0.8.4)\n",
			
 
				-      "Requirement already satisfied: bleach in /opt/conda/lib/python3.8/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (3.3.0)\n",
			
 
				-      "Requirement already satisfied: MarkupSafe>=0.23 in /opt/conda/lib/python3.8/site-packages (from jinja2->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (1.1.1)\n",
			
 
				-      "Requirement already satisfied: pycparser in /opt/conda/lib/python3.8/site-packages (from cffi>=1.0.0->argon2-cffi->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (2.20)\n",
			
 
				-      "Requirement already satisfied: nest-asyncio in /opt/conda/lib/python3.8/site-packages (from nbclient<0.6.0,>=0.5.0->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (1.5.1)\n",
			
 
				-      "Requirement already satisfied: async-generator in /opt/conda/lib/python3.8/site-packages (from nbclient<0.6.0,>=0.5.0->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (1.10)\n",
			
 
				-      "Requirement already satisfied: webencodings in /opt/conda/lib/python3.8/site-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets) (0.5.1)\n"
			
 
				-     ]
			
 
				-    }
			
 
				-   ],
			
 
				-   "source": [
			
 
				-    "!pip install tokenizers  transformers ipywidgets"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": 2,
			
 
				-   "id": "shaped-mercury",
			
 
				-   "metadata": {},
			
 
				-   "outputs": [
			
 
				-    {
			
 
				-     "name": "stdout",
			
 
				-     "output_type": "stream",
			
 
				-     "text": [
			
 
				-      "--2021-09-15 09:29:57--  https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json\n",
			
 
				-      "Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.95.125\n",
			
 
				-      "Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.95.125|:443... connected.\n",
			
 
				-      "HTTP request sent, awaiting response... 200 OK\n",
			
 
				-      "Length: 1042301 (1018K) [application/json]\n",
			
 
				-      "Saving to: ‘gpt2-vocab.json’\n",
			
 
				-      "\n",
			
 
				-      "gpt2-vocab.json     100%[===================>]   1018K  1.53MB/s    in 0.7s    \n",
			
 
				-      "\n",
			
 
				-      "2021-09-15 09:29:58 (1.53 MB/s) - ‘gpt2-vocab.json’ saved [1042301/1042301]\n",
			
 
				-      "\n",
			
 
				-      "--2021-09-15 09:29:58--  https://huggingface.co/openai-gpt/resolve/main/vocab.json\n",
			
 
				-      "Resolving huggingface.co (huggingface.co)... 107.23.77.87, 34.200.164.230, 34.195.144.223, ...\n",
			
 
				-      "Connecting to huggingface.co (huggingface.co)|107.23.77.87|:443... connected.\n",
			
 
				-      "HTTP request sent, awaiting response... 200 OK\n",
			
 
				-      "Length: 815973 (797K) [application/json]\n",
			
 
				-      "Saving to: ‘vocab.json’\n",
			
 
				-      "\n",
			
 
				-      "vocab.json          100%[===================>] 796.85K  1.78MB/s    in 0.4s    \n",
			
 
				-      "\n",
			
 
				-      "2021-09-15 09:29:59 (1.78 MB/s) - ‘vocab.json’ saved [815973/815973]\n",
			
 
				-      "\n",
			
 
				-      "--2021-09-15 09:30:00--  https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt\n",
			
 
				-      "Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.95.125\n",
			
 
				-      "Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.95.125|:443... connected.\n",
			
 
				-      "HTTP request sent, awaiting response... 200 OK\n",
			
 
				-      "Length: 456318 (446K) [text/plain]\n",
			
 
				-      "Saving to: ‘gpt2-merges.txt’\n",
			
 
				-      "\n",
			
 
				-      "gpt2-merges.txt     100%[===================>] 445.62K  1.00MB/s    in 0.4s    \n",
			
 
				-      "\n",
			
 
				-      "2021-09-15 09:30:01 (1.00 MB/s) - ‘gpt2-merges.txt’ saved [456318/456318]\n",
			
 
				-      "\n",
			
 
				-      "--2021-09-15 09:30:01--  https://huggingface.co/openai-gpt/resolve/main/merges.txt\n",
			
 
				-      "Resolving huggingface.co (huggingface.co)... 107.23.77.87, 34.200.164.230, 34.195.144.223, ...\n",
			
 
				-      "Connecting to huggingface.co (huggingface.co)|107.23.77.87|:443... connected.\n",
			
 
				-      "HTTP request sent, awaiting response... 200 OK\n",
			
 
				-      "Length: 458495 (448K) [text/plain]\n",
			
 
				-      "Saving to: ‘merges.txt’\n",
			
 
				-      "\n",
			
 
				-      "merges.txt          100%[===================>] 447.75K  1007KB/s    in 0.4s    \n",
			
 
				-      "\n",
			
 
				-      "2021-09-15 09:30:02 (1007 KB/s) - ‘merges.txt’ saved [458495/458495]\n",
			
 
				-      "\n"
			
 
				-     ]
			
 
				-    }
			
 
				-   ],
			
 
				-   "source": [
			
 
				-    "!wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json\n",
			
 
				-    "!wget https://huggingface.co/openai-gpt/resolve/main/vocab.json\n",
			
 
				-    "!wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt\n",
			
 
				-    "!wget https://huggingface.co/openai-gpt/resolve/main/merges.txt"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": 3,
			
 
				-   "id": "traditional-triangle",
			
 
				-   "metadata": {},
			
 
				-   "outputs": [
			
 
				-    {
			
 
				-     "name": "stdout",
			
 
				-     "output_type": "stream",
			
 
				-     "text": [
			
 
				-      "Enabling notebook extension jupyter-js-widgets/extension...\n",
			
 
				-      "      - Validating: \u001b[32mOK\u001b[0m\n"
			
 
				-     ]
			
 
				-    }
			
 
				-   ],
			
 
				-   "source": [
			
 
				-    "!jupyter nbextension enable --py widgetsnbextension"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "id": "disciplinary-journalist",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "## examine the vocab and merge files"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": 4,
			
 
				-   "id": "accomplished-builder",
			
 
				-   "metadata": {},
			
 
				-   "outputs": [
			
 
				-    {
			
 
				-     "name": "stdout",
			
 
				-     "output_type": "stream",
			
 
				-     "text": [
			
 
				-      "noted that the Ġ = space + 256 to form that control letter\n",
			
 
				-      "['Ġassorted', 'ĠRevision', 'ĠPiano', 'ĠGideon', 'Ocean', 'Ġsalon', 'Ġbustling', 'ognitive', 'ĠRahman', 'Ġwaiter', 'Ġpresets', 'ĠOsh', 'ĠGHC', 'operator', 'Ġreptiles', 'Ġ413', 'ĠGarr', 'ĠChak', 'Ġhashes', 'Ġfailings']\n"
			
 
				-     ]
			
 
				-    }
			
 
				-   ],
			
 
				-   "source": [
			
 
				-    "import json\n",
			
 
				-    "import random\n",
			
 
				-    "with open('gpt2-vocab.json') as ip_file:\n",
			
 
				-    "    o = json.load(ip_file)\n",
			
 
				-    "    take=20\n",
			
 
				-    "    rn=random.randint(0,len(o)-1)\n",
			
 
				-    "    print(\"noted that the Ġ = space + 256 to form that control letter\")\n",
			
 
				-    "    print(list(o.keys())[rn:rn+take])            "
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": 5,
			
 
				-   "id": "dependent-bridge",
			
 
				-   "metadata": {},
			
 
				-   "outputs": [
			
 
				-    {
			
 
				-     "name": "stdout",
			
 
				-     "output_type": "stream",
			
 
				-     "text": [
			
 
				-      "om inated\n",
			
 
				-      "Ġreg ress\n",
			
 
				-      "ĠColl ider\n",
			
 
				-      "Ġinform ants\n",
			
 
				-      "Ġg azed\n"
			
 
				-     ]
			
 
				-    }
			
 
				-   ],
			
 
				-   "source": [
			
 
				-    "!tail -n 5 gpt2-merges.txt"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "id": "imperial-setting",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "## sanity check load from transformer GPT2Tokenizer "
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": 6,
			
 
				-   "id": "corresponding-yugoslavia",
			
 
				-   "metadata": {},
			
 
				-   "outputs": [
			
 
				-    {
			
 
				-     "name": "stdout",
			
 
				-     "output_type": "stream",
			
 
				-     "text": [
			
 
				-      "\n",
			
 
				-      " notice the **SPACE** in front of ** Hello world** \n",
			
 
				-      "\n",
			
 
				-      " Hello world\n",
			
 
				-      "tokens: ['ĠHello', 'Ġworld']\n",
			
 
				-      "ids: [18435, 995]\n"
			
 
				-     ]
			
 
				-    }
			
 
				-   ],
			
 
				-   "source": [
			
 
				-    "from transformers import GPT2Tokenizer\n",
			
 
				-    "tokenizer = GPT2Tokenizer.from_pretrained(\"gpt2\")\n",
			
 
				-    "\n",
			
 
				-    "print('\\n notice the **SPACE** in front of ** Hello world** \\n')\n",
			
 
				-    "sample_text=\" Hello world\"\n",
			
 
				-    "print(sample_text)\n",
			
 
				-    "out=tokenizer.tokenize(sample_text)\n",
			
 
				-    "print(\"tokens:\",out)\n",
			
 
				-    "ids=tokenizer(sample_text)['input_ids']\n",
			
 
				-    "print(\"ids:\",ids)\n",
			
 
				-    "## expected output :\n",
			
 
				-    "## [18435, 995]"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": 7,
			
 
				-   "id": "overhead-philip",
			
 
				-   "metadata": {},
			
 
				-   "outputs": [
			
 
				-    {
			
 
				-     "name": "stdout",
			
 
				-     "output_type": "stream",
			
 
				-     "text": [
			
 
				-      "tokens:  ['ĠHello', 'Ġworld']\n",
			
 
				-      "ids: [18435, 995]\n",
			
 
				-      "------------------------------\n",
			
 
				-      "\n",
			
 
				-      "notice the difference when using BPE as tokenizer instead of GPT2BPE tokenizer\n",
			
 
				-      "tokens:  ['H', 'ellow', 'orld']\n",
			
 
				-      "ids: [39, 5037, 1764]\n"
			
 
				-     ]
			
 
				-    }
			
 
				-   ],
			
 
				-   "source": [
			
 
				-    "from tokenizers import Tokenizer, models, pre_tokenizers, trainers\n",
			
 
				-    "from tokenizers.decoders import ByteLevel as ByteLevelDecoder\n",
			
 
				-    "from tokenizers.models import BPE\n",
			
 
				-    "import json\n",
			
 
				-    "\n",
			
 
				-    "\n",
			
 
				-    "def load_tokenizer(vocab_file,merge_file, gpt2):\n",
			
 
				-    "    tokenizer = Tokenizer(BPE())\n",
			
 
				-    "    tokenizer.model = BPE.from_file(vocab_file, merge_file)\n",
			
 
				-    "    with open(vocab_file, 'r') as f2:\n",
			
 
				-    "        vocab = json.loads(f2.read())  \n",
			
 
				-    "    if gpt2:\n",
			
 
				-    "        tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()\n",
			
 
				-    "        tokenizer.decoder = ByteLevelDecoder()\n",
			
 
				-    "    return tokenizer , vocab\n",
			
 
				-    "vocab_file='./gpt2-vocab.json'\n",
			
 
				-    "merge_file='./gpt2-merges.txt'\n",
			
 
				-    "tokenizers_gpt,_=load_tokenizer(vocab_file,merge_file,True)\n",
			
 
				-    "sample_text=' Hello world' \n",
			
 
				-    "output=tokenizers_gpt.encode(sample_text)\n",
			
 
				-    "ids=output.ids\n",
			
 
				-    "tokens=output.tokens\n",
			
 
				-    "#print(tokens ,'\\n')\n",
			
 
				-    "print(\"tokens: \",tokens)\n",
			
 
				-    "print(\"ids:\",ids)\n",
			
 
				-    "\n",
			
 
				-    "tokenizers_bpe,_=load_tokenizer(vocab_file,merge_file, False)\n",
			
 
				-    "sample_text=' Hello world'\n",
			
 
				-    "output=tokenizers_bpe.encode(sample_text)\n",
			
 
				-    "ids=output.ids\n",
			
 
				-    "tokens=output.tokens\n",
			
 
				-    "\n",
			
 
				-    "print(\"---\"*10)\n",
			
 
				-    "print('\\nnotice the difference when using BPE as tokenizer instead of GPT2BPE tokenizer')\n",
			
 
				-    "print(\"tokens: \",tokens)\n",
			
 
				-    "print(\"ids:\",ids)\n"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": 8,
			
 
				-   "id": "interpreted-termination",
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "## clean up\n",
			
 
				-    "!rm merges.txt\n",
			
 
				-    "!rm vocab.json"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "id": "vulnerable-outside",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "--- \n",
			
 
				-    "\n",
			
 
				-    "## Additional Resources\n",
			
 
				-    "\n",
			
 
				-    "HuggingFace Tokenizer Documentation : https://huggingface.co/docs/tokenizers/python/latest/quicktour.html\n",
			
 
				-    "\n",
			
 
				-    "Train GPT-2 in your own langauge : https://towardsdatascience.com/train-gpt-2-in-your-own-language-fc6ad4d60171"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "id": "aquatic-blanket",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "---\n",
			
 
				-    "## Up Next : \n",
			
 
				-    "\n",
			
 
				-    "[Jsonfy and convert to mmap ](./Day2-4_jsonfy_and_process2mmap.ipynb)\n",
			
 
				-    "\n",
			
 
				-    "## Back To Start Menu\n",
			
 
				-    "[start menu](../Start_Here.ipynb)"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "id": "injured-pursuit",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "-----\n",
			
 
				-    "\n",
			
 
				-    "\n",
			
 
				-    "## Licensing \n",
			
 
				-    "\n",
			
 
				-    "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). "
			
 
				-   ]
			
 
				-  }
			
 
				- ],
			
 
				- "metadata": {
			
 
				-  "kernelspec": {
			
 
				-   "display_name": "Python 3",
			
 
				-   "language": "python",
			
 
				-   "name": "python3"
			
 
				-  },
			
 
				-  "language_info": {
			
 
				-   "codemirror_mode": {
			
 
				-    "name": "ipython",
			
 
				-    "version": 3
			
 
				-   },
			
 
				-   "file_extension": ".py",
			
 
				-   "mimetype": "text/x-python",
			
 
				-   "name": "python",
			
 
				-   "nbconvert_exporter": "python",
			
 
				-   "pygments_lexer": "ipython3",
			
 
				-   "version": "3.8.8"
			
 
				-  }
			
 
				- },
			
 
				- "nbformat": 4,
			
 
				- "nbformat_minor": 5
			
 
				-}
			
--- a/ai/Megatron/English/Python/jupyter_notebook/Day2-4_jsonfy_and_process2mmap.ipynb
+++ b/ai/Megatron/English/Python/jupyter_notebook/Day2-4_jsonfy_and_process2mmap.ipynb
@@ -1,394 +0,0 @@
 
				-{
			
 
				- "cells": [
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "id": "fifteen-channel",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "# Jsonfy and preprocess to mmap format for optimizing data loading\n",
			
 
				-    "---\n",
			
 
				-    "\n",
			
 
				-    "## Learning Objectives\n",
			
 
				-    "- **The goal of this lab is to:**\n",
			
 
				-    "    - motivation : understand the need for preprocessing to mmap format\n",
			
 
				-    "    - the assumptions about the data \n",
			
 
				-    "    - jsonfy the raw text data into loose json format\n",
			
 
				-    "    - use preprocess_data.py to convert the cleaned data into mmap format as a preparation for training\n",
			
 
				-    "\n",
			
 
				-    "----------------------------------------------------------\n",
			
 
				-    "### Understand the need for preprocessing to mmap format-    \n"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": 1,
			
 
				-   "id": "governing-country",
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "import numpy as np\n",
			
 
				-    "out=np.random.random((1024,2048))\n",
			
 
				-    "np.save('myarr',out)"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": 2,
			
 
				-   "id": "mysterious-hazard",
			
 
				-   "metadata": {},
			
 
				-   "outputs": [
			
 
				-    {
			
 
				-     "name": "stdout",
			
 
				-     "output_type": "stream",
			
 
				-     "text": [
			
 
				-      "3.84 ms ± 36.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
			
 
				-     ]
			
 
				-    }
			
 
				-   ],
			
 
				-   "source": [
			
 
				-    "%%timeit \n",
			
 
				-    "out=np.load('myarr.npy')"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": 3,
			
 
				-   "id": "designed-swaziland",
			
 
				-   "metadata": {},
			
 
				-   "outputs": [
			
 
				-    {
			
 
				-     "name": "stdout",
			
 
				-     "output_type": "stream",
			
 
				-     "text": [
			
 
				-      "43 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n"
			
 
				-     ]
			
 
				-    }
			
 
				-   ],
			
 
				-   "source": [
			
 
				-    "%%timeit\n",
			
 
				-    "array = np.memmap(\"myarr.npy\", mode=\"r\",\n",
			
 
				-    "                  dtype=np.int16, shape=(1024, 1024))"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": 4,
			
 
				-   "id": "sublime-beaver",
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "## clean up\n",
			
 
				-    "!rm myarr.npy"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "id": "aggressive-victim",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "----------------------------------------------------------\n",
			
 
				-    "### the assumptions about the data -\n",
			
 
				-    "    one element per document \n",
			
 
				-    "    text in the 'text' field by default ,can be modified to extract other fields\n",
			
 
				-    "    {\"src\": \"The Internet\", \"text\": \"jumps over the lazy dog\", \"type\": \"Eng\", \"id\": \"42\", \"title\": \"Second Part\"}\n"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "id": "specific-scheduling",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "----------------------------------------------------------\n",
			
 
				-    "### jsonfy the raw text data into loose json format -\n",
			
 
				-    "    python create_loose_json.py --help\n",
			
 
				-    "        usage: create_loose_json.py [-h] [--infile INFILE] [--outfile OUTFILE]\n",
			
 
				-    "\n",
			
 
				-    "        optional arguments:\n",
			
 
				-    "          -h, --help         show this help message and exit\n",
			
 
				-    "          --infile INFILE    input file path\n",
			
 
				-    "          --outfile OUTFILE  output file path"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": 5,
			
 
				-   "id": "powered-cooking",
			
 
				-   "metadata": {},
			
 
				-   "outputs": [
			
 
				-    {
			
 
				-     "name": "stdout",
			
 
				-     "output_type": "stream",
			
 
				-     "text": [
			
 
				-      "finished processing 71 lines to loose json format\n"
			
 
				-     ]
			
 
				-    }
			
 
				-   ],
			
 
				-   "source": [
			
 
				-    "!python create_loose_json.py --infile ../dataset/EN/extractedNVblogs.txt --outfile ../dataset/EN/extractedNVblogs.json"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "id": "marine-packing",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "----------------------------------------------------------\n",
			
 
				-    "### use preprocess_data.py to convert the cleaned data into mmap format as a preparation for training-\n",
			
 
				-    "\n",
			
 
				-    "wrap the following into a bash script :\n",
			
 
				-    "\n",
			
 
				-    "            %% writefile process2mmap.sh \n",
			
 
				-    "\n",
			
 
				-    "            INPUT_JSON_FILE=path_to_the_json_file\n",
			
 
				-    "\n",
			
 
				-    "            OUTPUT_PATH=path_to_save_the_converted_data_to\n",
			
 
				-    "\n",
			
 
				-    "            VOCAB_FILE=path_to_your_own_pretrained_vocab_file\n",
			
 
				-    "\n",
			
 
				-    "            MERGE_FILE=path_to_your_own_pretrained_merge_file\n",
			
 
				-    "\n",
			
 
				-    "            NUM_CPUS=16\n",
			
 
				-    "\n",
			
 
				-    "\n",
			
 
				-    "            python tools/preprocess_data.py \\\n",
			
 
				-    "                       --input INPUT_JSON_FILE \\\n",
			
 
				-    "                       --output-prefix OUTPUT_PATH \\\n",
			
 
				-    "                       --json-keys text \\\n",
			
 
				-    "                       --vocab-file VOCAB_FILE \\\n",
			
 
				-    "                       --merge-file MERGE_FILE \\\n",
			
 
				-    "                       --dataset-impl mmap \\\n",
			
 
				-    "                       --tokenizer-type GPT2BPETokenizer \\\n",
			
 
				-    "                       --workers NUM_CPUS \\\n",
			
 
				-    "                       --append-eod <--- very important, do not miss this flag !\n"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "id": "complete-rebate",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "----------------------------------------------------------\n",
			
 
				-    "se preprocess_data.py to convert the cleaned data into mmap format as a preparation for training-\n",
			
 
				-    "\n",
			
 
				-    "wrap the following into a bash script :\n",
			
 
				-    "\n",
			
 
				-    "            %% writefile process2mmap.sh \n",
			
 
				-    "\n",
			
 
				-    "            INPUT_JSON_FILE=path_to_the_json_file\n",
			
 
				-    "\n",
			
 
				-    "            OUTPUT_PATH=path_to_save_the_converted_data_to\n",
			
 
				-    "\n",
			
 
				-    "            VOCAB_FILE=path_to_your_own_pretrained_vocab_file\n",
			
 
				-    "\n",
			
 
				-    "            MERGE_FILE=path_to_your_own_pretrained_merge_file\n",
			
 
				-    "\n",
			
 
				-    "            NUM_CPUS=16\n",
			
 
				-    "\n",
			
 
				-    "\n",
			
 
				-    "            python tools/preprocess_data.py \\\n",
			
 
				-    "                       --input INPUT_JSON_FILE \\\n",
			
 
				-    "                       --output-prefix OUTPUT_PATH \\\n",
			
 
				-    "                       --json-keys text \\\n",
			
 
				-    "                       --vocab-file VOCAB_FILE \\\n",
			
 
				-    "                       --merge-file MERGE_FILE \\\n",
			
 
				-    "                       --dataset-impl mmap \\\n",
			
 
				-    "                       --tokenizer-type GPT2BPETokenizer \\\n",
			
 
				-    "                       --workers NUM_CPUS \\\n",
			
 
				-    "                       --append-eod <--- very important, do not miss this flag !\n",
			
 
				-    "\n"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": 6,
			
 
				-   "id": "resident-convert",
			
 
				-   "metadata": {},
			
 
				-   "outputs": [
			
 
				-    {
			
 
				-     "name": "stdout",
			
 
				-     "output_type": "stream",
			
 
				-     "text": [
			
 
				-      "gpt2-merges.txt  gpt2-vocab.json\n"
			
 
				-     ]
			
 
				-    }
			
 
				-   ],
			
 
				-   "source": [
			
 
				-    "!mv gpt2-vocab.json ../dataset/EN/50k/\n",
			
 
				-    "!mv gpt2-merges.txt ../dataset/EN/50k/\n",
			
 
				-    "!ls ../dataset/EN/50k/"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": 9,
			
 
				-   "id": "integrated-bridges",
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "INPUT_JSON_FILE='../dataset/EN/extractedNVblogs.json'\n",
			
 
				-    "OUTPUT_PATH='../dataset/EN/NVblog'\n",
			
 
				-    "VOCAB_FILE='../dataset/EN/50k/gpt2-vocab.json'\n",
			
 
				-    "MERGE_FILE='../dataset/EN/50k/gpt2-merges.txt'\n",
			
 
				-    "NUM_CPUS=16"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "id": "supported-celtic",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "---\n",
			
 
				-    "## OUTPUT should looks similar to the following \n",
			
 
				-    "\n",
			
 
				-    "                    Opening ./Megatron-LM/dataset/EN/extractedNVblogs.json\n",
			
 
				-    "                    > building GPT2BPETokenizer tokenizer ...\n",
			
 
				-    "                     > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
			
 
				-    "                    > building GPT2BPETokenizer tokenizer ...\n",
			
 
				-    "                    > building GPT2BPETokenizer tokenizer ...\n",
			
 
				-    "                    > building GPT2BPETokenizer tokenizer ...\n",
			
 
				-    "                    > building GPT2BPETokenizer tokenizer ...\n",
			
 
				-    "                    > building GPT2BPETokenizer tokenizer ...\n",
			
 
				-    "                    > building GPT2BPETokenizer tokenizer ...\n",
			
 
				-    "                    > building GPT2BPETokenizer tokenizer ...\n",
			
 
				-    "                    > building GPT2BPETokenizer tokenizer ...\n",
			
 
				-    "                    > building GPT2BPETokenizer tokenizer ...\n",
			
 
				-    "                    > building GPT2BPETokenizer tokenizer ...\n",
			
 
				-    "                    > building GPT2BPETokenizer tokenizer ...\n",
			
 
				-    "                    > building GPT2BPETokenizer tokenizer ...\n",
			
 
				-    "                    > building GPT2BPETokenizer tokenizer ...\n",
			
 
				-    "                    > building GPT2BPETokenizer tokenizer ...\n",
			
 
				-    "                    > building GPT2BPETokenizer tokenizer ...\n",
			
 
				-    "                    > building GPT2BPETokenizer tokenizer ...\n",
			
 
				-    "                    Vocab size: 50257\n",
			
 
				-    "                    Output prefix: ./Megatron-LM/dataset/EN/NVblogs\n",
			
 
				-    "                    Time to startup: 0.5460700988769531\n",
			
 
				-    "                     > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
			
 
				-    "                     > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": 10,
			
 
				-   "id": "familiar-victorian",
			
 
				-   "metadata": {},
			
 
				-   "outputs": [
			
 
				-    {
			
 
				-     "name": "stdout",
			
 
				-     "output_type": "stream",
			
 
				-     "text": [
			
 
				-      "Opening ../dataset/EN/extractedNVblogs.json\n",
			
 
				-      "> building GPT2BPETokenizer tokenizer ...\n",
			
 
				-      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
			
 
				-      "> building GPT2BPETokenizer tokenizer ...\n",
			
 
				-      "> building GPT2BPETokenizer tokenizer ...\n",
			
 
				-      "> building GPT2BPETokenizer tokenizer ...\n",
			
 
				-      "> building GPT2BPETokenizer tokenizer ...\n",
			
 
				-      "> building GPT2BPETokenizer tokenizer ...\n",
			
 
				-      "> building GPT2BPETokenizer tokenizer ...\n",
			
 
				-      "> building GPT2BPETokenizer tokenizer ...\n",
			
 
				-      "> building GPT2BPETokenizer tokenizer ...\n",
			
 
				-      "> building GPT2BPETokenizer tokenizer ...\n",
			
 
				-      "> building GPT2BPETokenizer tokenizer ...\n",
			
 
				-      "> building GPT2BPETokenizer tokenizer ...\n",
			
 
				-      "> building GPT2BPETokenizer tokenizer ...\n",
			
 
				-      "> building GPT2BPETokenizer tokenizer ...\n",
			
 
				-      "> building GPT2BPETokenizer tokenizer ...\n",
			
 
				-      "> building GPT2BPETokenizer tokenizer ...\n",
			
 
				-      "> building GPT2BPETokenizer tokenizer ...\n",
			
 
				-      "Vocab size: 50257\n",
			
 
				-      "Output prefix: ../dataset/EN/NVblog\n",
			
 
				-      "Time to startup: 0.1618051528930664\n",
			
 
				-      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
			
 
				-      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
			
 
				-      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
			
 
				-      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
			
 
				-      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
			
 
				-      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
			
 
				-      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
			
 
				-      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
			
 
				-      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
			
 
				-      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
			
 
				-      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
			
 
				-      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
			
 
				-      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
			
 
				-      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
			
 
				-      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
			
 
				-      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n"
			
 
				-     ]
			
 
				-    }
			
 
				-   ],
			
 
				-   "source": [
			
 
				-    "!python ./Megatron-LM/tools/preprocess_data.py \\\n",
			
 
				-    "                       --input $INPUT_JSON_FILE \\\n",
			
 
				-    "                       --output-prefix $OUTPUT_PATH \\\n",
			
 
				-    "                       --json-keys text \\\n",
			
 
				-    "                       --vocab-file $VOCAB_FILE \\\n",
			
 
				-    "                       --merge-file $MERGE_FILE \\\n",
			
 
				-    "                       --dataset-impl mmap \\\n",
			
 
				-    "                       --tokenizer-type GPT2BPETokenizer \\\n",
			
 
				-    "                       --workers $NUM_CPUS \\\n",
			
 
				-    "                       --append-eod"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "id": "tropical-gathering",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "--- \n",
			
 
				-    "\n",
			
 
				-    "## Additional Resources\n",
			
 
				-    "\n",
			
 
				-    "Read More on MMAP  : https://docs.python.org/3/library/mmap.html\n"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "id": "solid-reminder",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "---\n",
			
 
				-    "## Up Next : \n",
			
 
				-    "\n",
			
 
				-    "[Observe_GPT_runs_vs_performance ](./Day2-5_Observe_GPT_runs_vs_performance.ipynb)\n",
			
 
				-    "\n",
			
 
				-    "## Back To Start Menu\n",
			
 
				-    "[start menu](../Start_Here.ipynb)"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "id": "occupational-ranking",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "-----\n",
			
 
				-    "\n",
			
 
				-    "\n",
			
 
				-    "## Licensing \n",
			
 
				-    "\n",
			
 
				-    "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). "
			
 
				-   ]
			
 
				-  }
			
 
				- ],
			
 
				- "metadata": {
			
 
				-  "kernelspec": {
			
 
				-   "display_name": "Python 3",
			
 
				-   "language": "python",
			
 
				-   "name": "python3"
			
 
				-  },
			
 
				-  "language_info": {
			
 
				-   "codemirror_mode": {
			
 
				-    "name": "ipython",
			
 
				-    "version": 3
			
 
				-   },
			
 
				-   "file_extension": ".py",
			
 
				-   "mimetype": "text/x-python",
			
 
				-   "name": "python",
			
 
				-   "nbconvert_exporter": "python",
			
 
				-   "pygments_lexer": "ipython3",
			
 
				-   "version": "3.8.8"
			
 
				-  }
			
 
				- },
			
 
				- "nbformat": 4,
			
 
				- "nbformat_minor": 5
			
 
				-}
			
--- a/ai/Megatron/English/Python/jupyter_notebook/Day2-5_Observe_GPT_runs_vs_performance.ipynb
+++ b/ai/Megatron/English/Python/jupyter_notebook/Day2-5_Observe_GPT_runs_vs_performance.ipynb
--- a/ai/Megatron/English/Python/jupyter_notebook/Lab1-3_MegatronFundementals.ipynb
+++ b/ai/Megatron/English/Python/jupyter_notebook/Lab1-3_MegatronFundementals.ipynb
--- a/ai/Megatron/English/Python/jupyter_notebook/Lab1-4_GPT_vocab_merge_files.ipynb
+++ b/ai/Megatron/English/Python/jupyter_notebook/Lab1-4_GPT_vocab_merge_files.ipynb
@@ -2,7 +2,7 @@
 
				  "cells": [
			
 
				   {
			
 
				    "cell_type": "markdown",
			
 
				-   "id": "noticed-neighborhood",
			
 
				+   "id": "mysterious-bride",
			
 
				    "metadata": {},
			
 
				    "source": [
			
 
				     "## GPT Tokenizer files\n",
			
@@ -12,14 +12,14 @@
 
				     "\n",
			
 
				     "The goal of this lab is to examine the difference between BPE and GPTBPE Tokenizer.\n",
			
 
				     "\n",
			
 
				-    "Later on, we will use the observations from this notebook to train a GPT Tokenizer with our own raw text data.\n",
			
 
				+    "Later on, we will use the observations from this notebook to train a GPTBPE Tokenizer with our own raw text data.\n",
			
 
				     "\n",
			
 
				     "We will load and verify GPTBPE Tokenizer and make sure the output tokens and token ids are as expected. \n"
			
 
				    ]
			
 
				   },
			
 
				   {
			
 
				    "cell_type": "markdown",
			
 
				-   "id": "comic-architecture",
			
 
				+   "id": "thrown-pittsburgh",
			
 
				    "metadata": {},
			
 
				    "source": [
			
 
				     "Let's review the source code of [gpt2 tokenizer](https://huggingface.co/transformers/_modules/transformers/tokenization_gpt2.html)\n",
			
@@ -33,12 +33,14 @@
 
				     "         tokenizer = GPT2Tokenizer.from_pretrained(\"gpt2\")\n",
			
 
				     "        \n",
			
 
				     "         tokenizer(\" Hello world\")['input_ids']\n",
			
 
				-    "        [18435, 995]\n"
			
 
				+    "        [18435, 995]\n",
			
 
				+    "\n",
			
 
				+    "We expect our custom tokenizer, which we will train and obtain custom vocab.json and merges.txt files, when applies tokenization, should result in the same outputs above given the exact same input. \n"
			
 
				    ]
			
 
				   },
			
 
				   {
			
 
				    "cell_type": "markdown",
			
 
				-   "id": "portuguese-protocol",
			
 
				+   "id": "marine-alberta",
			
 
				    "metadata": {},
			
 
				    "source": [
			
 
				     "Install necessary python libraries."
			
@@ -47,7 +49,7 @@
 
				   {
			
 
				    "cell_type": "code",
			
 
				    "execution_count": null,
			
 
				-   "id": "latter-owner",
			
 
				+   "id": "entitled-brass",
			
 
				    "metadata": {},
			
 
				    "outputs": [],
			
 
				    "source": [
			
@@ -57,16 +59,18 @@
 
				   },
			
 
				   {
			
 
				    "cell_type": "markdown",
			
 
				-   "id": "trained-glenn",
			
 
				+   "id": "generous-blake",
			
 
				    "metadata": {},
			
 
				    "source": [
			
 
				-    "Next, we proceed to fetch pretrained GPT Tokenizer files, namely the vocab and merge files."
			
 
				+    "Next, we proceed to fetch pretrained GPT Tokenizer files, namely the vocab and merge files, will ideally looks like. \n",
			
 
				+    "\n",
			
 
				+    "We can later on use these observations to validate our custom trained GPTBPE tokenizer and the corresponding vocab.json and merges.txt file, in order to ensure the custom trained GPTBPE tokenizer will tokenze as expected."
			
 
				    ]
			
 
				   },
			
 
				   {
			
 
				    "cell_type": "code",
			
 
				    "execution_count": null,
			
 
				-   "id": "textile-trance",
			
 
				+   "id": "different-relief",
			
 
				    "metadata": {},
			
 
				    "outputs": [],
			
 
				    "source": [
			
@@ -76,17 +80,17 @@
 
				   },
			
 
				   {
			
 
				    "cell_type": "markdown",
			
 
				-   "id": "veterinary-plenty",
			
 
				+   "id": "latest-thinking",
			
 
				    "metadata": {},
			
 
				    "source": [
			
 
				-    "Examine the vocab and merge files, noted the presence of Ġ character.\n",
			
 
				-    "Ġ = space + 256 , this character is used as a control letter."
			
 
				+    "Examine the vocab and merge files, observe the presence of Ġ character.\n",
			
 
				+    "Ġ = space + 256, this character is used as a control letter."
			
 
				    ]
			
 
				   },
			
 
				   {
			
 
				    "cell_type": "code",
			
 
				    "execution_count": null,
			
 
				-   "id": "blessed-carbon",
			
 
				+   "id": "continental-keeping",
			
 
				    "metadata": {},
			
 
				    "outputs": [],
			
 
				    "source": [
			
@@ -103,7 +107,7 @@
 
				   {
			
 
				    "cell_type": "code",
			
 
				    "execution_count": null,
			
 
				-   "id": "enhanced-stack",
			
 
				+   "id": "mathematical-depression",
			
 
				    "metadata": {},
			
 
				    "outputs": [],
			
 
				    "source": [
			
@@ -112,10 +116,10 @@
 
				   },
			
 
				   {
			
 
				    "cell_type": "markdown",
			
 
				-   "id": "orange-baker",
			
 
				+   "id": "deluxe-empire",
			
 
				    "metadata": {},
			
 
				    "source": [
			
 
				-    "The following code block will load GPT2Tokenizer from HuggingFace transformer library, we verify the following :\n",
			
 
				+    "The following code block will load a default GPT2Tokenizer from HuggingFace transformer library, we verify the following :\n",
			
 
				     "\n",
			
 
				     "            from transformers import GPT2Tokenizer\n",
			
 
				     "            tokenizer = GPT2Tokenizer.from_pretrained(\"gpt2\")\n",
			
@@ -127,7 +131,7 @@
 
				   {
			
 
				    "cell_type": "code",
			
 
				    "execution_count": null,
			
 
				-   "id": "cardiac-burner",
			
 
				+   "id": "detailed-thirty",
			
 
				    "metadata": {},
			
 
				    "outputs": [],
			
 
				    "source": [
			
@@ -147,7 +151,7 @@
 
				   },
			
 
				   {
			
 
				    "cell_type": "markdown",
			
 
				-   "id": "alien-harris",
			
 
				+   "id": "ongoing-characterization",
			
 
				    "metadata": {},
			
 
				    "source": [
			
 
				     "Below is the expected outputs :\n",
			
@@ -159,7 +163,7 @@
 
				   },
			
 
				   {
			
 
				    "cell_type": "markdown",
			
 
				-   "id": "amazing-brick",
			
 
				+   "id": "universal-penalty",
			
 
				    "metadata": {},
			
 
				    "source": [
			
 
				     "Next code block will load tokenizer library from huggingFace, we will observe the difference when setting `use_gpt` to True or False. \n",
			
@@ -175,7 +179,7 @@
 
				   {
			
 
				    "cell_type": "code",
			
 
				    "execution_count": null,
			
 
				-   "id": "substantial-strike",
			
 
				+   "id": "enhanced-factor",
			
 
				    "metadata": {},
			
 
				    "outputs": [],
			
 
				    "source": [
			
@@ -217,7 +221,7 @@
 
				   },
			
 
				   {
			
 
				    "cell_type": "markdown",
			
 
				-   "id": "blessed-prize",
			
 
				+   "id": "polar-context",
			
 
				    "metadata": {},
			
 
				    "source": [
			
 
				     "Below is the expected outputs :\n",
			
@@ -233,7 +237,7 @@
 
				   },
			
 
				   {
			
 
				    "cell_type": "markdown",
			
 
				-   "id": "vocal-conflict",
			
 
				+   "id": "governmental-software",
			
 
				    "metadata": {},
			
 
				    "source": [
			
 
				     "What did we observed ? \n",
			
@@ -256,7 +260,7 @@
 
				   },
			
 
				   {
			
 
				    "cell_type": "markdown",
			
 
				-   "id": "contemporary-lancaster",
			
 
				+   "id": "bigger-miami",
			
 
				    "metadata": {},
			
 
				    "source": [
			
 
				     "We will now move the gpt-vocab.json and gpt2-merges.txt to the correct data folder as a preparation for the next step."
			
@@ -265,7 +269,7 @@
 
				   {
			
 
				    "cell_type": "code",
			
 
				    "execution_count": null,
			
 
				-   "id": "flush-amazon",
			
 
				+   "id": "killing-advance",
			
 
				    "metadata": {},
			
 
				    "outputs": [],
			
 
				    "source": [
			
@@ -276,7 +280,7 @@
 
				   },
			
 
				   {
			
 
				    "cell_type": "markdown",
			
 
				-   "id": "sapphire-horse",
			
 
				+   "id": "ultimate-girlfriend",
			
 
				    "metadata": {},
			
 
				    "source": [
			
 
				     "---\n",
			
@@ -287,7 +291,7 @@
 
				   },
			
 
				   {
			
 
				    "cell_type": "markdown",
			
 
				-   "id": "creative-dressing",
			
 
				+   "id": "academic-taiwan",
			
 
				    "metadata": {},
			
 
				    "source": [
			
 
				     "-----\n",
			
@@ -296,7 +300,7 @@
 
				   },
			
 
				   {
			
 
				    "cell_type": "markdown",
			
 
				-   "id": "offshore-greece",
			
 
				+   "id": "hungry-wilderness",
			
 
				    "metadata": {},
			
 
				    "source": [
			
 
				     "-----\n",
			
--- a/ai/Megatron/English/Python/jupyter_notebook/Lab1-6_Observe_GPT_runs_vs_performance.ipynb
+++ b/ai/Megatron/English/Python/jupyter_notebook/Lab1-6_Observe_GPT_runs_vs_performance.ipynb
@@ -0,0 +1,402 @@
 
				+{
			
 
				+ "cells": [
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "excessive-marathon",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "# Profiling Megatron_LM GPT\n",
			
 
				+    "---\n",
			
 
				+    "\n",
			
 
				+    "## Learning Objectives\n",
			
 
				+    "\n",
			
 
				+    "The goal of this lab is to profile the Megatron-LM's GPT model training runs with varying training configurations in order to ensure the GPUs performance across multi-GPUs or mult-nodes workload.\n",
			
 
				+    "\n",
			
 
				+    "\n",
			
 
				+    "**Motivation** : why should we care about profiling ?\n",
			
 
				+    "  \n",
			
 
				+    "The estimated time-to-compute which we went through in `Lab1-2_EstimateComputeDaysNeeded.ipynb` is based on the assumption that the training run will have good GPUs performance across multi-GPUs or multi-nodes jobs. Bad training configurations could result in low or inconsistent GPUs utilization, which in turn, might prolong the training run.\n",
			
 
				+    "\n",
			
 
				+    "In this notebook, we will cover the following : \n",
			
 
				+    "\n",
			
 
				+    "    1. intro to NVIDIA profiling toolchain\n",
			
 
				+    "    2. Run profiling to record training runs - naive vs. improved runs\n",
			
 
				+    "  \n",
			
 
				+    "A challenge will be presented to you at the end of this notebook, you are tasked to beat the profile of the improved run.\n",
			
 
				+    "\n",
			
 
				+    "Use the knowledge gained from going through `Lab1-2_EstimateComputeDaysNeeded.ipynb` and the profiling lecture presentations, it will help you formulate strategies on training configuration in order to obtain winning profile.\n",
			
 
				+    "\n",
			
 
				+    "Note: TAs and the NVIDIA profile expert will be available during this session when you go through this notebook, do reach out to them if you have questions."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "unlikely-edwards",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "---\n",
			
 
				+    "\n",
			
 
				+    "1. intro to NVIDIA profiling toolchain :\n",
			
 
				+    "\n",
			
 
				+    "<center><img src=\"./Megatron-LM/pics/NVprofilingToolchain.JPG\" width=\"800\"/></center>\n",
			
 
				+    "\n",
			
 
				+    "Note: We will be going through intro to NVIDIA profiling with a NVIDIA profiling expert in the lecture presentation."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "million-philip",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "The Profiling Workflow :\n",
			
 
				+    "\n",
			
 
				+    "Profiling is an iterative process, we record the profiling run, visualize and analyze the profile in order to find area for improvement and act upon.\n",
			
 
				+    "\n",
			
 
				+    "<center><img src=\"./Megatron-LM/pics/profiling_workflow.JPG\" width=\"700\"/></center>\n",
			
 
				+    "\n"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "assumed-steps",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "In order to properly analyze the profile obtained via real training runs. \n",
			
 
				+    "First, we need to understanding how Megatron-LM launches the training job.\n",
			
 
				+    "\n",
			
 
				+    "            ------------ Call out terminals as below illustrated ------------------------\n",
			
 
				+    "<center><img src=\"./Megatron-LM/pics/Alt_callout2terminals.JPG\" width=\"600\"/></center>\n",
			
 
				+    "\n",
			
 
				+    "\n",
			
 
				+    "To do a live monitoring during profiling run.\n",
			
 
				+    "\n",
			
 
				+    "Examine the below [profilig video](https://youtu.be/bnN8ZohiZSI), this video will demonstrate how to call out and arrange 2 windows  within jupyter lab, then launch and monitor the 2 profiling training runs with nvidia-smi live monitoring the performance of the GPUs, the saved profile will be visulized using Nsight UI."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "id": "executive-anxiety",
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "from IPython.lib.display import YouTubeVideo\n",
			
 
				+    "YouTubeVideo('bnN8ZohiZSI', width=600, height=1000)"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "signed-allah",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "Reference documents : \n",
			
 
				+    "\n",
			
 
				+    "[How to installing Nsight](https://developer.nvidia.com/gameworksdownload#?dn=nsight-systems-2021-4-1)\n",
			
 
				+    "\n",
			
 
				+    "[Nsight User Guide](https://docs.nvidia.com/nsight-systems/UserGuide/index.html)\n",
			
 
				+    "\n",
			
 
				+    "<center><img src=\"./Megatron-LM/pics/multigpu_naive_run.jpg\" width=\"1000\"/></center>\n"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "purple-event",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "Install nvtx library, the nvtx tags were already implemented in this repo for your convenience."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "id": "accessory-gentleman",
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "!pip install nvtx"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "designed-struggle",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "For the purpose of profiling, we will clean the follow folders after each profiling run, in order to ensure trainining always start from scratch."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "id": "premium-resident",
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "!rm -fr ../sv_ckpt/*\n",
			
 
				+    "!rm -fr ../dataset/EN/*.npy"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "irish-chinese",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "After the lecture with NVIDIA profiling champion, we are now ready to try out our very first profiling Megatron-LM training job.\n",
			
 
				+    "\n",
			
 
				+    "We start by profiling a naive run with a default configuration.\n",
			
 
				+    "\n",
			
 
				+    "Note: the following were obtained from previous labs :\n",
			
 
				+    "\n",
			
 
				+    "CHECKPOINT_PATH='../sv_ckpt/' ## path to save the checkpoint of the training run\n",
			
 
				+    "\n",
			
 
				+    "DATA_PATH='../dataset/EN/NVblog_text_document' ## obtained from`Lab1-1` and `Lab1-5`\n",
			
 
				+    "\n",
			
 
				+    "VOCAB_FILE='../dataset/EN/50k/gpt2-vocab.json' ## obtained from`Lab1-4`\n",
			
 
				+    "\n",
			
 
				+    "MERGE_FILE='../dataset/EN/50k/gpt2-merges.txt' ## obtained from`Lab1-4`\n",
			
 
				+    "\n",
			
 
				+    "PROFILE_OUTPUT_PATH='../profiles/naive/nsys_naive' ## path to save the profiles of this training run\n",
			
 
				+    "\n",
			
 
				+    "\n",
			
 
				+    "\n",
			
 
				+    "\n",
			
 
				+    "To evoke profiling session, call nsys decorations followed by the normal Megatron-LM training launch script : \n",
			
 
				+    "\n",
			
 
				+    "<center><img src=\"./Megatron-LM/pics/evoke_nsys_profiling.JPG\" width=\"1000\"/></center>\n"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "virtual-bankruptcy",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "To examine the naive profiling run bash script, click on [open profile_naive_run.sh ](./Megatron-LM/profile_naive_run.sh)\n",
			
 
				+    "\n",
			
 
				+    "The following code block launches the naive profiling training run."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "id": "lonely-newport",
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "!bash ./Megatron-LM/profile_naive_run.sh"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "consistent-pendant",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "\n",
			
 
				+    "Below is an example of a successful profiling outputs :\n",
			
 
				+    "\n",
			
 
				+    "        [after training is done] datetime: 2021-09-15 10:17:46 \n",
			
 
				+    "        ------------------------------------------------------------------------------------------------------------------\n",
			
 
				+    "         validation loss at the end of training for val data | lm loss value: 8.895156E+00 | lm loss PPL: 7.296543E+03 | \n",
			
 
				+    "        ------------------------------------------------------------------------------------------------------------------\n",
			
 
				+    "        saving checkpoint at iteration      12 to ../sv_ckpt/\n",
			
 
				+    "          successfully saved checkpoint at iteration      12 to ../sv_ckpt/\n",
			
 
				+    "        *****************************************\n",
			
 
				+    "        Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. \n",
			
 
				+    "        *****************************************\n",
			
 
				+    "        Processing events...\n",
			
 
				+    "        Capturing symbol files...\n",
			
 
				+    "        Saving temporary \"/tmp/nsys-report-4642-8c23-394b-8c2e.qdstrm\" file to disk...\n",
			
 
				+    "        Creating final output files...\n",
			
 
				+    "\n",
			
 
				+    "        Processing [==============================================================100%]\n",
			
 
				+    "        Saved report file to \"/tmp/nsys-report-4642-8c23-394b-8c2e.qdrep\"\n",
			
 
				+    "        Report file moved to \"/proj/guest_at_nsc/users/zcharpy/gpubootcamp/ai/Megatron/English/Python/jupyter_notebook/../profiles/naive/nsys_naive.qdrep\" \n",
			
 
				+    "        \n",
			
 
				+    " \n"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "applied-airfare",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "---\n",
			
 
				+    "\n",
			
 
				+    "Visualizing the profiles via nsight. The naive profiling run output visulized on Nsight UI :\n",
			
 
				+    "\n",
			
 
				+    "\n",
			
 
				+    "Observe during training, the GPUs utilizations are very low ( the light-blue bar ) \n",
			
 
				+    "<center><img src=\"./Megatron-LM/pics/GPUs_naive_run.JPG\" width=\"1000\"/></center>"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "olympic-internet",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "Below is a ReRun cell for experimentation of varying training configurations in order to obtain different training profiles.\n",
			
 
				+    "\n",
			
 
				+    "Before each re-run, make sure you clear the checkpoint directory by running the blow code block to clear checkpoint files.\n",
			
 
				+    "<a id=\"Rerun_Cell\"></a>"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "id": "operating-material",
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "!rm -fr ../sv_ckpt/*"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "perfect-lease",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "View/Modify the profile_2nd_run.sh, click to [open profile_2nd_run.sh](./Megatron-LM/profile_2nd_run.sh).\n",
			
 
				+    "\n",
			
 
				+    "After viewing/modification, run the below cell block to obtain a new profile."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "id": "extra-asbestos",
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "!bash ./Megatron-LM/profile_2nd_run.sh"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "laughing-blocking",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "Below is an example of a successful profiling outputs :\n",
			
 
				+    "\n",
			
 
				+    "        > finished creating GPT datasets ...\n",
			
 
				+    "        [after dataloaders are built] datetime: 2021-09-16 19:19:01 \n",
			
 
				+    "        done with setup ...\n",
			
 
				+    "        time (ms) | model-and-optimizer-setup: 772.93 | train/valid/test-data-iterators-setup: 1032.39\n",
			
 
				+    "        training ...\n",
			
 
				+    "        [after training is done] datetime: 2021-09-16 19:19:01 \n",
			
 
				+    "        ------------------------------------------------------------------------------------------------------------------\n",
			
 
				+    "         validation loss at the end of training for val data | lm loss value: 1.126569E+01 | lm loss PPL: 7.809596E+04 | \n",
			
 
				+    "        ------------------------------------------------------------------------------------------------------------------\n",
			
 
				+    "        Processing events...\n",
			
 
				+    "        Capturing symbol files...\n",
			
 
				+    "        Saving temporary \"/tmp/nsys-report-3aa1-f1a6-09c2-c853.qdstrm\" file to disk...\n",
			
 
				+    "        Creating final output files...\n",
			
 
				+    "\n",
			
 
				+    "        Processing [==============================================================100%]\n",
			
 
				+    "        Saved report file to \"/tmp/nsys-report-3aa1-f1a6-09c2-c853.qdrep\"\n",
			
 
				+    "        Report file moved to \"/proj/guest_at_nsc/users/zcharpy/gpubootcamp/ai/Megatron/English/Python/jupyter_notebook/../profiles/2ndrun/nsys_improved.qdrep\"\n",
			
 
				+    "\n"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "initial-vertical",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "---\n",
			
 
				+    "\n",
			
 
				+    "The improved profiling run output file visulized with Nsight UI :\n",
			
 
				+    "\n",
			
 
				+    "<center><img src=\"./Megatron-LM/pics/2ndrun.JPG\" width=\"1000\"/></center>\n"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "minus-encoding",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "<a id=\"TheChallenge\"></a>"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "exceptional-template",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "----------------\n",
			
 
				+    "\n",
			
 
				+    "## **The Challenge ** - get the best looking profile\n",
			
 
				+    "\n",
			
 
				+    "\n",
			
 
				+    "Constraints : \n",
			
 
				+    "\n",
			
 
				+    "        - Use the given # of gpus ( 2 x A100 GPUs 40GB) \n",
			
 
				+    "        - Only modify the parameters in the **modifiable section**\n",
			
 
				+    "        - Avoid OOM error\n",
			
 
				+    "        - training run must be finished and checkpoint must be saved successfully\n",
			
 
				+    "Task : \n",
			
 
				+    "      Given the above constraints, achieve a good looking profile. \n",
			
 
				+    "      \n",
			
 
				+    "The winning profile visulized on Nsight UI should look as the following : \n",
			
 
				+    "\n",
			
 
				+    "Observe the GPUs utilization are above 90% consistently throughout **training** phrase ( the **dark-blue** bar ) \n",
			
 
				+    "      \n",
			
 
				+    "<center><img src=\"./Megatron-LM/pics/GoodLookingProfile.JPG\" width=\"1000\"/></center>\n",
			
 
				+    "\n",
			
 
				+    "Jump back to modify the [profiling bash script](./Megatron-LM/profile_2nd_run.sh) and rerun \n",
			
 
				+    "<a href=\"./Lab1-6_Observe_GPT_runs_vs_performance.ipynb#Rerun_Cell\">GO to ReRun Cell</a> \n",
			
 
				+    "\n"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "chicken-discretion",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "--- \n",
			
 
				+    "## Links and Resources\n",
			
 
				+    "Don't forget to check out additional resources such as [NVIDIA Nsight Systems](https://docs.nvidia.com/nsight-systems/index.html), [NVTX Tutorial](https://developer.nvidia.com/blog/nvidia-tools-extension-api-nvtx-annotation-tool-for-profiling-code-in-python-and-c-c/) and [Nsight Systems](https://developer.nvidia.com/blog/transitioning-nsight-systems-nvidia-visual-profiler-nvprof/).\n"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "numeric-bathroom",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "-----\n",
			
 
				+    "## <p style=\"text-align:center;border:3px; padding: 1em\"> <a href=../Start_Here.ipynb>HOME</a></p>"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "specialized-ceremony",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "-----\n",
			
 
				+    "\n",
			
 
				+    "\n",
			
 
				+    "## Licensing \n",
			
 
				+    "\n",
			
 
				+    "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). "
			
 
				+   ]
			
 
				+  }
			
 
				+ ],
			
 
				+ "metadata": {
			
 
				+  "kernelspec": {
			
 
				+   "display_name": "Python 3",
			
 
				+   "language": "python",
			
 
				+   "name": "python3"
			
 
				+  },
			
 
				+  "language_info": {
			
 
				+   "codemirror_mode": {
			
 
				+    "name": "ipython",
			
 
				+    "version": 3
			
 
				+   },
			
 
				+   "file_extension": ".py",
			
 
				+   "mimetype": "text/x-python",
			
 
				+   "name": "python",
			
 
				+   "nbconvert_exporter": "python",
			
 
				+   "pygments_lexer": "ipython3",
			
 
				+   "version": "3.8.8"
			
 
				+  }
			
 
				+ },
			
 
				+ "nbformat": 4,
			
 
				+ "nbformat_minor": 5
			
 
				+}
			
--- a/ai/Megatron/English/Python/jupyter_notebook/Lab2-3_train_own_GPT2BPETokenizer.ipynb
+++ b/ai/Megatron/English/Python/jupyter_notebook/Lab2-3_train_own_GPT2BPETokenizer.ipynb
--- a/ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/pics/2ndrun.JPG
+++ b/ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/pics/2ndrun.JPG
--- a/ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/pics/GPUs_naive_run.JPG
+++ b/ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/pics/GPUs_naive_run.JPG
--- a/ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/pics/GPUs_utils_naive.JPG
+++ b/ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/pics/GPUs_utils_naive.JPG
--- a/ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/pics/GoodLookingProfile.JPG
+++ b/ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/pics/GoodLookingProfile.JPG
--- a/ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/pics/evoke_nsys_profiling.JPG
+++ b/ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/pics/evoke_nsys_profiling.JPG
--- a/ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/profile_2nd_run.sh
+++ b/ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/profile_2nd_run.sh
@@ -6,8 +6,7 @@ MASTER_PORT=6000
 
				 NNODES=1 #<-- currently we are using 1 node multigpus
			
 
				 NODE_RANK=0
			
 
				 WORLD_SIZE=2 # <--- remember to change the number of GPUs you actually have in your system
			
 
				-TENSOR_MP_SIZE=2
			
 
				-PIPELINE_MP_SIZE=1
			
 
				+
			
 
				 ### modify this section to point the file to its own path 
			
 
				 CHECKPOINT_PATH='../sv_ckpt/' ## modify this path if you customize it 
			
 
				 DATA_PATH='../dataset/EN/NVblog_text_document' ## modify this path if you customize it 
			
@@ -15,6 +14,19 @@ VOCAB_FILE='../dataset/EN/50k/gpt2-vocab.json' ## modify this path if you custom
 
				 MERGE_FILE='../dataset/EN/50k/gpt2-merges.txt' ## modify this path if you customize it 
			
 
				 PROFILE_OUTPUT_PATH='../profiles/2ndrun/nsys_improved' # modify this to your own profile path
			
 
				 
			
 
				+################   Beginning of modifiable section    ####################
			
 
				+TENSOR_MP_SIZE=2
			
 
				+PIPELINE_MP_SIZE=1
			
 
				+NUM_LYS=32
			
 
				+HIDDEN_SIZE=2048
			
 
				+NUM_ATTN_HEADS=32
			
 
				+SEQ_LEN=1024
			
 
				+MAX_POS_EM=1024
			
 
				+MICRO_BZ=16
			
 
				+GLOBAL_BZ=128
			
 
				+
			
 
				+##############   end of modifiable sectio, do NOT modify anything below this line    ####################
			
 
				+
			
 
				 export OMP_NUM_THREADS=1
			
 
				 DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
			
 
				 
			
@@ -22,21 +34,21 @@ DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $
 
				 nsys profile --stats=false --force-overwrite=true --duration=300 --trace=cudnn,cuda,osrt,nvtx -o $PROFILE_OUTPUT_PATH \
			
 
				 python -m torch.distributed.launch $DISTRIBUTED_ARGS \
			
 
				     ./Megatron-LM/Dlprof_pretrain_gpt.py \
			
 
				-       --tensor-model-parallel-size $TENSOR_MP_SIZE \
			
 
				-       --pipeline-model-parallel-size $PIPELINE_MP_SIZE \
			
 
				-       --num-layers 32 \
			
 
				-       --hidden-size 2048 \
			
 
				-       --num-attention-heads 32 \
			
 
				-       --micro-batch-size 16 \
			
 
				-       --global-batch-size 128 \
			
 
				-       --seq-length 1024 \
			
 
				-       --max-position-embeddings 1024 \
			
 
				+       --tensor-model-parallel-size ${TENSOR_MP_SIZE} \
			
 
				+       --pipeline-model-parallel-size ${PIPELINE_MP_SIZE} \
			
 
				+       --num-layers ${NUM_LYS} \
			
 
				+       --hidden-size ${HIDDEN_SIZE} \
			
 
				+       --num-attention-heads ${NUM_ATTN_HEADS} \
			
 
				+       --micro-batch-size ${MICRO_BZ} \
			
 
				+       --global-batch-size ${GLOBAL_BZ} \
			
 
				+       --seq-length ${SEQ_LEN} \
			
 
				+       --max-position-embeddings ${MAX_POS_EM} \
			
 
				        --train-samples 100 \
			
 
				-       --save $CHECKPOINT_PATH \
			
 
				-       --load $CHECKPOINT_PATH \
			
 
				-       --data-path $DATA_PATH \
			
 
				-       --vocab-file $VOCAB_FILE \
			
 
				-       --merge-file $MERGE_FILE \
			
 
				+       --save ${CHECKPOINT_PATH} \
			
 
				+       --load ${CHECKPOINT_PATH} \
			
 
				+       --data-path ${DATA_PATH} \
			
 
				+       --vocab-file ${VOCAB_FILE} \
			
 
				+       --merge-file ${MERGE_FILE} \
			
 
				        --data-impl mmap \
			
 
				        --split 949,50,1 \
			
 
				        --distributed-backend nccl \
			
--- a/ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/profile_naive_run.sh
+++ b/ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/profile_naive_run.sh
@@ -13,10 +13,12 @@ VOCAB_FILE='../dataset/EN/50k/gpt2-vocab.json' ## modify this path if you custom
 
				 MERGE_FILE='../dataset/EN/50k/gpt2-merges.txt' ## modify this path if you customize it 
			
 
				 PROFILE_OUTPUT_PATH='../profiles/naive/nsys_naive' # modify this to your own profile path
			
 
				 
			
 
				+
			
 
				+
			
 
				 DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
			
 
				 
			
 
				 ## for nsys run
			
 
				-nsys profile --stats=false --force-overwrite=true --duration=300 --trace=cudnn,cuda,osrt,nvtx -o $PROFILE_OUTPUT_PATH \
			
 
				+nsys profile --stats=false --force-overwrite=true --duration=300 --trace=cudnn,cuda,nvtx -o $PROFILE_OUTPUT_PATH \
			
 
				 python -m torch.distributed.launch $DISTRIBUTED_ARGS \
			
 
				     ./Megatron-LM/Dlprof_pretrain_gpt.py \
			
 
				        --num-layers 16 \
			
--- a/ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/tools/openwebtext/Day3-1_acquiring_data.ipynb
+++ b/ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/tools/openwebtext/Day3-1_acquiring_data.ipynb
@@ -1,249 +0,0 @@
 
				-{
			
 
				- "cells": [
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "id": "above-newark",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "# Fetch Swedish data on your own \n",
			
 
				-    "---\n",
			
 
				-    "\n",
			
 
				-    "## Due to GDPR , we are not allowed to provide data to attendees in the bootcamp !\n",
			
 
				-    "Therefore, we kindly ask you to fetch and preprocess this publically accessible dataset on your own,\n",
			
 
				-    "following the steps given below -\n",
			
 
				-    "    - wget språkbank [webnyheter2013](http://spraakbanken.gu.se/lb/resurser/meningsmangder/webbnyheter2013.xml.bz2)\n",
			
 
				-    "    - wget språkbank [provided script](https://raw.githubusercontent.com/spraakbanken/sb-nltk-tools/master/sb_corpus_reader.py) to extract the data\n",
			
 
				-    "    - use the function below to extract the xml file into raw txt file\n",
			
 
				-    "    - (additional advice) discuss filtering on number of sentences per document and number of tokens per sentence \n",
			
 
				-    "        - example for English GPT training, it is recommand to check on the stats of your raw data and come-up with a good rule-of-thumb to proceed filtering, it is,however,recommanded to look into language-specific cleaning and follow up robust clearning procedure to obtain quality corpus.\n",
			
 
				-    "\n",
			
 
				-    "---\n",
			
 
				-    "## About the data source -\n",
			
 
				-    "This data belongs to Språkbanken, Språkbanken Text is a research unit and part of the National Language Bank, a national e-infrastructure to support research based on linguistic data.\n",
			
 
				-    "[read more about språkbank](https://spraakbanken.gu.se/om)\n"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "id": "prepared-shield",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "--------------------------------------------------------------------------------------------------------------------\n",
			
 
				-    "#### fetch the webnyheter2013 data "
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": 1,
			
 
				-   "id": "painted-broadway",
			
 
				-   "metadata": {},
			
 
				-   "outputs": [
			
 
				-    {
			
 
				-     "name": "stdout",
			
 
				-     "output_type": "stream",
			
 
				-     "text": [
			
 
				-      "--2021-09-15 10:33:55--  http://spraakbanken.gu.se/lb/resurser/meningsmangder/webbnyheter2013.xml.bz2\n",
			
 
				-      "Resolving spraakbanken.gu.se (spraakbanken.gu.se)... 130.241.42.13\n",
			
 
				-      "Connecting to spraakbanken.gu.se (spraakbanken.gu.se)|130.241.42.13|:80... connected.\n",
			
 
				-      "HTTP request sent, awaiting response... 301 Moved Permanently\n",
			
 
				-      "Location: https://spraakbanken.gu.se/lb/resurser/meningsmangder/webbnyheter2013.xml.bz2 [following]\n",
			
 
				-      "--2021-09-15 10:33:55--  https://spraakbanken.gu.se/lb/resurser/meningsmangder/webbnyheter2013.xml.bz2\n",
			
 
				-      "Connecting to spraakbanken.gu.se (spraakbanken.gu.se)|130.241.42.13|:443... connected.\n",
			
 
				-      "HTTP request sent, awaiting response... 200 OK\n",
			
 
				-      "Length: 464382665 (443M) [application/x-bzip2]\n",
			
 
				-      "Saving to: ‘webbnyheter2013.xml.bz2’\n",
			
 
				-      "\n",
			
 
				-      "webbnyheter2013.xml 100%[===================>] 442.87M   110MB/s    in 4.1s    \n",
			
 
				-      "\n",
			
 
				-      "2021-09-15 10:33:59 (109 MB/s) - ‘webbnyheter2013.xml.bz2’ saved [464382665/464382665]\n",
			
 
				-      "\n"
			
 
				-     ]
			
 
				-    }
			
 
				-   ],
			
 
				-   "source": [
			
 
				-    "!wget http://spraakbanken.gu.se/lb/resurser/meningsmangder/webbnyheter2013.xml.bz2"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": 2,
			
 
				-   "id": "unavailable-munich",
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "!bunzip2 -d webbnyheter2013.xml.bz2 "
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": 3,
			
 
				-   "id": "fantastic-throat",
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "!mv ./webbnyheter2013.xml ../../../../dataset/SV/"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": 4,
			
 
				-   "id": "saved-potter",
			
 
				-   "metadata": {},
			
 
				-   "outputs": [
			
 
				-    {
			
 
				-     "name": "stdout",
			
 
				-     "output_type": "stream",
			
 
				-     "text": [
			
 
				-      "32k  56k  webbnyheter2013.xml\n"
			
 
				-     ]
			
 
				-    }
			
 
				-   ],
			
 
				-   "source": [
			
 
				-    "!ls ../../../../dataset/SV/"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": 5,
			
 
				-   "id": "pleasant-estimate",
			
 
				-   "metadata": {},
			
 
				-   "outputs": [
			
 
				-    {
			
 
				-     "name": "stdout",
			
 
				-     "output_type": "stream",
			
 
				-     "text": [
			
 
				-      "--2021-09-15 10:38:48--  https://raw.githubusercontent.com/spraakbanken/sb-nltk-tools/master/sb_corpus_reader.py\n",
			
 
				-      "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...\n",
			
 
				-      "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.\n",
			
 
				-      "HTTP request sent, awaiting response... 200 OK\n",
			
 
				-      "Length: 3065 (3.0K) [text/plain]\n",
			
 
				-      "Saving to: ‘sb_corpus_reader.py’\n",
			
 
				-      "\n",
			
 
				-      "sb_corpus_reader.py 100%[===================>]   2.99K  --.-KB/s    in 0.001s  \n",
			
 
				-      "\n",
			
 
				-      "2021-09-15 10:38:49 (3.77 MB/s) - ‘sb_corpus_reader.py’ saved [3065/3065]\n",
			
 
				-      "\n"
			
 
				-     ]
			
 
				-    }
			
 
				-   ],
			
 
				-   "source": [
			
 
				-    "!wget https://raw.githubusercontent.com/spraakbanken/sb-nltk-tools/master/sb_corpus_reader.py"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": 6,
			
 
				-   "id": "psychological-measure",
			
 
				-   "metadata": {},
			
 
				-   "outputs": [
			
 
				-    {
			
 
				-     "name": "stdout",
			
 
				-     "output_type": "stream",
			
 
				-     "text": [
			
 
				-      "[['Telekombranschen', 'lyfts', 'av', 'en', 'större', 'europeisk', 'telekomaffär', ',', 'nederländska', 'KPN', 'säljer', 'tysk', 'verksamhet', 'för', 'omkring', 'åtta', 'miljarder', 'euro', ',', 'och', 'en', 'stark', 'rapport', 'från', 'Telenor', '.'], ['Denna', 'upprepade', 'process', 'är', 'död', 'nu', '\"', ',', 'skriver', '\"', 'Shield', '\"', '-', 'skaparen', 'Shawn', 'Ryan', ',', 'som', 'låg', 'bakom', 'idén', ',', 'på', 'Twitter', '.']]\n",
			
 
				-      "write to :  webnyheter2013.txt\n",
			
 
				-      "finish processing  webnyheter2013.txt\n",
			
 
				-      "--------------------------------------------------\n"
			
 
				-     ]
			
 
				-    }
			
 
				-   ],
			
 
				-   "source": [
			
 
				-    "import json\n",
			
 
				-    "import os, sys\n",
			
 
				-    "import numpy as np\n",
			
 
				-    "import nltk\n",
			
 
				-    "from sb_corpus_reader import SBCorpusReader\n",
			
 
				-    "import random\n",
			
 
				-    "\n",
			
 
				-    "def write2csv(out_path, fname, sents):\n",
			
 
				-    "    f=open(out_path+fname,'a')\n",
			
 
				-    "    for s in sents:\n",
			
 
				-    "        if len(s)>=2:\n",
			
 
				-    "            s_text=' '.join(s)\n",
			
 
				-    "            f.write(s_text+'\\n')\n",
			
 
				-    "    print(\"finish processing \",fname)\n",
			
 
				-    "    f.close()\n",
			
 
				-    "    \n",
			
 
				-    "out_path='../../../../dataset/SV/'\n",
			
 
				-    "xml_f=out_path+'webbnyheter2013.xml'\n",
			
 
				-    "if xml_f.endswith('.xml') :    \n",
			
 
				-    "    corpus = SBCorpusReader(xml_f)\n",
			
 
				-    "    sents=corpus.sents()\n",
			
 
				-    "    print(sents[:2])\n",
			
 
				-    "    #n=len(sents)\n",
			
 
				-    "    #rn=random.randint(0,n-1)\n",
			
 
				-    "    #print(\"a random sample of sentence : \\n\".format(' '.join(sents[rn])))\n",
			
 
				-    "    fname='webnyheter2013.txt'  \n",
			
 
				-    "    print(\"write to : \",fname)\n",
			
 
				-    "    write2csv(out_path,fname,sents)\n",
			
 
				-    "    print('-----'*10)"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": 7,
			
 
				-   "id": "conservative-yacht",
			
 
				-   "metadata": {},
			
 
				-   "outputs": [
			
 
				-    {
			
 
				-     "name": "stdout",
			
 
				-     "output_type": "stream",
			
 
				-     "text": [
			
 
				-      "32k  56k  webbnyheter2013.xml  webnyheter2013.txt\n"
			
 
				-     ]
			
 
				-    }
			
 
				-   ],
			
 
				-   "source": [
			
 
				-    "!ls ../../../../dataset/SV/"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "id": "thousand-volleyball",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "---\n",
			
 
				-    "## Up Next : \n",
			
 
				-    "\n",
			
 
				-    "[Find sentence boundary and deduplicate your data](./Day3-2_SentenceBoundary_and_Deduplicate.ipynb)\n",
			
 
				-    "\n",
			
 
				-    "## Back To Start Menu\n",
			
 
				-    "[start menu](../../../../Start_Here.ipynb)"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "id": "similar-battery",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "-----\n",
			
 
				-    "\n",
			
 
				-    "\n",
			
 
				-    "## Licensing \n",
			
 
				-    "\n",
			
 
				-    "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). "
			
 
				-   ]
			
 
				-  }
			
 
				- ],
			
 
				- "metadata": {
			
 
				-  "kernelspec": {
			
 
				-   "display_name": "Python 3",
			
 
				-   "language": "python",
			
 
				-   "name": "python3"
			
 
				-  },
			
 
				-  "language_info": {
			
 
				-   "codemirror_mode": {
			
 
				-    "name": "ipython",
			
 
				-    "version": 3
			
 
				-   },
			
 
				-   "file_extension": ".py",
			
 
				-   "mimetype": "text/x-python",
			
 
				-   "name": "python",
			
 
				-   "nbconvert_exporter": "python",
			
 
				-   "pygments_lexer": "ipython3",
			
 
				-   "version": "3.8.8"
			
 
				-  }
			
 
				- },
			
 
				- "nbformat": 4,
			
 
				- "nbformat_minor": 5
			
 
				-}
			
--- a/ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/tools/openwebtext/Day3-2_SentenceBoundary_and_Deduplicate.ipynb
+++ b/ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/tools/openwebtext/Day3-2_SentenceBoundary_and_Deduplicate.ipynb
--- a/ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/tools/openwebtext/Day3-Website_scrapping.ipynb
+++ b/ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/tools/openwebtext/Day3-Website_scrapping.ipynb
--- a/ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/tools/openwebtext/Lab2-1_acquiring_data.ipynb
+++ b/ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/tools/openwebtext/Lab2-1_acquiring_data.ipynb
@@ -0,0 +1,190 @@
 
				+{
			
 
				+ "cells": [
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "monetary-season",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "## Acquire Swedish data \n",
			
 
				+    "---\n",
			
 
				+    "\n",
			
 
				+    "For data licensing and privacy concerns, we will not providing any data in this bootcamp.\n",
			
 
				+    "\n",
			
 
				+    "However, we do need data in order to proceed the customization of Megatron-LM's workflow for Swedish, hence, the first thing we need to do, is to acquire Swedish raw text data.\n",
			
 
				+    "\n",
			
 
				+    "This notebook is therefore provided to assist acquisition of Swedish raw text data from språkbanken.\n",
			
 
				+    "\n",
			
 
				+    "following the steps given below -\n",
			
 
				+    "\n",
			
 
				+    "    1. Download data via wget and download the python script which will be used to extract the Swedish text.\n",
			
 
				+    "    \n",
			
 
				+    "    2. unzip the data using bunzip and move the data to the correct folder under dataset\n",
			
 
				+    "    \n",
			
 
				+    "    3. Write a custom function to  extract raw txt file from xml file and move the text file to the correct folder under dataset\n",
			
 
				+    "\n",
			
 
				+    "\n",
			
 
				+    "\n",
			
 
				+    "**About the data source : språkbank**  :\n",
			
 
				+    "\n",
			
 
				+    "This data belongs to Språkbanken, Språkbanken Text is a research unit and part of the National Language Bank, a national e-infrastructure to support research based on linguistic data.\n",
			
 
				+    "[read more about språkbank](https://spraakbanken.gu.se/om)\n"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "leading-lingerie",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "1. Download data via wget and download the python script which will be used to extract the Swedish text."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "id": "distributed-sheet",
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "!wget http://spraakbanken.gu.se/lb/resurser/meningsmangder/webbnyheter2013.xml.bz2\n",
			
 
				+    "!wget https://raw.githubusercontent.com/spraakbanken/sb-nltk-tools/master/sb_corpus_reader.py"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "surprised-huntington",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "2. unzip the data using bunzip and move the data to the correct folder under dataset"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "id": "another-university",
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "!bunzip2 -d webbnyheter2013.xml.bz2 "
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "id": "textile-variance",
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "!mv ./webbnyheter2013.xml ../../../../dataset/SV/"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "id": "cardiac-exclusive",
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "!ls ../../../../dataset/SV/"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "liquid-marina",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "3. Write a custom function to extract raw txt file from xml file and move the text file `webnyheter2013.txt` to the correct folder under dataset"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "id": "enabling-dominant",
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "import json\n",
			
 
				+    "import os, sys\n",
			
 
				+    "import numpy as np\n",
			
 
				+    "import nltk\n",
			
 
				+    "from sb_corpus_reader import SBCorpusReader\n",
			
 
				+    "import random\n",
			
 
				+    "\n",
			
 
				+    "def write2csv(out_path, fname, sents):\n",
			
 
				+    "    f=open(out_path+fname,'a')\n",
			
 
				+    "    for s in sents:\n",
			
 
				+    "        if len(s)>=2:\n",
			
 
				+    "            s_text=' '.join(s)\n",
			
 
				+    "            f.write(s_text+'\\n')\n",
			
 
				+    "    print(\"finish processing \",fname)\n",
			
 
				+    "    f.close()\n",
			
 
				+    "    \n",
			
 
				+    "out_path='../../../../dataset/SV/'\n",
			
 
				+    "xml_f=out_path+'webbnyheter2013.xml'\n",
			
 
				+    "if xml_f.endswith('.xml') :    \n",
			
 
				+    "    corpus = SBCorpusReader(xml_f)\n",
			
 
				+    "    sents=corpus.sents()\n",
			
 
				+    "    print(sents[:2])\n",
			
 
				+    "    #n=len(sents)\n",
			
 
				+    "    #rn=random.randint(0,n-1)\n",
			
 
				+    "    #print(\"a random sample of sentence : \\n\".format(' '.join(sents[rn])))\n",
			
 
				+    "    fname='webnyheter2013.txt'  \n",
			
 
				+    "    print(\"write to : \",fname)\n",
			
 
				+    "    write2csv(out_path,fname,sents)\n",
			
 
				+    "    print('-----'*10)"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "id": "radical-workshop",
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "!ls ../../../../dataset/SV/"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "found-anaheim",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "-----\n",
			
 
				+    "## <p style=\"text-align:center;border:3px; padding: 1em\"> <a href=../../../../Start_Here.ipynb>HOME</a>&nbsp; &nbsp; &nbsp; <a href=./Lab2-2_SentenceBoundary_and_Deduplicate.ipynb>NEXT</a></p>\n"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "important-arthritis",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "-----\n",
			
 
				+    "\n",
			
 
				+    "\n",
			
 
				+    "## Licensing \n",
			
 
				+    "\n",
			
 
				+    "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). "
			
 
				+   ]
			
 
				+  }
			
 
				+ ],
			
 
				+ "metadata": {
			
 
				+  "kernelspec": {
			
 
				+   "display_name": "Python 3",
			
 
				+   "language": "python",
			
 
				+   "name": "python3"
			
 
				+  },
			
 
				+  "language_info": {
			
 
				+   "codemirror_mode": {
			
 
				+    "name": "ipython",
			
 
				+    "version": 3
			
 
				+   },
			
 
				+   "file_extension": ".py",
			
 
				+   "mimetype": "text/x-python",
			
 
				+   "name": "python",
			
 
				+   "nbconvert_exporter": "python",
			
 
				+   "pygments_lexer": "ipython3",
			
 
				+   "version": "3.8.8"
			
 
				+  }
			
 
				+ },
			
 
				+ "nbformat": 4,
			
 
				+ "nbformat_minor": 5
			
 
				+}
			
--- a/ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/tools/openwebtext/Lab2-2_SentenceBoundary_and_Deduplicate.ipynb
+++ b/ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/tools/openwebtext/Lab2-2_SentenceBoundary_and_Deduplicate.ipynb
@@ -0,0 +1,801 @@
 
				+{
			
 
				+ "cells": [
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "acute-tunnel",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "## Custom Data Cleaning\n",
			
 
				+    "---\n",
			
 
				+    "\n",
			
 
				+    "## Learning Objectives\n",
			
 
				+    "\n",
			
 
				+    "Big langauge model is sampling efficient, which implies that if the data fed to the model were full of misspelled words, vulgar by nature ( This is often the case with uncensored data extracted from webforums/chatrooms), contain large volume of other langauge than the target langauge, or having undesired and mischievous characteristic. We should definately consider creating a pipeline to clean and filter the data.\n",
			
 
				+    "\n",
			
 
				+    "Since Megatron-LM will sample sentences from documents during training run, we will also need to construct a mechanism to find sentence boundary per document.\n",
			
 
				+    "\n",
			
 
				+    "In this notebook, we are offering one way, there are many other ways, to deduplicate the data , on document level, based on a similarity threshold.Why ? Let's take newspaper for example, say, a catastrophic event such as tsunami occured in Thailand which claimed  many lives, such event could be reported repeatedly in a great many newws articles all over the world. Wouldn't we want to deduplicate the almost identical news articles, with a good similarity measuring mechanism ?\n",
			
 
				+    "\n",
			
 
				+    "Similarily, when we blending datasets from a great number of sources in order to obtain big data in order to train big langauge model, we would want to have a way to deduplicate the repeated documents which are present in the collected datasets and keep only the ones we deem worthy.\n",
			
 
				+    "\n",
			
 
				+    "\n",
			
 
				+    "Therefore, the goal of this lab is to provide some basic tools for cleaning and filtering data which should be carefully applied to custom langauge datasets, without losing the inherit charateristics in the datasets which you wish to preserve. \n",
			
 
				+    "\n",
			
 
				+    "In particular, this notebook covers the following steps :\n",
			
 
				+    "\n",
			
 
				+    "    1. Langauge Detection. \n",
			
 
				+    "    2. Find sentence boundaries.\n",
			
 
				+    "    3. Deduplicate documents based on similarity score.\n",
			
 
				+    "Note: The method recommanded in the [Megatron-LM repo, namely LSH](https://github.com/NVIDIA/Megatron-LM/tree/main/tools/openwebtext) will be used for deduplication.\n",
			
 
				+    "\n",
			
 
				+    "What this notebook will NOT cover :\n",
			
 
				+    "\n",
			
 
				+    "    - Constructing black-list words to block and filter out inappropriate words.\n",
			
 
				+    "    - Clean empty lines or empty sentences.\n",
			
 
				+    "    - Spell-check words and punctuations.\n",
			
 
				+    "    - Cut out sentence with too little tokens.\n",
			
 
				+    "     and many more customized data cleaning methods which one should consider adding to the data cleaning pipelins.\n",
			
 
				+    "\n",
			
 
				+    "At the end, there will be a **mini challenge** for hands-on practicing identifying the numbers of duplicated documents as close as you can to the groudtruth number!"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "innovative-fleece",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "---\n",
			
 
				+    "In case you encounter problems of installing LSH, here is the fix :\n",
			
 
				+    "\n",
			
 
				+    "    Install LSH - \n",
			
 
				+    "\n",
			
 
				+    "    Follow instruction from [Megatron-LM/tools/openwebtext/README](https://github.com/NVIDIA/Megatron-LM/tree/main/tools/openwebtext) in openwebtext clearning folder \n",
			
 
				+    "\n",
			
 
				+    "    Note : In a restricted environment where sudo is not allowed, please follow the below instruction to modify installation.\n",
			
 
				+    "            \n",
			
 
				+    "            Call out a terminal as illustrated below.             \n",
			
 
				+    "   ![call out a terminal ](../../pics/Alt_callout2terminals.JPG)\n",
			
 
				+    "   \n",
			
 
				+    "            cd gpubootcamp/ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/tools/openwebtext/\n",
			
 
				+    "        \n",
			
 
				+    "            git clone https://github.com/mattilyra/LSH.git\n",
			
 
				+    "            cd LSH\n",
			
 
				+    "            pip install -U --user cython>=0.24.1\n",
			
 
				+    "            open setup.py in an editor and modify as below\n",
			
 
				+    "   ![modify setup.py line 6](../../pics/modifyLSH_setuppy.JPG)\n",
			
 
				+    "\n",
			
 
				+    "            python setup.py install --user "
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "id": "premium-judgment",
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "!pip install ftfy langdetect numpy torch pandas nltk sentencepiece boto3 tqdm regex bs4 htmlmin tldextract sentence-splitter"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "brazilian-trigger",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "1. Langauge Detection  "
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "id": "worth-owner",
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "from langdetect import detect\n",
			
 
				+    "swe_raw_text='Under fredagsförmiddagen höll polis och räddningstjänst presskonferens tillsammans med en representanter från flygplatsens egna räddningsenhet och Örebro kommun.'\n",
			
 
				+    "detect(swe_raw_text)"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "id": "appointed-catering",
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "danish_text='1. januar 2021 var folketallet 5.840.045. Ved den første folketælling i 1735 var der 718.000 danskere.'\n",
			
 
				+    "detect(danish_text)"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "id": "decent-terrorism",
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "finnish_text='Jokaisella on oikeus vapaasti osallistua yhteiskunnan sivistyselämään, nauttia taiteista sekä päästä osalliseksi tieteen edistyksen mukanaan tuomista eduista.'\n",
			
 
				+    "detect(finnish_text)"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "demonstrated-radiation",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "When we have a way to identify which langauge does this document belong to, we can then filter or remove the documents belong to undesired langauges and keep only the selected language(s)."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "periodic-balloon",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "2. Find sentence boundaries - alternative 1 : NLTK"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "id": "quality-pittsburgh",
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "import nltk\n",
			
 
				+    "nltk.download('punkt')"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "id": "revolutionary-birmingham",
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "import nltk\n",
			
 
				+    "from nltk.tokenize import sent_tokenize\n",
			
 
				+    "text='Detta är ett stycke. Den innehåller flera meningar. \"Men varför\", frågar du? Andersson pekas ut som nästa partiledare: “Medlemmarna ska säga sitt”'\n",
			
 
				+    "print(\"original doc is :\\n \", text)\n",
			
 
				+    "sents=sent_tokenize(text)\n",
			
 
				+    "i=0\n",
			
 
				+    "for sent in sents:\n",
			
 
				+    "    print(\"------- sentence {} -------\".format(str(i)))    \n",
			
 
				+    "    print(sent)\n",
			
 
				+    "    i+=1"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "italic-bunch",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "Below is the expected outputs :\n",
			
 
				+    "\n",
			
 
				+    "Observe how NLTK tokenizer sentence per given document :\n",
			
 
				+    "\n",
			
 
				+    "    ------- sentence 0 -------\n",
			
 
				+    "    Detta är ett stycke.\n",
			
 
				+    "    ------- sentence 1 -------\n",
			
 
				+    "    Den innehåller flera meningar.\n",
			
 
				+    "    ------- sentence 2 -------\n",
			
 
				+    "    \"Men varför\", frågar du?\n",
			
 
				+    "    ------- sentence 3 -------\n",
			
 
				+    "    Andersson pekas ut som nästa partiledare: “Medlemmarna ska säga sitt”"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "beneficial-parts",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "2. Find sentence boundaries - alternative 2 : NLTK + custom function "
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "id": "western-wings",
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "import re\n",
			
 
				+    "import nltk\n",
			
 
				+    "from nltk.tokenize import sent_tokenize\n",
			
 
				+    "def normal_cut_sentence(temp):\n",
			
 
				+    "    return sent_tokenize(temp)\n",
			
 
				+    "\n",
			
 
				+    "def cut_sentence_with_quotation_marks(text):\n",
			
 
				+    "    p = re.compile(\"“.*?”\")\n",
			
 
				+    "    ls = []\n",
			
 
				+    "    index = 0\n",
			
 
				+    "    length = len(text)\n",
			
 
				+    "    for i in p.finditer(text):\n",
			
 
				+    "        temp = ''\n",
			
 
				+    "        start = i.start()\n",
			
 
				+    "        end = i.end()\n",
			
 
				+    "        for j in range(index, start):\n",
			
 
				+    "            temp += text[j]\n",
			
 
				+    "        if temp != '':\n",
			
 
				+    "            temp_list = normal_cut_sentence(temp)\n",
			
 
				+    "            ls += temp_list\n",
			
 
				+    "        temp = ''\n",
			
 
				+    "        for k in range(start, end):\n",
			
 
				+    "            temp += text[k]\n",
			
 
				+    "        if temp != ' ':\n",
			
 
				+    "            ls.append(temp)\n",
			
 
				+    "        index = end\n",
			
 
				+    "    return ls"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "id": "numerical-survivor",
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "sents=cut_sentence_with_quotation_marks(text)\n",
			
 
				+    "i=0\n",
			
 
				+    "for sent in sents:\n",
			
 
				+    "    print(\"------- sentence {} -------\".format(str(i)))  \n",
			
 
				+    "    print(sent)\n",
			
 
				+    "    i+=1"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "latin-sterling",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "Below is the expected outputs :\n",
			
 
				+    "\n",
			
 
				+    "Observe how the custom function `cut_sentence_with_quotation_marks` modifies NLTK and adding quotation marks as an additional sentence-splitter :\n",
			
 
				+    "\n",
			
 
				+    "        ------- sentence 0 -------\n",
			
 
				+    "        Detta är ett stycke.\n",
			
 
				+    "        ------- sentence 1 -------\n",
			
 
				+    "        Den innehåller flera meningar.\n",
			
 
				+    "        ------- sentence 2 -------\n",
			
 
				+    "        \"Men varför\", frågar du?\n",
			
 
				+    "        ------- sentence 3 -------\n",
			
 
				+    "        Andersson pekas ut som nästa partiledare:\n",
			
 
				+    "        ------- sentence 4 -------\n",
			
 
				+    "        “Medlemmarna ska säga sitt”"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "alternative-drill",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "3. Deduplicate documents based on similarity score.\n",
			
 
				+    "\n",
			
 
				+    "[Local Sensitive Hash](http://snap.stanford.edu/class/cs246-2012/slides/03-lsh.pdf)\n",
			
 
				+    "\n",
			
 
				+    "First, we create shingles from the document with ngram, then fingerprints were created and Jaccard Similarity measured is used in order to find the top K most similar items given pairs of documents based on an arbitrary threshold."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "id": "material-manitoba",
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "import itertools\n",
			
 
				+    "from lsh import cache, minhash # https://github.com/mattilyra/lsh\n",
			
 
				+    "from lsh import minhash\n",
			
 
				+    "\n",
			
 
				+    "# a pure python shingling function that will be used in comparing\n",
			
 
				+    "# LSH to true Jaccard similarities\n",
			
 
				+    "def shingles(text, char_ngram=5):\n",
			
 
				+    "    return set(text[head:head + char_ngram] for head in range(0, len(text) - char_ngram))\n",
			
 
				+    "\n",
			
 
				+    "\n",
			
 
				+    "def jaccard(set_a, set_b):\n",
			
 
				+    "    intersection = set_a & set_b\n",
			
 
				+    "    union = set_a | set_b\n",
			
 
				+    "    return len(intersection) / len(union)\n",
			
 
				+    "\n",
			
 
				+    "\n",
			
 
				+    "def candidate_duplicates(document_feed, char_ngram=5, seeds=100, bands=5, hashbytes=4):\n",
			
 
				+    "    char_ngram = char_ngram\n",
			
 
				+    "    sims = []\n",
			
 
				+    "    hasher = minhash.MinHasher(seeds=seeds, char_ngram=char_ngram, hashbytes=hashbytes)\n",
			
 
				+    "    if seeds % bands != 0:\n",
			
 
				+    "        raise ValueError('Seeds has to be a multiple of bands. {} % {} != 0'.format(seeds, bands))\n",
			
 
				+    "    \n",
			
 
				+    "    lshcache = cache.Cache(num_bands=bands, hasher=hasher)\n",
			
 
				+    "    for i_line, line in enumerate(document_feed):\n",
			
 
				+    "        line = line.decode('utf8')\n",
			
 
				+    "        docid, headline_text = line.split('\\t', 1)\n",
			
 
				+    "        fingerprint = hasher.fingerprint(headline_text.encode('utf8'))\n",
			
 
				+    "        \n",
			
 
				+    "        # in addition to storing the fingerpring store the line\n",
			
 
				+    "        # number and document ID to help analysis later on\n",
			
 
				+    "        lshcache.add_fingerprint(fingerprint, doc_id=(i_line, docid))\n",
			
 
				+    "\n",
			
 
				+    "    candidate_pairs = set()\n",
			
 
				+    "    for b in lshcache.bins:\n",
			
 
				+    "        for bucket_id in b:\n",
			
 
				+    "            if len(b[bucket_id]) > 1:\n",
			
 
				+    "                pairs_ = set(itertools.combinations(b[bucket_id], r=2))\n",
			
 
				+    "                candidate_pairs.update(pairs_)\n",
			
 
				+    "    \n",
			
 
				+    "    return candidate_pairs"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "musical-wells",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "We want to verify how good this algorithm is, hence, we will create deduplcates documents from our toy data `extractedNVblogs.txt` obtained in webscaping lab. We will flag the duplicated documents, in the column `duplicate=True`, and this column will serve as the groundtruth for us, we will use this hand-crafted data to verify whether this algorithm will work as expected."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "id": "naval-forest",
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "import pandas as pd\n",
			
 
				+    "cols=['doc1']\n",
			
 
				+    "df=pd.read_csv('../../../../dataset/EN/extractedNVblogs.txt',sep='\\n', names=cols ,skiprows=1)\n",
			
 
				+    "df.head()"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "id": "threatened-johnson",
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "import numpy as np\n",
			
 
				+    "\n",
			
 
				+    "def create_duplicates(df):\n",
			
 
				+    "    doc2=[]\n",
			
 
				+    "    duplicate=[]\n",
			
 
				+    "    n=len(df)\n",
			
 
				+    "    for i in range(n):\n",
			
 
				+    "        other_population=[k for k in range(n) if k!=i]\n",
			
 
				+    "        \n",
			
 
				+    "        other_idx=np.random.choice(other_population)\n",
			
 
				+    "        current_idx=np.random.choice([i,other_idx], p=[0.3,0.7])\n",
			
 
				+    "        if current_idx==i:            \n",
			
 
				+    "            duplicate.append(True)\n",
			
 
				+    "        else:\n",
			
 
				+    "            duplicate.append(False)\n",
			
 
				+    "        doc2.append(df.iloc[current_idx,0])\n",
			
 
				+    "    df['index']=df.index\n",
			
 
				+    "    df['doc2']=doc2\n",
			
 
				+    "    df['duplicate']=duplicate\n",
			
 
				+    "    cols=['index','doc1','doc2','duplicate']\n",
			
 
				+    "    df=df[cols]\n",
			
 
				+    "    return df"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "id": "gothic-coordination",
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "df2=create_duplicates(df)\n",
			
 
				+    "df2.tail()"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "theoretical-review",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "Below is the expected outputs :\n",
			
 
				+    "    \n",
			
 
				+    "            \n",
			
 
				+    "            index \tdoc1 \tdoc2 \tduplicate\n",
			
 
				+    "            65 \t65 \tThis post was updated July 20, 2021 to reflect... \tThis post was updated July 20, 2021 to reflect... \tTrue\n",
			
 
				+    "            66 \t66 \tResearchers, developers, and engineers worldwi... \tThis post was originally published in August 2... \tFalse\n",
			
 
				+    "            67 \t67 \tLooking to reveal secrets of days past, histor... \tThe NVIDIA Deep Learning Institute (DLI) exten... \tFalse\n",
			
 
				+    "            68 \t68 \tScientists searching the universe for gravitat... \tRobotics researchers from NVIDIA and Universit... \tFalse\n",
			
 
				+    "            69 \t69 \tAt GTC ’21, experts presented a variety of tec... \tThe NVIDIA Hardware Grant Program helps advanc... \tFalse"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "id": "daily-coast",
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "## this is our groundtruth, count duplicate == True is 31 \n",
			
 
				+    "df2.duplicate.value_counts()"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "decimal-guinea",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "Below is the expected outputs :\n",
			
 
				+    "    \n",
			
 
				+    "    False    45\n",
			
 
				+    "    True     25\n",
			
 
				+    "    Name: duplicate, dtype: int64"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "id": "differential-recovery",
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "df2.columns"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "id": "broad-incidence",
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "del df\n",
			
 
				+    "keep_cols_to_write=['index','doc1','doc2']\n",
			
 
				+    "df3=df2[keep_cols_to_write]\n",
			
 
				+    "df3.head()"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "embedded-reasoning",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "Below is the expected outputs :\n",
			
 
				+    "\n",
			
 
				+    "            index \tdoc1 \tdoc2\n",
			
 
				+    "        0 \t0 \tDeep learning models have been successfully us... \tDeep learning models have been successfully us...\n",
			
 
				+    "        1 \t1 \tBreast cancer is the most frequently diagnosed... \tIn NVIDIA Clara Train 4.0, we added homomorphi...\n",
			
 
				+    "        2 \t2 \tThe NVIDIA Deep Learning Institute (DLI) exten... \tThe NVIDIA Deep Learning Institute (DLI) exten...\n",
			
 
				+    "        3 \t3 \tEngineers, product developers and designers ar... \tDeep learning research requires working at sca...\n",
			
 
				+    "        4 \t4 \tDespite substantial progress in natural langua... \tNVIDIA announces our newest release of the CUD..."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "desperate-dialogue",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "We could proceed to write the above dataframe to a csv file, however, for mini challenge, we would like to keep determinism. \n",
			
 
				+    "\n",
			
 
				+    "In order to preserve determinism, we will load the previously saved df2.csv file so that all attendees have the exact same file.\n",
			
 
				+    "\n",
			
 
				+    "The `df2.csv` file is provided in this repo.\n"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "id": "prospective-murray",
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "## to do this deterministically , let's read in the saved df2.csv file\n",
			
 
				+    "df2=pd.read_csv('df2.csv', names=['index', 'doc1', 'doc2', 'duplicate'], skiprows=1)\n",
			
 
				+    "df2.head()"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "several-wallpaper",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "Below is the expected outputs , it should look exactly the same:\n",
			
 
				+    "\n",
			
 
				+    "        index \tdoc1 \tdoc2 \tduplicate\n",
			
 
				+    "        0 \t0 \tToday, NVIDIA announced new pretrained models ... \tAstrophysics researchers have long faced a tra... \tFalse\n",
			
 
				+    "        1 \t1 \tThis post was updated July 20, 2021 to reflect... \tThis post was updated July 20, 2021 to reflect... \tTrue\n",
			
 
				+    "        2 \t2 \tIn part 1 of this series, we introduced new AP... \tEdge computing has been around for a long time... \tFalse\n",
			
 
				+    "        3 \t3 \tThe NVIDIA NGC team is hosting a webinar with ... \tThe NVIDIA NGC team is hosting a webinar with ... \tTrue\n",
			
 
				+    "        4 \t4 \tNVIDIA announces our newest release of the CUD... \tAs an undergraduate student excited about AI f... \tFalse"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "id": "opened-bathroom",
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "## this is our groundtruth, count duplicate == Truth is 31 \n",
			
 
				+    "df2.duplicate.value_counts()"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "reduced-handy",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "Below is the expected outputs , it should look exactly the same:\n",
			
 
				+    "\n",
			
 
				+    "            False    42\n",
			
 
				+    "            True     31\n",
			
 
				+    "            Name: duplicate, dtype: int64"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "id": "productive-ideal",
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "!wc -l groundtruth.txt"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "intellectual-september",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "Below is the expected outputs , it should look exactly the same:\n",
			
 
				+    "\n",
			
 
				+    "    73 groundtruth.txt\n"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "designing-rates",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "## Mini-Challenge\n",
			
 
				+    "\n",
			
 
				+    "Task : \n",
			
 
				+    "    Overwrite the below parameters before calling `candidate_duplicates()` function, and rerun the cell block below.\n",
			
 
				+    "    \n",
			
 
				+    "    char_ngram= < input_value >\n",
			
 
				+    "    seeds=< input_value >\n",
			
 
				+    "    bands=< input_value >\n",
			
 
				+    "    hashbytes=< input_value >\n",
			
 
				+    "\n",
			
 
				+    "Pass : Consider yourself pass this mini challenge when you approach the number **31 +/- 3** ! \n",
			
 
				+    "\n",
			
 
				+    "\n",
			
 
				+    "Re-run the below cell for experiments in order to get as close as possible to the ground truth = 31 duplicates.\n",
			
 
				+    "<a id=\"Rerun_Cell\"></a>"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "id": "organizational-vulnerability",
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "## this is the Re Run Cell \n",
			
 
				+    "import itertools\n",
			
 
				+    "import random\n",
			
 
				+    "lines = []\n",
			
 
				+    "with open('groundtruth.txt', 'rb') as fh:\n",
			
 
				+    "    # read the first 1000 lines into memory so we can compare them\n",
			
 
				+    "    for line in itertools.islice(fh, 1000):\n",
			
 
				+    "        lines.append(line.decode('utf8'))\n",
			
 
				+    "    \n",
			
 
				+    "    # reset file pointer and do LSH\n",
			
 
				+    "    fh.seek(0)\n",
			
 
				+    "    feed = itertools.islice(fh, 1000)\n",
			
 
				+    "    \"\"\"\n",
			
 
				+    "    ## modify the below numbers as input to function candidate_duplicates()\n",
			
 
				+    "    char_ngram= < input_value >\n",
			
 
				+    "    seeds=< input_value >\n",
			
 
				+    "    bands=< input_value >\n",
			
 
				+    "    hashbytes=< input_value >\n",
			
 
				+    "    \"\"\"\n",
			
 
				+    "    # initial value given below, please modify them accordingly to obtain count of number of duplicates  as close as 31 (=groundtruth)\n",
			
 
				+    "    char_ngram=13\n",
			
 
				+    "    seeds=100\n",
			
 
				+    "    bands=5\n",
			
 
				+    "    hashbytes=8\n",
			
 
				+    "    \n",
			
 
				+    "    candidates = candidate_duplicates(feed, char_ngram=char_ngram, seeds=seeds, bands=bands, hashbytes=hashbytes)\n",
			
 
				+    "\n",
			
 
				+    "# go over all the generated candidates comparing their similarities\n",
			
 
				+    "similarities = []\n",
			
 
				+    "for ((line_a, docid_a), (line_b, docid_b)) in candidates:\n",
			
 
				+    "    doc_a, doc_b = lines[line_a], lines[line_b]\n",
			
 
				+    "    shingles_a = shingles(lines[line_a])\n",
			
 
				+    "    shingles_b = shingles(lines[line_b])\n",
			
 
				+    "    \n",
			
 
				+    "    jaccard_sim = jaccard(shingles_a, shingles_b)\n",
			
 
				+    "    fingerprint_a = set(hasher.fingerprint(doc_a.encode('utf8')))\n",
			
 
				+    "    fingerprint_b = set(hasher.fingerprint(doc_b.encode('utf8')))\n",
			
 
				+    "    minhash_sim = len(fingerprint_a & fingerprint_b) / len(fingerprint_a | fingerprint_b)\n",
			
 
				+    "    similarities.append((docid_a, docid_b, jaccard_sim, minhash_sim))\n",
			
 
				+    "\n",
			
 
				+    "for a,b,jsim, msim in random.sample(similarities, k=2 ):\n",
			
 
				+    "    print(\"pair of similar sentences with jaccard_sim score:{} and minhash_sim score:{} --- \\n\".format(str(jsim),str(msim)))\n",
			
 
				+    "    a=int(a)\n",
			
 
				+    "    b=int(b)\n",
			
 
				+    "    text_a=df2.iloc[a,1]\n",
			
 
				+    "    text_b=df2.iloc[b,2]\n",
			
 
				+    "    if text_a==text_b:\n",
			
 
				+    "        print(\"100% duplicates \\n\")\n",
			
 
				+    "    print(\"text_a:\", text_a.split(' ')[:5])\n",
			
 
				+    "    print(\"text_b:\", text_b.split(' ')[:5])\n",
			
 
				+    "    print('-----'*10)\n",
			
 
				+    "    import random\n",
			
 
				+    "\n",
			
 
				+    "print('\\nThere are **{}** candidate duplicates in total\\n'.format(len(candidates)))\n",
			
 
				+    "random.sample(similarities, k=1)"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "religious-recipe",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "Below is the expected outputs :\n",
			
 
				+    "\n",
			
 
				+    "A naive run\n",
			
 
				+    "\n",
			
 
				+    "        pair of similar sentences with jaccard_sim score:0.8197797952482132 and minhash_sim score:0.639344262295082 --- \n",
			
 
				+    "\n",
			
 
				+    "        text_a: ['The', 'NVIDIA,', 'Facebook,', 'and', 'TensorFlow']\n",
			
 
				+    "        text_b: ['Deep', 'learning', '(DL)', 'is', 'the']\n",
			
 
				+    "        --------------------------------------------------\n",
			
 
				+    "        pair of similar sentences with jaccard_sim score:0.9133693568066934 and minhash_sim score:0.8867924528301887 --- \n",
			
 
				+    "\n",
			
 
				+    "        100% duplicates \n",
			
 
				+    "\n",
			
 
				+    "        text_a: ['The', 'first', 'post', 'in', 'this']\n",
			
 
				+    "        text_b: ['The', 'first', 'post', 'in', 'this']\n",
			
 
				+    "        --------------------------------------------------\n",
			
 
				+    "\n",
			
 
				+    "        There are **3** candidate duplicates in total\n"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "differential-behavior",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "WOW that is way too LOW !!! We should have 31 duplicates as groundtruth."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "entire-brooks",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "<a id=\"TheChallenge\"></a>"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "recorded-jackson",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "Go back and rerun cell <a href=\"./Lab2-2_SentenceBoundary_and_Deduplicate.ipynb#Rerun_Cell\">Jump to ReRun Cell</a>\n",
			
 
				+    "\n",
			
 
				+    "Solution will be delivered to you at the end of the bootcamp !\n"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "id": "prostate-latest",
			
 
				+   "metadata": {
			
 
				+    "tags": []
			
 
				+   },
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "# turn the candidates into a dictionary so we have easy access to\n",
			
 
				+    "# candidates pairs that were found\n",
			
 
				+    "candidates_dict = {(line_a, line_b): (docid_a, docid_b) for ((line_a, docid_a), (line_b, docid_b)) in candidates}\n",
			
 
				+    "found = 0\n",
			
 
				+    "for i in range(len(lines)):\n",
			
 
				+    "    for j in range(i+1, len(lines)):\n",
			
 
				+    "        if sims_all[i, j] >= .9:\n",
			
 
				+    "            # documents i and j have an actual Jaccard similarity >= 90%\n",
			
 
				+    "            found += ((i, j) in candidates_dict or (j, i) in candidates_dict)\n",
			
 
				+    "\n",
			
 
				+    "print('Out of {} pairs with similarity >= 90% {} were found, that\\'s {:.1%}'.format((sims_all >= .9).sum(), found, found / (sims_all >= .9).sum()))\n"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "id": "yellow-wallpaper",
			
 
				+   "metadata": {
			
 
				+    "tags": []
			
 
				+   },
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "num_candidates = []\n",
			
 
				+    "bands = [2, 5, 10, 20]\n",
			
 
				+    "for num_bands in bands:\n",
			
 
				+    "    with open('groundtruth.txt', 'rb') as fh:\n",
			
 
				+    "        feed = itertools.islice(fh, 1000)\n",
			
 
				+    "        candidates = candidate_duplicates(feed, char_ngram=5, seeds=100, bands=num_bands, hashbytes=4)\n",
			
 
				+    "        num_candidates.append(len(candidates))"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "id": "asian-picking",
			
 
				+   "metadata": {
			
 
				+    "tags": []
			
 
				+   },
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "import matplotlib.pyplot as plt\n",
			
 
				+    "%matplotlib inline\n",
			
 
				+    "fig, ax = plt.subplots(figsize=(8, 6))\n",
			
 
				+    "plt.bar(bands, num_candidates, align='center');\n",
			
 
				+    "plt.title('Number of candidate duplicate pairs found by LSH using 100 minhash fingerprint.');\n",
			
 
				+    "plt.xlabel('Number of bands');\n",
			
 
				+    "plt.ylabel('Number of candidate duplicates');\n",
			
 
				+    "plt.xticks(bands, bands);\n"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "knowing-master",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "--- \n",
			
 
				+    "## Links and Resources\n",
			
 
				+    "Don't forget to check out additional resources such as [Language Detect](https://github.com/Mimino666/langdetect), [NLTK Sentence Tokenizer](https://www.nltk.org/api/nltk.tokenize.html) and [Local Sensitive Hashing](http://snap.stanford.edu/class/cs246-2012/slides/03-lsh.pdf)."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "acquired-browser",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "-----\n",
			
 
				+    "## <p style=\"text-align:center;border:3px; padding: 1em\"> <a href=../../../../Start_Here.ipynb>HOME</a>&nbsp; &nbsp; &nbsp; <a href=../../../Lab2-3_train_own_GPT2BPETokenizer.ipynb>NEXT</a></p>\n"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "running-engineering",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "-----\n",
			
 
				+    "\n",
			
 
				+    "\n",
			
 
				+    "## Licensing \n",
			
 
				+    "\n",
			
 
				+    "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). "
			
 
				+   ]
			
 
				+  }
			
 
				+ ],
			
 
				+ "metadata": {
			
 
				+  "kernelspec": {
			
 
				+   "display_name": "Python 3",
			
 
				+   "language": "python",
			
 
				+   "name": "python3"
			
 
				+  },
			
 
				+  "language_info": {
			
 
				+   "codemirror_mode": {
			
 
				+    "name": "ipython",
			
 
				+    "version": 3
			
 
				+   },
			
 
				+   "file_extension": ".py",
			
 
				+   "mimetype": "text/x-python",
			
 
				+   "name": "python",
			
 
				+   "nbconvert_exporter": "python",
			
 
				+   "pygments_lexer": "ipython3",
			
 
				+   "version": "3.8.8"
			
 
				+  }
			
 
				+ },
			
 
				+ "nbformat": 4,
			
 
				+ "nbformat_minor": 5
			
 
				+}