Pārlūkot izejas kodu

update SuperPOD discreption and README

zenodia 3 gadi atpakaļ
vecāks
revīzija
d41c326e5f

+ 23 - 0
ai/Megatron/Dockerfile

@@ -0,0 +1,23 @@
+# Copyright (c) 2020 NVIDIA Corporation.  All rights reserved.
+
+# To build the docker container, run: $ sudo docker build -t ai-multi-gpu:latest .
+# To run: $ sudo docker run --rm -it --gpus=all -p 8888:8888 -p 8000:8000 ai-multi-gpu:latest
+# Finally, open http://127.0.0.1:8888/
+
+# Select Base Image 
+FROM nvcr.io/nvidia/pytorch:21.03-py3
+# Update the repo
+RUN apt-get update -y
+
+# Install required python packages
+RUN pip3 install tokenizers transformers ipywidgets widgetsnbextension  
+RUN jupyter nbextension enable --py widgetsnbextension
+RUN pip3 install nvtx ftfy langdetect numpy torch pandas nltk sentencepiece boto3 tqdm regex bs4 htmlmin tldextract sentence-splitter
+
+##### TODO - From the Final Repo Changing this 
+
+# TO COPY the data 
+COPY English/ /workspace/
+
+## Uncomment this line to run Jupyter notebook by default
+CMD jupyter-lab --no-browser --allow-root --ip=0.0.0.0 --port=8888 --NotebookApp.token="" --notebook-dir=/workspace/python/

+ 14 - 14
ai/Megatron/English/Python/jupyter_notebook/Day2-3_GPT_vocab_merge_files.ipynb

@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "traditional-store",
+   "id": "valid-smoke",
    "metadata": {},
    "source": [
     "# \n",
@@ -25,7 +25,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "logical-junction",
+   "id": "novel-crash",
    "metadata": {},
    "source": [
     "#### let's review the source code of [gpt2 tokenizer](https://huggingface.co/transformers/_modules/transformers/tokenization_gpt2.html)\n",
@@ -47,7 +47,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "engaging-summer",
+   "id": "enhanced-vehicle",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -57,7 +57,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "deluxe-custom",
+   "id": "consolidated-substance",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -70,7 +70,7 @@
   {
    "cell_type": "code",
    "execution_count": 2,
-   "id": "lonely-inspection",
+   "id": "copyrighted-present",
    "metadata": {},
    "outputs": [
     {
@@ -88,7 +88,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "intensive-harvest",
+   "id": "critical-specific",
    "metadata": {},
    "source": [
     "## examine the vocab and merge files"
@@ -97,7 +97,7 @@
   {
    "cell_type": "code",
    "execution_count": 3,
-   "id": "caroline-dressing",
+   "id": "chemical-ebony",
    "metadata": {},
    "outputs": [
     {
@@ -116,14 +116,14 @@
     "    o = json.load(ip_file)\n",
     "    take=20\n",
     "    rn=random.randint(0,len(o)-1)\n",
-    "    print(\"noted that the Ġ = space +256 to form that control letter\")\n",
+    "    print(\"noted that the Ġ = space + 256 to form that control letter\")\n",
     "    print(list(o.keys())[rn:rn+take])            "
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 12,
-   "id": "green-sally",
+   "id": "mediterranean-outreach",
    "metadata": {},
    "outputs": [
     {
@@ -144,7 +144,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "moving-smile",
+   "id": "controversial-albuquerque",
    "metadata": {},
    "source": [
     "## sanity check load from transformer GPT2Tokenizer "
@@ -153,7 +153,7 @@
   {
    "cell_type": "code",
    "execution_count": 9,
-   "id": "sublime-promise",
+   "id": "classical-stretch",
    "metadata": {},
    "outputs": [
     {
@@ -187,7 +187,7 @@
   {
    "cell_type": "code",
    "execution_count": 1,
-   "id": "amino-immunology",
+   "id": "cordless-science",
    "metadata": {},
    "outputs": [
     {
@@ -245,7 +245,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "associate-current",
+   "id": "temporal-latest",
    "metadata": {},
    "source": [
     "---\n",
@@ -259,7 +259,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "blind-teacher",
+   "id": "educational-ecology",
    "metadata": {},
    "source": [
     "-----\n",

+ 15 - 15
ai/Megatron/English/Python/jupyter_notebook/Day2-4_jsonfy_and_process2mmap.ipynb

@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "amazing-superintendent",
+   "id": "convertible-whale",
    "metadata": {},
    "source": [
     "# \n",
@@ -24,7 +24,7 @@
   {
    "cell_type": "code",
    "execution_count": 6,
-   "id": "aboriginal-proceeding",
+   "id": "surrounded-counter",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -36,7 +36,7 @@
   {
    "cell_type": "code",
    "execution_count": 7,
-   "id": "secret-memorial",
+   "id": "complete-lindsay",
    "metadata": {},
    "outputs": [
     {
@@ -55,7 +55,7 @@
   {
    "cell_type": "code",
    "execution_count": 9,
-   "id": "conceptual-technician",
+   "id": "conventional-mason",
    "metadata": {},
    "outputs": [
     {
@@ -74,7 +74,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "danish-finder",
+   "id": "functioning-stage",
    "metadata": {},
    "source": [
     "----------------------------------------------------------\n",
@@ -86,7 +86,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "possible-scope",
+   "id": "eastern-habitat",
    "metadata": {},
    "source": [
     "----------------------------------------------------------\n",
@@ -103,7 +103,7 @@
   {
    "cell_type": "code",
    "execution_count": 11,
-   "id": "thrown-oakland",
+   "id": "finite-marina",
    "metadata": {},
    "outputs": [
     {
@@ -120,7 +120,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "balanced-shield",
+   "id": "proof-pakistan",
    "metadata": {},
    "source": [
     "----------------------------------------------------------\n",
@@ -155,7 +155,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "essential-coral",
+   "id": "angry-canvas",
    "metadata": {},
    "source": [
     "----------------------------------------------------------\n",
@@ -192,7 +192,7 @@
   {
    "cell_type": "code",
    "execution_count": 5,
-   "id": "interested-stereo",
+   "id": "regional-stake",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -205,11 +205,11 @@
   },
   {
    "cell_type": "markdown",
-   "id": "veterinary-technical",
+   "id": "similar-commonwealth",
    "metadata": {},
    "source": [
     "---\n",
-    "## OUTPUT should looks liek the following \n",
+    "## OUTPUT should looks similar to the following \n",
     "\n",
     "                    Opening ./Megatron-LM/dataset/EN/extractedNVblogs.json\n",
     "                    > building GPT2BPETokenizer tokenizer ...\n",
@@ -240,7 +240,7 @@
   {
    "cell_type": "code",
    "execution_count": 6,
-   "id": "deadly-cleaning",
+   "id": "framed-point",
    "metadata": {},
    "outputs": [
     {
@@ -303,7 +303,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "bridal-frost",
+   "id": "endless-vietnamese",
    "metadata": {},
    "source": [
     "---\n",
@@ -317,7 +317,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "encouraging-vietnam",
+   "id": "married-necklace",
    "metadata": {},
    "source": [
     "-----\n",

+ 49 - 32
ai/Megatron/English/Python/jupyter_notebook/Day2-5_Observe_GPT_runs_vs_performance.ipynb

@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "turned-producer",
+   "id": "governing-exclusion",
    "metadata": {},
    "source": [
     "# \n",
@@ -23,26 +23,26 @@
     "   \n",
     "    - example : naive run vs. improved run \n",
     "        - starts with multiGPUs \n",
-    "    - exercise : beat the record !\n",
+    "    - challenge : beat the record !\n",
     "\n",
-    "it is possible to obtain more than 90% GPU utilizations overall with high tensorcore ops sustained throughout during **training** for all gpus \n"
+    "it is possible to obtain more than 80% GPU utilizations overall with high tensorcore ops sustained throughout during **training** for all gpus \n"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "bulgarian-ceiling",
+   "id": "supported-romantic",
    "metadata": {},
    "source": [
     "----------------------------------------------------------\n",
-    "### intro to profiling \n",
+    "### Intro to profiling \n",
+    "\n",
     "\n",
-    "#### NVIDIA Profiling ToolChain\n",
     "<center><img src=\"./Megatron-LM/pics/NVprofilingToolchain.JPG\" width=\"800\"/></center>\n"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "economic-product",
+   "id": "friendly-lightning",
    "metadata": {},
    "source": [
     "----------------------------------------------------------\n",
@@ -54,7 +54,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "parallel-brother",
+   "id": "permanent-mainstream",
    "metadata": {},
    "source": [
     "----------------------------------------------------------\n",
@@ -64,14 +64,19 @@
     "<center><img src=\"./Megatron-LM/pics/Alt_callout2terminals.JPG\" width=\"600\"/></center>\n",
     "\n",
     "\n",
-    "            -------- launch profiling sessions to record: visualize on Nsight( please use Nsight Systems version >=2021.3.1 ) ---------\n",
+    "         ----- launch profiling sessions to record: visualize on Nsight( please use Nsight Systems version >=2021.3.1 ) ----\n",
+    "         \n",
+    "[Installing Nsight](https://developer.nvidia.com/gameworksdownload#?dn=nsight-systems-2021-3-1-54)\n",
+    "\n",
+    "[User Guide](https://docs.nvidia.com/nsight-systems/UserGuide/index.html)\n",
+    "\n",
     "<center><img src=\"./Megatron-LM/pics/multigpu_naive_run.jpg\" width=\"1000\"/></center>\n",
     "\n"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "satellite-drill",
+   "id": "meaning-rubber",
    "metadata": {},
    "source": [
     "---\n",
@@ -81,7 +86,7 @@
   {
    "cell_type": "code",
    "execution_count": 1,
-   "id": "norwegian-fault",
+   "id": "ecological-layer",
    "metadata": {},
    "outputs": [
     {
@@ -99,20 +104,32 @@
   },
   {
    "cell_type": "markdown",
-   "id": "cultural-shift",
+   "id": "precise-genius",
    "metadata": {},
    "source": [
     "---\n",
     "### Let's first verify training works properly, \n",
     "modify your configuration and the number of GPUs available to you\n",
     "\n",
-    "training output should look simialr to the following \n"
+    "training output should look simialr to the following \n",
+    "\n",
+    "                training ...\n",
+    "                time (ms) | model-and-optimizer-setup: 412.03 | train/valid/test-data-iterators-setup: 3889.83\n",
+    "                [before the start of training step] datetime: 2021-08-27 02:05:24 \n",
+    "                 iteration       10/      25 | consumed samples:  40 | elapsed time per iteration (ms): 886.5 | \n",
+    "                 learning rate:   0.000E+00 | global batch size: 4 | loss scale: 8388608.0 | number of skipped iterations:10 | \n",
+    "                \n",
+    "                ...\n",
+    "                \n",
+    "                [after training is done] datetime: 2021-08-27 02:05:40 \n",
+    "                saving checkpoint at iteration      25 to ./Megatron-LM/sv_ckpt/\n",
+    "                successfully saved checkpoint at iteration      25 to ./Megatron-LM/sv_ckpt/\n"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 9,
-   "id": "cross-ribbon",
+   "id": "educated-story",
    "metadata": {},
    "outputs": [
     {
@@ -379,7 +396,7 @@
   {
    "cell_type": "code",
    "execution_count": 11,
-   "id": "failing-execution",
+   "id": "numeric-relevance",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -388,7 +405,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "junior-territory",
+   "id": "dominican-limit",
    "metadata": {},
    "source": [
     "---\n",
@@ -399,7 +416,7 @@
   {
    "cell_type": "code",
    "execution_count": 10,
-   "id": "earlier-pontiac",
+   "id": "preliminary-feeding",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -408,7 +425,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "robust-columbus",
+   "id": "vanilla-dakota",
    "metadata": {},
    "source": [
     "----------------------------------------------------------\n",
@@ -441,7 +458,7 @@
   {
    "cell_type": "code",
    "execution_count": 4,
-   "id": "preceding-trainer",
+   "id": "responsible-senegal",
    "metadata": {},
    "outputs": [
     {
@@ -735,19 +752,19 @@
   },
   {
    "cell_type": "markdown",
-   "id": "aware-drive",
+   "id": "global-network",
    "metadata": {},
    "source": [
     "--------------------------------------------------\n",
     "-----\n",
     "visualizing the profiles via nsight should look similar to the following \n",
-    "\n",
-    "![multigpus naive run](./Megatron-LM/pics/multigpu_naive_run.jpg)"
+    "<center><img src=\"./Megatron-LM/pics/GPUs_utils_naive.JPG\" width=\"1000\"/></center>\n",
+    "<center><img src=\"./Megatron-LM/pics/multigpu_naive_run.jpg\" width=\"1000\"/></center>\n"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "weird-belgium",
+   "id": "handled-great",
    "metadata": {},
    "source": [
     "---\n",
@@ -775,7 +792,7 @@
   {
    "cell_type": "code",
    "execution_count": 7,
-   "id": "advanced-grounds",
+   "id": "academic-aquatic",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -785,7 +802,7 @@
   {
    "cell_type": "code",
    "execution_count": 6,
-   "id": "collect-ladder",
+   "id": "confidential-steam",
    "metadata": {},
    "outputs": [
     {
@@ -1076,18 +1093,18 @@
   },
   {
    "cell_type": "markdown",
-   "id": "liberal-preliminary",
+   "id": "earlier-priest",
    "metadata": {},
    "source": [
     "--------------------------------------------------\n",
     "visualizing the profiles via nsight should look similar to the following \n",
-    "\n",
-    "![multigpus 2nd run](./Megatron-LM/pics/2ndrun.JPG)"
+    "<center><img src=\"./Megatron-LM/pics/gpus_utils_improved.JPG\" width=\"1000\"/></center>\n",
+    "<center><img src=\"./Megatron-LM/pics/2ndrun.JPG\" width=\"1000\"/></center>\n"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "electoral-warning",
+   "id": "optimum-resource",
    "metadata": {},
    "source": [
     "<a id=\"TheChallenge\"></a>"
@@ -1095,7 +1112,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "rapid-composition",
+   "id": "connected-lincoln",
    "metadata": {},
    "source": [
     "----------------\n",
@@ -1134,7 +1151,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "settled-basic",
+   "id": "fatal-crawford",
    "metadata": {},
    "source": [
     "---\n",
@@ -1144,7 +1161,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "laden-assumption",
+   "id": "agreed-gasoline",
    "metadata": {},
    "source": [
     "-----\n",

+ 9 - 9
ai/Megatron/English/Python/jupyter_notebook/Day2_0_intro.ipynb

@@ -2,18 +2,18 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "cooked-female",
+   "id": "trained-reform",
    "metadata": {},
    "source": [
-    "## The Bootcamp computer environment BerzeLiUs ( SuperPOD in Sweden ) -\n",
-    "The [BerzeLiUS](https://blogs.nvidia.com/blog/2021/03/23/ai-supercomputer-sweden/) system consists of 60 NVIDIA DGX A100 systems, \n",
-    "linked on a 200 Gbit/second NVIDIA Mellanox InfiniBand HDR network. \n",
-    "The same network links the processors to 1.5 petabytes of flash memory on four storage servers from DataDirect Networks.\n"
+    "## The Bootcamp computer environment, a SuperPOD cluster  -\n",
+    "\n",
+    "For this bootcamp We will get access to NVIDIA DGX A100 systems. In general, it is highly recommanded to have access to large compute cluster when training very large language models.\n",
+    "\n"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "patent-arena",
+   "id": "enabling-jason",
    "metadata": {},
    "source": [
     "## Learning Objectives\n",
@@ -22,7 +22,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "average-strip",
+   "id": "fifth-argument",
    "metadata": {},
    "source": [
     "<!--**IMPORTANT**: Before we start please download the files specified in **Preparation work**.-->\n",
@@ -53,7 +53,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "israeli-going",
+   "id": "fleet-subject",
    "metadata": {},
    "source": [
     "---\n",
@@ -67,7 +67,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "blind-holmes",
+   "id": "accessible-palestine",
    "metadata": {},
    "source": [
     "---\n",

+ 77 - 83
ai/Megatron/English/Python/jupyter_notebook/Day3-5_run_Megatron_with_varying_config.ipynb

@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "postal-promotion",
+   "id": "appreciated-playing",
    "metadata": {},
    "source": [
     "# \n",
@@ -50,7 +50,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "deadly-windows",
+   "id": "homeless-academy",
    "metadata": {},
    "source": [
     "<a id=\"Rerun_Cell\"></a>"
@@ -58,8 +58,8 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 3,
-   "id": "constant-affairs",
+   "execution_count": 30,
+   "id": "through-marketing",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -68,8 +68,8 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 2,
-   "id": "transsexual-costume",
+   "execution_count": 29,
+   "id": "legitimate-thermal",
    "metadata": {},
    "outputs": [
     {
@@ -90,9 +90,9 @@
     "\n",
     "### modify this section to point the file to its own path \n",
     "CHECKPOINT_PATH='./Megatron-LM/sv_ckpt/'\n",
-    "DATA_PATH='../dataset/EN/NVblogs_text_document'\n",
-    "VOCAB_FILE='../dataset/EN/50k/gpt2-vocab.json'\n",
-    "MERGE_FILE='../dataset/EN/50k/gpt2-merges.txt'\n",
+    "DATA_PATH='../dataset/SV/webnyheter2013_text_document'\n",
+    "VOCAB_FILE='../dataset/SV/32k/vocab.json'\n",
+    "MERGE_FILE='../dataset/SV/32k/merges.txt'\n",
     "PROFILE_OUTPUT_PATH='/home/zcharpy/profiles/DLprof/2ndrun/nsys_improved' # modify this to your own profile path\n",
     "\n",
     "#### [TODO]--------------- params in the following block are allowed to change -----------#### \n",
@@ -101,7 +101,7 @@
     "\n",
     "TENSOR_MP_SIZE=8\n",
     "PIPELINE_MP_SIZE=1\n",
-    "LAYERS=32\n",
+    "LAYERS=64\n",
     "HIDDEN_SZ=2048\n",
     "NUM_ATTN_HEADS=32\n",
     "MICRO_BZ=64\n",
@@ -114,7 +114,7 @@
     "DISTRIBUTED_ARGS=\"--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT\"\n",
     "\n",
     "## for nsys run\n",
-    "nsys profile --stats=false --force-overwrite=true --duration=300 --trace=cudnn,cuda,osrt,nvtx -o $PROFILE_OUTPUT_PATH \\\n",
+    "#nsys profile --stats=false --force-overwrite=true --duration=300 --trace=cudnn,cuda,osrt,nvtx -o $PROFILE_OUTPUT_PATH \\\n",
     "python -m torch.distributed.launch $DISTRIBUTED_ARGS \\\n",
     "    ./Megatron-LM/Dlprof_pretrain_gpt.py \\\n",
     "       --tensor-model-parallel-size $TENSOR_MP_SIZE \\\n",
@@ -151,7 +151,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "constant-fighter",
+   "id": "accurate-riding",
    "metadata": {},
    "source": [
     "---\n",
@@ -161,16 +161,16 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 39,
-   "id": "weekly-mechanics",
+   "execution_count": 26,
+   "id": "minimal-lighter",
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "1\n",
-      "1677131776\n"
+      "3\n",
+      "3289513984\n"
      ]
     }
    ],
@@ -180,7 +180,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "concrete-finland",
+   "id": "cutting-patrol",
    "metadata": {},
    "source": [
     "---\n",
@@ -204,19 +204,16 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 3,
-   "id": "experienced-standing",
+   "execution_count": 31,
+   "id": "demographic-macro",
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Collecting data...\n",
       "Initializing NVTX monkey patches\n",
-      "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:144: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead\n",
-      "  warnings.warn(\"torch.distributed.reduce_op is deprecated, please use \"\n",
-      "Done with NVTX monkey patching\n",
+      "Initializing NVTX monkey patches\n",
       "Initializing NVTX monkey patches\n",
       "Initializing NVTX monkey patches\n",
       "Initializing NVTX monkey patches\n",
@@ -226,7 +223,6 @@
       "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:144: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead\n",
       "  warnings.warn(\"torch.distributed.reduce_op is deprecated, please use \"\n",
       "Done with NVTX monkey patching\n",
-      "Initializing NVTX monkey patches\n",
       "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:144: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead\n",
       "  warnings.warn(\"torch.distributed.reduce_op is deprecated, please use \"\n",
       "Done with NVTX monkey patching\n",
@@ -241,6 +237,9 @@
       "Done with NVTX monkey patching\n",
       "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:144: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead\n",
       "  warnings.warn(\"torch.distributed.reduce_op is deprecated, please use \"\n",
+      "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:144: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead\n",
+      "  warnings.warn(\"torch.distributed.reduce_op is deprecated, please use \"\n",
+      "Done with NVTX monkey patching\n",
       "Done with NVTX monkey patching\n",
       "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:144: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead\n",
       "  warnings.warn(\"torch.distributed.reduce_op is deprecated, please use \"\n",
@@ -273,7 +272,7 @@
       "  consumed_valid_samples .......................... 0\n",
       "  data_impl ....................................... mmap\n",
       "  data_parallel_size .............................. 1\n",
-      "  data_path ....................................... ['1.', '../dataset/EN/NVblogs_text_document']\n",
+      "  data_path ....................................... ['1.', '../dataset/SV/webnyheter2013_text_document']\n",
       "  dataloader_type ................................. single\n",
       "  DDP_impl ........................................ local\n",
       "  decoder_seq_length .............................. None\n",
@@ -330,7 +329,7 @@
       "  mask_prob ....................................... 0.15\n",
       "  masked_softmax_fusion ........................... True\n",
       "  max_position_embeddings ......................... 512\n",
-      "  merge_file ...................................... ../dataset/EN/50k/gpt2-merges.txt\n",
+      "  merge_file ...................................... ../dataset/SV/32k/merges.txt\n",
       "  micro_batch_size ................................ 64\n",
       "  min_loss_scale .................................. 1.0\n",
       "  min_lr .......................................... 1e-05\n",
@@ -342,7 +341,7 @@
       "  num_attention_heads ............................. 32\n",
       "  num_channels .................................... 3\n",
       "  num_classes ..................................... 1000\n",
-      "  num_layers ...................................... 32\n",
+      "  num_layers ...................................... 64\n",
       "  num_layers_per_virtual_pipeline_stage ........... None\n",
       "  num_workers ..................................... 2\n",
       "  onnx_safe ....................................... None\n",
@@ -383,13 +382,13 @@
       "  use_one_sent_docs ............................... False\n",
       "  virtual_pipeline_model_parallel_size ............ None\n",
       "  vocab_extra_ids ................................. 0\n",
-      "  vocab_file ...................................... ../dataset/EN/50k/gpt2-vocab.json\n",
+      "  vocab_file ...................................... ../dataset/SV/32k/vocab.json\n",
       "  weight_decay .................................... 0.01\n",
       "  world_size ...................................... 8\n",
       "-------------------- end of arguments ---------------------\n",
       "setting number of micro-batches to constant 8\n",
       "> building GPT2BPETokenizer tokenizer ...\n",
-      " > padded vocab (size: 50257) with 943 dummy tokens (new size: 51200)\n",
+      " > padded vocab (size: 32000) with 768 dummy tokens (new size: 32768)\n",
       "> initializing torch distributed ...\n",
       "> initializing tensor model parallel with size 8\n",
       "> initializing pipeline model parallel with size 1\n",
@@ -399,7 +398,7 @@
       "make: Entering directory '/home/zcharpy/bootcamp/jupyter_notebook/Megatron-LM/megatron/data'\n",
       "make: Nothing to be done for 'default'.\n",
       "make: Leaving directory '/home/zcharpy/bootcamp/jupyter_notebook/Megatron-LM/megatron/data'\n",
-      ">>> done with dataset index builder. Compilation time: 0.622 seconds\n",
+      ">>> done with dataset index builder. Compilation time: 0.167 seconds\n",
       "> compiling and loading fused kernels ...\n",
       "Detected CUDA files, patching ldflags\n",
       "Emitting ninja build file /home/zcharpy/bootcamp/jupyter_notebook/Megatron-LM/megatron/fused_kernels/build/build.ninja...\n",
@@ -419,24 +418,24 @@
       "Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)\n",
       "ninja: no work to do.\n",
       "Loading extension module fused_mix_prec_layer_norm_cuda...\n",
-      ">>> done with compiling and loading fused kernels. Compilation time: 31.948 seconds\n",
-      "time to initialize megatron (seconds): 9.410\n",
-      "[after megatron is initialized] datetime: 2021-08-30 05:52:25 \n",
+      ">>> done with compiling and loading fused kernels. Compilation time: 18.065 seconds\n",
+      "time to initialize megatron (seconds): 90.261\n",
+      "[after megatron is initialized] datetime: 2021-08-30 08:59:22 \n",
       "building GPT model ...\n",
-      " > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 215937024\n",
-      " > number of parameters on (tensor, pipeline) model parallel rank (5, 0): 215937024\n",
-      " > number of parameters on (tensor, pipeline) model parallel rank (2, 0): 215937024\n",
-      " > number of parameters on (tensor, pipeline) model parallel rank (1, 0): 215937024\n",
-      " > number of parameters on (tensor, pipeline) model parallel rank (6, 0): 215937024\n",
-      " > number of parameters on (tensor, pipeline) model parallel rank (3, 0): 215937024 > number of parameters on (tensor, pipeline) model parallel rank (4, 0): 215937024\n",
-      "\n",
-      " > number of parameters on (tensor, pipeline) model parallel rank (7, 0): 215937024\n",
+      " > number of parameters on (tensor, pipeline) model parallel rank (1, 0): 412995584\n",
+      " > number of parameters on (tensor, pipeline) model parallel rank (4, 0): 412995584\n",
+      " > number of parameters on (tensor, pipeline) model parallel rank (6, 0): 412995584\n",
+      " > number of parameters on (tensor, pipeline) model parallel rank (7, 0): 412995584\n",
+      " > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 412995584\n",
+      " > number of parameters on (tensor, pipeline) model parallel rank (2, 0): 412995584\n",
+      " > number of parameters on (tensor, pipeline) model parallel rank (3, 0): 412995584\n",
+      " > number of parameters on (tensor, pipeline) model parallel rank (5, 0): 412995584\n",
       "setting training iterations to 0\n",
       "> learning rate decay style: cosine\n",
       "WARNING: could not find the metadata file ./Megatron-LM/sv_ckpt/latest_checkpointed_iteration.txt \n",
       "    will not load any checkpoints and will start from random\n",
-      "time (ms) | load-checkpoint: 54.82\n",
-      "[after model, optimizer, and learning rate scheduler are built] datetime: 2021-08-30 05:52:29 \n",
+      "time (ms) | load-checkpoint: 25.10\n",
+      "[after model, optimizer, and learning rate scheduler are built] datetime: 2021-08-30 08:59:28 \n",
       "> building train, validation, and test datasets ...\n",
       " > datasets target sizes (minimum size):\n",
       "    train:      100\n",
@@ -449,38 +448,37 @@
       "    reading document index...\n",
       "    creating numpy buffer of mmap...\n",
       "    creating memory view of numpy buffer...\n",
-      " > finished creating indexed dataset in 0.005710 seconds\n",
-      "    number of documents: 74\n",
+      " > finished creating indexed dataset in 0.004143 seconds\n",
+      "    number of documents: 1249010\n",
       " > dataset split:\n",
       "    train:\n",
-      "     document indices in [0, 70) total of 70 documents\n",
+      "     document indices in [0, 1185311) total of 1185311 documents\n",
       "    validation:\n",
-      "     document indices in [70, 74) total of 4 documents\n",
+      "     document indices in [1185311, 1247761) total of 62450 documents\n",
       "    test:\n",
-      "     document indices in [74, 74) total of 0 documents\n",
-      " > WARNING: could not find index map files, building the indices on rank 0 ...\n",
-      " > only one epoch required, setting separate_last_epoch to False\n",
-      " > elasped time to build and save doc-idx mapping (seconds): 0.002701\n",
-      "    using:\n",
-      "     number of documents:       70\n",
-      "     number of epochs:          1\n",
-      "     sequence length:           512\n",
-      "     total number of samples:   141\n",
-      " > elasped time to build and save sample-idx mapping (seconds): 0.003840\n",
-      " > building shuffle index with split [0, 141) and [141, 141) ...\n",
-      " > elasped time to build and save shuffle-idx mapping (seconds): 0.001237\n",
-      " > loading doc-idx mapping from ../dataset/EN/NVblogs_text_document_train_indexmap_101ns_512sl_1234s_doc_idx.npy\n",
-      " > loading sample-idx mapping from ../dataset/EN/NVblogs_text_document_train_indexmap_101ns_512sl_1234s_sample_idx.npy\n",
-      " > loading shuffle-idx mapping from ../dataset/EN/NVblogs_text_document_train_indexmap_101ns_512sl_1234s_shuffle_idx.npy\n",
-      "    loaded indexed file in 0.003 seconds\n",
-      "    total number of samples: 142\n",
+      "     document indices in [1247761, 1249010) total of 1249 documents\n",
+      " > loading doc-idx mapping from ../dataset/SV/webnyheter2013_text_document_train_indexmap_101ns_512sl_1234s_doc_idx.npy\n",
+      " > loading sample-idx mapping from ../dataset/SV/webnyheter2013_text_document_train_indexmap_101ns_512sl_1234s_sample_idx.npy\n",
+      " > loading shuffle-idx mapping from ../dataset/SV/webnyheter2013_text_document_train_indexmap_101ns_512sl_1234s_shuffle_idx.npy\n",
+      "    loaded indexed file in 0.005 seconds\n",
+      "    total number of samples: 53948\n",
       "    total number of epochs: 1\n",
-      " > loading doc-idx mapping from ../dataset/EN/NVblogs_text_document_valid_indexmap_5146ns_512sl_1234s_doc_idx.npy\n",
-      " > loading sample-idx mapping from ../dataset/EN/NVblogs_text_document_valid_indexmap_5146ns_512sl_1234s_sample_idx.npy\n",
-      " > loading shuffle-idx mapping from ../dataset/EN/NVblogs_text_document_valid_indexmap_5146ns_512sl_1234s_shuffle_idx.npy\n",
+      " > loading doc-idx mapping from ../dataset/SV/webnyheter2013_text_document_valid_indexmap_5146ns_512sl_1234s_doc_idx.npy\n",
+      " > loading sample-idx mapping from ../dataset/SV/webnyheter2013_text_document_valid_indexmap_5146ns_512sl_1234s_sample_idx.npy\n",
+      " > loading shuffle-idx mapping from ../dataset/SV/webnyheter2013_text_document_valid_indexmap_5146ns_512sl_1234s_shuffle_idx.npy\n",
       "    loaded indexed file in 0.003 seconds\n",
-      "    total number of samples: 5153\n",
-      "    total number of epochs: 664\n",
+      "    total number of samples: 5695\n",
+      "    total number of epochs: 2\n",
+      " > loading doc-idx mapping from ../dataset/SV/webnyheter2013_text_document_test_indexmap_5146ns_512sl_1234s_doc_idx.npy\n",
+      " > loading sample-idx mapping from ../dataset/SV/webnyheter2013_text_document_test_indexmap_5146ns_512sl_1234s_sample_idx.npy\n",
+      " > loading shuffle-idx mapping from ../dataset/SV/webnyheter2013_text_document_test_indexmap_5146ns_512sl_1234s_shuffle_idx.npy\n",
+      "    loaded indexed file in 0.003 seconds\n",
+      "    total number of samples: 5192\n",
+      "    total number of epochs: 91\n",
+      "> building indices for blendable datasets ...\n",
+      " > sample ratios:\n",
+      "   dataset 0, input: 1, achieved: 1\n",
+      "> elapsed time for building blendable dataset indices: 0.00 (sec)\n",
       "> building indices for blendable datasets ...\n",
       " > sample ratios:\n",
       "   dataset 0, input: 1, achieved: 1\n",
@@ -490,22 +488,18 @@
       "   dataset 0, input: 1, achieved: 1\n",
       "> elapsed time for building blendable dataset indices: 0.00 (sec)\n",
       "> finished creating GPT datasets ...\n",
-      "[after dataloaders are built] datetime: 2021-08-30 05:52:33 \n",
+      "[after dataloaders are built] datetime: 2021-08-30 08:59:32 \n",
       "done with setup ...\n",
       "training ...\n",
-      "time (ms) | model-and-optimizer-setup: 3872.16 | train/valid/test-data-iterators-setup: 2957.85\n",
-      "[after training is done] datetime: 2021-08-30 05:52:34 \n",
+      "time (ms) | model-and-optimizer-setup: 6065.80 | train/valid/test-data-iterators-setup: 2661.91\n",
+      "[after training is done] datetime: 2021-08-30 08:59:32 \n",
       "------------------------------------------------------------------------------------------------------------------\n",
-      " validation loss at the end of training for val data | lm loss value: 1.124461E+01 | lm loss PPL: 7.646642E+04 | \n",
+      " validation loss at the end of training for val data | lm loss value: 1.081321E+01 | lm loss PPL: 4.967259E+04 | \n",
       "------------------------------------------------------------------------------------------------------------------\n",
-      "Processing events...\n",
-      "Capturing symbol files...\n",
-      "Saving temporary \"/tmp/nsys-report-c19d-0647-d480-f24c.qdstrm\" file to disk...\n",
-      "Creating final output files...\n",
-      "\n",
-      "Processing [==============================================================100%]\n",
-      "Saved report file to \"/tmp/nsys-report-c19d-0647-d480-f24c.qdrep\"\n",
-      "Report file moved to \"/home/zcharpy/profiles/DLprof/2ndrun/nsys_improved.qdrep\"\n"
+      "Evaluating iter 10/10\n",
+      "-------------------------------------------------------------------------------------------------------------------\n",
+      " validation loss at the end of training for test data | lm loss value: 1.081394E+01 | lm loss PPL: 4.970880E+04 | \n",
+      "-------------------------------------------------------------------------------------------------------------------\n"
      ]
     }
    ],
@@ -515,7 +509,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "spoken-night",
+   "id": "opening-latter",
    "metadata": {},
    "source": [
     "## Remember to copy and paste your output on Slack or Zoom\n",
@@ -524,7 +518,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "interpreted-escape",
+   "id": "jewish-tomorrow",
    "metadata": {},
    "source": [
     "-----\n",

BIN
ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/pics/GPUs_utils_naive.JPG


BIN
ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/pics/gpus_utils_improved.JPG


+ 45 - 40
ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/tools/openwebtext/Day3-2_SentenceBoundary_and_Deduplicate.ipynb

@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "virtual-gasoline",
+   "id": "prerequisite-duration",
    "metadata": {},
    "source": [
     "# \n",
@@ -30,7 +30,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "identical-manufacturer",
+   "id": "israeli-buddy",
    "metadata": {},
    "source": [
     "--------------------------------------------------------------------------------------------------------------------\n",
@@ -40,7 +40,12 @@
     "\n",
     "    follow instruction from [Megatron-LM/tools/openwebtext/README](https://github.com/NVIDIA/Megatron-LM/tree/main/tools/openwebtext) in openwebtext clearning folder \n",
     "\n",
-    "    note : in a restricted environment where sudo is not allowed, please follow the below instruction to modify installation\n",
+    "    note : in a restricted environment where sudo is not allowed, please follow the below instruction to modify installation ---\n",
+    "            \n",
+    "            call out a terminal             \n",
+    "   ![call out a terminal ](../../pics/Alt_callout2terminals.JPG)\n",
+    "   \n",
+    "            cd ./jupyter_notebook/Megatron-LM/tools/openwebtext/\n",
     "        \n",
     "            git clone https://github.com/mattilyra/LSH\n",
     "            cd LSH\n",
@@ -53,7 +58,7 @@
   {
    "cell_type": "code",
    "execution_count": 2,
-   "id": "stainless-courtesy",
+   "id": "superior-identifier",
    "metadata": {},
    "outputs": [
     {
@@ -106,7 +111,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "returning-internship",
+   "id": "coordinated-costa",
    "metadata": {},
    "source": [
     "-------------------------------------------------------------------------------\n",
@@ -116,7 +121,7 @@
   {
    "cell_type": "code",
    "execution_count": 3,
-   "id": "brown-satisfaction",
+   "id": "bored-utility",
    "metadata": {},
    "outputs": [
     {
@@ -139,7 +144,7 @@
   {
    "cell_type": "code",
    "execution_count": 4,
-   "id": "induced-aquatic",
+   "id": "mineral-venue",
    "metadata": {},
    "outputs": [
     {
@@ -161,7 +166,7 @@
   {
    "cell_type": "code",
    "execution_count": 5,
-   "id": "overhead-panama",
+   "id": "serious-reach",
    "metadata": {},
    "outputs": [
     {
@@ -182,7 +187,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "undefined-fifteen",
+   "id": "mediterranean-disability",
    "metadata": {},
    "source": [
     "-----------------------------------------------------------\n",
@@ -192,7 +197,7 @@
   {
    "cell_type": "code",
    "execution_count": 6,
-   "id": "precise-melissa",
+   "id": "fancy-affiliate",
    "metadata": {},
    "outputs": [
     {
@@ -222,7 +227,7 @@
   {
    "cell_type": "code",
    "execution_count": 1,
-   "id": "unauthorized-azerbaijan",
+   "id": "flush-apparatus",
    "metadata": {},
    "outputs": [
     {
@@ -251,7 +256,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "integrated-observer",
+   "id": "lonely-february",
    "metadata": {},
    "source": [
     "-----------------------------------------------------------\n",
@@ -261,7 +266,7 @@
   {
    "cell_type": "code",
    "execution_count": 2,
-   "id": "interested-companion",
+   "id": "lesbian-finland",
    "metadata": {},
    "outputs": [
     {
@@ -286,7 +291,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "accompanied-tanzania",
+   "id": "consistent-society",
    "metadata": {},
    "source": [
     "-----------------------------------------------------------\n",
@@ -296,7 +301,7 @@
   {
    "cell_type": "code",
    "execution_count": 9,
-   "id": "pregnant-defensive",
+   "id": "simple-withdrawal",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -332,7 +337,7 @@
   {
    "cell_type": "code",
    "execution_count": 10,
-   "id": "fleet-above",
+   "id": "employed-mercy",
    "metadata": {},
    "outputs": [
     {
@@ -357,7 +362,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "filled-silence",
+   "id": "palestinian-reason",
    "metadata": {},
    "source": [
     "-----------------------------------------------------------\n",
@@ -367,7 +372,7 @@
   {
    "cell_type": "code",
    "execution_count": 1,
-   "id": "radical-facing",
+   "id": "imported-dispatch",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -416,7 +421,7 @@
   {
    "cell_type": "code",
    "execution_count": 2,
-   "id": "african-savannah",
+   "id": "altered-darkness",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -430,7 +435,7 @@
   {
    "cell_type": "code",
    "execution_count": 3,
-   "id": "standard-auditor",
+   "id": "incoming-invitation",
    "metadata": {},
    "outputs": [
     {
@@ -456,7 +461,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "found-shuttle",
+   "id": "fabulous-brown",
    "metadata": {},
    "source": [
     "## dataset extracted from NVIDIA blog urls "
@@ -465,7 +470,7 @@
   {
    "cell_type": "code",
    "execution_count": 2,
-   "id": "regulated-dating",
+   "id": "adjusted-salvation",
    "metadata": {},
    "outputs": [
     {
@@ -540,7 +545,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "sapphire-respect",
+   "id": "painful-uncertainty",
    "metadata": {},
    "source": [
     "## create our own groudtruth dataset"
@@ -549,7 +554,7 @@
   {
    "cell_type": "code",
    "execution_count": 43,
-   "id": "indie-citation",
+   "id": "impressed-wound",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -580,7 +585,7 @@
   {
    "cell_type": "code",
    "execution_count": 49,
-   "id": "earlier-rogers",
+   "id": "complete-powell",
    "metadata": {},
    "outputs": [
     {
@@ -679,7 +684,7 @@
   {
    "cell_type": "code",
    "execution_count": 45,
-   "id": "healthy-bullet",
+   "id": "correct-blocking",
    "metadata": {},
    "outputs": [
     {
@@ -774,7 +779,7 @@
   {
    "cell_type": "code",
    "execution_count": 50,
-   "id": "direct-manchester",
+   "id": "infinite-convergence",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -783,7 +788,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "continental-anthony",
+   "id": "little-quilt",
    "metadata": {},
    "source": [
     "---\n",
@@ -794,7 +799,7 @@
   {
    "cell_type": "code",
    "execution_count": 133,
-   "id": "superior-victim",
+   "id": "distributed-authentication",
    "metadata": {},
    "outputs": [
     {
@@ -891,7 +896,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "rotary-gentleman",
+   "id": "tribal-diagnosis",
    "metadata": {},
    "source": [
     "---\n",
@@ -901,7 +906,7 @@
   {
    "cell_type": "code",
    "execution_count": 102,
-   "id": "marked-latino",
+   "id": "graphic-august",
    "metadata": {},
    "outputs": [
     {
@@ -924,7 +929,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "analyzed-carol",
+   "id": "centered-study",
    "metadata": {},
    "source": [
     "<a id=\"TheChallenge\"></a>"
@@ -932,7 +937,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "pretty-plumbing",
+   "id": "egyptian-peter",
    "metadata": {},
    "source": [
     "---\n",
@@ -957,7 +962,7 @@
   {
    "cell_type": "code",
    "execution_count": 108,
-   "id": "excessive-illness",
+   "id": "current-colonial",
    "metadata": {
     "jupyter": {
      "source_hidden": true
@@ -975,7 +980,7 @@
   {
    "cell_type": "code",
    "execution_count": 103,
-   "id": "gorgeous-brother",
+   "id": "prostate-jackson",
    "metadata": {
     "jupyter": {
      "source_hidden": true
@@ -999,7 +1004,7 @@
   {
    "cell_type": "code",
    "execution_count": 104,
-   "id": "tribal-vienna",
+   "id": "understanding-defeat",
    "metadata": {
     "collapsed": true,
     "jupyter": {
@@ -1033,7 +1038,7 @@
   {
    "cell_type": "code",
    "execution_count": 105,
-   "id": "automotive-steam",
+   "id": "magnetic-indiana",
    "metadata": {
     "jupyter": {
      "source_hidden": true
@@ -1053,7 +1058,7 @@
   {
    "cell_type": "code",
    "execution_count": 106,
-   "id": "coated-separate",
+   "id": "major-mathematics",
    "metadata": {
     "collapsed": true,
     "jupyter": {
@@ -1088,7 +1093,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "complicated-spanish",
+   "id": "appointed-analyst",
    "metadata": {},
    "source": [
     "---\n",
@@ -1102,7 +1107,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "organized-monthly",
+   "id": "furnished-boutique",
    "metadata": {},
    "source": [
     "-----\n",

+ 0 - 639
ai/Megatron/English/Python/jupyter_notebook/timeend2end.ipynb

@@ -1,639 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "id": "alternate-collar",
-   "metadata": {},
-   "source": [
-    "# \n",
-    "\n",
-    "# 5 Monitor GPT training performance with varying config\n",
-    "---\n",
-    "\n",
-    "## Learning Objectives\n",
-    "- **The goal of this lab is to monitor the performance of your training runs with different GPT training configurations **\n",
-    "    - motivation : why should we care ? \n",
-    "    \n",
-    "    Answer : bad config result in very low / inconsistent gpus utilizations which in turn, slow down training and therefore longer experiments per run, it's a lose-lose-lose situation on all sides.\n",
-    "    ![see example](./Megatron-LM/pics/naive_run.JPG)\n",
-    "    \n",
-    "    - example : naive run vs. improved run \n",
-    "      starts with multiGPUs -->  multinode ( if we get at least 2 nodes per person / team ) \n",
-    "    - exercise : beat the record !\n",
-    "\n",
-    "it is possible to obtain more than 90% GPU utilizations overall with high tensorcore ops sustained throughout forward and backward training throughout all gpus used in training. \n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "fifty-swimming",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "!rm -fr ./Megatron-LM/sv_ckpt/*"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "copyrighted-belarus",
-   "metadata": {},
-   "source": [
-    "## Let's verify the environment is ready "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "chronic-bradley",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "!nvidia-smi"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "personalized-walker",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "!nvidia-smi nvlink --status "
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "minimal-extreme",
-   "metadata": {},
-   "source": []
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "prostate-trouble",
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 2,
-   "id": "industrial-index",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Collecting data...\n",
-      "using world size: 8, data-parallel-size: 8, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 \n",
-      "using torch.float32 for parameters ...\n",
-      "------------------------ arguments ------------------------\n",
-      "  accumulate_allreduce_grads_in_fp32 .............. False\n",
-      "  adam_beta1 ...................................... 0.9\n",
-      "  adam_beta2 ...................................... 0.999\n",
-      "  adam_eps ........................................ 1e-08\n",
-      "  adlr_autoresume ................................. False\n",
-      "  adlr_autoresume_interval ........................ 1000\n",
-      "  apply_query_key_layer_scaling ................... True\n",
-      "  apply_residual_connection_post_layernorm ........ False\n",
-      "  attention_dropout ............................... 0.1\n",
-      "  attention_softmax_in_fp32 ....................... False\n",
-      "  bert_binary_head ................................ True\n",
-      "  bert_load ....................................... None\n",
-      "  bf16 ............................................ False\n",
-      "  bias_dropout_fusion ............................. True\n",
-      "  bias_gelu_fusion ................................ True\n",
-      "  biencoder_projection_dim ........................ 0\n",
-      "  biencoder_shared_query_context_model ............ False\n",
-      "  block_data_path ................................. None\n",
-      "  checkpoint_activations .......................... True\n",
-      "  checkpoint_num_layers ........................... 1\n",
-      "  clip_grad ....................................... 1.0\n",
-      "  consumed_train_samples .......................... 0\n",
-      "  consumed_valid_samples .......................... 0\n",
-      "  data_impl ....................................... mmap\n",
-      "  data_parallel_size .............................. 8\n",
-      "  data_path ....................................... ['../dataset/EN/NVblogs_text_document']\n",
-      "  dataloader_type ................................. single\n",
-      "  DDP_impl ........................................ local\n",
-      "  decoder_seq_length .............................. None\n",
-      "  distribute_checkpointed_activations ............. False\n",
-      "  distributed_backend ............................. nccl\n",
-      "  embedding_path .................................. None\n",
-      "  encoder_seq_length .............................. 512\n",
-      "  eod_mask_loss ................................... False\n",
-      "  eval_interval ................................... 100\n",
-      "  eval_iters ...................................... 10\n",
-      "  evidence_data_path .............................. None\n",
-      "  exit_duration_in_mins ........................... None\n",
-      "  exit_interval ................................... None\n",
-      "  ffn_hidden_size ................................. 4096\n",
-      "  finetune ........................................ False\n",
-      "  fp16 ............................................ False\n",
-      "  fp16_lm_cross_entropy ........................... False\n",
-      "  fp32_residual_connection ........................ False\n",
-      "  global_batch_size ............................... 8\n",
-      "  hidden_dropout .................................. 0.1\n",
-      "  hidden_size ..................................... 1024\n",
-      "  hysteresis ...................................... 2\n",
-      "  ict_head_size ................................... None\n",
-      "  ict_load ........................................ None\n",
-      "  img_dim ......................................... 224\n",
-      "  indexer_batch_size .............................. 128\n",
-      "  indexer_log_interval ............................ 1000\n",
-      "  init_method_std ................................. 0.02\n",
-      "  init_method_xavier_uniform ...................... False\n",
-      "  initial_loss_scale .............................. 4294967296\n",
-      "  kv_channels ..................................... 64\n",
-      "  layernorm_epsilon ............................... 1e-05\n",
-      "  lazy_mpu_init ................................... None\n",
-      "  load ............................................ ./Megatron-LM/sv_ckpt/\n",
-      "  local_rank ...................................... 0\n",
-      "  log_batch_size_to_tensorboard ................... False\n",
-      "  log_interval .................................... 10\n",
-      "  log_learning_rate_to_tensorboard ................ True\n",
-      "  log_loss_scale_to_tensorboard ................... True\n",
-      "  log_num_zeros_in_grad ........................... False\n",
-      "  log_params_norm ................................. False\n",
-      "  log_timers_to_tensorboard ....................... False\n",
-      "  log_validation_ppl_to_tensorboard ............... False\n",
-      "  loss_scale ...................................... None\n",
-      "  loss_scale_window ............................... 1000\n",
-      "  lr .............................................. 0.00015\n",
-      "  lr_decay_iters .................................. None\n",
-      "  lr_decay_samples ................................ None\n",
-      "  lr_decay_style .................................. cosine\n",
-      "  lr_warmup_fraction .............................. 0.01\n",
-      "  lr_warmup_iters ................................. 0\n",
-      "  lr_warmup_samples ............................... 0\n",
-      "  make_vocab_size_divisible_by .................... 128\n",
-      "  mask_prob ....................................... 0.15\n",
-      "  masked_softmax_fusion ........................... True\n",
-      "  max_position_embeddings ......................... 512\n",
-      "  merge_file ...................................... ../dataset/EN/50k/gpt2-merges.txt\n",
-      "  micro_batch_size ................................ 1\n",
-      "  min_loss_scale .................................. 1.0\n",
-      "  min_lr .......................................... 1e-05\n",
-      "  mmap_warmup ..................................... False\n",
-      "  no_load_optim ................................... None\n",
-      "  no_load_rng ..................................... None\n",
-      "  no_save_optim ................................... None\n",
-      "  no_save_rng ..................................... None\n",
-      "  num_attention_heads ............................. 16\n",
-      "  num_channels .................................... 3\n",
-      "  num_classes ..................................... 1000\n",
-      "  num_layers ...................................... 16\n",
-      "  num_layers_per_virtual_pipeline_stage ........... None\n",
-      "  num_workers ..................................... 2\n",
-      "  onnx_safe ....................................... None\n",
-      "  openai_gelu ..................................... False\n",
-      "  optimizer ....................................... adam\n",
-      "  override_lr_scheduler ........................... False\n",
-      "  params_dtype .................................... torch.float32\n",
-      "  patch_dim ....................................... 16\n",
-      "  pipeline_model_parallel_size .................... 1\n",
-      "  query_in_block_prob ............................. 0.1\n",
-      "  rampup_batch_size ............................... None\n",
-      "  rank ............................................ 0\n",
-      "  reset_attention_mask ............................ False\n",
-      "  reset_position_ids .............................. False\n",
-      "  retriever_report_topk_accuracies ................ []\n",
-      "  retriever_score_scaling ......................... False\n",
-      "  retriever_seq_length ............................ 256\n",
-      "  sample_rate ..................................... 1.0\n",
-      "  save ............................................ ./Megatron-LM/sv_ckpt/\n",
-      "  save_interval ................................... 100\n",
-      "  scatter_gather_tensors_in_pipeline .............. True\n",
-      "  seed ............................................ 1234\n",
-      "  seq_length ...................................... 512\n",
-      "  sgd_momentum .................................... 0.9\n",
-      "  short_seq_prob .................................. 0.1\n",
-      "  split ........................................... 949,50,1\n",
-      "  tensor_model_parallel_size ...................... 1\n",
-      "  tensorboard_dir ................................. None\n",
-      "  tensorboard_log_interval ........................ 1\n",
-      "  tensorboard_queue_size .......................... 1000\n",
-      "  titles_data_path ................................ None\n",
-      "  tokenizer_type .................................. GPT2BPETokenizer\n",
-      "  train_iters ..................................... None\n",
-      "  train_samples ................................... 100\n",
-      "  use_checkpoint_lr_scheduler ..................... False\n",
-      "  use_contiguous_buffers_in_ddp ................... False\n",
-      "  use_cpu_initialization .......................... None\n",
-      "  use_one_sent_docs ............................... False\n",
-      "  virtual_pipeline_model_parallel_size ............ None\n",
-      "  vocab_extra_ids ................................. 0\n",
-      "  vocab_file ...................................... ../dataset/EN/50k/gpt2-vocab.json\n",
-      "  weight_decay .................................... 0.01\n",
-      "  world_size ...................................... 8\n",
-      "-------------------- end of arguments ---------------------\n",
-      "setting number of micro-batches to constant 1\n",
-      "> building GPT2BPETokenizer tokenizer ...\n",
-      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
-      "> initializing torch distributed ...\n",
-      "> initializing tensor model parallel with size 1\n",
-      "> initializing pipeline model parallel with size 1\n",
-      "> setting random seeds to 1234 ...\n",
-      "> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234\n",
-      "> compiling dataset index builder ...\n",
-      "make: Entering directory '/home/zcharpy/bootcamp/jupyter_notebook/Megatron-LM/megatron/data'\n",
-      "make: Nothing to be done for 'default'.\n",
-      "make: Leaving directory '/home/zcharpy/bootcamp/jupyter_notebook/Megatron-LM/megatron/data'\n",
-      ">>> done with dataset index builder. Compilation time: 0.573 seconds\n",
-      "WARNING: constraints for invoking optimized fused softmax kernel are not met. We default back to unfused kernel invocations.\n",
-      "> compiling and loading fused kernels ...\n",
-      "Detected CUDA files, patching ldflags\n",
-      "Emitting ninja build file /home/zcharpy/bootcamp/jupyter_notebook/Megatron-LM/megatron/fused_kernels/build/build.ninja...\n",
-      "Building extension module scaled_upper_triang_masked_softmax_cuda...\n",
-      "Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)\n",
-      "ninja: no work to do.\n",
-      "Loading extension module scaled_upper_triang_masked_softmax_cuda...\n",
-      "Detected CUDA files, patching ldflags\n",
-      "Emitting ninja build file /home/zcharpy/bootcamp/jupyter_notebook/Megatron-LM/megatron/fused_kernels/build/build.ninja...\n",
-      "Building extension module scaled_masked_softmax_cuda...\n",
-      "Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)\n",
-      "ninja: no work to do.\n",
-      "Loading extension module scaled_masked_softmax_cuda...\n",
-      "Detected CUDA files, patching ldflags\n",
-      "Emitting ninja build file /home/zcharpy/bootcamp/jupyter_notebook/Megatron-LM/megatron/fused_kernels/build/build.ninja...\n",
-      "Building extension module fused_mix_prec_layer_norm_cuda...\n",
-      "Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)\n",
-      "ninja: no work to do.\n",
-      "Loading extension module fused_mix_prec_layer_norm_cuda...\n",
-      ">>> done with compiling and loading fused kernels. Compilation time: 31.516 seconds\n",
-      "time to initialize megatron (seconds): 13.548\n",
-      "[after megatron is initialized] datetime: 2021-08-26 00:28:13 \n",
-      "building GPT model ...\n",
-      " > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 253577216\n",
-      "setting training iterations to 12\n",
-      "> learning rate decay style: cosine\n",
-      "WARNING: could not find the metadata file ./Megatron-LM/sv_ckpt/latest_checkpointed_iteration.txt \n",
-      "    will not load any checkpoints and will start from random\n",
-      "time (ms) | load-checkpoint: 30.87\n",
-      "[after model, optimizer, and learning rate scheduler are built] datetime: 2021-08-26 00:28:14 \n",
-      "> building train, validation, and test datasets ...\n",
-      " > datasets target sizes (minimum size):\n",
-      "    train:      100\n",
-      "    validation: 80\n",
-      "    test:       80\n",
-      "> building train, validation, and test datasets for GPT ...\n",
-      " > building dataset index ...\n",
-      "    reading sizes...\n",
-      "    reading pointers...\n",
-      "    reading document index...\n",
-      "    creating numpy buffer of mmap...\n",
-      "    creating memory view of numpy buffer...\n",
-      " > finished creating indexed dataset in 0.003097 seconds\n",
-      "    number of documents: 74\n",
-      " > dataset split:\n",
-      "    train:\n",
-      "     document indices in [0, 70) total of 70 documents\n",
-      "    validation:\n",
-      "     document indices in [70, 74) total of 4 documents\n",
-      "    test:\n",
-      "     document indices in [74, 74) total of 0 documents\n",
-      " > loading doc-idx mapping from ../dataset/EN/NVblogs_text_document_train_indexmap_100ns_512sl_1234s_doc_idx.npy\n",
-      " > loading sample-idx mapping from ../dataset/EN/NVblogs_text_document_train_indexmap_100ns_512sl_1234s_sample_idx.npy\n",
-      " > loading shuffle-idx mapping from ../dataset/EN/NVblogs_text_document_train_indexmap_100ns_512sl_1234s_shuffle_idx.npy\n",
-      "    loaded indexed file in 0.018 seconds\n",
-      "    total number of samples: 142\n",
-      "    total number of epochs: 1\n",
-      " > loading doc-idx mapping from ../dataset/EN/NVblogs_text_document_valid_indexmap_80ns_512sl_1234s_doc_idx.npy\n",
-      " > loading sample-idx mapping from ../dataset/EN/NVblogs_text_document_valid_indexmap_80ns_512sl_1234s_sample_idx.npy\n",
-      " > loading shuffle-idx mapping from ../dataset/EN/NVblogs_text_document_valid_indexmap_80ns_512sl_1234s_shuffle_idx.npy\n",
-      "    loaded indexed file in 0.022 seconds\n",
-      "    total number of samples: 86\n",
-      "    total number of epochs: 11\n",
-      "> finished creating GPT datasets ...\n",
-      "[after dataloaders are built] datetime: 2021-08-26 00:28:24 \n",
-      "done with setup ...\n",
-      "training ...\n",
-      "time (ms) | model-and-optimizer-setup: 548.83 | train/valid/test-data-iterators-setup: 10068.67\n",
-      "[before the start of training step] datetime: 2021-08-26 00:28:24 \n",
-      " iteration       10/      12 | consumed samples:           80 | elapsed time per iteration (ms): 2141.9 | learning rate: 2.363E-05 | global batch size:     8 | lm loss: 9.601698E+00 | loss scale: 1.0 | grad norm: 1.856 | number of skipped iterations:   0 | number of nan iterations:   0 |\n",
-      "time (ms) | forward-compute: 1252.78 | backward-compute: 643.07 | backward-params-all-reduce: 146.14 | backward-embedding-all-reduce: 0.06 | optimizer: 94.92 | batch-generator: 12.73\n",
-      "[Rank 0] (after 10 iterations) memory (MB) | allocated: 3869.28369140625 | max allocated: 5229.60595703125 | reserved: 7306.0 | max reserved: 7306.0\n",
-      "[after training is done] datetime: 2021-08-26 00:28:47 \n",
-      "------------------------------------------------------------------------------------------------------------------saving checkpoint at iteration      12 to ./Megatron-LM/sv_ckpt/\n",
-      "\n",
-      " validation loss at the end of training for val data | lm loss value: 8.891883E+00 | lm loss PPL: 7.272700E+03 | \n",
-      "------------------------------------------------------------------------------------------------------------------\n",
-      "  successfully saved checkpoint at iteration      12 to ./Megatron-LM/sv_ckpt/\n",
-      "*****************************************\n",
-      "Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. \n",
-      "*****************************************\n",
-      "Processing events...\n",
-      "Capturing symbol files...\n",
-      "Saving temporary \"/tmp/nsys-report-5e6b-aa12-9711-df33.qdstrm\" file to disk...\n",
-      "Creating final output files...\n",
-      "\n",
-      "Processing [==============================================================100%]\n",
-      "Saved report file to \"/tmp/nsys-report-5e6b-aa12-9711-df33.qdrep\"\n",
-      "Exporting 1665820 events: [===============================================100%]\n",
-      "\n",
-      "Exported successfully to\n",
-      "/tmp/nsys-report-5e6b-aa12-9711-df33.sqlite\n",
-      "\n",
-      "\n",
-      "CUDA API Statistics:\n",
-      "\n",
-      " Time(%)  Total Time (ns)  Num Calls    Average    Minimum    Maximum                 Name              \n",
-      " -------  ---------------  ---------  -----------  -------  -----------  -------------------------------\n",
-      "    56.2     285320355235       1696  168231341.5     3530  29703799378  cudaDeviceSynchronize          \n",
-      "    29.6     149955233137       4392   34142812.6     4864  17084387260  cudaMalloc                     \n",
-      "     7.9      39849705340       2960   13462738.3     1275   2244485039  cudaFree                       \n",
-      "     1.7       8434557988     132848      63490.3     4159    108060983  cudaLaunchKernel               \n",
-      "     1.6       8056133957       1136    7091667.2     5385     73711234  cudaHostAlloc                  \n",
-      "     1.4       7314617708      23844     306769.7     6186    758561184  cudaMemcpyAsync                \n",
-      "     0.4       1978894053        720    2748464.0   352582     35632270  cudaIpcOpenMemHandle           \n",
-      "     0.3       1558283230       1064    1464551.9     5957     26847462  cudaFreeHost                   \n",
-      "     0.3       1519497814        720    2110413.6   240178     21756247  cudaIpcCloseMemHandle          \n",
-      "     0.2        869040030      15104      57537.1     7662     32041419  cuLaunchKernel                 \n",
-      "     0.1        760874918      14944      50915.1     5032    103123066  cudaMemsetAsync                \n",
-      "     0.1        416824863      63658       6547.9      361     79176267  cudaStreamIsCapturing_v10000   \n",
-      "     0.1        289499706      16483      17563.5     1293     40910490  cudaEventQuery                 \n",
-      "     0.1        283699184      21864      12975.6      643     60138070  cudaEventRecord                \n",
-      "     0.0        163990222      44832       3657.9      279     24748108  cudaStreamGetCaptureInfo_v10010\n",
-      "     0.0        149197647        512     291401.7     2613     37658462  cudaStreamCreateWithPriority   \n",
-      "     0.0        138131763       4504      30668.7      523     32220166  cudaEventDestroy               \n",
-      "     0.0        113211871       2732      41439.2     2572     30272284  cudaStreamSynchronize          \n",
-      "     0.0         75811639       2848      26619.3     3170     29446146  cudaMemset                     \n",
-      "     0.0         68895475         40    1722386.9   101401     32335797  cuModuleLoadData               \n",
-      "     0.0         36296853       1824      19899.6     6299      6783812  cudaMemcpy                     \n",
-      "     0.0         13397443         40     334936.1    21288      4248804  cuModuleUnload                 \n",
-      "     0.0         12237981       4520       2707.5      381      2838236  cudaEventCreateWithFlags       \n",
-      "     0.0         11913924       2784       4279.4     1276      3291031  cudaStreamWaitEvent            \n",
-      "     0.0           100758         24       4198.3     1730        10252  cuInit                         \n",
-      "\n",
-      "\n",
-      "\n",
-      "CUDA Kernel Statistics:\n",
-      "\n",
-      " Time(%)  Total Time (ns)  Instances    Average     Minimum    Maximum                                                    Name                                                \n",
-      " -------  ---------------  ---------  ------------  -------  -----------  ----------------------------------------------------------------------------------------------------\n",
-      "    88.2     260838216539         88  2964070642.5     9664  29673260794  ncclKernel_AllReduce_RING_LL_Sum_uint8_t(ncclWorkElem)                                              \n",
-      "     6.4      18834798439        280    67267137.3    13952    423636076  ncclKernel_AllReduce_RING_LL_Sum_float(ncclWorkElem)                                                \n",
-      "     1.3       3737595881      13056      286274.2   100127       445790  volta_sgemm_128x32_tn                                                                               \n",
-      "     0.6       1922114218       6240      308031.1    88128      4093366  volta_sgemm_128x32_nt                                                                               \n",
-      "     0.6       1914679497       4528      422853.2   268063      4114387  volta_sgemm_128x64_tn                                                                               \n",
-      "     0.6       1628497447       6144      265054.9    91936       432223  volta_sgemm_128x32_nn                                                                               \n",
-      "     0.3       1019927165         16    63745447.8    12160    185146385  ncclKernel_AllReduce_RING_LL_Sum_int64_t(ncclWorkElem)                                              \n",
-      "     0.3        911914712       1632      558771.3     5824       807487  void multi_tensor_apply_kernel<TensorListMetadata<4>, AdamFunctor<float>, float, float, float, floa…\n",
-      "     0.2        624933625       5888      106136.8    96128       125600  volta_sgemm_64x64_nn                                                                                \n",
-      "     0.2        516397176       9608       53746.6     2495      2498036  void at::native::vectorized_elementwise_kernel<4, at::native::MulScalarFunctor<float, float>, at::d…\n",
-      "     0.1        370199942         96     3856249.4  3595413      4220668  volta_sgemm_64x32_sliced1x4_nn                                                                      \n",
-      "     0.1        365762037       5888       62119.9    53312        81952  volta_sgemm_64x64_tn                                                                                \n",
-      "     0.1        317760917       3072      103437.8    94048       120768  volta_sgemm_64x64_nt                                                                                \n",
-      "     0.1        291143645       5888       49447.0    45408        58304  void at::native::unrolled_elementwise_kernel<at::native::(anonymous namespace)::masked_fill_kernel<…\n",
-      "     0.1        253378189        192     1319678.1   725760      1933403  void at::native::(anonymous namespace)::CatArrayBatchedCopy<float, unsigned int, 1, 128, 1>(float*,…\n",
-      "     0.1        251234285       1440      174468.3     3585       228607  void multi_tensor_apply_kernel<TensorListMetadata<2>, ScaleFunctor<float, float>, float>(int, int v…\n",
-      "     0.1        211542704       4352       48608.2    45728        53376  void (anonymous namespace)::softmax_warp_forward<float, float, float, 9, false>(float*, float const…\n",
-      "     0.1        165985735       3168       52394.5     7648        57663  void at::native::(anonymous namespace)::fused_dropout_kernel_vec<float, float, unsigned int, 1, 4>(…\n",
-      "     0.1        154413148       7088       21785.2     7424       259423  void at::native::unrolled_elementwise_kernel<at::native::AddFunctor<float>, at::detail::Array<char*…\n",
-      "     0.0        126251751       6144       20548.8    12864        29088  void at::native::reduce_kernel<128, 4, at::native::ReduceOp<float, at::native::func_wrapper_t<float…\n",
-      "     0.0        119034048       1344       88567.0     5120       108128  void multi_tensor_apply_kernel<TensorListMetadata<1>, L2NormFunctor<float>, float*, float*, bool, i…\n",
-      "     0.0        118285804       6352       18621.8     2560       733214  void at::native::vectorized_elementwise_kernel<4, at::native::AddFunctor<float>, at::detail::Array<…\n",
-      "     0.0        105684444       4352       24284.1    21791        37184  kernel_1                                                                                            \n",
-      "     0.0         98824724       1904       51903.7     2496        70079  void at::native::vectorized_elementwise_kernel<4, at::native::MulFunctor<float>, at::detail::Array<…\n",
-      "     0.0         97740010       1536       63632.8    62784        68160  void (anonymous namespace)::softmax_warp_backward<float, float, float, 9, false>(float*, float cons…\n",
-      "     0.0         82467209       8880        9286.8     6656        16928  void cuApplyLayerNorm<float, float, float>(float*, float*, float*, float const*, int, int, float, f…\n",
-      "     0.0         76246947       1632       46719.9     5920        52928  void at::native::vectorized_elementwise_kernel<4, at::native::(anonymous namespace)::masked_scale_k…\n",
-      "     0.0         66689595       5952       11204.6     8864        18848  kernel_2                                                                                            \n",
-      "     0.0         65696173       4560       14407.1     2335       344991  void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<float>, at::detail::Array…\n",
-      "     0.0         54335809       1536       35374.9    33120        40640  kernel_4                                                                                            \n",
-      "     0.0         51338187       4352       11796.5     8576        16288  void at::native::unrolled_elementwise_kernel<at::native::copy_device_to_device(at::TensorIterator&,…\n",
-      "     0.0         45344126        176      257637.1   254302       260671  void at::native::unrolled_elementwise_kernel<at::native::DivFunctor<float>, at::detail::Array<char*…\n",
-      "     0.0         44337536        176      251917.8   250559       253407  void at::native::vectorized_elementwise_kernel<4, at::native::exp_kernel_cuda(at::TensorIterator&):…\n",
-      "     0.0         44304532       1536       28844.1    25920        36992  void at::native::(anonymous namespace)::CatArrayBatchedCopy<float, unsigned int, 4, 64, 64>(float*,…\n",
-      "     0.0         33892087       3168       10698.3     8480        18304  void cuComputePartGradGammaBeta<float, float, float>(float const*, float const*, int, int, float co…\n",
-      "     0.0         29677485       3168        9367.9     8032        18464  void cuComputeGradInput<float, float, float>(float const*, float const*, int, int, float const*, fl…\n",
-      "     0.0         28793796       3072        9373.0     6816        16351  kernel_3                                                                                            \n",
-      "     0.0         25638427        176      145672.9   141248       151744  void at::native::reduce_kernel<512, 1, at::native::ReduceOp<float, at::native::MaxOps<float>, unsig…\n",
-      "     0.0         24934651        192      129868.0     6208       254687  void at::native::unrolled_elementwise_kernel<at::native::MulFunctor<float>, at::detail::Array<char*…\n",
-      "     0.0         24768005        624       39692.3     3872       133248  void at::native::reduce_kernel<512, 1, at::native::ReduceOp<float, at::native::func_wrapper_t<float…\n",
-      "     0.0         14595499        520       28068.3    11168       322080  _ZN2at6native89_GLOBAL__N__65_tmpxft_00001007_00000000_13_DistributionNormal_compute_86_cpp1_ii_7d8…\n",
-      "     0.0         14342499       3168        4527.3     4031        11296  void cuComputeGradGammaBeta<float, float>(float const*, float const*, int, int, int, float*, float*)\n",
-      "     0.0          4617682        192       24050.4    18080        34848  void at::native::(anonymous namespace)::embedding_backward_feature_kernel<float, float, long>(long*…\n",
-      "     0.0          4383059        352       12451.9    11072        14975  void at::native::(anonymous namespace)::indexSelectLargeIndex<float, long, unsigned int, 2, 2, -2, …\n",
-      "     0.0          2511679        192       13081.7    10368        22560  kernel_0                                                                                            \n",
-      "     0.0          2431636        272        8939.8     7648        10272  void at::native::index_elementwise_kernel<128, 4, at::native::gpu_index_kernel<at::native::index_ke…\n",
-      "     0.0          1403583        352        3987.5     2976        11584  void cub::DeviceReduceSingleTileKernel<cub::DeviceReducePolicy<bool, int, int, cub::Sum>::Policy600…\n",
-      "     0.0          1350339        352        3836.2     2911         5216  void cub::DeviceSelectSweepKernel<cub::DispatchSelectIf<cub::CountingInputIterator<long, long>, cub…\n",
-      "     0.0          1254812        176        7129.6     6464         9217  void at::native::triu_tril_kernel<float, int, false>(at::cuda::detail::TensorInfo<float, int>, at::…\n",
-      "     0.0          1248508        448        2786.8     2399         3840  void (anonymous namespace)::elementwise_kernel_with_index<int, at::native::arange_cuda_out(at::Tens…\n",
-      "     0.0           991701        272        3646.0     2752         4512  void at::native::vectorized_elementwise_kernel<4, at::native::DivFunctor<float>, at::detail::Array<…\n",
-      "     0.0           942900        352        2678.7     2336         3552  void cub::DeviceCompactInitKernel<cub::ScanTileState<int, true>, int*>(cub::ScanTileState<int, true…\n",
-      "     0.0           872668         96        9090.3     8352        10048  void at::native::unrolled_elementwise_kernel<at::native::copy_device_to_device(at::TensorIterator&,…\n",
-      "     0.0           826148        176        4694.0     4352         5601  void at::native::vectorized_elementwise_kernel<4, at::native::BUnaryFunctor<at::native::CompareLTFu…\n",
-      "     0.0           775099         96        8073.9     7424         8992  void at::native::index_elementwise_kernel<128, 4, at::native::gpu_index_kernel<at::native::index_pu…\n",
-      "     0.0           765821        176        4351.3     3936         5184  void at::native::vectorized_elementwise_kernel<4, at::native::log_kernel_cuda(at::TensorIterator&):…\n",
-      "     0.0           688480        176        3911.8     3456         5024  void at::native::unrolled_elementwise_kernel<at::native::BUnaryFunctor<at::native::CompareLTFunctor…\n",
-      "     0.0           608608        176        3458.0     3136         4384  void at::native::unrolled_elementwise_kernel<at::native::BUnaryFunctor<at::native::CompareGEFunctor…\n",
-      "     0.0           548866        176        3118.6     2912         3680  void at::native::vectorized_elementwise_kernel<4, at::native::BUnaryFunctor<at::native::AddFunctor<…\n",
-      "     0.0           543391        176        3087.4     2816         4000  void at::native::vectorized_elementwise_kernel<4, at::native::BitwiseOrFunctor<bool>, at::detail::A…\n",
-      "     0.0           502590        176        2855.6     2655         3488  void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<long>, at::detail::Array<…\n",
-      "     0.0           383140         96        3991.0     3744         4480  cleanup(float*, float*, float*, float*, bool, int)                                                  \n",
-      "     0.0           319140         96        3324.4     2912        16032  void at::native::vectorized_elementwise_kernel<4, at::native::BUnaryFunctor<at::native::AddFunctor<…\n",
-      "     0.0           313667         96        3267.4     2912         3968  void at::native::vectorized_elementwise_kernel<4, at::native::(anonymous namespace)::pow_tensor_sca…\n",
-      "     0.0           287259         96        2992.3     2784         3328  void at::native::vectorized_elementwise_kernel<4, at::native::AUnaryFunctor<at::native::AddFunctor<…\n",
-      "\n",
-      "\n",
-      "\n",
-      "CUDA Memory Operation Statistics (by time):\n",
-      "\n",
-      " Time(%)  Total Time (ns)  Operations   Average   Minimum   Maximum       Operation     \n",
-      " -------  ---------------  ----------  ---------  -------  ---------  ------------------\n",
-      "    89.1       3465094446        1380  2510938.0     1343  757281996  [CUDA memcpy DtoH]\n",
-      "     9.9        385763414       21152    18237.7     2431     527263  [CUDA memcpy DtoD]\n",
-      "     0.8         31439598       17792     1767.1     1247      18912  [CUDA memset]     \n",
-      "     0.2          6520963        3136     2079.4     1247     274623  [CUDA memcpy HtoD]\n",
-      "\n",
-      "\n",
-      "\n",
-      "CUDA Memory Operation Statistics (by size in KiB):\n",
-      "\n",
-      "     Total      Operations  Average   Minimum   Maximum        Operation     \n",
-      " -------------  ----------  --------  -------  ----------  ------------------\n",
-      "   4433452.563       17792   249.182    0.004   10240.000  [CUDA memset]     \n",
-      "   2971617.438        1380  2153.346    0.004  201216.000  [CUDA memcpy DtoH]\n",
-      "     19824.688        3136     6.322    0.004    2048.000  [CUDA memcpy HtoD]\n",
-      " 120454945.375       21152  5694.731    0.004  201216.000  [CUDA memcpy DtoD]\n",
-      "\n",
-      "\n",
-      "\n",
-      "Operating System Runtime API Statistics:\n",
-      "\n",
-      " Time(%)  Total Time (ns)  Num Calls     Average     Minimum     Maximum             Name         \n",
-      " -------  ---------------  ---------  -------------  --------  -----------  ----------------------\n",
-      "    31.8    5593434303269        368  15199549737.1     23610  79120616452  pthread_cond_wait     \n",
-      "    27.9    4910125179481      17570    279460738.7      1963  10130282621  pthread_cond_timedwait\n",
-      "    24.3    4278054729224      34186    125140546.7      1290  44497241962  poll                  \n",
-      "    10.3    1803196509173        355   5079426786.4     23310  33853558996  sem_wait              \n",
-      "     3.6     641269443401        576   1113315005.9      1015   2997228374  futex                 \n",
-      "     0.8     145245328959        688    211112396.7  10065411   1007923775  select                \n",
-      "     0.6     109784339681      50957      2154450.6      1003    109559784  ioctl                 \n",
-      "     0.2      28042017648        117    239675364.5    164118   2317633343  fork                  \n",
-      "     0.1      15313635731        988     15499631.3    106209   1353294979  pthread_create        \n",
-      "     0.1      13206505992      11995      1101000.9       999    319033247  read                  \n",
-      "     0.1      12657774292       4603      2749896.7      1004    119470493  pthread_tryjoin_np    \n",
-      "     0.1       9296666389       2070      4491143.2      1020    742142953  write                 \n",
-      "     0.0       8550210410        167     51198864.7   5064806   1000155935  nanosleep             \n",
-      "     0.0       5473575733        340     16098752.2      1716    753451009  accept                \n",
-      "     0.0       4563965549       5670       804932.2      1000    744010044  recv                  \n",
-      "     0.0       2881757706       8761       328930.2      3498     52730526  open64                \n",
-      "     0.0       2738750168        970      2823453.8      1029    108977691  pthread_cond_signal   \n",
-      "     0.0       1674433469        695      2409256.8      3814    447909461  pthread_join          \n",
-      "     0.0       1307932755        574      2278628.5       999    552483526  waitpid               \n",
-      "     0.0        787392363        116      6787865.2      1039    118398933  pthread_mutex_lock    \n",
-      "     0.0        466428334       7253        64308.3      1000     29379070  send                  \n",
-      "     0.0        365132680       3392       107645.2      2498     19909417  mmap                  \n",
-      "     0.0        342197730        563       607811.2      6018     44240427  connect               \n",
-      "     0.0        249098555       3159        78853.6      1421     34377391  fopen                 \n",
-      "     0.0        217823404        260       837782.3      2407     34361262  pipe2                 \n",
-      "     0.0        217659306         40      5441482.7      1558     48632942  openat                \n",
-      "     0.0        180993013       4202        43073.1      1374     32116690  munmap                \n",
-      "     0.0         77052008         18      4280667.1     61137     13838109  sleep                 \n",
-      "     0.0         61461276       3602        17063.1      2001     20022904  mmap64                \n",
-      "     0.0         48242345        204       236482.1      2249     29899269  open                  \n",
-      "     0.0         39481067        228       173162.6      1032      3767436  recvmsg               \n",
-      "     0.0         38664317        726        53256.6      1774     27390519  socket                \n",
-      "     0.0         36313974       1633        22237.6      1000      1736639  fgets                 \n",
-      "     0.0         22467169       2648         8484.6      1000     11488792  fread                 \n",
-      "     0.0         20593267         73       282099.5     29914      5176911  sem_timedwait         \n",
-      "     0.0         19633001       3116         6300.7      1000      7802890  fclose                \n",
-      "     0.0          8898623         86       103472.4      1007      8486628  fgetc                 \n",
-      "     0.0          6623741       1610         4114.1       999      3032106  sched_yield           \n",
-      "     0.0          3582154          3      1194051.3    343132      2284636  pthread_rwlock_wrlock \n",
-      "     0.0          1912398         26        73553.8      2155      1181798  fopen64               \n",
-      "     0.0          1550002         20        77500.1      2507       938762  fwrite_unlocked       \n",
-      "     0.0          1303265        949         1373.3      1000       123053  fcntl                 \n",
-      "     0.0           551681        160         3448.0      1014       141280  bind                  \n",
-      "     0.0           546706         41        13334.3      1026       224142  getdelim              \n",
-      "     0.0           244769         73         3353.0      1000        22723  fflush                \n",
-      "     0.0           198165         80         2477.1      1081         9301  waitid                \n",
-      "     0.0           172780        122         1416.2      1001         3555  listen                \n",
-      "     0.0           152498         16         9531.1      1347        20498  sendmsg               \n",
-      "     0.0           130015          1       130015.0    130015       130015  pthread_cond_broadcast\n",
-      "     0.0           117991          3        39330.3      2098       107700  fputs                 \n",
-      "     0.0           114666          2        57333.0     19023        95643  wait                  \n",
-      "     0.0           104388         36         2899.7      1074         7418  signal                \n",
-      "     0.0            98697         59         1672.8      1009         6189  sigaction             \n",
-      "     0.0            85207         12         7100.6      4060        27106  pipe                  \n",
-      "     0.0            66940         38         1761.6      1187         3590  pthread_mutex_trylock \n",
-      "     0.0            66109         13         5085.3      3490         7378  fputs_unlocked        \n",
-      "     0.0            64428         54         1193.1      1001         1787  dup                   \n",
-      "     0.0            43299          7         6185.6      3683        16998  wait3                 \n",
-      "     0.0             3502          2         1751.0      1396         2106  dup2                  \n",
-      "\n",
-      "Report file moved to \"/home/zcharpy/profiles/GPT360M_naive.qdrep\"\n",
-      "Report file moved to \"/home/zcharpy/profiles/GPT360M_naive.sqlite\"\n"
-     ]
-    }
-   ],
-   "source": [
-    "!bash ./Megatron-LM/nsys_test.sh"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "upset-skating",
-   "metadata": {},
-   "source": []
-  },
-  {
-   "cell_type": "markdown",
-   "id": "governmental-welding",
-   "metadata": {},
-   "source": [
-    "---\n",
-    "## below is a ReRun cell to experiment training configurations\n",
-    "<a id=\"Rerun_Cell\"></a>"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "requested-clause",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "!bash ./Megatron-LM/dlprof_2nd_run.sh"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "written-trace",
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  },
-  {
-   "cell_type": "markdown",
-   "id": "sunrise-borough",
-   "metadata": {},
-   "source": [
-    "----------------\n",
-    "\n",
-    "## **Challenge ** - the best profile\n",
-    "\n",
-    "with the exact same compute limitations ( i.e # of gpus you currently have ) \n",
-    "\n",
-    "task: modify the [profiling bash script](./Megatron-LM/dlprof_2nd_run.sh) and rerun \n",
-    "<a href=\"./Day2-5_Observe_GPT_runs_vs_performance.ipynb#Rerun_Cell\">Jump to ReRun Cell</a> \n",
-    "monitor the training runs to get an overall >80% gpu utils in **training** runs \n",
-    "\n",
-    "```\n",
-    "    TENSOR_MP_SIZE=1\n",
-    "    PIPELINE_MP_SIZE=1\n",
-    "\n",
-    "    #GPT Config \n",
-    "    LAYERS= \n",
-    "    HIDDEN_SIZE=\n",
-    "    ATTN_HEADS=\n",
-    "    MICRO_BZ=\n",
-    "    GB_BZ=\n",
-    "    SEQ_LEN=\n",
-    "    MAX_POS_EM=\n",
-    "``` \n",
-    "<a id=\"TheChallenge\"></a>"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "streaming-artist",
-   "metadata": {},
-   "source": [
-    "-----\n",
-    "\n",
-    "\n",
-    "## Licensing \n",
-    "\n",
-    "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). "
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.8.8"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}