浏览代码

Lab2 notebooks

zenodia 4 年之前
父节点
当前提交
915d0ca26c

+ 3 - 4
ai/Megatron/English/Python/Start_Here.ipynb

@@ -134,7 +134,6 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# verify profiling capability \n",
     "!nsys status -e"
    ]
   },
@@ -191,15 +190,15 @@
     "- **Outlines of Lab 1**\n",
     "    Megatron 101 in half a day - Please go through the below notebooks sequentially.\n",
     "    1. [WebCrawling to obtain raw text data](./jupyter_notebook/Megatron-LM/tools/openwebtext/Lab1-1_Website_scrapping.ipynb)\n",
-    "    2. [Estimate hours/days needed to execute one end-to-end run per Megatron configuration](./jupyter_notebook/Lab1-2_EstimateComputeDaysNeeded.ipynb)\n",
-    "    3. [Understanding the core of Megatron - mpu ](./jupyter_notebook/Lab1-3_MegatronFundementals.ipynb)\n",
+    "    2. [Estimate hours/days needed to execute one end-to-end run per Megatron-LM configuration](./jupyter_notebook/Lab1-2_EstimateComputeDaysNeeded.ipynb)\n",
+    "    3. [Understanding the core of Megatron-LM - mpu ](./jupyter_notebook/Lab1-3_MegatronFundementals.ipynb)\n",
     "    4. [About GPT's tokenizer](./jupyter_notebook/Lab1-4_GPT_vocab_merge_files.ipynb)\n",
     "    5. [jsonfy and convert to mmap format](./jupyter_notebook/Lab1-5_jsonfy_and_process2mmap.ipynb)\n",
     "    6. [Megatron runs vs config](./jupyter_notebook/Lab1-6_Observe_GPT_runs_vs_performance.ipynb)\n",
     "\n",
     "\n",
     "- **Outlines of Lab 2**\n",
-    "    Getting started on training own language Megatron GPT models -- Please go through the below notebooks sequentially.\n",
+    "    Getting started on training own language Megatron-LM GPT models -- Please go through the below notebooks sequentially.\n",
     "    1. [Fetch and extract Swedish data](./jupyter_notebook/Megatron-LM/tools/openwebtext/Lab2-1_acquiring_data.ipynb)\n",
     "    2. [Find sentence boundary and deduplicate your data](./jupyter_notebook/Megatron-LM/tools/openwebtext/Lab2-2_SentenceBoundary_and_Deduplicate.ipynb)\n",
     "        - [mini challenge - approaching groundtruth](./jupyter_notebook/Megatron-LM/tools/openwebtext/Lab2-2_SentenceBoundary_and_Deduplicate.ipynb#TheChallenge)\n",

文件差异内容过多而无法显示
+ 0 - 25645
ai/Megatron/English/Python/jupyter_notebook/Day3-4_customize_process2mmap.ipynb


+ 0 - 610
ai/Megatron/English/Python/jupyter_notebook/Day3-5_run_Megatron_with_varying_config.ipynb

@@ -1,610 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "id": "posted-subcommittee",
-   "metadata": {},
-   "source": [
-    "# Monitor GPT training performance with varying config\n",
-    "---\n",
-    "\n",
-    "## **Challenge ** - Go big or go home !\n",
-    "- prerequisites : \n",
-    "    - use your current given # of gpus\n",
-    "    - do NOT changing the following parameters **--train-samples 100 **\n",
-    "    - you cannot go OOM \n",
-    "    - you must sustain >60% GPUs utilization in the **training** phase \n",
-    "    - training run must be finished and checkpoint must be saved successfully\n",
-    "\n",
-    "\n",
-    "- task : \n",
-    "        given the above prerequisites, train as BIG a GPT model as possible\n",
-    "\n",
-    "\n",
-    "\n",
-    "- winning criteria : the biggest model wins given the above constraints(=prerequisites).\n",
-    "\n",
-    "    Go directly to the challenge , link here --> <a href=\"./Day3-5_run_Megatron_with_varying_config.ipynb#Rerun_Cell\">Jump to ReRun Cell</a> \n",
-    "\n",
-    "```\n",
-    "                        #### the follow params are allowed to change \n",
-    "                        WORLD_SIZE=8 # <--- remember to change the number of GPUs you actually have in your system\n",
-    "                        GPUS_PER_NODE=8 # <--- remember to change the number of GPUs you actually have in your system\n",
-    "\n",
-    "                        TENSOR_MP_SIZE=8\n",
-    "                        PIPELINE_MP_SIZE=1\n",
-    "                        LYS=32\n",
-    "                        HIDDEN_SZ=2048\n",
-    "                        NUM_ATTN_HEADS=32\n",
-    "                        MICRO_BZ=\n",
-    "                        GLOBAL_BZ=\n",
-    "                        SEQ_LEN=\n",
-    "                        MAX_POS_EM=\n",
-    "                        #### ---------------------------#### \n",
-    "``` \n",
-    "                                ----------------------------For your reference --------------------------\n",
-    "<center><img src=\"./Megatron-LM/pics/GPT3_all.png\" width=\"700\"/></center>"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "turned-lender",
-   "metadata": {},
-   "source": [
-    "---\n",
-    "# Hint :\n",
-    "### call out a terminal and type in **nvidia-smi** to monitor the GPUs' utils and power consumption \n",
-    "### remember to fill up the GPU memory\n",
-    "![call out a terminal ](./Megatron-LM/pics/Alt_callout2terminals.JPG)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "controlling-advertiser",
-   "metadata": {},
-   "source": [
-    "---\n",
-    "## modify and rerun the below to get a even bigger GPT model \n",
-    "<a id=\"MODIFY_CELL\"></a>\n",
-    "\n",
-    "<a href=\"./Day3-5_run_Megatron_with_varying_config.ipynb#Rerun_Cell\">Jump to ReRun Cell</a> "
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "improving-speech",
-   "metadata": {},
-   "source": [
-    "<a id=\"Rerun_Cell\"></a>"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 1,
-   "id": "sharing-headline",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "!rm -fr ../sv_ckpt/* "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 2,
-   "id": "warming-brooklyn",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Overwriting ./Megatron-LM/profile_SVGPT_BIG.sh\n"
-     ]
-    }
-   ],
-   "source": [
-    "%%writefile ./Megatron-LM/profile_SVGPT_BIG.sh\n",
-    "# Copyright (c) 2020 NVIDIA Corporation.  All rights reserved.\n",
-    "MASTER_ADDR=localhost\n",
-    "MASTER_PORT=6000\n",
-    "NNODES=1 #<-- currently we are using 1 node multigpus\n",
-    "NODE_RANK=0\n",
-    "\n",
-    "### modify this section to point the file to its own path \n",
-    "CHECKPOINT_PATH='../sv_ckpt/'\n",
-    "DATA_PATH='../dataset/SV/webnyheter2013_56kvocab_text_document'\n",
-    "VOCAB_FILE='../dataset/SV/56k/vocab.json'\n",
-    "MERGE_FILE='../dataset/SV/56k/merges.txt'\n",
-    "PROFILE_OUTPUT_PATH='../profiles/SV/nsys_sv_' # modify this to your own profile path\n",
-    "\n",
-    "#### [TODO]--------------- params in the following block are allowed to change -----------#### \n",
-    "WORLD_SIZE=2 # <--- remember to change the number of GPUs you actually have in your system\n",
-    "GPUS_PER_NODE=2 # <--- remember to change the number of GPUs you actually have in your system\n",
-    "\n",
-    "TENSOR_MP_SIZE=2\n",
-    "PIPELINE_MP_SIZE=1\n",
-    "LAYERS=32\n",
-    "HIDDEN_SZ=4096\n",
-    "NUM_ATTN_HEADS=32\n",
-    "MICRO_BZ=8\n",
-    "GLOBAL_BZ=512\n",
-    "SEQ_LEN=512\n",
-    "MAX_POS_EM=512\n",
-    "#### -------------------- end of blocks ------------------------#### \n",
-    "\n",
-    "export OMP_NUM_THREADS=1\n",
-    "DISTRIBUTED_ARGS=\"--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT\"\n",
-    "\n",
-    "## for nsys run\n",
-    "#nsys profile --stats=false --force-overwrite=true --duration=300 --trace=cudnn,cuda,osrt,nvtx -o $PROFILE_OUTPUT_PATH \\\n",
-    "python -m torch.distributed.launch $DISTRIBUTED_ARGS \\\n",
-    "    ./Megatron-LM/Dlprof_pretrain_gpt.py \\\n",
-    "       --tensor-model-parallel-size $TENSOR_MP_SIZE \\\n",
-    "       --pipeline-model-parallel-size $PIPELINE_MP_SIZE \\\n",
-    "       --num-layers $LAYERS \\\n",
-    "       --hidden-size $HIDDEN_SZ \\\n",
-    "       --num-attention-heads $NUM_ATTN_HEADS \\\n",
-    "       --micro-batch-size $MICRO_BZ \\\n",
-    "       --global-batch-size $GLOBAL_BZ \\\n",
-    "       --seq-length $SEQ_LEN \\\n",
-    "       --max-position-embeddings $MAX_POS_EM \\\n",
-    "       --train-samples 100 \\\n",
-    "       --save $CHECKPOINT_PATH \\\n",
-    "       --load $CHECKPOINT_PATH \\\n",
-    "       --data-path 1. $DATA_PATH \\\n",
-    "       --vocab-file $VOCAB_FILE \\\n",
-    "       --merge-file $MERGE_FILE \\\n",
-    "       --data-impl mmap \\\n",
-    "       --split 949,50,1 \\\n",
-    "       --distributed-backend nccl \\\n",
-    "       --lr 0.00015 \\\n",
-    "       --lr-decay-style cosine \\\n",
-    "       --min-lr 1.0e-5 \\\n",
-    "       --weight-decay 1e-2 \\\n",
-    "       --clip-grad 1.0 \\\n",
-    "       --lr-warmup-fraction .01 \\\n",
-    "       --checkpoint-activations \\\n",
-    "       --log-interval 10 \\\n",
-    "       --save-interval 100 \\\n",
-    "       --eval-interval 200 \\\n",
-    "       --eval-iters 10 \\\n",
-    "       --fp16"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "protective-private",
-   "metadata": {},
-   "source": [
-    "---\n",
-    "## Check how big is your model - \n",
-    "modify the parameters in the [params_cnt.sh](./params_cnt.sh)\n",
-    "I got 6.6 Billion :)  what about you ?"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 3,
-   "id": "pretty-laser",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "6\n",
-      "6675628032\n"
-     ]
-    }
-   ],
-   "source": [
-    "!bash params_cnt.sh $LAYERS $HIDDEN_SZ $NUM_ATTN_HEADS $SEQ_LEN"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "acknowledged-thinking",
-   "metadata": {},
-   "source": [
-    "---\n",
-    "#### you should see something similar to the following \n",
-    "\n",
-    "            training ...\n",
-    "            time (ms) | model-and-optimizer-setup: 4013.85 | train/valid/test-data-iterators-setup: 2773.74\n",
-    "            [after training is done] datetime: 2021-08-27 06:24:46 \n",
-    "            ------------------------------------------------------------------------------------------------------------------\n",
-    "             validation loss at the end of training for val data | lm loss value: 1.124495E+01 | lm loss PPL: 7.649290E+04 | \n",
-    "            ------------------------------------------------------------------------------------------------------------------\n",
-    "            Processing events...\n",
-    "            Capturing symbol files...\n",
-    "            Saving temporary \"/tmp/nsys-report-96a7-0101-ea4b-0ee5.qdstrm\" file to disk...\n",
-    "            Creating final output files...\n",
-    "\n",
-    "            Processing [==============================================================100%]\n",
-    "            Saved report file to \"/tmp/nsys-report-96a7-0101-ea4b-0ee5.qdrep\"\n",
-    "            Report file moved to \"/home/zcharpy/profiles/DLprof/2ndrun/nsys_improved.qdrep\""
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "imperial-fellowship",
-   "metadata": {},
-   "source": [
-    "---\n",
-    "# Re-run this cell below to get an even bigger GPT model\n",
-    "\n",
-    "Remember to modify the [params count](./params_cnt.sh) to check how big is your model\n",
-    "\n",
-    "Jump back and mdify the profile_SVGPT_BIG.sh, click here --> \n",
-    "<a href=\"./Day3-5_run_Megatron_with_varying_config.ipynb#MODIFY_CELL\">Jump back to modify and overwrite profile_SVGPT_BIG.sh </a> \n",
-    "<a id=\"Rerun_Cell\"></a>"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 4,
-   "id": "continued-yahoo",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Initializing NVTX monkey patchesInitializing NVTX monkey patches\n",
-      "\n",
-      "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:144: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead\n",
-      "  warnings.warn(\"torch.distributed.reduce_op is deprecated, please use \"\n",
-      "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:144: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead\n",
-      "  warnings.warn(\"torch.distributed.reduce_op is deprecated, please use \"\n",
-      "Done with NVTX monkey patchingDone with NVTX monkey patching\n",
-      "\n",
-      "using world size: 2, data-parallel-size: 1, tensor-model-parallel size: 2, pipeline-model-parallel size: 1 \n",
-      "using torch.float16 for parameters ...\n",
-      "------------------------ arguments ------------------------\n",
-      "  accumulate_allreduce_grads_in_fp32 .............. False\n",
-      "  adam_beta1 ...................................... 0.9\n",
-      "  adam_beta2 ...................................... 0.999\n",
-      "  adam_eps ........................................ 1e-08\n",
-      "  adlr_autoresume ................................. False\n",
-      "  adlr_autoresume_interval ........................ 1000\n",
-      "  apply_query_key_layer_scaling ................... True\n",
-      "  apply_residual_connection_post_layernorm ........ False\n",
-      "  attention_dropout ............................... 0.1\n",
-      "  attention_softmax_in_fp32 ....................... False\n",
-      "  bert_binary_head ................................ True\n",
-      "  bert_load ....................................... None\n",
-      "  bf16 ............................................ False\n",
-      "  bias_dropout_fusion ............................. True\n",
-      "  bias_gelu_fusion ................................ True\n",
-      "  biencoder_projection_dim ........................ 0\n",
-      "  biencoder_shared_query_context_model ............ False\n",
-      "  block_data_path ................................. None\n",
-      "  checkpoint_activations .......................... True\n",
-      "  checkpoint_num_layers ........................... 1\n",
-      "  clip_grad ....................................... 1.0\n",
-      "  consumed_train_samples .......................... 0\n",
-      "  consumed_valid_samples .......................... 0\n",
-      "  data_impl ....................................... mmap\n",
-      "  data_parallel_size .............................. 1\n",
-      "  data_path ....................................... ['1.', '../dataset/SV/webnyheter2013_56kvocab_text_document']\n",
-      "  dataloader_type ................................. single\n",
-      "  DDP_impl ........................................ local\n",
-      "  decoder_seq_length .............................. None\n",
-      "  distribute_checkpointed_activations ............. False\n",
-      "  distributed_backend ............................. nccl\n",
-      "  embedding_path .................................. None\n",
-      "  encoder_seq_length .............................. 512\n",
-      "  eod_mask_loss ................................... False\n",
-      "  eval_interval ................................... 200\n",
-      "  eval_iters ...................................... 10\n",
-      "  evidence_data_path .............................. None\n",
-      "  exit_duration_in_mins ........................... None\n",
-      "  exit_interval ................................... None\n",
-      "  ffn_hidden_size ................................. 16384\n",
-      "  finetune ........................................ False\n",
-      "  fp16 ............................................ True\n",
-      "  fp16_lm_cross_entropy ........................... False\n",
-      "  fp32_residual_connection ........................ False\n",
-      "  global_batch_size ............................... 512\n",
-      "  hidden_dropout .................................. 0.1\n",
-      "  hidden_size ..................................... 4096\n",
-      "  hysteresis ...................................... 2\n",
-      "  ict_head_size ................................... None\n",
-      "  ict_load ........................................ None\n",
-      "  img_dim ......................................... 224\n",
-      "  indexer_batch_size .............................. 128\n",
-      "  indexer_log_interval ............................ 1000\n",
-      "  init_method_std ................................. 0.02\n",
-      "  init_method_xavier_uniform ...................... False\n",
-      "  initial_loss_scale .............................. 4294967296\n",
-      "  kv_channels ..................................... 128\n",
-      "  layernorm_epsilon ............................... 1e-05\n",
-      "  lazy_mpu_init ................................... None\n",
-      "  load ............................................ ../sv_ckpt/\n",
-      "  local_rank ...................................... 0\n",
-      "  log_batch_size_to_tensorboard ................... False\n",
-      "  log_interval .................................... 10\n",
-      "  log_learning_rate_to_tensorboard ................ True\n",
-      "  log_loss_scale_to_tensorboard ................... True\n",
-      "  log_num_zeros_in_grad ........................... False\n",
-      "  log_params_norm ................................. False\n",
-      "  log_timers_to_tensorboard ....................... False\n",
-      "  log_validation_ppl_to_tensorboard ............... False\n",
-      "  loss_scale ...................................... None\n",
-      "  loss_scale_window ............................... 1000\n",
-      "  lr .............................................. 0.00015\n",
-      "  lr_decay_iters .................................. None\n",
-      "  lr_decay_samples ................................ None\n",
-      "  lr_decay_style .................................. cosine\n",
-      "  lr_warmup_fraction .............................. 0.01\n",
-      "  lr_warmup_iters ................................. 0\n",
-      "  lr_warmup_samples ............................... 0\n",
-      "  make_vocab_size_divisible_by .................... 128\n",
-      "  mask_prob ....................................... 0.15\n",
-      "  masked_softmax_fusion ........................... True\n",
-      "  max_position_embeddings ......................... 512\n",
-      "  merge_file ...................................... ../dataset/SV/56k/merges.txt\n",
-      "  micro_batch_size ................................ 8\n",
-      "  min_loss_scale .................................. 1.0\n",
-      "  min_lr .......................................... 1e-05\n",
-      "  mmap_warmup ..................................... False\n",
-      "  no_load_optim ................................... None\n",
-      "  no_load_rng ..................................... None\n",
-      "  no_save_optim ................................... None\n",
-      "  no_save_rng ..................................... None\n",
-      "  num_attention_heads ............................. 32\n",
-      "  num_channels .................................... 3\n",
-      "  num_classes ..................................... 1000\n",
-      "  num_layers ...................................... 32\n",
-      "  num_layers_per_virtual_pipeline_stage ........... None\n",
-      "  num_workers ..................................... 2\n",
-      "  onnx_safe ....................................... None\n",
-      "  openai_gelu ..................................... False\n",
-      "  optimizer ....................................... adam\n",
-      "  override_lr_scheduler ........................... False\n",
-      "  params_dtype .................................... torch.float16\n",
-      "  patch_dim ....................................... 16\n",
-      "  pipeline_model_parallel_size .................... 1\n",
-      "  query_in_block_prob ............................. 0.1\n",
-      "  rampup_batch_size ............................... None\n",
-      "  rank ............................................ 0\n",
-      "  reset_attention_mask ............................ False\n",
-      "  reset_position_ids .............................. False\n",
-      "  retriever_report_topk_accuracies ................ []\n",
-      "  retriever_score_scaling ......................... False\n",
-      "  retriever_seq_length ............................ 256\n",
-      "  sample_rate ..................................... 1.0\n",
-      "  save ............................................ ../sv_ckpt/\n",
-      "  save_interval ................................... 100\n",
-      "  scatter_gather_tensors_in_pipeline .............. True\n",
-      "  seed ............................................ 1234\n",
-      "  seq_length ...................................... 512\n",
-      "  sgd_momentum .................................... 0.9\n",
-      "  short_seq_prob .................................. 0.1\n",
-      "  split ........................................... 949,50,1\n",
-      "  tensor_model_parallel_size ...................... 2\n",
-      "  tensorboard_dir ................................. None\n",
-      "  tensorboard_log_interval ........................ 1\n",
-      "  tensorboard_queue_size .......................... 1000\n",
-      "  titles_data_path ................................ None\n",
-      "  tokenizer_type .................................. GPT2BPETokenizer\n",
-      "  train_iters ..................................... None\n",
-      "  train_samples ................................... 100\n",
-      "  use_checkpoint_lr_scheduler ..................... False\n",
-      "  use_contiguous_buffers_in_ddp ................... False\n",
-      "  use_cpu_initialization .......................... None\n",
-      "  use_one_sent_docs ............................... False\n",
-      "  virtual_pipeline_model_parallel_size ............ None\n",
-      "  vocab_extra_ids ................................. 0\n",
-      "  vocab_file ...................................... ../dataset/SV/56k/vocab.json\n",
-      "  weight_decay .................................... 0.01\n",
-      "  world_size ...................................... 2\n",
-      "-------------------- end of arguments ---------------------\n",
-      "setting number of micro-batches to constant 64\n",
-      "> building GPT2BPETokenizer tokenizer ...\n",
-      " > padded vocab (size: 56000) with 64 dummy tokens (new size: 56064)\n",
-      "> initializing torch distributed ...\n",
-      "> initializing tensor model parallel with size 2\n",
-      "> initializing pipeline model parallel with size 1\n",
-      "> setting random seeds to 1234 ...\n",
-      "> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234\n",
-      "> compiling dataset index builder ...\n",
-      "make: Entering directory '/proj/guest_at_nsc/users/zcharpy/gpubootcamp/ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/megatron/data'\n",
-      "make: Nothing to be done for 'default'.\n",
-      "make: Leaving directory '/proj/guest_at_nsc/users/zcharpy/gpubootcamp/ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/megatron/data'\n",
-      ">>> done with dataset index builder. Compilation time: 0.145 seconds\n",
-      "> compiling and loading fused kernels ...\n",
-      "Detected CUDA files, patching ldflags\n",
-      "Emitting ninja build file /proj/guest_at_nsc/users/zcharpy/gpubootcamp/ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/megatron/fused_kernels/build/build.ninja...\n",
-      "Building extension module scaled_upper_triang_masked_softmax_cuda...\n",
-      "Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)\n",
-      "ninja: no work to do.\n",
-      "Loading extension module scaled_upper_triang_masked_softmax_cuda...\n",
-      "Detected CUDA files, patching ldflags\n",
-      "Emitting ninja build file /proj/guest_at_nsc/users/zcharpy/gpubootcamp/ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/megatron/fused_kernels/build/build.ninja...\n",
-      "Building extension module scaled_masked_softmax_cuda...\n",
-      "Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)\n",
-      "ninja: no work to do.\n",
-      "Loading extension module scaled_masked_softmax_cuda...\n",
-      "Detected CUDA files, patching ldflags\n",
-      "Emitting ninja build file /proj/guest_at_nsc/users/zcharpy/gpubootcamp/ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/megatron/fused_kernels/build/build.ninja...\n",
-      "Building extension module fused_mix_prec_layer_norm_cuda...\n",
-      "Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)\n",
-      "ninja: no work to do.\n",
-      "Loading extension module fused_mix_prec_layer_norm_cuda...\n",
-      ">>> done with compiling and loading fused kernels. Compilation time: 2.868 seconds\n",
-      "time to initialize megatron (seconds): 43.936\n",
-      "[after megatron is initialized] datetime: 2021-09-15 11:55:55 \n",
-      "building GPT model ...\n",
-      " > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 3339395072\n",
-      " > number of parameters on (tensor, pipeline) model parallel rank (1, 0): 3339395072\n",
-      "setting training iterations to 0\n",
-      "> learning rate decay style: cosine\n",
-      "WARNING: could not find the metadata file ../sv_ckpt/latest_checkpointed_iteration.txt \n",
-      "    will not load any checkpoints and will start from random\n",
-      "time (ms) | load-checkpoint: 2.66\n",
-      "[after model, optimizer, and learning rate scheduler are built] datetime: 2021-09-15 11:55:56 \n",
-      "> building train, validation, and test datasets ...\n",
-      " > datasets target sizes (minimum size):\n",
-      "    train:      100\n",
-      "    validation: 5120\n",
-      "    test:       5120\n",
-      "> building train, validation, and test datasets for GPT ...\n",
-      " > building dataset index ...\n",
-      "    reading sizes...\n",
-      "    reading pointers...\n",
-      "    reading document index...\n",
-      "    creating numpy buffer of mmap...\n",
-      "    creating memory view of numpy buffer...\n",
-      " > finished creating indexed dataset in 0.004941 seconds\n",
-      "    number of documents: 1249010\n",
-      " > dataset split:\n",
-      "    train:\n",
-      "     document indices in [0, 1185311) total of 1185311 documents\n",
-      "    validation:\n",
-      "     document indices in [1185311, 1247761) total of 62450 documents\n",
-      "    test:\n",
-      "     document indices in [1247761, 1249010) total of 1249 documents\n",
-      " > WARNING: could not find index map files, building the indices on rank 0 ...\n",
-      " > only one epoch required, setting separate_last_epoch to False\n",
-      " > elasped time to build and save doc-idx mapping (seconds): 0.066494\n",
-      "    using:\n",
-      "     number of documents:       1185311\n",
-      "     number of epochs:          1\n",
-      "     sequence length:           512\n",
-      "     total number of samples:   51303\n",
-      " > elasped time to build and save sample-idx mapping (seconds): 0.008808\n",
-      " > building shuffle index with split [0, 51303) and [51303, 51303) ...\n",
-      " > elasped time to build and save shuffle-idx mapping (seconds): 0.002738\n",
-      " > loading doc-idx mapping from ../dataset/SV/webnyheter2013_56kvocab_text_document_train_indexmap_101ns_512sl_1234s_doc_idx.npy\n",
-      " > loading sample-idx mapping from ../dataset/SV/webnyheter2013_56kvocab_text_document_train_indexmap_101ns_512sl_1234s_sample_idx.npy\n",
-      " > loading shuffle-idx mapping from ../dataset/SV/webnyheter2013_56kvocab_text_document_train_indexmap_101ns_512sl_1234s_shuffle_idx.npy\n",
-      "    loaded indexed file in 0.005 seconds\n",
-      "    total number of samples: 51304\n",
-      "    total number of epochs: 1\n",
-      " > WARNING: could not find index map files, building the indices on rank 0 ...\n",
-      " > last epoch number of samples (2438) is larger than 80% of number of samples per epoch (2708), setting separate_last_epoch to False\n",
-      " > elasped time to build and save doc-idx mapping (seconds): 0.005265\n",
-      "    using:\n",
-      "     number of documents:       62450\n",
-      "     number of epochs:          2\n",
-      "     sequence length:           512\n",
-      "     total number of samples:   5416\n",
-      " > elasped time to build and save sample-idx mapping (seconds): 0.001357\n",
-      " > building shuffle index with split [0, 5416) and [5416, 5416) ...\n",
-      " > elasped time to build and save shuffle-idx mapping (seconds): 0.002597\n",
-      " > loading doc-idx mapping from ../dataset/SV/webnyheter2013_56kvocab_text_document_valid_indexmap_5146ns_512sl_1234s_doc_idx.npy\n",
-      " > loading sample-idx mapping from ../dataset/SV/webnyheter2013_56kvocab_text_document_valid_indexmap_5146ns_512sl_1234s_sample_idx.npy\n",
-      " > loading shuffle-idx mapping from ../dataset/SV/webnyheter2013_56kvocab_text_document_valid_indexmap_5146ns_512sl_1234s_shuffle_idx.npy\n",
-      "    loaded indexed file in 0.002 seconds\n",
-      "    total number of samples: 5417\n",
-      "    total number of epochs: 2\n",
-      " > WARNING: could not find index map files, building the indices on rank 0 ...\n",
-      " > last epoch number of samples (12) is smaller than 80% of number of samples per epoch (54), setting separate_last_epoch to True\n",
-      " > elasped time to build and save doc-idx mapping (seconds): 0.004714\n",
-      "    using:\n",
-      "     number of documents:       1249\n",
-      "     number of epochs:          96\n",
-      "     sequence length:           512\n",
-      "     total number of samples:   5188\n",
-      " > elasped time to build and save sample-idx mapping (seconds): 0.001624\n",
-      " > building shuffle index with split [0, 5134) and [5134, 5188) ...\n",
-      " > elasped time to build and save shuffle-idx mapping (seconds): 0.001298\n",
-      " > loading doc-idx mapping from ../dataset/SV/webnyheter2013_56kvocab_text_document_test_indexmap_5146ns_512sl_1234s_doc_idx.npy\n",
-      " > loading sample-idx mapping from ../dataset/SV/webnyheter2013_56kvocab_text_document_test_indexmap_5146ns_512sl_1234s_sample_idx.npy\n",
-      " > loading shuffle-idx mapping from ../dataset/SV/webnyheter2013_56kvocab_text_document_test_indexmap_5146ns_512sl_1234s_shuffle_idx.npy\n",
-      "    loaded indexed file in 0.002 seconds\n",
-      "    total number of samples: 5189\n",
-      "    total number of epochs: 96\n",
-      "> building indices for blendable datasets ...\n",
-      " > sample ratios:\n",
-      "   dataset 0, input: 1, achieved: 1\n",
-      "> elapsed time for building blendable dataset indices: 0.00 (sec)\n",
-      "> building indices for blendable datasets ...\n",
-      " > sample ratios:\n",
-      "   dataset 0, input: 1, achieved: 1\n",
-      "> elapsed time for building blendable dataset indices: 0.00 (sec)\n",
-      "> building indices for blendable datasets ...\n",
-      " > sample ratios:\n",
-      "   dataset 0, input: 1, achieved: 1\n",
-      "> elapsed time for building blendable dataset indices: 0.00 (sec)\n",
-      "> finished creating GPT datasets ...\n",
-      "[after dataloaders are built] datetime: 2021-09-15 11:55:58 \n",
-      "done with setup ...\n",
-      "training ...\n",
-      "time (ms) | model-and-optimizer-setup: 929.42 | train/valid/test-data-iterators-setup: 1004.53\n",
-      "[after training is done] datetime: 2021-09-15 11:55:58 \n",
-      "------------------------------------------------------------------------------------------------------------------\n",
-      " validation loss at the end of training for val data | lm loss value: 1.171452E+01 | lm loss PPL: 1.223352E+05 | \n",
-      "------------------------------------------------------------------------------------------------------------------\n",
-      "Evaluating iter 10/10\n",
-      "-------------------------------------------------------------------------------------------------------------------\n",
-      " validation loss at the end of training for test data | lm loss value: 1.171400E+01 | lm loss PPL: 1.222719E+05 | \n",
-      "-------------------------------------------------------------------------------------------------------------------\n"
-     ]
-    }
-   ],
-   "source": [
-    "!bash ./Megatron-LM/profile_SVGPT_BIG.sh"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "available-relaxation",
-   "metadata": {},
-   "source": [
-    "--- \n",
-    "\n",
-    "## Additional Resources\n",
-    "\n",
-    "Language Models are Few-Shot Learners : https://arxiv.org/pdf/2005.14165.pdf\n",
-    "\n",
-    "Efficient Large-Scale Language Model Training on GPU Clusters : https://arxiv.org/pdf/2104.04473.pdf"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "naked-lodge",
-   "metadata": {},
-   "source": [
-    "---\n",
-    "\n",
-    "## Congratulations on completing the mission !\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "compound-tonight",
-   "metadata": {},
-   "source": [
-    "-----\n",
-    "\n",
-    "\n",
-    "## Licensing \n",
-    "\n",
-    "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). "
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.8.8"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}

+ 101 - 416
ai/Megatron/English/Python/jupyter_notebook/Lab2-3_train_own_GPT2BPETokenizer.ipynb

@@ -2,50 +2,47 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "respected-vintage",
+   "id": "naval-commodity",
    "metadata": {},
    "source": [
-    "# Train your own GPT compatible Tokenzer and obtain vocab.json & merges.txt\n",
+    "# Train custom GPTBPE  Tokenzer \n",
     "---\n",
     "\n",
     "## Learning Objectives\n",
-    "The goal of this lab is to demonstrate how to train your own GPTBPE tokenizer on your own raw text data \n",
     "\n",
-    "- train your own GPT compatible tokenizer given own text data in own langauge\n",
-    "    1. option 1 - load from pretrained vocab and merge files, and fit to the new corpus \n",
-    "    2. option 2 - train a GPT compatible tokenizer from scratch\n",
+    "In order to include the vocabulary of the local language into GPTBPE tokenizer, we need to be able to train GPTBPE Tokenizer on local language raw text data. The trained GPTBPE Tokenizer will produce it's own vocab.json and merges.txt files which is compatible with Megatron-LM's GPTBPE Tokenizer. \n",
     "\n",
-    "we will elaborate how to train your own GPT compatible tokenizer and obtain vocab and merge files\n",
+    "Previously in `Lab2-1_acquiring_data.ipynb`, we have acquired our own Swedish raw text data extracted from data source språkbank.\n",
+    "Therefore, the goal of this notebook, is to train our own GPTBPE Tokenizer on the Swedish raw text data obtained from `Lab2-1_acquiring_data.ipynb`.\n",
     "\n",
-    "we will be using HuggingFace's ByteLevel BPE Tokenizer and trainer to complete this task\n",
+    "We can either choose to load a previously trained GPTBPE Tokenizer by providing the vocab.json and merges.txt files to the GPTBPE Tokenizer before training further with the raw text data, or we can choose to train a completely new GPTBPE Tokenizer.\n",
     "\n",
-    "--------------------------------------------------------------------------------------------------------------------\n",
-    "First of all, we need to install the [HuggingFace Tokenizer library](https://huggingface.co/transformers/installation.html)"
+    "The two options are covered in this notebook :\n",
+    "\n",
+    "    1. option 1 - load from pretrained vocab and merge files, then continue training with the new raw text.\n",
+    "    2. option 2 - train a GPT compatible tokenizer from scratch.\n",
+    "\n",
+    "\n",
+    "We will use HuggingFace's Tokenizer library and the trainer function in order train our own GPTBPE Tokenizer with our own raw text data.\n",
+    "\n",
+    "\n",
+    "First, we will install the [HuggingFace Tokenizer library](https://huggingface.co/transformers/installation.html)"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 1,
-   "id": "transsexual-republican",
+   "execution_count": null,
+   "id": "cathedral-jumping",
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Defaulting to user installation because normal site-packages is not writeable\n",
-      "Requirement already satisfied: tokenizers in /home/x_zench/.local/lib/python3.8/site-packages (0.10.3)\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
     "!pip install tokenizers"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 3,
-   "id": "golden-retailer",
+   "execution_count": null,
+   "id": "designing-occasion",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -56,11 +53,13 @@
   },
   {
    "cell_type": "markdown",
-   "id": "accepting-simon",
+   "id": "separated-article",
    "metadata": {},
    "source": [
-    "-------------------------------------------------------------------------------\n",
-    "## How to use the python script below -       \n",
+    "A python script for training custom GPTBPE Tokenizer is provided for your convenience : \n",
+    "\n",
+    "To view the python script, click on [trainGPTTokenizer.py](./Megatron-LM/sv_utils/trainGPTTokenizer.py)\n",
+    "\n",
     "  trainGPTTokenizer.py [-h] \n",
     "\n",
     "        optional arguments:\n",
@@ -77,217 +76,54 @@
   },
   {
    "cell_type": "markdown",
-   "id": "latin-netscape",
+   "id": "therapeutic-kentucky",
    "metadata": {},
    "source": [
-    "---\n",
-    "## Load pretrained vocab and merge files into the trainer and then train on new txt\n",
-    "#### OUTPUT should look similar to the below ---\n",
-    "        \n",
-    "        loading gpt2bpe english vocab and merge \n",
-    "        include minimal special token end of text \n",
-    "        [00:00:00] Pre-processing files (914 Mo)            ░░░░░░░░                  0%\n",
-    "        [00:00:02] Pre-processing files (914 Mo)            ░░░░░░░░                  1%\n",
-    "        [00:00:05] Pre-processing files (914 Mo)            ░░░░░░░░                  2%\n",
-    "        [00:00:07] Pre-processing files (914 Mo)            ░░░░░░░░                  3%\n",
-    "        [00:00:10] Pre-processing files (914 Mo)            ░░░░░░░░                  4%\n",
-    "        ....\n",
-    "        [00:00:19] Compute merges                           ███████░ 30080    /    32000\n",
-    "        [00:00:19] Compute merges                           ███████░ 31040    /    32000\n",
-    "        [00:00:19] Compute merges                           ████████ 31743    /    31743\n",
-    "\n",
-    "        Trained vocab size: 32000\n",
-    "        saving trained BPE model to :  ./Megatron-LM/dataset/EN/32k/\n",
-    "        model saved ! "
+    "1. option 1 - load from pretrained vocab and merge files, then continue training with the new raw text."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 4,
-   "id": "visible-setup",
+   "execution_count": null,
+   "id": "modular-result",
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "loading gpt2bpe english vocab and merge \n",
-      "\n",
-      "include minimal special token end of text \n",
-      "[00:00:00] Pre-processing files (136 Mo)            ░░░░░░░░                  0%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo)            ░░░░░░░░                  1%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo)            ░░░░░░░░                  3%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo)            ░░░░░░░░                  5%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo)            ░░░░░░░░                  7%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo)            ░░░░░░░░                  9%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo)            ░░░░░░░░                 11%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo)            █░░░░░░░                 14%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo)            █░░░░░░░                 16%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo)            █░░░░░░░                 18%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo)            █░░░░░░░                 21%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo)            █░░░░░░░                 23%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo)            ██░░░░░░                 26%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo)            ██░░░░░░                 29%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo)            ██░░░░░░                 31%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo)            ██░░░░░░                 34%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo)            ██░░░░░░                 36%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo)            ███░░░░░                 38%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo)            ███░░░░░                 40%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo)            ███░░░░░                 42%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo)            ███░░░░░                 45%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo)            ███░░░░░                 47%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo)            ████░░░░                 50%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo)            ████░░░░                 53%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo)            ████░░░░                 56%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo)            ████░░░░                 59%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo)            ████░░░░                 62%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo)            █████░░░                 65%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo)            █████░░░                 68%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo)            █████░░░                 71%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo)            █████░░░                 74%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo)            ██████░░                 76%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo)            ██████░░                 79%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo)            ██████░░                 82%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo)            ██████░░                 85%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo)            ███████░                 88%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:03] Pre-processing files (136 Mo)            ███████░                 91%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:03] Pre-processing files (136 Mo)            ███████░                 94%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:03] Pre-processing files (136 Mo)            ███████░                 97%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:04] Pre-processing files (136 Mo)            ████████                100%\n",
-      "[00:00:00] Tokenize words                           ████████ 0        /        0\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words                           ░░░░░░░░ 45279    /   503185\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words                           █░░░░░░░ 90558    /   503185\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words                           ██░░░░░░ 140868   /   503185\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words                           ██░░░░░░ 186147   /   503185\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words                           ███░░░░░ 236457   /   503185\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words                           ████░░░░ 286767   /   503185\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words                           █████░░░ 337077   /   503185\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words                           ██████░░ 387387   /   503185\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words                           ██████░░ 437697   /   503185\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words                           ███████░ 488007   /   503185\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words                           ████████ 503185   /   503185\n",
-      "\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Count pairs                              █░░░░░░░ 70434    /   503185\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Count pairs                              ██░░░░░░ 140882   /   503185\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Count pairs                              ███░░░░░ 216347   /   503185\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Count pairs                              ████░░░░ 291798   /   503185\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Count pairs                              █████░░░ 362267   /   503185\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Count pairs                              ██████░░ 432688   /   503185\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Count pairs                              ████████ 503185   /   503185\n",
-      "\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:04] Compute merges                           ░░░░░░░░ 560      /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:05] Compute merges                           ░░░░░░░░ 1120     /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:05] Compute merges                           ░░░░░░░░ 1680     /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:06] Compute merges                           ░░░░░░░░ 2240     /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:06] Compute merges                           ░░░░░░░░ 2800     /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:06] Compute merges                           ░░░░░░░░ 3360     /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:07] Compute merges                           ░░░░░░░░ 3920     /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:07] Compute merges                           ░░░░░░░░ 4480     /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:07] Compute merges                           ░░░░░░░░ 5040     /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:07] Compute merges                           ░░░░░░░░ 5600     /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:07] Compute merges                           ░░░░░░░░ 6160     /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:07] Compute merges                           ░░░░░░░░ 6720     /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Compute merges                           █░░░░░░░ 7280     /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Compute merges                           █░░░░░░░ 7840     /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Compute merges                           █░░░░░░░ 8400     /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Compute merges                           █░░░░░░░ 8960     /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Compute merges                           █░░░░░░░ 9520     /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Compute merges                           █░░░░░░░ 10080    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Compute merges                           █░░░░░░░ 10640    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Compute merges                           █░░░░░░░ 11200    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Compute merges                           █░░░░░░░ 11760    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:09] Compute merges                           █░░░░░░░ 12320    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:09] Compute merges                           █░░░░░░░ 12880    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:09] Compute merges                           █░░░░░░░ 13440    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:09] Compute merges                           ██░░░░░░ 14000    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:09] Compute merges                           ██░░░░░░ 14560    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:09] Compute merges                           ██░░░░░░ 15120    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:09] Compute merges                           ██░░░░░░ 15680    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:09] Compute merges                           ██░░░░░░ 16240    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:09] Compute merges                           ██░░░░░░ 16800    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:09] Compute merges                           ██░░░░░░ 17360    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges                           ██░░░░░░ 17920    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges                           ██░░░░░░ 18480    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges                           ██░░░░░░ 19040    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges                           ██░░░░░░ 19600    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges                           ██░░░░░░ 20160    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges                           ██░░░░░░ 20720    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges                           ███░░░░░ 21280    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges                           ███░░░░░ 21840    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges                           ███░░░░░ 22400    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges                           ███░░░░░ 22960    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges                           ███░░░░░ 23520    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges                           ███░░░░░ 24640    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges                           ███░░░░░ 25200    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges                           ███░░░░░ 25760    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges                           ███░░░░░ 26320    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges                           ███░░░░░ 26880    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges                           ███░░░░░ 27440    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges                           ████░░░░ 28000    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges                           ████░░░░ 28560    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges                           ████░░░░ 29120    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges                           ████░░░░ 29680    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges                           ████░░░░ 30240    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges                           ████░░░░ 30800    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges                           ████░░░░ 31360    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:12] Compute merges                           ████░░░░ 32480    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:12] Compute merges                           ████░░░░ 33040    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:12] Compute merges                           ████░░░░ 34160    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:12] Compute merges                           █████░░░ 35280    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:12] Compute merges                           █████░░░ 36400    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:12] Compute merges                           █████░░░ 37520    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:12] Compute merges                           █████░░░ 38640    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:12] Compute merges                           █████░░░ 39200    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:12] Compute merges                           █████░░░ 40320    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:13] Compute merges                           █████░░░ 41440    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:13] Compute merges                           ██████░░ 42560    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:13] Compute merges                           ██████░░ 43120    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:13] Compute merges                           ██████░░ 43680    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:13] Compute merges                           ██████░░ 44240    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:13] Compute merges                           ██████░░ 45360    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:13] Compute merges                           ██████░░ 45920    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:13] Compute merges                           ██████░░ 47040    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:13] Compute merges                           ██████░░ 48160    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:13] Compute merges                           ███████░ 49280    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:13] Compute merges                           ███████░ 50400    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:14] Compute merges                           ███████░ 51520    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:14] Compute merges                           ███████░ 52640    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:14] Compute merges                           ███████░ 53760    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:14] Compute merges                           ███████░ 54880    /    56000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:14] Compute merges                           ████████ 55743    /    55743\n",
-      "\n",
-      "Trained vocab size: 56000\n",
-      "saving trained BPE model to :  ../dataset/SV/56k/\n",
-      "model saved ! \n",
-      "\n",
-      "\n",
-      "\n",
-      "testing ...\n",
-      "\n",
-      "\n",
-      "\n",
-      "['ĠHar', 'ĠnÃ¥gon', 'Ġfunderat', 'ĠpÃ¥', 'Ġvarför', 'Ġman', 'Ġinte', 'ĠfÃ¥r', 'Ġinom', 'hu', 'ste', 'peratur', 'ens', 'Ġkurva', 'Ġsynlig', 'Ġi', 'Ġgraf', 'en', '?', 'ĠÃĦr', 'Ġdet', 'ĠnÃ¥gon', 'Ġsom', 'ĠfrÃ¥gat', 'ĠTher', 'm', 'ia', '?', 'ĠSkulle', 'Ġdet', 'Ġinte', 'Ġvara', 'Ġväsentligt', 'Ġatt', 'Ġkunna', 'Ġkolla', 'Ġhistor', 'iken', 'ĠpÃ¥', 'Ġden', 'ĠdÃ¥', 'Ġman', 'Ġskall', 'Ġställa', 'Ġin', 'Ġkurvan', '?']\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
     "!python ./Megatron-LM/sv_utils/trainGPTTokenizer.py --infile $raw_text_path --bpe_path $output_trained_tokenizer_model_path --load_pretrained --pretrained_gpt_dir=$pretrained_gpt_dir --vocab_size 56000"
    ]
   },
   {
+   "cell_type": "markdown",
+   "id": "roman-advocate",
+   "metadata": {},
+   "source": [
+    "Below is the expected outputs :\n",
+    "        \n",
+    "        [00:00:14] Compute merges                           ███████░ 51520    /    56000\n",
+    "        [00:00:14] Compute merges                           ███████░ 52640    /    56000\n",
+    "        [00:00:14] Compute merges                           ███████░ 53760    /    56000\n",
+    "        [00:00:14] Compute merges                           ███████░ 54880    /    56000\n",
+    "        [00:00:14] Compute merges                           ████████ 55743    /    55743\n",
+    "\n",
+    "        Trained vocab size: 56000\n",
+    "        saving trained BPE model to :  ../dataset/SV/56k/\n",
+    "        model saved ! \n",
+    "\n",
+    "\n",
+    "\n",
+    "        testing ...\n",
+    "\n",
+    "\n",
+    "\n",
+    "        ['ĠHar', 'ĠnÃ¥gon', 'Ġfunderat', 'ĠpÃ¥', 'Ġvarför', 'Ġman', 'Ġinte', 'ĠfÃ¥r', 'Ġinom', 'hu', 'ste', 'peratur', 'ens', 'Ġkurva', 'Ġsynlig', 'Ġi', 'Ġgraf', 'en', '?', 'ĠÃĦr', 'Ġdet', 'ĠnÃ¥gon', 'Ġsom', 'ĠfrÃ¥gat', 'ĠTher', 'm', 'ia', '?', 'ĠSkulle', 'Ġdet', 'Ġinte', 'Ġvara', 'Ġväsentligt', 'Ġatt', 'Ġkunna', 'Ġkolla', 'Ġhistor', 'iken', 'ĠpÃ¥', 'Ġden', 'ĠdÃ¥', 'Ġman', 'Ġskall', 'Ġställa', 'Ġin', 'Ġkurvan', '?']"
+   ]
+  },
+  {
    "cell_type": "code",
-   "execution_count": 9,
-   "id": "analyzed-pacific",
+   "execution_count": null,
+   "id": "better-consideration",
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "merges.txt  vocab.json\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
     "## verify merges.txt and vocab.json exist\n",
     "!ls ../dataset/SV/56k/"
@@ -295,31 +131,16 @@
   },
   {
    "cell_type": "markdown",
-   "id": "brown-pickup",
+   "id": "mysterious-gossip",
    "metadata": {},
    "source": [
-    "---\n",
-    "## Train completely from scratch with the raw txt to obtain vocab.json and merges.txt files\n",
-    "#### OUTPUT should look similar to the below ---\n",
-    "    include minimal special token end of text \n",
-    "    [00:00:00] Pre-processing files (914 Mo)            ░░░░░░░░                  0%\n",
-    "    [00:00:02] Pre-processing files (914 Mo)            ░░░░░░░░                  1%\n",
-    "    [00:00:05] Pre-processing files (914 Mo)            ░░░░░░░░                  2%\n",
-    "    [00:00:07] Pre-processing files (914 Mo)            ░░░░░░░░                  3%\n",
-    "    ...\n",
-    "    [00:00:18] Compute merges                           ███████░ 30400    /    32000\n",
-    "    [00:00:18] Compute merges                           ███████░ 31360    /    32000\n",
-    "    [00:00:19] Compute merges                           ████████ 31743    /    31743\n",
-    "\n",
-    "    Trained vocab size: 32000\n",
-    "    saving trained BPE model to :  ./Megatron-LM/dataset/EN/32k/\n",
-    "    model saved ! \n"
+    "2. option 2 - train a GPT compatible tokenizer from scratch."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 5,
-   "id": "pointed-toner",
+   "execution_count": null,
+   "id": "unexpected-cowboy",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -329,171 +150,44 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 6,
-   "id": "north-reality",
+   "execution_count": null,
+   "id": "atlantic-serbia",
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "include minimal special token end of text \n",
-      "[00:00:00] Pre-processing files (136 Mo)            ░░░░░░░░                  0%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo)            ░░░░░░░░                  1%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo)            ░░░░░░░░                  4%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo)            ░░░░░░░░                  7%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo)            ░░░░░░░░                  9%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo)            ░░░░░░░░                 12%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo)            █░░░░░░░                 14%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo)            █░░░░░░░                 17%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo)            █░░░░░░░                 20%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo)            █░░░░░░░                 23%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo)            ██░░░░░░                 26%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo)            ██░░░░░░                 29%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo)            ██░░░░░░                 32%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo)            ██░░░░░░                 35%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo)            ███░░░░░                 38%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo)            ███░░░░░                 40%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo)            ███░░░░░                 43%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo)            ███░░░░░                 46%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo)            ███░░░░░                 49%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo)            ████░░░░                 52%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo)            ████░░░░                 55%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo)            ████░░░░                 58%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo)            ████░░░░                 61%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo)            █████░░░                 64%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo)            █████░░░                 67%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo)            █████░░░                 70%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo)            █████░░░                 72%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo)            ██████░░                 75%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo)            ██████░░                 78%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo)            ██████░░                 81%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo)            ██████░░                 84%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo)            ██████░░                 87%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo)            ███████░                 90%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo)            ███████░                 93%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo)            ███████░                 96%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:03] Pre-processing files (136 Mo)            ███████░                 98%\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:03] Pre-processing files (136 Mo)            ████████                100%\n",
-      "[00:00:00] Tokenize words                           ████████ 0        /        0\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words                           ░░░░░░░░ 50310    /   503185\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words                           █░░░░░░░ 100620   /   503185\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words                           ██░░░░░░ 150930   /   503185\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words                           ███░░░░░ 201240   /   503185\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words                           ███░░░░░ 251550   /   503185\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words                           ████░░░░ 301860   /   503185\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words                           █████░░░ 352170   /   503185\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words                           ██████░░ 402480   /   503185\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words                           ███████░ 452790   /   503185\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words                           ███████░ 503100   /   503185\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words                           ████████ 503185   /   503185\n",
-      "\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Count pairs                              █░░░░░░░ 75526    /   503185\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Count pairs                              ██░░░░░░ 150944   /   503185\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Count pairs                              ███░░░░░ 231465   /   503185\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Count pairs                              ████░░░░ 301961   /   503185\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Count pairs                              █████░░░ 372348   /   503185\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Count pairs                              ███████░ 447777   /   503185\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Count pairs                              ████████ 503185   /   503185\n",
-      "\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:04] Compute merges                           ░░░░░░░░ 320      /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:04] Compute merges                           ░░░░░░░░ 640      /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:05] Compute merges                           ░░░░░░░░ 960      /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:05] Compute merges                           ░░░░░░░░ 1280     /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:05] Compute merges                           ░░░░░░░░ 1600     /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:06] Compute merges                           ░░░░░░░░ 1920     /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:06] Compute merges                           ░░░░░░░░ 2240     /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:06] Compute merges                           ░░░░░░░░ 2560     /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:06] Compute merges                           ░░░░░░░░ 2880     /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:06] Compute merges                           ░░░░░░░░ 3200     /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:06] Compute merges                           ░░░░░░░░ 3520     /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:07] Compute merges                           ░░░░░░░░ 3840     /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:07] Compute merges                           █░░░░░░░ 4160     /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:07] Compute merges                           █░░░░░░░ 4480     /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:07] Compute merges                           █░░░░░░░ 4800     /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:07] Compute merges                           █░░░░░░░ 5120     /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:07] Compute merges                           █░░░░░░░ 5440     /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:07] Compute merges                           █░░░░░░░ 5760     /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:07] Compute merges                           █░░░░░░░ 6080     /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:07] Compute merges                           █░░░░░░░ 6400     /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Compute merges                           █░░░░░░░ 6720     /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Compute merges                           █░░░░░░░ 7040     /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Compute merges                           █░░░░░░░ 7360     /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Compute merges                           █░░░░░░░ 7680     /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Compute merges                           ██░░░░░░ 8000     /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Compute merges                           ██░░░░░░ 8320     /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Compute merges                           ██░░░░░░ 8640     /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Compute merges                           ██░░░░░░ 8960     /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Compute merges                           ██░░░░░░ 9280     /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Compute merges                           ██░░░░░░ 9600     /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Compute merges                           ██░░░░░░ 9920     /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Compute merges                           ██░░░░░░ 10240    /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:09] Compute merges                           ██░░░░░░ 10560    /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:09] Compute merges                           ██░░░░░░ 11200    /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:09] Compute merges                           ██░░░░░░ 11520    /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:09] Compute merges                           ███░░░░░ 12160    /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:09] Compute merges                           ███░░░░░ 12800    /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:09] Compute merges                           ███░░░░░ 13120    /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:09] Compute merges                           ███░░░░░ 13440    /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:09] Compute merges                           ███░░░░░ 14080    /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:09] Compute merges                           ███░░░░░ 14720    /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges                           ███░░░░░ 15360    /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges                           ████░░░░ 16000    /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges                           ████░░░░ 16640    /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges                           ████░░░░ 17280    /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges                           ████░░░░ 17920    /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges                           ████░░░░ 18560    /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges                           ████░░░░ 18880    /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges                           ████░░░░ 19520    /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges                           █████░░░ 20160    /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges                           █████░░░ 20800    /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges                           █████░░░ 21440    /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges                           █████░░░ 22400    /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges                           █████░░░ 23040    /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges                           ██████░░ 24000    /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges                           ██████░░ 24960    /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges                           ██████░░ 25920    /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges                           ██████░░ 26880    /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges                           ██████░░ 27840    /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges                           ███████░ 28800    /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges                           ███████░ 29440    /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges                           ███████░ 30080    /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges                           ███████░ 30720    /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges                           ███████░ 31360    /    32000\n",
-      "\u001b[2K\u001b[1B\u001b[1A[00:00:12] Compute merges                           ████████ 31743    /    31743\n",
-      "\n",
-      "Trained vocab size: 32000\n",
-      "saving trained BPE model to :  ../dataset/SV/32k/\n",
-      "model saved ! \n",
-      "\n",
-      "\n",
-      "\n",
-      "testing ...\n",
-      "\n",
-      "\n",
-      "\n",
-      "['ĠHar', 'ĠnÃ¥gon', 'Ġfunderat', 'ĠpÃ¥', 'Ġvarför', 'Ġman', 'Ġinte', 'ĠfÃ¥r', 'Ġinom', 'hu', 'ste', 'peratur', 'ens', 'Ġkurva', 'Ġsynlig', 'Ġi', 'Ġgraf', 'en', '?', 'ĠÃĦr', 'Ġdet', 'ĠnÃ¥gon', 'Ġsom', 'ĠfrÃ¥gat', 'ĠTher', 'm', 'ia', '?', 'ĠSkulle', 'Ġdet', 'Ġinte', 'Ġvara', 'Ġväsentligt', 'Ġatt', 'Ġkunna', 'Ġkolla', 'Ġhistor', 'iken', 'ĠpÃ¥', 'Ġden', 'ĠdÃ¥', 'Ġman', 'Ġskall', 'Ġställa', 'Ġin', 'Ġkurvan', '?']\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
     "!python ./Megatron-LM/sv_utils/trainGPTTokenizer.py --infile $raw_text_path --bpe_path $output_trained_tokenizer_model_path --vocab_size 32000"
    ]
   },
   {
+   "cell_type": "markdown",
+   "id": "heated-ranch",
+   "metadata": {},
+   "source": [
+    "Below is the expected outputs :\n",
+    "    \n",
+    "        [00:00:11] Compute merges                           ███████░ 30720    /    32000\n",
+    "        [00:00:11] Compute merges                           ███████░ 31360    /    32000\n",
+    "        [00:00:12] Compute merges                           ████████ 31743    /    31743\n",
+    "\n",
+    "        Trained vocab size: 32000\n",
+    "        saving trained BPE model to :  ../dataset/SV/32k/\n",
+    "        model saved ! \n",
+    "\n",
+    "\n",
+    "\n",
+    "        testing ...\n",
+    "\n",
+    "\n",
+    "\n",
+    "        ['ĠHar', 'ĠnÃ¥gon', 'Ġfunderat', 'ĠpÃ¥', 'Ġvarför', 'Ġman', 'Ġinte', 'ĠfÃ¥r', 'Ġinom', 'hu', 'ste', 'peratur', 'ens', 'Ġkurva', 'Ġsynlig', 'Ġi', 'Ġgraf', 'en', '?', 'ĠÃĦr', 'Ġdet', 'ĠnÃ¥gon', 'Ġsom', 'ĠfrÃ¥gat', 'ĠTher', 'm', 'ia', '?', 'ĠSkulle', 'Ġdet', 'Ġinte', 'Ġvara', 'Ġväsentligt', 'Ġatt', 'Ġkunna', 'Ġkolla', 'Ġhistor', 'iken', 'ĠpÃ¥', 'Ġden', 'ĠdÃ¥', 'Ġman', 'Ġskall', 'Ġställa', 'Ġin', 'Ġkurvan', '?']"
+   ]
+  },
+  {
    "cell_type": "code",
-   "execution_count": 10,
-   "id": "wireless-galaxy",
+   "execution_count": null,
+   "id": "olive-japanese",
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "merges.txt  vocab.json\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
     "## verify the merges.txt and vocab.json exist \n",
     "!ls ../dataset/SV/32k/"
@@ -501,35 +195,26 @@
   },
   {
    "cell_type": "markdown",
-   "id": "requested-sphere",
+   "id": "mineral-middle",
    "metadata": {},
    "source": [
     "--- \n",
-    "\n",
-    "## Additional Resources\n",
-    "\n",
-    "HuggingFace Tokenizer Documentation : https://huggingface.co/docs/tokenizers/python/latest/quicktour.html\n",
-    "\n",
-    "Train GPT-2 in your own langauge : https://towardsdatascience.com/train-gpt-2-in-your-own-language-fc6ad4d60171"
+    "## Links and Resources\n",
+    "Don't forget to check out additional resources such as [HuggingFace Tokenizer Documentation](https://huggingface.co/docs/tokenizers/python/latest/quicktour.html) and [Train GPTBPE Tokenizer in your own langauge](https://towardsdatascience.com/train-gpt-2-in-your-own-language-fc6ad4d60171)."
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "contrary-quantum",
+   "id": "pregnant-template",
    "metadata": {},
    "source": [
-    "---\n",
-    "## Up Next : \n",
-    "\n",
-    "[customize preprocess data python script and convert to mmap](./Day3-4_customize_process2mmap.ipynb)\n",
-    "\n",
-    "## Back To Start Menu\n",
-    "[start menu](../Start_Here.ipynb)"
+    "-----\n",
+    "## <p style=\"text-align:center;border:3px; padding: 1em\"> <a href=../Start_Here.ipynb>HOME</a> &nbsp; &nbsp; &nbsp; <a href=./Lab2-4_customize_process2mmap.ipynb>NEXT</a></p>"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "sized-google",
+   "id": "limiting-visiting",
    "metadata": {},
    "source": [
     "-----\n",

+ 561 - 0
ai/Megatron/English/Python/jupyter_notebook/Lab2-4_customize_process2mmap.ipynb

@@ -0,0 +1,561 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "otherwise-masters",
+   "metadata": {},
+   "source": [
+    "## Customize preprocess_data.py\n",
+    "---\n",
+    "\n",
+    "## Learning Objectives\n",
+    "\n",
+    "We fetched our own Swedish raw text data in `Lab2-1_acquiring_data.ipynb`, we learned how to find sentence boundary with custom functions in `Lab2-2_SentenceBoundary_and_Deduplicate.ipynb` , we also trained a GPTBPETokenizer and fitted it to our raw Swedish text. \n",
+    "\n",
+    "We are now ready to incorporate the custom sentence-splitter into preprocess_data.py and covert the raw Swedish text to , first json format, and then mmap format.\n",
+    "\n",
+    "Therefore, the goal of this notebook is to integrate all knowledge gained from both Lab 1 as well as the above notebooks, and challenge ourselves to further customize the preprocess_data.py with a <a href=\"./Lab2-4_customize_process2mmap.ipynb#Custom-Sentence-Splitter\">custom sentence-splitter</a>, and in the process, convert the new raw Sweden text to mmap format.\n",
+    "\n",
+    "More specifically, this notebook will cover the steps to :\n",
+    "\n",
+    "1.  Convert the extracted raw Swedish text from webnyheter2013.txt to webnyheter2013.json.\n",
+    "2.  Generate the mmap format files by default preprocess_data.py first to assure the possibility to move on to the next notebook in case time runs out.\n",
+    "\n",
+    "\n",
+    "Toward the end, there is a Mini-Challenge <a href=\"./Lab2-4_customize_process2mmap.ipynb#Mini-Challenge\">Jump to view Mini-Challenge</a>.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "statutory-thesis",
+   "metadata": {},
+   "source": [
+    "1. Convert the extracted raw Swedish text from webnyheter2013.txt to webnyheter2013.json."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "horizontal-cause",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!python create_loose_json.py --infile ../dataset/SV/webnyheter2013.txt --outfile ../dataset/SV/webnyheter2013.json"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "reserved-clear",
+   "metadata": {},
+   "source": [
+    "Below is the expected outputs :\n",
+    "\n",
+    "        process 1000000 documents so far ...\n",
+    "        example:  – Vi har en bra generation som spelat tillsammans ett tag .\n",
+    "\n",
+    "        finished processing 1249010 lines to loose json format"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fixed-closing",
+   "metadata": {},
+   "source": [
+    "2.  Generate the mmap format files by default preprocess_data.py first to assure the possibility to move on to the next notebook in case time runs out."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "dried-intro",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "INPUT_JSON_FILE='../dataset/SV/webnyheter2013.json'\n",
+    "OUTPUT_PATH='../dataset/SV/webnyheter2013_56kvocab'\n",
+    "VOCAB_FILE='../dataset/SV/56k/vocab.json'\n",
+    "MERGE_FILE='../dataset/SV/56k/merges.txt'\n",
+    "NUM_CPUS=16"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "addressed-meeting",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!python ./Megatron-LM/tools/preprocess_data.py \\\n",
+    "                       --input $INPUT_JSON_FILE \\\n",
+    "                       --output-prefix $OUTPUT_PATH \\\n",
+    "                       --json-keys text \\\n",
+    "                       --vocab-file $VOCAB_FILE \\\n",
+    "                       --merge-file $MERGE_FILE \\\n",
+    "                       --dataset-impl mmap \\\n",
+    "                       --tokenizer-type GPT2BPETokenizer \\\n",
+    "                       --workers $NUM_CPUS \\\n",
+    "                       --append-eod"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "moderate-future",
+   "metadata": {},
+   "source": [
+    "Below is the expected outputs :\n",
+    "\n",
+    "    Processed 1248300 documents (52998.601302473544 docs/s, 5.869853647730749 MB/s).\n",
+    "    Processed 1248400 documents (53001.39142986273 docs/s, 5.870136451906283 MB/s).\n",
+    "    Processed 1248500 documents (53004.16423593737 docs/s, 5.870477584597603 MB/s).\n",
+    "    Processed 1248600 documents (53007.072626674184 docs/s, 5.870763528521501 MB/s).\n",
+    "    Processed 1248700 documents (53009.92668081499 docs/s, 5.871081674576178 MB/s).\n",
+    "    Processed 1248800 documents (53012.79399884911 docs/s, 5.871406835923378 MB/s).\n",
+    "    Processed 1248900 documents (53015.61341376629 docs/s, 5.8717617499445 MB/s).\n",
+    "    Processed 1249000 documents (53018.49277365899 docs/s, 5.8720826162486786 MB/s)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "electrical-executive",
+   "metadata": {},
+   "source": [
+    "Now we get the default mmap files (xxx.bin and xxx.idx ) and therefore guarantee the data needed for the next notebook to run.\n",
+    "We can now move on. We start by copy the old preprocess_data.py and rename it to `MYpreprocess_data.py`\n",
+    "\n",
+    "cp the preprocess_data.py into a new python script called `MYpreprocess_data.py`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "greatest-receptor",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!cp ./Megatron-LM/tools/preprocess_data.py ./Megatron-LM/tools/MYpreprocess_data.py"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "south-devil",
+   "metadata": {},
+   "source": [
+    "<a id=\"Custom-Sentence-Splitter\"></a>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "rough-pickup",
+   "metadata": {},
+   "source": [
+    "The below code block is our custom sentence-splitter `cut_sentence_with_quotation_marks`, the custom function is provided for your convenience for integarting to  `MYpreprocess_data.py`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "prostate-profession",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import re\n",
+    "import nltk\n",
+    "from nltk.tokenize import sent_tokenize\n",
+    "def normal_cut_sentence(temp):\n",
+    "    return sent_tokenize(temp)\n",
+    "\n",
+    "def cut_sentence_with_quotation_marks(text):\n",
+    "    p = re.compile(\"“.*?”\")\n",
+    "    list = []\n",
+    "    index = 0\n",
+    "    length = len(text)\n",
+    "    for i in p.finditer(text):\n",
+    "        temp = ''\n",
+    "        start = i.start()\n",
+    "        end = i.end()\n",
+    "        for j in range(index, start):\n",
+    "            temp += text[j]\n",
+    "        if temp != '':\n",
+    "            temp_list = normal_cut_sentence(temp)\n",
+    "            list += temp_list\n",
+    "        temp = ''\n",
+    "        for k in range(start, end):\n",
+    "            temp += text[k]\n",
+    "        if temp != ' ':\n",
+    "            list.append(temp)\n",
+    "        index = end\n",
+    "    return list"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "large-birthday",
+   "metadata": {},
+   "source": [
+    "<a id=\"Mini-Challenge\"></a>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "medical-incident",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "## **Mini-Challenge ** - integrate the custom sentence splitter into MYpreprocess_data.py\n",
+    "\n",
+    "Task : Modify and overwrite `MYpreprocess_data.py` below to incoporate the custom `cut_sentence_with_quotation_marks`\n",
+    "\n",
+    "Pass : Successfully run Mypreprocess_data.py with the custom sentence splitter cut_sentence_with_quotation_marks and generate the customSentenceSplit_text_document.bin and customSentenceSplit_text_document.idx files.\n",
+    "\n",
+    "Note: the solution will be delivered to you at the end of Lab 2.\n",
+    "\n",
+    "---\n",
+    "Modify the below cell block to overwrite `MYpreprocess_data.py`. \n",
+    "After modification, Jump to Rerun cell to produce customSentenceSplit_text_document.bin and customSentenceSplit_text_document.idx files.\n",
+    "<a id=\"MODIFY_CELL\"></a>\n",
+    "<a href=\"./Lab2-4_customize_process2mmap.ipynb#Rerun_Cell\">Jump to ReRun Cell</a> "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "selected-depth",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%writefile ./Megatron-LM/tools/MYpreprocess_data.py \n",
+    "# coding=utf-8\n",
+    "# Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.\n",
+    "#\n",
+    "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
+    "# you may not use this file except in compliance with the License.\n",
+    "# You may obtain a copy of the License at\n",
+    "#\n",
+    "#     http://www.apache.org/licenses/LICENSE-2.0\n",
+    "#\n",
+    "# Unless required by applicable law or agreed to in writing, software\n",
+    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
+    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
+    "# See the License for the specific language governing permissions and\n",
+    "# limitations under the License.\n",
+    "\n",
+    "\"\"\"Processing data for pretraining.\"\"\"\n",
+    "\n",
+    "import argparse\n",
+    "import json\n",
+    "import multiprocessing\n",
+    "import os\n",
+    "import sys\n",
+    "sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__),\n",
+    "                                             os.path.pardir)))\n",
+    "import time\n",
+    "\n",
+    "import torch\n",
+    "try:\n",
+    "    import nltk\n",
+    "    nltk_available = True\n",
+    "except ImportError:\n",
+    "    nltk_available = False\n",
+    "\n",
+    "from megatron.tokenizer import build_tokenizer\n",
+    "from megatron.data import indexed_dataset\n",
+    "\n",
+    "\n",
+    "# https://stackoverflow.com/questions/33139531/preserve-empty-lines-with-nltks-punkt-tokenizer\n",
+    "class CustomLanguageVars(nltk.tokenize.punkt.PunktLanguageVars):\n",
+    "\n",
+    "    _period_context_fmt = r\"\"\"\n",
+    "        \\S*                          # some word material\n",
+    "        %(SentEndChars)s             # a potential sentence ending\n",
+    "        \\s*                       #  <-- THIS is what I changed\n",
+    "        (?=(?P<after_tok>\n",
+    "            %(NonWord)s              # either other punctuation\n",
+    "            |\n",
+    "            (?P<next_tok>\\S+)     #  <-- Normally you would have \\s+ here\n",
+    "        ))\"\"\"\n",
+    "\n",
+    "class IdentitySplitter(object):\n",
+    "    def tokenize(self, *text):\n",
+    "        return text\n",
+    "\"\"\"[TODO]: modify this class to integrate the custom sentence splitter above \"\"\"\n",
+    "\n",
+    "class Encoder(object):\n",
+    "    def __init__(self, args):\n",
+    "        self.args = args\n",
+    "    \n",
+    "    def initializer(self):\n",
+    "        # Use Encoder class as a container for global data\n",
+    "        Encoder.tokenizer = build_tokenizer(self.args)\n",
+    "        if self.args.split_sentences:\n",
+    "            if not nltk_available:\n",
+    "                print(\"NLTK is not available to split sentences.\")\n",
+    "                exit()\n",
+    "            splitter = nltk.load(\"tokenizers/punkt/english.pickle\")\n",
+    "            if self.args.keep_newlines:\n",
+    "                # this prevents punkt from eating newlines after sentences\n",
+    "                Encoder.splitter = nltk.tokenize.punkt.PunktSentenceTokenizer(\n",
+    "                    train_text = splitter._params,\n",
+    "                    lang_vars = CustomLanguageVars())\n",
+    "            else:\n",
+    "                Encoder.splitter = splitter\n",
+    "\n",
+    "        else:\n",
+    "            Encoder.splitter = IdentitySplitter()\n",
+    "\n",
+    "    def encode(self, json_line):\n",
+    "        data = json.loads(json_line)\n",
+    "        ids = {}\n",
+    "        for key in self.args.json_keys:\n",
+    "            text = data[key]\n",
+    "            doc_ids = []\n",
+    "            for sentence in Encoder.splitter.tokenize(text):\n",
+    "                sentence_ids = Encoder.tokenizer.tokenize(sentence)\n",
+    "                if len(sentence_ids) > 0:\n",
+    "                    doc_ids.append(sentence_ids)\n",
+    "            if len(doc_ids) > 0 and self.args.append_eod:\n",
+    "                doc_ids[-1].append(Encoder.tokenizer.eod)\n",
+    "            ids[key] = doc_ids\n",
+    "        return ids, len(json_line)\n",
+    "\n",
+    "def get_args():\n",
+    "    parser = argparse.ArgumentParser()\n",
+    "    group = parser.add_argument_group(title='input data')\n",
+    "    group.add_argument('--input', type=str, required=True,\n",
+    "                       help='Path to input JSON')\n",
+    "    group.add_argument('--json-keys', nargs='+', default=['text'],\n",
+    "                       help='space separate listed of keys to extract from json')\n",
+    "    group.add_argument('--split-sentences', action='store_true',\n",
+    "                       help='Split documents into sentences.')\n",
+    "    group.add_argument('--keep-newlines', action='store_true',\n",
+    "                       help='Keep newlines between sentences when splitting.')\n",
+    "\n",
+    "    group = parser.add_argument_group(title='tokenizer')\n",
+    "    group.add_argument('--tokenizer-type', type=str, required=True,\n",
+    "                       choices=['BertWordPieceLowerCase','BertWordPieceCase',\n",
+    "                                'GPT2BPETokenizer'],\n",
+    "                       help='What type of tokenizer to use.')\n",
+    "    group.add_argument('--vocab-file', type=str, default=None,\n",
+    "                       help='Path to the vocab file')\n",
+    "    group.add_argument('--merge-file', type=str, default=None,\n",
+    "                       help='Path to the BPE merge file (if necessary).')\n",
+    "    group.add_argument('--append-eod', action='store_true',\n",
+    "                       help='Append an <eod> token to the end of a document.')\n",
+    "\n",
+    "\n",
+    "    group = parser.add_argument_group(title='output data')\n",
+    "    group.add_argument('--output-prefix', type=str, required=True,\n",
+    "                       help='Path to binary output file without suffix')\n",
+    "    group.add_argument('--dataset-impl', type=str, default='mmap',\n",
+    "                       choices=['lazy', 'cached', 'mmap'])\n",
+    "\n",
+    "    group = parser.add_argument_group(title='runtime')\n",
+    "    group.add_argument('--workers', type=int, default=1,\n",
+    "                       help='Number of worker processes to launch')\n",
+    "    group.add_argument('--log-interval', type=int, default=100,\n",
+    "                       help='Interval between progress updates')\n",
+    "    args = parser.parse_args()\n",
+    "    args.keep_empty = False\n",
+    "\n",
+    "    if args.tokenizer_type.lower().startswith('bert'):\n",
+    "        if not args.split_sentences:\n",
+    "            print(\"Bert tokenizer detected, are you sure you don't want to split sentences?\")\n",
+    "\n",
+    "    # some default/dummy values for the tokenizer\n",
+    "    args.rank = 0\n",
+    "    args.make_vocab_size_divisible_by = 128\n",
+    "    args.tensor_model_parallel_size = 1\n",
+    "    args.vocab_extra_ids = 0\n",
+    "\n",
+    "    return args\n",
+    "\n",
+    "def main():\n",
+    "    args = get_args()\n",
+    "    startup_start = time.time()\n",
+    "\n",
+    "    print(\"Opening\", args.input)\n",
+    "    fin = open(args.input, 'r', encoding='utf-8')\n",
+    "\n",
+    "    if nltk_available and args.split_sentences:\n",
+    "        nltk.download(\"punkt\", quiet=True)\n",
+    "\n",
+    "    encoder = Encoder(args)\n",
+    "    tokenizer = build_tokenizer(args)\n",
+    "    pool = multiprocessing.Pool(args.workers, initializer=encoder.initializer)\n",
+    "    encoded_docs = pool.imap(encoder.encode, fin, 25)\n",
+    "    #encoded_docs = map(encoder.encode, fin)\n",
+    "\n",
+    "    level = \"document\"\n",
+    "    if args.split_sentences:\n",
+    "        level = \"sentence\"\n",
+    "\n",
+    "    print(f\"Vocab size: {tokenizer.vocab_size}\")\n",
+    "    print(f\"Output prefix: {args.output_prefix}\")\n",
+    "    output_bin_files = {}\n",
+    "    output_idx_files = {}\n",
+    "    builders = {}\n",
+    "    for key in args.json_keys:\n",
+    "        output_bin_files[key] = \"{}_{}_{}.bin\".format(args.output_prefix,\n",
+    "                                                      key, level)\n",
+    "        output_idx_files[key] = \"{}_{}_{}.idx\".format(args.output_prefix,\n",
+    "                                                      key, level)\n",
+    "        builders[key] = indexed_dataset.make_builder(output_bin_files[key],\n",
+    "                                               impl=args.dataset_impl,\n",
+    "                                               vocab_size=tokenizer.vocab_size)\n",
+    "\n",
+    "    startup_end = time.time()\n",
+    "    proc_start = time.time()\n",
+    "    total_bytes_processed = 0\n",
+    "    print(\"Time to startup:\", startup_end - startup_start)\n",
+    "\n",
+    "    for i, (doc, bytes_processed) in enumerate(encoded_docs, start=1):\n",
+    "        total_bytes_processed += bytes_processed\n",
+    "        for key, sentences in doc.items():\n",
+    "            if len(sentences) == 0:\n",
+    "                continue\n",
+    "            for sentence in sentences:\n",
+    "                builders[key].add_item(torch.IntTensor(sentence))\n",
+    "            builders[key].end_document()\n",
+    "        if i % args.log_interval == 0:\n",
+    "            current = time.time()\n",
+    "            elapsed = current - proc_start\n",
+    "            mbs = total_bytes_processed/elapsed/1024/1024\n",
+    "            print(f\"Processed {i} documents\",\n",
+    "                  f\"({i/elapsed} docs/s, {mbs} MB/s).\",\n",
+    "                  file=sys.stderr)\n",
+    "\n",
+    "    for key in args.json_keys:\n",
+    "        builders[key].finalize(output_idx_files[key])\n",
+    "\n",
+    "if __name__ == '__main__':\n",
+    "    main()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "raised-victim",
+   "metadata": {},
+   "source": [
+    "Below cell block specify all the input parameters in order to run `MYpreprocess_data.py`. \n",
+    "\n",
+    "Please do **NOT** modify anything in below cell."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "fluid-dayton",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "INPUT_JSON_FILE='../dataset/SV/webnyheter2013.json'\n",
+    "OUTPUT_PATH='../dataset/SV/customSentenceSplit'\n",
+    "VOCAB_FILE='../dataset/SV/32k/vocab.json'\n",
+    "MERGE_FILE='../dataset/SV/32k/merges.txt'\n",
+    "NUM_CPUS=16"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "concerned-protest",
+   "metadata": {},
+   "source": [
+    "Below is a ReRun cell block to run `MYpreprocess_data.py` and produce the customSentenceSplit_text_document.bin and customSentenceSplit_text_document.idx files.\n",
+    "\n",
+    "<a id=\"Rerun_Cell\"></a>\n",
+    "\n",
+    "Go back and modify `MYpreprocess_data.py`, click on this shortcut link to <a href=\"./Lab2-4_customize_process2mmap.ipynb#MODIFY_CELL\">Jump to Modify MYpreprocess_data.py</a> "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "rolled-welcome",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!python ./Megatron-LM/tools/MYpreprocess_data.py \\\n",
+    "                       --input $INPUT_JSON_FILE \\\n",
+    "                       --output-prefix $OUTPUT_PATH \\\n",
+    "                       --json-keys text \\\n",
+    "                       --vocab-file $VOCAB_FILE \\\n",
+    "                       --merge-file $MERGE_FILE \\\n",
+    "                       --dataset-impl mmap \\\n",
+    "                       --tokenizer-type GPT2BPETokenizer \\\n",
+    "                       --workers $NUM_CPUS \\\n",
+    "                       --append-eod"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "reduced-court",
+   "metadata": {},
+   "source": [
+    "Check whether these two files : customSentenceSplit_text_document.bin and customSentenceSplit_text_document.idx files are successfully generated."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "secondary-stereo",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! ls ../dataset/SV/"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "premier-birth",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "## clean up to free up space\n",
+    "!rm ./Megatron-LM/tools/MYpreprocess_data.py"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "eastern-ministry",
+   "metadata": {},
+   "source": [
+    "-----\n",
+    "## <p style=\"text-align:center;border:3px; padding: 1em\"> <a href=../Start_Here.ipynb>HOME</a> &nbsp; &nbsp; &nbsp; <a href=./Lab2-5_run_Megatron_with_varying_config.ipynb>NEXT</a></p>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "qualified-admission",
+   "metadata": {},
+   "source": [
+    "-----\n",
+    "\n",
+    "\n",
+    "## Licensing \n",
+    "\n",
+    "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). "
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

+ 311 - 0
ai/Megatron/English/Python/jupyter_notebook/Lab2-5_run_Megatron_with_varying_config.ipynb

@@ -0,0 +1,311 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "effective-university",
+   "metadata": {},
+   "source": [
+    "## Scale up model size\n",
+    "---\n",
+    "In previous notebooks, we downloaded and extracted our own Swedish raw text; practiced filter, clean and deduplicate the raw text data ; trained our own GPTBPETokenizer and fitted to the raw Swedish text ; converted the raw text to mmap format integrating a custom sentence-splitter.\n",
+    "\n",
+    "Now that we have learned the components to customize the Megatron-LM's workflow according to specific langauge needs ( in this case, it is Swedish). The next step is to train the Megatron-LM GPT model with the Swedish data. \n",
+    "\n",
+    "However, constraint by how much compute resources you get, i.e the number of GPUs available for the training job, there is an upper limit of how big a model you can train.\n",
+    "\n",
+    "Let's test this out by presenting a Challenge. \n",
+    "\n",
+    "## **Challenge ** - Go big or go home !\n",
+    "\n",
+    "- Constraints : \n",
+    "    - 2 x A100 GPUs 40G is allocated for this challenge.\n",
+    "    - Only the parameters in the **modifiable blocks** are allowed to be changed.\n",
+    "    - Avoid OOM !\n",
+    "    - Training run must be finished and checkpoint must be saved successfully.\n",
+    "\n",
+    "\n",
+    "- Task : \n",
+    "        given the above prerequisites, train as BIG a GPT model as possible.\n",
+    "\n",
+    "- Winning criteria : the biggest model wins given the above constraints.\n",
+    "\n",
+    "Note 1: Post the parameters you changed into the **modifiable blocks** on slack channels for verification.\n",
+    "\n",
+    "Note 2: We purposefully turned-off nsys profiling in this challenge, because calling nsys profiling will introduce a small overhead, which will impact the maximum achievable model size.\n",
+    "\n",
+    "Go directly to the code block and modify training configuration, click here to <a href=\"./Lab2-5_run_Megatron_with_varying_config.ipynb#MODIFY_CELL\">Jump to Code Cell and Modify Training Config</a> "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "incorporate-bidding",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "# Hint :\n",
+    "### call out a terminal and type in **nvidia-smi** to monitor the GPUs' utils and power consumption \n",
+    "### remember to fill up the GPU memory\n",
+    "![call out a terminal ](./Megatron-LM/pics/Alt_callout2terminals.JPG)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "established-substitute",
+   "metadata": {},
+   "source": [
+    "Modify and rerun the code blocks below to obtain a even bigger GPT model. \n",
+    "\n",
+    "\n",
+    "<a id=\"MODIFY_CELL\"></a>\n",
+    "<a href=\"./Lab2-5_run_Megatron_with_varying_config.ipynb#Rerun_Cell\">Jump to ReRun Cell</a> "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "central-scheduling",
+   "metadata": {},
+   "source": [
+    "<a id=\"MODIFY_CELL\"></a>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "handy-process",
+   "metadata": {},
+   "source": [
+    "Always clean the checkpoint folder to ensure trainining start from scratch."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "human-privacy",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!rm -fr ../sv_ckpt/* "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "chief-latter",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Overwriting ./Megatron-LM/profile_SVGPT_BIG.sh\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%writefile ./Megatron-LM/SV_GPT_goingBIG.sh\n",
+    "# Copyright (c) 2020 NVIDIA Corporation.  All rights reserved.\n",
+    "MASTER_ADDR=localhost\n",
+    "MASTER_PORT=6000\n",
+    "NNODES=1 #<-- currently we are using 1 node multigpus\n",
+    "NODE_RANK=0\n",
+    "WORLD_SIZE=2 \n",
+    "GPUS_PER_NODE=2  \n",
+    "\n",
+    "\n",
+    "CHECKPOINT_PATH='../sv_ckpt/'\n",
+    "DATA_PATH='../dataset/SV/webnyheter2013_56kvocab_text_document'\n",
+    "VOCAB_FILE='../dataset/SV/56k/vocab.json'\n",
+    "MERGE_FILE='../dataset/SV/56k/merges.txt'\n",
+    "PROFILE_OUTPUT_PATH='../profiles/SV/nsys_sv_' # modify this to your own profile path\n",
+    "\n",
+    "#### [TODO]--------------- Begin of modifiable block -----------#### \n",
+    "\n",
+    "TENSOR_MP_SIZE=<FILL_IN>\n",
+    "PIPELINE_MP_SIZE=<FILL_IN>\n",
+    "LAYERS=<FILL_IN>\n",
+    "HIDDEN_SZ=<FILL_IN>\n",
+    "NUM_ATTN_HEADS=<FILL_IN>\n",
+    "MICRO_BZ=<FILL_IN>\n",
+    "GLOBAL_BZ=<FILL_IN>\n",
+    "SEQ_LEN=<FILL_IN>\n",
+    "MAX_POS_EM=<FILL_IN>\n",
+    "\n",
+    "#### -------------------- end of modifiable blocks ------------------------#### \n",
+    "\n",
+    "##################  DO NOT modify anything below this line ##################\n",
+    "export OMP_NUM_THREADS=1\n",
+    "DISTRIBUTED_ARGS=\"--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT\"\n",
+    "\n",
+    "## We turn off nsys profiling decoration to avoid the small overhead\n",
+    "#nsys profile --stats=false --force-overwrite=true --duration=300 --trace=cudnn,cuda,osrt,nvtx -o $PROFILE_OUTPUT_PATH \\\n",
+    "python -m torch.distributed.launch $DISTRIBUTED_ARGS \\\n",
+    "    ./Megatron-LM/Dlprof_pretrain_gpt.py \\\n",
+    "       --tensor-model-parallel-size ${TENSOR_MP_SIZE} \\\n",
+    "       --pipeline-model-parallel-size ${PIPELINE_MP_SIZE} \\\n",
+    "       --num-layers ${LAYERS} \\\n",
+    "       --hidden-size ${HIDDEN_SZ} \\\n",
+    "       --num-attention-heads ${NUM_ATTN_HEADS} \\\n",
+    "       --micro-batch-size ${MICRO_BZ} \\\n",
+    "       --global-batch-size ${GLOBAL_BZ} \\\n",
+    "       --seq-length ${SEQ_LEN} \\\n",
+    "       --max-position-embeddings ${MAX_POS_EM} \\\n",
+    "       --train-samples 100 \\\n",
+    "       --save ${CHECKPOINT_PATH} \\\n",
+    "       --load ${CHECKPOINT_PATH} \\\n",
+    "       --data-path 1. ${DATA_PATH} \\\n",
+    "       --vocab-file ${VOCAB_FILE} \\\n",
+    "       --merge-file ${MERGE_FILE} \\\n",
+    "       --data-impl mmap \\\n",
+    "       --split 949,50,1 \\\n",
+    "       --distributed-backend nccl \\\n",
+    "       --lr 0.00015 \\\n",
+    "       --lr-decay-style cosine \\\n",
+    "       --min-lr 1.0e-5 \\\n",
+    "       --weight-decay 1e-2 \\\n",
+    "       --clip-grad 1.0 \\\n",
+    "       --lr-warmup-fraction .01 \\\n",
+    "       --checkpoint-activations \\\n",
+    "       --log-interval 10 \\\n",
+    "       --save-interval 100 \\\n",
+    "       --eval-interval 200 \\\n",
+    "       --eval-iters 10 \\\n",
+    "       --fp16"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "proprietary-elizabeth",
+   "metadata": {},
+   "source": [
+    "Check how big is your model. By modify the parameters in the [params_cnt.sh](./params_cnt.sh)\n",
+    "\n",
+    "I got 6.6 Billion :)  what about you ?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "beginning-homework",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!bash params_cnt.sh "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "radio-secretariat",
+   "metadata": {},
+   "source": [
+    "Below is an example of expected outputs:\n",
+    "    \n",
+    "        6\n",
+    "        6675628032\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fuzzy-assault",
+   "metadata": {},
+   "source": [
+    "Re-run this cell below to get an even bigger GPT model\n",
+    "\n",
+    "Remember to modify the [params count](./params_cnt.sh) to check how big is your model.\n",
+    "\n",
+    "Jump back and mdify the SV_GPT_goingBIG.sh, click here to \n",
+    "<a href=\"./Lab2-5_run_Megatron_with_varying_config.ipynb#MODIFY_CELL\">Jump back to modify and overwrite SV_GPT_goingBIG.sh </a> \n",
+    "<a id=\"Rerun_Cell\"></a>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "rental-deputy",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!./Megatron-LM/SV_GPT_goingBIG.sh"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "korean-republic",
+   "metadata": {},
+   "source": [
+    "Below is an example of expected outputs:\n",
+    "\n",
+    "        > elapsed time for building blendable dataset indices: 0.00 (sec)\n",
+    "        > finished creating GPT datasets ...\n",
+    "        [after dataloaders are built] datetime: 2021-09-15 11:55:58 \n",
+    "        done with setup ...\n",
+    "        training ...\n",
+    "        time (ms) | model-and-optimizer-setup: 929.42 | train/valid/test-data-iterators-setup: 1004.53\n",
+    "        [after training is done] datetime: 2021-09-15 11:55:58 \n",
+    "        ------------------------------------------------------------------------------------------------------------------\n",
+    "         validation loss at the end of training for val data | lm loss value: 1.171452E+01 | lm loss PPL: 1.223352E+05 | \n",
+    "        ------------------------------------------------------------------------------------------------------------------\n",
+    "        Evaluating iter 10/10\n",
+    "        -------------------------------------------------------------------------------------------------------------------\n",
+    "         validation loss at the end of training for test data | lm loss value: 1.171400E+01 | lm loss PPL: 1.222719E+05 | \n",
+    "        -------------------------------------------------------------------------------------------------------------------"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "official-concept",
+   "metadata": {},
+   "source": [
+    "--- \n",
+    "\n",
+    "## Additional Resources\n",
+    "\n",
+    "Language Models are Few-Shot Learners : https://arxiv.org/pdf/2005.14165.pdf\n",
+    "\n",
+    "Efficient Large-Scale Language Model Training on GPU Clusters : https://arxiv.org/pdf/2104.04473.pdf"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "laden-sender",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "\n",
+    "## Congratulations on completing the mission !\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "premium-treasury",
+   "metadata": {},
+   "source": [
+    "-----\n",
+    "\n",
+    "\n",
+    "## Licensing \n",
+    "\n",
+    "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). "
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

+ 17 - 17
ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/tools/openwebtext/Lab2-1_acquiring_data.ipynb

@@ -2,13 +2,13 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "monetary-season",
+   "id": "stopped-graph",
    "metadata": {},
    "source": [
     "## Acquire Swedish data \n",
     "---\n",
     "\n",
-    "For data licensing and privacy concerns, we will not providing any data in this bootcamp.\n",
+    "For data licensing and privacy concerns, we will not providing training data in this bootcamp.\n",
     "\n",
     "However, we do need data in order to proceed the customization of Megatron-LM's workflow for Swedish, hence, the first thing we need to do, is to acquire Swedish raw text data.\n",
     "\n",
@@ -18,9 +18,9 @@
     "\n",
     "    1. Download data via wget and download the python script which will be used to extract the Swedish text.\n",
     "    \n",
-    "    2. unzip the data using bunzip and move the data to the correct folder under dataset\n",
+    "    2. unzip the data using bunzip and move the data to the correct folder under dataset.\n",
     "    \n",
-    "    3. Write a custom function to  extract raw txt file from xml file and move the text file to the correct folder under dataset\n",
+    "    3. A custom function is provided in order to extract raw txt file from xml file and move the text file to the correct folder under dataset.\n",
     "\n",
     "\n",
     "\n",
@@ -32,7 +32,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "leading-lingerie",
+   "id": "realistic-boating",
    "metadata": {},
    "source": [
     "1. Download data via wget and download the python script which will be used to extract the Swedish text."
@@ -41,7 +41,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "distributed-sheet",
+   "id": "electric-motel",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -51,16 +51,16 @@
   },
   {
    "cell_type": "markdown",
-   "id": "surprised-huntington",
+   "id": "activated-sigma",
    "metadata": {},
    "source": [
-    "2. unzip the data using bunzip and move the data to the correct folder under dataset"
+    "2. unzip the data using bunzip and move the data to the correct folder under dataset."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "another-university",
+   "id": "blond-begin",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -70,7 +70,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "textile-variance",
+   "id": "interior-disease",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -80,7 +80,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "cardiac-exclusive",
+   "id": "severe-table",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -89,16 +89,16 @@
   },
   {
    "cell_type": "markdown",
-   "id": "liquid-marina",
+   "id": "worse-ceiling",
    "metadata": {},
    "source": [
-    "3. Write a custom function to extract raw txt file from xml file and move the text file `webnyheter2013.txt` to the correct folder under dataset"
+    "3. A custom function is provided in order to extract raw txt file from xml file and move the text file to the correct folder under dataset."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "enabling-dominant",
+   "id": "afraid-eleven",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -136,7 +136,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "radical-workshop",
+   "id": "dietary-thinking",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -145,7 +145,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "found-anaheim",
+   "id": "false-calgary",
    "metadata": {},
    "source": [
     "-----\n",
@@ -154,7 +154,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "important-arthritis",
+   "id": "elegant-bandwidth",
    "metadata": {},
    "source": [
     "-----\n",

文件差异内容过多而无法显示
+ 74 - 71
ai/Megatron/English/Python/jupyter_notebook/Megatron-LM/tools/openwebtext/Lab2-2_SentenceBoundary_and_Deduplicate.ipynb


+ 4 - 4
ai/Megatron/English/Python/jupyter_notebook/params_cnt.sh

@@ -1,8 +1,8 @@
 # Copyright (c) 2020 NVIDIA Corporation.  All rights reserved.
-NLAYERS=32 ## modify this param
-NHIDDEN=4096 ## modify this param
-NHEADS=32 ## modify this param
-SEQ_LEN=512 ## modify this param
+NLAYERS=  ## modify this param
+NHIDDEN=  ## modify this param
+NHEADS=  ## modify this param
+SEQ_LEN= ## modify this param
 VOCAB_SIZE=56000 ## modify this param
 
 MODEL_SIZE=$((($NLAYERS * (12*$NHIDDEN**2 + 13*$NHIDDEN) + ($VOCAB_SIZE * $NHIDDEN) + ($SEQ_LEN * $NHIDDEN) ) / 10**9))