{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "alternate-collar",
   "metadata": {},
   "source": [
    "# \n",
    "\n",
    "# 5 Monitor GPT training performance with varying config\n",
    "---\n",
    "\n",
    "## Learning Objectives\n",
    "- **The goal of this lab is to monitor the performance of your training runs with different GPT training configurations **\n",
    "    - motivation : why should we care ? \n",
    "    \n",
    "    Answer : bad config result in very low / inconsistent gpus utilizations which in turn, slow down training and therefore longer experiments per run, it's a lose-lose-lose situation on all sides.\n",
    "    ![see example](./Megatron-LM/pics/naive_run.JPG)\n",
    "    \n",
    "    - example : naive run vs. improved run \n",
    "      starts with multiGPUs -->  multinode ( if we get at least 2 nodes per person / team ) \n",
    "    - exercise : beat the record !\n",
    "\n",
    "it is possible to obtain more than 90% GPU utilizations overall with high tensorcore ops sustained throughout forward and backward training throughout all gpus used in training. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fifty-swimming",
   "metadata": {},
   "outputs": [],
   "source": [
    "!rm -fr ./Megatron-LM/sv_ckpt/*"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "copyrighted-belarus",
   "metadata": {},
   "source": [
    "## Let's verify the environment is ready "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "chronic-bradley",
   "metadata": {},
   "outputs": [],
   "source": [
    "!nvidia-smi"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "personalized-walker",
   "metadata": {},
   "outputs": [],
   "source": [
    "!nvidia-smi nvlink --status "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "minimal-extreme",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "prostate-trouble",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "industrial-index",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Collecting data...\n",
      "using world size: 8, data-parallel-size: 8, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 \n",
      "using torch.float32 for parameters ...\n",
      "------------------------ arguments ------------------------\n",
      "  accumulate_allreduce_grads_in_fp32 .............. False\n",
      "  adam_beta1 ...................................... 0.9\n",
      "  adam_beta2 ...................................... 0.999\n",
      "  adam_eps ........................................ 1e-08\n",
      "  adlr_autoresume ................................. False\n",
      "  adlr_autoresume_interval ........................ 1000\n",
      "  apply_query_key_layer_scaling ................... True\n",
      "  apply_residual_connection_post_layernorm ........ False\n",
      "  attention_dropout ............................... 0.1\n",
      "  attention_softmax_in_fp32 ....................... False\n",
      "  bert_binary_head ................................ True\n",
      "  bert_load ....................................... None\n",
      "  bf16 ............................................ False\n",
      "  bias_dropout_fusion ............................. True\n",
      "  bias_gelu_fusion ................................ True\n",
      "  biencoder_projection_dim ........................ 0\n",
      "  biencoder_shared_query_context_model ............ False\n",
      "  block_data_path ................................. None\n",
      "  checkpoint_activations .......................... True\n",
      "  checkpoint_num_layers ........................... 1\n",
      "  clip_grad ....................................... 1.0\n",
      "  consumed_train_samples .......................... 0\n",
      "  consumed_valid_samples .......................... 0\n",
      "  data_impl ....................................... mmap\n",
      "  data_parallel_size .............................. 8\n",
      "  data_path ....................................... ['../dataset/EN/NVblogs_text_document']\n",
      "  dataloader_type ................................. single\n",
      "  DDP_impl ........................................ local\n",
      "  decoder_seq_length .............................. None\n",
      "  distribute_checkpointed_activations ............. False\n",
      "  distributed_backend ............................. nccl\n",
      "  embedding_path .................................. None\n",
      "  encoder_seq_length .............................. 512\n",
      "  eod_mask_loss ................................... False\n",
      "  eval_interval ................................... 100\n",
      "  eval_iters ...................................... 10\n",
      "  evidence_data_path .............................. None\n",
      "  exit_duration_in_mins ........................... None\n",
      "  exit_interval ................................... None\n",
      "  ffn_hidden_size ................................. 4096\n",
      "  finetune ........................................ False\n",
      "  fp16 ............................................ False\n",
      "  fp16_lm_cross_entropy ........................... False\n",
      "  fp32_residual_connection ........................ False\n",
      "  global_batch_size ............................... 8\n",
      "  hidden_dropout .................................. 0.1\n",
      "  hidden_size ..................................... 1024\n",
      "  hysteresis ...................................... 2\n",
      "  ict_head_size ................................... None\n",
      "  ict_load ........................................ None\n",
      "  img_dim ......................................... 224\n",
      "  indexer_batch_size .............................. 128\n",
      "  indexer_log_interval ............................ 1000\n",
      "  init_method_std ................................. 0.02\n",
      "  init_method_xavier_uniform ...................... False\n",
      "  initial_loss_scale .............................. 4294967296\n",
      "  kv_channels ..................................... 64\n",
      "  layernorm_epsilon ............................... 1e-05\n",
      "  lazy_mpu_init ................................... None\n",
      "  load ............................................ ./Megatron-LM/sv_ckpt/\n",
      "  local_rank ...................................... 0\n",
      "  log_batch_size_to_tensorboard ................... False\n",
      "  log_interval .................................... 10\n",
      "  log_learning_rate_to_tensorboard ................ True\n",
      "  log_loss_scale_to_tensorboard ................... True\n",
      "  log_num_zeros_in_grad ........................... False\n",
      "  log_params_norm ................................. False\n",
      "  log_timers_to_tensorboard ....................... False\n",
      "  log_validation_ppl_to_tensorboard ............... False\n",
      "  loss_scale ...................................... None\n",
      "  loss_scale_window ............................... 1000\n",
      "  lr .............................................. 0.00015\n",
      "  lr_decay_iters .................................. None\n",
      "  lr_decay_samples ................................ None\n",
      "  lr_decay_style .................................. cosine\n",
      "  lr_warmup_fraction .............................. 0.01\n",
      "  lr_warmup_iters ................................. 0\n",
      "  lr_warmup_samples ............................... 0\n",
      "  make_vocab_size_divisible_by .................... 128\n",
      "  mask_prob ....................................... 0.15\n",
      "  masked_softmax_fusion ........................... True\n",
      "  max_position_embeddings ......................... 512\n",
      "  merge_file ...................................... ../dataset/EN/50k/gpt2-merges.txt\n",
      "  micro_batch_size ................................ 1\n",
      "  min_loss_scale .................................. 1.0\n",
      "  min_lr .......................................... 1e-05\n",
      "  mmap_warmup ..................................... False\n",
      "  no_load_optim ................................... None\n",
      "  no_load_rng ..................................... None\n",
      "  no_save_optim ................................... None\n",
      "  no_save_rng ..................................... None\n",
      "  num_attention_heads ............................. 16\n",
      "  num_channels .................................... 3\n",
      "  num_classes ..................................... 1000\n",
      "  num_layers ...................................... 16\n",
      "  num_layers_per_virtual_pipeline_stage ........... None\n",
      "  num_workers ..................................... 2\n",
      "  onnx_safe ....................................... None\n",
      "  openai_gelu ..................................... False\n",
      "  optimizer ....................................... adam\n",
      "  override_lr_scheduler ........................... False\n",
      "  params_dtype .................................... torch.float32\n",
      "  patch_dim ....................................... 16\n",
      "  pipeline_model_parallel_size .................... 1\n",
      "  query_in_block_prob ............................. 0.1\n",
      "  rampup_batch_size ............................... None\n",
      "  rank ............................................ 0\n",
      "  reset_attention_mask ............................ False\n",
      "  reset_position_ids .............................. False\n",
      "  retriever_report_topk_accuracies ................ []\n",
      "  retriever_score_scaling ......................... False\n",
      "  retriever_seq_length ............................ 256\n",
      "  sample_rate ..................................... 1.0\n",
      "  save ............................................ ./Megatron-LM/sv_ckpt/\n",
      "  save_interval ................................... 100\n",
      "  scatter_gather_tensors_in_pipeline .............. True\n",
      "  seed ............................................ 1234\n",
      "  seq_length ...................................... 512\n",
      "  sgd_momentum .................................... 0.9\n",
      "  short_seq_prob .................................. 0.1\n",
      "  split ........................................... 949,50,1\n",
      "  tensor_model_parallel_size ...................... 1\n",
      "  tensorboard_dir ................................. None\n",
      "  tensorboard_log_interval ........................ 1\n",
      "  tensorboard_queue_size .......................... 1000\n",
      "  titles_data_path ................................ None\n",
      "  tokenizer_type .................................. GPT2BPETokenizer\n",
      "  train_iters ..................................... None\n",
      "  train_samples ................................... 100\n",
      "  use_checkpoint_lr_scheduler ..................... False\n",
      "  use_contiguous_buffers_in_ddp ................... False\n",
      "  use_cpu_initialization .......................... None\n",
      "  use_one_sent_docs ............................... False\n",
      "  virtual_pipeline_model_parallel_size ............ None\n",
      "  vocab_extra_ids ................................. 0\n",
      "  vocab_file ...................................... ../dataset/EN/50k/gpt2-vocab.json\n",
      "  weight_decay .................................... 0.01\n",
      "  world_size ...................................... 8\n",
      "-------------------- end of arguments ---------------------\n",
      "setting number of micro-batches to constant 1\n",
      "> building GPT2BPETokenizer tokenizer ...\n",
      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
      "> initializing torch distributed ...\n",
      "> initializing tensor model parallel with size 1\n",
      "> initializing pipeline model parallel with size 1\n",
      "> setting random seeds to 1234 ...\n",
      "> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234\n",
      "> compiling dataset index builder ...\n",
      "make: Entering directory '/home/zcharpy/bootcamp/jupyter_notebook/Megatron-LM/megatron/data'\n",
      "make: Nothing to be done for 'default'.\n",
      "make: Leaving directory '/home/zcharpy/bootcamp/jupyter_notebook/Megatron-LM/megatron/data'\n",
      ">>> done with dataset index builder. Compilation time: 0.573 seconds\n",
      "WARNING: constraints for invoking optimized fused softmax kernel are not met. We default back to unfused kernel invocations.\n",
      "> compiling and loading fused kernels ...\n",
      "Detected CUDA files, patching ldflags\n",
      "Emitting ninja build file /home/zcharpy/bootcamp/jupyter_notebook/Megatron-LM/megatron/fused_kernels/build/build.ninja...\n",
      "Building extension module scaled_upper_triang_masked_softmax_cuda...\n",
      "Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)\n",
      "ninja: no work to do.\n",
      "Loading extension module scaled_upper_triang_masked_softmax_cuda...\n",
      "Detected CUDA files, patching ldflags\n",
      "Emitting ninja build file /home/zcharpy/bootcamp/jupyter_notebook/Megatron-LM/megatron/fused_kernels/build/build.ninja...\n",
      "Building extension module scaled_masked_softmax_cuda...\n",
      "Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)\n",
      "ninja: no work to do.\n",
      "Loading extension module scaled_masked_softmax_cuda...\n",
      "Detected CUDA files, patching ldflags\n",
      "Emitting ninja build file /home/zcharpy/bootcamp/jupyter_notebook/Megatron-LM/megatron/fused_kernels/build/build.ninja...\n",
      "Building extension module fused_mix_prec_layer_norm_cuda...\n",
      "Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)\n",
      "ninja: no work to do.\n",
      "Loading extension module fused_mix_prec_layer_norm_cuda...\n",
      ">>> done with compiling and loading fused kernels. Compilation time: 31.516 seconds\n",
      "time to initialize megatron (seconds): 13.548\n",
      "[after megatron is initialized] datetime: 2021-08-26 00:28:13 \n",
      "building GPT model ...\n",
      " > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 253577216\n",
      "setting training iterations to 12\n",
      "> learning rate decay style: cosine\n",
      "WARNING: could not find the metadata file ./Megatron-LM/sv_ckpt/latest_checkpointed_iteration.txt \n",
      "    will not load any checkpoints and will start from random\n",
      "time (ms) | load-checkpoint: 30.87\n",
      "[after model, optimizer, and learning rate scheduler are built] datetime: 2021-08-26 00:28:14 \n",
      "> building train, validation, and test datasets ...\n",
      " > datasets target sizes (minimum size):\n",
      "    train:      100\n",
      "    validation: 80\n",
      "    test:       80\n",
      "> building train, validation, and test datasets for GPT ...\n",
      " > building dataset index ...\n",
      "    reading sizes...\n",
      "    reading pointers...\n",
      "    reading document index...\n",
      "    creating numpy buffer of mmap...\n",
      "    creating memory view of numpy buffer...\n",
      " > finished creating indexed dataset in 0.003097 seconds\n",
      "    number of documents: 74\n",
      " > dataset split:\n",
      "    train:\n",
      "     document indices in [0, 70) total of 70 documents\n",
      "    validation:\n",
      "     document indices in [70, 74) total of 4 documents\n",
      "    test:\n",
      "     document indices in [74, 74) total of 0 documents\n",
      " > loading doc-idx mapping from ../dataset/EN/NVblogs_text_document_train_indexmap_100ns_512sl_1234s_doc_idx.npy\n",
      " > loading sample-idx mapping from ../dataset/EN/NVblogs_text_document_train_indexmap_100ns_512sl_1234s_sample_idx.npy\n",
      " > loading shuffle-idx mapping from ../dataset/EN/NVblogs_text_document_train_indexmap_100ns_512sl_1234s_shuffle_idx.npy\n",
      "    loaded indexed file in 0.018 seconds\n",
      "    total number of samples: 142\n",
      "    total number of epochs: 1\n",
      " > loading doc-idx mapping from ../dataset/EN/NVblogs_text_document_valid_indexmap_80ns_512sl_1234s_doc_idx.npy\n",
      " > loading sample-idx mapping from ../dataset/EN/NVblogs_text_document_valid_indexmap_80ns_512sl_1234s_sample_idx.npy\n",
      " > loading shuffle-idx mapping from ../dataset/EN/NVblogs_text_document_valid_indexmap_80ns_512sl_1234s_shuffle_idx.npy\n",
      "    loaded indexed file in 0.022 seconds\n",
      "    total number of samples: 86\n",
      "    total number of epochs: 11\n",
      "> finished creating GPT datasets ...\n",
      "[after dataloaders are built] datetime: 2021-08-26 00:28:24 \n",
      "done with setup ...\n",
      "training ...\n",
      "time (ms) | model-and-optimizer-setup: 548.83 | train/valid/test-data-iterators-setup: 10068.67\n",
      "[before the start of training step] datetime: 2021-08-26 00:28:24 \n",
      " iteration       10/      12 | consumed samples:           80 | elapsed time per iteration (ms): 2141.9 | learning rate: 2.363E-05 | global batch size:     8 | lm loss: 9.601698E+00 | loss scale: 1.0 | grad norm: 1.856 | number of skipped iterations:   0 | number of nan iterations:   0 |\n",
      "time (ms) | forward-compute: 1252.78 | backward-compute: 643.07 | backward-params-all-reduce: 146.14 | backward-embedding-all-reduce: 0.06 | optimizer: 94.92 | batch-generator: 12.73\n",
      "[Rank 0] (after 10 iterations) memory (MB) | allocated: 3869.28369140625 | max allocated: 5229.60595703125 | reserved: 7306.0 | max reserved: 7306.0\n",
      "[after training is done] datetime: 2021-08-26 00:28:47 \n",
      "------------------------------------------------------------------------------------------------------------------saving checkpoint at iteration      12 to ./Megatron-LM/sv_ckpt/\n",
      "\n",
      " validation loss at the end of training for val data | lm loss value: 8.891883E+00 | lm loss PPL: 7.272700E+03 | \n",
      "------------------------------------------------------------------------------------------------------------------\n",
      "  successfully saved checkpoint at iteration      12 to ./Megatron-LM/sv_ckpt/\n",
      "*****************************************\n",
      "Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. \n",
      "*****************************************\n",
      "Processing events...\n",
      "Capturing symbol files...\n",
      "Saving temporary \"/tmp/nsys-report-5e6b-aa12-9711-df33.qdstrm\" file to disk...\n",
      "Creating final output files...\n",
      "\n",
      "Processing [==============================================================100%]\n",
      "Saved report file to \"/tmp/nsys-report-5e6b-aa12-9711-df33.qdrep\"\n",
      "Exporting 1665820 events: [===============================================100%]\n",
      "\n",
      "Exported successfully to\n",
      "/tmp/nsys-report-5e6b-aa12-9711-df33.sqlite\n",
      "\n",
      "\n",
      "CUDA API Statistics:\n",
      "\n",
      " Time(%)  Total Time (ns)  Num Calls    Average    Minimum    Maximum                 Name              \n",
      " -------  ---------------  ---------  -----------  -------  -----------  -------------------------------\n",
      "    56.2     285320355235       1696  168231341.5     3530  29703799378  cudaDeviceSynchronize          \n",
      "    29.6     149955233137       4392   34142812.6     4864  17084387260  cudaMalloc                     \n",
      "     7.9      39849705340       2960   13462738.3     1275   2244485039  cudaFree                       \n",
      "     1.7       8434557988     132848      63490.3     4159    108060983  cudaLaunchKernel               \n",
      "     1.6       8056133957       1136    7091667.2     5385     73711234  cudaHostAlloc                  \n",
      "     1.4       7314617708      23844     306769.7     6186    758561184  cudaMemcpyAsync                \n",
      "     0.4       1978894053        720    2748464.0   352582     35632270  cudaIpcOpenMemHandle           \n",
      "     0.3       1558283230       1064    1464551.9     5957     26847462  cudaFreeHost                   \n",
      "     0.3       1519497814        720    2110413.6   240178     21756247  cudaIpcCloseMemHandle          \n",
      "     0.2        869040030      15104      57537.1     7662     32041419  cuLaunchKernel                 \n",
      "     0.1        760874918      14944      50915.1     5032    103123066  cudaMemsetAsync                \n",
      "     0.1        416824863      63658       6547.9      361     79176267  cudaStreamIsCapturing_v10000   \n",
      "     0.1        289499706      16483      17563.5     1293     40910490  cudaEventQuery                 \n",
      "     0.1        283699184      21864      12975.6      643     60138070  cudaEventRecord                \n",
      "     0.0        163990222      44832       3657.9      279     24748108  cudaStreamGetCaptureInfo_v10010\n",
      "     0.0        149197647        512     291401.7     2613     37658462  cudaStreamCreateWithPriority   \n",
      "     0.0        138131763       4504      30668.7      523     32220166  cudaEventDestroy               \n",
      "     0.0        113211871       2732      41439.2     2572     30272284  cudaStreamSynchronize          \n",
      "     0.0         75811639       2848      26619.3     3170     29446146  cudaMemset                     \n",
      "     0.0         68895475         40    1722386.9   101401     32335797  cuModuleLoadData               \n",
      "     0.0         36296853       1824      19899.6     6299      6783812  cudaMemcpy                     \n",
      "     0.0         13397443         40     334936.1    21288      4248804  cuModuleUnload                 \n",
      "     0.0         12237981       4520       2707.5      381      2838236  cudaEventCreateWithFlags       \n",
      "     0.0         11913924       2784       4279.4     1276      3291031  cudaStreamWaitEvent            \n",
      "     0.0           100758         24       4198.3     1730        10252  cuInit                         \n",
      "\n",
      "\n",
      "\n",
      "CUDA Kernel Statistics:\n",
      "\n",
      " Time(%)  Total Time (ns)  Instances    Average     Minimum    Maximum                                                    Name                                                \n",
      " -------  ---------------  ---------  ------------  -------  -----------  ----------------------------------------------------------------------------------------------------\n",
      "    88.2     260838216539         88  2964070642.5     9664  29673260794  ncclKernel_AllReduce_RING_LL_Sum_uint8_t(ncclWorkElem)                                              \n",
      "     6.4      18834798439        280    67267137.3    13952    423636076  ncclKernel_AllReduce_RING_LL_Sum_float(ncclWorkElem)                                                \n",
      "     1.3       3737595881      13056      286274.2   100127       445790  volta_sgemm_128x32_tn                                                                               \n",
      "     0.6       1922114218       6240      308031.1    88128      4093366  volta_sgemm_128x32_nt                                                                               \n",
      "     0.6       1914679497       4528      422853.2   268063      4114387  volta_sgemm_128x64_tn                                                                               \n",
      "     0.6       1628497447       6144      265054.9    91936       432223  volta_sgemm_128x32_nn                                                                               \n",
      "     0.3       1019927165         16    63745447.8    12160    185146385  ncclKernel_AllReduce_RING_LL_Sum_int64_t(ncclWorkElem)                                              \n",
      "     0.3        911914712       1632      558771.3     5824       807487  void multi_tensor_apply_kernel<TensorListMetadata<4>, AdamFunctor<float>, float, float, float, floa…\n",
      "     0.2        624933625       5888      106136.8    96128       125600  volta_sgemm_64x64_nn                                                                                \n",
      "     0.2        516397176       9608       53746.6     2495      2498036  void at::native::vectorized_elementwise_kernel<4, at::native::MulScalarFunctor<float, float>, at::d…\n",
      "     0.1        370199942         96     3856249.4  3595413      4220668  volta_sgemm_64x32_sliced1x4_nn                                                                      \n",
      "     0.1        365762037       5888       62119.9    53312        81952  volta_sgemm_64x64_tn                                                                                \n",
      "     0.1        317760917       3072      103437.8    94048       120768  volta_sgemm_64x64_nt                                                                                \n",
      "     0.1        291143645       5888       49447.0    45408        58304  void at::native::unrolled_elementwise_kernel<at::native::(anonymous namespace)::masked_fill_kernel<…\n",
      "     0.1        253378189        192     1319678.1   725760      1933403  void at::native::(anonymous namespace)::CatArrayBatchedCopy<float, unsigned int, 1, 128, 1>(float*,…\n",
      "     0.1        251234285       1440      174468.3     3585       228607  void multi_tensor_apply_kernel<TensorListMetadata<2>, ScaleFunctor<float, float>, float>(int, int v…\n",
      "     0.1        211542704       4352       48608.2    45728        53376  void (anonymous namespace)::softmax_warp_forward<float, float, float, 9, false>(float*, float const…\n",
      "     0.1        165985735       3168       52394.5     7648        57663  void at::native::(anonymous namespace)::fused_dropout_kernel_vec<float, float, unsigned int, 1, 4>(…\n",
      "     0.1        154413148       7088       21785.2     7424       259423  void at::native::unrolled_elementwise_kernel<at::native::AddFunctor<float>, at::detail::Array<char*…\n",
      "     0.0        126251751       6144       20548.8    12864        29088  void at::native::reduce_kernel<128, 4, at::native::ReduceOp<float, at::native::func_wrapper_t<float…\n",
      "     0.0        119034048       1344       88567.0     5120       108128  void multi_tensor_apply_kernel<TensorListMetadata<1>, L2NormFunctor<float>, float*, float*, bool, i…\n",
      "     0.0        118285804       6352       18621.8     2560       733214  void at::native::vectorized_elementwise_kernel<4, at::native::AddFunctor<float>, at::detail::Array<…\n",
      "     0.0        105684444       4352       24284.1    21791        37184  kernel_1                                                                                            \n",
      "     0.0         98824724       1904       51903.7     2496        70079  void at::native::vectorized_elementwise_kernel<4, at::native::MulFunctor<float>, at::detail::Array<…\n",
      "     0.0         97740010       1536       63632.8    62784        68160  void (anonymous namespace)::softmax_warp_backward<float, float, float, 9, false>(float*, float cons…\n",
      "     0.0         82467209       8880        9286.8     6656        16928  void cuApplyLayerNorm<float, float, float>(float*, float*, float*, float const*, int, int, float, f…\n",
      "     0.0         76246947       1632       46719.9     5920        52928  void at::native::vectorized_elementwise_kernel<4, at::native::(anonymous namespace)::masked_scale_k…\n",
      "     0.0         66689595       5952       11204.6     8864        18848  kernel_2                                                                                            \n",
      "     0.0         65696173       4560       14407.1     2335       344991  void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<float>, at::detail::Array…\n",
      "     0.0         54335809       1536       35374.9    33120        40640  kernel_4                                                                                            \n",
      "     0.0         51338187       4352       11796.5     8576        16288  void at::native::unrolled_elementwise_kernel<at::native::copy_device_to_device(at::TensorIterator&,…\n",
      "     0.0         45344126        176      257637.1   254302       260671  void at::native::unrolled_elementwise_kernel<at::native::DivFunctor<float>, at::detail::Array<char*…\n",
      "     0.0         44337536        176      251917.8   250559       253407  void at::native::vectorized_elementwise_kernel<4, at::native::exp_kernel_cuda(at::TensorIterator&):…\n",
      "     0.0         44304532       1536       28844.1    25920        36992  void at::native::(anonymous namespace)::CatArrayBatchedCopy<float, unsigned int, 4, 64, 64>(float*,…\n",
      "     0.0         33892087       3168       10698.3     8480        18304  void cuComputePartGradGammaBeta<float, float, float>(float const*, float const*, int, int, float co…\n",
      "     0.0         29677485       3168        9367.9     8032        18464  void cuComputeGradInput<float, float, float>(float const*, float const*, int, int, float const*, fl…\n",
      "     0.0         28793796       3072        9373.0     6816        16351  kernel_3                                                                                            \n",
      "     0.0         25638427        176      145672.9   141248       151744  void at::native::reduce_kernel<512, 1, at::native::ReduceOp<float, at::native::MaxOps<float>, unsig…\n",
      "     0.0         24934651        192      129868.0     6208       254687  void at::native::unrolled_elementwise_kernel<at::native::MulFunctor<float>, at::detail::Array<char*…\n",
      "     0.0         24768005        624       39692.3     3872       133248  void at::native::reduce_kernel<512, 1, at::native::ReduceOp<float, at::native::func_wrapper_t<float…\n",
      "     0.0         14595499        520       28068.3    11168       322080  _ZN2at6native89_GLOBAL__N__65_tmpxft_00001007_00000000_13_DistributionNormal_compute_86_cpp1_ii_7d8…\n",
      "     0.0         14342499       3168        4527.3     4031        11296  void cuComputeGradGammaBeta<float, float>(float const*, float const*, int, int, int, float*, float*)\n",
      "     0.0          4617682        192       24050.4    18080        34848  void at::native::(anonymous namespace)::embedding_backward_feature_kernel<float, float, long>(long*…\n",
      "     0.0          4383059        352       12451.9    11072        14975  void at::native::(anonymous namespace)::indexSelectLargeIndex<float, long, unsigned int, 2, 2, -2, …\n",
      "     0.0          2511679        192       13081.7    10368        22560  kernel_0                                                                                            \n",
      "     0.0          2431636        272        8939.8     7648        10272  void at::native::index_elementwise_kernel<128, 4, at::native::gpu_index_kernel<at::native::index_ke…\n",
      "     0.0          1403583        352        3987.5     2976        11584  void cub::DeviceReduceSingleTileKernel<cub::DeviceReducePolicy<bool, int, int, cub::Sum>::Policy600…\n",
      "     0.0          1350339        352        3836.2     2911         5216  void cub::DeviceSelectSweepKernel<cub::DispatchSelectIf<cub::CountingInputIterator<long, long>, cub…\n",
      "     0.0          1254812        176        7129.6     6464         9217  void at::native::triu_tril_kernel<float, int, false>(at::cuda::detail::TensorInfo<float, int>, at::…\n",
      "     0.0          1248508        448        2786.8     2399         3840  void (anonymous namespace)::elementwise_kernel_with_index<int, at::native::arange_cuda_out(at::Tens…\n",
      "     0.0           991701        272        3646.0     2752         4512  void at::native::vectorized_elementwise_kernel<4, at::native::DivFunctor<float>, at::detail::Array<…\n",
      "     0.0           942900        352        2678.7     2336         3552  void cub::DeviceCompactInitKernel<cub::ScanTileState<int, true>, int*>(cub::ScanTileState<int, true…\n",
      "     0.0           872668         96        9090.3     8352        10048  void at::native::unrolled_elementwise_kernel<at::native::copy_device_to_device(at::TensorIterator&,…\n",
      "     0.0           826148        176        4694.0     4352         5601  void at::native::vectorized_elementwise_kernel<4, at::native::BUnaryFunctor<at::native::CompareLTFu…\n",
      "     0.0           775099         96        8073.9     7424         8992  void at::native::index_elementwise_kernel<128, 4, at::native::gpu_index_kernel<at::native::index_pu…\n",
      "     0.0           765821        176        4351.3     3936         5184  void at::native::vectorized_elementwise_kernel<4, at::native::log_kernel_cuda(at::TensorIterator&):…\n",
      "     0.0           688480        176        3911.8     3456         5024  void at::native::unrolled_elementwise_kernel<at::native::BUnaryFunctor<at::native::CompareLTFunctor…\n",
      "     0.0           608608        176        3458.0     3136         4384  void at::native::unrolled_elementwise_kernel<at::native::BUnaryFunctor<at::native::CompareGEFunctor…\n",
      "     0.0           548866        176        3118.6     2912         3680  void at::native::vectorized_elementwise_kernel<4, at::native::BUnaryFunctor<at::native::AddFunctor<…\n",
      "     0.0           543391        176        3087.4     2816         4000  void at::native::vectorized_elementwise_kernel<4, at::native::BitwiseOrFunctor<bool>, at::detail::A…\n",
      "     0.0           502590        176        2855.6     2655         3488  void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<long>, at::detail::Array<…\n",
      "     0.0           383140         96        3991.0     3744         4480  cleanup(float*, float*, float*, float*, bool, int)                                                  \n",
      "     0.0           319140         96        3324.4     2912        16032  void at::native::vectorized_elementwise_kernel<4, at::native::BUnaryFunctor<at::native::AddFunctor<…\n",
      "     0.0           313667         96        3267.4     2912         3968  void at::native::vectorized_elementwise_kernel<4, at::native::(anonymous namespace)::pow_tensor_sca…\n",
      "     0.0           287259         96        2992.3     2784         3328  void at::native::vectorized_elementwise_kernel<4, at::native::AUnaryFunctor<at::native::AddFunctor<…\n",
      "\n",
      "\n",
      "\n",
      "CUDA Memory Operation Statistics (by time):\n",
      "\n",
      " Time(%)  Total Time (ns)  Operations   Average   Minimum   Maximum       Operation     \n",
      " -------  ---------------  ----------  ---------  -------  ---------  ------------------\n",
      "    89.1       3465094446        1380  2510938.0     1343  757281996  [CUDA memcpy DtoH]\n",
      "     9.9        385763414       21152    18237.7     2431     527263  [CUDA memcpy DtoD]\n",
      "     0.8         31439598       17792     1767.1     1247      18912  [CUDA memset]     \n",
      "     0.2          6520963        3136     2079.4     1247     274623  [CUDA memcpy HtoD]\n",
      "\n",
      "\n",
      "\n",
      "CUDA Memory Operation Statistics (by size in KiB):\n",
      "\n",
      "     Total      Operations  Average   Minimum   Maximum        Operation     \n",
      " -------------  ----------  --------  -------  ----------  ------------------\n",
      "   4433452.563       17792   249.182    0.004   10240.000  [CUDA memset]     \n",
      "   2971617.438        1380  2153.346    0.004  201216.000  [CUDA memcpy DtoH]\n",
      "     19824.688        3136     6.322    0.004    2048.000  [CUDA memcpy HtoD]\n",
      " 120454945.375       21152  5694.731    0.004  201216.000  [CUDA memcpy DtoD]\n",
      "\n",
      "\n",
      "\n",
      "Operating System Runtime API Statistics:\n",
      "\n",
      " Time(%)  Total Time (ns)  Num Calls     Average     Minimum     Maximum             Name         \n",
      " -------  ---------------  ---------  -------------  --------  -----------  ----------------------\n",
      "    31.8    5593434303269        368  15199549737.1     23610  79120616452  pthread_cond_wait     \n",
      "    27.9    4910125179481      17570    279460738.7      1963  10130282621  pthread_cond_timedwait\n",
      "    24.3    4278054729224      34186    125140546.7      1290  44497241962  poll                  \n",
      "    10.3    1803196509173        355   5079426786.4     23310  33853558996  sem_wait              \n",
      "     3.6     641269443401        576   1113315005.9      1015   2997228374  futex                 \n",
      "     0.8     145245328959        688    211112396.7  10065411   1007923775  select                \n",
      "     0.6     109784339681      50957      2154450.6      1003    109559784  ioctl                 \n",
      "     0.2      28042017648        117    239675364.5    164118   2317633343  fork                  \n",
      "     0.1      15313635731        988     15499631.3    106209   1353294979  pthread_create        \n",
      "     0.1      13206505992      11995      1101000.9       999    319033247  read                  \n",
      "     0.1      12657774292       4603      2749896.7      1004    119470493  pthread_tryjoin_np    \n",
      "     0.1       9296666389       2070      4491143.2      1020    742142953  write                 \n",
      "     0.0       8550210410        167     51198864.7   5064806   1000155935  nanosleep             \n",
      "     0.0       5473575733        340     16098752.2      1716    753451009  accept                \n",
      "     0.0       4563965549       5670       804932.2      1000    744010044  recv                  \n",
      "     0.0       2881757706       8761       328930.2      3498     52730526  open64                \n",
      "     0.0       2738750168        970      2823453.8      1029    108977691  pthread_cond_signal   \n",
      "     0.0       1674433469        695      2409256.8      3814    447909461  pthread_join          \n",
      "     0.0       1307932755        574      2278628.5       999    552483526  waitpid               \n",
      "     0.0        787392363        116      6787865.2      1039    118398933  pthread_mutex_lock    \n",
      "     0.0        466428334       7253        64308.3      1000     29379070  send                  \n",
      "     0.0        365132680       3392       107645.2      2498     19909417  mmap                  \n",
      "     0.0        342197730        563       607811.2      6018     44240427  connect               \n",
      "     0.0        249098555       3159        78853.6      1421     34377391  fopen                 \n",
      "     0.0        217823404        260       837782.3      2407     34361262  pipe2                 \n",
      "     0.0        217659306         40      5441482.7      1558     48632942  openat                \n",
      "     0.0        180993013       4202        43073.1      1374     32116690  munmap                \n",
      "     0.0         77052008         18      4280667.1     61137     13838109  sleep                 \n",
      "     0.0         61461276       3602        17063.1      2001     20022904  mmap64                \n",
      "     0.0         48242345        204       236482.1      2249     29899269  open                  \n",
      "     0.0         39481067        228       173162.6      1032      3767436  recvmsg               \n",
      "     0.0         38664317        726        53256.6      1774     27390519  socket                \n",
      "     0.0         36313974       1633        22237.6      1000      1736639  fgets                 \n",
      "     0.0         22467169       2648         8484.6      1000     11488792  fread                 \n",
      "     0.0         20593267         73       282099.5     29914      5176911  sem_timedwait         \n",
      "     0.0         19633001       3116         6300.7      1000      7802890  fclose                \n",
      "     0.0          8898623         86       103472.4      1007      8486628  fgetc                 \n",
      "     0.0          6623741       1610         4114.1       999      3032106  sched_yield           \n",
      "     0.0          3582154          3      1194051.3    343132      2284636  pthread_rwlock_wrlock \n",
      "     0.0          1912398         26        73553.8      2155      1181798  fopen64               \n",
      "     0.0          1550002         20        77500.1      2507       938762  fwrite_unlocked       \n",
      "     0.0          1303265        949         1373.3      1000       123053  fcntl                 \n",
      "     0.0           551681        160         3448.0      1014       141280  bind                  \n",
      "     0.0           546706         41        13334.3      1026       224142  getdelim              \n",
      "     0.0           244769         73         3353.0      1000        22723  fflush                \n",
      "     0.0           198165         80         2477.1      1081         9301  waitid                \n",
      "     0.0           172780        122         1416.2      1001         3555  listen                \n",
      "     0.0           152498         16         9531.1      1347        20498  sendmsg               \n",
      "     0.0           130015          1       130015.0    130015       130015  pthread_cond_broadcast\n",
      "     0.0           117991          3        39330.3      2098       107700  fputs                 \n",
      "     0.0           114666          2        57333.0     19023        95643  wait                  \n",
      "     0.0           104388         36         2899.7      1074         7418  signal                \n",
      "     0.0            98697         59         1672.8      1009         6189  sigaction             \n",
      "     0.0            85207         12         7100.6      4060        27106  pipe                  \n",
      "     0.0            66940         38         1761.6      1187         3590  pthread_mutex_trylock \n",
      "     0.0            66109         13         5085.3      3490         7378  fputs_unlocked        \n",
      "     0.0            64428         54         1193.1      1001         1787  dup                   \n",
      "     0.0            43299          7         6185.6      3683        16998  wait3                 \n",
      "     0.0             3502          2         1751.0      1396         2106  dup2                  \n",
      "\n",
      "Report file moved to \"/home/zcharpy/profiles/GPT360M_naive.qdrep\"\n",
      "Report file moved to \"/home/zcharpy/profiles/GPT360M_naive.sqlite\"\n"
     ]
    }
   ],
   "source": [
    "!bash ./Megatron-LM/nsys_test.sh"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "upset-skating",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "governmental-welding",
   "metadata": {},
   "source": [
    "---\n",
    "## below is a ReRun cell to experiment training configurations\n",
    "<a id=\"Rerun_Cell\"></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "requested-clause",
   "metadata": {},
   "outputs": [],
   "source": [
    "!bash ./Megatron-LM/dlprof_2nd_run.sh"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "written-trace",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "sunrise-borough",
   "metadata": {},
   "source": [
    "----------------\n",
    "\n",
    "## **Challenge ** - the best profile\n",
    "\n",
    "with the exact same compute limitations ( i.e # of gpus you currently have ) \n",
    "\n",
    "task: modify the [profiling bash script](./Megatron-LM/dlprof_2nd_run.sh) and rerun \n",
    "<a href=\"./Day2-5_Observe_GPT_runs_vs_performance.ipynb#Rerun_Cell\">Jump to ReRun Cell</a> \n",
    "monitor the training runs to get an overall >80% gpu utils in **training** runs \n",
    "\n",
    "```\n",
    "    TENSOR_MP_SIZE=1\n",
    "    PIPELINE_MP_SIZE=1\n",
    "\n",
    "    #GPT Config \n",
    "    LAYERS= \n",
    "    HIDDEN_SIZE=\n",
    "    ATTN_HEADS=\n",
    "    MICRO_BZ=\n",
    "    GB_BZ=\n",
    "    SEQ_LEN=\n",
    "    MAX_POS_EM=\n",
    "``` \n",
    "<a id=\"TheChallenge\"></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "streaming-artist",
   "metadata": {},
   "source": [
    "-----\n",
    "\n",
    "\n",
    "## Licensing \n",
    "\n",
    "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}