{ "cells": [ { "cell_type": "markdown", "id": "frozen-circumstances", "metadata": {}, "source": [ "# Profiling Megatron-LM training\n", "---\n", "\n", "## Learning Objectives\n", "\n", "The goal of this lab is to profile the Megatron-LM's GPT model training runs with varying training configurations in order to ensure the GPUs performance across multi-GPUs or mult-nodes workload.\n", "\n", "\n", "**Motivation** : Why should we care about profiling ?\n", " \n", "The estimated time-to-compute which we went through in `Lab1-2_EstimateComputeDaysNeeded.ipynb` is based on the assumption that the training run will have good GPUs performance across multi-GPUs or multi-nodes jobs. Bad training configurations could result in low or inconsistent GPUs utilization, which in turn, might prolong the training run.\n", "\n", "In this notebook, we will cover the following : \n", "\n", " 1. intro to NVIDIA profiling toolchain\n", " 2. Run profiling to record training runs - naive vs. improved runs\n", " \n", "A challenge will be presented to you at the end of this notebook, you are tasked to beat the profile of the improved run.\n", "\n", "Use the knowledge gained from going through `Lab1-2_EstimateComputeDaysNeeded.ipynb` and the profiling lecture presentations, it will help you formulate strategies on training configuration in order to obtain winning profile.\n", "\n", "Note: TAs and the NVIDIA profile expert will be available during this session when you go through this notebook, do reach out to them if you have questions." ] }, { "cell_type": "markdown", "id": "fatal-neutral", "metadata": {}, "source": [ "---\n", "\n", "1. intro to NVIDIA profiling toolchain :\n", "\n", "