{ "cells": [ { "cell_type": "markdown", "id": "inappropriate-diagnosis", "metadata": {}, "source": [ "# Estimate Time\n", "---\n", "\n", "## Learning Objectives\n", "The goal of this lab is to estimate compute time needed for an end to end training run.\n", "\n", "**Motivation**: In order to request computing resources for a training job to run on a super computing cluster, one must provide information such as, the number of nodes/gpus requested as well as the estimated time for an end-to-end training job.\n", "\n", "Note: now that we obtained the toy text data via webscraping from the previous lab, the next step is to request compute resources. In order to train a very large language model, large compute resources must be requested and approved in advance. For the cluster admin to allocate resources to the training job, the applicant must provide minimal information such as the number of GPUs necessary for the training job to run as well as estimate how long, i.e the time it takes, to compute one end to end training run.\n", "\n", "This is what we are trying to achieve in this notebook, estimate training time per given parameters below.\n", "\n", "Training time (in seconds) is approximated with this equation : 8*T*P/n*X\n", "\n", "- T = dataset size measured in numbers of tokens in the dataset\n", "- P = model parameters for GPT3 variants\n", "- n = number of GPUs in the compute cluster\n", "- x = achieved teraflops per GPU \n", "\n", "\n", "The above equation was extracted from this paper : [Efficient Large-Scale Language Model Training on GPU Clusters](https://arxiv.org/pdf/2104.04473.pdf)\n", "\n", "---------------------------------------------------------------------------------------------------\n", "\n", "Assets provided below for you convenience : \n", "\n", "