{ "cells": [ { "cell_type": "markdown", "id": "proud-packet", "metadata": {}, "source": [ "# \n", "\n", "# 1_Estimate compute hours/days needed to execute one end-to-end run\n", "---\n", "\n", "## Learning Objectives\n", "- **The goal of this lab is to:**\n", "Understand how reserve compute resource per given data volume + model configuration for a training run. This is important not only for cluster capacity planning as well as for strategic research planning ( how many end to end experiments one can run given the compute capacity and duration )\n", "\n", " - Extracting the formular in from the paper [Efficient Large-Scale Language Model Training on GPU Clusters](https://arxiv.org/pdf/2104.04473.pdf), per given [GPT3 variants](https://arxiv.org/pdf/2005.14165.pdf) based on assumed [Teraflops reference table](https://arxiv.org/pdf/2104.04473.pdf)\n", " - Understanding how to estimate compute needed per dataset volume ( measured in number of tokens ) \n", " - Apply to your own/ imagenary data volume and and compute cluster set-ups\n", "---------------------------------------------------------------------------------------------------\n", "\n", "- assuming the following information \n", "- T = dataset size measured in numbers of tokens in the dataset\n", "- P = model parameters for GPT3 varients\n", "- n = number of GPUs in the compute cluster\n", "- x = achieved teraflops per GPU \n", "\n", "you will need the following tables from the above papers for the estimation \n", "\n", "
\n", "\n", "
\n", "\n", "
" ] }, { "cell_type": "markdown", "id": "political-oriental", "metadata": {}, "source": [ "---\n", "## let's do a sanity check \n", "scenario 1 - Given 300Billion tokens , 1024 GPUs, with 175 Billion model parmeters , assuming 140 teraFLOP/s per GPU \n", "we should observe around **34 days** for an end to end training run\n", "\n", "scenario 2 - Given 450Billion tokens , 3072 GPUs, with 1 Trillion model parmeters , assuming 163 teraFLOP/s per GPU \n", "we should observe around **84 days** for an end to end training run\n" ] }, { "cell_type": "code", "execution_count": 16, "id": "linear-collector", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "all below are measured with dataset size **300 billion** measured in tokens \n", "\n", " ----------------------------------------------------------------------------------------\n", " language model :gpt3_175B with 175 Billion number of parameters , it will need 33.9 days to compute \n", "\n", " ----------------------------------------------------------------------------------------\n", " language model :gpt3_1Trillion with 1Trillion number of parameters , it will need 83.2 days to compute \n", "\n" ] } ], "source": [ "import numpy as np\n", "\n", "def calculate_days_needed(T , P , n ,x):\n", " if x is None:\n", " return 'not a good SuperPOD use case, let us try a bigger model :)'\n", " else:\n", " #x=140*1e+12 # TeraFlop/s per GPU\n", " tot=8*T*P\n", " div=n*x\n", " compute_sec=tot/div\n", " #convert compute seconds to days\n", " to_days=round(compute_sec/(3600*24),1)\n", " return to_days\n", "## sanity check against the paper reported figure above \n", "T=[300*1e+9, 450*1e+9]\n", "n=[1024,3072]\n", "GPT3_models_labels=[ 'gpt3_175B','gpt3_1Trillion']\n", "GPT3_model_params=[ 175*1e+9,1*1e+12 ]\n", "GPT3_model_params_str=['175 Billion','1Trillion']\n", "#according to the table above\n", "GPT3_X=[140*1e+12,163*1e+12]\n", "print(\"all below are measured with dataset size **300 billion** measured in tokens \\n\")\n", "for gpt3_name, gpt3_params, gpt3_param_str, x, n_,t in zip(GPT3_models_labels,GPT3_model_params,GPT3_model_params_str, GPT3_X ,n,T):\n", " days_needed=calculate_days_needed(t,gpt3_params,n_,x)\n", " print(\" ----------------------------------------------------------------------------------------\")\n", " print(\" language model :{} with {} number of parameters , it will need {} days to compute \\n\".format(gpt3_name, gpt3_param_str, str(days_needed)))\n" ] }, { "cell_type": "markdown", "id": "fancy-hollywood", "metadata": {}, "source": [ "---\n", "## Exercise -\n", "Question -\n", "for a GPT3 model size of 70B parameters with approximatedly 300 Billion tokens in existing dataset\n", "giveing a 1/4 of the BerzeLiUs compute avaialbility. \n", "how may hours/days would you need to compute \n", "when you are ready , check against the solution uncollapse \n", "**. . .**\n" ] }, { "cell_type": "code", "execution_count": 5, "id": "interior-technology", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true, "source_hidden": true } }, "outputs": [ { "data": { "text/plain": [ "115.7" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "T=300*1e+9 #oftokens in the dataset\n", "n=int(480*0.25) # Berzelius Max 480 GPUs # number of GPUs in the compute cluster\n", "x=140*1e+12\n", "gpt3_params=70*1e+9\n", "calculate_days_needed(T,gpt3_params,n,x)" ] }, { "cell_type": "markdown", "id": "distinguished-electricity", "metadata": {}, "source": [ "---\n", "## Up Next : \n", "\n", "[Understanding the core of Megatron - mpu ](./Day2-2_MegatronFundementals.ipynb)\n", "\n", "## Back To Start Menu\n", "[start menu](../Start_Here.ipynb)" ] }, { "cell_type": "markdown", "id": "brilliant-delta", "metadata": {}, "source": [ "-----\n", "\n", "\n", "## Licensing \n", "\n", "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 5 }