{ "cells": [ { "cell_type": "markdown", "id": "strong-match", "metadata": {}, "source": [ "# Estimate compute hours/days needed to execute one end-to-end run\n", "---\n", "\n", "## Learning Objectives\n", "The goal of this lab is size the problem :\n", "Understanding how to calculate hours/days needed in order to reserve compute resources for the training job per given existing data volume and desired model size. \n", "It is important for both the admin in the compute cluster to do capacity forecasting and for researchers to plan their experiments strategically.\n", "\n", "- Extracting the formular from the paper [Efficient Large-Scale Language Model Training on GPU Clusters](https://arxiv.org/pdf/2104.04473.pdf), per given [GPT3 variants](https://arxiv.org/pdf/2005.14165.pdf) based on assumed [Teraflops reference table](https://arxiv.org/pdf/2104.04473.pdf)\n", "\n", "- Understanding how to estimate compute resource needed per dataset volume ( measured in # of tokens ) and a chosen model size\n", "\n", "- Apply to your own imagenary data volume and a figurative compute cluster set-ups\n", "---------------------------------------------------------------------------------------------------\n", "\n", "- assuming the following information \n", "- T = dataset size measured in numbers of tokens in the dataset\n", "- P = model parameters for GPT3 varients\n", "- n = number of GPUs in the compute cluster\n", "- x = achieved teraflops per GPU \n", "\n", "Training time (in seconds) is approximated with this equation : 8*T*P/n*X\n", "you will need the following tables from the above papers for the estimation \n", "\n", "
\n", "\n", "
\n", "\n", "
" ] }, { "cell_type": "markdown", "id": "pregnant-basket", "metadata": {}, "source": [ "---\n", "## let's do a sanity check - \n", "\n", "**Assumption** : you have an existing dataset, you know the volume of the dataset ( measure in # of tokens )\n", "\n", "**Scenario 1** - Given 300Billion tokens, you want to train 175 Billion GPT3 model, you have access to 1024 GPUs, look up on the table above to fetch 140 teraFLOP/s per GPU\n", "\n", "Question : How many hours/ days will you need given the scenaio above for you to compute an end to end training job ?\n", "\n", "Answer : We should observe around **34 days** for an end to end training run\n", "\n", "--\n", "\n", "**scenario 2** - You increase the data volume to 450 Billion tokens, you want to train a big model, say 1 Trillion parameters, you have access to 3072 GPUs, and fetching the 163 teraFLOP/s per GPU from the table above\n", "\n", "Question: How many hours/ days will you need given this scenaio above for you to compute an end to end training job ?\n", "\n", "Answer: We should observe around **84 days** for an end to end training run\n" ] }, { "cell_type": "code", "execution_count": 16, "id": "played-broadcast", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "all below are measured with dataset size **300 billion** measured in tokens \n", "\n", " ----------------------------------------------------------------------------------------\n", " language model :gpt3_175B with 175 Billion number of parameters , it will need 33.9 days to compute \n", "\n", " ----------------------------------------------------------------------------------------\n", " language model :gpt3_1Trillion with 1Trillion number of parameters , it will need 83.2 days to compute \n", "\n" ] } ], "source": [ "import numpy as np\n", "# T = dataset size measured in numbers of tokens in the dataset\n", "# P = model parameters for GPT3 varients\n", "# n = number of GPUs in the compute cluster\n", "# x = achieved teraflops per GPU \n", "\n", "def calculate_days_needed(T , P , n ,x):\n", " if x is None:\n", " return 'not a good SuperPOD use case, let us try a bigger model :)'\n", " else: \n", " tot=8*T*P\n", " div=n*x\n", " compute_sec=tot/div\n", " #convert compute seconds to days\n", " to_days=round(compute_sec/(3600*24),1)\n", " return to_days\n", "## sanity check against the paper reported figure above \n", "T=[300*1e+9, 450*1e+9]\n", "n=[1024,3072]\n", "GPT3_models_labels=[ 'gpt3_175B','gpt3_1Trillion']\n", "GPT3_model_params=[ 175*1e+9,1*1e+12 ]\n", "GPT3_model_params_str=['175 Billion','1Trillion']\n", "#according to the table above\n", "GPT3_X=[140*1e+12,163*1e+12]\n", "print(\"all below are measured with dataset size **300 billion** measured in tokens \\n\")\n", "for gpt3_name, gpt3_params, gpt3_param_str, x, n_,t in zip(GPT3_models_labels,GPT3_model_params,GPT3_model_params_str, GPT3_X ,n,T):\n", " days_needed=calculate_days_needed(t,gpt3_params,n_,x)\n", " print(\" ----------------------------------------------------------------------------------------\")\n", " print(\" language model :{} with {} number of parameters , it will need {} days to compute \\n\".format(gpt3_name, gpt3_param_str, str(days_needed)))\n" ] }, { "cell_type": "markdown", "id": "ruled-score", "metadata": {}, "source": [ "---\n", "## Exercise -\n", "Question -\n", "for a GPT3 model size of 70B parameters with approximatedly 300 Billion tokens in existing dataset\n", "giveing a 1/4 of the BerzeLiUs compute avaialbility. \n", "how may hours/days would you need to compute \n", "when you are ready , check against the solution uncollapse \n", "**. . .**\n" ] }, { "cell_type": "code", "execution_count": 5, "id": "circular-northwest", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true } }, "outputs": [ { "data": { "text/plain": [ "115.7" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "T=300*1e+9 #oftokens in the dataset\n", "n=int(480*0.25) # Berzelius Max 480 GPUs # number of GPUs in the compute cluster\n", "x=140*1e+12\n", "gpt3_params=70*1e+9\n", "calculate_days_needed(T,gpt3_params,n,x)" ] }, { "cell_type": "markdown", "id": "peaceful-colorado", "metadata": {}, "source": [ "--- \n", "\n", "## Additional Resources\n", "\n", "Efficient Large-Scale Language Model Training on GPU Clusters : https://arxiv.org/pdf/2104.04473.pdf \n", "\n", "Language Models are Few-Shot Learners : https://arxiv.org/pdf/2005.14165.pdf\n", "\n", "Scaling Laws for Neural Language Models : https://arxiv.org/pdf/2001.08361.pdf\n", "\n", "" ] }, { "cell_type": "markdown", "id": "northern-fellowship", "metadata": {}, "source": [ "---\n", "## Up Next : \n", "\n", "[Understanding the core of Megatron - mpu ](./Day2-2_MegatronFundementals.ipynb)\n", "\n", "## Back To Start Menu\n", "[start menu](../Start_Here.ipynb)" ] }, { "cell_type": "markdown", "id": "linear-culture", "metadata": {}, "source": [ "-----\n", "\n", "\n", "## Licensing \n", "\n", "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 5 }