{ "cells": [ { "cell_type": "markdown", "id": "special-singer", "metadata": {}, "source": [ "# Estimate Time\n", "---\n", "\n", "## Learning Objectives\n", "The goal of this lab is to estimate compute time needed for an end to end training run.\n", "\n", "**Motivation**: In order to request for computing resources for a training job on a cluster, one must provide information such as, the number of nodes/gpus and the estimated time of the training job run.\n", "\n", "Training time (in seconds) is approximated with this equation : 8*T*P/n*X\n", "\n", "- T = dataset size measured in numbers of tokens in the dataset\n", "- P = model parameters for GPT3 varients\n", "- n = number of GPUs in the compute cluster\n", "- x = achieved teraflops per GPU \n", "\n", "\n", "The above equation was extracted from this paper : [Efficient Large-Scale Language Model Training on GPU Clusters](https://arxiv.org/pdf/2104.04473.pdf)\n", "\n", "---------------------------------------------------------------------------------------------------\n", "\n", "Assets provided below for you convenience : \n", "\n", "
\n", "\n", "
\n" ] }, { "cell_type": "markdown", "id": "complicated-reproduction", "metadata": {}, "source": [ "---\n", "## Sanity check - \n", "\n", "\n", "\n", "Two scenarios were extracted from the above paper (screenshot above) : [Efficient Large-Scale Language Model Training on GPU Clusters](https://arxiv.org/pdf/2104.04473.pdf) \n", "\n", "**Scenario 1** -\n", "\n", "T = 300Billion tokens # assumed data size measured in tokens\n", "\n", "P = 175 Billion GPT3 model\n", "\n", "n = 1024 GPUs\n", "\n", "x = 140 teraFLOP/s per GPU\n", "\n", "Question : How many hours/ days will you need given the scenaio above for you to compute an end to end training job ?\n", "\n", "Answer : We should observe around **34 days** for an end to end training run\n", "\n", "\n", "**Scenario 2** - \n", "\n", "T = 450 Billion tokens \n", "\n", "P = 1 Trillion parameters GPT 3 model\n", "\n", "n = 3072 \n", "\n", "x = 163 teraFLOP/s per GPU \n", "\n", "Question: How many hours/ days will you need given this scenaio above for you to compute an end to end training job ?\n", "\n", "Answer: We should observe around **84 days** for an end to end training run\n" ] }, { "cell_type": "code", "execution_count": null, "id": "hundred-array", "metadata": {}, "outputs": [], "source": [ "# The following code block contain automatic functions which assist calculating time-to-compute for an end to end training run.\n", "import numpy as np\n", "# T = dataset size measured in numbers of tokens in the dataset\n", "# P = model parameters for GPT3 varients\n", "# n = number of GPUs in the compute cluster\n", "# x = achieved teraflops per GPU \n", "\n", "def calculate_days_needed(T , P , n ,x):\n", " if x is None:\n", " return 'not a good SuperPOD use case, let us try a bigger model :)'\n", " else: \n", " tot=8*T*P\n", " div=n*x\n", " compute_sec=tot/div\n", " #convert compute seconds to days\n", " to_days=round(compute_sec/(3600*24),1)\n", " return to_days\n", "## sanity check against the two scenarios above \n", "T=[300*1e+9, 450*1e+9]\n", "n=[1024,3072]\n", "GPT3_models_labels=[ 'gpt3_175B','gpt3_1Trillion']\n", "GPT3_model_params=[ 175*1e+9,1*1e+12 ]\n", "GPT3_model_params_str=['175 Billion','1Trillion']\n", "#according to the table above\n", "GPT3_X=[140*1e+12,163*1e+12]\n", "print(\"all below are measured with dataset size **300 billion** measured in tokens \\n\")\n", "scene=1\n", "for gpt3_name, gpt3_params, gpt3_param_str, x, n_,t in zip(GPT3_models_labels,GPT3_model_params,GPT3_model_params_str, GPT3_X ,n,T):\n", " days_needed=calculate_days_needed(t,gpt3_params,n_,x)\n", " print(\" ----------------------------scenario {}-----------------------------------\".format(scene))\n", " print(\" language model :{} with {} number of parameters , it will need {} days to compute \\n\".format(gpt3_name, gpt3_param_str, str(days_needed)))\n", " scene+=1" ] }, { "cell_type": "markdown", "id": "noted-sense", "metadata": {}, "source": [ "Below is an example of expected outputs :\n", "\n", " ----------------------------scenario 1-----------------------------------\n", " language model :gpt3_175B with 175 Billion number of parameters , it will need 33.9 days to compute \n", "\n", " ----------------------------scenario 2-----------------------------------\n", " language model :gpt3_1Trillion with 1Trillion number of parameters , it will need 83.2 days to compute\n" ] }, { "cell_type": "markdown", "id": "indie-schema", "metadata": {}, "source": [ "---\n", "**Exercise** -\n", "\n", "For a GPT3 model size of 70B parameters with approximatedly 300 Billion tokens in an existing dataset\n", "You have requested 1/4 of the total number of gpus available in [BerzeLiUs](https://www.nsc.liu.se/support/systems/berzelius-getting-started/).\n", "\n", "\n", "Question -\n", "\n", "How many hours/days would you need to do an end to end training run ? \n" ] }, { "cell_type": "code", "execution_count": null, "id": "cosmetic-gregory", "metadata": {}, "outputs": [], "source": [ "T= \n", "p= \n", "n= \n", "x= \n", "gpt3_params= \n", "calculate_days_needed(T,gpt3_params,n,x)" ] }, { "cell_type": "markdown", "id": "viral-upper", "metadata": {}, "source": [ "--- \n", "\n", "## Links and Resources\n", "Don't forget to check out additional resources such as [Efficient Large-Scale Language Model Training on GPU Clusters](https://arxiv.org/pdf/2104.04473.pdf ), [Language Models are Few-Shot Learners](https://arxiv.org/pdf/2005.14165.pdf) and [Scaling Laws for Neural Language Models](https://arxiv.org/pdf/2001.08361.pdf)." ] }, { "cell_type": "markdown", "id": "spiritual-dancing", "metadata": {}, "source": [ "-----\n", "##

HOME      NEXT

" ] }, { "cell_type": "markdown", "id": "silent-kruger", "metadata": {}, "source": [ "-----\n", "\n", "\n", "## Licensing \n", "\n", "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 5 }