{ "cells": [ { "cell_type": "markdown", "id": "proud-packet", "metadata": {}, "source": [ "# \n", "\n", "# 1_Estimate compute hours/days needed to execute one end-to-end run\n", "---\n", "\n", "## Learning Objectives\n", "- **The goal of this lab is to:**\n", "Understand how reserve compute resource per given data volume + model configuration for a training run. This is important not only for cluster capacity planning as well as for strategic research planning ( how many end to end experiments one can run given the compute capacity and duration )\n", "\n", " - Extracting the formular in from the paper [Efficient Large-Scale Language Model Training on GPU Clusters](https://arxiv.org/pdf/2104.04473.pdf), per given [GPT3 variants](https://arxiv.org/pdf/2005.14165.pdf) based on assumed [Teraflops reference table](https://arxiv.org/pdf/2104.04473.pdf)\n", " - Understanding how to estimate compute needed per dataset volume ( measured in number of tokens ) \n", " - Apply to your own/ imagenary data volume and and compute cluster set-ups\n", "---------------------------------------------------------------------------------------------------\n", "\n", "- assuming the following information \n", "- T = dataset size measured in numbers of tokens in the dataset\n", "- P = model parameters for GPT3 varients\n", "- n = number of GPUs in the compute cluster\n", "- x = achieved teraflops per GPU \n", "\n", "you will need the following tables from the above papers for the estimation \n", "\n", "
---
## let's do a sanity check 
scenario 1 - Given 300Billion tokens , 1024 GPUs, with 175 Billion model parmeters , assuming 140 teraFLOP/s per GPU 
we should observe around **34 days** for an end to end training run

scenario 2 - Given 450Billion tokens , 3072 GPUs, with 1 Trillion model parmeters , assuming 163 teraFLOP/s per GPU 
we should observe around **84 days** for an end to end training run

import numpy as np

def calculate_days_needed(T , P , n ,x):
    if x is None:
        return 'not a good SuperPOD use case, let us try a bigger model :)'
    else:
        #x=140*1e+12 # TeraFlop/s per GPU
        tot=8*T*P
        div=n*x
        compute_sec=tot/div
        #convert compute seconds to days
        to_days=round(compute_sec/(3600*24),1)
        return to_days
## sanity check against the paper reported figure above 
T=[300*1e+9, 450*1e+9]
n=[1024,3072]
GPT3_models_labels=[ 'gpt3_175B','gpt3_1Trillion']
GPT3_model_params=[ 175*1e+9,1*1e+12 ]
GPT3_model_params_str=['175 Billion','1Trillion']
#according to the table above
GPT3_X=[140*1e+12,163*1e+12]
print("all below are measured with dataset size **300 billion** measured in tokens \n")
for gpt3_name, gpt3_params, gpt3_param_str, x, n_,t in zip(GPT3_models_labels,GPT3_model_params,GPT3_model_params_str, GPT3_X ,n,T):
    days_needed=calculate_days_needed(t,gpt3_params,n_,x)
    print(" ----------------------------------------------------------------------------------------")
    print(" language model :{} with {} number of parameters , it will need {} days to compute \n".format(gpt3_name, gpt3_param_str, str(days_needed)))

---
## Exercise -
Question -
for a GPT3 model size of 70B parameters with approximatedly 300 Billion tokens in existing dataset
giveing a 1/4 of the BerzeLiUs compute avaialbility. 
how may hours/days would you need to compute 
when you are ready , check against the solution uncollapse 
**. . .**

T=300*1e+9 #oftokens in the dataset
n=int(480*0.25) # Berzelius Max 480 GPUs # number of GPUs in the compute cluster
x=140*1e+12
gpt3_params=70*1e+9
calculate_days_needed(T,gpt3_params,n,x)