| 1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798991001011021031041051061071081091101111121131141151161171181191201211221231241251261271281291301311321331341351361371381391401411421431441451461471481491501511521531541551561571581591601611621631641651661671681691701711721731741751761771781791801811821831841851861871881891901911921931941951961971981992002012022032042052062072082092102112122132142152162172182192202212222232242252262272282292302312322332342352362372382392402412422432442452462472482492502512522532542552562572582592602612622632642652662672682692702712722732742752762772782792802812822832842852862872882892902912922932942952962972982993003013023033043053063073083093103113123133143153163173183193203213223233243253263273283293303313323333343353363373383393403413423433443453463473483493503513523533543553563573583593603613623633643653663673683693703713723733743753763773783793803813823833843853863873883893903913923933943953963973983994004014024034044054064074084094104114124134144154164174184194204214224234244254264274284294304314324334344354364374384394404414424434444454464474484494504514524534544554564574584594604614624634644654664674684694704714724734744754764774784794804814824834844854864874884894904914924934944954964974984995005015025035045055065075085095105115125135145155165175185195205215225235245255265275285295305315325335345355365375385395405415425435445455465475485495505515525535545555565575585595605615625635645655665675685695705715725735745755765775785795805815825835845855865875885895905915925935945955965975985996006016026036046056066076086096106116126136146156166176186196206216226236246256266276286296306316326336346356366376386396406416426436446456466476486496506516526536546556566576586596606616626636646656666676686696706716726736746756766776786796806816826836846856866876886896906916926936946956966976986997007017027037047057067077087097107117127137147157167177187197207217227237247257267277287297307317327337347357367377387397407417427437447457467477487497507517527537547557567577587597607617627637647657667677687697707717727737747757767777787797807817827837847857867877887897907917927937947957967977987998008018028038048058068078088098108118128138148158168178188198208218228238248258268278288298308318328338348358368378388398408418428438448458468478488498508518528538548558568578588598608618628638648658668678688698708718728738748758768778788798808818828838848858868878888898908918928938948958968978988999009019029039049059069079089099109119129139149159169179189199209219229239249259269279289299309319329339349359369379389399409419429439449459469479489499509519529539549559569579589599609619629639649659669679689699709719729739749759769779789799809819829839849859869879889899909919929939949959969979989991000100110021003100410051006100710081009101010111012101310141015101610171018101910201021102210231024102510261027102810291030103110321033103410351036103710381039104010411042104310441045104610471048104910501051105210531054105510561057105810591060106110621063106410651066106710681069107010711072107310741075107610771078107910801081108210831084108510861087108810891090109110921093109410951096109710981099110011011102110311041105110611071108110911101111111211131114111511161117111811191120112111221123112411251126112711281129113011311132113311341135113611371138113911401141114211431144114511461147114811491150115111521153115411551156115711581159116011611162 |
- The Llama 3 Herd of Models
- Llama Team, AI @ Meta1
- 1A detailed contributor list can be found in the appendix of this paper.
- Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a
- new set of foundation models, called Llama 3. It is a herd of language models that natively support
- multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with
- 405B parameters and a context window of up to 128K tokens. This paper presents an extensive
- empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language
- models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and
- post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input
- and output safety. The paper also presents the results of experiments in which we integrate image,
- video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach
- performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The
- resulting models are not yet being broadly released as they are still under development.
- Date:July 23, 2024
- Website: https://llama.meta.com/
- 1 Introduction
- Foundation models are general models of language, vision, speech, and/or other modalities that are designed
- to support a large variety of AI tasks. They form the basis of many modern AI systems.
- The development of modern foundation models consists of two main stages: (1)a pre-training stage in which
- the model is trained at massive scale using straightforward tasks such as next-word prediction or captioning
- and(2)a post-training stage in which the model is tuned to follow instructions, align with human preferences,
- and improve specific capabilities (for example, coding and reasoning).
- In this paper, we present a new set of foundation models for language, called Llama 3. The Llama 3 Herd
- of models natively supports multilinguality, coding, reasoning, and tool usage. Our largest model is dense
- Transformer with 405B parameters, processing information in a context window of up to 128K tokens. Each
- member of the herd is listed in Table 1. All the results presented in this paper are for the Llama 3.1 models,
- which we will refer to as Llama 3 throughout for brevity.
- We believe there are three key levers in the development of high-quality foundation models: data, scale, and
- managing complexity. We seek to optimize for these three levers in our development process:
- •Data.Compared to prior versions of Llama (Touvron et al., 2023a,b), we improved both the quantity and
- qualityofthedataweuseforpre-trainingandpost-training. Theseimprovementsincludethedevelopment
- of more careful pre-processing and curation pipelines for pre-training data and the development of more
- rigorous quality assurance and filtering approaches for post-training data. We pre-train Llama 3 on a
- corpus of about 15T multilingual tokens, compared to 1.8T tokens for Llama 2.
- •Scale.We train a model at far larger scale than previous Llama models: our flagship language model was
- pre-trained using 3.8×1025FLOPs, almost 50×more than the largest version of Llama 2. Specifically,
- we pre-trained a flagship model with 405B trainable parameters on 15.6T text tokens. As expected per
- 1arXiv:2407.21783v3 [cs.AI] 23 Nov 2024
- Finetuned Multilingual Long context Tool use Release
- Llama 3 8B ✗ ✗1✗ ✗ April 2024
- Llama 3 8B Instruct ✓ ✗ ✗ ✗ April 2024
- Llama 3 70B ✗ ✗1✗ ✗ April 2024
- Llama 3 70B Instruct ✓ ✗ ✗ ✗ April 2024
- Llama 3.1 8B ✗ ✓ ✓ ✗ July 2024
- Llama 3.1 8B Instruct ✓ ✓ ✓ ✓ July 2024
- Llama 3.1 70B ✗ ✓ ✓ ✗ July 2024
- Llama 3.1 70B Instruct ✓ ✓ ✓ ✓ July 2024
- Llama 3.1 405B ✗ ✓ ✓ ✗ July 2024
- Llama 3.1 405B Instruct ✓ ✓ ✓ ✓ July 2024
- Table 1 Overview of the Llama 3 Herd of models. All results in this paper are for the Llama 3.1 models.
- scaling laws for foundation models, our flagship model outperforms smaller models trained using the
- same procedure. While our scaling laws suggest our flagship model is an approximately compute-optimal
- size for our training budget, we also train our smaller models for much longer than is compute-optimal.
- The resulting models perform better than compute-optimal models at the same inference budget. We
- use the flagship model to further improve the quality of those smaller models during post-training.
- •Managing complexity. We make design choices that seek to maximize our ability to scale the model
- development process. For example, we opt for a standard dense Transformer model architecture (Vaswani
- et al., 2017) with minor adaptations, rather than for a mixture-of-experts model (Shazeer et al., 2017)
- to maximize training stability. Similarly, we adopt a relatively simple post-training procedure based
- on supervised finetuning (SFT), rejection sampling (RS), and direct preference optimization (DPO;
- Rafailov et al. (2023)) as opposed to more complex reinforcement learning algorithms (Ouyang et al.,
- 2022; Schulman et al., 2017) that tend to be less stable and harder to scale.
- The result of our work is Llama 3: a herd of three multilingual1language models with 8B, 70B, and 405B
- parameters. We evaluate the performance of Llama 3 on a plethora of benchmark datasets that span a wide
- range of language understanding tasks. In addition, we perform extensive human evaluations that compare
- Llama 3 with competing models. An overview of the performance of the flagship Llama 3 model on key
- benchmarks is presented in Table 2. Our experimental evaluation suggests that our flagship model performs
- on par with leading language models such as GPT-4 (OpenAI, 2023a) across a variety of tasks, and is close to
- matching the state-of-the-art. Our smaller models are best-in-class, outperforming alternative models with
- similar numbers of parameters (Bai et al., 2023; Jiang et al., 2023). Llama 3 also delivers a much better
- balance between helpfulness and harmlessness than its predecessor (Touvron et al., 2023b). We present a
- detailed analysis of the safety of Llama 3 in Section 5.4.
- We are publicly releasing all three Llama 3 models under an updated version of the Llama 3 Community License;
- seehttps://llama.meta.com . This includes pre-trained and post-trained versions of our 405B parameter
- language model and a new version of our Llama Guard model (Inan et al., 2023) for input and output safety.
- We hope that the open release of a flagship model will spur a wave of innovation in the research community,
- and accelerate a responsible path towards the development of artificial general intelligence (AGI).
- As part of the Llama 3 development process we also develop multimodal extensions to the models, enabling
- image recognition, video recognition, and speech understanding capabilities. These models are still under
- active development and not yet ready for release. In addition to our language modeling results, the paper
- presents results of our initial experiments with those multimodal models.
- 1The Llama 3 8B and 70B were pre-trained on multilingual data but were intended for use in English at the time.
- 2
- Category Benchmark
- Llama 3 8B
- Gemma 2 9B
- Mistral 7B
- Llama 3 70B
- Mixtral 8x22B
- GPT 3.5 Turbo
- Llama 3 405B
- Nemotron 4 340B
- GPT-4 (0125)
- GPT-4o
- Claude 3.5 Sonnet
- GeneralMMLU (5-shot) 69.4 72.361.1 83.676.970.787.382.6 85.189.1 89.9
- MMLU (0-shot, CoT) 73.0 72.3△60.5 86.079.969.888.6 78.7◁85.4 88.7 88.3
- MMLU-Pro (5-shot, CoT) 48.3 –36.9 66.456.349.273.362.7 64.874.0 77.0
- IFEval 80.4 73.657.6 87.572.769.9 88.6 85.1 84.385.6 88.0
- CodeHumanEval (0-shot) 72.6 54.340.2 80.575.668.089.073.2 86.690.2 92.0
- MBPP EvalPlus (0-shot) 72.8 71.749.5 86.078.682.088.672.8 83.687.8 90.5
- MathGSM8K (8-shot, CoT) 84.5 76.753.2 95.188.281.6 96.8 92.3♢94.296.1 96.4♢
- MATH (0-shot, CoT) 51.9 44.313.0 68.054.143.173.841.1 64.5 76.6 71.1
- ReasoningARC Challenge (0-shot) 83.4 87.674.2 94.888.783.7 96.9 94.6 96.496.7 96.7
- GPQA (0-shot, CoT) 32.8 –28.8 46.733.330.851.1 –41.453.6 59.4
- Tool useBFCL 76.1 –60.484.8– 85.988.586.5 88.380.5 90.2
- Nexus 38.5 30.024.7 56.748.537.2 58.7 –50.356.1 45.7
- Long contextZeroSCROLLS/QuALITY 81.0 ––90.5–– 95.2 – 95.2 90.5 90.5
- InfiniteBench/En.MC 65.1 ––78.2–– 83.4 –72.182.5 –
- NIH/Multi-needle 98.8 ––97.5––98.1 – 100.0 100.0 90.8
- Multilingual MGSM (0-shot, CoT) 68.9 53.229.9 86.971.151.4 91.6 –85.990.5 91.6
- Table 2 Performance of finetuned Llama 3 models on key benchmark evaluations. The table compares the performance of
- the 8B, 70B, and 405B versions of Llama 3 with that of competing models. We boldface the best-performing model in
- each of three model-size equivalence classes.△Results obtained using 5-shot prompting (no CoT).◁Results obtained
- without CoT.♢Results obtained using zero-shot prompting.
- 2 General Overview
- The model architecture of Llama 3 is illustrated in Figure 1. The development of our Llama 3 language
- models comprises two main stages:
- •Language model pre-training. We start by converting a large, multilingual text corpus to discrete tokens
- and pre-training a large language model (LLM) on the resulting data to perform next-token prediction.
- In the language model pre-training stage, the model learns the structure of language and obtains large
- amounts of knowledge about the world from the text it is “reading”. To do this effectively, pre-training
- is performed at massive scale: we pre-train a model with 405B parameters on 15.6T tokens using a
- context window of 8K tokens. This standard pre-training stage is followed by a continued pre-training
- stage that increases the supported context window to 128K tokens. See Section 3 for details.
- •Language model post-training. The pre-trained language model has a rich understanding of language
- but it does not yet follow instructions or behave in the way we would expect an assistant to. We
- align the model with human feedback in several rounds, each of which involves supervised finetuning
- (SFT) on instruction tuning data and Direct Preference Optimization (DPO; Rafailov et al., 2024).
- At this post-training2stage, we also integrate new capabilities, such as tool-use, and observe strong
- improvements in other areas, such as coding and reasoning. See Section 4 for details. Finally, safety
- mitigations are also incorporated into the model at the post-training stage, the details of which are
- described in Section 5.4.
- The resulting models have a rich set of capabilities. They can answer questions in at least eight languages,
- write high-quality code, solve complex reasoning problems, and use tools out-of-the-box or in a zero-shot way.
- We also perform experiments in which we add image, video, and speech capabilities to Llama 3 using a
- compositional approach. The approach we study comprises the three additional stages illustrated in Figure 28:
- •Multi-modal encoder pre-training. We train separate encoders for images and speech. We train our
- image encoder on large amounts of image-text pairs. This teaches the model the relation between visual
- content and the description of that content in natural language. Our speech encoder is trained using a
- 2In this paper, we use the term “post-training” to refer to any model training that happens outside of pre-training.
- 3
- Figure 1 Illustration of the overall architecture and training of Llama 3. Llama 3 is a Transformer language model trained to
- predict the next token of a textual sequence. See text for details.
- self-supervised approach that masks out parts of the speech inputs and tries to reconstruct the masked
- out parts via a discrete-token representation. As a result, the model learns the structure of speech
- signals. See Section 7 for details on the image encoder and Section 8 for details on the speech encoder.
- •Vision adapter training. We train an adapter that integrates the pre-trained image encoder into the
- pre-trained language model. The adapter consists of a series of cross-attention layers that feed image-
- encoder representations into the language model. The adapter is trained on text-image pairs. This
- aligns the image representations with the language representations. During adapter training, we also
- update the parameters of the image encoder but we intentionally do not update the language-model
- parameters. We also train a video adapter on top of the image adapter on paired video-text data. This
- enables the model to aggregate information across frames. See Section 7 for details.
- •Speech adapter training. Finally, we integrate the speech encoder into the model via an adapter that
- converts speech encodings into token representations that can be fed directly into the finetuned language
- model. The parameters of the adapter and encoder are jointly updated in a supervised finetuning stage
- to enable high-quality speech understanding. We do not change the language model during speech
- adapter training. We also integrate a text-to-speech system. See Section 8 for details.
- Our multimodal experiments lead to models that can recognize the content of images and videos, and support
- interaction via a speech interface. These models are still under development and not yet ready for release.
- 3 Pre-Training
- Language model pre-training involves: (1)the curation and filtering of a large-scale training corpus, (2)the
- development of a model architecture and corresponding scaling laws for determining model size, (3)the
- development of techniques for efficient pre-training at large scale, and (4)the development of a pre-training
- recipe. We present each of these components separately below.
- 3.1 Pre-Training Data
- We create our dataset for language model pre-training from a variety of data sources containing knowledge
- until the end of 2023. We apply several de-duplication methods and data cleaning mechanisms on each data
- source to obtain high-quality tokens. We remove domains that contain large amounts of personally identifiable
- information (PII), and domains with known adult content.
- 3.1.1 Web Data Curation
- Much of the data we utilize is obtained from the web and we describe our cleaning process below.
- PII and safety filtering. Among other mitigations, we implement filters designed to remove data from websites
- are likely to contain unsafe content or high volumes of PII, domains that have been ranked as harmful
- according to a variety of Meta safety standards, and domains that are known to contain adult content.
- 4
- Text extraction and cleaning. We process the raw HTML content for non-truncated web documents to extract
- high-quality diverse text. To do so, we build a custom parser that extracts the HTML content and optimizes
- for precision in boilerplate removal and content recall. We evaluate our parser’s quality in human evaluations,
- comparing it with popular third-party HTML parsers that optimize for article-like content, and found it
- to perform favorably. We carefully process HTML pages with mathematics and code content to preserve
- the structure of that content. We maintain the image altattribute text since mathematical content is often
- represented as pre-rendered images where the math is also provided in the altattribute. We experimentally
- evaluate different cleaning configurations. We find markdown is harmful to the performance of a model that
- is primarily trained on web data compared to plain text, so we remove all markdown markers.
- De-duplication. We apply several rounds of de-duplication at the URL, document, and line level:
- •URL-level de-duplication. We perform URL-level de-duplication across the entire dataset. We keep the
- most recent version for pages corresponding to each URL.
- •Document-level de-duplication. We perform global MinHash (Broder, 1997) de-duplication across the
- entire dataset to remove near duplicate documents.
- •Line-level de-duplication. We perform aggressive line-level de-duplication similar to ccNet(Wenzek
- et al., 2019). We remove lines that appeared more than 6 times in each bucket of 30M documents.
- Although our manual qualitative analysis showed that the line-level de-duplication removes not only
- leftover boilerplate from various websites such as navigation menus, cookie warnings, but also frequent
- high-quality text, our empirical evaluations showed strong improvements.
- Heuristic filtering. We develop heuristics to remove additional low-quality documents, outliers, and documents
- with excessive repetitions. Some examples of heuristics include:
- •We use duplicated n-gram coverage ratio (Rae et al., 2021) to remove lines that consist of repeated
- content such as logging or error messages. Those lines could be very long and unique, hence cannot be
- filtered by line-dedup.
- •We use “dirty word” counting (Raffel et al., 2020) to filter out adult websites that are not covered by
- domain block lists.
- •We use a token-distribution Kullback-Leibler divergence to filter out documents containing excessive
- numbers of outlier tokens compared to the training corpus distribution.
- Model-based quality filtering. Further, we experiment with applying various model-based quality classifiers
- to sub-select high-quality tokens. These include using fast classifiers such as fasttext (Joulin et al., 2017)
- trained to recognize if a given text would be referenced by Wikipedia (Touvron et al., 2023a), as well as more
- compute-intensive Roberta-based classifiers (Liu et al., 2019a) trained on Llama 2 predictions. To train a
- quality classifier based on Llama 2, we create a training set of cleaned web documents, describe the quality
- requirements, and instruct Llama 2’s chat model to determine if the documents meets these requirements. We
- use DistilRoberta (Sanh et al., 2019) to generate quality scores for each document for efficiency reasons. We
- experimentally evaluate the efficacy of various quality filtering configurations.
- Code and reasoning data. Similar to DeepSeek-AI et al. (2024), we build domain-specific pipelines that extract
- code and math-relevant web pages. Specifically, both the code and reasoning classifiers are DistilRoberta
- models trained on web data annotated by Llama 2. Unlike the general quality classifier mentioned above, we
- conduct prompt tuning to target web pages containing math deduction, reasoning in STEM areas and code
- interleaved with natural language. Since the token distribution of code and math is substantially different
- than that of natural language, these pipelines implement domain-specific HTML extraction, customized text
- features and heuristics for filtering.
- Multilingual data. Similar to our processing pipelines for English described above, we implement filters to
- remove data from websites that are likely to contain PII or unsafe content. Our multilingual text processing
- pipeline has several unique features:
- •We use a fasttext-based language identification model to categorize documents into 176 languages.
- •We perform document-level and line-level de-duplication within data for each language.
- 5
- •We apply language-specific heuristics and model-based filters to remove low-quality documents.
- In addition, we perform quality ranking of multilingual documents using a multilingual Llama 2-based classifier
- to ensure that high-quality content is prioritized. We determine the amount of multilingual tokens used in
- pre-training experimentally, balancing model performance on English and multilingual benchmarks.
- 3.1.2 Determining the Data Mix
- To obtain a high-quality language model, it is essential to carefully determine the proportion of different data
- sources in the pre-training data mix. Our main tools in determining this data mix are knowledge classification
- and scaling law experiments.
- Knowledge classification. We develop a classifier to categorize the types of information contained in our web
- data to more effectively determine a data mix. We use this classifier to downsample data categories that are
- over-represented on the web, for example, arts and entertainment.
- Scaling laws for data mix. To determine the best data mix, we perform scaling law experiments in which we
- train several small models on a data mix and use that to predict the performance of a large model on that mix
- (see Section 3.2.1). We repeat this process multiple times for different data mixes to select a new data mix
- candidate. Subsequently, we train a larger model on this candidate data mix and evaluate the performance of
- that model on several key benchmarks.
- Data mix summary. Our final data mix contains roughly 50% of tokens corresponding to general knowledge,
- 25% of mathematical and reasoning tokens, 17% code tokens, and 8% multilingual tokens.
- 3.1.3 Annealing Data
- Empirically, we find that annealing (see Section 3.4.3) on small amounts of high-quality code and mathematical
- data can boost the performance of pre-trained models on key benchmarks. Akin to Li et al. (2024b), we
- perform annealing with a data mix that upsamples high-quality data in select domains. We do not include
- any training sets from commonly used benchmarks in our annealing data. This enables us to assess the true
- few-shot learning capabilities and out-of-domain generalization of Llama 3.
- Following OpenAI (2023a), we evaluate the efficacy of annealing on the GSM8k (Cobbe et al., 2021) and
- MATH (Hendrycks et al., 2021b) training sets in annealing. We find that annealing improved the performance
- of a pre-trained Llama 3 8B model on the GSM8k and MATH validation sets by 24.0% and 6.4%, respectively.
- However, the improvements on the 405B model are negligible, suggesting that our flagship model has strong
- in-context learning and reasoning capabilities and does not require specific in-domain training samples to
- obtain strong performance.
- Using annealing to assess data quality. Similar to Blakeney et al. (2024), we find that annealing enables us to
- judge the value of small domain-specific datasets. We measure the value of such datasets by annealing the
- learning rate of a 50% trained Llama 3 8B model linearly to 0 on 40B tokens. In those experiments, we assign
- 30% weight to the new dataset and the remaining 70% weight to the default data mix. Using annealing to
- evaluate new data sources is more efficient than performing scaling law experiments for every small dataset.
- 3.2 Model Architecture
- Llama 3 uses a standard, dense Transformer architecture (Vaswani et al., 2017). It does not deviate significantly
- from Llama and Llama 2 (Touvron et al., 2023a,b) in terms of model architecture; our performance gains are
- primarily driven by improvements in data quality and diversity as well as by increased training scale.
- We make a few small modifications compared to Llama 2:
- •We use grouped query attention (GQA; Ainslie et al. (2023)) with 8 key-value heads to improve inference
- speed and to reduce the size of key-value caches during decoding.
- •We use an attention mask that prevents self-attention between different documents within the same
- sequence. We find that this change had limited impact during in standard pre-training, but find it to be
- important in continued pre-training on very long sequences.
- 6
- 8B 70B 405B
- Layers 32 80 126
- Model Dimension 4,096 8192 16,384
- FFN Dimension 14,336 28,672 53,248
- Attention Heads 32 64 128
- Key/Value Heads 8 8 8
- Peak Learning Rate 3×10−41.5×10−48×10−5
- Activation Function SwiGLU
- Vocabulary Size 128,000
- Positional Embeddings RoPE ( θ= 500 ,000)
- Table 3 Overview of the key hyperparameters of Llama 3. We display settings for 8B, 70B, and 405B language models.
- •We use a vocabulary with 128K tokens. Our token vocabulary combines 100K tokens from the tiktoken3
- tokenizer with 28K additional tokens to better support non-English languages. Compared to the Llama
- 2 tokenizer, our new tokenizer improves compression rates on a sample of English data from 3.17 to
- 3.94 characters per token. This enables the model to “read” more text for the same amount of training
- compute. We also found that adding 28K tokens from select non-English languages improved both
- compression ratios and downstream performance, with no impact on English tokenization.
- •We increase the RoPE base frequency hyperparameter to 500,000. This enables us to better support
- longer contexts; Xiong et al. (2023) showed this value to be effective for context lengths up to 32,768.
- Llama 3 405B uses an architecture with 126 layers, a token representation dimension of 16,384, and 128
- attention heads; see Table 3 for details. This leads to a model size that is approximately compute-optimal
- according to scaling laws on our data for our training budget of 3.8×1025FLOPs.
- 3.2.1 Scaling Laws
- We develop scaling laws (Hoffmann et al., 2022; Kaplan et al., 2020) to determine the optimal model size for
- our flagship model given our pre-training compute budget. In addition to determining the optimal model size,
- a major challenge is to forecast the flagship model’s performance on downstream benchmark tasks, due to a
- couple of issues: (1) Existing scaling laws typically predict only next-token prediction loss rather than specific
- benchmark performance. (2) Scaling laws can be noisy and unreliable because they are developed based on
- pre-training runs conducted with small compute budgets (Wei et al., 2022b).
- To address these challenges, we implement a two-stage methodology to develop scaling laws that accurately
- predict downstream benchmark performance:
- 1.We first establish a correlation between the compute-optimal model’s negative log-likelihood on down-
- stream tasks and the training FLOPs.
- 2.Next, we correlate the negative log-likelihood on downstream tasks with task accuracy, utilizing both the
- scaling law models and older models trained with higher compute FLOPs. In this step, we specifically
- leverage the Llama 2 family of models.
- This approach enables us to predict downstream task performance given a specific number of training FLOPs
- for compute-optimal models. We use a similar method to select our pre-training data mix (see Section 3.4).
- Scaling law experiments. Concretely, we construct our scaling laws by pre-training models using compute
- budgets between 6×1018FLOPs and 1022FLOPs. At each compute budget, we pre-train models ranging
- in size between 40M and 16B parameters, using a subset of model sizes at each compute budget. In these
- training runs, we use a cosine learning rate schedule with a linear warmup for 2,000 training steps. The peak
- learning rate is set between 2×10−4and4×10−4depending on the size of the model. We set the cosine
- decay to 0.1 of the peak value. The weight decay at each step is set to 0.1 times the learning rate at that step.
- We use a fixed batch size for each compute scale, ranging between 250K and 4M.
- 3https://github.com/openai/tiktoken/tree/main
- 7
- 101010111012
- Training Tokens0.700.750.800.850.900.95Validation Loss
- Compute
- 6e18
- 1e19
- 3e19
- 6e19
- 1e20
- 3e20
- 6e20
- 1e21
- 3e21
- 1e22Figure 2 Scaling law IsoFLOPs curves between 6×1018
- and 1022FLOPs. The loss is the negative log-
- likelihood on a held-out validation set. We approx-
- imate measurements at each compute scale using a
- second degree polynomial.
- 1019102010211022
- Compute (FLOPs)10101011Training Tokens
- Fitted Line, = 0.537, A = 0.299
- Figure 3 Number of training tokens in identified compute-
- optimal models as a function of pre-training compute
- budget.We include the fitted scaling-law prediction
- as well. The compute-optimal models correspond to
- the parabola minimums in Figure 2.
- These experiments give rise to the IsoFLOPs curves in Figure 2. The loss in these curves is measured on
- a separate validation set. We fit the measured loss values using a second-degree polynomial and identify
- the minimums of each parabola. We refer to minimum of a parabola as the compute-optimal model at the
- corresponding pre-training compute budget.
- We use the compute-optimal models we identified this way to predict the optimal number of training tokens
- for a specific compute budget. To do so, we assume a power-law relation between compute budget, C, and
- the optimal number of training tokens, N⋆(C):
- N⋆(C) =ACα.
- We fit Aandαusing the data from Figure 2. We find that (α, A) = (0 .53,0.29); the corresponding fit is
- shown in Figure 3. Extrapolation of the resulting scaling law to 3.8×1025FLOPs suggests training a 402B
- parameter model on 16.55T tokens.
- An important observation is that IsoFLOPs curves become flatteraround the minimum as the compute
- budget increases. This implies that performance of the flagship model is relatively robust to small changes in
- the trade-off between model size and training tokens. Based on this observation, we ultimately decided to
- train a flagship model with 405B parameters.
- Predicting performance on downstream tasks. We use the resulting compute-optimal models to forecast
- the performance of the flagship Llama 3 model on benchmark data sets. First, we linearly correlate the
- (normalized) negative log-likelihood of correct answer in the benchmark and the training FLOPs. In this
- analysis, we use only the scaling law models trained up to 1022FLOPs on the data mix described above. Next,
- we establish a sigmoidal relation between the log-likelihood and accuracy using both the scaling law models
- and Llama 2 models, which were trained using the Llama 2 data mix and tokenizer. We show the results of
- this experiment on the ARC Challenge benchmark in Figure 4). We find this two-step scaling law prediction,
- which extrapolates over four orders of magnitude, to be quite accurate: it only slightly underestimates the
- final performance of the flagship Llama 3 model.
- 3.3 Infrastructure, Scaling, and Efficiency
- We describe our hardware and infrastructure that powered Llama 3 405B pre-training at scale and discuss
- several optimizations that leads to improvements in training efficiency.
- 3.3.1 Training Infrastructure
- The Llama 1 and 2 models were trained on Meta’s AI Research SuperCluster (Lee and Sengupta, 2022). As
- we scaled further, the training for Llama 3 was migrated to Meta’s production clusters (Lee et al., 2024).This
- 8
- 102010211022102310241025
- Compute (FLOPs)1.2001.2251.2501.2751.3001.3251.3501.3751.400Normalized NLL per Char.
- 1.20 1.25 1.30 1.35 1.40
- Normalized NLL per Char.0.30.40.50.60.70.80.91.0Accuracy
- Scaling Law Models
- Llama 2 Models
- Scaling Law Prediction
- Llama 3 405BFigure 4 Scaling law forecast for ARC Challenge. Left:Normalized negative log-likelihood of the correct answer on the
- ARC Challenge benchmark as a function of pre-training FLOPs. Right:ARC Challenge benchmark accuracy as a
- function of the normalized negative log-likelihood of the correct answer. This analysis enables us to predict model
- performance on the ARC Challenge benchmark before pre-training commences. See text for details.
- setup optimizes for production-grade reliability, which is essential as we scale up training.
- Compute. Llama 3 405B is trained on up to 16K H100 GPUs, each running at 700W TDP with 80GB HBM3,
- using Meta’s Grand Teton AI server platform (Matt Bowman, 2022). Each server is equipped with eight GPUs
- and two CPUs. Within a server, the eight GPUs are connected via NVLink. Training jobs are scheduled
- using MAST (Choudhury et al., 2024), Meta’s global-scale training scheduler.
- Storage. Tectonic (Pan et al., 2021), Meta’s general-purpose distributed file system, is used to build a storage
- fabric (Battey and Gupta, 2024) for Llama 3 pre-training. It offers 240 PB of storage out of 7,500 servers
- equipped with SSDs, and supports a sustainable throughput of 2 TB/s and a peak throughput of 7 TB/s. A
- major challenge is supporting the highly bursty checkpoint writes that saturate the storage fabric for short
- durations. Checkpointing saves each GPU’s model state, ranging from 1 MB to 4 GB per GPU, for recovery
- and debugging. We aim to minimize GPU pause time during checkpointing and increase checkpoint frequency
- to reduce the amount of lost work after a recovery.
- Network. Llama 3 405B used RDMA over Converged Ethernet (RoCE) fabric based on the Arista 7800
- and Minipack2 Open Compute Project4OCP rack switches. Smaller models in the Llama 3 family were
- trained using Nvidia Quantum2 Infiniband fabric. Both RoCE and Infiniband clusters leverage 400 Gbps
- interconnects between GPUs. Despite the underlying network technology differences between these clusters,
- we tune both of them to provide equivalent performance for these large training workloads. We elaborate
- further on our RoCE network since we fully own its design.
- •Network topology. Our RoCE-based AI cluster comprises 24K GPUs5connected by a three-layer Clos
- network (Lee et al., 2024). At the bottom layer, each rack hosts 16 GPUs split between two servers and
- connected by a single Minipack2 top-of-the-rack (ToR) switch. In the middle layer, 192 such racks are
- connected by Cluster Switches to form a pod of 3,072 GPUs with full bisection bandwidth, ensuring no
- oversubscription. At the top layer, eight such pods within the same datacenter building are connected via
- Aggregation Switches to form a cluster of 24K GPUs. However, network connectivity at the aggregation
- layer does not maintain full bisection bandwidth and instead has an oversubscription ratio of 1:7. Our
- model parallelism methods (see Section 3.3.2) and training job scheduler (Choudhury et al., 2024) are
- all optimized to be aware of network topology, aiming to minimize network communication across pods.
- •Load balancing. LLM training produces fat network flows that are hard to load balance across all
- available network paths using traditional methods such as Equal-Cost Multi-Path (ECMP) routing. To
- address this challenge, we employ two techniques. First, our collective library creates 16 network flows
- between two GPUs, instead of just one, thereby reducing the traffic per flow and providing more flows
- 4Open Compute Project: https://www.opencompute.org/
- 5Note that we use only up to 16K of these 24K GPUs for Llama 3 pre-training.
- 9
- GPUs TP CP PP DP Seq. Len. Batch size/DP Tokens/Batch TFLOPs/GPU BF16 MFU
- 8,192 8 1 16 64 8,192 32 16M 430 43%
- 16,384 8 1 16 128 8,192 16 16M 400 41%
- 16,384 8 16 16 8 131,072 16 16M 380 38%
- Table 4 Scaling configurations and MFU for each stage of Llama 3 405B pre-training. See text and Figure 5 for descriptions
- of each type of parallelism.
- for load balancing. Second, our Enhanced-ECMP (E-ECMP) protocol effectively balances these 16 flows
- across different network paths by hashing on additional fields in the RoCE header of packets.
- •Congestion control. We use deep-buffer switches in the spine (Gangidi et al., 2024) to accommodate
- transient congestion and buffering caused by collective communication patterns. This setup helps
- limit the impact of persistent congestion and network back pressure caused by slow servers, which is
- common in training. Finally, better load balancing through E-ECMP significantly reduces the chance
- of congestion. With these optimizations, we successfully run a 24K GPU cluster without traditional
- congestion control methods such as Data Center Quantized Congestion Notification (DCQCN).
- 3.3.2 Parallelism for Model Scaling
- To scale training for our largest models, we use 4D parallelism—a combination of four different types of
- parallelism methods—to shard the model. This approach efficiently distributes computation across many
- GPUs and ensures each GPU’s model parameters, optimizer states, gradients, and activations fit in its
- HBM. Our implementation of 4D parallelism is illustrated in Figure 5. It combines tensor parallelism (TP;
- Krizhevsky et al. (2012); Shoeybi et al. (2019); Korthikanti et al. (2023)), pipeline parallelism (PP; Huang
- et al. (2019); Narayanan et al. (2021); Lamy-Poirier (2023)), context parallelism (CP; Liu et al. (2023a)), and
- data parallelism (DP; Rajbhandari et al. (2020); Ren et al. (2021); Zhao et al. (2023b)).
- Tensor parallelism splits individual weight tensors into multiple chunks on different devices. Pipeline parallelism
- partitions the model vertically into stages by layers, so that different devices can process in parallel different
- stages of the full model pipeline. Context parallelism divides the input context into segments, reducing memory
- bottleneck for very long sequence length inputs. We use fully sharded data parallelism (FSDP; Rajbhandari
- et al., 2020; Ren et al., 2021; Zhao et al., 2023b), which shards the model, optimizer, and gradients while
- implementing data parallelism which processes data in parallel on multiple GPUs and synchronizes after each
- training step. Our use of FSDP for Llama 3 shards optimizer states and gradients, but for model shards we do
- not reshard after forward computation to avoid an extra all-gather communication during backward passes.
- GPU utilization. Through careful tuning of the parallelism configuration, hardware, and software, we achieve
- an overall BF16 Model FLOPs Utilization (MFU; Chowdhery et al. (2023)) of 38-43% for the configurations
- shown in Table 4. The slight drop in MFU to 41% on 16K GPUs with DP=128 compared to 43% on 8K
- GPUs with DP=64 is due to the lower batch size per DP group needed to keep the global tokens per batch
- constant during training.
- Pipeline parallelism improvements. We encountered several challenges with existing implementations:
- •Batch size constraint. Current implementations have constraints on supported batch size per GPU,
- requiring it to be divisible by the number of pipeline stages. For the example in Figure 6, the depth-first
- schedule (DFS) of pipeline parallelism (Narayanan et al., 2021) requires N=PP= 4, while the
- breadth-first schedule (BFS; Lamy-Poirier (2023)) requires N=M, where Mis the total number
- of micro-batches and Nis the number of contiguous micro-batches for the same stage’s forward or
- backward. However, pre-training often needs flexibility to adjust batch size.
- •Memory imbalance. Existing pipeline parallelism implementations lead to imbalanced resource consump-
- tion. The first stage consumes more memory due to the embedding and the warm-up micro-batches.
- •Computation imbalance. After the last layer of the model, we need to calculate output and loss, making
- this stage the execution latency bottleneck.
- 10
- Figure 5 Illustration of 4D parallelism. GPUs are divided into parallelism groups in the order of [TP, CP, PP, DP], where
- DP stands for FSDP. In this example, 16 GPUs are configured with a group size of |TP|=2, |CP|=2, |PP|=2, and
- |DP|=2. A GPU’s position in 4D parallelism is represented as a vector, [ D1,D2,D3,D4], where Diis the index on
- thei-th parallelism dimension. In this example, GPU0[TP0, CP0, PP0, DP0] and GPU1[TP1, CP0, PP0, DP0] are in
- the same TP group, GPU0 and GPU2 are in the same CP group, GPU0 and GPU4 are in the same PP group, and
- GPU0 and GPU8 are in the same DP group.
- To address these issues, we modify our pipeline schedule as shown in Figure 6, which allows setting N
- flexibly—in this case N= 5, which can run a arbitrary number of micro-batches in each batch. This allows
- us to run: (1) fewer micro-batches than the number of stages when we have batch size limit at large scale;
- or (2) more micro-batches to hide point-to-point communication, finding a sweet spot between DFS and
- breadth first schedule (BFS) for the best communication and memory efficiency. To balance the pipeline,
- we reduce one Transformer layer each from the first and the last stages, respectively. This means that
- the first model chunk on the first stage has only the embedding, and the last model chunk on the last
- stage has only output projection and loss calculation. To reduce pipeline bubbles, we use an interleaved
- schedule (Narayanan et al., 2021) with Vpipeline stages on one pipeline rank. Overall pipeline bubble ratio
- isPP−1
- V∗M. Further, we adopt asynchronous point-to-point communication in PP, which considerably speeds up
- training, especially in cases when the document mask introduces extra computation imbalance. We enable
- TORCH_NCCL_AVOID_RECORD_STREAMS to reduce memory usage from asynchronous point-to-point
- communication. Finally, to reduce memory cost, based on detailed memory allocation profiling, we proactively
- deallocate tensors that will not be used for future computation, including the input and output tensors of each
- pipeline stage, that will not be used for future computation. With these optimizations, we could pre-train
- Llama 3 on sequences of 8K tokens without activation checkpointing.
- Context parallelism for long sequences. We utilize context parallelism (CP) to improve memory efficiency when
- scaling the context length of Llama 3 and enable training on extremely long sequences up to 128K in length.
- In CP, we partition across the sequence dimension, and specifically we partition the input sequence into
- 2×CPchunks so each CP rank receives two chunks for better load balancing. The i-th CP rank received
- both the i-th and the (2×CP−1−i)-th chunks.
- Different from existing CP implementations that overlap communication and computation in a ring-like
- structure (Liu et al., 2023a), our CP implementation adopts an all-gather based method where we first
- all-gather the key (K) and value (V) tensors, and then compute attention output for the local query (Q)
- tensor chunk. Although the all-gather communication latency is exposed in the critical path, we still adopt
- this approach for two main reasons: (1) it is easier and more flexible to support different types of attention
- masks in all-gather based CP attention, such as the document mask; and (2) the exposed all-gather latency
- 11
- Figure 6 Illustration of pipeline parallelism in Llama 3. Pipeline parallelism partitions eight pipeline stages (0 to 7) across
- four pipeline ranks (PP ranks 0 to 3), where the GPUs with rank 0 run stages 0 and 4, the GPUs with P rank 1 run
- stages 1 and 5, etc. The colored blocks (0 to 9) represent a sequence of micro-batches, where Mis the total number of
- micro-batches and Nis the number of continuous micro-batches for the same stage’s forward or backward. Our key
- insight is to make Ntunable.
- is small as the communicated K and V tensors are much smaller than Q tensor due to the use of GQA (Ainslie
- et al., 2023). Hence, the time complexity of attention computation is an order of magnitude larger than
- all-gather (O(S2)versus O(S), where Srepresents the sequence length in the full causal mask), making the
- all-gather overhead negligible.
- Network-aware parallelism configuration. The order of parallelism dimensions, [TP, CP, PP, DP], is optimized
- for network communication. The innermost parallelism requires the highest network bandwidth and lowest
- latency, and hence is usually constrained to within the same server. The outermost parallelism may spread
- across a multi-hop network and should tolerate higher network latency. Therefore, based on the requirements
- for network bandwidth and latency, we place parallelism dimensions in the order of [TP, CP, PP, DP]. DP
- (i.e., FSDP) is the outermost parallelism because it can tolerate longer network latency by asynchronously
- prefetching sharded model weights and reducing gradients. Identifying the optimal parallelism configuration
- with minimal communication overhead while avoiding GPU memory overflow is challenging. We develop a
- memory consumption estimator and a performance-projection tool which helped us explore various parallelism
- configurations and project overall training performance and identify memory gaps effectively.
- Numerical stability. By comparing training loss between different parallelism setups, we fixed several numerical
- issues that impact training stability. To ensure training convergence, we use FP32 gradient accumulation
- during backward computation over multiple micro-batches and also reduce-scatter gradients in FP32 across
- data parallel workers in FSDP. For intermediate tensors, e.g., vision encoder outputs, that are used multiple
- times in the forward computation, the backward gradients are also accumulated in FP32.
- 3.3.3 Collective Communication
- Our collective communication library for Llama 3 is based on a fork of Nvidia’s NCCL library, called NCCLX.
- NCCLX significantly improves the performance of NCCL, especially for higher latency networks. Recall that
- the order of parallelism dimensions is [TP, CP, PP, DP], where DP corresponds to FSDP. The outermost
- parallelism dimensions, PP and DP, may communicate through a multi-hop network, with latency up to tens
- of microseconds. The original NCCL collectives— all-gather andreduce-scatter in FSDP, and point-to-point
- in PP—require data chunking and staged data copy. This approach incurs several inefficiencies, including
- (1) requiring a large number of small control messages to be exchanged over the network to facilitate data
- transfer, (2) extra memory-copy operations, and (3) using extra GPU cycles for communication. For Llama 3
- training, we address a subset of these inefficiencies by tuning chunking and data transfer to fit our network
- latencies, which can be as high as tens of microseconds for a large cluster. We also allow small control messages
- to traverse our network at a higher priority, especially avoiding being head-of-line blocked in deep-buffer
- core switches. Our ongoing work for future Llama versions involves making deeper changes in NCCLX to
- holistically address all the aforementioned problems.
- 12
- Component Category Interruption Count % of Interruptions
- Faulty GPU GPU 148 30.1%
- GPU HBM3 Memory GPU 72 17.2%
- Software Bug Dependency 54 12.9%
- Network Switch/Cable Network 35 8.4%
- Host MaintenanceUnplanned
- Maintenance32 7.6%
- GPU SRAM Memory GPU 19 4.5%
- GPU System Processor GPU 17 4.1%
- NIC Host 7 1.7%
- NCCL Watchdog Timeouts Unknown 7 1.7%
- Silent Data Corruption GPU 6 1.4%
- GPU Thermal Interface + Sensor GPU 6 1.4%
- SSD Host 3 0.7%
- Power Supply Host 3 0.7%
- Server Chassis Host 2 0.5%
- IO Expansion Board Host 2 0.5%
- Dependency Dependency 2 0.5%
- CPU Host 2 0.5%
- System Memory Host 2 0.5%
- Table 5 Root-cause categorization of unexpected interruptions during a 54-day period of Llama 3 405B pre-training. About
- 78% of unexpected interruptions were attributed to confirmed or suspected hardware issues.
- 3.3.4 Reliability and Operational Challenges
- The complexity and potential failure scenarios of 16K GPU training surpass those of much larger CPU clusters
- that we have operated. Moreover, the synchronous nature of training makes it less fault-tolerant—a single
- GPU failure may require a restart of the entire job. Despite these challenges, for Llama 3, we achieved higher
- than 90% effective training time while supporting automated cluster maintenance, such as firmware and Linux
- kernel upgrades (Vigraham and Leonhardi, 2024), which resulted in at least one training interruption daily.
- The effective training time measures the time spent on useful training over the elapsed time.
- During a 54-day snapshot period of pre-training, we experienced a total of 466 job interruptions. Of these, 47
- were planned interruptions due to automated maintenance operations such as firmware upgrades or operator-
- initiated operations like configuration or dataset updates. The remaining 419 were unexpected interruptions,
- which are classified in Table 5. Approximately 78% of the unexpected interruptions are attributed to confirmed
- hardware issues, such as GPU or host component failures, or suspected hardware-related issues like silent data
- corruption and unplanned individual host maintenance events. GPU issues are the largest category, accounting
- for 58.7% of all unexpected issues. Despite the large number of failures, significant manual intervention was
- required only three times during this period, with the rest of issues handled by automation.
- To increase the effective training time, we reduced job startup and checkpointing time, and developed tools
- for fast diagnosis and problem resolution. We extensively use PyTorch’s built-in NCCL flight recorder (Ansel
- et al., 2024), a feature that captures collective metadata and stack traces into a ring buffer, and hence allowing
- us to diagnose hangs and performance issues quickly at scale, particularly with regard to NCCLX. Using
- this, we efficiently record every communication event and the duration of each collective operation, and also
- automatically dump tracing data on NCCLX watchdog or heartbeat timeout. We enable more computationally
- intensive tracing operations and metadata collection selectively as needed live in production through online
- configuration changes (Tang et al., 2015) without needing a code release or job restart.
- Debugging issues in large-scale training is complicated by the mixed use of NVLink and RoCE in our network.
- Data transfer over NVLink typically occurs through load/store operations issued by CUDA kernels, and
- failures in either the remote GPU or NVLink connectivity often manifest as stalled load/store operations
- within CUDA kernels without returning a clear error code. NCCLX enhances the speed and accuracy of failure
- 13
- detection and localization through a tight co-design with PyTorch, allowing PyTorch to access NCCLX’s
- internal state and track relevant information. While stalls due to NVLink failures cannot be completely
- prevented, our system monitors the state of the communication library and automatically times out when
- such a stall is detected. Additionally, NCCLX traces the kernel and network activities of each NCCLX
- communication and provides a snapshot of the failing NCCLX collective’s internal state, including finished
- and pending data transfers between all ranks. We analyze this data to debug NCCLX scaling issues.
- Sometimes, hardwareissuesmaycausestill-functioningbutslowstragglersthatarehardtodetect. Evenasingle
- straggler can slow down thousands of other GPUs, often appearing as functioning but slow communications.
- We developed tools to prioritize potentially problematic communications from selected process groups. By
- investigating just a few top suspects, we were usually able to effectively identify the stragglers.
- One interesting observation is the impact of environmental factors on training performance at scale. For
- Llama 3 405B , we noted a diurnal 1-2% throughput variation based on time-of-day. This fluctuation is the
- result of higher mid-day temperatures impacting GPU dynamic voltage and frequency scaling.
- During training, tens of thousands of GPUs may increase or decrease power consumption at the same time,
- for example, due to all GPUs waiting for checkpointing or collective communications to finish, or the startup
- or shutdown of the entire training job. When this happens, it can result in instant fluctuations of power
- consumption across the data center on the order of tens of megawatts, stretching the limits of the power grid.
- This is an ongoing challenge for us as we scale training for future, even larger Llama models.
- 3.4 Training Recipe
- The recipe used to pre-train Llama 3 405B consists of three main stages: (1)initial pre-training, (2)long-context
- pre-training, and (3)annealing. The three stages are described separately below. We use similar recipes to
- pre-train the 8B and 70B models.
- 3.4.1 Initial Pre-Training
- We pre-train Llama 3 405B using AdamW with a peak learning rate of 8×10−5,a linear warm up of 8,000
- steps, and a cosine learning rate schedule decaying to 8×10−7over 1,200,000 steps. We use a lower batch size
- early in training to improve training stability, and increase it subsequently to improve efficiency. Specifically,
- we use an initial batch size of 4M tokens and sequences of length 4,096, and double these values to a batch
- size of 8M sequences of 8,192 tokens after pre-training 252M tokens. We double the batch size again to 16M
- after pre-training on 2.87T tokens. We found this training recipe to be very stable: we observed few loss
- spikes and did not require interventions to correct for model training divergence.
- Adjusting the data mix. We made a several adjustments to the pre-training data mix during training to improve
- model performance on particular downstream tasks. In particular, we increased the percentage of non-English
- data during pre-training to improve the multilingual performance of Llama 3. We also upsample mathematical
- data to improve the model’s mathematical reasoning performance, we added more recent web data in the
- later stages of pre-training to advance the model’s knowledge cut-off, and we downsampled subsets of the
- pre-training data that were later identified as being lower quality.
- 3.4.2 Long Context Pre-Training
- In the final stages of pre-training, we train on long sequences to support context windows of up to 128K tokens.
- We do not train on long sequences earlier because the compute in self-attention layers grows quadratically in
- the sequence length. We increase the supported context length in increments, pre-training until the model has
- successfully adapted to the increased context length. We assess successful adaptation by measuring whether (1)
- model performance on short-context evaluations has recovered completely and (2)the model perfectly solves
- “needle in a haystack” tasks up to that length. In Llama 3 405B pre-training, we increased context length
- gradually in six stages, starting from the original 8K context window and ending in the final 128K context
- window. This long-context pre-training stage was performed using approximately 800B training tokens.
- 14
- Figure 7 Illustration of the overall post-training approach for Llama 3. Our post-training strategy involves rejection sampling,
- supervised finetuning, and direct preference optimization. See text for details.
- 3.4.3 Annealing
- During pre-training on the final 40M tokens, we linearly annealed the learning rate to 0, maintaining a context
- length of 128K tokens. During this annealing phase, we also adjusted the data mix to upsample data sources
- of very high quality; see Section 3.1.3. Finally, we compute the average of model checkpoints (Polyak (1991)
- averaging) during annealing to produce the final pre-trained model.
- 4 Post-Training
- We produce the aligned Llama 3 models by applying several rounds of post-training,6or aligning the model
- with human feedback (Ouyang et al., 2022; Rafailov et al., 2024) on top of a pre-trained checkpoint. Each
- round of post-training involves supervised finetuning (SFT) followed by Direct Preference Optimization (DPO;
- Rafailov et al., 2024) on examples collected either via human annotations or generated synthetically. Our
- post-training modeling and data approaches are described in Sections 4.1 and 4.2 respectively. We further
- detail custom data curation strategies to improve the reasoning, coding, factuality, multilingual, tool use, long
- context, and precise instruction following in Section 4.3.
- 4.1 Modeling
- The backbone of our post-training strategy is a reward model and a language model. We first train a reward
- model on top of the pre-trained checkpoint using human-annotated preference data (see Section 4.1.2). We
- then finetune pre-trained checkpoints with supervised finetuning (SFT; see Section 4.1.3), and further align
- the checkpoints with Direct Preference Optimization (DPO; see Section 4.1.4). This process is illustrated
- in Figure 7. Unless otherwise noted, our modeling procedure applies to Llama 3 405B, and we refer to
- Llama 3 405B as Llama 3 for simplicity.
- 4.1.1 Chat Dialog Format
- To tune LLMs for human-AI interaction, we need to define a chat dialog protocol for the model to understand
- human instructions and perform conversational tasks. Compared to its predecessor, Llama 3 has new
- capabilities such as tool use (Section 4.3.5) which may require generating multiple messages and sending
- 6We use the term “post-training” to refer to any model training that happens outside of pre-training.
- 15
- them to different locations (e.g., user, ipython) within a single dialog turn. To support this, we design a new
- multi-message chat protocol which uses various special header and termination tokens. The header tokens
- are used to indicate the source and destination of each message in a conversation. Similarly, the termination
- tokens indicate when it is the time to alternate between human and AI to speak.
- 4.1.2 Reward Modeling
- We train a reward model (RM) covering different capabilities on top of the pre-trained checkpoint. The
- training objective is the same as Llama 2 except that we remove the margin term in the loss, as we observe
- diminishing improvements after data scaling. Following Llama 2, we use all of our preference data for reward
- modeling after filtering out samples with similar responses. In addition to standard preference pair of (chosen,
- rejected) response, annotations also create a third “edited response” for some prompts, where the chosen
- response from the pair is further edited for improvement (see Section 4.2.1). Hence, each preference ranking
- sample has two or three responses with clear ranking ( edited>chosen>rejected). We concatenate the
- prompt and multiple responses into a single row during training with responses randomly shuffled. This is an
- approximation to the standard scenario of putting the responses in separate rows and computing the scores,
- but in our ablations, this approach improves training efficiency without a loss in accuracy.
- 4.1.3 Supervised Finetuning
- The reward model is then used to perform rejection sampling on our human annotation prompts, the details
- of which are described in Section 4.2. Together with this rejection-sampled data and other data sources
- (including synthetic data), we finetune the pre-trained language model using a standard cross entropy loss
- on the target tokens (while masking loss on prompt tokens). More details about the data mix can be found
- in Section 4.2. We refer to this stage as supervised finetuning (SFT; Wei et al., 2022a; Sanh et al., 2022;
- Wang et al., 2022b), even though many of the training targets are model-generated. Our largest models are
- finetuned with a learning rate of 10−5over the course of 8.5K to 9K steps. We found these hyperparameter
- settings to work well across different rounds and data mixes.
- 4.1.4 Direct Preference Optimization
- We further train our SFT models with Direct Preference Optimization (DPO; Rafailov et al., 2024) for human
- preference alignment. For training, we primarily use the most recent batches of preference data collected using
- the best performing models from the previous alignment rounds. As a result, our training data conforms better
- to the distribution of the policy model that is being optimized in each round. We also explored on-policy
- algorithms such as PPO (Schulman et al., 2017), but found that DPO required less compute for large-scale
- models and performed better, especially on instruction following benchmarks like IFEval (Zhou et al., 2023).
- For Llama 3, we use a learning rate of 10−5and set the βhyper-parameter to be 0.1. In addition, we apply
- the following algorithmic modifications to DPO:
- •Masking out formatting tokens in DPO loss : We mask out special formatting tokens including header
- and termination tokens (described in Section 4.1.1) from both chosen and rejected responses in the
- loss to stabilize DPO training. We observe that having these tokens contribute to the loss may lead
- to undesired model behaviors such as tail repetition or abruptly generating termination tokens. We
- hypothesize that this is due to the contrastive nature of the DPO loss – the presence of common tokens
- in both chosen and rejected responses leads to a conflicting learning objective as the model needs to
- increase and reduce the likelihood of these tokens simultaneously.
- •Regularization with NLL loss : We add an additional negative log-likelihood (NLL) loss term with a scaling
- coefficient of 0.2on the chosen sequences, similar to Pang et al. (2024). This helps further stabilize DPO
- training by maintaining desired formatting for generation and preventing the decrease of log probability
- of chosen responses (Pang et al., 2024; Pal et al., 2024).
- 4.1.5 Model Averaging
- Finally, we average models obtained from experiments using various versions of data or hyperparameters at
- each RM, SFT, or DPO stage (Izmailov et al., 2019; Wortsman et al., 2022; Li et al., 2022).
- 16
- % of Avg. # turns Avg. # tokens Avg. # tokens Avg. # tokens
- Dataset comparisons per dialog per example in prompt in response
- General English 81.99% 4.1 1,000.4 36.4 271.2
- Coding 6.93% 3.2 1,621.0 113.8 462.9
- Multilingual 5.19% 1.8 1,299.4 77.1 420.9
- Reasoning and tools 5.89% 1.6 707.7 46.6 129.9
- Total 100% 3.8 1,041.6 44.5 284.0
- Table 6 Statistics of human preference data. We list statistics of the internally collected human preference data used for
- Llama 3 alignment. We ask annotators to perform multi-turn dialogues with the models and make comparisons among
- responses at each turn. In post-processing, we split each dialogue to multiple examples at a turn level. Each example
- consists of a prompt (including previous dialog if available) and a response (e.g., chosen or rejected response).
- 4.1.6 Iterative Rounds
- Following Llama 2, we apply the above methods in six rounds. In each cycle, we collect new preference
- annotations and SFT data, sampling synthetic data from the latest models.
- 4.2 Post-training Data
- The post-training data composition plays a critical role in the usefulness and behavior of language models. In
- this section, we discuss our human annotation procedures and preference data collection (Section 4.2.1), the
- composition of our SFT data (Section 4.2.2), and methods for data quality control and cleaning (Section 4.2.3).
- 4.2.1 Preference Data
- Our preference data annotation process is similar to Llama 2. We deploy multiple models for annotation after
- each round and sample two responses from two different models for each user prompt. These models can
- be trained with different data mixes and alignment recipes, allowing for different capability strength ( e.g.,
- code expertise) and increased data diversity. We ask annotators to rate the strength of their preference by
- categorizing it into one of four levels, based on how much more they prefer the chosen response over the
- rejected one: significantly better, better, slightly better, or marginally better. We also incorporate an editing
- step after preference ranking to encourage annotators to further improve the preferred response. Annotators
- edit the chosen response directly or prompt the model with feedback to refine its own response. Consequently,
- a portion of our preference data has three responses ranked ( edited>chosen>rejected).
- In Table 6, we report the statistics of preference annotations that we use for Llama 3 training. General English
- covers multiple subcategories such as knowledge-based question and answering or precise instruction-following,
- which fall outside the scope of specific capabilities. Compared to Llama 2, we observe an increase in the
- average length of prompt and response, suggesting that we train Llama 3 on more complex tasks. In addition,
- we implement a quality analysis and human evaluation process to rigorously assess the data collected, allowing
- us to refine our prompts and provide systematic, actionable feedback to annotators. For example, as Llama 3
- improves after each round, we increase prompt complexity accordingly to target areas where the model lags.
- In each round of post-training, we use all the preference data that is available at the time for reward modeling,
- while only using the latest batches from various capabilities for DPO training. For both reward modeling and
- DPO, we use samples that are labeled as the chosen response being significantly better or better than the
- rejected counterpart for training and discard samples with similar responses.
- 4.2.2 SFT Data
- Our finetuning data is largely comprised of the following sources:
- •Prompts from our human annotation collection with rejection-sampled responses.
- •Synthetic data targeting specific capabilities (see Section 4.3 for more details).
- 17
- Avg. # tokens Avg. # tokens
- Dataset % of examples Avg. # turns Avg. # tokens in context in final response
- General English 52.66% 6.3 974.0 656.7 317.1
- Code 14.89% 2.7 753.3 378.8 374.5
- Multilingual 3.01% 2.7 520.5 230.8 289.7
- Exam-like 8.14% 2.3 297.8 124.4 173.4
- Reasoning and tools 21.19% 3.1 661.6 359.8 301.9
- Long context 0.11% 6.7 38,135.6 37,395.2 740.5
- Total 100% 4.7 846.1 535.7 310.4
- Table 7 Statistics of SFT data. We list internally collected SFT data used for Llama 3 alignment. Each SFT example
- consists of a context (i.e., all conversation turns except the last one) and a final response.
- •Small amounts of human-curated data (see Section 4.3 for more details).
- As our post-training rounds progress, we develop stronger Llama 3 variants that we use to collect larger
- datasets that cover a wide range of complex capabilities. In this section, we discuss the details for the
- rejection-sampling procedure and overall composition of our final SFT datamix.
- Rejection sampling. During rejection sampling (RS), for each prompt collected during human annotation
- (Section 4.2.1) we sample K(typically between 10 and 30) outputs from the latest chat model policy (usually
- the best performing checkpoint from the previous post-training iteration, or the best performing checkpoint
- for a particular capability) and use our reward model to select the best candidate, consistent with Bai et al.
- (2022). In later rounds of post-training, we introduce system prompts to steer RS responses to conform with
- desirable tone, style, or formatting, which might be different for different capabilities.
- To increase the efficiency of rejection sampling, we adopt PagedAttention (Kwon et al., 2023). PagedAttention
- enhances memory efficiency through dynamic key-value cache allocation. It supports arbitrary output lengths
- by dynamically scheduling requests based on the current cache capacity. Unfortunately, this carries the risk of
- swap-out when running out of memory. To eliminate such swap overhead, we define a maximum output length
- and perform a request only if sufficient memory is available to fit an output with that length. PagedAttention
- also enables us to share the key-value cache pages for a prompt across all corresponding outputs. Together,
- this leads to a throughput improvement of over 2×during rejection sampling.
- Overall data composition. Table 7 shows data statistics for each broad category of our “helpfulness” mix. While
- SFT and preference data contain overlapping domains, they are curated differently, yielding distinct count
- statistics. In Section 4.2.3 we describe techniques for categorizing topic, complexity, and quality of our data
- samples. In each round of post-training, we adjust our overall data mix carefully across these axes to tune
- performance across a wide range of benchmarks. Our final data mix epochs multiple times on some high
- quality sources and downsamples others.
- 4.2.3 Data Processing and Quality Control
- Given that most of our training data is model-generated , it requires careful cleaning and quality control.
- Data cleaning. In the early rounds, we observed a number of undesirable patterns common in our data, such
- as excessive use of emojis or exclamation points. Therefore, we implement a series of rule-based data removal
- and modification strategies to filter or clean problematic data. For example, to mitigate overly-apologetic
- tonal issues, we identify overused phrases (such as “I’m sorry” or “I apologize”) and carefully balance the
- proportion of such samples in our dataset.
- Data pruning. We also apply a collection of model-based techniques to remove low-quality training samples
- and improve overall model performance:
- •Topic classification: We first finetune Llama 3 8B into a topic classifier, and perform inference over
- all data to classify it into both coarsely-grained buckets (“mathematical reasoning”) and fine-grained
- 18
- buckets (“geometry and trigonometry”).
- •Quality scoring: We use both reward model and Llama-based signals to obtain a quality score for each
- sample. For an RM-based score, we consider data that is in the top quartile of RM scores as high quality.
- For a Llama-based score, we prompt Llama 3 checkpoint to rate each sample on a three-point scale for
- general English data (accuracy, instruction following, and tone/presentation) and a two-point scale for
- coding data (bug identification and user intention), and consider samples that obtain the maximum
- score as high quality. The RM and Llama-based scores have high disagreement rates, and we find that
- combining these signals yield the best recall on our internal test set. Ultimately, we select examples
- that are marked as high quality by the RM orthe Llama-based filter.
- •Difficulty scoring: Because we are also interested in prioritizing examples that are more complex for
- the model, we score data using two measures of difficulty: Instag (Lu et al., 2023) and Llama-based
- scoring. For Instag, we prompt Llama 3 70B to perform intention tagging of SFT prompts, where more
- intentions implies more complexity. We also prompt Llama 3 to measure the difficulty (Liu et al., 2024c)
- of dialogs on a three-point scale.
- •Semantic deduplication: Finally, we perform semantic deduplication (Abbas et al., 2023; Liu et al.,
- 2024c). We first cluster complete dialogs using RoBERTa (Liu et al., 2019b) and within each cluster
- sort them by quality score ×difficulty score. We then do greedy selection by iterating through all sorted
- examples, and only keeping the ones that have maximum cosine similarity less than a threshold to the
- examples seen so far in the cluster.
- 4.3 Capabilities
- We highlight special efforts to improve performance for specific capabilities such as code (Section 4.3.1),
- multilinguality (Section 4.3.2), math and reasoning (Section 4.3.3), long context (Section 4.3.4), tool use
- (Section 4.3.5), factuality (Section 4.3.6), and steerability (Section 4.3.7).
- 4.3.1 Code
- LLMs for code have received significant attention since the release of Copilot and Codex (Chen et al., 2021).
- Developers are now widely using these models to generate code snippets, debug, automate tasks, and improve
- code quality. For Llama 3, we target improving and evaluating code generation, documentation, debugging,
- and review capabilities for the following high priority programming languages: Python, Java, Javascript,
- C/C++, Typescript, Rust, PHP, HTML/CSS, SQL, bash/shell. Here, we present our work on improving
- these coding capabilities via training a code expert, generating synthetic data for SFT, improving formatting
- with system prompt steering, and creating quality filters to remove bad samples from our training data.
- Expert training. We train a code expert which we use to collect high quality human annotations for code
- throughout subsequent rounds of post-training. This is accomplished by branching the main pre-training run
- and continuing pre-training on a 1T token mix of mostly (>85%) code data. Continued pre-training on domain-
- specific data has been shown to be effective for improving performance in a specific domain (Gururangan
- et al., 2020). We follow a recipe similar to that of CodeLlama (Rozière et al., 2023). For the last several
- thousand steps of training we perform long-context finetuning (LCFT) to extend the expert’s context length
- to 16K tokens on a high quality mix of repo-level code data. Finally, we follow the similar post-training
- modeling recipes described in Section 4.1 to align this model, except with SFT and DPO data mixes primarily
- targeting code. This model is also used for rejection sampling (Section 4.2.2) for coding prompts.
- Synthetic data generation. During development, we identified key issues in code generation, including difficulty
- in following instructions, code syntax errors, incorrect code generation, and difficulty in fixing bugs. While
- intensive human annotation could theoretically resolve these issues, synthetic data generation offers a
- complementary approach at a lower cost and higher scale, unconstrained by the expertise level of annotators.
- As such, we use Llama 3 and the code expert to generate a large quantity of synthetic SFT dialogs.
- We describe three high-level approaches for generating synthetic code data. In total, we generate over 2.7M
- synthetic examples which were used during SFT.
- 19
- 1.Synthetic data generation: execution feedback. The 8B and 70B models show significant performance
- improvements when trained on data generated by a larger, more competent model. However, our initial
- experiments revealed that training Llama 3 405B on its own generated data is not helpful (and can
- even degrade performance). To address this limitation, we introduced execution feedback as a source of
- truth, enabling the model to learn from its mistakes and stay on track. In particular, we generate large
- dataset of approximately one million synthetic coding dialogues using the following process:
- •Problem description generation: First, we generate a large collection of programming problem
- descriptions that span a diverse range of topics, including those in the long tail distribution. To
- achieve this diversity, we sample random code snippets from various sources and prompt the model
- to generate programming problems inspired by these examples. This allowed us to tap into a wide
- range of topics and create a comprehensive set of problem descriptions (Wei et al., 2024).
- •Solution generation: Then, we prompt Llama 3 to solve each problem in a given programming
- language. We observe that adding general rules of good programming to the prompt improves the
- generated solution quality. Also, we find it is helpful to require the model to explain its thought
- process in comments.
- •Correctness analysis: After generating a solution, it is crucial to recognize that its correctness is
- not guaranteed, and including incorrect solutions in the finetuning dataset could harm the model’s
- quality. While we do not ensure complete correctness, we develop methods to approximate it. To
- achieve this, we extract the source code from the generated solution and applied a combination of
- static and dynamic analysis techniques to test its correctness, including:
- –Static analysis : We run all generated code through a parser and a linter to ensure syntactic
- correctness, catching errors such as syntax errors, use of uninitialized variables or non-imported
- functions, code style issues, typing errors, and others.
- –Unit test generation and execution : For each problem and solution, we prompt the model
- to generate unit tests, executed in a containerized environment together with the solution,
- catching run-time execution errors and some semantic errors.
- •Error feedback and iterative self-correction: When a solution fails at any step, we prompt the
- model to revise it. The prompt included the original problem description, the faulty solution,
- and feedback from the parser/linter/tester (stdout, stderr/ and return code). After a unit test
- execution failure, the model could either fix the code to pass the existing tests or modify its unit
- tests to accommodate the generated code. Only dialogs that pass all checks are included in the final
- dataset, used for supervised finetuning (SFT). Notably, we observed that about 20% of solutions
- were initially incorrect but self-corrected, indicating that the model learned from the execution
- feedback and improved its performance.
- •Fine-tuning and iterative improvement: The finetuning process is conducted over multiple rounds,
- with each round building on the previous one. After each round, the model is improved, generating
- higher-quality synthetic data for the next round. This iterative process allows for progressive
- refinement and enhancement of the model’s performance.
- 2.Synthetic data generation: programming language translation. We observe a performance gap between
- major programming languages ( e.g., Python/C++) and less common ones ( e.g., Typescript/PHP). This
- is not surprising as we have less training data for less common programming languages. To mitigate
- this, we supplement our existing data by translating data from common programming languages to
- less common languages (similar to Chen et al. (2023) in the context of reasoning). This is achieved
- by prompting Llama 3 and ensuring quality via syntax parsing, compilation, and execution. Figure 8
- demonstrates an example of synthetic PHP code translated from Python. This improves performance
- significantly for less common languages as measured by the MultiPL-E (Cassano et al., 2023) benchmark.
- 3.Synthetic data generation: backtranslation. To improve certain coding capabilities (e.g., documentation,
- explanations) where execution feedback is less informative for determining quality, we employ an
- alternative multi-step approach. Using this procedure, we generated approximately 1.2M synthetic
- 20
- Figure 8 Code translation example. We display an example of using Llama 3 to translate Python code (left) to PHP
- code (right) to augment our SFT dataset with a wider range of programming languages.
- Figure 9 Improving generated code quality with system prompts. Left:without system prompt Right:with system prompt.
- dialogs related to code explanation, generation, documentation, and debugging. Beginning with code
- snippets from a variety of languages in our pre-training data:
- •Generate: We prompt Llama 3 to generate data that represents our target capability (e.g., we add
- comments and docstrings for the code snippet, or we ask the model to explain a piece of code).
- •Backtranslate: We then prompt the model to “backtranslate” the synthetically generated data to
- the original code (e.g., we prompt the model to generate code only from its documentation, or we
- ask the model to generate code only from its explanation).
- •Filter:Using the original code as a reference, we prompt the Llama 3 to determine the quality of
- the output (e.g., we ask the model how faithful the backtranslated code is to the original). We
- then use the generated examples that have the highest self-verification scores in SFT.
- System prompt steering during rejection sampling. During the rejection sampling process, we used code specific
- system prompts to improve code readability, documentation, thoroughness, and specificity. Recall, from
- Section 7 this data is used to finetune the language model. Figure 9 shows an example of how the system
- prompt helps improve the generated code quality — it adds necessary comments, uses more informative
- variable names, saves memory, etc.
- Filtering training data with execution and model-as-judge signals. As described in Section 4.2.3, we occasionally
- encounter quality issues in our rejection-sampled data, such as code blocks containing bugs. Detecting these
- issues in our rejection-sampled data is not as straightforward as it is for our synthetic code data , as the
- rejection-sampled responses typically contain a mix of natural language and code for which the code may not
- 21
- always be expected to be executable. (For example, user prompts may explicitly ask for pseudo-code or edits to
- only a very small snippet of an executable program.) To address this, we utilize the “model-as-judge” approach,
- where earlier versions of Llama 3 assess and assign a binary (0/1) score based on two criteria: code correctness
- and code style. We retain only those samples that achieve a perfect score of 2. Initially, this stringent filtering
- led to a regression in downstream benchmark performance, primarily because it disproportionately removed
- examples with challenging prompts. To counteract this, we strategically revise the responses of some coding
- data categorized as most challenging until they met the Llama-based “model-as-judge” criteria. By refining
- these challenging problems, the coding data achieves a balance between quality and difficulty, resulting in
- optimal downstream performance.
- 4.3.2 Multilinguality
- We describe how we improve Llama 3’s multilingual capabilities, including training an expert specialized on
- substantially more multilingual data, sourcing and generating high quality multilingual instruction tuning
- data for German, French, Italian, Portuguese, Hindi, Spanish, and Thai, and tackling specific challenges of
- multilingual language steering to enhance the overall performance of our model.
- Expert training. Our Llama 3 pre-training data mix contains significantly more English tokens than non-English
- tokens. To collect higher quality human annotations in non-English languages, we train a multilingual expert by
- branching off the pre-training run and continuing to pre-train on a data mix that consists of 90% multilingual
- tokens. We then perform post-training on this expert following Section 4.1. This expert model is then used to
- collect higher quality annotations in non-English languages until pre-training was fully complete.
- Multilingual data collection. Our multilingual SFT data is derived primarily from sources described below. The
- overall distribution is 2.4% human annotations, 44.2% data from other NLP tasks, 18.8% rejection sampled
- data, and 34.6% translated reasoning data.
- •Human annotations : We collect high-quality, manually annotated data from linguists and native speakers.
- These annotations mostly consist of open-ended prompts that represent real world use cases.
- •Data from other NLP tasks : To further augment, we use multilingual training data from other tasks
- and rewrite into dialog format. For example, we use data from exams-qa (Hardalov et al., 2020)
- and Conic10k (Wu et al., 2023). To improve language alignment, we also use parallel texts from
- GlobalVoices (Prokopidis et al., 2016) and Wikimedia (Tiedemann, 2012). We use LID based filtering
- and Blaser2.0 (Seamless Communication et al., 2023) to remove low quality data. For parallel text data,
- instead of using the bitext pairs directly, we apply a multilingual template inspired by Wei et al. (2022a)
- to better simulate real-life conversations in translation and language learning scenarios.
- •Rejection sampled data : We apply rejection sampling on our human annotated prompts to generate
- high-quality samples for finetuning, with few modifications compared to the process for English data:
- –Generation : We explored randomly choosing the temperature hyperparameter from the range
- 0.2−1for diverse generations in early rounds of post-training. With high temperature, responses
- for multilingual prompts can get creative and inspiring, but are also susceptible to unnecessary
- or unnatural code-switching. In the final round of post-training, we use a constant value of 0.6
- to balance the trade-off. Additionally, we used specialized system prompts to improve response
- format, structure and general readability.
- –Selection : Prior to reward model based selection, we implement multilingual-specific checks to
- ensure high language-match rate between the prompt and response (e.g., a romanized Hindi prompt
- should not expect a response in Hindi Devanagari script).
- •Translated data : We try to avoid using machine-translated data to finetune the model in order to
- prevent translationese (Bizzoni et al., 2020; Muennighoff et al., 2023) or possible name bias (Wang
- et al., 2022a), gender bias (Savoldi et al., 2021), or cultural bias (Ji et al., 2023). Moreover, we aim to
- prevent the model from being exposed only to tasks that are rooted in English cultural context, which
- may not be representative of the linguistic and cultural diversity we aim to capture. We made one
- exception to this and translated our synthetic quantitative reasoning data (see Section 4.3.3 for details)
- to improve performance in quantitative reasoning in non-English languages. Due to the simple nature of
- 22
- the language in these math problems, the translated samples were found to have little to no quality
- issues. We observed strong gains on MGSM (Shi et al., 2022) from adding this translated data.
- 4.3.3 Math and Reasoning
- We define reasoning as the ability to perform multi-step computations and arrive at the correct final answer.
- Several challenges guide our approach to training models that excel in mathematical reasoning:
- •Lack of prompts : As the complexity of questions increases, the number of valid prompts or questions
- for Supervised Fine-Tuning (SFT) decreases. This scarcity makes it difficult to create diverse and
- representative training datasets for teaching models various mathematical skills (Yu et al., 2023; Yue
- et al., 2023; Luo et al., 2023; Mitra et al., 2024; Shao et al., 2024; Yue et al., 2024b).
- •Lack of ground truth chain of thought : Effective reasoning requires a step-by-step solution to facilitate
- the reasoning process (Wei et al., 2022c). However, there is often a shortage of ground truth chains of
- thought, which are essential for guiding the model how to break down the problem step-by-step and
- reach the final answer (Zelikman et al., 2022).
- •Incorrect intermediate steps : When using model-generated chains of thought, the intermediate steps
- may not always be correct (Cobbe et al., 2021; Uesato et al., 2022; Lightman et al., 2023; Wang et al.,
- 2023a). This inaccuracy can lead to incorrect final answers and needs to be addressed.
- •Teaching models to use external tools : Enhancing models to utilize external tools, such as code interpreters,
- allows them to reason by interleaving code and text (Gao et al., 2023; Chen et al., 2022; Gou et al.,
- 2023). This capability can significantly improve their problem-solving abilities.
- •Discrepancy between training and inference : There is often a discrepancy between how the model is
- finetuned during training and how it is used during inference. During inference, the finetuned model may
- interact with humans or other models, requiring it to improve its reasoning using feedback. Ensuring
- consistency between training and real-world usage is crucial for maintaining reasoning performance.
- To address these challenges, we apply the following methodologies:
- •Addressing the lack of prompts: We source relevant pre-training data from mathematical contexts and
- converted it into a question-answer format which can then be used for supervised finetuning. Additionally,
- we identify mathematical skills where the model under-performs and actively sourced prompts from
- humans to teach models such skills. To facilitate this process, we create a taxonomy of mathematical
- skills (Didolkar et al., 2024) and ask humans to provide relevant prompts/questions accordingly.
- •Augmenting training data with step-wise reasoning traces : We use Llama 3 to generate step-by-step
- solutions for a set of prompts. For each prompt, the model produces a variable number of generations.
- These generations are then filtered based on the correct answer (Li et al., 2024a). We also do self-
- verification where Llama 3 is used to verify whether a particular step-by-step solution is valid for a given
- question. This process improves the quality of the finetuning data by eliminating instances where the
- model does not produce valid reasoning traces.
- •Filtering incorrect reasoning traces : We train outcome and stepwise reward models (Lightman et al., 2023;
- Wang et al., 2023a) to filter training data where the intermediate reasoning steps were incorrect. These
- reward models are used to eliminate data with invalid step-by-step reasoning, ensuring high-quality
- data for finetuning. For more challenging prompts, we use Monte Carlo Tree Search (MCTS) with
- learned step-wise reward models to generate valid reasoning traces, further enhancing the collection of
- high-quality reasoning data (Xie et al., 2024).
- •Interleaving code and text reasoning : We prompt Llama 3 to solve reasoning problems through a
- combination of textual reasoning and associated Python code (Gou et al., 2023). Code execution is used
- as a feedback signal to eliminate cases where the reasoning chain was not valid, ensuring the correctness
- of the reasoning process.
- •Learning from feedback and mistakes : To simulate human feedback, we utilize incorrect generations ( i.e.,
- generations leading to incorrect reasoning traces) and perform error correction by prompting Llama 3 to
- 23
- yield correct generations (An et al., 2023b; Welleck et al., 2022; Madaan et al., 2024a). The iterative
- process of using feedback from incorrect attempts and correcting them helps improve the model’s ability
- to reason accurately and learn from its mistakes.
- 4.3.4 Long Context
- During the final pre-training stage, we extend the context length of Llama 3 from 8K tokens to 128K tokens
- (see Section 3.4 for more details). Similar to pre-training, we find that during finetuning we must carefully
- tune the recipe to balance short and long-context capabilities.
- SFT and synthetic data generation. Naively applying our existing SFT recipe with only short-context data
- resulted in significant regressions in long-context capabilities from pre-training, highlighting the need to
- incorporate long-context data in our SFT data mix. In practice, however, it is largely impractical to get humans
- to annotate such examples due to the tedious and time-consuming nature of reading lengthy contexts, so we
- predominantly rely on synthetic data to fill this gap. We use earlier versions of Llama 3 to generate synthetic
- data based on the key long-context use-cases: (possibly multi-turn) question-answering, summarization for
- long documents, and reasoning over code repositories, and describe them in greater detail below.
- •Question answering: We carefully curate a set of long documents from our pre-training mix. We split
- these documents into chunks of 8K tokens, and prompted an earlier version of the Llama 3 model to
- generate QA pairs conditional on randomly selected chunks. During training, the whole document is
- used as context.
- •Summarization: We applied hierarchical summarization of long-context documents by first summarizing
- the chunks of 8K input length using our strongest Llama 3 8K context model and then summarizing
- the summaries. During training we provide the full document and prompt the model to summarize the
- document while preserving all the important details. We also generate QA pairs based on the summaries
- of the documents and prompt the model with questions that require global understanding of the whole
- long document.
- •Long context code reasoning: We parse Python files to identify importstatements and determine their
- dependencies. From here, we select the most commonly depended-upon files, specifically those referenced
- by at least five other files. We remove one of these key files from a repository and prompt the model to
- identify which files depended on the missing file and to generate the necessary missing code.
- We further categorize these synthetically generated samples based on the sequence length (16K, 32K, 64K
- and 128K) to enable more fine-grained targeting of input lengths.
- Through careful ablations, we observe that mixing 0.1% of synthetically generated long-context data with the
- original short-context data optimizes the performance across both short-context and long-context benchmarks.
- DPO.We observe that using only short context training data in DPO did not negatively impact long-context
- performance as long as the SFT model is high quality in long context tasks. We suspect this is due to the
- fact that our DPO recipe has fewer optimizer steps than SFT. Given this finding, we keep the standard
- short-context recipe for DPO on top of our long-context SFT checkpoints.
- 4.3.5 Tool Use
- Teaching LLMs to use tools such as search engines or code interpreters hugely expands the range of tasks
- they can solve, transforming them from pure chat models into more general assistants (Nakano et al., 2021;
- Thoppilan et al., 2022; Parisi et al., 2022; Gao et al., 2023; Mialon et al., 2023a; Schick et al., 2024). We train
- Llama 3 to interact with the following tools:
- •Search engine. Llama 3 is trained to use Brave Search7to answer questions about recent events that go
- beyond its knowledge cutoff or that require retrieving a particular piece of information from the web.
- •Python interpreter. Llama 3 can generate and execute code to perform complex computations, read files
- uploaded by the user and solve tasks based on them such as question answering, summarization, data
- analysis or visualization.
- 7https://brave.com/search/api/
- 24
- •Mathematical computational engine. Llama 3 can use the Wolfram Alpha API8to more accurately solve
- math, science problems, or retrieve accurate information from Wolfram’s database.
- The resulting model is able to use these tools in a chat setup to solve the user’s queries, including in multi-turn
- dialogs. If a query requires multiple tool calls, the model can write a step-by-step plan, call the tools in
- sequence, and do reasoning after each tool call.
- We also improve Llama 3’s zero-shot tool use capabilities — given in-context, potentially unseen tool definitions
- and a user query, we train the model to generate the correct tool call.
- Implementation. We implement our core tools as Python objects with different methods. Zero-shot tools can
- be implemented as Python functions with descriptions, documentation ( i.e., examples for how to use them),
- and the model only needs the function’s signature and docstring as context to generate the appropriate call.
- We also convert function definitions and calls to JSON format, e.g., for web API calls. All tool calls are
- executed by the Python interpreter, that must be enabled in the Llama 3 system prompt. Core tools can be
- individually enabled or disabled in the system prompt.
- Data collection. Different from Schick et al. (2024), we rely on human annotations and preferences to teach
- Llama 3 to use tools. There are two main differences with the post-training pipeline generally used in Llama 3:
- •For tools, dialogs often contain more than a single assistant message (e.g., calling the tool and reasoning
- about the tool output). Thus, we annotate at the message level to collect granular feedback: annotators
- provide a preference between two assistant messages with the same context or, if both contain major
- problems, edit one of the messages. The chosen or edited message is then added to the context and the
- dialog continues. This provides human feedback for both the assistant’s ability of calling the tools and
- reasoning about the tool outputs. Annotators cannot rank or edit the tool outputs.
- •We do not perform rejection sampling, as we did not observe gains in our tool benchmarks.
- To accelerate the annotation process, we start by bootstrapping basic tool use capabilities by finetuning on
- synthetically generated data from previous Llama 3 checkpoints. Thus, annotators have fewer edits to perform.
- In a similar spirit, as Llama 3 gradually improves through its development, we progressively complexify our
- human annotation protocols: we start by single-turn tool use annotations, before moving to tool use in dialogs,
- and finally annotating for multi-step tool use and data analysis.
- Tool datasets. To create data for tool usage applications, we leverage the following procedure:
- •Single-step tool use: We start by few-shot generation of synthetic user prompts which, by construction,
- require a call to one of our core tools (for example, questions that exceed our knowledge cutoff date).
- Then, still relying on few-shot generation, we generate appropriate tool calls for these prompts, execute
- them, and add the output to the model’s context. Finally, we prompt the model again to generate a
- final answer to the user’s query based on the tool output. We end up with trajectories of the following
- form: system prompt, user prompt, tool call, tool output, final answer. We also filter around 30%this
- dataset to remove tool calls that cannot be executed or other formatting issues.
- •Multi-step tool use: We follow a similar protocol and first generate synthetic data to teach the model
- basic multi-step tool use capabilities. To do this, we first prompt Llama 3 to generate user prompts
- that require at least two tool calls, that can be the same or different tools from our core set. Then,
- conditioned on these prompts, we few-shot prompt Llama 3 to generate a solution consisting of interleaved
- reasoning steps and tool calls, similar to ReAct (Yao et al., 2022). See Figure 10 for an example of
- Llama 3 performing a task involving multi-step tool usage.
- •File uploads: We annotate for the following filetypes: .txt, .docx, .pdf, .pptx, .xlsx, .csv, .tsv,
- .py, .json, .jsonl, .html, .xml . Our prompts are based on a provided file, and ask to summarize the
- contents of the file, find and fix bugs, optimize a piece of code, perform data analysis or visualization.
- See Figure 11 for an example of Llama 3 performing a task involving a file upload.
- After finetuning on this synthetic data, we gather human annotations in diverse and challenging scenarios
- including multi-turn interactions, more than three step tool use, and instances where a tool call does not yield
- 8https://products.wolframalpha.com/llm-api/documentation
- 25
- Figure 10 Multi-step tool usage. Example of Llama 3 performing multi-step planning, reasoning, and tool calling to
- solve a task.
- a satisfying answer. We augment our synthetic data with different system prompts to teach the model to use
- tools only when activated. To train the model to avoid calling tools for simple queries, we also add queries
- from easy math or question answering datasets (Berant et al., 2013; Koncel-Kedziorski et al., 2016; Joshi
- et al., 2017; Amini et al., 2019) and their responses without tools, but with tools activated in system prompt.
- Zero-shot tool use data. We improve Llama 3 zero-shot tool use abilities (also referred to as function calling)
- by finetuning on a large and diverse set of partly synthetic (functions definitions, user query, corresponding
- call) tuples. We evaluate our model on a set of unseen tools.
- •Single, nested, and parallel function calling: Calls can be simple, nested, i.e.we pass a function call as an
- argument of another function, or parallel, i.e.the model returns a list of independent function calls.
- Generating a diverse set of functions, queries and ground truths can be challenging (Mekala et al., 2024),
- and we resort to mining the Stack (Kocetkov et al., 2022) to ground our synthetic user queries in real
- functions. More precisely, we extract function calls and their definitions, clean and filter them, e.g.for
- missing docstrings or non-executable functions, and use Llama 3 to generate a natural language query
- corresponding to the function call.
- •Multi-turn function calling: We also generate synthetic data for multi-turn dialogs with function calls,
- following a protocol similar to the one proposed in Li et al. (2023b). We use multiple agents that
- generate domains, APIs, user queries, API calls, and responses, while also ensuring that the generated
- data covers a set of diverse domains and realistic APIs. All agents are variants of Llama 3 prompted in
- different ways depending on their roles and collaborate in a step-by-step manner.
- 4.3.6 Factuality
- Hallucinations remain a major challenge for large language models. Models tend to be overconfident, even in
- domains where they have little knowledge. Despite these shortcomings, they are often used as knowledge bases,
- which can lead to risky outcomes such as the spread of misinformation. While we recognize that factuality
- can go beyond hallucinations, we took a hallucination-first approach here.
- 26
- Figure 11 Processing file uploads. Example of Llama 3 performing analysis and visualization of an uploaded file.
- We follow the principle that post-training should align the model to “know what it knows” rather than add
- knowledge (Gekhman et al., 2024; Mielke et al., 2020). Our primary approach involves generating data that
- aligns model generations with subsets of factual data present in the pre-training data. To achieve this, we
- develop a knowledge probing technique that takes advantage of Llama 3’s in-context abilities. This data
- generation process involves the following procedure:
- 1.Extract a data snippet from the pre-training data.
- 2.Generate a factual question about these snippets (context) by prompting Llama 3.
- 3.Sample responses from Llama 3 to the question.
- 4.Score the correctness of the generations using the original context as a reference and Llama 3 as a judge.
- 5.Score the informativeness of the generations using Llama 3 as a judge.
- 6.Generate a refusal for responses which are consistently informative and incorrect across the generations,
- using Llama 3.
- We use data generated from the knowledge probe to encourage the model to only answer questions which it
- has knowledge about, and refuse answering those questions
|