extracted_text.txt 98 KB

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798991001011021031041051061071081091101111121131141151161171181191201211221231241251261271281291301311321331341351361371381391401411421431441451461471481491501511521531541551561571581591601611621631641651661671681691701711721731741751761771781791801811821831841851861871881891901911921931941951961971981992002012022032042052062072082092102112122132142152162172182192202212222232242252262272282292302312322332342352362372382392402412422432442452462472482492502512522532542552562572582592602612622632642652662672682692702712722732742752762772782792802812822832842852862872882892902912922932942952962972982993003013023033043053063073083093103113123133143153163173183193203213223233243253263273283293303313323333343353363373383393403413423433443453463473483493503513523533543553563573583593603613623633643653663673683693703713723733743753763773783793803813823833843853863873883893903913923933943953963973983994004014024034044054064074084094104114124134144154164174184194204214224234244254264274284294304314324334344354364374384394404414424434444454464474484494504514524534544554564574584594604614624634644654664674684694704714724734744754764774784794804814824834844854864874884894904914924934944954964974984995005015025035045055065075085095105115125135145155165175185195205215225235245255265275285295305315325335345355365375385395405415425435445455465475485495505515525535545555565575585595605615625635645655665675685695705715725735745755765775785795805815825835845855865875885895905915925935945955965975985996006016026036046056066076086096106116126136146156166176186196206216226236246256266276286296306316326336346356366376386396406416426436446456466476486496506516526536546556566576586596606616626636646656666676686696706716726736746756766776786796806816826836846856866876886896906916926936946956966976986997007017027037047057067077087097107117127137147157167177187197207217227237247257267277287297307317327337347357367377387397407417427437447457467477487497507517527537547557567577587597607617627637647657667677687697707717727737747757767777787797807817827837847857867877887897907917927937947957967977987998008018028038048058068078088098108118128138148158168178188198208218228238248258268278288298308318328338348358368378388398408418428438448458468478488498508518528538548558568578588598608618628638648658668678688698708718728738748758768778788798808818828838848858868878888898908918928938948958968978988999009019029039049059069079089099109119129139149159169179189199209219229239249259269279289299309319329339349359369379389399409419429439449459469479489499509519529539549559569579589599609619629639649659669679689699709719729739749759769779789799809819829839849859869879889899909919929939949959969979989991000100110021003100410051006100710081009101010111012101310141015101610171018101910201021102210231024102510261027102810291030103110321033103410351036103710381039104010411042104310441045104610471048104910501051105210531054105510561057105810591060106110621063106410651066106710681069107010711072107310741075107610771078107910801081108210831084108510861087108810891090109110921093109410951096109710981099110011011102110311041105110611071108110911101111111211131114111511161117111811191120112111221123112411251126112711281129113011311132113311341135113611371138113911401141114211431144114511461147114811491150115111521153115411551156115711581159116011611162
  1. The Llama 3 Herd of Models
  2. Llama Team, AI @ Meta1
  3. 1A detailed contributor list can be found in the appendix of this paper.
  4. Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a
  5. new set of foundation models, called Llama 3. It is a herd of language models that natively support
  6. multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with
  7. 405B parameters and a context window of up to 128K tokens. This paper presents an extensive
  8. empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language
  9. models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and
  10. post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input
  11. and output safety. The paper also presents the results of experiments in which we integrate image,
  12. video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach
  13. performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The
  14. resulting models are not yet being broadly released as they are still under development.
  15. Date:July 23, 2024
  16. Website: https://llama.meta.com/
  17. 1 Introduction
  18. Foundation models are general models of language, vision, speech, and/or other modalities that are designed
  19. to support a large variety of AI tasks. They form the basis of many modern AI systems.
  20. The development of modern foundation models consists of two main stages: (1)a pre-training stage in which
  21. the model is trained at massive scale using straightforward tasks such as next-word prediction or captioning
  22. and(2)a post-training stage in which the model is tuned to follow instructions, align with human preferences,
  23. and improve specific capabilities (for example, coding and reasoning).
  24. In this paper, we present a new set of foundation models for language, called Llama 3. The Llama 3 Herd
  25. of models natively supports multilinguality, coding, reasoning, and tool usage. Our largest model is dense
  26. Transformer with 405B parameters, processing information in a context window of up to 128K tokens. Each
  27. member of the herd is listed in Table 1. All the results presented in this paper are for the Llama 3.1 models,
  28. which we will refer to as Llama 3 throughout for brevity.
  29. We believe there are three key levers in the development of high-quality foundation models: data, scale, and
  30. managing complexity. We seek to optimize for these three levers in our development process:
  31. •Data.Compared to prior versions of Llama (Touvron et al., 2023a,b), we improved both the quantity and
  32. qualityofthedataweuseforpre-trainingandpost-training. Theseimprovementsincludethedevelopment
  33. of more careful pre-processing and curation pipelines for pre-training data and the development of more
  34. rigorous quality assurance and filtering approaches for post-training data. We pre-train Llama 3 on a
  35. corpus of about 15T multilingual tokens, compared to 1.8T tokens for Llama 2.
  36. •Scale.We train a model at far larger scale than previous Llama models: our flagship language model was
  37. pre-trained using 3.8×1025FLOPs, almost 50×more than the largest version of Llama 2. Specifically,
  38. we pre-trained a flagship model with 405B trainable parameters on 15.6T text tokens. As expected per
  39. 1arXiv:2407.21783v3 [cs.AI] 23 Nov 2024
  40. Finetuned Multilingual Long context Tool use Release
  41. Llama 3 8B ✗ ✗1✗ ✗ April 2024
  42. Llama 3 8B Instruct ✓ ✗ ✗ ✗ April 2024
  43. Llama 3 70B ✗ ✗1✗ ✗ April 2024
  44. Llama 3 70B Instruct ✓ ✗ ✗ ✗ April 2024
  45. Llama 3.1 8B ✗ ✓ ✓ ✗ July 2024
  46. Llama 3.1 8B Instruct ✓ ✓ ✓ ✓ July 2024
  47. Llama 3.1 70B ✗ ✓ ✓ ✗ July 2024
  48. Llama 3.1 70B Instruct ✓ ✓ ✓ ✓ July 2024
  49. Llama 3.1 405B ✗ ✓ ✓ ✗ July 2024
  50. Llama 3.1 405B Instruct ✓ ✓ ✓ ✓ July 2024
  51. Table 1 Overview of the Llama 3 Herd of models. All results in this paper are for the Llama 3.1 models.
  52. scaling laws for foundation models, our flagship model outperforms smaller models trained using the
  53. same procedure. While our scaling laws suggest our flagship model is an approximately compute-optimal
  54. size for our training budget, we also train our smaller models for much longer than is compute-optimal.
  55. The resulting models perform better than compute-optimal models at the same inference budget. We
  56. use the flagship model to further improve the quality of those smaller models during post-training.
  57. •Managing complexity. We make design choices that seek to maximize our ability to scale the model
  58. development process. For example, we opt for a standard dense Transformer model architecture (Vaswani
  59. et al., 2017) with minor adaptations, rather than for a mixture-of-experts model (Shazeer et al., 2017)
  60. to maximize training stability. Similarly, we adopt a relatively simple post-training procedure based
  61. on supervised finetuning (SFT), rejection sampling (RS), and direct preference optimization (DPO;
  62. Rafailov et al. (2023)) as opposed to more complex reinforcement learning algorithms (Ouyang et al.,
  63. 2022; Schulman et al., 2017) that tend to be less stable and harder to scale.
  64. The result of our work is Llama 3: a herd of three multilingual1language models with 8B, 70B, and 405B
  65. parameters. We evaluate the performance of Llama 3 on a plethora of benchmark datasets that span a wide
  66. range of language understanding tasks. In addition, we perform extensive human evaluations that compare
  67. Llama 3 with competing models. An overview of the performance of the flagship Llama 3 model on key
  68. benchmarks is presented in Table 2. Our experimental evaluation suggests that our flagship model performs
  69. on par with leading language models such as GPT-4 (OpenAI, 2023a) across a variety of tasks, and is close to
  70. matching the state-of-the-art. Our smaller models are best-in-class, outperforming alternative models with
  71. similar numbers of parameters (Bai et al., 2023; Jiang et al., 2023). Llama 3 also delivers a much better
  72. balance between helpfulness and harmlessness than its predecessor (Touvron et al., 2023b). We present a
  73. detailed analysis of the safety of Llama 3 in Section 5.4.
  74. We are publicly releasing all three Llama 3 models under an updated version of the Llama 3 Community License;
  75. seehttps://llama.meta.com . This includes pre-trained and post-trained versions of our 405B parameter
  76. language model and a new version of our Llama Guard model (Inan et al., 2023) for input and output safety.
  77. We hope that the open release of a flagship model will spur a wave of innovation in the research community,
  78. and accelerate a responsible path towards the development of artificial general intelligence (AGI).
  79. As part of the Llama 3 development process we also develop multimodal extensions to the models, enabling
  80. image recognition, video recognition, and speech understanding capabilities. These models are still under
  81. active development and not yet ready for release. In addition to our language modeling results, the paper
  82. presents results of our initial experiments with those multimodal models.
  83. 1The Llama 3 8B and 70B were pre-trained on multilingual data but were intended for use in English at the time.
  84. 2
  85. Category Benchmark
  86. Llama 3 8B
  87. Gemma 2 9B
  88. Mistral 7B
  89. Llama 3 70B
  90. Mixtral 8x22B
  91. GPT 3.5 Turbo
  92. Llama 3 405B
  93. Nemotron 4 340B
  94. GPT-4 (0125)
  95. GPT-4o
  96. Claude 3.5 Sonnet
  97. GeneralMMLU (5-shot) 69.4 72.361.1 83.676.970.787.382.6 85.189.1 89.9
  98. MMLU (0-shot, CoT) 73.0 72.3△60.5 86.079.969.888.6 78.7◁85.4 88.7 88.3
  99. MMLU-Pro (5-shot, CoT) 48.3 –36.9 66.456.349.273.362.7 64.874.0 77.0
  100. IFEval 80.4 73.657.6 87.572.769.9 88.6 85.1 84.385.6 88.0
  101. CodeHumanEval (0-shot) 72.6 54.340.2 80.575.668.089.073.2 86.690.2 92.0
  102. MBPP EvalPlus (0-shot) 72.8 71.749.5 86.078.682.088.672.8 83.687.8 90.5
  103. MathGSM8K (8-shot, CoT) 84.5 76.753.2 95.188.281.6 96.8 92.3♢94.296.1 96.4♢
  104. MATH (0-shot, CoT) 51.9 44.313.0 68.054.143.173.841.1 64.5 76.6 71.1
  105. ReasoningARC Challenge (0-shot) 83.4 87.674.2 94.888.783.7 96.9 94.6 96.496.7 96.7
  106. GPQA (0-shot, CoT) 32.8 –28.8 46.733.330.851.1 –41.453.6 59.4
  107. Tool useBFCL 76.1 –60.484.8– 85.988.586.5 88.380.5 90.2
  108. Nexus 38.5 30.024.7 56.748.537.2 58.7 –50.356.1 45.7
  109. Long contextZeroSCROLLS/QuALITY 81.0 ––90.5–– 95.2 – 95.2 90.5 90.5
  110. InfiniteBench/En.MC 65.1 ––78.2–– 83.4 –72.182.5 –
  111. NIH/Multi-needle 98.8 ––97.5––98.1 – 100.0 100.0 90.8
  112. Multilingual MGSM (0-shot, CoT) 68.9 53.229.9 86.971.151.4 91.6 –85.990.5 91.6
  113. Table 2 Performance of finetuned Llama 3 models on key benchmark evaluations. The table compares the performance of
  114. the 8B, 70B, and 405B versions of Llama 3 with that of competing models. We boldface the best-performing model in
  115. each of three model-size equivalence classes.△Results obtained using 5-shot prompting (no CoT).◁Results obtained
  116. without CoT.♢Results obtained using zero-shot prompting.
  117. 2 General Overview
  118. The model architecture of Llama 3 is illustrated in Figure 1. The development of our Llama 3 language
  119. models comprises two main stages:
  120. •Language model pre-training. We start by converting a large, multilingual text corpus to discrete tokens
  121. and pre-training a large language model (LLM) on the resulting data to perform next-token prediction.
  122. In the language model pre-training stage, the model learns the structure of language and obtains large
  123. amounts of knowledge about the world from the text it is “reading”. To do this effectively, pre-training
  124. is performed at massive scale: we pre-train a model with 405B parameters on 15.6T tokens using a
  125. context window of 8K tokens. This standard pre-training stage is followed by a continued pre-training
  126. stage that increases the supported context window to 128K tokens. See Section 3 for details.
  127. •Language model post-training. The pre-trained language model has a rich understanding of language
  128. but it does not yet follow instructions or behave in the way we would expect an assistant to. We
  129. align the model with human feedback in several rounds, each of which involves supervised finetuning
  130. (SFT) on instruction tuning data and Direct Preference Optimization (DPO; Rafailov et al., 2024).
  131. At this post-training2stage, we also integrate new capabilities, such as tool-use, and observe strong
  132. improvements in other areas, such as coding and reasoning. See Section 4 for details. Finally, safety
  133. mitigations are also incorporated into the model at the post-training stage, the details of which are
  134. described in Section 5.4.
  135. The resulting models have a rich set of capabilities. They can answer questions in at least eight languages,
  136. write high-quality code, solve complex reasoning problems, and use tools out-of-the-box or in a zero-shot way.
  137. We also perform experiments in which we add image, video, and speech capabilities to Llama 3 using a
  138. compositional approach. The approach we study comprises the three additional stages illustrated in Figure 28:
  139. •Multi-modal encoder pre-training. We train separate encoders for images and speech. We train our
  140. image encoder on large amounts of image-text pairs. This teaches the model the relation between visual
  141. content and the description of that content in natural language. Our speech encoder is trained using a
  142. 2In this paper, we use the term “post-training” to refer to any model training that happens outside of pre-training.
  143. 3
  144. Figure 1 Illustration of the overall architecture and training of Llama 3. Llama 3 is a Transformer language model trained to
  145. predict the next token of a textual sequence. See text for details.
  146. self-supervised approach that masks out parts of the speech inputs and tries to reconstruct the masked
  147. out parts via a discrete-token representation. As a result, the model learns the structure of speech
  148. signals. See Section 7 for details on the image encoder and Section 8 for details on the speech encoder.
  149. •Vision adapter training. We train an adapter that integrates the pre-trained image encoder into the
  150. pre-trained language model. The adapter consists of a series of cross-attention layers that feed image-
  151. encoder representations into the language model. The adapter is trained on text-image pairs. This
  152. aligns the image representations with the language representations. During adapter training, we also
  153. update the parameters of the image encoder but we intentionally do not update the language-model
  154. parameters. We also train a video adapter on top of the image adapter on paired video-text data. This
  155. enables the model to aggregate information across frames. See Section 7 for details.
  156. •Speech adapter training. Finally, we integrate the speech encoder into the model via an adapter that
  157. converts speech encodings into token representations that can be fed directly into the finetuned language
  158. model. The parameters of the adapter and encoder are jointly updated in a supervised finetuning stage
  159. to enable high-quality speech understanding. We do not change the language model during speech
  160. adapter training. We also integrate a text-to-speech system. See Section 8 for details.
  161. Our multimodal experiments lead to models that can recognize the content of images and videos, and support
  162. interaction via a speech interface. These models are still under development and not yet ready for release.
  163. 3 Pre-Training
  164. Language model pre-training involves: (1)the curation and filtering of a large-scale training corpus, (2)the
  165. development of a model architecture and corresponding scaling laws for determining model size, (3)the
  166. development of techniques for efficient pre-training at large scale, and (4)the development of a pre-training
  167. recipe. We present each of these components separately below.
  168. 3.1 Pre-Training Data
  169. We create our dataset for language model pre-training from a variety of data sources containing knowledge
  170. until the end of 2023. We apply several de-duplication methods and data cleaning mechanisms on each data
  171. source to obtain high-quality tokens. We remove domains that contain large amounts of personally identifiable
  172. information (PII), and domains with known adult content.
  173. 3.1.1 Web Data Curation
  174. Much of the data we utilize is obtained from the web and we describe our cleaning process below.
  175. PII and safety filtering. Among other mitigations, we implement filters designed to remove data from websites
  176. are likely to contain unsafe content or high volumes of PII, domains that have been ranked as harmful
  177. according to a variety of Meta safety standards, and domains that are known to contain adult content.
  178. 4
  179. Text extraction and cleaning. We process the raw HTML content for non-truncated web documents to extract
  180. high-quality diverse text. To do so, we build a custom parser that extracts the HTML content and optimizes
  181. for precision in boilerplate removal and content recall. We evaluate our parser’s quality in human evaluations,
  182. comparing it with popular third-party HTML parsers that optimize for article-like content, and found it
  183. to perform favorably. We carefully process HTML pages with mathematics and code content to preserve
  184. the structure of that content. We maintain the image altattribute text since mathematical content is often
  185. represented as pre-rendered images where the math is also provided in the altattribute. We experimentally
  186. evaluate different cleaning configurations. We find markdown is harmful to the performance of a model that
  187. is primarily trained on web data compared to plain text, so we remove all markdown markers.
  188. De-duplication. We apply several rounds of de-duplication at the URL, document, and line level:
  189. •URL-level de-duplication. We perform URL-level de-duplication across the entire dataset. We keep the
  190. most recent version for pages corresponding to each URL.
  191. •Document-level de-duplication. We perform global MinHash (Broder, 1997) de-duplication across the
  192. entire dataset to remove near duplicate documents.
  193. •Line-level de-duplication. We perform aggressive line-level de-duplication similar to ccNet(Wenzek
  194. et al., 2019). We remove lines that appeared more than 6 times in each bucket of 30M documents.
  195. Although our manual qualitative analysis showed that the line-level de-duplication removes not only
  196. leftover boilerplate from various websites such as navigation menus, cookie warnings, but also frequent
  197. high-quality text, our empirical evaluations showed strong improvements.
  198. Heuristic filtering. We develop heuristics to remove additional low-quality documents, outliers, and documents
  199. with excessive repetitions. Some examples of heuristics include:
  200. •We use duplicated n-gram coverage ratio (Rae et al., 2021) to remove lines that consist of repeated
  201. content such as logging or error messages. Those lines could be very long and unique, hence cannot be
  202. filtered by line-dedup.
  203. •We use “dirty word” counting (Raffel et al., 2020) to filter out adult websites that are not covered by
  204. domain block lists.
  205. •We use a token-distribution Kullback-Leibler divergence to filter out documents containing excessive
  206. numbers of outlier tokens compared to the training corpus distribution.
  207. Model-based quality filtering. Further, we experiment with applying various model-based quality classifiers
  208. to sub-select high-quality tokens. These include using fast classifiers such as fasttext (Joulin et al., 2017)
  209. trained to recognize if a given text would be referenced by Wikipedia (Touvron et al., 2023a), as well as more
  210. compute-intensive Roberta-based classifiers (Liu et al., 2019a) trained on Llama 2 predictions. To train a
  211. quality classifier based on Llama 2, we create a training set of cleaned web documents, describe the quality
  212. requirements, and instruct Llama 2’s chat model to determine if the documents meets these requirements. We
  213. use DistilRoberta (Sanh et al., 2019) to generate quality scores for each document for efficiency reasons. We
  214. experimentally evaluate the efficacy of various quality filtering configurations.
  215. Code and reasoning data. Similar to DeepSeek-AI et al. (2024), we build domain-specific pipelines that extract
  216. code and math-relevant web pages. Specifically, both the code and reasoning classifiers are DistilRoberta
  217. models trained on web data annotated by Llama 2. Unlike the general quality classifier mentioned above, we
  218. conduct prompt tuning to target web pages containing math deduction, reasoning in STEM areas and code
  219. interleaved with natural language. Since the token distribution of code and math is substantially different
  220. than that of natural language, these pipelines implement domain-specific HTML extraction, customized text
  221. features and heuristics for filtering.
  222. Multilingual data. Similar to our processing pipelines for English described above, we implement filters to
  223. remove data from websites that are likely to contain PII or unsafe content. Our multilingual text processing
  224. pipeline has several unique features:
  225. •We use a fasttext-based language identification model to categorize documents into 176 languages.
  226. •We perform document-level and line-level de-duplication within data for each language.
  227. 5
  228. •We apply language-specific heuristics and model-based filters to remove low-quality documents.
  229. In addition, we perform quality ranking of multilingual documents using a multilingual Llama 2-based classifier
  230. to ensure that high-quality content is prioritized. We determine the amount of multilingual tokens used in
  231. pre-training experimentally, balancing model performance on English and multilingual benchmarks.
  232. 3.1.2 Determining the Data Mix
  233. To obtain a high-quality language model, it is essential to carefully determine the proportion of different data
  234. sources in the pre-training data mix. Our main tools in determining this data mix are knowledge classification
  235. and scaling law experiments.
  236. Knowledge classification. We develop a classifier to categorize the types of information contained in our web
  237. data to more effectively determine a data mix. We use this classifier to downsample data categories that are
  238. over-represented on the web, for example, arts and entertainment.
  239. Scaling laws for data mix. To determine the best data mix, we perform scaling law experiments in which we
  240. train several small models on a data mix and use that to predict the performance of a large model on that mix
  241. (see Section 3.2.1). We repeat this process multiple times for different data mixes to select a new data mix
  242. candidate. Subsequently, we train a larger model on this candidate data mix and evaluate the performance of
  243. that model on several key benchmarks.
  244. Data mix summary. Our final data mix contains roughly 50% of tokens corresponding to general knowledge,
  245. 25% of mathematical and reasoning tokens, 17% code tokens, and 8% multilingual tokens.
  246. 3.1.3 Annealing Data
  247. Empirically, we find that annealing (see Section 3.4.3) on small amounts of high-quality code and mathematical
  248. data can boost the performance of pre-trained models on key benchmarks. Akin to Li et al. (2024b), we
  249. perform annealing with a data mix that upsamples high-quality data in select domains. We do not include
  250. any training sets from commonly used benchmarks in our annealing data. This enables us to assess the true
  251. few-shot learning capabilities and out-of-domain generalization of Llama 3.
  252. Following OpenAI (2023a), we evaluate the efficacy of annealing on the GSM8k (Cobbe et al., 2021) and
  253. MATH (Hendrycks et al., 2021b) training sets in annealing. We find that annealing improved the performance
  254. of a pre-trained Llama 3 8B model on the GSM8k and MATH validation sets by 24.0% and 6.4%, respectively.
  255. However, the improvements on the 405B model are negligible, suggesting that our flagship model has strong
  256. in-context learning and reasoning capabilities and does not require specific in-domain training samples to
  257. obtain strong performance.
  258. Using annealing to assess data quality. Similar to Blakeney et al. (2024), we find that annealing enables us to
  259. judge the value of small domain-specific datasets. We measure the value of such datasets by annealing the
  260. learning rate of a 50% trained Llama 3 8B model linearly to 0 on 40B tokens. In those experiments, we assign
  261. 30% weight to the new dataset and the remaining 70% weight to the default data mix. Using annealing to
  262. evaluate new data sources is more efficient than performing scaling law experiments for every small dataset.
  263. 3.2 Model Architecture
  264. Llama 3 uses a standard, dense Transformer architecture (Vaswani et al., 2017). It does not deviate significantly
  265. from Llama and Llama 2 (Touvron et al., 2023a,b) in terms of model architecture; our performance gains are
  266. primarily driven by improvements in data quality and diversity as well as by increased training scale.
  267. We make a few small modifications compared to Llama 2:
  268. •We use grouped query attention (GQA; Ainslie et al. (2023)) with 8 key-value heads to improve inference
  269. speed and to reduce the size of key-value caches during decoding.
  270. •We use an attention mask that prevents self-attention between different documents within the same
  271. sequence. We find that this change had limited impact during in standard pre-training, but find it to be
  272. important in continued pre-training on very long sequences.
  273. 6
  274. 8B 70B 405B
  275. Layers 32 80 126
  276. Model Dimension 4,096 8192 16,384
  277. FFN Dimension 14,336 28,672 53,248
  278. Attention Heads 32 64 128
  279. Key/Value Heads 8 8 8
  280. Peak Learning Rate 3×10−41.5×10−48×10−5
  281. Activation Function SwiGLU
  282. Vocabulary Size 128,000
  283. Positional Embeddings RoPE ( θ= 500 ,000)
  284. Table 3 Overview of the key hyperparameters of Llama 3. We display settings for 8B, 70B, and 405B language models.
  285. •We use a vocabulary with 128K tokens. Our token vocabulary combines 100K tokens from the tiktoken3
  286. tokenizer with 28K additional tokens to better support non-English languages. Compared to the Llama
  287. 2 tokenizer, our new tokenizer improves compression rates on a sample of English data from 3.17 to
  288. 3.94 characters per token. This enables the model to “read” more text for the same amount of training
  289. compute. We also found that adding 28K tokens from select non-English languages improved both
  290. compression ratios and downstream performance, with no impact on English tokenization.
  291. •We increase the RoPE base frequency hyperparameter to 500,000. This enables us to better support
  292. longer contexts; Xiong et al. (2023) showed this value to be effective for context lengths up to 32,768.
  293. Llama 3 405B uses an architecture with 126 layers, a token representation dimension of 16,384, and 128
  294. attention heads; see Table 3 for details. This leads to a model size that is approximately compute-optimal
  295. according to scaling laws on our data for our training budget of 3.8×1025FLOPs.
  296. 3.2.1 Scaling Laws
  297. We develop scaling laws (Hoffmann et al., 2022; Kaplan et al., 2020) to determine the optimal model size for
  298. our flagship model given our pre-training compute budget. In addition to determining the optimal model size,
  299. a major challenge is to forecast the flagship model’s performance on downstream benchmark tasks, due to a
  300. couple of issues: (1) Existing scaling laws typically predict only next-token prediction loss rather than specific
  301. benchmark performance. (2) Scaling laws can be noisy and unreliable because they are developed based on
  302. pre-training runs conducted with small compute budgets (Wei et al., 2022b).
  303. To address these challenges, we implement a two-stage methodology to develop scaling laws that accurately
  304. predict downstream benchmark performance:
  305. 1.We first establish a correlation between the compute-optimal model’s negative log-likelihood on down-
  306. stream tasks and the training FLOPs.
  307. 2.Next, we correlate the negative log-likelihood on downstream tasks with task accuracy, utilizing both the
  308. scaling law models and older models trained with higher compute FLOPs. In this step, we specifically
  309. leverage the Llama 2 family of models.
  310. This approach enables us to predict downstream task performance given a specific number of training FLOPs
  311. for compute-optimal models. We use a similar method to select our pre-training data mix (see Section 3.4).
  312. Scaling law experiments. Concretely, we construct our scaling laws by pre-training models using compute
  313. budgets between 6×1018FLOPs and 1022FLOPs. At each compute budget, we pre-train models ranging
  314. in size between 40M and 16B parameters, using a subset of model sizes at each compute budget. In these
  315. training runs, we use a cosine learning rate schedule with a linear warmup for 2,000 training steps. The peak
  316. learning rate is set between 2×10−4and4×10−4depending on the size of the model. We set the cosine
  317. decay to 0.1 of the peak value. The weight decay at each step is set to 0.1 times the learning rate at that step.
  318. We use a fixed batch size for each compute scale, ranging between 250K and 4M.
  319. 3https://github.com/openai/tiktoken/tree/main
  320. 7
  321. 101010111012
  322. Training Tokens0.700.750.800.850.900.95Validation Loss
  323. Compute
  324. 6e18
  325. 1e19
  326. 3e19
  327. 6e19
  328. 1e20
  329. 3e20
  330. 6e20
  331. 1e21
  332. 3e21
  333. 1e22Figure 2 Scaling law IsoFLOPs curves between 6×1018
  334. and 1022FLOPs. The loss is the negative log-
  335. likelihood on a held-out validation set. We approx-
  336. imate measurements at each compute scale using a
  337. second degree polynomial.
  338. 1019102010211022
  339. Compute (FLOPs)10101011Training Tokens
  340. Fitted Line, = 0.537, A = 0.299
  341. Figure 3 Number of training tokens in identified compute-
  342. optimal models as a function of pre-training compute
  343. budget.We include the fitted scaling-law prediction
  344. as well. The compute-optimal models correspond to
  345. the parabola minimums in Figure 2.
  346. These experiments give rise to the IsoFLOPs curves in Figure 2. The loss in these curves is measured on
  347. a separate validation set. We fit the measured loss values using a second-degree polynomial and identify
  348. the minimums of each parabola. We refer to minimum of a parabola as the compute-optimal model at the
  349. corresponding pre-training compute budget.
  350. We use the compute-optimal models we identified this way to predict the optimal number of training tokens
  351. for a specific compute budget. To do so, we assume a power-law relation between compute budget, C, and
  352. the optimal number of training tokens, N⋆(C):
  353. N⋆(C) =ACα.
  354. We fit Aandαusing the data from Figure 2. We find that (α, A) = (0 .53,0.29); the corresponding fit is
  355. shown in Figure 3. Extrapolation of the resulting scaling law to 3.8×1025FLOPs suggests training a 402B
  356. parameter model on 16.55T tokens.
  357. An important observation is that IsoFLOPs curves become flatteraround the minimum as the compute
  358. budget increases. This implies that performance of the flagship model is relatively robust to small changes in
  359. the trade-off between model size and training tokens. Based on this observation, we ultimately decided to
  360. train a flagship model with 405B parameters.
  361. Predicting performance on downstream tasks. We use the resulting compute-optimal models to forecast
  362. the performance of the flagship Llama 3 model on benchmark data sets. First, we linearly correlate the
  363. (normalized) negative log-likelihood of correct answer in the benchmark and the training FLOPs. In this
  364. analysis, we use only the scaling law models trained up to 1022FLOPs on the data mix described above. Next,
  365. we establish a sigmoidal relation between the log-likelihood and accuracy using both the scaling law models
  366. and Llama 2 models, which were trained using the Llama 2 data mix and tokenizer. We show the results of
  367. this experiment on the ARC Challenge benchmark in Figure 4). We find this two-step scaling law prediction,
  368. which extrapolates over four orders of magnitude, to be quite accurate: it only slightly underestimates the
  369. final performance of the flagship Llama 3 model.
  370. 3.3 Infrastructure, Scaling, and Efficiency
  371. We describe our hardware and infrastructure that powered Llama 3 405B pre-training at scale and discuss
  372. several optimizations that leads to improvements in training efficiency.
  373. 3.3.1 Training Infrastructure
  374. The Llama 1 and 2 models were trained on Meta’s AI Research SuperCluster (Lee and Sengupta, 2022). As
  375. we scaled further, the training for Llama 3 was migrated to Meta’s production clusters (Lee et al., 2024).This
  376. 8
  377. 102010211022102310241025
  378. Compute (FLOPs)1.2001.2251.2501.2751.3001.3251.3501.3751.400Normalized NLL per Char.
  379. 1.20 1.25 1.30 1.35 1.40
  380. Normalized NLL per Char.0.30.40.50.60.70.80.91.0Accuracy
  381. Scaling Law Models
  382. Llama 2 Models
  383. Scaling Law Prediction
  384. Llama 3 405BFigure 4 Scaling law forecast for ARC Challenge. Left:Normalized negative log-likelihood of the correct answer on the
  385. ARC Challenge benchmark as a function of pre-training FLOPs. Right:ARC Challenge benchmark accuracy as a
  386. function of the normalized negative log-likelihood of the correct answer. This analysis enables us to predict model
  387. performance on the ARC Challenge benchmark before pre-training commences. See text for details.
  388. setup optimizes for production-grade reliability, which is essential as we scale up training.
  389. Compute. Llama 3 405B is trained on up to 16K H100 GPUs, each running at 700W TDP with 80GB HBM3,
  390. using Meta’s Grand Teton AI server platform (Matt Bowman, 2022). Each server is equipped with eight GPUs
  391. and two CPUs. Within a server, the eight GPUs are connected via NVLink. Training jobs are scheduled
  392. using MAST (Choudhury et al., 2024), Meta’s global-scale training scheduler.
  393. Storage. Tectonic (Pan et al., 2021), Meta’s general-purpose distributed file system, is used to build a storage
  394. fabric (Battey and Gupta, 2024) for Llama 3 pre-training. It offers 240 PB of storage out of 7,500 servers
  395. equipped with SSDs, and supports a sustainable throughput of 2 TB/s and a peak throughput of 7 TB/s. A
  396. major challenge is supporting the highly bursty checkpoint writes that saturate the storage fabric for short
  397. durations. Checkpointing saves each GPU’s model state, ranging from 1 MB to 4 GB per GPU, for recovery
  398. and debugging. We aim to minimize GPU pause time during checkpointing and increase checkpoint frequency
  399. to reduce the amount of lost work after a recovery.
  400. Network. Llama 3 405B used RDMA over Converged Ethernet (RoCE) fabric based on the Arista 7800
  401. and Minipack2 Open Compute Project4OCP rack switches. Smaller models in the Llama 3 family were
  402. trained using Nvidia Quantum2 Infiniband fabric. Both RoCE and Infiniband clusters leverage 400 Gbps
  403. interconnects between GPUs. Despite the underlying network technology differences between these clusters,
  404. we tune both of them to provide equivalent performance for these large training workloads. We elaborate
  405. further on our RoCE network since we fully own its design.
  406. •Network topology. Our RoCE-based AI cluster comprises 24K GPUs5connected by a three-layer Clos
  407. network (Lee et al., 2024). At the bottom layer, each rack hosts 16 GPUs split between two servers and
  408. connected by a single Minipack2 top-of-the-rack (ToR) switch. In the middle layer, 192 such racks are
  409. connected by Cluster Switches to form a pod of 3,072 GPUs with full bisection bandwidth, ensuring no
  410. oversubscription. At the top layer, eight such pods within the same datacenter building are connected via
  411. Aggregation Switches to form a cluster of 24K GPUs. However, network connectivity at the aggregation
  412. layer does not maintain full bisection bandwidth and instead has an oversubscription ratio of 1:7. Our
  413. model parallelism methods (see Section 3.3.2) and training job scheduler (Choudhury et al., 2024) are
  414. all optimized to be aware of network topology, aiming to minimize network communication across pods.
  415. •Load balancing. LLM training produces fat network flows that are hard to load balance across all
  416. available network paths using traditional methods such as Equal-Cost Multi-Path (ECMP) routing. To
  417. address this challenge, we employ two techniques. First, our collective library creates 16 network flows
  418. between two GPUs, instead of just one, thereby reducing the traffic per flow and providing more flows
  419. 4Open Compute Project: https://www.opencompute.org/
  420. 5Note that we use only up to 16K of these 24K GPUs for Llama 3 pre-training.
  421. 9
  422. GPUs TP CP PP DP Seq. Len. Batch size/DP Tokens/Batch TFLOPs/GPU BF16 MFU
  423. 8,192 8 1 16 64 8,192 32 16M 430 43%
  424. 16,384 8 1 16 128 8,192 16 16M 400 41%
  425. 16,384 8 16 16 8 131,072 16 16M 380 38%
  426. Table 4 Scaling configurations and MFU for each stage of Llama 3 405B pre-training. See text and Figure 5 for descriptions
  427. of each type of parallelism.
  428. for load balancing. Second, our Enhanced-ECMP (E-ECMP) protocol effectively balances these 16 flows
  429. across different network paths by hashing on additional fields in the RoCE header of packets.
  430. •Congestion control. We use deep-buffer switches in the spine (Gangidi et al., 2024) to accommodate
  431. transient congestion and buffering caused by collective communication patterns. This setup helps
  432. limit the impact of persistent congestion and network back pressure caused by slow servers, which is
  433. common in training. Finally, better load balancing through E-ECMP significantly reduces the chance
  434. of congestion. With these optimizations, we successfully run a 24K GPU cluster without traditional
  435. congestion control methods such as Data Center Quantized Congestion Notification (DCQCN).
  436. 3.3.2 Parallelism for Model Scaling
  437. To scale training for our largest models, we use 4D parallelism—a combination of four different types of
  438. parallelism methods—to shard the model. This approach efficiently distributes computation across many
  439. GPUs and ensures each GPU’s model parameters, optimizer states, gradients, and activations fit in its
  440. HBM. Our implementation of 4D parallelism is illustrated in Figure 5. It combines tensor parallelism (TP;
  441. Krizhevsky et al. (2012); Shoeybi et al. (2019); Korthikanti et al. (2023)), pipeline parallelism (PP; Huang
  442. et al. (2019); Narayanan et al. (2021); Lamy-Poirier (2023)), context parallelism (CP; Liu et al. (2023a)), and
  443. data parallelism (DP; Rajbhandari et al. (2020); Ren et al. (2021); Zhao et al. (2023b)).
  444. Tensor parallelism splits individual weight tensors into multiple chunks on different devices. Pipeline parallelism
  445. partitions the model vertically into stages by layers, so that different devices can process in parallel different
  446. stages of the full model pipeline. Context parallelism divides the input context into segments, reducing memory
  447. bottleneck for very long sequence length inputs. We use fully sharded data parallelism (FSDP; Rajbhandari
  448. et al., 2020; Ren et al., 2021; Zhao et al., 2023b), which shards the model, optimizer, and gradients while
  449. implementing data parallelism which processes data in parallel on multiple GPUs and synchronizes after each
  450. training step. Our use of FSDP for Llama 3 shards optimizer states and gradients, but for model shards we do
  451. not reshard after forward computation to avoid an extra all-gather communication during backward passes.
  452. GPU utilization. Through careful tuning of the parallelism configuration, hardware, and software, we achieve
  453. an overall BF16 Model FLOPs Utilization (MFU; Chowdhery et al. (2023)) of 38-43% for the configurations
  454. shown in Table 4. The slight drop in MFU to 41% on 16K GPUs with DP=128 compared to 43% on 8K
  455. GPUs with DP=64 is due to the lower batch size per DP group needed to keep the global tokens per batch
  456. constant during training.
  457. Pipeline parallelism improvements. We encountered several challenges with existing implementations:
  458. •Batch size constraint. Current implementations have constraints on supported batch size per GPU,
  459. requiring it to be divisible by the number of pipeline stages. For the example in Figure 6, the depth-first
  460. schedule (DFS) of pipeline parallelism (Narayanan et al., 2021) requires N=PP= 4, while the
  461. breadth-first schedule (BFS; Lamy-Poirier (2023)) requires N=M, where Mis the total number
  462. of micro-batches and Nis the number of contiguous micro-batches for the same stage’s forward or
  463. backward. However, pre-training often needs flexibility to adjust batch size.
  464. •Memory imbalance. Existing pipeline parallelism implementations lead to imbalanced resource consump-
  465. tion. The first stage consumes more memory due to the embedding and the warm-up micro-batches.
  466. •Computation imbalance. After the last layer of the model, we need to calculate output and loss, making
  467. this stage the execution latency bottleneck.
  468. 10
  469. Figure 5 Illustration of 4D parallelism. GPUs are divided into parallelism groups in the order of [TP, CP, PP, DP], where
  470. DP stands for FSDP. In this example, 16 GPUs are configured with a group size of |TP|=2, |CP|=2, |PP|=2, and
  471. |DP|=2. A GPU’s position in 4D parallelism is represented as a vector, [ D1,D2,D3,D4], where Diis the index on
  472. thei-th parallelism dimension. In this example, GPU0[TP0, CP0, PP0, DP0] and GPU1[TP1, CP0, PP0, DP0] are in
  473. the same TP group, GPU0 and GPU2 are in the same CP group, GPU0 and GPU4 are in the same PP group, and
  474. GPU0 and GPU8 are in the same DP group.
  475. To address these issues, we modify our pipeline schedule as shown in Figure 6, which allows setting N
  476. flexibly—in this case N= 5, which can run a arbitrary number of micro-batches in each batch. This allows
  477. us to run: (1) fewer micro-batches than the number of stages when we have batch size limit at large scale;
  478. or (2) more micro-batches to hide point-to-point communication, finding a sweet spot between DFS and
  479. breadth first schedule (BFS) for the best communication and memory efficiency. To balance the pipeline,
  480. we reduce one Transformer layer each from the first and the last stages, respectively. This means that
  481. the first model chunk on the first stage has only the embedding, and the last model chunk on the last
  482. stage has only output projection and loss calculation. To reduce pipeline bubbles, we use an interleaved
  483. schedule (Narayanan et al., 2021) with Vpipeline stages on one pipeline rank. Overall pipeline bubble ratio
  484. isPP−1
  485. V∗M. Further, we adopt asynchronous point-to-point communication in PP, which considerably speeds up
  486. training, especially in cases when the document mask introduces extra computation imbalance. We enable
  487. TORCH_NCCL_AVOID_RECORD_STREAMS to reduce memory usage from asynchronous point-to-point
  488. communication. Finally, to reduce memory cost, based on detailed memory allocation profiling, we proactively
  489. deallocate tensors that will not be used for future computation, including the input and output tensors of each
  490. pipeline stage, that will not be used for future computation. With these optimizations, we could pre-train
  491. Llama 3 on sequences of 8K tokens without activation checkpointing.
  492. Context parallelism for long sequences. We utilize context parallelism (CP) to improve memory efficiency when
  493. scaling the context length of Llama 3 and enable training on extremely long sequences up to 128K in length.
  494. In CP, we partition across the sequence dimension, and specifically we partition the input sequence into
  495. 2×CPchunks so each CP rank receives two chunks for better load balancing. The i-th CP rank received
  496. both the i-th and the (2×CP−1−i)-th chunks.
  497. Different from existing CP implementations that overlap communication and computation in a ring-like
  498. structure (Liu et al., 2023a), our CP implementation adopts an all-gather based method where we first
  499. all-gather the key (K) and value (V) tensors, and then compute attention output for the local query (Q)
  500. tensor chunk. Although the all-gather communication latency is exposed in the critical path, we still adopt
  501. this approach for two main reasons: (1) it is easier and more flexible to support different types of attention
  502. masks in all-gather based CP attention, such as the document mask; and (2) the exposed all-gather latency
  503. 11
  504. Figure 6 Illustration of pipeline parallelism in Llama 3. Pipeline parallelism partitions eight pipeline stages (0 to 7) across
  505. four pipeline ranks (PP ranks 0 to 3), where the GPUs with rank 0 run stages 0 and 4, the GPUs with P rank 1 run
  506. stages 1 and 5, etc. The colored blocks (0 to 9) represent a sequence of micro-batches, where Mis the total number of
  507. micro-batches and Nis the number of continuous micro-batches for the same stage’s forward or backward. Our key
  508. insight is to make Ntunable.
  509. is small as the communicated K and V tensors are much smaller than Q tensor due to the use of GQA (Ainslie
  510. et al., 2023). Hence, the time complexity of attention computation is an order of magnitude larger than
  511. all-gather (O(S2)versus O(S), where Srepresents the sequence length in the full causal mask), making the
  512. all-gather overhead negligible.
  513. Network-aware parallelism configuration. The order of parallelism dimensions, [TP, CP, PP, DP], is optimized
  514. for network communication. The innermost parallelism requires the highest network bandwidth and lowest
  515. latency, and hence is usually constrained to within the same server. The outermost parallelism may spread
  516. across a multi-hop network and should tolerate higher network latency. Therefore, based on the requirements
  517. for network bandwidth and latency, we place parallelism dimensions in the order of [TP, CP, PP, DP]. DP
  518. (i.e., FSDP) is the outermost parallelism because it can tolerate longer network latency by asynchronously
  519. prefetching sharded model weights and reducing gradients. Identifying the optimal parallelism configuration
  520. with minimal communication overhead while avoiding GPU memory overflow is challenging. We develop a
  521. memory consumption estimator and a performance-projection tool which helped us explore various parallelism
  522. configurations and project overall training performance and identify memory gaps effectively.
  523. Numerical stability. By comparing training loss between different parallelism setups, we fixed several numerical
  524. issues that impact training stability. To ensure training convergence, we use FP32 gradient accumulation
  525. during backward computation over multiple micro-batches and also reduce-scatter gradients in FP32 across
  526. data parallel workers in FSDP. For intermediate tensors, e.g., vision encoder outputs, that are used multiple
  527. times in the forward computation, the backward gradients are also accumulated in FP32.
  528. 3.3.3 Collective Communication
  529. Our collective communication library for Llama 3 is based on a fork of Nvidia’s NCCL library, called NCCLX.
  530. NCCLX significantly improves the performance of NCCL, especially for higher latency networks. Recall that
  531. the order of parallelism dimensions is [TP, CP, PP, DP], where DP corresponds to FSDP. The outermost
  532. parallelism dimensions, PP and DP, may communicate through a multi-hop network, with latency up to tens
  533. of microseconds. The original NCCL collectives— all-gather andreduce-scatter in FSDP, and point-to-point
  534. in PP—require data chunking and staged data copy. This approach incurs several inefficiencies, including
  535. (1) requiring a large number of small control messages to be exchanged over the network to facilitate data
  536. transfer, (2) extra memory-copy operations, and (3) using extra GPU cycles for communication. For Llama 3
  537. training, we address a subset of these inefficiencies by tuning chunking and data transfer to fit our network
  538. latencies, which can be as high as tens of microseconds for a large cluster. We also allow small control messages
  539. to traverse our network at a higher priority, especially avoiding being head-of-line blocked in deep-buffer
  540. core switches. Our ongoing work for future Llama versions involves making deeper changes in NCCLX to
  541. holistically address all the aforementioned problems.
  542. 12
  543. Component Category Interruption Count % of Interruptions
  544. Faulty GPU GPU 148 30.1%
  545. GPU HBM3 Memory GPU 72 17.2%
  546. Software Bug Dependency 54 12.9%
  547. Network Switch/Cable Network 35 8.4%
  548. Host MaintenanceUnplanned
  549. Maintenance32 7.6%
  550. GPU SRAM Memory GPU 19 4.5%
  551. GPU System Processor GPU 17 4.1%
  552. NIC Host 7 1.7%
  553. NCCL Watchdog Timeouts Unknown 7 1.7%
  554. Silent Data Corruption GPU 6 1.4%
  555. GPU Thermal Interface + Sensor GPU 6 1.4%
  556. SSD Host 3 0.7%
  557. Power Supply Host 3 0.7%
  558. Server Chassis Host 2 0.5%
  559. IO Expansion Board Host 2 0.5%
  560. Dependency Dependency 2 0.5%
  561. CPU Host 2 0.5%
  562. System Memory Host 2 0.5%
  563. Table 5 Root-cause categorization of unexpected interruptions during a 54-day period of Llama 3 405B pre-training. About
  564. 78% of unexpected interruptions were attributed to confirmed or suspected hardware issues.
  565. 3.3.4 Reliability and Operational Challenges
  566. The complexity and potential failure scenarios of 16K GPU training surpass those of much larger CPU clusters
  567. that we have operated. Moreover, the synchronous nature of training makes it less fault-tolerant—a single
  568. GPU failure may require a restart of the entire job. Despite these challenges, for Llama 3, we achieved higher
  569. than 90% effective training time while supporting automated cluster maintenance, such as firmware and Linux
  570. kernel upgrades (Vigraham and Leonhardi, 2024), which resulted in at least one training interruption daily.
  571. The effective training time measures the time spent on useful training over the elapsed time.
  572. During a 54-day snapshot period of pre-training, we experienced a total of 466 job interruptions. Of these, 47
  573. were planned interruptions due to automated maintenance operations such as firmware upgrades or operator-
  574. initiated operations like configuration or dataset updates. The remaining 419 were unexpected interruptions,
  575. which are classified in Table 5. Approximately 78% of the unexpected interruptions are attributed to confirmed
  576. hardware issues, such as GPU or host component failures, or suspected hardware-related issues like silent data
  577. corruption and unplanned individual host maintenance events. GPU issues are the largest category, accounting
  578. for 58.7% of all unexpected issues. Despite the large number of failures, significant manual intervention was
  579. required only three times during this period, with the rest of issues handled by automation.
  580. To increase the effective training time, we reduced job startup and checkpointing time, and developed tools
  581. for fast diagnosis and problem resolution. We extensively use PyTorch’s built-in NCCL flight recorder (Ansel
  582. et al., 2024), a feature that captures collective metadata and stack traces into a ring buffer, and hence allowing
  583. us to diagnose hangs and performance issues quickly at scale, particularly with regard to NCCLX. Using
  584. this, we efficiently record every communication event and the duration of each collective operation, and also
  585. automatically dump tracing data on NCCLX watchdog or heartbeat timeout. We enable more computationally
  586. intensive tracing operations and metadata collection selectively as needed live in production through online
  587. configuration changes (Tang et al., 2015) without needing a code release or job restart.
  588. Debugging issues in large-scale training is complicated by the mixed use of NVLink and RoCE in our network.
  589. Data transfer over NVLink typically occurs through load/store operations issued by CUDA kernels, and
  590. failures in either the remote GPU or NVLink connectivity often manifest as stalled load/store operations
  591. within CUDA kernels without returning a clear error code. NCCLX enhances the speed and accuracy of failure
  592. 13
  593. detection and localization through a tight co-design with PyTorch, allowing PyTorch to access NCCLX’s
  594. internal state and track relevant information. While stalls due to NVLink failures cannot be completely
  595. prevented, our system monitors the state of the communication library and automatically times out when
  596. such a stall is detected. Additionally, NCCLX traces the kernel and network activities of each NCCLX
  597. communication and provides a snapshot of the failing NCCLX collective’s internal state, including finished
  598. and pending data transfers between all ranks. We analyze this data to debug NCCLX scaling issues.
  599. Sometimes, hardwareissuesmaycausestill-functioningbutslowstragglersthatarehardtodetect. Evenasingle
  600. straggler can slow down thousands of other GPUs, often appearing as functioning but slow communications.
  601. We developed tools to prioritize potentially problematic communications from selected process groups. By
  602. investigating just a few top suspects, we were usually able to effectively identify the stragglers.
  603. One interesting observation is the impact of environmental factors on training performance at scale. For
  604. Llama 3 405B , we noted a diurnal 1-2% throughput variation based on time-of-day. This fluctuation is the
  605. result of higher mid-day temperatures impacting GPU dynamic voltage and frequency scaling.
  606. During training, tens of thousands of GPUs may increase or decrease power consumption at the same time,
  607. for example, due to all GPUs waiting for checkpointing or collective communications to finish, or the startup
  608. or shutdown of the entire training job. When this happens, it can result in instant fluctuations of power
  609. consumption across the data center on the order of tens of megawatts, stretching the limits of the power grid.
  610. This is an ongoing challenge for us as we scale training for future, even larger Llama models.
  611. 3.4 Training Recipe
  612. The recipe used to pre-train Llama 3 405B consists of three main stages: (1)initial pre-training, (2)long-context
  613. pre-training, and (3)annealing. The three stages are described separately below. We use similar recipes to
  614. pre-train the 8B and 70B models.
  615. 3.4.1 Initial Pre-Training
  616. We pre-train Llama 3 405B using AdamW with a peak learning rate of 8×10−5,a linear warm up of 8,000
  617. steps, and a cosine learning rate schedule decaying to 8×10−7over 1,200,000 steps. We use a lower batch size
  618. early in training to improve training stability, and increase it subsequently to improve efficiency. Specifically,
  619. we use an initial batch size of 4M tokens and sequences of length 4,096, and double these values to a batch
  620. size of 8M sequences of 8,192 tokens after pre-training 252M tokens. We double the batch size again to 16M
  621. after pre-training on 2.87T tokens. We found this training recipe to be very stable: we observed few loss
  622. spikes and did not require interventions to correct for model training divergence.
  623. Adjusting the data mix. We made a several adjustments to the pre-training data mix during training to improve
  624. model performance on particular downstream tasks. In particular, we increased the percentage of non-English
  625. data during pre-training to improve the multilingual performance of Llama 3. We also upsample mathematical
  626. data to improve the model’s mathematical reasoning performance, we added more recent web data in the
  627. later stages of pre-training to advance the model’s knowledge cut-off, and we downsampled subsets of the
  628. pre-training data that were later identified as being lower quality.
  629. 3.4.2 Long Context Pre-Training
  630. In the final stages of pre-training, we train on long sequences to support context windows of up to 128K tokens.
  631. We do not train on long sequences earlier because the compute in self-attention layers grows quadratically in
  632. the sequence length. We increase the supported context length in increments, pre-training until the model has
  633. successfully adapted to the increased context length. We assess successful adaptation by measuring whether (1)
  634. model performance on short-context evaluations has recovered completely and (2)the model perfectly solves
  635. “needle in a haystack” tasks up to that length. In Llama 3 405B pre-training, we increased context length
  636. gradually in six stages, starting from the original 8K context window and ending in the final 128K context
  637. window. This long-context pre-training stage was performed using approximately 800B training tokens.
  638. 14
  639. Figure 7 Illustration of the overall post-training approach for Llama 3. Our post-training strategy involves rejection sampling,
  640. supervised finetuning, and direct preference optimization. See text for details.
  641. 3.4.3 Annealing
  642. During pre-training on the final 40M tokens, we linearly annealed the learning rate to 0, maintaining a context
  643. length of 128K tokens. During this annealing phase, we also adjusted the data mix to upsample data sources
  644. of very high quality; see Section 3.1.3. Finally, we compute the average of model checkpoints (Polyak (1991)
  645. averaging) during annealing to produce the final pre-trained model.
  646. 4 Post-Training
  647. We produce the aligned Llama 3 models by applying several rounds of post-training,6or aligning the model
  648. with human feedback (Ouyang et al., 2022; Rafailov et al., 2024) on top of a pre-trained checkpoint. Each
  649. round of post-training involves supervised finetuning (SFT) followed by Direct Preference Optimization (DPO;
  650. Rafailov et al., 2024) on examples collected either via human annotations or generated synthetically. Our
  651. post-training modeling and data approaches are described in Sections 4.1 and 4.2 respectively. We further
  652. detail custom data curation strategies to improve the reasoning, coding, factuality, multilingual, tool use, long
  653. context, and precise instruction following in Section 4.3.
  654. 4.1 Modeling
  655. The backbone of our post-training strategy is a reward model and a language model. We first train a reward
  656. model on top of the pre-trained checkpoint using human-annotated preference data (see Section 4.1.2). We
  657. then finetune pre-trained checkpoints with supervised finetuning (SFT; see Section 4.1.3), and further align
  658. the checkpoints with Direct Preference Optimization (DPO; see Section 4.1.4). This process is illustrated
  659. in Figure 7. Unless otherwise noted, our modeling procedure applies to Llama 3 405B, and we refer to
  660. Llama 3 405B as Llama 3 for simplicity.
  661. 4.1.1 Chat Dialog Format
  662. To tune LLMs for human-AI interaction, we need to define a chat dialog protocol for the model to understand
  663. human instructions and perform conversational tasks. Compared to its predecessor, Llama 3 has new
  664. capabilities such as tool use (Section 4.3.5) which may require generating multiple messages and sending
  665. 6We use the term “post-training” to refer to any model training that happens outside of pre-training.
  666. 15
  667. them to different locations (e.g., user, ipython) within a single dialog turn. To support this, we design a new
  668. multi-message chat protocol which uses various special header and termination tokens. The header tokens
  669. are used to indicate the source and destination of each message in a conversation. Similarly, the termination
  670. tokens indicate when it is the time to alternate between human and AI to speak.
  671. 4.1.2 Reward Modeling
  672. We train a reward model (RM) covering different capabilities on top of the pre-trained checkpoint. The
  673. training objective is the same as Llama 2 except that we remove the margin term in the loss, as we observe
  674. diminishing improvements after data scaling. Following Llama 2, we use all of our preference data for reward
  675. modeling after filtering out samples with similar responses. In addition to standard preference pair of (chosen,
  676. rejected) response, annotations also create a third “edited response” for some prompts, where the chosen
  677. response from the pair is further edited for improvement (see Section 4.2.1). Hence, each preference ranking
  678. sample has two or three responses with clear ranking ( edited>chosen>rejected). We concatenate the
  679. prompt and multiple responses into a single row during training with responses randomly shuffled. This is an
  680. approximation to the standard scenario of putting the responses in separate rows and computing the scores,
  681. but in our ablations, this approach improves training efficiency without a loss in accuracy.
  682. 4.1.3 Supervised Finetuning
  683. The reward model is then used to perform rejection sampling on our human annotation prompts, the details
  684. of which are described in Section 4.2. Together with this rejection-sampled data and other data sources
  685. (including synthetic data), we finetune the pre-trained language model using a standard cross entropy loss
  686. on the target tokens (while masking loss on prompt tokens). More details about the data mix can be found
  687. in Section 4.2. We refer to this stage as supervised finetuning (SFT; Wei et al., 2022a; Sanh et al., 2022;
  688. Wang et al., 2022b), even though many of the training targets are model-generated. Our largest models are
  689. finetuned with a learning rate of 10−5over the course of 8.5K to 9K steps. We found these hyperparameter
  690. settings to work well across different rounds and data mixes.
  691. 4.1.4 Direct Preference Optimization
  692. We further train our SFT models with Direct Preference Optimization (DPO; Rafailov et al., 2024) for human
  693. preference alignment. For training, we primarily use the most recent batches of preference data collected using
  694. the best performing models from the previous alignment rounds. As a result, our training data conforms better
  695. to the distribution of the policy model that is being optimized in each round. We also explored on-policy
  696. algorithms such as PPO (Schulman et al., 2017), but found that DPO required less compute for large-scale
  697. models and performed better, especially on instruction following benchmarks like IFEval (Zhou et al., 2023).
  698. For Llama 3, we use a learning rate of 10−5and set the βhyper-parameter to be 0.1. In addition, we apply
  699. the following algorithmic modifications to DPO:
  700. •Masking out formatting tokens in DPO loss : We mask out special formatting tokens including header
  701. and termination tokens (described in Section 4.1.1) from both chosen and rejected responses in the
  702. loss to stabilize DPO training. We observe that having these tokens contribute to the loss may lead
  703. to undesired model behaviors such as tail repetition or abruptly generating termination tokens. We
  704. hypothesize that this is due to the contrastive nature of the DPO loss – the presence of common tokens
  705. in both chosen and rejected responses leads to a conflicting learning objective as the model needs to
  706. increase and reduce the likelihood of these tokens simultaneously.
  707. •Regularization with NLL loss : We add an additional negative log-likelihood (NLL) loss term with a scaling
  708. coefficient of 0.2on the chosen sequences, similar to Pang et al. (2024). This helps further stabilize DPO
  709. training by maintaining desired formatting for generation and preventing the decrease of log probability
  710. of chosen responses (Pang et al., 2024; Pal et al., 2024).
  711. 4.1.5 Model Averaging
  712. Finally, we average models obtained from experiments using various versions of data or hyperparameters at
  713. each RM, SFT, or DPO stage (Izmailov et al., 2019; Wortsman et al., 2022; Li et al., 2022).
  714. 16
  715. % of Avg. # turns Avg. # tokens Avg. # tokens Avg. # tokens
  716. Dataset comparisons per dialog per example in prompt in response
  717. General English 81.99% 4.1 1,000.4 36.4 271.2
  718. Coding 6.93% 3.2 1,621.0 113.8 462.9
  719. Multilingual 5.19% 1.8 1,299.4 77.1 420.9
  720. Reasoning and tools 5.89% 1.6 707.7 46.6 129.9
  721. Total 100% 3.8 1,041.6 44.5 284.0
  722. Table 6 Statistics of human preference data. We list statistics of the internally collected human preference data used for
  723. Llama 3 alignment. We ask annotators to perform multi-turn dialogues with the models and make comparisons among
  724. responses at each turn. In post-processing, we split each dialogue to multiple examples at a turn level. Each example
  725. consists of a prompt (including previous dialog if available) and a response (e.g., chosen or rejected response).
  726. 4.1.6 Iterative Rounds
  727. Following Llama 2, we apply the above methods in six rounds. In each cycle, we collect new preference
  728. annotations and SFT data, sampling synthetic data from the latest models.
  729. 4.2 Post-training Data
  730. The post-training data composition plays a critical role in the usefulness and behavior of language models. In
  731. this section, we discuss our human annotation procedures and preference data collection (Section 4.2.1), the
  732. composition of our SFT data (Section 4.2.2), and methods for data quality control and cleaning (Section 4.2.3).
  733. 4.2.1 Preference Data
  734. Our preference data annotation process is similar to Llama 2. We deploy multiple models for annotation after
  735. each round and sample two responses from two different models for each user prompt. These models can
  736. be trained with different data mixes and alignment recipes, allowing for different capability strength ( e.g.,
  737. code expertise) and increased data diversity. We ask annotators to rate the strength of their preference by
  738. categorizing it into one of four levels, based on how much more they prefer the chosen response over the
  739. rejected one: significantly better, better, slightly better, or marginally better. We also incorporate an editing
  740. step after preference ranking to encourage annotators to further improve the preferred response. Annotators
  741. edit the chosen response directly or prompt the model with feedback to refine its own response. Consequently,
  742. a portion of our preference data has three responses ranked ( edited>chosen>rejected).
  743. In Table 6, we report the statistics of preference annotations that we use for Llama 3 training. General English
  744. covers multiple subcategories such as knowledge-based question and answering or precise instruction-following,
  745. which fall outside the scope of specific capabilities. Compared to Llama 2, we observe an increase in the
  746. average length of prompt and response, suggesting that we train Llama 3 on more complex tasks. In addition,
  747. we implement a quality analysis and human evaluation process to rigorously assess the data collected, allowing
  748. us to refine our prompts and provide systematic, actionable feedback to annotators. For example, as Llama 3
  749. improves after each round, we increase prompt complexity accordingly to target areas where the model lags.
  750. In each round of post-training, we use all the preference data that is available at the time for reward modeling,
  751. while only using the latest batches from various capabilities for DPO training. For both reward modeling and
  752. DPO, we use samples that are labeled as the chosen response being significantly better or better than the
  753. rejected counterpart for training and discard samples with similar responses.
  754. 4.2.2 SFT Data
  755. Our finetuning data is largely comprised of the following sources:
  756. •Prompts from our human annotation collection with rejection-sampled responses.
  757. •Synthetic data targeting specific capabilities (see Section 4.3 for more details).
  758. 17
  759. Avg. # tokens Avg. # tokens
  760. Dataset % of examples Avg. # turns Avg. # tokens in context in final response
  761. General English 52.66% 6.3 974.0 656.7 317.1
  762. Code 14.89% 2.7 753.3 378.8 374.5
  763. Multilingual 3.01% 2.7 520.5 230.8 289.7
  764. Exam-like 8.14% 2.3 297.8 124.4 173.4
  765. Reasoning and tools 21.19% 3.1 661.6 359.8 301.9
  766. Long context 0.11% 6.7 38,135.6 37,395.2 740.5
  767. Total 100% 4.7 846.1 535.7 310.4
  768. Table 7 Statistics of SFT data. We list internally collected SFT data used for Llama 3 alignment. Each SFT example
  769. consists of a context (i.e., all conversation turns except the last one) and a final response.
  770. •Small amounts of human-curated data (see Section 4.3 for more details).
  771. As our post-training rounds progress, we develop stronger Llama 3 variants that we use to collect larger
  772. datasets that cover a wide range of complex capabilities. In this section, we discuss the details for the
  773. rejection-sampling procedure and overall composition of our final SFT datamix.
  774. Rejection sampling. During rejection sampling (RS), for each prompt collected during human annotation
  775. (Section 4.2.1) we sample K(typically between 10 and 30) outputs from the latest chat model policy (usually
  776. the best performing checkpoint from the previous post-training iteration, or the best performing checkpoint
  777. for a particular capability) and use our reward model to select the best candidate, consistent with Bai et al.
  778. (2022). In later rounds of post-training, we introduce system prompts to steer RS responses to conform with
  779. desirable tone, style, or formatting, which might be different for different capabilities.
  780. To increase the efficiency of rejection sampling, we adopt PagedAttention (Kwon et al., 2023). PagedAttention
  781. enhances memory efficiency through dynamic key-value cache allocation. It supports arbitrary output lengths
  782. by dynamically scheduling requests based on the current cache capacity. Unfortunately, this carries the risk of
  783. swap-out when running out of memory. To eliminate such swap overhead, we define a maximum output length
  784. and perform a request only if sufficient memory is available to fit an output with that length. PagedAttention
  785. also enables us to share the key-value cache pages for a prompt across all corresponding outputs. Together,
  786. this leads to a throughput improvement of over 2×during rejection sampling.
  787. Overall data composition. Table 7 shows data statistics for each broad category of our “helpfulness” mix. While
  788. SFT and preference data contain overlapping domains, they are curated differently, yielding distinct count
  789. statistics. In Section 4.2.3 we describe techniques for categorizing topic, complexity, and quality of our data
  790. samples. In each round of post-training, we adjust our overall data mix carefully across these axes to tune
  791. performance across a wide range of benchmarks. Our final data mix epochs multiple times on some high
  792. quality sources and downsamples others.
  793. 4.2.3 Data Processing and Quality Control
  794. Given that most of our training data is model-generated , it requires careful cleaning and quality control.
  795. Data cleaning. In the early rounds, we observed a number of undesirable patterns common in our data, such
  796. as excessive use of emojis or exclamation points. Therefore, we implement a series of rule-based data removal
  797. and modification strategies to filter or clean problematic data. For example, to mitigate overly-apologetic
  798. tonal issues, we identify overused phrases (such as “I’m sorry” or “I apologize”) and carefully balance the
  799. proportion of such samples in our dataset.
  800. Data pruning. We also apply a collection of model-based techniques to remove low-quality training samples
  801. and improve overall model performance:
  802. •Topic classification: We first finetune Llama 3 8B into a topic classifier, and perform inference over
  803. all data to classify it into both coarsely-grained buckets (“mathematical reasoning”) and fine-grained
  804. 18
  805. buckets (“geometry and trigonometry”).
  806. •Quality scoring: We use both reward model and Llama-based signals to obtain a quality score for each
  807. sample. For an RM-based score, we consider data that is in the top quartile of RM scores as high quality.
  808. For a Llama-based score, we prompt Llama 3 checkpoint to rate each sample on a three-point scale for
  809. general English data (accuracy, instruction following, and tone/presentation) and a two-point scale for
  810. coding data (bug identification and user intention), and consider samples that obtain the maximum
  811. score as high quality. The RM and Llama-based scores have high disagreement rates, and we find that
  812. combining these signals yield the best recall on our internal test set. Ultimately, we select examples
  813. that are marked as high quality by the RM orthe Llama-based filter.
  814. •Difficulty scoring: Because we are also interested in prioritizing examples that are more complex for
  815. the model, we score data using two measures of difficulty: Instag (Lu et al., 2023) and Llama-based
  816. scoring. For Instag, we prompt Llama 3 70B to perform intention tagging of SFT prompts, where more
  817. intentions implies more complexity. We also prompt Llama 3 to measure the difficulty (Liu et al., 2024c)
  818. of dialogs on a three-point scale.
  819. •Semantic deduplication: Finally, we perform semantic deduplication (Abbas et al., 2023; Liu et al.,
  820. 2024c). We first cluster complete dialogs using RoBERTa (Liu et al., 2019b) and within each cluster
  821. sort them by quality score ×difficulty score. We then do greedy selection by iterating through all sorted
  822. examples, and only keeping the ones that have maximum cosine similarity less than a threshold to the
  823. examples seen so far in the cluster.
  824. 4.3 Capabilities
  825. We highlight special efforts to improve performance for specific capabilities such as code (Section 4.3.1),
  826. multilinguality (Section 4.3.2), math and reasoning (Section 4.3.3), long context (Section 4.3.4), tool use
  827. (Section 4.3.5), factuality (Section 4.3.6), and steerability (Section 4.3.7).
  828. 4.3.1 Code
  829. LLMs for code have received significant attention since the release of Copilot and Codex (Chen et al., 2021).
  830. Developers are now widely using these models to generate code snippets, debug, automate tasks, and improve
  831. code quality. For Llama 3, we target improving and evaluating code generation, documentation, debugging,
  832. and review capabilities for the following high priority programming languages: Python, Java, Javascript,
  833. C/C++, Typescript, Rust, PHP, HTML/CSS, SQL, bash/shell. Here, we present our work on improving
  834. these coding capabilities via training a code expert, generating synthetic data for SFT, improving formatting
  835. with system prompt steering, and creating quality filters to remove bad samples from our training data.
  836. Expert training. We train a code expert which we use to collect high quality human annotations for code
  837. throughout subsequent rounds of post-training. This is accomplished by branching the main pre-training run
  838. and continuing pre-training on a 1T token mix of mostly (>85%) code data. Continued pre-training on domain-
  839. specific data has been shown to be effective for improving performance in a specific domain (Gururangan
  840. et al., 2020). We follow a recipe similar to that of CodeLlama (Rozière et al., 2023). For the last several
  841. thousand steps of training we perform long-context finetuning (LCFT) to extend the expert’s context length
  842. to 16K tokens on a high quality mix of repo-level code data. Finally, we follow the similar post-training
  843. modeling recipes described in Section 4.1 to align this model, except with SFT and DPO data mixes primarily
  844. targeting code. This model is also used for rejection sampling (Section 4.2.2) for coding prompts.
  845. Synthetic data generation. During development, we identified key issues in code generation, including difficulty
  846. in following instructions, code syntax errors, incorrect code generation, and difficulty in fixing bugs. While
  847. intensive human annotation could theoretically resolve these issues, synthetic data generation offers a
  848. complementary approach at a lower cost and higher scale, unconstrained by the expertise level of annotators.
  849. As such, we use Llama 3 and the code expert to generate a large quantity of synthetic SFT dialogs.
  850. We describe three high-level approaches for generating synthetic code data. In total, we generate over 2.7M
  851. synthetic examples which were used during SFT.
  852. 19
  853. 1.Synthetic data generation: execution feedback. The 8B and 70B models show significant performance
  854. improvements when trained on data generated by a larger, more competent model. However, our initial
  855. experiments revealed that training Llama 3 405B on its own generated data is not helpful (and can
  856. even degrade performance). To address this limitation, we introduced execution feedback as a source of
  857. truth, enabling the model to learn from its mistakes and stay on track. In particular, we generate large
  858. dataset of approximately one million synthetic coding dialogues using the following process:
  859. •Problem description generation: First, we generate a large collection of programming problem
  860. descriptions that span a diverse range of topics, including those in the long tail distribution. To
  861. achieve this diversity, we sample random code snippets from various sources and prompt the model
  862. to generate programming problems inspired by these examples. This allowed us to tap into a wide
  863. range of topics and create a comprehensive set of problem descriptions (Wei et al., 2024).
  864. •Solution generation: Then, we prompt Llama 3 to solve each problem in a given programming
  865. language. We observe that adding general rules of good programming to the prompt improves the
  866. generated solution quality. Also, we find it is helpful to require the model to explain its thought
  867. process in comments.
  868. •Correctness analysis: After generating a solution, it is crucial to recognize that its correctness is
  869. not guaranteed, and including incorrect solutions in the finetuning dataset could harm the model’s
  870. quality. While we do not ensure complete correctness, we develop methods to approximate it. To
  871. achieve this, we extract the source code from the generated solution and applied a combination of
  872. static and dynamic analysis techniques to test its correctness, including:
  873. –Static analysis : We run all generated code through a parser and a linter to ensure syntactic
  874. correctness, catching errors such as syntax errors, use of uninitialized variables or non-imported
  875. functions, code style issues, typing errors, and others.
  876. –Unit test generation and execution : For each problem and solution, we prompt the model
  877. to generate unit tests, executed in a containerized environment together with the solution,
  878. catching run-time execution errors and some semantic errors.
  879. •Error feedback and iterative self-correction: When a solution fails at any step, we prompt the
  880. model to revise it. The prompt included the original problem description, the faulty solution,
  881. and feedback from the parser/linter/tester (stdout, stderr/ and return code). After a unit test
  882. execution failure, the model could either fix the code to pass the existing tests or modify its unit
  883. tests to accommodate the generated code. Only dialogs that pass all checks are included in the final
  884. dataset, used for supervised finetuning (SFT). Notably, we observed that about 20% of solutions
  885. were initially incorrect but self-corrected, indicating that the model learned from the execution
  886. feedback and improved its performance.
  887. •Fine-tuning and iterative improvement: The finetuning process is conducted over multiple rounds,
  888. with each round building on the previous one. After each round, the model is improved, generating
  889. higher-quality synthetic data for the next round. This iterative process allows for progressive
  890. refinement and enhancement of the model’s performance.
  891. 2.Synthetic data generation: programming language translation. We observe a performance gap between
  892. major programming languages ( e.g., Python/C++) and less common ones ( e.g., Typescript/PHP). This
  893. is not surprising as we have less training data for less common programming languages. To mitigate
  894. this, we supplement our existing data by translating data from common programming languages to
  895. less common languages (similar to Chen et al. (2023) in the context of reasoning). This is achieved
  896. by prompting Llama 3 and ensuring quality via syntax parsing, compilation, and execution. Figure 8
  897. demonstrates an example of synthetic PHP code translated from Python. This improves performance
  898. significantly for less common languages as measured by the MultiPL-E (Cassano et al., 2023) benchmark.
  899. 3.Synthetic data generation: backtranslation. To improve certain coding capabilities (e.g., documentation,
  900. explanations) where execution feedback is less informative for determining quality, we employ an
  901. alternative multi-step approach. Using this procedure, we generated approximately 1.2M synthetic
  902. 20
  903. Figure 8 Code translation example. We display an example of using Llama 3 to translate Python code (left) to PHP
  904. code (right) to augment our SFT dataset with a wider range of programming languages.
  905. Figure 9 Improving generated code quality with system prompts. Left:without system prompt Right:with system prompt.
  906. dialogs related to code explanation, generation, documentation, and debugging. Beginning with code
  907. snippets from a variety of languages in our pre-training data:
  908. •Generate: We prompt Llama 3 to generate data that represents our target capability (e.g., we add
  909. comments and docstrings for the code snippet, or we ask the model to explain a piece of code).
  910. •Backtranslate: We then prompt the model to “backtranslate” the synthetically generated data to
  911. the original code (e.g., we prompt the model to generate code only from its documentation, or we
  912. ask the model to generate code only from its explanation).
  913. •Filter:Using the original code as a reference, we prompt the Llama 3 to determine the quality of
  914. the output (e.g., we ask the model how faithful the backtranslated code is to the original). We
  915. then use the generated examples that have the highest self-verification scores in SFT.
  916. System prompt steering during rejection sampling. During the rejection sampling process, we used code specific
  917. system prompts to improve code readability, documentation, thoroughness, and specificity. Recall, from
  918. Section 7 this data is used to finetune the language model. Figure 9 shows an example of how the system
  919. prompt helps improve the generated code quality — it adds necessary comments, uses more informative
  920. variable names, saves memory, etc.
  921. Filtering training data with execution and model-as-judge signals. As described in Section 4.2.3, we occasionally
  922. encounter quality issues in our rejection-sampled data, such as code blocks containing bugs. Detecting these
  923. issues in our rejection-sampled data is not as straightforward as it is for our synthetic code data , as the
  924. rejection-sampled responses typically contain a mix of natural language and code for which the code may not
  925. 21
  926. always be expected to be executable. (For example, user prompts may explicitly ask for pseudo-code or edits to
  927. only a very small snippet of an executable program.) To address this, we utilize the “model-as-judge” approach,
  928. where earlier versions of Llama 3 assess and assign a binary (0/1) score based on two criteria: code correctness
  929. and code style. We retain only those samples that achieve a perfect score of 2. Initially, this stringent filtering
  930. led to a regression in downstream benchmark performance, primarily because it disproportionately removed
  931. examples with challenging prompts. To counteract this, we strategically revise the responses of some coding
  932. data categorized as most challenging until they met the Llama-based “model-as-judge” criteria. By refining
  933. these challenging problems, the coding data achieves a balance between quality and difficulty, resulting in
  934. optimal downstream performance.
  935. 4.3.2 Multilinguality
  936. We describe how we improve Llama 3’s multilingual capabilities, including training an expert specialized on
  937. substantially more multilingual data, sourcing and generating high quality multilingual instruction tuning
  938. data for German, French, Italian, Portuguese, Hindi, Spanish, and Thai, and tackling specific challenges of
  939. multilingual language steering to enhance the overall performance of our model.
  940. Expert training. Our Llama 3 pre-training data mix contains significantly more English tokens than non-English
  941. tokens. To collect higher quality human annotations in non-English languages, we train a multilingual expert by
  942. branching off the pre-training run and continuing to pre-train on a data mix that consists of 90% multilingual
  943. tokens. We then perform post-training on this expert following Section 4.1. This expert model is then used to
  944. collect higher quality annotations in non-English languages until pre-training was fully complete.
  945. Multilingual data collection. Our multilingual SFT data is derived primarily from sources described below. The
  946. overall distribution is 2.4% human annotations, 44.2% data from other NLP tasks, 18.8% rejection sampled
  947. data, and 34.6% translated reasoning data.
  948. •Human annotations : We collect high-quality, manually annotated data from linguists and native speakers.
  949. These annotations mostly consist of open-ended prompts that represent real world use cases.
  950. •Data from other NLP tasks : To further augment, we use multilingual training data from other tasks
  951. and rewrite into dialog format. For example, we use data from exams-qa (Hardalov et al., 2020)
  952. and Conic10k (Wu et al., 2023). To improve language alignment, we also use parallel texts from
  953. GlobalVoices (Prokopidis et al., 2016) and Wikimedia (Tiedemann, 2012). We use LID based filtering
  954. and Blaser2.0 (Seamless Communication et al., 2023) to remove low quality data. For parallel text data,
  955. instead of using the bitext pairs directly, we apply a multilingual template inspired by Wei et al. (2022a)
  956. to better simulate real-life conversations in translation and language learning scenarios.
  957. •Rejection sampled data : We apply rejection sampling on our human annotated prompts to generate
  958. high-quality samples for finetuning, with few modifications compared to the process for English data:
  959. –Generation : We explored randomly choosing the temperature hyperparameter from the range
  960. 0.2−1for diverse generations in early rounds of post-training. With high temperature, responses
  961. for multilingual prompts can get creative and inspiring, but are also susceptible to unnecessary
  962. or unnatural code-switching. In the final round of post-training, we use a constant value of 0.6
  963. to balance the trade-off. Additionally, we used specialized system prompts to improve response
  964. format, structure and general readability.
  965. –Selection : Prior to reward model based selection, we implement multilingual-specific checks to
  966. ensure high language-match rate between the prompt and response (e.g., a romanized Hindi prompt
  967. should not expect a response in Hindi Devanagari script).
  968. •Translated data : We try to avoid using machine-translated data to finetune the model in order to
  969. prevent translationese (Bizzoni et al., 2020; Muennighoff et al., 2023) or possible name bias (Wang
  970. et al., 2022a), gender bias (Savoldi et al., 2021), or cultural bias (Ji et al., 2023). Moreover, we aim to
  971. prevent the model from being exposed only to tasks that are rooted in English cultural context, which
  972. may not be representative of the linguistic and cultural diversity we aim to capture. We made one
  973. exception to this and translated our synthetic quantitative reasoning data (see Section 4.3.3 for details)
  974. to improve performance in quantitative reasoning in non-English languages. Due to the simple nature of
  975. 22
  976. the language in these math problems, the translated samples were found to have little to no quality
  977. issues. We observed strong gains on MGSM (Shi et al., 2022) from adding this translated data.
  978. 4.3.3 Math and Reasoning
  979. We define reasoning as the ability to perform multi-step computations and arrive at the correct final answer.
  980. Several challenges guide our approach to training models that excel in mathematical reasoning:
  981. •Lack of prompts : As the complexity of questions increases, the number of valid prompts or questions
  982. for Supervised Fine-Tuning (SFT) decreases. This scarcity makes it difficult to create diverse and
  983. representative training datasets for teaching models various mathematical skills (Yu et al., 2023; Yue
  984. et al., 2023; Luo et al., 2023; Mitra et al., 2024; Shao et al., 2024; Yue et al., 2024b).
  985. •Lack of ground truth chain of thought : Effective reasoning requires a step-by-step solution to facilitate
  986. the reasoning process (Wei et al., 2022c). However, there is often a shortage of ground truth chains of
  987. thought, which are essential for guiding the model how to break down the problem step-by-step and
  988. reach the final answer (Zelikman et al., 2022).
  989. •Incorrect intermediate steps : When using model-generated chains of thought, the intermediate steps
  990. may not always be correct (Cobbe et al., 2021; Uesato et al., 2022; Lightman et al., 2023; Wang et al.,
  991. 2023a). This inaccuracy can lead to incorrect final answers and needs to be addressed.
  992. •Teaching models to use external tools : Enhancing models to utilize external tools, such as code interpreters,
  993. allows them to reason by interleaving code and text (Gao et al., 2023; Chen et al., 2022; Gou et al.,
  994. 2023). This capability can significantly improve their problem-solving abilities.
  995. •Discrepancy between training and inference : There is often a discrepancy between how the model is
  996. finetuned during training and how it is used during inference. During inference, the finetuned model may
  997. interact with humans or other models, requiring it to improve its reasoning using feedback. Ensuring
  998. consistency between training and real-world usage is crucial for maintaining reasoning performance.
  999. To address these challenges, we apply the following methodologies:
  1000. •Addressing the lack of prompts: We source relevant pre-training data from mathematical contexts and
  1001. converted it into a question-answer format which can then be used for supervised finetuning. Additionally,
  1002. we identify mathematical skills where the model under-performs and actively sourced prompts from
  1003. humans to teach models such skills. To facilitate this process, we create a taxonomy of mathematical
  1004. skills (Didolkar et al., 2024) and ask humans to provide relevant prompts/questions accordingly.
  1005. •Augmenting training data with step-wise reasoning traces : We use Llama 3 to generate step-by-step
  1006. solutions for a set of prompts. For each prompt, the model produces a variable number of generations.
  1007. These generations are then filtered based on the correct answer (Li et al., 2024a). We also do self-
  1008. verification where Llama 3 is used to verify whether a particular step-by-step solution is valid for a given
  1009. question. This process improves the quality of the finetuning data by eliminating instances where the
  1010. model does not produce valid reasoning traces.
  1011. •Filtering incorrect reasoning traces : We train outcome and stepwise reward models (Lightman et al., 2023;
  1012. Wang et al., 2023a) to filter training data where the intermediate reasoning steps were incorrect. These
  1013. reward models are used to eliminate data with invalid step-by-step reasoning, ensuring high-quality
  1014. data for finetuning. For more challenging prompts, we use Monte Carlo Tree Search (MCTS) with
  1015. learned step-wise reward models to generate valid reasoning traces, further enhancing the collection of
  1016. high-quality reasoning data (Xie et al., 2024).
  1017. •Interleaving code and text reasoning : We prompt Llama 3 to solve reasoning problems through a
  1018. combination of textual reasoning and associated Python code (Gou et al., 2023). Code execution is used
  1019. as a feedback signal to eliminate cases where the reasoning chain was not valid, ensuring the correctness
  1020. of the reasoning process.
  1021. •Learning from feedback and mistakes : To simulate human feedback, we utilize incorrect generations ( i.e.,
  1022. generations leading to incorrect reasoning traces) and perform error correction by prompting Llama 3 to
  1023. 23
  1024. yield correct generations (An et al., 2023b; Welleck et al., 2022; Madaan et al., 2024a). The iterative
  1025. process of using feedback from incorrect attempts and correcting them helps improve the model’s ability
  1026. to reason accurately and learn from its mistakes.
  1027. 4.3.4 Long Context
  1028. During the final pre-training stage, we extend the context length of Llama 3 from 8K tokens to 128K tokens
  1029. (see Section 3.4 for more details). Similar to pre-training, we find that during finetuning we must carefully
  1030. tune the recipe to balance short and long-context capabilities.
  1031. SFT and synthetic data generation. Naively applying our existing SFT recipe with only short-context data
  1032. resulted in significant regressions in long-context capabilities from pre-training, highlighting the need to
  1033. incorporate long-context data in our SFT data mix. In practice, however, it is largely impractical to get humans
  1034. to annotate such examples due to the tedious and time-consuming nature of reading lengthy contexts, so we
  1035. predominantly rely on synthetic data to fill this gap. We use earlier versions of Llama 3 to generate synthetic
  1036. data based on the key long-context use-cases: (possibly multi-turn) question-answering, summarization for
  1037. long documents, and reasoning over code repositories, and describe them in greater detail below.
  1038. •Question answering: We carefully curate a set of long documents from our pre-training mix. We split
  1039. these documents into chunks of 8K tokens, and prompted an earlier version of the Llama 3 model to
  1040. generate QA pairs conditional on randomly selected chunks. During training, the whole document is
  1041. used as context.
  1042. •Summarization: We applied hierarchical summarization of long-context documents by first summarizing
  1043. the chunks of 8K input length using our strongest Llama 3 8K context model and then summarizing
  1044. the summaries. During training we provide the full document and prompt the model to summarize the
  1045. document while preserving all the important details. We also generate QA pairs based on the summaries
  1046. of the documents and prompt the model with questions that require global understanding of the whole
  1047. long document.
  1048. •Long context code reasoning: We parse Python files to identify importstatements and determine their
  1049. dependencies. From here, we select the most commonly depended-upon files, specifically those referenced
  1050. by at least five other files. We remove one of these key files from a repository and prompt the model to
  1051. identify which files depended on the missing file and to generate the necessary missing code.
  1052. We further categorize these synthetically generated samples based on the sequence length (16K, 32K, 64K
  1053. and 128K) to enable more fine-grained targeting of input lengths.
  1054. Through careful ablations, we observe that mixing 0.1% of synthetically generated long-context data with the
  1055. original short-context data optimizes the performance across both short-context and long-context benchmarks.
  1056. DPO.We observe that using only short context training data in DPO did not negatively impact long-context
  1057. performance as long as the SFT model is high quality in long context tasks. We suspect this is due to the
  1058. fact that our DPO recipe has fewer optimizer steps than SFT. Given this finding, we keep the standard
  1059. short-context recipe for DPO on top of our long-context SFT checkpoints.
  1060. 4.3.5 Tool Use
  1061. Teaching LLMs to use tools such as search engines or code interpreters hugely expands the range of tasks
  1062. they can solve, transforming them from pure chat models into more general assistants (Nakano et al., 2021;
  1063. Thoppilan et al., 2022; Parisi et al., 2022; Gao et al., 2023; Mialon et al., 2023a; Schick et al., 2024). We train
  1064. Llama 3 to interact with the following tools:
  1065. •Search engine. Llama 3 is trained to use Brave Search7to answer questions about recent events that go
  1066. beyond its knowledge cutoff or that require retrieving a particular piece of information from the web.
  1067. •Python interpreter. Llama 3 can generate and execute code to perform complex computations, read files
  1068. uploaded by the user and solve tasks based on them such as question answering, summarization, data
  1069. analysis or visualization.
  1070. 7https://brave.com/search/api/
  1071. 24
  1072. •Mathematical computational engine. Llama 3 can use the Wolfram Alpha API8to more accurately solve
  1073. math, science problems, or retrieve accurate information from Wolfram’s database.
  1074. The resulting model is able to use these tools in a chat setup to solve the user’s queries, including in multi-turn
  1075. dialogs. If a query requires multiple tool calls, the model can write a step-by-step plan, call the tools in
  1076. sequence, and do reasoning after each tool call.
  1077. We also improve Llama 3’s zero-shot tool use capabilities — given in-context, potentially unseen tool definitions
  1078. and a user query, we train the model to generate the correct tool call.
  1079. Implementation. We implement our core tools as Python objects with different methods. Zero-shot tools can
  1080. be implemented as Python functions with descriptions, documentation ( i.e., examples for how to use them),
  1081. and the model only needs the function’s signature and docstring as context to generate the appropriate call.
  1082. We also convert function definitions and calls to JSON format, e.g., for web API calls. All tool calls are
  1083. executed by the Python interpreter, that must be enabled in the Llama 3 system prompt. Core tools can be
  1084. individually enabled or disabled in the system prompt.
  1085. Data collection. Different from Schick et al. (2024), we rely on human annotations and preferences to teach
  1086. Llama 3 to use tools. There are two main differences with the post-training pipeline generally used in Llama 3:
  1087. •For tools, dialogs often contain more than a single assistant message (e.g., calling the tool and reasoning
  1088. about the tool output). Thus, we annotate at the message level to collect granular feedback: annotators
  1089. provide a preference between two assistant messages with the same context or, if both contain major
  1090. problems, edit one of the messages. The chosen or edited message is then added to the context and the
  1091. dialog continues. This provides human feedback for both the assistant’s ability of calling the tools and
  1092. reasoning about the tool outputs. Annotators cannot rank or edit the tool outputs.
  1093. •We do not perform rejection sampling, as we did not observe gains in our tool benchmarks.
  1094. To accelerate the annotation process, we start by bootstrapping basic tool use capabilities by finetuning on
  1095. synthetically generated data from previous Llama 3 checkpoints. Thus, annotators have fewer edits to perform.
  1096. In a similar spirit, as Llama 3 gradually improves through its development, we progressively complexify our
  1097. human annotation protocols: we start by single-turn tool use annotations, before moving to tool use in dialogs,
  1098. and finally annotating for multi-step tool use and data analysis.
  1099. Tool datasets. To create data for tool usage applications, we leverage the following procedure:
  1100. •Single-step tool use: We start by few-shot generation of synthetic user prompts which, by construction,
  1101. require a call to one of our core tools (for example, questions that exceed our knowledge cutoff date).
  1102. Then, still relying on few-shot generation, we generate appropriate tool calls for these prompts, execute
  1103. them, and add the output to the model’s context. Finally, we prompt the model again to generate a
  1104. final answer to the user’s query based on the tool output. We end up with trajectories of the following
  1105. form: system prompt, user prompt, tool call, tool output, final answer. We also filter around 30%this
  1106. dataset to remove tool calls that cannot be executed or other formatting issues.
  1107. •Multi-step tool use: We follow a similar protocol and first generate synthetic data to teach the model
  1108. basic multi-step tool use capabilities. To do this, we first prompt Llama 3 to generate user prompts
  1109. that require at least two tool calls, that can be the same or different tools from our core set. Then,
  1110. conditioned on these prompts, we few-shot prompt Llama 3 to generate a solution consisting of interleaved
  1111. reasoning steps and tool calls, similar to ReAct (Yao et al., 2022). See Figure 10 for an example of
  1112. Llama 3 performing a task involving multi-step tool usage.
  1113. •File uploads: We annotate for the following filetypes: .txt, .docx, .pdf, .pptx, .xlsx, .csv, .tsv,
  1114. .py, .json, .jsonl, .html, .xml . Our prompts are based on a provided file, and ask to summarize the
  1115. contents of the file, find and fix bugs, optimize a piece of code, perform data analysis or visualization.
  1116. See Figure 11 for an example of Llama 3 performing a task involving a file upload.
  1117. After finetuning on this synthetic data, we gather human annotations in diverse and challenging scenarios
  1118. including multi-turn interactions, more than three step tool use, and instances where a tool call does not yield
  1119. 8https://products.wolframalpha.com/llm-api/documentation
  1120. 25
  1121. Figure 10 Multi-step tool usage. Example of Llama 3 performing multi-step planning, reasoning, and tool calling to
  1122. solve a task.
  1123. a satisfying answer. We augment our synthetic data with different system prompts to teach the model to use
  1124. tools only when activated. To train the model to avoid calling tools for simple queries, we also add queries
  1125. from easy math or question answering datasets (Berant et al., 2013; Koncel-Kedziorski et al., 2016; Joshi
  1126. et al., 2017; Amini et al., 2019) and their responses without tools, but with tools activated in system prompt.
  1127. Zero-shot tool use data. We improve Llama 3 zero-shot tool use abilities (also referred to as function calling)
  1128. by finetuning on a large and diverse set of partly synthetic (functions definitions, user query, corresponding
  1129. call) tuples. We evaluate our model on a set of unseen tools.
  1130. •Single, nested, and parallel function calling: Calls can be simple, nested, i.e.we pass a function call as an
  1131. argument of another function, or parallel, i.e.the model returns a list of independent function calls.
  1132. Generating a diverse set of functions, queries and ground truths can be challenging (Mekala et al., 2024),
  1133. and we resort to mining the Stack (Kocetkov et al., 2022) to ground our synthetic user queries in real
  1134. functions. More precisely, we extract function calls and their definitions, clean and filter them, e.g.for
  1135. missing docstrings or non-executable functions, and use Llama 3 to generate a natural language query
  1136. corresponding to the function call.
  1137. •Multi-turn function calling: We also generate synthetic data for multi-turn dialogs with function calls,
  1138. following a protocol similar to the one proposed in Li et al. (2023b). We use multiple agents that
  1139. generate domains, APIs, user queries, API calls, and responses, while also ensuring that the generated
  1140. data covers a set of diverse domains and realistic APIs. All agents are variants of Llama 3 prompted in
  1141. different ways depending on their roles and collaborate in a step-by-step manner.
  1142. 4.3.6 Factuality
  1143. Hallucinations remain a major challenge for large language models. Models tend to be overconfident, even in
  1144. domains where they have little knowledge. Despite these shortcomings, they are often used as knowledge bases,
  1145. which can lead to risky outcomes such as the spread of misinformation. While we recognize that factuality
  1146. can go beyond hallucinations, we took a hallucination-first approach here.
  1147. 26
  1148. Figure 11 Processing file uploads. Example of Llama 3 performing analysis and visualization of an uploaded file.
  1149. We follow the principle that post-training should align the model to “know what it knows” rather than add
  1150. knowledge (Gekhman et al., 2024; Mielke et al., 2020). Our primary approach involves generating data that
  1151. aligns model generations with subsets of factual data present in the pre-training data. To achieve this, we
  1152. develop a knowledge probing technique that takes advantage of Llama 3’s in-context abilities. This data
  1153. generation process involves the following procedure:
  1154. 1.Extract a data snippet from the pre-training data.
  1155. 2.Generate a factual question about these snippets (context) by prompting Llama 3.
  1156. 3.Sample responses from Llama 3 to the question.
  1157. 4.Score the correctness of the generations using the original context as a reference and Llama 3 as a judge.
  1158. 5.Score the informativeness of the generations using Llama 3 as a judge.
  1159. 6.Generate a refusal for responses which are consistently informative and incorrect across the generations,
  1160. using Llama 3.
  1161. We use data generated from the knowledge probe to encourage the model to only answer questions which it
  1162. has knowledge about, and refuse answering those questions