| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243 |
- Modern artificial intelligence systems are powered by foundation models. This paper presents a new set of foundation models called Llama 3. It is a herd of language models that support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. Llama 3 delivers comparable quality to leading language models on a plethora of tasks. We publicly release Llama 3 including pre-trained and post-trained versions of the 405B parameter language model. The paper also presents experiments integrating image, video, and speech capabilities into Llama 3.
- this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development.
- Foundation models are general models of language, vision, speech, and/or other modalities that are designed to support a large variety of AI tasks. They form the basis of many modern AI systems.
- The development of modern foundation models consists of two main stages: a pre-training stage and a post-training stage.
- In this paper, we present a new set of foundation models for language, called Llama 3.
- The Llama 3 Herd of models natively supports
- Our largest model is a dense Transformer with 405B parameters, processing information in a context window of up to 128K tokens. We believe there are three key levers in the development of high-quality foundation models: data, scale, and managing complexity. We seek to optimize for these three levers in our development process.
- We improved the quantity and quality of the data we use for pre-training and post-training compared to prior versions of Llama. This includes more careful pre-processing and curation pipelines for pre-training data and more rigorous quality assurance and filtering approaches for post-training data. We pre-train Llama 3 on a corpus of about 15T multilingual tokens.
- We train a model at a larger scale than previous Llama models. Our flagship language model was pre-trained using a large amount of computational power and 405B trainable parameters on 15.6T text tokens. This is significantly larger than the largest version of Llama 2. All results in this paper are for the Llama 3.1 models.
- We scale our model to an approximately compute-optimal size for our training budget. We also train smaller models for longer than is compute-optimal. These models perform better than compute-optimal models at the same inference budget. We use the flagship model to improve the quality of smaller models after training.
- We make design choices to maximize our ability to scale the model development process. We use a standard dense Transformer model architecture with minor adaptations. We adopt a simple post-training procedure based on supervised finetuning, rejection sampling, and direct preference optimization.
- Our work results in Llama 3, a herd of three multilingual language models with 8B, 70B, and 405B parameters. We evaluate Llama 3 on various benchmark datasets and perform human evaluations comparing it to competing models. The flagship model performs on par with leading language models across different tasks and matches the state-of-the-art. Smaller models outperform alternatives with similar parameters. Llama 3 also offers a better balance between being helpful and harmless than its predecessor.
- We present a detailed analysis of the safety of Llama 3 in Section 5.4. We are releasing all three Llama 3 models under an updated version of the Llama 3 Community License. This includes pre-trained and post-trained versions of our 405B parameter language model and a new version of our Llama Guard model for input and output safety. The open release of the model is hoped to spur innovation in the research community and accelerate a responsible path towards artificial general intelligence. As part of the development process, we also developed multimodal extensions to the models, enabling image recognition, video recognition, and speech understanding capabilities.
- Llama 3 8B and Llama 3 70B were pre-trained on multilingual data for use in English.
- Category Benchmark results:
- Llama 3 8B, Gemma 2 9B, Mistral 7B, Llama 3 70B, Mixtral 8x22B, GPT 3.5 Turbo, Llama 3 405B, Nemotron 4 340B, GPT-4, Claude 3.5, Sonnet, GeneralMMLU, MMLU (0-shot, CoT), MMLU-Pro (5-shot, CoT), IFEval, CodeHumanEval (0-shot), MBPP EvalPlus (0-shot), MathGSM8K, MATH (0-shot, CoT), ReasoningARC Challenge, GPQA (0-shot, CoT), Tool useBFCL.
- Results:
- 69.4, 72.36, 11.1, 83.67, 6.97, 70.78, 7.38, 82.6, 85.18, 9.1, 89.9, 73.0, 72.3, 60.5, 86.0, 79.9, 69.8, 88.6, 78.7, 85.4, 88.7, 88.3, 48.3, 36.9, 66.4, 56.3, 49.2, 73.6, 57.6, 87.5, 72.7, 69.2, 80.4, 73.6, 57.6, 87.5, 72.7, 69.2, 54.3, 40.2, 80.5, 75.6, 68.0, 82.0, 88.0, 71.7, 49.5, 86.0, 78.6, 82.0, 88.0, 83.6, 87.8, 90.5, 84.5, 76.7, 53.2, 95.1, 88.2, 81.6, 96.8, 92.3, 94.2, 96.1,
- The development of our Llama 3 language models comprises two main stages:
- Language model pre-training. We start by converting a large, multilingual text corpus to
- fine-tune the models on key benchmark evaluations. The table compares the performance of the 8B, 70B, and 405B versions of Llama 3 with that of competing models.
- pre-training a large language model on text data enables it to learn language structure and gain knowledge about the world. This pre-training is done on a massive scale with a model having 405B parameters and 15.6T tokens. After the initial pre-training, the model undergoes continued pre-training to increase its context window.
- The pre-trained model then requires post-training to align with human expectations and follow instructions. This involves supervised finetuning and Direct Preference Optimization.
- We integrate new capabilities such as tool-use at the post-training stage, resulting in strong improvements in areas like coding and reasoning. The models can answer questions in multiple languages, write high-quality code, solve complex problems, and use tools. We also add image, video, and speech capabilities to Llama 3 using a compositional approach. This approach includes training separate encoders for images and speech, teaching the model the relationship between visual content and text.
- We use the term post-training to refer to any model training that happens outside of pre-training. Llama 3 is a Transformer language model trained to predict the next token of a textual sequence. The model learns the structure of speech signals using a self-supervised approach that masks out parts of the speech inputs and tries to reconstruct the masked out parts. We also train an adapter that integrates a pre-trained image encoder into the pre-trained language model.
- This aligns the image representations with the language representations. During adapter training, we update the parameters of the image encoder and train a video adapter on paired video-text data to aggregate information across frames. We also integrate the speech encoder into the model via an adapter that converts speech encodings into token representations for the finetuned language model. The parameters of the adapter and encoder are jointly updated for high-quality speech understanding. Our multimodal experiments lead to models that can recognize image and video content.
- We create a dataset for language model pre-training from various data sources up to 2023. We apply de-duplication methods and data cleaning to obtain high-quality tokens. We remove data containing personally identifiable information and adult content.
- We obtain much of our data from the web and describe
- Our cleaning process involves PII and safety filtering. We remove data from websites that may contain unsafe content or high volumes of personally identifiable information. We also filter out domains ranked as harmful by Meta's safety standards and those known to contain adult content.
- Text extraction and cleaning are key steps. We process raw HTML content to extract high-quality, diverse text. Our custom parser is designed to optimize for precision in removing boilerplate and ensuring content recall. We evaluate its quality through human evaluations, comparing it to popular third-party HTML parsers, and find it performs well. We handle HTML pages with math and code content to preserve their structure. We keep image alt attribute text, as math content is often represented this way.
- We experimentally evaluate different cleaning configurations. We find that markdown is harmful to the performance of a model trained on web data compared to plain text, so we remove all markdown markers.
- We apply several rounds of de-duplication at the URL, document, and line level.
- We keep the most recent version for pages corresponding to each URL.
- We perform global de-duplication across the entire dataset to remove near duplicate documents.
- We remove lines that appeared more than 6 times in each bucket of 30M documents.
- Our manual qualitative analysis shows that line-level de-duplication removes leftover content.
- We develop heuristics to remove low-quality documents and outliers. Some examples include using duplicated n-gram coverage ratio to remove repeated content and "dirty word" counting to filter out adult websites. We also use a token-distribution Kullback-Leibler divergence to filter out documents with excessive outlier tokens. Further, we experiment with model-based quality classifiers to sub-select documents.
- high-quality tokens include using fast classifiers such as fasttext trained to recognize if a given text would be referenced by Wikipedia, as well as Roberta-based classifiers trained on Llama 2 predictions. To train a quality classifier based on Llama 2, we create a training set of cleaned web documents and instruct Llama 2 to determine if the documents meet the quality requirements. We use DistilRoberta to generate quality scores for each document. We experimentally evaluate the efficacy of various quality filtering configurations. We build domain-specific pipelines that extract code and math-relevant web pages using DistilRoberta models trained on web data annotated by Llama 2.
- We conduct prompt tuning to target web pages with math deduction and STEM reasoning. The pipeline implements domain-specific HTML extraction and customized text features to handle code and math. We also filter out websites with PII or unsafe content. Our multilingual text processing pipeline has several features including language identification and de-duplication. We use language-specific heuristics and filters to remove low-quality documents.
- We determine the amount of multilingual tokens used in pre-training experimentally to balance model performance on English and multilingual benchmarks.
- To obtain a high-quality language model we carefully determine the proportion of different data sources in the pre-training data mix.
- We use a classifier to categorize the types of information in our web data to determine a data mix and downsample over-represented categories.
- We also perform scaling law experiments to find the best data mix by training several small models on a data mix and using that to make predictions.
- We repeat a process multiple times for different data mixes to select a new data mix candidate. We then train a larger model on this candidate data mix and evaluate its performance on several key benchmarks. Our final data mix contains 50% general knowledge tokens, 25% mathematical and reasoning tokens, 17% code tokens, and 8% multilingual tokens. We find that annealing on small amounts of high-quality code and mathematical data can boost the performance of pre-trained models on key benchmarks. We perform annealing with a data mix that upsamples high-quality data in select domains, excluding training sets from commonly used benchmarks to assess few-shot learning capabilities and out-of-domain generalization.
- Annealing improves the performance of a pre-trained Llama 3 model on certain training sets. It boosts the 8B model's performance by 24.0% on the GSM8k set and by 6.4% on the MATH set. However, the 405B model shows negligible improvements. This suggests that the flagship model has strong learning and reasoning capabilities. We can use annealing to evaluate data quality by judging the value of small domain-specific datasets.
- Evaluate new data sources is more efficient than performing scaling law experiments.
- Llama 3 uses a standard dense Transformer architecture.
- We make a few small modifications compared to Llama 2:
- We use grouped query attention with 8 key-value heads to improve inference speed.
- We use an attention mask that prevents self-attention between different documents within the same sequence.
- The model has 32 layers and a dimension of 4,096.
- We use a vocabulary with 128,000 tokens. Our token vocabulary combines 100,000 tokens from the tiktoken tokenizer with 28,000 additional tokens to support non-English languages. This improves compression rates on English data from 3.17 to 3.94 characters per token, allowing the model to read more text for the same amount of training compute. Adding 28,000 tokens from select non-English languages also improves compression ratios and downstream performance without impacting English tokenization. We increase the RoPE base frequency hyperparameter to 500,000.
- enables us to better support longer contexts. This value is effective for context lengths up to 32,768, as shown by Xiong et al. in 2023. Llama 3 405B uses an architecture with 126 layers, a token representation dimension of 16,384, and 128 attention heads. The model size is approximately compute-optimal according to scaling laws on our data for our training budget. We develop scaling laws to determine the optimal model size for our flagship model given our pre-training compute budget. A major challenge is forecasting the flagship model’s performance on downstream benchmark tasks due to issues with existing scaling laws and their reliability.
- We implement a two-stage methodology to develop scaling laws that predict downstream benchmark performance.
- First, we establish a correlation between a model's negative log-likelihood on downstream tasks and the training FLOPs.
- Next, we correlate negative log-likelihood on downstream tasks with task accuracy, using both scaling law models and older models trained with higher compute FLOPs.
- We use a similar method to select our pre-training data mix.
- We construct our scaling laws by pre-training models with compute budgets between 6×10^18 FLOPs and 10^22 FLOPs.
- At each compute budget, we pre-train models.
- We use models with a range of sizes, from 40M to 16B parameters, and test different model sizes at each compute budget. The training process involves a cosine learning rate schedule with a linear warmup for 2000 steps. The peak learning rate varies between 2×10−4 and 4×10−4 depending on the model size, with a cosine decay set to 0.1 of the peak value. Weight decay is set to 0.1 times the learning rate at each step. A fixed batch size is used for each compute scale, ranging from 250K to 4M. The loss is measured as the negative log-likelihood on a held-out validation set, with approximate measurements made using a second-degree polynomial at each compute scale.
- We include a prediction as well as the number of training tokens in identified optimal models as a function of pre-training compute budget in Figure 3. The optimal models correspond to the minimums in Figure 2. We use these models to predict the optimal number of training tokens for a specific compute budget. We assume a power-law relation between compute budget and the optimal number of training tokens. We find a fit with alpha equal to 0.53 and A equal to 0.29.
- Training a 402B parameter model on 16.55T tokens is suggested for 3.8×1025 FLOPs. The performance of the flagship model is relatively robust to small changes in the trade-off between model size and training tokens. A flagship model with 405B parameters was ultimately trained. To forecast the performance of the flagship Llama 3 model, we use the compute-optimal models to correlate the negative log-likelihood of correct answers with training FLOPs. A sigmoidal relation is established between log-likelihood and accuracy using the scaling law models trained up to 1022 FLOPs.
- We show the results of an experiment on the ARC Challenge benchmark. A two-step scaling law prediction was found to be quite accurate, extrapolating over four orders of magnitude, and only slightly underestimating the final performance of the Llama 3 model.
- We describe the hardware and infrastructure that powered Llama 3 pre-training at scale, and discuss several optimizations that led to improvements in training efficiency.
- The Llama 3 model was trained on Meta’s production clusters. Several compute values are noted, including 1.20, 1.25, 1.30, 1.35, and 1.40, along with Normalized NLL per Char values for each.
- Llama 2 Models Scaling Law Prediction Llama 3
- This analysis enables us to predict model performance on the ARC Challenge benchmark before pre-training commences.
- Llama 3 405B is trained on up to 16K H100 GPUs, each running at 700W TDP with 80GB HBM3.
- Training jobs are scheduled using MAST, Meta’s global-scale platform.
- Training scheduler. A storage fabric is built for Llama 3 pre-training using Tectonic, a distributed file system. It provides 240 PB of storage across 7,500 servers with SSDs, supporting 2 TB/s sustainable throughput and 7 TB/s peak throughput.
- A challenge is handling bursty checkpoint writes that saturate the storage fabric. Checkpointing saves each GPU's model state, ranging from 1 MB to 4 GB per GPU, for recovery and debugging.
- The goal is to minimize GPU pause time during checkpointing and increase checkpoint frequency to reduce lost work after recovery.
- Llama 3 uses RDMA over Converged Ethernet fabric based on Arista 7800 and Minipack2 Open Compute Project rack switches, while smaller models use Nvidia Quantum2 Infiniband fabric. Both leverage 400 GB infrastructure.
- We tune clusters with different network technologies to provide equivalent performance for large training workloads. Our RoCE-based AI cluster includes 24,000 GPUs connected by a three-layer Clos network. Each rack hosts 16 GPUs, connected by a single switch, with 192 racks forming a pod of 3,072 GPUs with full bandwidth. Eight such pods are connected to form a cluster of 24,000 GPUs, with some network connectivity oversubscription at the aggregation layer.
- To minimize network communication across pods, parallelism methods and a training job scheduler are optimized to be aware of network topology. Load balancing is also a challenge due to fat network flows produced by LLM training, which are hard to balance across all available network paths using traditional methods. Two techniques are employed to address this challenge.
- Our collective library creates 16 network flows between two GPUs, reducing traffic per flow and providing more flows.
- We use up to 16K of 24K GPUs for Llama 3 pre-training.
- Table 4 shows scaling configurations and MFU for each stage of Llama 3.
- We use parallelism to scale training for large models. Our Enhanced-ECMP protocol balances flows across different network paths, reducing congestion. Deep-buffer switches in the spine help accommodate transient congestion and buffering caused by collective communication patterns. This setup limits the impact of persistent congestion and network back pressure. Better load balancing through E-ECMP significantly reduces congestion. We successfully run a 24K GPU cluster without traditional congestion control methods. To scale training, we use 4D parallelism, a combination of four different types of parallelism.
- Our implementation efficiently distributes computation across many GPUs, ensuring each GPU's model parameters and gradients fit in its memory. We combine tensor parallelism, pipeline parallelism, context parallelism, and data parallelism to achieve this. Tensor parallelism splits weight tensors into chunks on different devices. Pipeline parallelism partitions the model into stages by layers, allowing different devices to process different stages in parallel. Context parallelism divides the input into segments, reducing memory bottlenecks for long inputs.
- We use fully sharded data parallelism to process data in parallel on multiple GPUs and synchronize after each training step. Our implementation shards the model, optimizer, and gradients. For Llama 3, we shard optimizer states and gradients, but not the model, to avoid extra communication during backward passes. Through tuning, we achieve 38-43% BF16 Model FLOPs Utilization. There is a slight drop in utilization when using 16K GPUs with a lower batch size per DP group. We also explored pipeline parallelism improvements.
- Several challenges exist with current implementations including batch size constraints and memory imbalance. Batch size constraints require the batch size to be divisible by the number of pipeline stages, limiting flexibility in pre-training. Memory imbalance occurs due to uneven resource consumption among stages, with the first stage consuming more memory. Computation imbalance is also a problem, particularly after the last layer of the model where output and loss calculation creates a latency bottleneck.
- GPUs are divided into parallelism groups in the order of TP, CP, PP, DP. In this example, 16 GPUs are configured with a group size of 2 for each group. A GPU's position in 4D parallelism is represented as a vector [D1, D2, D3, D4]. GPU0 and GPU1 are in the same TP group, GPU0 and GPU2 are in the same CP group, GPU0 and GPU4 are in the same PP group, and GPU0 and GPU8 are in the same DP group. To address these issues, we modify our pipeline schedule to set N flexibly, in this case N = 5, allowing for an arbitrary number of micro-batches in each batch. This enables running fewer micro-batches when there's a batch size limit at large scale, or more micro-batches to hide point-to-point communication.
- We use a schedule to optimize communication and memory efficiency. To balance the pipeline, we modify the number of Transformer layers in the first and last stages. The first model chunk on the first stage has a reduced architecture, and the last model chunk on the last stage also has a simplified architecture. We implement an interleaved schedule to reduce pipeline bubbles. Additionally, we use asynchronous point-to-point communication to speed up training, especially when the document mask introduces extra computation imbalance. We also optimize memory usage by reducing unnecessary tensor allocation.
- We utilize context parallelism to improve memory efficiency when scaling the context length of Llama 3 and enable training on extremely long sequences. We partition the input sequence into chunks for better load balancing. Each rank receives two chunks, the i-th and the (2×CP−1−i)-th chunks. We use an all-gather based method to compute attention output for the local query tensor chunk.
- We adopt this approach for two main reasons. First, it's easier and more flexible to support different types of attention masks in all-gather based CP attention. Second, although the all-gather communication latency is exposed in the critical path, the communicated tensors are smaller due to the use of GQA, making the time complexity of attention computation more manageable.
- Our goal is to optimize parallelism dimensions for network communication, ordering them as [TP, CP, PP, DP] to balance bandwidth and latency. The innermost parallelism, TP, requires high bandwidth and low latency, typically confined to a single server. In contrast, the outermost parallelism, DP, can span multiple hops and tolerate higher latency by prefetching model weights and reducing gradients asynchronously.
- Develop a memory consumption estimator and a performance-projection tool to explore parallelism configurations and project overall training performance, identifying memory gaps effectively.
- Numerical stability issues were fixed by comparing training loss between different parallelism setups.
- To ensure training convergence, FP32 gradient accumulation is used during backward computation over multiple micro-batches and reduce-scatter gradients in FP32 across data parallel workers in FSDP.
- Intermediate tensors used multiple times in forward computation have backward gradients accumulated in FP32.
- Our collective communication library for Llama 3 is based on NCCLX, a fork of Nvidia’s NCCL library, improving performance especially for higher latency networks.
- Parallelism dimensions include TP, CP, PP, and DP, with DP corresponding to FSDP. PP and DP may communicate through a multi-hop network with latency of up to tens of microseconds. The original NCCL collectives require data chunking and staged data copy, leading to inefficiencies such as exchanging many small control messages, extra memory-copy operations, and using extra GPU cycles for communication. To address this for Llama 3 training, we tuned chunking and data transfer to fit our network latencies, which can be as high as tens of microseconds. We also allow small control messages to traverse the network with higher priority, avoiding head-of-line blocking.
- versions involve making deeper changes to address problems.
- Component Category Interruptions
- 1. GPU 148
- 2. GPU HBM3 Memory 72
- 3. Software Bug 54
- 4. Network Switch/Cable 35
- 5. Host Maintenance 32
- 6. GPU SRAM Memory 19
- 7. GPU System Processor 17
- 8. NIC Host 7
- 9. NCCL Watchdog Timeouts 7
- 10. Silent Data Corruption 6
- 11. GPU Thermal Interface 6
- 12. SSD Host 3
- 13. Power Supply Host 3
- 14. Server Chassis Host 2
- 15. IO Expansion Board Host 2
- 16. Dependency 2
- 17. CPU Host 2
- 18. System Memory Host 2
- About 78% of interruptions were attributed to hardware issues.
- Reliability and Operational Challenges involve complexity and potential failure scenarios.
- Here is the processed text:
- We achieved higher than 90% effective training time while supporting automated cluster maintenance. The effective training time measures the time spent on useful training over the elapsed time. We experienced a total of 466 job interruptions during a 54-day snapshot period of pre-training. Of these, 47 were planned interruptions due to automated maintenance operations or operator-initiated operations. The remaining 419 were unexpected interruptions. Approximately 78% of the unexpected interruptions occurred.
- are attributed to confirmed hardware issues such as GPU or host component failures or suspected hardware related issues like silent data corruption and unplanned individual host maintenance events. GPU issues are the largest category accounting for 58.7% of all unexpected issues. Despite the large number of failures significant manual intervention was required only three times during this period with the rest handled by automation. To increase effective training time we reduced job startup and checkpointing time and developed tools for fast diagnosis and problem resolution. We use PyTorch's NCCL flight recorder to diagnose hangs and performance issues quickly at scale particularly with regard to NCCLX. This allows us to efficiently record every communication event and the duration of each collective operation
- Automatically dumping tracing data on NCCLX watchdog or heartbeat timeout is enabled for more computationally intensive operations. Tracing is done selectively as needed in production through online configuration changes. Debugging large-scale training issues is complicated due to the mixed use of NVLink and RoCE in the network. NCCLX enhances failure detection and localization through co-design with PyTorch, allowing it to track relevant information. The system monitors communication state to handle stalls caused by NVLink failures.
- We analyze NCCLX communications to debug scaling issues. The system times out when it detects a stall and provides a snapshot of the failing collective's internal state. This includes data transfers between all ranks. Hardware issues can cause slow stragglers that are hard to detect, and a single straggler can slow down thousands of other GPUs. To address this, we developed tools to prioritize potentially problematic communications. By investigating a few top suspects, we can effectively identify the stragglers. Environmental factors also impact training performance at scale, with one example being a 1-2% throughput variation based on time-of-day.
- temperatures impacting GPU dynamic voltage and frequency scaling. During training, tens of thousands of GPUs may increase or decrease power consumption at the same time, for example, due to all GPUs waiting for checkpointing or collective communications to finish, or the startup or shutdown of the entire training job. This can result in instant fluctuations of power consumption across the data center. This is a challenge for scaling training for larger models.
- The recipe used to pre-train Llama 3 consists of three main stages: initial pre-training, long-context pre-training, and annealing. We use similar recipes to pre-train other models.
- We pre-train Llama 3 using AdamW with a peak learning rate of 8×10−5 and a linear warm up of 8,000 steps.
- We use a cosine learning rate schedule decaying to 8×10−7 over 1,200,000 steps. To improve training stability, we start with a lower batch size and increase it later for efficiency. The initial batch size is 4M tokens for sequences of length 4,096, which is doubled to 8M sequences of 8,192 tokens after pre-training 252M tokens. We double the batch size again to 16M after pre-training on 2.87T tokens. This training recipe proved stable with few loss spikes and no need for interventions. We also adjusted the pre-training data mix to enhance model performance on specific tasks. We increased the non-English data percentage to improve Llama 3's multilingual performance and upsampled mathematical data for better mathematical reasoning.
- we added more recent web data in the later stages of pre-training to advance the model’s knowledge cut-off, and we downsampled subsets of the pre-training data that were lower quality.
- In the final stages of pre-training, we train on long sequences to support context windows of up to 128K tokens. We train on long sequences later because of compute issues in self-attention layers.
- We increase the supported context length in increments, pre-training until the model has successfully adapted to the increased context length.
- We assess successful adaptation by measuring whether the model performance on short-context evaluations has recovered and the model solves “needle in a haystack” tasks up to that length.
- In Llama 3 405B pre-training, we increased context length gradually in six stages, starting from 8K context window and ending in
- final 128K context window. This long-context pre-training stage was performed using approximately 800B training tokens.
- Our post-training strategy involves rejection sampling, supervised finetuning, and direct preference optimization.
- During pre-training on the final 40M tokens, we linearly annealed the learning rate to 0, maintaining a context length of 128K tokens.
- We produce the aligned Llama 3 models by applying several rounds of post-training, on top of a pre-trained checkpoint.
- Our post-training approach involves a reward model and a language model. We train a reward model on a pre-trained checkpoint using human-annotated preference data. We then finetune pre-trained checkpoints with supervised finetuning and align them with Direct Preference Optimization. This process applies to Llama 3 405B unless otherwise noted.
- We refer to Llama 3 405B as Llama 3 for simplicity. To tune LLMs for human-AI interaction, we need to define a chat dialog protocol. Llama 3 has new capabilities such as tool use, requiring generating multiple messages and sending them to different locations within a single dialog turn. We design a new multi-message chat protocol using special header and termination tokens. The header tokens indicate the source and destination of each message, while termination tokens signal when it's time to alternate between human and AI. We train a reward model covering different capabilities on top of the model.
- We remove the margin term in the loss to improve training efficiency.
- The training objective is the same as Llama 2, using all preference data for reward modeling after filtering out similar responses.
- Each preference ranking sample has two or three responses with clear ranking.
- We concatenate the prompt and multiple responses into a single row during training with responses randomly shuffled.
- This approach improves training efficiency without losing accuracy.
- We use this method in our supervised training.
- Finetuning the reward model is used to perform rejection sampling on human annotation prompts. This process is combined with other data sources, including synthetic data, to finetune the pre-trained language model using cross entropy loss on target tokens. We refer to this stage as supervised finetuning. Our largest models are finetuned with a learning rate of 10−5 over 8.5K to 9K steps. We further train our models with Direct Preference Optimization for human preference alignment.
- primarily use the most recent batches of preference data collected using the best performing models. We explored on-policy algorithms like PPO, but found DPO required less compute for large-scale models and performed better, especially on instruction following benchmarks. For Llama 3, we use a learning rate of 10^-5 and set the beta hyper-parameter to 0.1. We apply algorithmic modifications to DPO, including masking out formatting tokens in DPO loss to stabilize training.
- We hypothesize that the DPO loss is contrastive, leading to a conflicting learning objective due to common tokens in both chosen and rejected responses.
- We add an NLL loss term with a scaling coefficient of 0.2 on the chosen sequences to stabilize DPO training and maintain desired formatting for generation.
- We also average models obtained from experiments using various data or hyperparameters at each stage.
- We list statistics of human preference data used for Llama 3 alignment. The data shows comparisons of responses in multi-turn dialogues. Each dialogue is split into examples at a turn level, consisting of a prompt and a response.
- We apply the same methods in six rounds, collecting new preference annotations and data in each cycle. The post-training data composition is crucial.
- Our human annotation procedures and preference data collection involve deploying multiple models for annotation after each round, sampling two responses from two different models for each user prompt. Annotators rate the strength of their preference by categorizing it into four levels: significantly better, better, slightly better, or marginally better. We also include an editing step after preference ranking.
- We encourage annotators to further improve the response. Annotators edit the chosen response directly or prompt the model with feedback to refine its own response. Table 6 reports the statistics of preference annotations used for Llama 3 training. General English covers multiple subcategories, including knowledge-based question and answering, and precise instruction-following. We observe an increase in the average length of prompt and response compared to Llama 2, suggesting more complex tasks. We also implement a quality analysis and human evaluation process to assess the data and provide feedback to annotators. As Llama 3 improves, we increase prompt complexity.
- We target areas where the model lags by using all available preference data for reward modeling and the latest batches for DPO training. We use samples labeled as significantly better or better than the rejected counterpart for training and discard samples with similar responses. Our fine-tuning data comes from human annotation collection with rejection-sampled responses and synthetic data targeting specific capabilities. The datasets used are General English, Code, Multilingual, Exam-like, and Reasoning and tools.
- We list internally collected SFT data used for Llama 3 alignment. Each SFT example consists of a context and a final response. Small amounts of human-curated data are used. As post-training rounds progress, we develop stronger Llama 3 variants to collect larger datasets. This section discusses the rejection-sampling procedure and overall composition of the final SFT data. During rejection sampling, for each prompt collected during human annotation, we sample multiple outputs from the latest chat model policy, typically between 10 and 30.
- We use a reward model to select the best candidate, consistent with Bai et al. in 2022. In later rounds of post-training, we introduce system prompts to steer responses to conform with a desirable tone, style, or formatting. To increase efficiency, we adopt PagedAttention to enhance memory efficiency and support arbitrary output lengths. This also enables us to share key-value cache pages across corresponding outputs, leading to a throughput improvement.
- During rejection sampling, the data composition is as follows: Table 7 shows statistics for each category of our "helpfulness" mix. SFT and preference data have overlapping domains but are curated differently, resulting in distinct count statistics. We adjust our data mix across axes to tune performance across various benchmarks. Our final data mix is refined through multiple epochs on high-quality sources and downsampling of others.
- Data processing involves careful cleaning and quality control, particularly since most training data is model-generated. In early rounds, we removed undesirable patterns such as excessive use of emojis or exclamation points.
- Strategies to filter or clean problematic data include identifying overused phrases and balancing their proportion in the dataset. Data pruning is also applied to remove low-quality training samples. Techniques used include topic classification and quality scoring. Topic classification involves finetuning a model to classify data into coarse and fine-grained buckets. Quality scoring uses reward models and Llama-based signals to obtain a score for each sample, with top quartile scores considered high quality.
- We score data using two measures of difficulty to prioritize more complex examples for the model. We use Instag and Llama-based scoring. For Instag, we prompt a model to perform intention tagging of prompts, where more intentions indicate more complexity. We also measure the difficulty of dialogs on a three-point scale. We then perform semantic deduplication by clustering complete samples.
- We use RoBERTa to sort dialogs by quality score and difficulty score. Then, we select examples greedily, keeping only those with maximum cosine similarity less than a threshold to examples seen so far. We highlight special efforts to improve performance for specific capabilities, including code, multilinguality, math and reasoning, long context, tool use, factuality, and steerability.
- For code, LLMs have gained attention since Copilot and Codex. Developers use these models to generate code snippets, debug, automate tasks, and improve code quality. For Llama 3, we focus on improving and evaluating code generation, documentation, debugging, and review capabilities.
- We present our work on improving coding capabilities in high priority programming languages such as Python, Java, and C/C++. Our approach involves training a code expert to collect high quality human annotations for code. This is done by continuing pre-training on a large token mix of code data. We follow a similar recipe to CodeLlama and perform long-context fine-tuning to extend the expert's capabilities.
- Here is the processed text:
- We follow a similar post-training modeling recipe to align this model with SFT and DPO data mixes targeting code. This model is used for rejection sampling for coding prompts and synthetic data generation. During development, we identified issues in code generation, including following instructions, code syntax errors, and incorrect code generation. We use Llama 3 and a code expert to generate synthetic SFT dialogs. We describe three approaches for generating synthetic code data, resulting in over 2.7M synthetic examples.
- We used synthetic data generation with execution feedback during SFT. The 8B and 70B models show significant performance improvements when trained on data generated by a larger model. However, training Llama 3 405B on its own generated data is not helpful. To address this, we introduced execution feedback as a source of truth. We generated a large dataset of synthetic coding dialogues using a specific process. This process starts with problem description generation, where we create a collection of programming problem descriptions covering a diverse range of topics. We sample random code snippets and prompt the model to generate problems inspired by these examples.
- We generate problem descriptions for a range of topics. To solve each problem, we prompt Llama 3 in a specific programming language. Adding good programming rules to the prompt improves the solution quality. It's also helpful to require the model to explain its thought process in comments. After generating a solution, we check its correctness using static and dynamic analysis techniques. This includes running the code through a parser and linter to ensure syntactic correctness.
- Here is the processed text:
- errors such as syntax errors, use of uninitialized variables or non-imported functions, code style issues, typing errors, and others.
- Unit test generation and execution involves generating unit tests for each problem and solution, executed in a containerized environment together with the solution, catching run-time execution errors and some semantic errors.
- Error feedback and iterative self-correction is done by prompting the model to revise a solution when it fails, using the original problem description, faulty solution, and feedback from the parser/linter/tester.
- Only dialogs that pass all checks are included in the final dataset, used for supervised finetuning. About 20% of solutions were initially incorrect but
- self-corrected, indicating that the model learned from the execution feedback and improved its performance.
- Fine-tuning and iterative improvement is conducted over multiple rounds, with each round building on the previous one, resulting in improved model performance.
- A performance gap exists between major programming languages and less common ones, which is mitigated by translating data from common languages to less common languages.
- This is achieved by prompting the model and ensuring quality.
- Syntax parsing, compilation, and execution. An example of synthetic PHP code translated from Python is shown in Figure 8. This improves performance significantly for less common languages.
- We employ an alternative approach for generating synthetic data, called backtranslation, to improve coding capabilities such as documentation and explanations.
- We generated approximately 1.2M synthetic data points.
- Figure 8 shows a code translation example using Llama 3 to translate Python code to PHP code.
- Figure 9 compares the quality of generated code with and without system prompts.
- We use code snippets from various sources.
- languages in our pre-training data. We prompt Llama 3 to generate data that represents our target capability, then prompt the model to backtranslate the synthetically generated data to the original code. We use the original code as a reference to determine the quality of the output. We also use system prompt steering during rejection sampling to improve code readability, documentation, thoroughness, and specificity.
- The system prompt helps improve generated code quality by adding comments, using informative variable names, and saving memory. We encounter quality issues in rejection-sampled data, such as code blocks with bugs. Detecting these issues is not straightforward due to the mix of natural language and code in the responses. To address this, we use the "model-as-judge" approach, where earlier versions of Llama 3 assess and assign a score based on code correctness and other criteria.
- We retain only those samples that achieve a perfect score of 2. This stringent filtering led to a regression in downstream benchmark performance, primarily because it removed examples with challenging prompts. To counteract this, we revised the responses of some coding data until they met the model-as-judge criteria. We refined these challenging problems to achieve a balance between quality and difficulty, resulting in optimal downstream performance.
- We improve Llama 3's multilingual capabilities by training on more multilingual data and sourcing high quality instruction tuning data for several languages including German, French, Italian, Portuguese, Hindi, Spanish, and Thai. We also tackle specific challenges of multilingual language steering to enhance the model's overall performance.
- Expert training involves branching off the pre-training run to continue pre-training on a data mix with 90% multilingual tokens to collect higher quality human annotations in non-English languages. We then perform post-training on this expert.
- Multilingual data is derived from various sources with a distribution of 2.4% human annotations, 44.2% data from other NLP tasks, 18.8% rejection sampled data, and 34.6% translated reasoning data.
- Human annotations are collected from linguists and native speakers, consisting of open-ended prompts that represent real-world use cases.
- Data from other NLP tasks is used to augment the training data by rewriting it into dialog format. We use data from exams-qa and Conic10k to improve language alignment. Parallel texts from GlobalVoices and Wikimedia are also used. Low quality data is removed using LID based filtering and Blaser2.0. For parallel text data, a multilingual template is applied to simulate real-life conversations. Rejection sampling is used on human annotated prompts to generate high-quality samples for finetuning. The temperature hyperparameter is randomly chosen from the range 0.2-1 for diverse generation.
- We implement multilingual-specific checks to ensure a high language-match rate between the prompt and response. This includes checks such as ensuring a romanized Hindi prompt does not expect a response in Hindi Devanagari script. We also avoid using machine-translated data to fine-tune the model to prevent issues like translationese and bias.
- We made an exception to focus on English cultural context, which may not represent linguistic and cultural diversity. We translated synthetic quantitative reasoning data to improve performance in non-English languages. The translated samples had little to no quality issues due to the simple nature of the math problems. Adding this translated data resulted in strong gains on MGSM. We define reasoning as performing multi-step computations to arrive at the correct answer. Several challenges guide our approach to training models that excel in mathematical reasoning, including the lack of prompts for complex questions.
- Diverse and representative training datasets are needed for teaching models various mathematical skills.
- Lack of step-by-step solutions is a major issue as effective reasoning requires a chain of thought to facilitate the reasoning process.
- Incorrect intermediate steps can lead to incorrect final answers and need to be addressed.
- Teaching models to use external tools such as code interpreters can enhance their reasoning abilities.
- To improve problem-solving abilities, leveraging relevant pre-training data from mathematical contexts is crucial. This involves converting the data into a question-answer format for supervised fine-tuning. Additionally, identifying areas where the model underperforms and sourcing prompts from humans to teach these skills is essential. Ensuring consistency between training and real-world usage is vital for maintaining reasoning performance.
- We create a taxonomy of mathematical skills and ask humans to provide relevant prompts/questions.
- Augmenting training data with step-wise reasoning traces involves using Llama 3 to generate step-by-step solutions for a set of prompts.
- We filter these generations based on the correct answer and also use Llama 3 for self-verification to ensure the step-by-step solution is valid for a given question.
- We train reward models to filter training data with invalid step-by-step reasoning.
- These models eliminate data with incorrect intermediate reasoning steps.
- Ensuring high-quality data for fine-tuning involves using techniques such as Monte Carlo Tree Search with learned step-wise reward models to generate valid reasoning traces.
- Interleaving code and text reasoning is another approach where Llama 3 solves problems through a combination of textual reasoning and Python code, with code execution serving as a feedback signal to ensure the correctness of the reasoning process.
- Learning from feedback and mistakes is also crucial, where incorrect generations are utilized to simulate human feedback and perform error correction by prompting Llama 3 to yield correct generations.
- Improve the model's ability to reason accurately and learn from its mistakes.
- We extend the context length of Llama 3 from 8K tokens to 128K tokens.
- We must carefully tune the recipe to balance short and long-context capabilities during finetuning.
- Naively applying our existing recipe resulted in significant regressions in long-context capabilities, highlighting the need to incorporate long-context data.
- We rely on synthetic data to fill the gap due to the tedious nature of reading lengthy contexts.
- We use earlier versions of Llama 3 to generate synthetic data based on key long-context use-cases.
- We curate long documents, splitting them into 8K token chunks. We use these chunks to generate QA pairs with a model, with the full document as context during training.
- We apply hierarchical summarization to long documents, first summarizing chunks with a strong model, then summarizing the summaries.
- During training, the full document is used, and the model is prompted to summarize while preserving important details. QA pairs are also generated from summaries, with questions requiring global understanding of the long document.
- We parse code repositories for reasoning.
- Python files are used to identify import statements and determine their dependencies. We select the most commonly depended-upon files, those referenced by at least five other files. By removing one of these key files from a repository, we prompt the model to identify dependent files and generate necessary missing code. We categorize synthetically generated samples by sequence length (16K, 32K, 64K, 128K) for fine-grained targeting of input lengths. Ablations show that mixing 0.1% of synthetically generated long-context data with original short-context data optimizes performance across both benchmarks. Using only short context training data in DPO doesn't negatively impact long-context performance if the SFT model is high quality in long context tasks.
- We keep the standard short-context recipe for DPO on top of our long-context SFT checkpoints. Teaching LLMs to use tools like search engines or code interpreters makes them more general assistants. We train Llama 3 to interact with a search engine and a Python interpreter. The search engine is used to answer questions about recent events that go beyond its knowledge cutoff. The Python interpreter is used to generate and execute code for complex computations and tasks.
- Llama 3 uses the Wolfram Alpha API to solve math and science problems and retrieve information from Wolfram's database. It can handle multi-turn dialogs and create step-by-step plans for queries that require multiple tool calls. The model can also generate the correct tool call given in-context tool definitions and a user query. We implement core tools as Python objects with different methods, and zero-shot tools can be implemented as Python functions with descriptions and documentation.
- and calls to JSON format, for web API calls. All tool calls are executed by the Python interpreter, which must be enabled in the system prompt. Core tools can be individually enabled or disabled in the system prompt. We rely on human annotations and preferences to teach Llama 3 to use tools, differing from Schick et al. (2024). The main difference with the post-training pipeline is that we annotate at the message level to collect granular feedback. Annotators provide a preference between two assistant messages with the same context, or edit one of the messages if both contain major problems. This provides human feedback for both the assistant's ability to call tools and reason about the tool output.
- We do not perform rejection sampling as it didn't improve our tool benchmarks. To speed up annotation, we fine tune on synthetic data from previous Llama 3 checkpoints to bootstrap basic tool use capabilities. This reduces the number of edits annotators need to make. As Llama 3 improves, we gradually make our annotation protocols more complex, starting with single-turn tool use, then tool use in conversations, and finally multi-step tool use and data analysis. To create tool usage data, we generate synthetic user prompts that require a core tool, such as questions beyond our knowledge cutoff date.
- We generate tool calls for prompts, execute them, and add the output to the model's context. We then prompt the model to generate a final answer based on the tool output. The process involves system prompts, user prompts, tool calls, tool output, and final answers. We also remove around 30% of the dataset to fix formatting issues and tool calls that can't be executed.
- We teach the model basic multi-step tool use capabilities by generating synthetic data. We prompt the model to create user prompts that need at least two tool calls, then use few-shot prompting to generate a solution with interleaved reasoning steps and tool calls.
- We annotate files of various types including txt, docx, pdf, pptx, xlsx, csv, tsv, py, json, jsonl, html, and xml. Our prompts are based on the file content and ask the model to summarize, fix bugs, optimize code, perform data analysis or visualization. The model can perform tasks involving multiple tool usage steps. For example, it can use tools to solve a task through multi-step planning and reasoning. We fine tune the model on synthetic data and gather human annotations in diverse scenarios, including multi-turn interactions and tool use that requires more than three steps.
- We improve Llama 3 zero-shot tool use abilities by finetuning on a large and diverse set of function definitions, user queries, and corresponding calls.
- We evaluate our model on unseen tools.
- The model can handle single, nested, and parallel function calling.
- Generating diverse functions, queries, and ground truths can be challenging.
- We use real functions from the Stack to ground our synthetic user queries.
- We extract and clean function calls and their definitions, filtering out those with missing docstrings or non-executable functions. We then use Llama 3 to generate natural language queries corresponding to the function calls. The system also handles multi-turn function calling by generating synthetic data for dialogues with function calls, following a specific protocol. This involves multiple agents that collaborate to create domains, APIs, user queries, API calls, and responses, ensuring diversity and realism.
- Factuality hallucinations are a major challenge for large language models, which tend to be overconfident even in domains where they have little knowledge.
- The spread of misinformation. We took a hallucination-first approach here. Our primary approach involves generating data that aligns model generations with subsets of factual data present in the pre-training data. We develop a knowledge probing technique that takes advantage of Llama 3’s in-context abilities. This process involves extracting a data snippet from the pre-training data, generating a factual question about these snippets, sampling responses from Llama 3, and scoring the correctness of the generations.
- Here is the processed text:
- We use Llama 3 as a judge to score the informativeness of the generations. The score is 5. A refusal for responses which are consistently informative and incorrect across the generations will be generated using Llama 3. We encourage the model to only answer questions which it has knowledge about by using data generated from the knowledge probe, and refuse answering those questions it doesn't know about.
|