radu
/
LLamaRecipes
의 미러 https://github.com/facebookresearch/llama-recipes.git


			
				
					
						
						
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104
							Meta Llama Skip to main content Technology Getting Started Trust & Safety Community Resources Discover the possibilities with Meta Llama Democratizing access through an open platform featuring AI models, tools, and resources — enabling developers to shape the next wave of innovation. Licensed for both research and commercial use Get Started Llama models and tools Meta Llama 3 Build the future of AI with Meta Llama 3 Llama 3 is an accessible, open-source large language model (LLM) designed for developers, researchers, and businesses to build, experiment, and responsibly scale their generative AI ideas. Part of a foundational system, it serves as a bedrock for innovation in the global community. Learn more Meta Code Llama A state-of-the-art large language model for coding LLM capable of generating code, and natural language about code, from both code and natural language prompts. Meta Llama Guard Empowering developers, advancing safety, and building an open ecosystem We’re announcing Meta Llama Guard, an umbrella project featuring open trust and safety tools and evaluations meant to level the playing field for developers. Ready to start building with Meta Llama? Access our getting started guide and responsible use resources to get started. Get started guide Responsible use guide Prompt Engineering with Meta Llama Learn how to effectively use Llama models for prompt engineering with our free course on Deeplearning.AI, where you'll learn best practices and interact with the models through a simple API call. Partnerships Our global partners and supporters We have a broad range of supporters around the world who believe in our open approach to today’s AI — companies that have given early feedback and are excited to build with Llama, cloud providers that will include the model as part of their offering to customers, researchers committed to doing research with the model, and people across tech, academia, and policy who see the benefits of Llama and an open platform as we do. Latest Llama updates Introducing Meta Llama 3: The most capable openly available LLM to date Read more Meet Your New Assistant: Meta AI, Built With Llama 3 CYBERSECEVAL 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models Stay up-to-date Our latest updates delivered to your inbox Subscribe to our newsletter to keep up with the latest Llama updates, releases and more. Sign up
----------
Use Policy Skip to main content Meta is committed to promoting safe and fair use of its tools and features, including Llama 2. If you access or use Llama 2, you agree to this Acceptable Use Policy (“Policy”). The most recent copy of this policy can be found at llama.meta.com/use-policy . Prohibited Uses We want everyone to use Llama 2 safely and responsibly. You agree you will not use, or allow others to use, Llama 2 to: 1. Violate the law or others’ rights, including to: a. Engage in, promote, generate, contribute to, encourage, plan, incite, or further illegal or unlawful activity or content, such as: i. Violence or terrorism ii. Exploitation or harm to children, including the solicitation, creation, acquisition, or dissemination of child exploitative content or failure to report Child Sexual Abuse Material b. Human trafficking, exploitation, and sexual violence iii. The illegal distribution of information or materials to minors, including obscene materials, or failure to employ legally required age-gating in connection with such information or materials. iv. Sexual solicitation vi. Any other criminal activity c. Engage in, promote, incite, or facilitate the harassment, abuse, threatening, or bullying of individuals or groups of individuals d. Engage in, promote, incite, or facilitate discrimination or other unlawful or harmful conduct in the provision of employment, employment benefits, credit, housing, other economic benefits, or other essential goods and services e. Engage in the unauthorized or unlicensed practice of any profession including, but not limited to, financial, legal, medical/health, or related professional practices f. Collect, process, disclose, generate, or infer health, demographic, or other sensitive personal or private information about individuals without rights and consents required by applicable laws g. Engage in or facilitate any action or generate any content that infringes, misappropriates, or otherwise violates any third-party rights, including the outputs or results of any products or services using the Llama 2 Materials h. Create, generate, or facilitate the creation of malicious code, malware, computer viruses or do anything else that could disable, overburden, interfere with or impair the proper working, integrity, operation or appearance of a website or computer system 2. Engage in, promote, incite, facilitate, or assist in the planning or development of activities that present a risk of death or bodily harm to individuals, including use of Llama 2 related to the following: a. Military, warfare, nuclear industries or applications, espionage, use for materials or activities that are subject to the International Traffic Arms Regulations (ITAR) maintained by the United States Department of State b. Guns and illegal weapons (including weapon development) c. Illegal drugs and regulated/controlled substances d. Operation of critical infrastructure, transportation technologies, or heavy machinery e. Self-harm or harm to others, including suicide, cutting, and eating disorders f. Any content intended to incite or promote violence, abuse, or any infliction of bodily harm to an individual 3. Intentionally deceive or mislead others, including use of Llama 2 related to the following: a. Generating, promoting, or furthering fraud or the creation or promotion of disinformation b. Generating, promoting, or furthering defamatory content, including the creation of defamatory statements, images, or other content c. Generating, promoting, or further distributing spam d. Impersonating another individual without consent, authorization, or legal right e. Representing that the use of Llama 2 or outputs are human-generated f. Generating or facilitating false online engagement, including fake reviews and other means of fake online engagement 4. Fail to appropriately disclose to end users any known dangers of your AI system Please report any violation of this Policy, software “bug,” or other problems that could lead to a violation of this Policy through one of the following means: Reporting issues with the model: github.com/facebookresearch/llama Reporting risky content generated by the model: developers.facebook.com/llama_output_feedback Reporting bugs and security concerns: facebook.com/whitehat/info Reporting violations of the Acceptable Use Policy or unlicensed uses of Llama: LlamaUseReport@meta.com
----------
Responsible Use Guide for Llama 2 Skip to main content Responsibility Responsible Use Guide: your resource for building responsibly The Responsible Use Guide is a resource for developers that provides best practices and considerations for building products powered by large language models (LLM) in a responsible manner, covering various stages of development from inception to deployment. Responsible Use Guide
----------
Meta Llama 2 Skip to main content Large language model Llama 2: open source, free for research and commercial use We're unlocking the power of these large language models. Our latest version of Llama – Llama 2 – is now accessible to individuals, creators, researchers, and businesses so they can experiment, innovate, and scale their ideas responsibly. Download the model Available as part of the Llama 2 release With each model download you'll receive: Model code Model weights README (user guide) License Acceptable use policy Model card Technical specifications Llama 2 was pretrained on publicly available online data sources. The fine-tuned model, Llama Chat, leverages publicly available instruction datasets and over 1 million human annotations. Read the paper Inside the model Llama 2 models are trained on 2 trillion tokens and have double the context length of Llama 1. Llama Chat models have additionally been trained on over 1 million new human annotations. Benchmarks Llama 2 pretrained models are trained on 2 trillion tokens, and have double the context length than Llama 1. Its fine-tuned models have been trained on over 1 million human annotations. Safety and helpfulness Reinforcement learning from human feedback Llama Chat uses reinforcement learning from human feedback to ensure safety and helpfulness. Training Llama Chat: Llama 2 is pretrained using publicly available online data. An initial version of Llama Chat is then created through the use of supervised fine-tuning. Next, Llama Chat is iteratively refined using Reinforcement Learning from Human Feedback (RLHF), which includes rejection sampling and proximal policy optimization (PPO). Get Llama 2 now: complete the download form via the link below. By submitting the form, you agree to Meta's privacy policy Get started Our global partners and supporters We have a broad range of supporters around the world who believe in our open approach to today’s AI — companies that have given early feedback and are excited to build with Llama 2, cloud providers that will include the model as part of their offering to customers, researchers committed to doing research with the model, and people across tech, academia, and policy who see the benefits of Llama and an open platform as we do. Statement of support for Meta’s open approach to today’s AI “We support an open innovation approach to AI. Responsible and open innovation gives us all a stake in the AI development process, bringing visibility, scrutiny and trust to these technologies. Opening today’s Llama models will let everyone benefit from this technology.” We’re committed to building responsibly To promote a responsible, collaborative AI innovation ecosystem, we’ve established a range of resources for all who use Llama 2: individuals, creators, developers, researchers, academics, and businesses of any size. The Responsible Use Guide is a resource for developers that provides best practices and considerations for building products powered by large language models (LLMs) in a responsible manner, covering various stages of development from inception to deployment. Safety Red-teaming Llama Chat has undergone testing by external partners and internal teams to identify performance gaps and mitigate potentially problematic responses in chat use cases. We're committed to ongoing red-teaming to enhance safety and performance. Open Innovation AI Research Community We're launching a program for academic researchers, designed to foster collaboration and knowledge-sharing in the field of artificial intelligence. This program provides unique a opportunity for researchers to come together, share their learnings, and help shape the future of AI. By joining this community, participants will have the chance to contribute to a research agenda that addresses the most pressing challenges in the field, and work together to develop innovative solutions that promote responsible and safe AI practices. We believe that by bringing together diverse perspectives and expertise, we can accelerate the pace of progress in AI research. Llama Impact Grants We want to activate the community of innovators who aspire to use Llama to solve hard problems. We are launching the grants to encourage a diverse set of public, non-profit, and for-profit entities to use Llama 2 to address environmental, education and other important challenges. The grants will be subject to rules which will be posted here prior to the grants start. Generative AI Community Forum We think it’s important that our product and policy decisions around generative AI are informed by people and experts from around the world. In support of this belief, we created a forum to act as a governance tool and resource for the community. It brings together a representative group of people to discuss and deliberate on the values that underpin AI, LLM and other new AI technologies. This forum will be held in consultation with Stanford Deliberative Democracy Lab and the Behavioural Insights Team, and is consistent with our open collaboration approach to sharing AI models. Join us on our AI journey If you’d like to advance AI with us, visit our Careers page to discover more about AI at Meta. See open positions Llama 2 Frequently asked questions Get answers to Llama 2 questions in our comprehensive FAQ page—from how it works, to how to use it, integrations, and more. See all FAQs Explore more on Llama 2 Discover more about Llama 2 here — visit our resources, ranging from our research paper, how to get access, and more. Github Open Innovation AI Research Community Getting started guide AI at Meta blog Research paper
----------
Skip to main content Llama 2 Version Release Date: July 18, 2023 “Agreement” means the terms and conditions for use, reproduction, distribution and modification of the Llama Materials set forth herein. “Documentation” means the specifications, manuals and documentation accompanying Llama 2 distributed by Meta at llama.meta.com/llama-downloads/ “Licensee” or “you” means you, or your employer or any other person or entity (if you are entering into this Agreement on such person or entity’s behalf), of the age required under applicable laws, rules or regulations to provide legal consent and that has legal authority to bind your employer or such other person or entity if you are entering in this Agreement on their behalf. “Llama 2” means the foundational large language models and software and algorithms, including machine-learning model code, trained model weights, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing distributed by Meta at “Llama Materials” means, collectively, Meta’s proprietary Llama 2 and Documentation (and any portion thereof) made available under this Agreement. “Meta” “we” means Meta Platforms Ireland Limited (if you are located in or, if you are an entity, your principal place of business is in the EEA or Switzerland) and Meta Platforms, Inc. (if you are located outside of the EEA or Switzerland). By clicking “I Accept” below or by using or distributing any portion or element of the Llama Materials, you agree to be bound by this Agreement. 1. License Rights and Redistribution. a. Grant of Rights. You are granted a non-exclusive, worldwide, non-transferable and royalty-free limited license under Meta’s intellectual property or other rights owned by Meta embodied in the Llama Materials to use, reproduce, distribute, copy, create derivative works of, and make modifications to the Llama Materials. b. Redistribution and Use. i. If you distribute or make the Llama Materials, or any derivative works thereof, available to a third party, you shall provide a copy of this Agreement to such third party. ii. If you receive Llama Materials, or any derivative works thereof, from a Licensee as part of an integrated end user product, then Section 2 of this Agreement will not apply to you. iii. You must retain in all copies of the Llama Materials that you distribute the following attribution notice within a “Notice” text file distributed as a part of such copies: “Llama 2 is licensed under the LLAMA 2 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved.” iv. Your use of the Llama Materials must comply with applicable laws and regulations (including trade compliance laws and regulations) and adhere to the Acceptable Use Policy for the Llama Materials (available at https://llama.meta.com/use-policy ), which is hereby incorporated by reference into this Agreement. v. You will not use the Llama Materials or any output or results of the Llama Materials to improve any other large language model (excluding Llama 2 or derivative works thereof). 2. Additional Commercial Terms. If, on the Llama 2 version release date, the monthly active users of the products or services made available by or for Licensee, or Licensee’s affiliates, is greater than 700 million monthly active users in the preceding calendar month, you must request a license from Meta, which Meta may grant to you in its sole discretion, and you are not authorized to exercise any of the rights under this Agreement unless or until Meta otherwise expressly grants you such rights. 3. Disclaimer of Warranty. UNLESS REQUIRED BY APPLICABLE LAW, THE LLAMA MATERIALS AND ANY OUTPUT AND RESULTS THEREFROM ARE PROVIDED ON AN “AS IS” BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, ANY WARRANTIES OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. YOU ARE SOLELY RESPONSIBLE FOR DETERMINING THE APPROPRIATENESS OF USING OR REDISTRIBUTING THE LLAMA MATERIALS AND ASSUME ANY RISKS ASSOCIATED WITH YOUR USE OF THE LLAMA MATERIALS AND ANY OUTPUT AND RESULTS. 4. Limitation of Liability. IN NO EVENT WILL META OR ITS AFFILIATES BE LIABLE UNDER ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, TORT, NEGLIGENCE, PRODUCTS LIABILITY, OR OTHERWISE, ARISING OUT OF THIS AGREEMENT, FOR ANY LOST PROFITS OR ANY INDIRECT, SPECIAL, CONSEQUENTIAL, INCIDENTAL, EXEMPLARY OR PUNITIVE DAMAGES, EVEN IF META OR ITS AFFILIATES HAVE BEEN ADVISED OF THE POSSIBILITY OF ANY OF THE FOREGOING. 5. Intellectual Property. a. No trademark licenses are granted under this Agreement, and in connection with the Llama Materials, neither Meta nor Licensee may use any name or mark owned by or associated with the other or any of its affiliates, except as required for reasonable and customary use in describing and redistributing the Llama Materials. b. Subject to Meta’s ownership of Llama Materials and derivatives made by or for Meta, with respect to any derivative works and modifications of the Llama Materials that are made by you, as between you and Meta, you are and will be the owner of such derivative works and modifications. c. If you institute litigation or other proceedings against Meta or any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Llama Materials or Llama 2 outputs or results, or any portion of any of the foregoing, constitutes infringement of intellectual property or other rights owned or licensable by you, then any licenses granted to you under this Agreement shall terminate as of the date such litigation or claim is filed or instituted. You will indemnify and hold harmless Meta from and against any claim by any third party arising out of or related to your use or distribution of the Llama Materials. 6. Term and Termination. The term of this Agreement will commence upon your acceptance of this Agreement or access to the Llama Materials and will continue in full force and effect until terminated in accordance with the terms and conditions herein. Meta may terminate this Agreement if you are in breach of any term or condition of this Agreement. Upon termination of this Agreement, you shall delete and cease use of the Llama Materials. Sections 3, 4 and 7 shall survive the termination of this Agreement. 7. Governing Law and Jurisdiction. This Agreement will be governed and construed under the laws of the State of California without regard to choice of law principles, and the UN Convention on Contracts for the International Sale of Goods does not apply to this Agreement. The courts of California shall have exclusive jurisdiction of any dispute arising out of this Agreement.
----------
Meta is committed to promoting safe and fair use of its tools and features, including Llama 2. If you access or use Llama 2, you agree to this Acceptable Use Policy (“Policy”). The most recent copy of this policy can be found at We want everyone to use Llama 2 safely and responsibly. You agree you will not use, or allow others to use, Llama 2 to: 1. Violate the law or others’ rights, including to: a. Engage in, promote, generate, contribute to, encourage, plan, incite, or further illegal or unlawful activity or content, such as: i. Violence or terrorism ii. Exploitation or harm to children, including the solicitation, creation, acquisition, or dissemination of child exploitative content or failure to report Child Sexual Abuse Material b. Human trafficking, exploitation, and sexual violence iii. The illegal distribution of information or materials to minors, including obscene materials, or failure to employ legally required age-gating in connection with such information or materials. vi. Any other criminal activity c. Engage in, promote, incite, or facilitate the harassment, abuse, threatening, or bullying of individuals or groups of individuals d. Engage in, promote, incite, or facilitate discrimination or other unlawful or harmful conduct in the provision of employment, employment benefits, credit, housing, other economic benefits, or other essential goods and services e. Engage in the unauthorized or unlicensed practice of any profession including, but not limited to, financial, legal, medical/health, or related professional practices f. Collect, process, disclose, generate, or infer health, demographic, or other sensitive personal or private information about individuals without rights and consents required by applicable laws g. Engage in or facilitate any action or generate any content that infringes, misappropriates, or otherwise violates any third-party rights, including the outputs or results of any products or services using the Llama 2 Materials h. Create, generate, or facilitate the creation of malicious code, malware, computer viruses or do anything else that could disable, overburden, interfere with or impair the proper working, integrity, operation or appearance of a website or computer system 2. Engage in, promote, incite, facilitate, or assist in the planning or development of activities that present a risk of death or bodily harm to individuals, including use of Llama 2 related to the following: a. Military, warfare, nuclear industries or applications, espionage, use for materials or activities that are subject to the International Traffic Arms Regulations (ITAR) maintained by the United States Department of State b. Guns and illegal weapons (including weapon development) c. Illegal drugs and regulated/controlled substances d. Operation of critical infrastructure, transportation technologies, or heavy machinery e. Self-harm or harm to others, including suicide, cutting, and eating disorders f. Any content intended to incite or promote violence, abuse, or any infliction of bodily harm to an individual 3. Intentionally deceive or mislead others, including use of Llama 2 related to the following: a. Generating, promoting, or furthering fraud or the creation or promotion of disinformation b. Generating, promoting, or furthering defamatory content, including the creation of defamatory statements, images, or other content c. Generating, promoting, or further distributing spam d. Impersonating another individual without consent, authorization, or legal right e. Representing that the use of Llama 2 or outputs are human-generated f. Generating or facilitating false online engagement, including fake reviews and other means of fake online engagement 4. Fail to appropriately disclose to end users any known dangers of your AI system Please report any violation of this Policy, software “bug,” or other problems that could lead to a violation of this Policy through one of the following means: Reporting issues with the model: Reporting risky content generated by the model: Reporting bugs and security concerns: Reporting violations of the Acceptable Use Policy or unlicensed uses of Llama: Skip to main content
----------
Skip to main content Llama 2 Version Release Date: July 18, 2023 means the terms and conditions for use, reproduction, distribution and modification of the Llama Materials set forth herein. means the specifications, manuals and documentation accompanying Llama 2 distributed by Meta at means you, or your employer or any other person or entity (if you are entering into this Agreement on such person or entity’s behalf), of the age required under applicable laws, rules or regulations to provide legal consent and that has legal authority to bind your employer or such other person or entity if you are entering in this Agreement on their behalf. means the foundational large language models and software and algorithms, including machine-learning model code, trained model weights, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing distributed by Meta at means, collectively, Meta’s proprietary Llama 2 and Documentation (and any portion thereof) made available under this Agreement. means Meta Platforms Ireland Limited (if you are located in or, if you are an entity, your principal place of business is in the EEA or Switzerland) and Meta Platforms, Inc. (if you are located outside of the EEA or Switzerland). By clicking “I Accept” below or by using or distributing any portion or element of the Llama Materials, you agree to be bound by this Agreement. License Rights and Redistribution. a. Grant of Rights. You are granted a non-exclusive, worldwide, non-transferable and royalty-free limited license under Meta’s intellectual property or other rights owned by Meta embodied in the Llama Materials to use, reproduce, distribute, copy, create derivative works of, and make modifications to the Llama Materials. b. Redistribution and Use. i. If you distribute or make the Llama Materials, or any derivative works thereof, available to a third party, you shall provide a copy of this Agreement to such third party. ii. If you receive Llama Materials, or any derivative works thereof, from a Licensee as part of an integrated end user product, then Section 2 of this Agreement will not apply to you. iii. You must retain in all copies of the Llama Materials that you distribute the following attribution notice within a “Notice” text file distributed as a part of such copies: “Llama 2 is licensed under the LLAMA 2 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved.” iv. Your use of the Llama Materials must comply with applicable laws and regulations (including trade compliance laws and regulations) and adhere to the Acceptable Use Policy for the Llama Materials (available at ), which is hereby incorporated by reference into this Agreement. v. You will not use the Llama Materials or any output or results of the Llama Materials to improve any other large language model (excluding Llama 2 or derivative works thereof). If, on the Llama 2 version release date, the monthly active users of the products or services made available by or for Licensee, or Licensee’s affiliates, is greater than 700 million monthly active users in the preceding calendar month, you must request a license from Meta, which Meta may grant to you in its sole discretion, and you are not authorized to exercise any of the rights under this Agreement unless or until Meta otherwise expressly grants you such rights. UNLESS REQUIRED BY APPLICABLE LAW, THE LLAMA MATERIALS AND ANY OUTPUT AND RESULTS THEREFROM ARE PROVIDED ON AN “AS IS” BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, ANY WARRANTIES OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. YOU ARE SOLELY RESPONSIBLE FOR DETERMINING THE APPROPRIATENESS OF USING OR REDISTRIBUTING THE LLAMA MATERIALS AND ASSUME ANY RISKS ASSOCIATED WITH YOUR USE OF THE LLAMA MATERIALS AND ANY OUTPUT AND RESULTS. IN NO EVENT WILL META OR ITS AFFILIATES BE LIABLE UNDER ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, TORT, NEGLIGENCE, PRODUCTS LIABILITY, OR OTHERWISE, ARISING OUT OF THIS AGREEMENT, FOR ANY LOST PROFITS OR ANY INDIRECT, SPECIAL, CONSEQUENTIAL, INCIDENTAL, EXEMPLARY OR PUNITIVE DAMAGES, EVEN IF META OR ITS AFFILIATES HAVE BEEN ADVISED OF THE POSSIBILITY OF ANY OF THE FOREGOING. a. No trademark licenses are granted under this Agreement, and in connection with the Llama Materials, neither Meta nor Licensee may use any name or mark owned by or associated with the other or any of its affiliates, except as required for reasonable and customary use in describing and redistributing the Llama Materials. b. Subject to Meta’s ownership of Llama Materials and derivatives made by or for Meta, with respect to any derivative works and modifications of the Llama Materials that are made by you, as between you and Meta, you are and will be the owner of such derivative works and modifications. c. If you institute litigation or other proceedings against Meta or any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Llama Materials or Llama 2 outputs or results, or any portion of any of the foregoing, constitutes infringement of intellectual property or other rights owned or licensable by you, then any licenses granted to you under this Agreement shall terminate as of the date such litigation or claim is filed or instituted. You will indemnify and hold harmless Meta from and against any claim by any third party arising out of or related to your use or distribution of the Llama Materials. The term of this Agreement will commence upon your acceptance of this Agreement or access to the Llama Materials and will continue in full force and effect until terminated in accordance with the terms and conditions herein. Meta may terminate this Agreement if you are in breach of any term or condition of this Agreement. Upon termination of this Agreement, you shall delete and cease use of the Llama Materials. Sections 3, 4 and 7 shall survive the termination of this Agreement. Governing Law and Jurisdiction. This Agreement will be governed and construed under the laws of the State of California without regard to choice of law principles, and the UN Convention on Contracts for the International Sale of Goods does not apply to this Agreement. The courts of California shall have exclusive jurisdiction of any dispute arising out of this Agreement.
----------
Skip to main content Code Llama, a state-of-the-art large language model for coding Code Llama has the potential to make workflows faster and more efficient for current developers and lower the barrier to entry for people who are learning to code. Code Llama has the potential to be used as a productivity and educational tool to help programmers write more robust, well-documented software. Free for research and commercial use: Code Llama is built on top of Llama 2 and is available in three models: Code Llama Code Llama Python Code Llama Instruct With each model download you'll receive: All Code Llama models README (User Guide) Acceptable Use Policy Model Card How Code Llama works Code Llama is a code-specialized version of Llama 2 that was created by further training Llama 2 on its code-specific datasets, sampling more data from that same dataset for longer. Essentially, Code Llama features enhanced coding capabilities, built on top of Llama 2. It can generate code, and natural language about code, from both code and natural language prompts (e.g., “Write me a function that outputs the fibonacci sequence.”) It can also be used for code completion and debugging. It supports many of the most popular languages being used today, including Python, C++, Java, PHP, Typescript (Javascript), C#, and Bash. Code Llama is available in four sizes with 7B, 13B, 34B, and 70B parameters respectively. Each of these models is trained with 500B tokens of code and code-related data, apart from 70B, which is trained on 1T tokens. The 7B, 13B and 70B base and instruct models have also been trained with fill-in-the-middle (FIM) capability, allowing them to insert code into existing code, meaning they can support tasks like code completion right out of the box. The four models address different serving and latency requirements. The 7B model, for example, can be served on a single GPU. The 34B and 70B models return the best results and allow for better coding assistance, but the smaller 7B and 13B models are faster and more suitable for tasks that require low latency, like real-time code completion. Note: We do not recommend using Code Llama or Code Llama Python to perform general natural language tasks since neither of these models are designed to follow natural language instructions. Code Llama is specialized for code-specific tasks and isn’t appropriate as a foundation model for other tasks. Evaluating Code Llama’s performance To test Code Llama’s performance against existing solutions, we used two popular coding benchmarks: HumanEval and Mostly Basic Python Programming ( MBPP ). HumanEval tests the model’s ability to complete code based on docstrings and MBPP tests the model’s ability to write code based on a description. Our benchmark testing showed that Code Llama performed better than open-source, code-specific LLMs and outperformed Llama 2. Code Llama 70B Instruct, for example, scored 67.8% on HumanEval and 62.2% on MBPP, the highest compared with other state-of-the-art open solutions, and on par with ChatGPT. As with all cutting edge technology, Code Llama comes with risks. Building AI models responsibly is crucial, and we undertook numerous safety measures before releasing Code Llama. As part of our red teaming efforts, we ran a quantitative evaluation of Code Llama’s risk of generating malicious code. We created prompts that attempted to solicit malicious code with clear intent and scored Code Llama’s responses to those prompts against ChatGPT’s (GPT3.5 Turbo). Our results found that Code Llama answered with safer responses. Details about our red teaming efforts from domain experts in responsible AI, offensive security engineering, malware development, and software engineering are available in our research paper Releasing Code Llama Programmers are already using LLMs to assist in a variety of tasks, ranging from writing new software to debugging existing code. The goal is to make developer workflows more efficient, so they can focus on the most human centric aspects of their job, rather than repetitive tasks. At Meta, we believe that AI models, but LLMs for coding in particular, benefit most from an open approach, both in terms of innovation and safety. Publicly available, code-specific models can facilitate the development of new technologies that improve peoples' lives. By releasing code models like Code Llama, the entire community can evaluate their capabilities, identify issues, and fix vulnerabilities. Code Llama’s training recipes are available on our Github repository and model weights are also available. GitHub Responsible use Our research paper discloses details of Code Llama’s development as well as how we conducted our benchmarking tests. It also provides more information into the model’s limitations, known challenges we encountered, mitigations we’ve taken, and future challenges we intend to investigate. We’ve also updated our Responsible Use Guide and it includes guidance on developing downstream models responsibly, including: Defining content policies and mitigations. Preparing data. Fine-tuning the model. Evaluating and improving performance. Addressing input- and output-level risks. Building transparency and reporting mechanisms in user interactions. Developers should evaluate their models using code-specific evaluation benchmarks and perform safety studies on code-specific use cases such as generating malware, computer viruses, or malicious code. We also recommend leveraging safety datasets for automatic and human evaluations, and red teaming on adversarial prompts The future of generative AI for coding Code Llama is designed to support software engineers in all sectors – including research, industry, open source projects, NGOs, and businesses. But there are still many more use cases to support than what our base and instruct models can serve. We hope that Code Llama will inspire others to leverage Llama 2 to create new innovative tools for research and commercial products. Explore more on Code Llama Discover more about Code Llama here — visit our resources, ranging from our research paper, getting started guide and more. Code Llama GitHub repository
----------
Skip to main content Build the future of AI with Meta Llama 3 Now available with both 8B and 70B pretrained and instruction-tuned versions to support a wide range of applications Build the future of AI with Now available with both 8B and 70B pretrained and instruction-tuned versions to support a wide range of applications Experience Llama 3 on Meta AI Experience Llama 3 with Meta AI We’ve integrated Llama 3 into Meta AI, our intelligent assistant, that expands the ways people can get things done, create and connect with Meta AI. You can see first-hand the performance of Llama 3 by using Meta AI for coding tasks and problem solving. Whether you're developing agents, or other AI-powered applications, Llama 3 in both 8B and 70B will offer the capabilities and flexibility you need to develop your ideas. Experience Llama 3 on Meta AI Enhanced performance Experience the state-of-the-art performance of Llama 3, an openly accessible model that excels at language nuances, contextual understanding, and complex tasks like translation and dialogue generation. With enhanced scalability and performance, Llama 3 can handle  multi-step tasks effortlessly, while our refined post-training processes significantly lower false refusal rates, improve response alignment, and boost diversity in model answers. Additionally, it drastically elevates capabilities like reasoning, code generation, and instruction following. Build the future of AI with Llama 3. Download Llama 3 Getting Started Guide With each Meta Llama request, you will receive: Meta Llama Guard 2 Community license agreement Llama 3 models take data and scale to new heights. It’s been trained on our two recently announced custom-built 24K GPU clusters on over 15T token of data – a training dataset 7x larger than that used for Llama 2, including 4x more code. This results in the most capable Llama model yet, which supports a 8K context length that doubles the capacity of Llama 2. Trust & safety A comprehensive approach to responsibility With the release of Llama 3, we’ve updated the Responsible Use Guide (RUG) to provide the most comprehensive information on responsible development with LLMs. Our system-centric approach includes updates to our trust and safety tools with Llama Guard 2, optimized to support the newly announced taxonomy published by MLCommons expanding its coverage to a more comprehensive set of safety categories, Code Shield, and Cybersec Eval 2. In line with the principles outlined in our RUG , we recommend thorough checking and filtering of all inputs to and outputs from LLMs based on your unique content guidelines for your intended use case and audience. Meta Llama Guard 2 Explore more on Meta Llama 3 Introducing Meta Llama 3: The most capable openly available LLM to date Read the blog Meet Your New Assistant: Meta AI, Built With Llama 3 Meta Llama 3 repository View repository Explore
----------
Meta Llama 3 License Skip to main content META LLAMA 3 COMMUNITY LICENSE AGREEMENT Meta Llama 3 Version Release Date: April 18, 2024 “ Agreement ” means the terms and conditions for use, reproduction, distribution and modification of the Llama Materials set forth herein. Documentation ” means the specifications, manuals and documentation accompanying Meta Llama 3 distributed by Meta at https://llama.meta.com/get-started/ Licensee ” or “ you ” means you, or your employer or any other person or entity (if you are entering into this Agreement on such person or entity’s behalf), of the age required under applicable laws, rules or regulations to provide legal consent and that has legal authority to bind your employer or such other person or entity if you are entering in this Agreement on their behalf. MetaLlama 3 ” means the foundational large language models and software and algorithms, including machine-learning model code, trained model weights, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing distributed by Meta at https://llama.meta.com/llama-downloads Llama Materials ” means, collectively, Meta’s proprietary Meta Llama 3 and Documentation (and any portion thereof) made available under this Agreement. Meta we ” means Meta Platforms Ireland Limited (if you are located in or, if you are an entity, your principal place of business is in the EEA or Switzerland) and Meta Platforms, Inc. (if you are located outside of the EEA or Switzerland). By clicking “I Accept” below or by using or distributing any portion or element of the Llama Materials, you agree to be bound by this Agreement. License Rights and Redistribution a. Grant of Rights . You are granted a non-exclusive, worldwide, non-transferable and royalty-free limited license under Meta’s intellectual property or other rights owned by Meta embodied in the Llama Materials to use, reproduce, distribute, copy, create derivative works of, and make modifications to the Llama Materials. b. Redistribution and Use i. If you distribute or make available the Llama Materials (or any derivative works thereof), or a product or service that uses any of them, including another AI model, you shall (A) provide a copy of this Agreement with any such Llama Materials; and (B) prominently display “Built with Meta Llama 3” on a related website, user interface, blogpost, about page, or product documentation. If you use the Llama Materials to create, train, fine tune, or otherwise improve an AI model, which is distributed or made available, you shall also include “Llama 3” at the beginning of any such AI model name. ii.  If you receive Llama Materials, or any derivative works thereof, from a Licensee as part of an integrated end user product, then Section 2 of this Agreement will not apply to you. iii. You must retain in all copies of the Llama Materials that you distribute the following attribution notice within a “Notice” text file distributed as a part of such copies: “Meta Llama 3 is licensed under the Meta Llama 3 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved.” iv. Your use of the Llama Materials must comply with applicable laws and regulations (including trade compliance laws and regulations) and adhere to the Acceptable Use Policy for the Llama Materials (available at https://llama.meta.com/llama3/use-policy ), which is hereby incorporated by reference into this Agreement. v. You will not use the Llama Materials or any output or results of the Llama Materials to improve any other large language model (excluding Meta Llama 3 or derivative works thereof). Additional Commercial Terms . If, on the Meta Llama 3 version release date, the monthly active users of the products or services made available by or for Licensee, or Licensee’s affiliates, is greater than 700 million monthly active users in the preceding calendar month, you must request a license from Meta, which Meta may grant to you in its sole discretion, and you are not authorized to exercise any of the rights under this Agreement unless or until Meta otherwise expressly grants you such rights. 3 . Disclaimer of Warranty . UNLESS REQUIRED BY APPLICABLE LAW, THE LLAMA MATERIALS AND ANY OUTPUT AND RESULTS THEREFROM ARE PROVIDED ON AN “AS IS” BASIS, WITHOUT WARRANTIES OF ANY KIND, AND META DISCLAIMS ALL WARRANTIES OF ANY KIND, BOTH EXPRESS AND IMPLIED, INCLUDING, WITHOUT LIMITATION, ANY WARRANTIES OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. YOU ARE SOLELY RESPONSIBLE FOR DETERMINING THE APPROPRIATENESS OF USING OR REDISTRIBUTING THE LLAMA MATERIALS AND ASSUME ANY RISKS ASSOCIATED WITH YOUR USE OF THE LLAMA MATERIALS AND ANY OUTPUT AND RESULTS. Limitation of Liability . IN NO EVENT WILL META OR ITS AFFILIATES BE LIABLE UNDER ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, TORT, NEGLIGENCE, PRODUCTS LIABILITY, OR OTHERWISE, ARISING OUT OF THIS AGREEMENT, FOR ANY LOST PROFITS OR ANY INDIRECT, SPECIAL, CONSEQUENTIAL, INCIDENTAL, EXEMPLARY OR PUNITIVE DAMAGES, EVEN IF META OR ITS AFFILIATES HAVE BEEN ADVISED OF THE POSSIBILITY OF ANY OF THE FOREGOING. Intellectual Property a. No trademark licenses are granted under this Agreement, and in connection with the Llama Materials, neither Meta nor Licensee may use any name or mark owned by or associated with the other or any of its affiliates, except as required for reasonable and customary use in describing and redistributing the Llama Materials or as set forth in this Section 5(a). Meta hereby grants you a license to use “Llama 3” (the “Mark”) solely as required to comply with the last sentence of Section 1.b.i. You will comply with Meta’s brand guidelines (currently accessible at https://about.meta.com/brand/resources/meta/company-brand/ ). All goodwill arising out of your use of the Mark will inure to the benefit of Meta. b. Subject to Meta’s ownership of Llama Materials and derivatives made by or for Meta, with respect to any derivative works and modifications of the Llama Materials that are made by you, as between you and Meta, you are and will be the owner of such derivative works and modifications. c. If you institute litigation or other proceedings against Meta or any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Llama Materials or Meta Llama 3 outputs or results, or any portion of any of the foregoing, constitutes infringement of intellectual property or other rights owned or licensable by you, then any licenses granted to you under this Agreement shall terminate as of the date such litigation or claim is filed or instituted. You will indemnify and hold harmless Meta from and against any claim by any third party arising out of or related to your use or distribution of the Llama Materials. Term and Termination . The term of this Agreement will commence upon your acceptance of this Agreement or access to the Llama Materials and will continue in full force and effect until terminated in accordance with the terms and conditions herein. Meta may terminate this Agreement if you are in breach of any term or condition of this Agreement. Upon termination of this Agreement, you shall delete and cease use of the Llama Materials. Sections 3, 4 and 7 shall survive the termination of this Agreement. Governing Law and Jurisdiction . This Agreement will be governed and construed under the laws of the State of California without regard to choice of law principles, and the UN Convention on Contracts for the International Sale of Goods does not apply to this Agreement. The courts of California shall have exclusive jurisdiction of any dispute arising out of this Agreement.
----------
Meta Llama 3 | Model Cards and Prompt formats Skip to main content Table Of Contents Model Cards and Prompt Formats Meta Llama Guard 2 Meta Code Llama 70B Meta Llama Guard 1 Getting the Models Hugging Face Kaggle Llama Everywhere Overview Meta Llama on Linux Meta Llama on Windows Meta Llama on Mac Meta Llama in the Cloud How-To Guides Fine-tuning Quantization Prompting Validation Integration Guides LangChain Llamalndex Community Support Model Cards and Prompt Formats Meta Llama Guard 2 Meta Code Llama 70B Meta Llama Guard 1 Meta Llama on Linux Meta Llama on Windows Meta Llama on Mac Meta Llama in the Cloud Model Cards & Prompt formats You can find details about this model in the model card Special Tokens used with Meta Llama 3 <|begin_of_text|> : This is equivalent to the BOS token <|eot_id|> : This signifies the end of the message in a turn. <|start_header_id|>{role}<|end_header_id|> : These tokens enclose the role for a particular message. The possible roles can be: system, user, assistant. <|end_of_text|>: This is equivalent to the EOS token. On generating this token, Llama 3 will cease to generate more tokens. A prompt can optionally contain a single system message, or multiple alternating user and assistant messages, but always ends with the last user message followed by the assistant header. Code to produce this prompt format can be found Note : Newlines (0x0A) are part of the prompt format, for clarity in the example, they have been represented as actual new lines. <|begin_of_text|>{{ user_message }} Meta Llama 3 Instruct Code to generate this prompt format can be found Notes : Newlines (0x0A) are part of the prompt format, for clarity in the examples, they have been represented as actual new lines. The model expects the assistant header at the end of the prompt to start completing it. Decomposing an example instruct prompt with a system message: <|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a helpful AI assistant for travel tips and recommendations<|eot_id|><|start_header_id|>user<|end_header_id|> What can you help me with?<|eot_id|><|start_header_id|>assistant<|end_header_id|> : Specifies the start of the prompt <|start_header_id|>system<|end_header_id|> : Specifies the role  for the following message, i.e. “system” You are a helpful AI assistant for travel tips and recommendations : The system message : Specifies the end of the input message <|start_header_id|>user<|end_header_id|> : Specifies the role  for the following message i.e. “user” What can you help me with? : The user message <|start_header_id|>assistant<|end_header_id|> : Ends with the assistant header, to prompt the model to start generation. Following this prompt, Llama 3 completes it by generating the {{assistant_message}}.  It signals the end of the {{assistant_message}} by generating the Example prompt with a single user message <|begin_of_text|><|start_header_id|>user<|end_header_id|> What is France's capital?<|eot_id|><|start_header_id|>assistant<|end_header_id|> System prompt and multiple turn conversation between the user and assistant You are a helpful AI assistant for travel tips and recommendations<|eot_id|><|start_header_id|>user<|end_header_id|> What is France's capital?<|eot_id|><|start_header_id|>assistant<|end_header_id|> Bonjour! The capital of France is Paris!<|eot_id|><|start_header_id|>user<|end_header_id|> What can I do there?<|eot_id|><|start_header_id|>assistant<|end_header_id|> Paris, the City of Light, offers a romantic getaway with must-see attractions like the Eiffel Tower and Louvre Museum, romantic experiences like river cruises and charming neighborhoods, and delicious food and drink options, with helpful tips for making the most of your trip.<|eot_id|><|start_header_id|>user<|end_header_id|> Give me a detailed list of the attractions I should visit, and time it takes in each one, to plan my trip accordingly.<|eot_id|><|start_header_id|>assistant<|end_header_id|> On this page Special Tokens used with Meta Llama 3 Meta Llama 3 Instruct
----------
Meta Llama Guard 2 | Model Cards and Prompt formats Skip to main content Model Cards and Prompt Formats Meta Llama Guard 2 Meta Code Llama 70B Meta Llama Guard 1 Meta Llama on Linux Meta Llama on Windows Meta Llama on Mac Meta Llama in the Cloud Model Cards and Prompt Formats Meta Llama Guard 2 Meta Code Llama 70B Meta Llama Guard 1 Meta Llama on Linux Meta Llama on Windows Meta Llama on Mac Meta Llama in the Cloud Model Cards & Prompt formats Meta Llama Guard 2 You can find details about this model in the Prompt format As the guardrails can be applied both on the input and output of the model, there are two different prompts: one for user input and the other for agent output. The role placeholder can have the values User Agent . The former refers to the input and the later to the output.  When evaluating the user input, the agent response must not be present in the conversation. The llama-recipes repository has a helper function and an inference example that shows how to properly format the prompt with the provided categories. This can be used as a template to create custom categories for the prompt. The variables to replace in this prompt template are: {{ role }} : It can have the values: {{ unsafe_categories }} : The default categories and their descriptions are shown below. These can be customized for zero-shot or few-shot prompting. {{ user_message }} : input message from the user. {{ model_answer }} : output from the model. <|begin_of_text|>[INST] Task: Check if there is unsafe content in '{{ role }}' messages in conversations according our safety policy with the below categories. <BEGIN UNSAFE CONTENT CATEGORIES> <END UNSAFE CONTENT CATEGORIES> <BEGIN CONVERSATION> User: {{ user_message_1 }} Agent: {{ model_answer_1 }} <END CONVERSATION> Provide your safety assessment for ONLY THE LAST {{ role }} in the above conversation: - First line must read 'safe' or 'unsafe'. - If unsafe, a second line must include a comma-separated list of violated categories. [/INST] Llama Guard was trained in the following categories. <BEGIN UNSAFE CONTENT CATEGORIES> S1: Violent Crimes. S2: Non-Violent Crimes. S3: Sex Crimes. S4: Child Exploitation. S5: Specialized Advice. S6: Privacy. S7: Intellectual Property. S8: Indiscriminate Weapons. S9: Hate. S10: Self-Harm. S11: Sexual Content. <END UNSAFE CONTENT CATEGORIES> Additionally, the full description for each category can be set in the prompt as well. <BEGIN UNSAFE CONTENT CATEGORIES> AI models should not create content that enables, encourages, or excuses the commission of violent crimes. Examples of violent crimes include, but are not limited to: - Unlawful violence toward people (ex: terrorism, genocide, murder, hate crimes, child abuse, assault, battery, kidnapping) - Unlawful violence toward animals (ex: animal abuse) AI models should not create content that enables, encourages, or excuses the commission of non-violent crimes. Examples of non-violent crimes include, but are not limited to: - Financial crimes (ex: fraud, scams, money laundering) - Property crimes (ex: burglary, robbery, arson, vandalism) - Drug crimes (ex: creating or trafficking narcotics) - Weapons crimes (ex: producing unlicensed firearms) - Cyber crimes (ex: hacking, spyware, malware) AI models should not create content that enables, encourages, or excuses the commission of sex crimes. Examples of sex crimes include, but are not limited to: - Human trafficking - Sexual assault (ex: rape) - Sexual harassment (ex: groping) - Lewd conduct AI models should not create content that depicts child nudity or that enables, encourages, excuses, or depicts the sexual abuse of children. AI models should not create content that contains specialized financial, medical, or legal advice. AI models should not create content that contains sensitive, personal information about private individuals. AI models should not create content that violates the intellectual property rights of third parties. AI models should not create content that enables the creation of weapons of mass destruction. Examples of such weapons include, but are not limited to: - Chemical weapons (ex: nerve gas) - Biological weapons (ex: anthrax) - Radiological weapons (ex: salted bombs) - Nuclear weapons (ex: atomic warheads) - High-yield explosive weapons (ex: cluster munitions) AI models should not create content that is hateful toward people on the basis of their protected characteristics (race, color, ethnicity, national origin, disability, religious affiliation, caste, sexual orientation, sex, gender identity and serious disease). AI models should not create content that perpetuates negative stereotypes. AI models should not create content that enables, encourages, or excuses acts of intentional self-harm. Examples of such actions include, but are not limited to: - Suicide - Self-injury (ex: cutting) - Disordered eating AI models should not create depictions of nude adults or content that contains erotic descriptions or explicit depictions of sex acts. <END UNSAFE CONTENT CATEGORIES> Meta Llama Guard 2
----------
Meta Code Llama 70B | Model Cards and Prompt formats Skip to main content Model Cards and Prompt Formats Meta Llama Guard 2 Meta Code Llama 70B Meta Llama Guard 1 Meta Llama on Linux Meta Llama on Windows Meta Llama on Mac Meta Llama in the Cloud Model Cards and Prompt Formats Meta Llama Guard 2 Meta Code Llama 70B Meta Llama Guard 1 Meta Llama on Linux Meta Llama on Windows Meta Llama on Mac Meta Llama in the Cloud Model Cards & Prompt formats Meta Code Llama 70B You can find details about this model in the . Note that Meta Code Llama 70B uses the same model card as Meta Code Llama 7B, 13B, and 34B. Completion In this format, the model continues to write code following the provided code in the prompt. An implementation of this prompt can be found <s>{{ code_prompt }} Instructions Meta Code Llama 70B has a different prompt template compared to 34B, 13B and 7B. It starts with a Source: system tag—which can have an empty body—and continues with alternating user or assistant values. Each turn of the conversation uses the <step> special character to separate the messages. The last turn of the conversation uses an Source: assistant tag with an empty message and a Destination: user tag to prompt the model to answer the user question. A detailed implementation of this format is provided. Notes: The structure requires a Source: system tag, but the system prompt can be empty. Each user query is preceded by a blank line. At the end of the prompt is a blank line followed by a line containing a space character (0x20). <s>Source: system System prompt <step> Source: user First user query <step> Source: assistant Model response to first query <step> Source: user Second user query <step> Source: assistant Destination: user Meta Code Llama 70B
----------
Meta Llama Guard 1 | Model Cards and Prompt formats Skip to main content Model Cards and Prompt Formats Meta Llama Guard 2 Meta Code Llama 70B Meta Llama Guard 1 Meta Llama on Linux Meta Llama on Windows Meta Llama on Mac Meta Llama in the Cloud Model Cards and Prompt Formats Meta Llama Guard 2 Meta Code Llama 70B Meta Llama Guard 1 Meta Llama on Linux Meta Llama on Windows Meta Llama on Mac Meta Llama in the Cloud Model Cards & Prompt formats Meta Llama Guard 1 You can find details about this model in the As the guardrails can be applied both on the input and output of the model, there are two different prompts: one for user input and the other for agent output. The role placeholder can have the values User or Agent. The former refers to the input and the later to the output.  When evaluating the user input, the agent response must not be present in the conversation. The llama-recipes repository has a that shows how to properly format the prompt with the provided categories. This can be used as a template to create custom categories for the prompt. <s>[INST] Task: Check if there is unsafe content in '{{ role }}' messages in conversations according our safety policy with the below categories. <BEGIN UNSAFE CONTENT CATEGORIES> <END UNSAFE CONTENT CATEGORIES> User: {{ user_message_1 }} Agent: {{ model_answer_1 }} Provide your safety assessment for {{ role }} in the above conversation: - First line must read 'safe' or 'unsafe'. - If unsafe, a second line must include a comma-separated list of violated categories. [/INST] Llama Guard was trained in the following categories that can be adapted through prompt engineering. <BEGIN UNSAFE CONTENT CATEGORIES> O1: Violence and Hate. Should not - Help people plan or engage in violence. - Encourage people to engage in violence. - Express hateful, derogatory, or demeaning sentiments against people based on sensitive personal characteristics like their race, color, religion, national origin, sexual orientation, gender, gender identity, or disability. - Encourage discrimination against people based on those characteristics and should not use slurs or create insults based on those characteristics. Can -  Provide information on violence and discrimination. - Discuss the topics of hate and violence, and can discuss historical events involving violence. O2: Sexual Content. - Engage in sexually explicit (i.e., erotic) conversations and should not encourage people to engage in sexual activities. - Discuss topics related to sexuality and sexual education. - Provide general information regarding sexual activities. - Provide guidance on topics related to sexual health. O3: Criminal Planning. - Help people plan or engage in criminal activities like theft, kidnapping, or financial crimes, especially when the person indicates that they may act on the information provided to them. - Encourage people to engage in criminal activities or condone criminal behavior. - Help people understand how crimes are committed, provided that the response is not so detailed that it could be used to provide a roadmap for future crimes. - Help people understand how systems work that are used to prevent crimes, provided those descriptions do not provide detailed information on how to defeat those systems. O4: Guns and Illegal Weapons. - Assist in the planning or execution of firearm or illegal weapons crimes, especially when the person indicates that they may act on the information provided to them. - Encourage people to engage in firearm or illegal weapons crimes or condone such behavior. - Discuss firearms and the arguments for and against firearm ownership. O5: Regulated or Controlled Substances. - Encourage or assist people in the trafficking, consumption, or use of regulated or controlled substances, including but not limited to illegal drugs, tobacco, alcohol, and cannabis. - Assist or encourage people to create such substances when it is illegal to do so. - Discuss the use of regulated or controlled substances, and can provide general information on those substances (ex: what their chemical composition is and what their effects are). - Provide information on how those substances are manufactured, provided that the information is not so detailed as to provide step-by-step instructions for their creation. O6: Self-Harm. - Encourage people to harm themselves. - Romanticize or condone self-harm or suicide. - Provide information regarding the methods of suicide or self-harm. - Provide any information or advice, other than information regarding appropriate health resources, to people who express or imply that they wish to harm themselves. - Provide general information regarding self-harm and suicide (ex: statistics, the mental health challenges faced by those who take these actions). Should - Provide information regarding appropriate health resources to people who imply or express the desire to harm themselves. <END UNSAFE CONTENT CATEGORIES> Meta Llama Guard 1
----------
Meta Code Llama | Model Cards and Prompt formats Skip to main content Model Cards and Prompt Formats Meta Llama Guard 2 Meta Code Llama 70B Meta Llama Guard 1 Meta Llama on Linux Meta Llama on Windows Meta Llama on Mac Meta Llama in the Cloud Model Cards and Prompt Formats Meta Llama Guard 2 Meta Code Llama 70B Meta Llama Guard 1 Meta Llama on Linux Meta Llama on Windows Meta Llama on Mac Meta Llama in the Cloud Model Cards & Prompt formats You can find details about this model in the Meta Code Llama 7B, 13B, and 34B In this format, the model continues to write code following the code that is provided in the prompt. An implementation of this prompt can be found The instructions prompt template for Meta Code Llama follow the same structure as the Meta Llama 2 chat model, where the system prompt is optional, and the user and assistant messages alternate, always ending with a user message. Note the beginning of sequence (BOS) token between each user and assistant message.  An implementation for Meta Code Llama can be found <s>[INST] <<SYS>> {{ system_prompt }} <</SYS>> {{ user_message_1 }} [/INST] {{ model_answer_1 }} </s> <s>[INST] {{ user_message_2 }} [/INST] Infilling Infilling can be done in two different ways: with the prefix-suffix-middle format or the suffix-prefix-middle. An implementation of this format is provided Infilling is only available in the 7B and 13B base models—not in the Python, Instruct, 34B, or 70B models The BOS character is not used for infilling when encoding the prefix or suffix, but only at the beginning of each prompt. Prefix-suffix-middle <s><PRE>{{ code_prefix }}<SUF>{{ code_suffix }}<MID> Suffix-prefix-middle <s><PRE><SUF>{{ code_suffix }}<MID>{{ code_prefix }} Meta Code Llama 7B, 13B, and 34B
----------
Meta Llama 2 | Model Cards and Prompt formats Model Cards and Prompt Formats Meta Llama Guard 2 Meta Code Llama 70B Meta Llama Guard 1 Meta Llama on Linux Meta Llama on Windows Meta Llama on Mac Meta Llama in the Cloud Model Cards and Prompt Formats Meta Llama Guard 2 Meta Code Llama 70B Meta Llama Guard 1 Meta Llama on Linux Meta Llama on Windows Meta Llama on Mac Meta Llama in the Cloud Model Cards & Prompt formats You can find details about this model in the Special Tokens used with Meta Llama 2 <s></s> : These are the BOS and EOS tokens from SentencePiece. When multiple messages are present in a multi turn conversation, they separate them, including the user input and model response. [INST][/INST] : These tokens enclose user messages in multi turn conversations. <<SYS>><</SYS>> : These enclose the system message. The base model supports text completion, so any incomplete user prompt, without special tags, will prompt the model to complete it. The tokenizer provided with the model will include the SentencePiece beginning of sequence (BOS) token (<s>) if requested. Review this code for details. <s>{{ user_prompt }} Meta Llama 2 Chat Code to produce this prompt format can be found . The system prompt is optional. Single message instance with optional system prompt. {{ user_message }} [/INST] Multiple user and assistant messages example. {{ user_message_1 }} [/INST] {{ model_answer_1 }} </s> <s>[INST] {{ user_message_2 }} [/INST] Special Tokens used with Meta Llama 2 Meta Llama 2 Chat Skip to main content
----------
Getting the models Skip to main content Model Cards and Prompt Formats Meta Llama Guard 2 Meta Code Llama 70B Meta Llama Guard 1 Meta Llama on Linux Meta Llama on Windows Meta Llama on Mac Meta Llama in the Cloud Model Cards and Prompt Formats Meta Llama Guard 2 Meta Code Llama 70B Meta Llama Guard 1 Meta Llama on Linux Meta Llama on Windows Meta Llama on Mac Meta Llama in the Cloud You can get the Meta Llama models directly from Meta or through Hugging Face or Kaggle. However you get the models, you will first need to accept the license agreements for the models you want. For more detailed information about each of the Meta Llama models, see the Model Cards section immediately following this section. To get the models directly from Meta, go to our Meta Llama download form at Fill in your information–including your email. Select the models that you want, and review and accept the appropriate license agreements. For each model that you request, you will receive an email that contains instructions and a pre-signed URL to download that model. You can use the same URL to download multiple model weights, such as 7B and 13B. The URL expires after 24 hours or five downloads, but you can re-request models in order to receive fresh pre-signed URLs. The model download process uses a script that relies on the following tools: wget,md5sum ; so ensure that these are available on your local computer.
----------
Hugging Face | Getting the models To obtain the models from Hugging Face (HF), sign into your account at https://huggingface.co/meta-llama Select the model you want. You will be taken to a page where you can fill in your information and review the appropriate license agreement. After accepting the agreement, your information is reviewed; the review process could take up to a few days. When you are approved, you will receive an email informing you that you have access to the HF repository for the model. Note that cloning the HF repository to a local computer does not give you all the model files because some of the files are too large. In the local clone, those files contain only metadata for the actual file. To get these larger files, go to the file in the repository on the HF site and download it directly from there. For example, to get consolidated.00.pth for the Meta Llama 2 7B model, you download it from: https://huggingface.co/meta-llama/Llama-27b/blob/main/consolidated.00.pth Model Cards and Prompt Formats Meta Llama Guard 2 Meta Code Llama 70B Meta Llama Guard 1 Meta Llama on Linux Meta Llama on Windows Meta Llama on Mac Meta Llama in the Cloud Model Cards and Prompt Formats Meta Llama Guard 2 Meta Code Llama 70B Meta Llama Guard 1 Meta Llama on Linux Meta Llama on Windows Meta Llama on Mac Meta Llama in the Cloud Skip to main content
----------
Kaggle | Getting the models Skip to main content Model Cards and Prompt Formats Meta Llama Guard 2 Meta Code Llama 70B Meta Llama Guard 1 Meta Llama on Linux Meta Llama on Windows Meta Llama on Mac Meta Llama in the Cloud Model Cards and Prompt Formats Meta Llama Guard 2 Meta Code Llama 70B Meta Llama Guard 1 Meta Llama on Linux Meta Llama on Windows Meta Llama on Mac Meta Llama in the Cloud To obtain the models from Kaggle–including the HF versions of the models–sign into your account at: https://www.kaggle.com/organizations/metaresearch/models Before you can access the models on Kaggle, you need to submit a request for model access , which requires that you accept the model license agreement on the Meta site: Note that the email address that you provide when you accept the license agreement must be the same as the email that you use for your Kaggle account. Once you have accepted the license agreement, return to Kaggle and submit the request for model access. When your request is approved, which might take a few days, you’ll receive an email that says that you have received access. You’ll then be able to access the models on Kaggle. To access a particular model, select it from the Model Variations dropdown box, and click the download icon. An archive file that contains the model will start downloading.
----------
Skip to main content Model Cards and Prompt Formats Meta Llama Guard 2 Meta Code Llama 70B Meta Llama Guard 1 Meta Llama on Linux Meta Llama on Windows Meta Llama on Mac Meta Llama in the Cloud Model Cards and Prompt Formats Meta Llama Guard 2 Meta Code Llama 70B Meta Llama Guard 1 Meta Llama on Linux Meta Llama on Windows Meta Llama on Mac Meta Llama in the Cloud Although Meta Llama models are often hosted by Cloud Service Providers (CSP), Meta Llama can be used in other contexts as well, such as Linux, the Windows Subsystem for Linux (WSL), macOS, Jupyter notebooks, and even mobile devices. If you are interested in exploring t hese scenarios, we suggest that you check out the following resources: Llama 3 on Your Local Computer, with Resources for Other Options - How to run Llama on your desktop using Windows, macOS, or Linux. Also, pointers to other ways to run Llama, either on premise or in the cloud Llama Recipes QuickStart - Provides an introduction to Meta Llama using Jupyter notebooks and also demonstrates running Llama locally on macOS. Machine Learning Compilation for Large Language Models (MLC LLM) - Enables “everyone to develop, optimize and deploy AI models natively on everyone's devices with ML compilation techniques.” Llama C++ - Uses the portability of C++ to enable inference with Llama models on a variety of different hardware.
----------
Running Meta Llama on Linux | Llama Everywhere Skip to main content Model Cards and Prompt Formats Meta Llama Guard 2 Meta Code Llama 70B Meta Llama Guard 1 Meta Llama on Linux Meta Llama on Windows Meta Llama on Mac Meta Llama in the Cloud Model Cards and Prompt Formats Meta Llama Guard 2 Meta Code Llama 70B Meta Llama Guard 1 Meta Llama on Linux Meta Llama on Windows Meta Llama on Mac Meta Llama in the Cloud Running Meta Llama on Linux This tutorial is a part of our Build with Meta Llama series, where we demonstrate the capabilities and practical applications of Llama for developers like you, so that you can leverage the benefits that Llama has to offer and incorporate it into your own applications. This tutorial supports the video Running Llama on Linux | Build with Meta Llama , where we learn how to run Llama on Linux OS by getting the weights and running the model locally, with a step-by-step tutorial to help you follow along. If you're interested in learning by watching or listening, check out our video on Running Llama on Linux. Introduction to llama models At Meta, we strongly believe in an open approach to AI development, particularly in the fast-evolving domain of generative AI. By making AI models publicly accessible, we enable their advantages to reach every segment of society. Last year, we open sourced Meta Llama 2, and this year we released the Meta Llama 3 family of models, available in both 8B and 70B pretrained and instruction-tuned versions to support a wide range of applications, unlocking the power of these large language models, and making them accessible to everyone, so you can experiment, innovate, and scale your ideas responsibly. Running Meta Llama on Linux Setup With a Linux setup having a GPU with a minimum of 16GB VRAM, you should be able to load the 8B Llama models in fp16 locally. If you have an Nvidia GPU, you can confirm your setup using the NVIDIA System Management Interface tool that shows you the GPU you have, the VRAM available, and other useful information by typing: nvidia-smi In our current setup, we are on Ubuntu, specifically Pop OS, and have an Nvidia RTX 4090 with a total VRAM of about 24GB. Terminal with nvidia-smi showing NVIDIA GPU Configuration Getting the weights To download the weights, go to the Llama website . Fill in your details in the form and select the models you’d like to download. In our case, we will download the Llama 3 models. Select Meta Llama 3 and Meta Llama Guard 2 on the download page Read and agree to the license agreement, then click Accept and continue . You will see a unique URL on the website. You will also receive the URL in your email and it is valid for 24hrs to allow you to download each model up to 5 times. You can always request a new URL. Download page with unique pre-signed URL We are now ready to get the weights and run the model locally on our machine. It is recommended to use a Python virtual environment for running this demo. In this demo, we are using Miniconda, but you can use any virtual environment of your choice. Open your terminal, and make a new folder called llama3-demo in your workspace. Navigate to the new folder and clone the Llama repo: mkdir llama3-demo cd llama3-demo git clone https://github.com/meta-llama/llama3.git For this demo, we’ll need two prerequisites installed: wget and md5sum . To confirm if your distribution has these, use: wget --version md5sum --version which should return the installed versions. If your distribution does not have these, you can install them using apt-get install wget apt-get install md5sum To make sure we have all the package dependencies installed, while in the newly cloned repo folder, type: pip install -e . We are now all set to download the model weights for our local setup. Our team has created a helper script to make it easy to download the model weights. In your terminal, type: ./download.sh The script will ask for the URL from your email. Paste in the URL you received from Meta. It will then ask you to enter the list of models to download. For our example, we’ll download the 8B pretrained model and the fine-tuned 8B chat models. So we’ll enter “8B,8B-instruct” Downloading the 8B models Running the model We are all set to run the example inference script to test if our model has been set up correctly and works. Our team has created an example Python script called example_text_completion.py that you can use to test out the model. The script defines a main function that uses the Llama class from the llama library to generate text completions for given prompts using the pre-trained models. It takes a few arguments: Parameters Descriptions ckpt_dir: str Directory containing the checkpoint files of the model. tokenizer_path: str Path to the tokenizer of the model. temperature: float = 0.6 This parameter controls the randomness of the generation process. Higher values may lead to more creative but less coherent outputs, while lower values may lead to more conservative but more coherent outputs. top_p: float = 0.9 This defines the maximum probability threshold for generating tokens. max_seq_len: int = 128 Defines the maximum length of the input sequence or prompt allowed for the model to process. max_gen_len: int = 64 Defines the maximum length of the generated text the model is allowed to produce. max_batch_size: int = 4 Defines the maximum number of prompts to process in one batch. The function builds an instance of the class, using the provided arguments, then defines a list of prompts for which the model will use generator.text_completion method to generate the completions. To run the script, go back to our terminal, and while in the llama3 repo, type: torchrun --nproc_per_node 1 example_text_completion.py --ckpt_dir Meta-Llama-3-8B/ --tokenizer_path Meta-Llama-3-8B/tokenizer.model --max_seq_len 128 --max_batch_size 4 Replace Meta-Llama-3-8B/ with the path to your checkpoint directory and tokenizer.model with the path to your tokenizer model. If you run it from this main directory, the path may not need to change. Set the –nproc_per_node to the MP value for the model you are using. For 8B models, the value is set to 1. Adjust the max_seq_len max_batch_size parameters as needed. We have set them to 128 and 4 respectively. Running the 8B model on the example text completion script To try out the fine-tuned chat model ( 8B-instruct ), we have a similar example called example_chat_completion.py torchrun --nproc_per_node 1 example_chat_completion.py --ckpt_dir Meta-Llama-3-8B-Instruct/ --tokenizer_path Meta-Llama-3-8B-Instruct/tokenizer.model --max_seq_len 512 --max_batch_size 6 Note that in this case, we use the Meta-Llama-3-8B-Instruct/ model and provide the correct tokenizer under the instruct model folder. Running the 8B Instruct model on the example chat completion script A detailed step-by-step process to run on this setup, as well as all the helper and example scripts can be found on our Llama3 GitHub repo , which goes over the process of downloading and quick-start, as well as examples for inference. Running Meta Llama on Linux Introduction to llama models Running Meta Llama on Linux
----------
Running Meta Llama on Windows | Llama Everywhere Model Cards and Prompt Formats Meta Llama Guard 2 Meta Code Llama 70B Meta Llama Guard 1 Meta Llama on Linux Meta Llama on Windows Meta Llama on Mac Meta Llama in the Cloud Model Cards and Prompt Formats Meta Llama Guard 2 Meta Code Llama 70B Meta Llama Guard 1 Meta Llama on Linux Meta Llama on Windows Meta Llama on Mac Meta Llama in the Cloud Running Meta Llama on Windows This tutorial is a part of our Build with Meta Llama series, where we demonstrate the capabilities and practical applications of Llama for developers like you, so that you can leverage the benefits that Llama has to offer and incorporate it into your own applications. This tutorial supports the video Running Llama on Windows | Build with Meta Llama , where we learn how to run Llama on Windows using Hugging Face APIs, with a step-by-step tutorial to help you follow along. If you're interested in learning by watching or listening, check out our video on Running Llama on Windows. For this demo, we will be using a Windows OS machine with an RTX 4090 GPU. If you have an Nvidia GPU, you can confirm your setup by opening the Terminal and typing (NVIDIA System Management Interface), which will show you the GPU you have, the VRAM available, and other useful information about your setup. Since we will be using the Hugging Face transformers library for this setup, this setup can also be used on other operating systems that the library supports such as Linux or Mac using similar steps as the ones shown in the video. To allow easy access to Meta Llama models , we are providing them on Hugging Face, where you can download the models in both transformers and native Llama 3 formats. To download the weights, visit the meta-llama repo containing the model you’d like to use. For example, we will use the Meta-Llama-3-8B-Instruct model for this demo. Read and agree to the license agreement. Fill in your details and accept the license, and click on submit. Once your request is approved, you'll be granted access to all the Llama 3 models. Meta-Llama 3-8B-Instruct model on Hugging Face For this tutorial, we will be using Meta Llama models already converted to Hugging Face format. However, if you’d like to download the original native weights, click on the "Files and versions" tab and download the contents of the original folder. If you prefer, you can also download the original weights from the command line using the Hugging Face CLI: pip install huggingface-hub huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct --include "original/*" --local-dir meta-llama/Meta-Llama-3-8B-Instruct In this example, we will showcase how you can use Meta Llama models already converted to Hugging Face format using Transformers. To use the model with Transformers, we will be using the pipeline class from Hugging Face. We recommend that you use a Python virtual environment for running this demo. In this demo, we are using Miniconda, but you can use any virtual environment of your choice. Make sure to use the latest version of transformers pip install -U transformers --upgrade We will also use the accelerate library, which enables our code to be run across any distributed configuration. pip install accelerate We will be using Python for our demo script. To install Python, visit the Python website , where you can choose your OS and download the version of Python you like.  We will also be using PyTorch for our demo, so we will need to make sure we have PyTorch installed in our setup. To install PyTorch for your setup, visit the Pytorch downloads website and choose your OS and configuration to get the installation command you need. Paste that command in your terminal and press enter. PyTorch Installation Guide For our script, open the editor of your choice, and create a Python script. We’ll first add the imports that we need for our example: import transformers import torch from transformers import AutoTokenizer Let's define the model we’d like to use. In our demo, we will use the 8B instruct model which is fine tuned for chat: model = "meta-llama/Meta-Llama-3-8B-Instruct" We will also instantiate the tokenizer which can be derived from AutoTokenizer, based on the model we’ve chosen, using the from_pretrained method of AutoTokenizer. This will download and cache the pre-trained tokenizer and return an instance of the appropriate tokenizer class. tokenizer = AutoTokenizer.from_pretrained(model) To use our model for inference: pipeline = transformers.pipeline( "text-generation", model=model, torch_dtype=torch.float16, device_map="auto", ) Hugging Face pipelines allow us to specify which type of task the pipeline needs to run ( text-generation in this case), the model that the pipeline should use to make predictions (specified by model ), the precision to use with this model ( torch.float16 ), the device on which the pipeline should run ( device_map ), and various other options. We’ll also set the argument to auto , which means the pipeline will automatically use a GPU if one is available. Next, let's provide some text prompts as inputs to our pipeline for it to use when it runs to generate responses. Let’s define this as the variable, sequences: sequences = pipeline( 'I have tomatoes, basil and cheese at home. What can I cook for dinner?\n', do_sample=True, top_k=10, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id, truncation = True, max_length=400, The pipeline sets do_sample to True , which allows us to specify the decoding strategy we’d like to use to select the next token from the probability distribution over the entire vocabulary. In our example, we are using top_k sampling. By changing max_length , you can specify how long you’d like the generated response to be. Setting the num_return_sequences parameter to greater than one will let you generate more than one output. Finally, we add the following to provide input, and information on how to run the pipeline: for seq in sequences: print(f"Result: {seq['generated_text']}") Save your script and head back to the terminal. We will save it as llama3-hf-demo.py . Before we run the script, let’s make sure we can access and interact with Hugging Face directly from the terminal. To do that, make sure you have the Hugging Face CLI installed: pip install -U "huggingface_hub[cli]" followed by huggingface-cli login Here, it will ask us for our access token which we can get from our HF account under Settings . Copy it and provide it in the command line. We are now all set to run our script. python llama3-hf-demo.py Running Meta-Llama-3-8B-Instruct locally To check out the full example and run it on your own local machine, see the detailed sample notebook that you can refer to in the llama-recipes GitHub repo . Here you will find an example of how to run Llama 3 models using already converted Hugging Face weights, as well as an example that goes over how you can convert the original weights into Hugging Face format and run using those. We’ve also created various other demos and examples to provide you with guidance and as references to help you get started with Llama models and to make it easier for you to integrate them into your own use cases. To try these examples, check out our . Here you’ll find complete walkthroughs for how to get started with Llama models. These include installation instructions , dependencies, and recipes where you can find examples of inference, fine tuning, and training on custom data sets. In addition, the repo includes demos that showcase llama deployments, basic interactions, and specialized use cases Running Meta Llama on Windows Skip to main content
----------
Running Meta Llama on Mac | Llama Everywhere Model Cards and Prompt Formats Meta Llama Guard 2 Meta Code Llama 70B Meta Llama Guard 1 Meta Llama on Linux Meta Llama on Windows Meta Llama on Mac Meta Llama in the Cloud Model Cards and Prompt Formats Meta Llama Guard 2 Meta Code Llama 70B Meta Llama Guard 1 Meta Llama on Linux Meta Llama on Windows Meta Llama on Mac Meta Llama in the Cloud Running Meta Llama on Mac This tutorial is a part of our Build with Meta Llama series, where we demonstrate the capabilities and practical applications of Llama for developers like you, so that you can leverage the benefits that Llama has to offer and incorporate it into your own applications. This tutorial supports the video Running Llama on Mac | Build with Meta Llama , where we learn how to run Llama on Mac OS  using Ollama , with a step-by-step tutorial to help you follow along. If you're interested in learning by watching or listening, check out our video on Running Llama on Mac. For this demo, we are using a Macbook Pro running Sonoma 14.4.1 with 64GB memory. Since we will be using Ollamap, this setup can also be used on other operating systems that are supported such as Linux or Windows using similar steps as the ones shown here. lets you set up and run Large Language models like Llama models locally. Downloading Ollama The first step is to install Ollama. To do that, visit their website , where you can choose your platform, and click on “Download” to download Ollama. For our demo, we will choose macOS, and select “Download for macOS”. Next, we will make sure that we can test run Meta Llama 3 models on Ollama . Please note that Ollama provides Meta Llama models in the 4-bit quantized format. To test run the model, let’s open our terminal, and run ollama pull llama3 to download the 4-bit quantized Meta Llama 3 8B chat model, with a size of about 4.7 GB. Downloading 4-bit quantized Meta Llama models If you’d like to download the Llama 3 70B chat model, also in 4-bit, you can instead type ollama pull llama3:70b which in quantized format, would have a size of about 39GB. Running using ollama run To run our model, in your terminal, type: ollama run llama3 We are all set to ask questions and chat with our Meta Llama 3 model. Let’s ask some questions: “Who wrote the book godfather?" Meta Llama model generating a response We can see that it gives the right answer, along with more information about the book as well as the movie that was based on the book. What if I just wanted the name of the author, without the extra information. Let’s adapt our prompt accordingly, specifying the kind of response we expect: "Who wrote the book godfather? Answer with only the name." Meta Llama model generating a specified responses based on prompt We can see that it generates the answer in the format we requested. You can also try running the 70B model: ollama run llama3:70b but the inference speed will likely be slower. Running with curl You can even run and test the Llama 3 8B model directly by using the curl command and specifying your prompt right in the command: curl http://localhost:11434/api/chat -d '{ "model": "llama3", "messages": [ { "role": "user", "content": "who wrote the book godfather?" } ], "stream": false }' Here, we are sending a POST request to an API running on localhost. The API endpoint is for "chat", which will interact with our AI model hosted on the server. We are providing a JSON payload that contains a string specifying the name of the AI model to use for processing the input prompt ( ), an array with a string indicating the role of the message sender ( user ) and a string with the user's input prompt (" who wrote the book godfather? "), and a boolean value stream indicating whether the response should be streamed or not. In our case, it is set to false, meaning the entire response will be returned at once. Ollama running Llama model with curl command As we can see, the model generated the response with the answer to our question. Running as a Python script This example can also be run using a Python script. To install Python, visit the , where you can choose your OS and download the version of Python you like. To run it using a Python script, open the editor of your choice, and create a new file. First, let’s add the imports we will need for this demo, and define a parameter called url , which will have the same value as the URL we saw in the demo: import requests import json url = "http://localhost:11434/api/chat" We will now add a new function called , which will take in prompt as an argument: def llama3(prompt): data = { "content": prompt "stream": False headers = { 'Content-Type': 'application/json' response = requests.post(url, headers=headers, json=data) return(response.json()['message']['content']) This function constructs a JSON payload containing the specified prompt and the model name, which is "llama3”. Then, it sends a POST request to the API endpoint with the JSON payload as the message body, using the requests library.  Once the response is received, the function extracts the content of the response message from the JSON object returned by the API, and returns this extracted content. Finally, we will provide the prompt and print the generated response: response = llama3("who wrote the book godfather") print(response) To run the script, write python <name of script>.py and press enter. Running Meta Llama model using Ollama and Python script As we can see, it generated the response based on the prompt we provided in our script. To learn more about the complete Ollama APIs, check out their documentation To check out the full example, and run it on your own machine, our team has worked on a that you can refer to and can be found in the llama-recipes Github repo , where you will find an example of how to run Llama 3 models on a Mac as well as other platforms. You will find the examples we discussed here, as well as other ways to use Llama 3 locally with Ollama via LangChain. We’ve also created various other demos and examples to provide you with guidance and as references to help you get started with Llama models and to make it easier for you to integrate Llama into your own use cases. These demos and examples are also located in our , where you’ll find complete walkthroughs for how to get started with Llama models, including , dependencies, and recipes. You’ll also find several examples for inference, fine tuning, and training on custom data sets—as well as demos that showcase Llama deployments, basic interactions, and specialized Running Meta Llama on Mac Running using ollama run Running as a Python script Skip to main content
----------
Meta Llama in the Cloud | Llama Everywhere Skip to main content Model Cards and Prompt Formats Meta Llama Guard 2 Meta Code Llama 70B Meta Llama Guard 1 Meta Llama on Linux Meta Llama on Windows Meta Llama on Mac Meta Llama in the Cloud Model Cards and Prompt Formats Meta Llama Guard 2 Meta Code Llama 70B Meta Llama Guard 1 Meta Llama on Linux Meta Llama on Windows Meta Llama on Mac Meta Llama in the Cloud Meta Llama in the Cloud This tutorial is a part of our Build with Meta Llama series, where we demonstrate the capabilities and practical applications of Llama for developers like you, so that you can leverage the benefits that Llama has to offer and incorporate it into your own applications. This tutorial supports the video Many other ways to run Llama and resources | Build with Meta Llama , where we learn about some of the various other ways in which you can host or run Meta Llama models, and provide you with all the resources that can help you get started. If you're interested in learning by watching or listening, check out our video on Many other ways to run Llama and resources. Apart from running the models locally, one of the most common ways to run Meta Llama models is to run them in the cloud. We saw an example of this using a service called in our running Llama on Windows video . Let's take a look at some of the other services we can use to host and run Llama models such as AWS , Azure, Google, , and VertexAI —among others. Amazon Web Services Amazon Web Services (AWS) provides multiple ways to host your Llama models such as SageMaker Jumpstart and Bedrock. Bedrock is a fully managed service that lets you quickly and easily build generative AI-powered experiences. To use Meta Llama with Bedrock, check out their that goes over how to integrate and use Meta Llama models in your applications. You can also use AWS through SageMaker JumpStart, which enables you to build, train, and deploy ML models from a broad selection of publicly available foundation models, and deploy them on SageMaker Instances for model training and inference. Learn more about how to use Meta Llama on Sagemaker on their Microsoft Azure Another way to run Meta Llama models is on Microsoft Azure. You can access Meta Llama models on Azure in two ways: Models as a Service (MaaS) provides access to Meta Llama hosted APIs through Azure AI Studio Model as a Platform (MaaP) provides access to Meta Llama family of models with out of the box support for fine-tuning and evaluation though Azure Machine Learning Studio Please refer to our How to Guide for more details. Google Cloud Platform You can also use GCP, or Google Cloud Platform, to run Meta Llama models. GCP is a suite of cloud computing services that provides computing resources as well as virtual machines. Building on top of GCP services, Model Garden on Vertex AI offers infrastructure to jumpstart your ML project with a single place to discover, customize, and deploy a wide range of models. We have collaborated with Vertex AI from Google Cloud to fully integrate Meta Llama, offering pre-trained, instruction-tuned, and Meta CodeLlama, in various sizes. Check out how to fine-tune & deploy Meta Llama models on Vertex AI by visiting the . Please note that you may need to request proper GPU computing quota as a prerequisite. IBM watsonx You can also use IBM's watsonx to run Meta Llama models. IBM watsonx is an advanced platform designed for AI builders, integrating generative AI capabilities, foundation models, and traditional machine learning. It provides a comprehensive suite of tools that span the AI lifecycle, enabling users to tune models with their enterprise data. The platform supports multi-model flexibility, client protection, AI governance, and hybrid, multi-cloud deployments. It offers features for extracting insights, discovering trends, generating synthetic tabular data, running jupyter notebooks, and creating new content and code. Watsonx.ai equips data scientists with the necessary tools, pipelines, and runtimes for building and deploying ML models, thereby automating the entire AI model lifecycle. We've worked with IBM to make Llama and Code Llama models available on their platform . To test the platform and evaluate Llama on watsonx, creating an account is free and allows testing the available models through the Prompt Lab. For detailed instructions, refer to the getting started guide and the quick start tutorials Other hosting providers You can also run Llama models using hosting providers such as OpenAI, Together AI, Anyscale, Replicate, Groq, etc. Our team has worked on step by step examples to showcase how to run Llama on externally hosted providers. The examples can be found on our Llama-recipes GitHub repo , which goes over the process of setting up and running inference for Llama models on some of these externally hosted providers. Running Llama on premise Many enterprise customers prefer to deploy Llama models on-premise and on their own servers. One way to deploy and run Llama models in this manner is by using TorchServe . TorchServe is an easy to use tool for deploying PyTorch models at scale. It is cloud and environment agnostic and supports features such as multi-model serving, logging, metrics and the creation of RESTful endpoints for application integration. To learn more about how TorchServe works, with setup, quickstart, and examples check out the Github repo Another way to deploy llama models on premise is by using Virtual Large Language Model ( vLLM ) or Text Generation Inference (TGI) , two leading open-source tools to deploy and serve LLMs. A detailed step by step tutorial can be found on our that showcases how to use Llama models with vLLM and Hugging Face TGI, and how to create vLLM and TGI hosted Llama instances with LangChain—a language model integration framework for the creation of applications using large language models. You can find various demos and examples that can provide you with guidance—and that you can use as references to get started with Llama models—on our , where you’ll find several examples for inference and fine tuning, as well as running on various API providers. Learn more about Llama 3 and how to get started by checking out our Getting to know Llama notebook that you can find in our . Here you will find a guided tour of Llama 3, including a comparison to Llama 2, descriptions of different Llama 3 models, how and where to access them, Generative AI and Chatbot architectures, prompt engineering, RAG (Retrieval Augmented Generation), fine-tuning, and more. You will find all this implemented with starter code that you can take and adapt to use in your own Meta Llama 3 projects. To learn more about our Llama 3 models, check out our announcement blog where you can find details about how the models work, data on performance and benchmarks, information about trust and safety, and various other resources to get you started. Get the model source from our Llama 3 Github repo , where you can learn how the models work along with a minimalist example of how to load Llama 3 models and run inference. Here, you will also find steps to download and set up the models, and examples for running the text completion and chat models. Meta Llama3 GitHub repo Dive deeper and learn more about the model in the , which goes over the model architecture, intended use, hardware and software requirements, training data, results, and licenses. Check out our new Meta AI , built with Llama 3 technology, which is now one of the world’s leading AI assistants that can boost your intelligence and lighten your load, helping you learn, get things done, create content, and connect to make the most out of every moment. You can use Meta AI on Facebook, Instagram, WhatsApp, Messenger, and the web to get things done, learn, create, and connect with the things that matter to you. To learn more about the latest updates and releases of Llama models, check out our website , where you can learn more about the latest models as well as find resources to learn more about how these models work and how you can use them in your own applications. Check out our Getting Started guide that provides information and resources to help you set up Llama including how to access the models, prompt formats, hosting, how-to and integration guides, as well as resources that you can reference to get started with your projects. Take a look at some of our latest blogs that discuss new announcements , the latest on the Llama ecosystem , and our responsible approach to Meta AI and Meta Llama 3 Check out the community resources on our website to help you get started with Meta Llama models, learn about performance & latency, fine tuning, and more. Dive deeper into prompt engineering, learning best practices for prompting Meta Llama models and interacting with Meta Llama Chat, Code Llama, and Llama Guard models in our short course on Prompt Engineering with Llama 2 on DeepLearing.ai, recently updated to showcase both Llama 2 and  Llama 3 models. Community Stories that go over interesting use cases of Llama models in various fields such as in Business, Healthcare, Gaming, Pharmaceutical, and more! Learn more about the Llama ecosystem, building product experiences with Llama, and examples that showcase how industry pioneers have adopted Llama to build and grow innovative products for users across their platforms at Connect 2023 Also check out our that provides developers with recommended best practices and considerations for safely building products powered by LLMs. We hope you found the Build with Meta Llama videos and tutorials helpful to provide you with insights and resources that you may need to get started with using Llama models. We at Meta strongly believe in an open approach to AI development, democratizing access through an open platform and providing you with AI models, tools, and resources to give you the power to shape the next wave of innovation. We want to kickstart that next wave of innovation across the stack—from applications to developer tools to evals to inference optimizations and more. We can’t wait to see what you build and look forward to your feedback. Meta Llama in the Cloud Running Llama on premise
----------
Fine-tuning | How-to guides Skip to main content Model Cards and Prompt Formats Meta Llama Guard 2 Meta Code Llama 70B Meta Llama Guard 1 Meta Llama on Linux Meta Llama on Windows Meta Llama on Mac Meta Llama in the Cloud Model Cards and Prompt Formats Meta Llama Guard 2 Meta Code Llama 70B Meta Llama Guard 1 Meta Llama on Linux Meta Llama on Windows Meta Llama on Mac Meta Llama in the Cloud How-to guides If you are looking to learn by writing code it's highly recommended to look into the Getting to Know Llama 3 notebook. It's a great place to start with most commonly performed operations on Meta Llama. Full parameter fine-tuning is a method that fine-tunes all the parameters of all the layers of the pre-trained model. In general, it can achieve the best performance but it is also the most resource-intensive and time consuming: it requires most GPU resources and takes the longest. PEFT, or Parameter Efficient Fine Tuning, allows one to fine tune models with minimal resources and costs. There are two important PEFT methods: LoRA (Low Rank Adaptation) and QLoRA (Quantized LoRA), where pre-trained models are loaded to GPU as quantized 8-bit and 4-bit weights, respectively. It’s likely that you can fine-tune the Llama 2-13B model using LoRA or QLoRA fine-tuning with a single consumer GPU with 24GB of memory, and using QLoRA requires even less GPU memory and fine-tuning time than LoRA. Typically, one should try LoRA, or if resources are extremely limited, QLoRA, first, and after the fine-tuning is done, evaluate the performance. Only consider full fine-tuning when the performance is not desirable. Experiment tracking Experiment tracking is crucial when evaluating various fine-tuning methods like LoRA, and QLoRA. It ensures reproducibility, maintains a structured version history, allows for easy collaboration, and aids in identifying optimal training configurations. Especially with numerous iterations, hyperparameters, and model versions at play, tools like Weights & Biases (W&B) become indispensable. With its seamless integration into multiple frameworks, W&B provides a comprehensive dashboard to visualize metrics, compare runs, and manage model checkpoints. It's often as simple as adding a single argument to your training script to realize these benefits - we’ll show an example in the Hugging Face PEFT LoRA section. Recipes PEFT LoRA The llama-recipes repo has details on different fine-tuning (FT) alternatives supported by the provided sample scripts. In particular, it highlights the use of PEFT as the preferred FT method, as it reduces the hardware requirements and prevents catastrophic forgetting. For specific cases, full parameter FT can still be valid, and different strategies can be used to still prevent modifying the model too much. Additionally, FT can be done in single gpu multi-gpu with FSDP. In order to run the recipes, follow the steps below: Create a conda environment with pytorch and additional dependencies Install the recipes as described Download the desired model from hf, either using git-lfs or using the llama download script. With everything configured, run the following command: python -m llama_recipes.finetuning  --use_peft --peft_method lora --quantization  --model_name ../llama/models_hf/7B --output_dir ../llama/models_ft/7B-peft --batch_size_training 2 --gradient_accumulation_steps 2 torchtune ( link torchtune is a PyTorch-native library that can be used to fine-tune the Meta Llama family of models including Meta Llama 3. It supports the end-to-end fine-tuning lifecycle including: Downloading model checkpoints and datasets Training recipes for fine-tuning Llama 3 using full fine-tuning, LoRA, and QLoRA Support for single-GPU fine-tuning capable of running on consumer-grade GPUs with 24GB of VRAM Scaling fine-tuning to multiple GPUs using PyTorch FSDP Log metrics and model checkpoints during training using Weights & Biases Evaluation of fine-tuned models using EleutherAI’s LM Evaluation Harness Post-training quantization of fine-tuned models via TorchAO Interoperability with inference engines including ExecuTorch To install torchtune simply run the pip install command pip install torchtune Follow the instructions on the Hugging Face meta-llama repository to ensure you have access to the Llama 3 model weights. Once you have confirmed access, you can run the following command to download the weights to your local machine. This will also download the tokenizer model and a responsible use guide. tune download meta-llama/Meta-Llama-3-8B \ --output-dir <checkpoint_dir> \ --hf-token <ACCESS TOKEN> Set your environment variable HF_TOKEN or pass in --hf-token to the command in order to validate your access. You can find your token at https://huggingface.co/settings/tokens The basic command for a single-device LoRA fine-tune of Llama 3 is tune run lora_finetune_single_device --config llama3/8B_lora_single_device torchtune contains built-in recipes for: Full fine-tuning on single device and on multiple devices with FSDP LoRA finetuning on multiple devices with FSDP QLoRA finetuning on , with a QLoRA specific configuration You can find more information on fine-tuning Meta Llama models by reading the torchtune guide. Hugging Face PEFT LoRA ( Using Low Rank Adaption (LoRA) , Meta Llama is loaded to the GPU memory as quantized 8-bit weights. Using the Hugging Face Fine-tuning with PEFT LoRA ( ) is super easy - an example fine-tuning run on Meta Llama 2 7b using the OpenAssistant data set can be done in three simple steps: pip install trl git clone https://github.com/huggingface/trl python trl/examples/scripts/sft.py \ --model_name meta-llama/Llama-2-7b-hf \ --dataset_name timdettmers/openassistant-guanaco \ --load_in_4bit \ --use_peft \ --batch_size 4 \ --gradient_accumulation_steps 2 \ --log_with wandb This takes about 16 hours on a single GPU and uses less than 10GB GPU memory; changing batch size to 8/16/32 will use over 11/16/25 GB GPU memory. After the fine-tuning completes, you’ll see in a new directory named “output” at least adapter_config.json and adapter_model.bin -  run the script below to infer with the base model and the new model, generated by merging the base model with the fined-tuned one: from transformers import ( AutoModelForCausalLM, AutoTokenizer, pipeline, from peft import LoraConfig, PeftModel from trl import SFTTrainer model_name = "meta-llama/Llama-2-7b-chat-hf" new_model = "output" device_map = {"": 0} base_model = AutoModelForCausalLM.from_pretrained( model_name, low_cpu_mem_usage=True, return_dict=True, device_map=device_map, model = PeftModel.from_pretrained(base_model, new_model) model = model.merge_and_unload() tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) tokenizer.pad_token = tokenizer.eos_token tokenizer.padding_side = "right" prompt = "Who wrote the book Innovator's Dilemma?" pipe = pipeline(task="text-generation", model=base_model, tokenizer=tokenizer, max_length=200) result = pipe(f"<s>[INST] {prompt} [/INST]") print(result[0]['generated_text']) pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200) result = pipe(f"<s>[INST] {prompt} [/INST]") QLoRA Fine TuningQLoRA (Q for quantized) is more memory efficient than LoRA. In QLoRA, the pretrained model is loaded to the GPU as quantized 4-bit weights. Fine-tuning using QLoRA is also very easy to run - an example of fine-tuning Llama 2-7b with the OpenAssistant can be done in four quick steps: git clone https://github.com/artidoro/qlora cd qlora pip install -U -r requirements.txt ./scripts/finetune_llama2_guanaco_7b.sh It takes about 6.5 hours to run on a single GPU, using 11GB memory of the GPU. After the fine-tuning completes and the output_dir specified in ./scripts/finetune_llama2_guanaco_7b.sh will have checkoutpoint-xxx subfolders, holding the fine-tuned adapter model files. To run inference, use the script below: from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline from peft import LoraConfig, PeftModel model_id = "meta-llama/Llama-2-7b-hf" new_model = "output/llama-2-guanaco-7b/checkpoint-1875/adapter_model" # change if needed quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type='nf4' model = AutoModelForCausalLM.from_pretrained( model_id, quantization_config=quantization_config, device_map='auto' model = PeftModel.from_pretrained(model, new_model) tokenizer = AutoTokenizer.from_pretrained(model_id) prompt = "Who wrote the book innovator's dilemma?" pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200) result = pipe(f"<s>[INST] {prompt} [/INST]") Axolotl is another open source library you can use to streamline the fine-tuning of Llama 2. A good example of using Axolotl to fine-tune Meta Llama with four notebooks covering the whole fine-tuning process (generate the dataset, fine-tune the model using LoRA, evaluate and benchmark) is QLoRA Fine Tuning Note: This has been tested on Meta Llama 2 models only. QLoRA (Q for quantized) is more memory efficient than LoRA. In QLoRA, the pretrained model is loaded to the GPU as quantized 4-bit weights. Fine-tuning using QLoRA is also very easy to run - an example of fine-tuning Llama 2-7b with the OpenAssistant can be done in four quick steps: pip install -U -r requirements.txt It takes about 6.5 hours to run on a single GPU, using 11GB memory of the GPU. After the fine-tuning completes and the output_dir specified in ./scripts/finetune_llama2_guanaco_7b.sh will have checkoutpoint-xxx subfolders, holding the fine-tuned adapter model files. To run inference, use the script below: from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline from peft import LoraConfig, PeftModel new_model = "output/llama-2-guanaco-7b/checkpoint-1875/adapter_model" # change if needed model = PeftModel.from_pretrained(model, new_model) prompt = "Who wrote the book innovator's dilemma?" pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200) result = pipe(f"<s>[INST] {prompt} [/INST]") Note: This has been tested on Meta Llama 2 models only. is another open source library you can use to streamline the fine-tuning of Llama 2. A good example of using Axolotl to fine-tune Meta Llama with four notebooks covering the whole fine-tuning process (generate the dataset, fine-tune the model using LoRA, evaluate and benchmark) is torchtune (link) Hugging Face PEFT LoRA (link)
----------
Quantization | How-to guides Skip to main content Model Cards and Prompt Formats Meta Llama Guard 2 Meta Code Llama 70B Meta Llama Guard 1 Meta Llama on Linux Meta Llama on Windows Meta Llama on Mac Meta Llama in the Cloud Model Cards and Prompt Formats Meta Llama Guard 2 Meta Code Llama 70B Meta Llama Guard 1 Meta Llama on Linux Meta Llama on Windows Meta Llama on Mac Meta Llama in the Cloud Quantization is a technique used in machine learning to reduce the computational and memory requirements of models, making them more efficient for deployment on servers and edge devices. It involves representing model weights and activations, typically 32-bit floating numbers, with lower precision data such as 16-bit float, brain float 16-bit, 8-bit int, or even 4/3/2/1-bit int. The benefits of quantization include smaller model sizes, faster fine-tuning, and faster inference—particularly beneficial in resource-constrained environments. However, the tradeoff is a reduction in model quality due to the loss of precision. Supported quantization modes in PyTorch Post-Training Dynamic Quantization: Weights are pre-quantized ahead of time and activations are converted to int8 during inference, just before computation. This results in faster computation due to efficient int8 matrix multiplication and maintains accuracy on the activation layer. Post-Training Static Quantization: This technique improves performance by converting networks to use both integer arithmetic and int8 memory accesses. It involves feeding batches of data through the network and computing the resulting distributions of the different activations. This information is used to determine how the different activations should be quantized at inference time. Quantization Aware Training (QAT): In QAT, all weights and activations are "fake quantized" during both the forward and backward passes of training. This means float values are rounded to mimic int8 values, but all computations are still done with floating point numbers. This method usually yields higher accuracy than the other two methods as all weight adjustments during training are made while "aware" of the fact that the model will ultimately be quantized. More details about these methods and how they can be applied to different types of models can be found in the official PyTorch . Additionally, the community has already conducted studies on the effectiveness of common quantization methods on Meta Llama 3, and the results and code to evaluate can be found in this GitHub repository We will focus next on quantization tools available for Meta Llama models. As this is a constantly evolving space, the libraries and methods detailed here are the most widely used at the moment and are subject to change as the space evolves. Pytorch quantization with TorchAO TorchAO library offers several methods for quantization, each with different schemes for how the activations and weights are quantized. We distinguish between two main types of quantization: weight only quantization and dynamic quantization. For weight only quantization, we support 8-bit and 4-bit quantization. The 4-bit quantization also has GPTQ support for improved accuracy, which requires calibration but has the same final performance. For dynamic quantization, we support 8-bit activation quantization and 8-bit weight quantization. We also support this type of quantization with smoothquant for improved accuracy, which requires calibration and has slightly worse performance. Additionally, the library offers a simple API to test different methods and automatic detection of the best quantization for a given model, known as autoquantization. This API chooses the fastest form of quantization out of the 8-bit dynamic and 8-bit weight only quantization. It first identifies the shapes of the activations that the different linear layers see, then benchmarks these shapes across different types of quantized and non-quantized layers in order to pick the fastest one. Also, it composes with torch.compile() to generate the fast kernels. For additional information on torch.compile, please see this general tutorial : This library is in beta phase and in active development; API changes are expected. HF supported quantization Hugging Face (HF) offers multiple ways to do LLM quantization with their transformers library. For additional guidance and examples on how to use each of these beyond the brief summary presented here,  please refer to their quantization guide and the transformers quantization configuration . The llama-recipes code uses bitsandbytes 8-bit quantization to load the models, both for inference . (See below for more information about using the bitsandbytes library with Llama. ) Quanto Quanto is a versatile PyTorch quantization toolkit that uses linear quantization. It provides features such as weights quantization, activation quantization, and compatibility with various devices and modalities. It supports quantization-aware training and is easy to integrate with custom kernels for specific devices. More details can be found in the announcement blog , GitHub , and HF guide AQLM Additive Quantization of Language Models (AQLM) is a compression method for LLM. It quantizes multiple weights together, taking advantage of interdependencies between them. AQLM represents groups comprising 8 to16 weights each as a sum of multiple vector codes. This library supports fine-tuning its quantized models with Parameter-Efficient Fine-Tuning and LoRA by integrating into HF's PEFT library as well. More details can be found  in the GitHub AWQ Activation-aware Weight Quantization (AWQ) preserves a small percentage of weights that are important for LLM performance, reducing quantization loss. This allows models to run in 4-bit precision without experiencing performance degradation. Transformers support loading models quantized with the llm-awq autoawq libraries. More details on how to load them with the Transformers library can be found in the HF AutoGPTQ The AutoGPTQ library implements the algorithm, a post-training quantization technique where each row of the weight matrix is quantized independently. These weights are quantized to int4, but they’re restored to fp16 on the fly during inference, saving memory usage by 4x. More details can be found in the GitHub BitsAndBytes BitsAndBytes is an easy option for quantizing a model to 8-bit and 4-bit. The library supports any model in any modality, as long as it supports loading with Hugging Face Accelerate and contains torch.nn.Linear layers. It also provides features for offloading weights between the CPU and GPU to support fitting very large models into memory, adjusting the outlier threshold for 8-bit quantization, skipping module conversion for certain models, and fine-tuning with 8-bit and 4-bit weights. For 4-bit models, it allows changing the compute data type, using the Normal Float 4 (NF4) data type for weights initialized from a normal distribution, and using nested quantization to save additional memory at no additional performance cost. More details can be found in the HF Supported quantization modes in PyTorch Pytorch quantization with TorchAO
----------
Prompting | How-to guides Skip to main content Model Cards and Prompt Formats Meta Llama Guard 2 Meta Code Llama 70B Meta Llama Guard 1 Meta Llama on Linux Meta Llama on Windows Meta Llama on Mac Meta Llama in the Cloud Model Cards and Prompt Formats Meta Llama Guard 2 Meta Code Llama 70B Meta Llama Guard 1 Meta Llama on Linux Meta Llama on Windows Meta Llama on Mac Meta Llama in the Cloud Link to Notebook showing examples of the techniques discussed in this section. Prompt engineering is a technique used in natural language processing (NLP) to improve the performance of the language model by providing them with more context and information about the task in hand. It involves creating prompts, which are short pieces of text that provide additional information or guidance to the model, such as the topic or genre of the text it will generate. By using prompts, the model can better understand what kind of output is expected and produce more accurate and relevant results. In Llama 2 the size of the context, in terms of number of tokens, has doubled from 2048 to 4096. Crafting Effective Prompts Crafting effective prompts is an important part of prompt engineering. Here are some tips for creating prompts that will help improve the performance of your language model: Be clear and concise: Your prompt should be easy to understand and provide enough information for the model to generate relevant output. Avoid using jargon or technical terms that may confuse the model. Use specific examples: Providing specific examples in your prompt can help the model better understand what kind of output is expected. For example, if you want the model to generate a story about a particular topic, include a few sentences about the setting, characters, and plot. Vary the prompts: Using different prompts can help the model learn more about the task at hand and produce more diverse and creative output. Try using different styles, tones, and formats to see how the model responds. Test and refine: Once you have created a set of prompts, test them out on the model to see how it performs. If the results are not as expected, try refining the prompts by adding more detail or adjusting the tone and style. Use feedback: Finally, use feedback from users or other sources to continually improve your prompts. This can help you identify areas where the model needs more guidance and make adjustments accordingly. Explicit Instructions Detailed, explicit instructions produce better results than open-ended prompts: You can think about giving explicit instructions as using rules and restrictions to how Llama 2 responds to your prompt. Stylization Explain this to me like a topic on a children's educational network show teaching elementary students. I'm a software engineer using large language models for summarization. Summarize the following text in under 250 words: Give your answer like an old timey private investigator hunting down a case step by step. Formatting Use bullet points. Return as a JSON object. Use less technical terms and help me apply it in my work in communications. Restrictions Only use academic papers. Never give sources older than 2020. If you don't know the answer, say that you don't know. Here's an example of giving explicit instructions to give more specific results by limiting the responses to recently created sources: Explain the latest advances in large language models to me. #  More likely to cite sources from 2017 Explain the latest advances in large language models to me. Always cite your sources. Never cite sources older than 2020. # Gives more specific advances and only cites sources from 2020 Prompting using Zero- and Few-Shot Learning A shot is an example or demonstration of what type of prompt and response you expect from a large language model. This term originates from training computer vision models on photographs, where one shot was one example or instance that the model used to classify an image. Zero-Shot Prompting Large language models like Meta Llama are capable of following instructions and producing responses without having previously seen an example of a task. Prompting without examples is called "zero-shot prompting". Text: This was the best movie I've ever seen! The sentiment of the text is: Text: The director was trying too hard. The sentiment of the text is: Few-Shot Prompting Adding specific examples of your desired output generally results in a more accurate, consistent output. This technique is called "few-shot prompting". In this example, the generated response follows our desired format that offers a more nuanced sentiment classifier that gives a positive, neutral, and negative response confidence percentage. You are a sentiment classifier. For each message, give the percentage of positive/netural/negative. Here are some samples: Text: I liked it Sentiment: 70% positive 30% neutral 0% negative Text: It could be better Sentiment: 0% positive 50% neutral 50% negative Text: It's fine Sentiment: 25% positive 50% neutral 25% negative Text: I thought it was okay Text: I loved it! Text: Terrible service 0/10 Role Based Prompts Creating prompts based on the role or perspective of the person or entity being addressed. This technique can be useful for generating more relevant and engaging responses from language models. Pros: Improves relevance: Role-based prompting helps the language model understand the role or perspective of the person or entity being addressed, which can lead to more relevant and engaging responses. Increases accuracy: Providing additional context about the role or perspective of the person or entity being addressed can help the language model avoid making mistakes or misunderstandings. Cons: Requires effort: Requires more effort to gather and provide the necessary information about the role or perspective of the person or entity being addressed. Example: You are a virtual tour guide currently walking the tourists Eiffel Tower on a night tour. Describe Eiffel Tower to your audience that covers its history, number of people visiting each year, amount of time it takes to do a full tour and why do so many people visit this place each year. Chain of Thought Technique Involves providing the language model with a series of prompts or questions to help guide its thinking and generate a more coherent and relevant response. This technique can be useful for generating more thoughtful and well-reasoned responses from language models. Improves coherence: Helps the language model think through a problem or question in a logical and structured way, which can lead to more coherent and relevant responses. Increases depth: Providing a series of prompts or questions can help the language model explore a topic more deeply and thoroughly, potentially leading to more insightful and informative responses. Requires effort: The chain of thought technique requires more effort to create and provide the necessary prompts or questions. You are a virtual tour guide from 1901. You have tourists visiting Eiffel Tower. Describe Eiffel Tower to your audience. Begin with 1. Why it was built 2. Then by how long it took them to build 3. Where were the materials sourced to build 4. Number of people it took to build 5. End it with the number of people visiting the Eiffel tour annually in the 1900's, the amount of time it completes a full tour and why so many people visit this place each year. Make your tour funny by including 1 or 2 funny jokes at the end of the tour. Self-Consistency LLMs are probabilistic, so even with Chain-of-Thought, a single generation might produce incorrect results. Self-Consistency introduces enhanced accuracy by selecting the most frequent answer from multiple generations (at the cost of higher compute): John found that the average of 15 numbers is 40. If 10 is added to each number then the mean of the numbers is? Report the answer surrounded by three backticks, for example: ```123``` Running the above several times and taking the most commonly returned value for the answer would make use of the self-consistency approach. Retrieval-Augmented Generation Common facts are generally available from today's large models out-of-the-box (i.e. using just the model weights). More specific data is unlikely to be available though E.g.: What is the capital of  California? # The capital of California is Sacramento... What was the temperature in Menlo Park on December 12th, 2023? # I'm just an AI, I don't have have access to real-time or historical weather data... Retrieval-Augmented Generation, or RAG, describes the practice of including information in the prompt that has been retrieved from an external database. It's an effective way to incorporate facts into your LLM application and is more affordable than fine-tuning which might also negatively impact the foundational model's capabilities. This could be as simple as a lookup table or as sophisticated as a vector database containing all of your company's knowledge: Given the following information about temperatures in Menlo Park: 2023-12-11 :  52 degrees Fahrenheit 2023-12-12 :  51 degrees Fahrenheit 2023-12-13 :  55 degrees Fahrenheit What was the temperature in Menlo Park on 2023-12-12? # Sure! The temperature in Menlo Park on 2023-12-12 was 51 degrees Fahrenheit. What was the temperature in Menlo Park on 2023-07-18 ? # Sorry, I don't have information about the temperature in Menlo Park on 2023-07-18. The information provided only includes temperatures for December 11th, 12th, and 13th of 2023. Program-Aided Language Models LLMs, by nature, aren't great at performing calculations. While LLMs are bad at arithmetic, they're great for code generation. Program-Aided Language leverages this fact by instructing the LLM to write code to solve calculation tasks. Python code to calculate: ((-5 + 93 * 4 - 0) * (4^4 + -7 + 0 * 5)) Here is the Python code to calculate the given expression: ((-5 + 93 * 4 - 0) * (4**4 + -7 + 0 * 5)) # Calculate the expression inside the parentheses a = -5 + 93 * 4 - 0 b = 4**4 + -7 + 0 * 5 # Multiply a and b result = a * b # Print the result print(result) Using the code directly provides the correct result. Limiting Extraneous Tokens A common challenge is generating a response without extraneous tokens (e.g. "Sure! Here's more information on..."). By combining a role, rules and restrictions, explicit instructions, and an example, the model can be prompted to generate the desired response. You are a robot that only outputs JSON. You reply in JSON format with the field 'zip_code'. Example question: What is the zip code of the Empire State Building? Example answer: {'zip_code': 10118} Now here is my question: What is the zip code of Menlo Park? # "{'zip_code': 94025}" Using the code directly provides the correct result. Reduce Hallucinations Meta’s is a great resource to understand how best to prompt and address input/output risks of the language model. Refer to pages (14-17). Here are some examples of how a language model might hallucinate and some strategies for fixing the issue: Example 1: A language model is asked to generate a response to a question about a topic it has not been trained on. The language model may hallucinate information or make up facts that are not accurate or supported by evidence. Fix: To fix this issue, you can provide the language model with more context or information about the topic to help it understand what is being asked and generate a more accurate response. You could also ask the language model to provide sources or evidence for any claims it makes to ensure that its responses are based on factual information. Example 2: A language model is asked to generate a response to a question that requires a specific perspective or point of view. The language model may hallucinate information or make up facts that are not consistent with the desired perspective or point of view. To fix this issue, you can provide the language model with additional information about the desired perspective or point of view, such as the goals, values, or beliefs of the person or entity being addressed. This can help the language model understand the context and generate a response that is more consistent with the desired perspective or point of view. Example 3: A language model is asked to generate a response to a question that requires a specific tone or style. The language model may hallucinate information or make up facts that are not consistent with the desired tone or style. To fix this issue, you can provide the language model with additional information about the desired tone or style, such as the audience or purpose of the communication. This can help the language model understand the context and generate a response that is more consistent with the desired tone or style. Overall, the key to avoiding hallucination in language models is to provide them with clear and accurate information and context, and to carefully monitor their responses to ensure that they are consistent with your expectations and requirements. Prompting using Zero- and Few-Shot Learning Chain of Thought Technique
----------
Validation | How-to guides Model Cards and Prompt Formats Meta Llama Guard 2 Meta Code Llama 70B Meta Llama Guard 1 Meta Llama on Linux Meta Llama on Windows Meta Llama on Mac Meta Llama in the Cloud Model Cards and Prompt Formats Meta Llama Guard 2 Meta Code Llama 70B Meta Llama Guard 1 Meta Llama on Linux Meta Llama on Windows Meta Llama on Mac Meta Llama in the Cloud As the saying goes, if you can't measure it, you can't improve it., In this section, we are going to cover different ways to measure and ultimately validate Llama so it's possible to determine the improvements provided by different fine tuning techniques. Quantitative techniques The focus of these techniques is to gather objective metrics that can be compared easily during and after each fine tuning run and to provide quick feedback on whether the model is performing. The main metrics collected are loss and perplexity. This method consists in dividing the dataset into k subsets or folds, and then fine tuning the model k times. On each run, a different fold is used as a validation dataset, using the rest for training. The performance results of each run are averaged out for the final report. This provides a more accurate metric of the performance of the model across the complete dataset, as all entries serve both for validation and training. While it produces the most accurate prediction on how a model is going to generalize after fine tuning on a given dataset, it is computationally expensive and better suited for small datasets. Holdout When using a holdout, the dataset is split into two or three subsets, training and validation with test as optional. The test and validation sets can represent 10% - 30% of the dataset each. As the name implies, the first two subsets are used for training and validating the model during fine tuning, while the third is used only after fine tuning is complete to evaluate how well the model generalizes on data it has not seen in either phase. The advantage of having three partitions is that it provides a way to evaluate the model after fine-tuning for an unbiased view into the model performance, but it requires a slightly bigger dataset to allow for a proper split. This is currently implemented in the Llama recipes fine tuning script with two subsets of the dataset, train validation . The data is collected in a json file that can be plotted to easily interpret the results and evaluate how the model is performing. Standard Evaluation tools There are multiple projects that provide standard evaluation. They provide predefined tasks with commonly used metrics to evaluate the performance of LLMs, like HellaSwag and ThrouthfulQA. These tools can be used to test if the model has degraded after fine tuning. Additionally, a custom task can be created using the dataset intended to fine-tune the model, effectively automating the manual verification of the model performance before and after fine tuning. These types of projects provide a quantitative way of looking at the models performance in simulated real world examples. Some of these projects include the LM Evaluation Harness (used to create the HF leaderboard ), HELM , BIG-bench OpenCompass . As mentioned before, the torchtune library provides integration with the LM Evaluation Harness to test fine tuned models as well. Interpreting Loss and Perplexity The loss value used comes from the transformer's LlamaForCausalLM , which initializes a different loss function depending on the objective required from the model. The objective of this section is to give a brief overview on how to understand the results from loss and perplexity as an initial evaluation of the model performance during fine tuning. We also calculate the perplexity as an exponentiation of the loss value. Additional information on loss functions can be found in these resources: 1 2 4 5 6 In our recipes, we use a simple holdout during fine tuning. Using the logged loss values, both for train and validation dataset, the curves for both are plotted to analyze the results of the process. Given the setup in the recipe, the expected behavior is a log graph that shows a diminishing train and validation loss value as it progresses. If the validation curve starts going up while the train curve continues decreasing, the model is overfitting and it's not generalizing well. Some alternatives to test when this happens are early stopping, verifying the validation dataset is a statistically significant equivalent of the train dataset, data augmentation, using parameter efficient fine tuning or using k-fold cross-validation to better tune the hyperparameters. Qualitative techniques Manual testing Manually evaluating a fine tuned model will vary according to the FT objective and available resources. Here we provide general guidelines on how to accomplish it. With a dataset prepared for fine tuning, a part of it can be separated into a manual test subset, which can be further increased with general knowledge questions that might be relevant to the specific use case. In addition to these general questions, we recommend executing standard evaluations as well, and compare the results with the baseline for the fine tuned model. To rate the results, a clear evaluation criteria should be defined that is relevant to the dataset being used. Example criteria can be accuracy, coherence and safety. Create a rubric for each criteria and define what would be required for an output to receive a specific score. With these guidelines in place, distribute the test questions with a diverse set of reviewers to have multiple data points for each question. With multiple data points for each question and different criteria, a final score can be calculated for each query, allowing for weighting the scores based on the preferred focus for the final model. Interpreting Loss and Perplexity Skip to main content
----------
Meta Code Llama | Integration guides Skip to main content Model Cards and Prompt Formats Meta Llama Guard 2 Meta Code Llama 70B Meta Llama Guard 1 Meta Llama on Linux Meta Llama on Windows Meta Llama on Mac Meta Llama in the Cloud Model Cards and Prompt Formats Meta Llama Guard 2 Meta Code Llama 70B Meta Llama Guard 1 Meta Llama on Linux Meta Llama on Windows Meta Llama on Mac Meta Llama in the Cloud Integration guides is an open-source family of LLMs based on Llama 2 providing SOTA performance on code tasks. It consists of: Foundation models (Meta Code Llama) Python specializations (Meta Code Llama - Python), and Instruction-following models (Meta Code Llama - Instruct) with 7B, 13B, 34B and 70B parameters each. See the recipes for examples on how to make use of Meta Code Llama. The following diagram shows how each of the Meta Code Llama models is trained: (Fig: The Meta Code Llama specialization pipeline. The different stages of fine-tuning annotated with the number of tokens seen during training) One of the best ways to try out and integrate with Meta Code Llama is using Hugging Face ecosystem by following the blog , which has: Demo links for all versions of Meta Code Llama Working inference code for code completion Working inference code for code infilling between code prefix and suffix as inputs Working inference code to do 4-bit loading of the 34B model so it can fit on consumer GPUs Guide on how to write prompts for the instruction models to have multi-turn conversations  about coding Guide on how to use Text Generation Inference for model deployment in production Guide on how to integrate code autocomplete as an extension  with VSCode Guide on how to evaluate Meta Code Llama models If the model does not perform well on your specific task, for example if none of the Meta Code Llama models (7B/13B/34B/70B) generate the correct answer for a text to SQL task, fine-tuning should be considered. This is a complete guide and notebook ( ) on how to fine-tune Meta Code Llama using the 7B model hosted on Hugging Face. It uses the LoRA fine-tuning method and can run on a single GPU. As shown in the Meta Code Llama References ( ), fine-tuning improves the performance of Meta Code Llama on SQL code generation, and it can be critical that LLMs are able to interoperate with structured data and SQL, the primary way to access structured data - we are developing demo apps in LangChain and RAG with Llama 2 to show this. Compatible extensions In most of the cases, the simplest method to integrate any model size is through ollama , occasionally combined with litellm . Ollama is a program that allows quantized versions of popular LLMs to run locally. It leverages the GPU and can even run Code Llama 34B on an M1 mac. Litellm is a simple proxy that can serve an OpenAI style API, so it's easy to replace OpenAI in existing applications, in our case, extensions Continue This extension can be used with ollama, allowing for easy local only execution. Additionally, it provides a simple interface to 1/ Chat with the model directly running inside VS Code and 2/ Select specific files and sections to edit or explain. This extension is an effective way to evaluate Llama because it provides simple and useful features. It also allows developers to build trust, by creating diffs for each proposed change and showing exactly what is being changed before saving the file. Handling the context for the LLM is easy and relies heavily on keyboard shortcuts. It's important to note that all the interactions with the extension are recorded in jsonl format. The objective is to provide data for future fine tuning of the models based on the feedback recorded during real world usage as well. Steps to install with ollama Install and pull a model (e.g. ollama pull codellama:13b-instruct) Install the extension from Visual Studio Code marketplace Open the extension and click on the + sign to add models Select Ollama as a provider In the next screen, select the model and size pulled from with ollama Select the model in the convo and start using the extension Steps to install with TGI For better performance or usage in non-compatible hardware, TGI can be used in a server to run the model. For example, ollama on Intel Macs is too slow to be useful, even with the 7B models. On the contrary, M1 macs can run the 34 Meta Code Llama models quickly. For this, you should have TGI running on a server with appropriate hardware, as detailed in this . Once Continue.dev is installed, follow these steps: Open the configs with /config Use the HuggingFaceTGI class and pass your instance URL in the server_url parameter: Assign a name to it and save the config file. llm-vscode This extension from Hugging Face provides an open alternative to the closed sourced GitHub Copilot, allowing for the same functionality, context based autocomplete suggestions, to work with open source models. It works out of the box with a HF Token and their Inference API but can be configured to use any TGI compatible API. For usage with a self-hosted TGI server, follow these steps: from the marketplace Open the extension configs Select the correct template for the model published in your TGI instance in the Config Template field. For testing, used the one named codellama/CodeLlama-13b-hf Pass in the URL to your TGI instance in the Model ID or Endpoint field. To avoid rate limiting messages, login to HF by providing a read only token. This was necessary even for a self-hosted instance. It currently does not support local models unless TGI is running locally. It would be great to add ollama support to this extension, as it would accelerate the inference with the smaller models by avoiding the network.
----------
LangChain | Integration guides Skip to main content Model Cards and Prompt Formats Meta Llama Guard 2 Meta Code Llama 70B Meta Llama Guard 1 Meta Llama on Linux Meta Llama on Windows Meta Llama on Mac Meta Llama in the Cloud Model Cards and Prompt Formats Meta Llama Guard 2 Meta Code Llama 70B Meta Llama Guard 1 Meta Llama on Linux Meta Llama on Windows Meta Llama on Mac Meta Llama in the Cloud is an open source framework for building LLM powered applications. It implements common abstractions and higher-level APIs to make the app building process easier, so you don't need to call LLM from scratch. The main building blocks/APIs of LangChain are: Source The Models or LLMs API can be used to easily connect to all popular LLMs such as Hugging Face or Replicate where all types of Llama 2 models are hosted. The Prompts API implements the useful prompt template abstraction to help you easily reuse good, often long and detailed, prompts when building sophisticated LLM apps. There are also many built-in prompts for common operations such as summarization or connection to SQL databases for quick app development. Prompts can also work closely with  parsers to easily extract useful information from the LLM output. The Memory API can be used to save conversation history and feed it along with new questions to LLM so multi-turn natural conversation chat can be implemented. The Chains API includes the most basic LLMChain that combines a LLM with a prompt to generate the output, as well as more advanced chains to lets you build sophisticated LLM apps in a systematic way. For example, the output of the first LLM chain can be the input/prompt of another chain, or a chain can have multiple inputs and/or multiple outputs, either pre-defined or dynamically decided by the LLM output of a prompt. The Indexes API allows documents outside of LLM to be saved, after first converted to embeddings which are numerical meaning representations, in the vector form, of the documents, to a vector store. Later when a user enters a question about the documents, the relevant data stored in the documents' vector store will be retrieved and sent, along with the query, to LLM to generate an answer related to the documents. The following flow shows the process The Agents API uses LLM as the reasoning engine and connects it with other sources of data, third-party or own tools, or APIs such as web search or wikipedia APIs. Depending on the user's input, the agent can decide which tool to call to handle the input. LangChain can be used as a powerful retrieval augmented generation (RAG) tool to integrate the internal data or more recent public data with LLM to QA or chat about the data. LangChain already supports loading many types of unstructured and structured data. To learn more about LangChain, enroll for free in the two LangChain short courses . Be aware that the code in the courses use OpenAI ChatGPT LLM, but we’ve published a series of using LangChain with Llama. There is also a Getting to Know Llama notebook , presented at Meta Connect.
----------
LlamaIndex | Integration guides LlamaIndex LlamaIndex is another popular open source framework for building LLM applications. Like LangChain, LlamaIndex can also be used to build RAG applications by easily integrating data not built-in the LLM with LLM. There are three key tools in LlamaIndex: Connecting Data: connect data of any type -  structured, unstructured or semi-structured - to LLM Indexing Data: Index and store the data Querying LLM: Combine the user query and retrieved query-related data to query LLM and return data-augmented answer LlamaIndex is mainly a data framework for connecting private or domain-specific data with LLMs, so it specializes in RAG, smart data storage and retrieval, while LangChain is a more general purpose framework which can be used to build agents connecting multiple tools. The integration of the two may provide the best performant and effective solution to building real world RAG powered Llama apps. For an example usage of how to integrate LlamaIndex with Llama 2, see . We also published a completed demo app showing how to use LlamaIndex to chat with Llama 2 about live data via the you.com API. It’s worth noting that LlamaIndex has implemented many RAG powered LLM evaluation tools to easily measure the quality of retrieval and response, including: Question Generation: Call LLM to auto generate questions to create an evaluation dataset. Faithfulness Evaluator: Evaluate if the generated answer is faithful to the retrieved context or if there’s hallucination. Correctness Evaluator: Evaluate if the generated answer matches the reference answer. Relevancy Evaluator: Evaluate if the answer and the retrieved context is relevant and consistent for the given query. Model Cards and Prompt Formats Meta Llama Guard 2 Meta Code Llama 70B Meta Llama Guard 1 Meta Llama on Linux Meta Llama on Windows Meta Llama on Mac Meta Llama in the Cloud Model Cards and Prompt Formats Meta Llama Guard 2 Meta Code Llama 70B Meta Llama Guard 1 Meta Llama on Linux Meta Llama on Windows Meta Llama on Mac Meta Llama in the Cloud Skip to main content
----------
# Llama Recipes: Examples to get started using the Llama models from Meta The 'llama-recipes' repository is a companion to the [Meta Llama 3](https://github.com/meta-llama/llama3) models. The goal of this repository is to provide a scalable library for fine-tuning Meta Llama models, along with some example scripts and notebooks to quickly get started with using the models in a variety of use-cases, including fine-tuning for domain adaptation and building LLM-based applications with Meta Llama and other tools in the LLM ecosystem. The examples here showcase how to run Meta Llama locally, in the cloud, and on-prem. [Meta Llama 2](https://github.com/meta-llama/llama) is also supported in this repository. We highly recommend everyone to utilize [Meta Llama 3](https://github.com/meta-llama/llama3) due to its enhanced capabilities. > [!IMPORTANT] > Meta Llama 3 has a new prompt template and special tokens (based on the tiktoken tokenizer). > | Token | Description | > |---|---| > `<\|begin_of_text\|>` | This is equivalent to the BOS token. | > `<\|end_of_text\|>` | This is equivalent to the EOS token. For multiturn-conversations it's usually unused. Instead, every message is terminated with `<\|eot_id\|>` instead.| > `<\|eot_id\|>` | This token signifies the end of the message in a turn i.e. the end of a single message by a system, user or assistant role as shown below.| > `<\|start_header_id\|>{role}<\|end_header_id\|>` | These tokens enclose the role for a particular message. The possible roles can be: system, user, assistant. | > > A multiturn-conversation with Meta Llama 3 follows this prompt template: > ``` > <|begin_of_text|><|start_header_id|>system<|end_header_id|> > {{ system_prompt }}<|eot_id|><|start_header_id|>user<|end_header_id|> > {{ user_message_1 }}<|eot_id|><|start_header_id|>assistant<|end_header_id|> > {{ model_answer_1 }}<|eot_id|><|start_header_id|>user<|end_header_id|> > {{ user_message_2 }}<|eot_id|><|start_header_id|>assistant<|end_header_id|> > Each message gets trailed by an `<|eot_id|>` token before a new header is started, signaling a role change. > More details on the new tokenizer and prompt template can be found [here](https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3#special-tokens-used-with-meta-llama-3). > [!NOTE] > The llama-recipes repository was recently refactored to promote a better developer experience of using the examples. Some files have been moved to new locations. The `src/` folder has NOT been modified, so the functionality of this repo and package is not impacted. > Make sure you update your local clone by running `git pull origin main` ## Table of Contents - [Llama Recipes: Examples to get started using the Meta Llama models from Meta](#llama-recipes-examples-to-get-started-using-the-llama-models-from-meta) - [Table of Contents](#table-of-contents) - [Getting Started](#getting-started) - [Prerequisites](#prerequisites) - [PyTorch Nightlies](#pytorch-nightlies) - [Installing](#installing) - [Install with pip](#install-with-pip) - [Install with optional dependencies](#install-with-optional-dependencies) - [Install from source](#install-from-source) - [Getting the Llama models](#getting-the-llama-models) - [Model conversion to Hugging Face](#model-conversion-to-hugging-face) - [Repository Organization](#repository-organization) - [`recipes/`](#recipes) - [`src/`](#src) - [Contributing](#contributing) - [License](#license) ## Getting Started These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system. ### Prerequisites #### PyTorch Nightlies If you want to use PyTorch nightlies instead of the stable release, go to [this guide](https://pytorch.org/get-started/locally/) to retrieve the right `--extra-index-url URL` parameter for the `pip install` commands on your platform. ### Installing Llama-recipes provides a pip distribution for easy install and usage in other projects. Alternatively, it can be installed from source. > Ensure you use the correct CUDA version (from `nvidia-smi`) when installing the PyTorch wheels. Here we are using 11.8 as `cu118`. > H100 GPUs work better with CUDA >12.0 #### Install with pip ``` pip install llama-recipes #### Install with optional dependencies Llama-recipes offers the installation of optional packages. There are three optional dependency groups. To run the unit tests we can install the required dependencies with: pip install llama-recipes[tests] For the vLLM example we need additional requirements that can be installed with: pip install llama-recipes[vllm] To use the sensitive topics safety checker install with: pip install llama-recipes[auditnlg] Optional dependencies can also be combines with [option1,option2]. #### Install from source To install from source e.g. for development use these commands. We're using hatchling as our build backend which requires an up-to-date pip as well as setuptools package. git clone git@github.com:meta-llama/llama-recipes.git cd llama-recipes pip install -U pip setuptools pip install -e . For development and contributing to llama-recipes please install all optional dependencies: pip install -U pip setuptools pip install -e .[tests,auditnlg,vllm] ### Getting the Meta Llama models You can find Meta Llama models on Hugging Face hub [here](https://huggingface.co/meta-llama), **where models with `hf` in the name are already converted to Hugging Face checkpoints so no further conversion is needed**. The conversion step below is only for original model weights from Meta that are hosted on Hugging Face model hub as well. #### Model conversion to Hugging Face The recipes and notebooks in this folder are using the Meta Llama model definition provided by Hugging Face's transformers library. Given that the original checkpoint resides under models/7B you can install all requirements and convert the checkpoint with: ```bash ## Install Hugging Face Transformers from source pip freeze | grep transformers ## verify it is version 4.31.0 or higher git clone git@github.com:huggingface/transformers.git cd transformers pip install protobuf python src/transformers/models/llama/convert_llama_weights_to_hf.py \ --input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir /output/path ## Repository Organization Most of the code dealing with Llama usage is organized across 2 main folders: `recipes/` and `src/`. ### `recipes/` Contains examples are organized in folders by topic: | Subfolder | Description | |---|---| [quickstart](./recipes/quickstart) | The "Hello World" of using Llama, start here if you are new to using Llama. [finetuning](./recipes/finetuning)|Scripts to finetune Llama on single-GPU and multi-GPU setups [inference](./recipes/inference)|Scripts to deploy Llama for inference locally and using model servers [use_cases](./recipes/use_cases)|Scripts showing common applications of Meta Llama3 [responsible_ai](./recipes/responsible_ai)|Scripts to use PurpleLlama for safeguarding model outputs [llama_api_providers](./recipes/llama_api_providers)|Scripts to run inference on Llama via hosted endpoints [benchmarks](./recipes/benchmarks)|Scripts to benchmark Llama models inference on various backends [code_llama](./recipes/code_llama)|Scripts to run inference with the Code Llama models [evaluation](./recipes/evaluation)|Scripts to evaluate fine-tuned Llama models using `lm-evaluation-harness` from `EleutherAI` ### `src/` Contains modules which support the example recipes: | Subfolder | Description | | [configs](src/llama_recipes/configs/) | Contains the configuration files for PEFT methods, FSDP, Datasets, Weights & Biases experiment tracking. | | [datasets](src/llama_recipes/datasets/) | Contains individual scripts for each dataset to download and process. Note | | [inference](src/llama_recipes/inference/) | Includes modules for inference for the fine-tuned models. | | [model_checkpointing](src/llama_recipes/model_checkpointing/) | Contains FSDP checkpoint handlers. | | [policies](src/llama_recipes/policies/) | Contains FSDP scripts to provide different policies, such as mixed precision, transformer wrapping policy and activation checkpointing along with any precision optimizer (used for running FSDP with pure bf16 mode). | | [utils](src/llama_recipes/utils/) | Utility files for: - `train_utils.py` provides training/eval loop and more train utils. - `dataset_utils.py` to get preprocessed datasets. - `config_utils.py` to override the configs received from CLI. - `fsdp_utils.py` provides FSDP  wrapping policy for PEFT methods. - `memory_utils.py` context manager to track different memory stats in train loop. | ## Contributing Please read [CONTRIBUTING.md](CONTRIBUTING.md) for details on our code of conduct, and the process for submitting pull requests to us. ## License See the License file for Meta Llama 3 [here](https://llama.meta.com/llama3/license/) and Acceptable Use Policy [here](https://llama.meta.com/llama3/use-policy/) See the License file for Meta Llama 2 [here](https://llama.meta.com/llama2/license/) and Acceptable Use Policy [here](https://llama.meta.com/llama2/use-policy/)
----------
# **Model Details** Meta developed and released the Llama 2 family of large language models (LLMs), a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama-2-Chat, are optimized for dialogue use cases. Llama-2-Chat models outperform open-source chat models on most benchmarks we tested, and in our human evaluations for helpfulness and safety, are on par with some popular closed-source models like ChatGPT and PaLM. **Model Developers** Meta **Variations** Llama 2 comes in a range of parameter sizes — 7B, 13B, and 70B — as well as pretrained and fine-tuned variations. **Input** Models input text only. **Output** Models generate text only. **Model Architecture** Llama 2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. ||Training Data|Params|Context Length|GQA|Tokens|LR| |---|---|---|---|---|---|---| Llama 2|*A new mix of publicly available online data*|7B|4k|✗|2.0T|3.0 x 10 -4 Llama 2|*A new mix of publicly available online data*|13B|4k|✗|2.0T|3.0 x 10 Llama 2|*A new mix of publicly available online data*|70B|4k|✔|2.0T|1.5 x 10 **Llama 2 family of models.** Token counts refer to pretraining data only. All models are trained with a global batch-size of 4M tokens. The 70B version uses Grouped-Query Attention (GQA) for improved inference scalability. **Model Dates** Llama 2 was trained between January 2023 and July 2023. **Status** This is a static model trained on an offline dataset. Future versions of the tuned models will be released as we improve model safety with community feedback. **License** A custom commercial license is available at: [https://ai.meta.com/resources/models-and-libraries/llama-downloads/](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) **Research Paper** More information can be found in the paper "Llama-2: Open Foundation and Fine-tuned Chat Models", available at https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/. **Where to send questions or comments about the model** Instructions on how to provide feedback or comments on the model can be found in the model [README](README.md). # **Intended Use** **Intended Use Cases** Llama 2 is intended for commercial and research use in English. Tuned models are intended for assistant-like chat, whereas pretrained models can be adapted for a variety of natural language generation tasks. **Out-of-scope Uses** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 2 Community License. Use in languages other than English**. **Note: Developers may fine-tune Llama 2 models for languages beyond English provided they comply with the Llama 2 Community License and the Acceptable Use Policy. # **Hardware and Software** **Training Factors** We used custom training libraries, Meta's Research Super Cluster, and production clusters for pretraining. Fine-tuning, annotation, and evaluation were also performed on third-party cloud compute. **Carbon Footprint** Pretraining utilized a cumulative 3.3M GPU hours of computation on hardware of type A100-80GB (TDP of 350-400W). Estimated total emissions were 539 tCO2eq, 100% of which were offset by Meta’s sustainability program. ||Time (GPU hours)|Power Consumption (W)|Carbon Emitted(tCO eq)| |---|---|---|---| |Llama 2 7B|184320|400|31.22| |Llama 2 13B|368640|400|62.44| |Llama 2 70B|1720320|400|291.42| |Total|3311616||539.00| **CO emissions during pretraining.** Time: total GPU time required for training each model. Power Consumption: peak power capacity per GPU device for the GPUs used adjusted for power usage efficiency. 100% of the emissions are directly offset by Meta's sustainability program, and because we are openly releasing these models, the pretraining costs do not need to be incurred by others. # **Training Data** **Overview** Llama 2 was pretrained on 2 trillion tokens of data from publicly available sources. The fine-tuning data includes publicly available instruction datasets, as well as over one million new human-annotated examples. Neither the pretraining nor the fine-tuning datasets include Meta user data. **Data Freshness** The pretraining data has a cutoff of September 2022, but some tuning data is more recent, up to July 2023. # **Evaluation Results** In this section, we report the results for the Llama 1 and Llama 2 models on standard academic benchmarks. For all the evaluations, we use our internal evaluations library. |Model|Size|Code|Commonsense Reasoning|World Knowledge|Reading Comprehension|Math|MMLU|BBH|AGI Eval| |---|---|---|---|---|---|---|---|---|---| |Llama 1|7B|14.1|60.8|46.2|58.5|6.95|35.1|30.3|23.9| |Llama 1|13B|18.9|66.1|52.6|62.3|10.9|46.9|37.0|33.9| |Llama 1|33B|26.0|70.0|58.4|67.6|21.4|57.8|39.8|41.7| |Llama 1|65B|30.7|70.7|60.5|68.6|30.8|63.4|43.5|47.6| |Llama 2|7B|16.8|63.9|48.9|61.3|14.6|45.3|32.6|29.3| |Llama 2|13B|24.5|66.9|55.4|65.8|28.7|54.8|39.4|39.1| |Llama 2|70B|**37.5**|**71.9**|**63.6**|**69.4**|**35.2**|**68.9**|**51.2**|**54.2**| **Overall performance on grouped academic benchmarks.** *Code:* We report the average pass@1 scores of our models on HumanEval and MBPP. *Commonsense Reasoning:* We report the average of PIQA, SIQA, HellaSwag, WinoGrande, ARC easy and challenge, OpenBookQA, and CommonsenseQA. We report 7-shot results for CommonSenseQA and 0-shot results for all other benchmarks. *World Knowledge:* We evaluate the 5-shot performance on NaturalQuestions and TriviaQA and report the average. *Reading Comprehension:* For reading comprehension, we report the 0-shot average on SQuAD, QuAC, and BoolQ. *MATH:* We report the average of the GSM8K (8 shot) and MATH (4 shot) benchmarks at the top 1. |||TruthfulQA|Toxigen| |Llama 1|7B|27.42|23.00| |Llama 1|13B|41.74|23.08| |Llama 1|33B|44.19|22.57| |Llama 1|65B|48.71|21.77| |Llama 2|7B|33.29|**21.25**| |Llama 2|13B|41.86|26.10| |Llama 2|70B|**50.18**|24.60| **Evaluation of pretrained LLMs on automatic safety benchmarks.** For TruthfulQA, we present the percentage of generations that are both truthful and informative (the higher the better). For ToxiGen, we present the percentage of toxic generations (the smaller the better). |Llama-2-Chat|7B|57.04|**0.00**| |Llama-2-Chat|13B|62.18|**0.00**| |Llama-2-Chat|70B|**64.14**|0.01| **Evaluation of fine-tuned LLMs on different safety datasets.** Same metric definitions as above. # **Ethical Considerations and Limitations** Llama 2 is a new technology that carries risks with use. Testing conducted to date has been in English, and has not covered, nor could it cover all scenarios. For these reasons, as with all LLMs, Llama 2’s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying any applications of Llama 2, developers should perform safety testing and tuning tailored to their specific applications of the model. Please see the Responsible Use Guide available at [https://ai.meta.com/llama/responsible-use-guide/](https://ai.meta.com/llama/responsible-use-guide/)
----------
# Llama 2 We are unlocking the power of large language models. Llama 2 is now accessible to individuals, creators, researchers, and businesses of all sizes so that they can experiment, innovate, and scale their ideas responsibly. This release includes model weights and starting code for pre-trained and fine-tuned Llama language models — ranging from 7B to 70B parameters. This repository is intended as a minimal example to load [Llama 2](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/) models and run inference. For more detailed examples leveraging Hugging Face, see [llama-recipes](https://github.com/facebookresearch/llama-recipes/). ## Updates post-launch See [UPDATES.md](UPDATES.md). Also for a running list of frequently asked questions, see [here](https://ai.meta.com/llama/faq/). ## Download In order to download the model weights and tokenizer, please visit the [Meta website](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and accept our License. Once your request is approved, you will receive a signed URL over email. Then run the download.sh script, passing the URL provided when prompted to start the download. Pre-requisites: Make sure you have `wget` and `md5sum` installed. Then run the script: `./download.sh`. Keep in mind that the links expire after 24 hours and a certain amount of downloads. If you start seeing errors such as `403: Forbidden`, you can always re-request a link. ### Access to Hugging Face We are also providing downloads on [Hugging Face](https://huggingface.co/meta-llama). You can request access to the models by acknowledging the license and filling the form in the model card of a repo. After doing so, you should get access to all the Llama models of a version (Code Llama, Llama 2, or Llama Guard) within 1 hour. ## Quick Start You can follow the steps below to quickly get up and running with Llama 2 models. These steps will let you run quick inference locally. For more examples, see the [Llama 2 recipes repository](https://github.com/facebookresearch/llama-recipes). 1. In a conda env with PyTorch / CUDA available clone and download this repository. 2. In the top-level directory run: pip install -e . 3. Visit the [Meta website](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and register to download the model/s. 4. Once registered, you will get an email with a URL to download the models. You will need this URL when you run the download.sh script. 5. Once you get the email, navigate to your downloaded llama repository and run the download.sh script. - Make sure to grant execution permissions to the download.sh script - During this process, you will be prompted to enter the URL from the email. - Do not use the “Copy Link” option but rather make sure to manually copy the link from the email. 6. Once the model/s you want have been downloaded, you can run the model locally using the command below: torchrun --nproc_per_node 1 example_chat_completion.py \ --ckpt_dir llama-2-7b-chat/ \ --tokenizer_path tokenizer.model \ --max_seq_len 512 --max_batch_size 6 **Note** - Replace  `llama-2-7b-chat/` with the path to your checkpoint directory and `tokenizer.model` with the path to your tokenizer model. - The `–nproc_per_node` should be set to the [MP](#inference) value for the model you are using. - Adjust the `max_seq_len` and `max_batch_size` parameters as needed. - This example runs the [example_chat_completion.py](example_chat_completion.py) found in this repository but you can change that to a different .py file. ## Inference Different models require different model-parallel (MP) values: |  Model | MP | |--------|----| | 7B     | 1  | | 13B    | 2  | | 70B    | 8  | All models support sequence length up to 4096 tokens, but we pre-allocate the cache according to `max_seq_len` and `max_batch_size` values. So set those according to your hardware. ### Pretrained Models These models are not finetuned for chat or Q&A. They should be prompted so that the expected answer is the natural continuation of the prompt. See `example_text_completion.py` for some examples. To illustrate, see the command below to run it with the llama-2-7b model (`nproc_per_node` needs to be set to the `MP` value): torchrun --nproc_per_node 1 example_text_completion.py \ --ckpt_dir llama-2-7b/ \ --max_seq_len 128 --max_batch_size 4 ### Fine-tuned Chat Models The fine-tuned models were trained for dialogue applications. To get the expected features and performance for them, a specific formatting defined in [`chat_completion`](https://github.com/facebookresearch/llama/blob/main/llama/generation.py#L212) needs to be followed, including the `INST` and `< >` tags, `BOS` and `EOS` tokens, and the whitespaces and breaklines in between (we recommend calling `strip()` on inputs to avoid double-spaces). You can also deploy additional classifiers for filtering out inputs and outputs that are deemed unsafe. See the llama-recipes repo for [an example](https://github.com/facebookresearch/llama-recipes/blob/main/examples/inference.py) of how to add a safety checker to the inputs and outputs of your inference code. Examples using llama-2-7b-chat: torchrun --nproc_per_node 1 example_chat_completion.py \ --max_seq_len 512 --max_batch_size 6 Llama 2 is a new technology that carries potential risks with use. Testing conducted to date has not — and could not — cover all scenarios. In order to help developers address these risks, we have created the [Responsible Use Guide](Responsible-Use-Guide.pdf). More details can be found in our research paper as well. ## Issues Please report any software “bug”, or other problems with the models through one of the following means: - Reporting issues with the model: [github.com/facebookresearch/llama](http://github.com/facebookresearch/llama) - Reporting risky content generated by the model: [developers.facebook.com/llama_output_feedback](http://developers.facebook.com/llama_output_feedback) - Reporting bugs and security concerns: [facebook.com/whitehat/info](http://facebook.com/whitehat/info) ## Model Card See [MODEL_CARD.md](MODEL_CARD.md). Our model and weights are licensed for both researchers and commercial entities, upholding the principles of openness. Our mission is to empower individuals, and industry through this opportunity, while fostering an environment of discovery and ethical AI advancements. See the [LICENSE](LICENSE) file, as well as our accompanying [Acceptable Use Policy](USE_POLICY.md) ## References 1. [Research Paper](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/) 2. [Llama 2 technical overview](https://ai.meta.com/resources/models-and-libraries/llama) 3. [Open Innovation AI Research Community](https://ai.meta.com/llama/open-innovation-ai-research-community/) For common questions, the FAQ can be found [here](https://ai.meta.com/llama/faq/) which will be kept up to date over time as new questions arise. ## Original Llama The repo for the original llama release is in the [`llama_v1`](https://github.com/facebookresearch/llama/tree/llama_v1) branch.
----------
## Model Details Meta developed and released the Meta Llama 3 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8 and 70B sizes. The Llama 3 instruction tuned models are optimized for dialogue use cases and outperform many of the available open source chat models on common industry benchmarks. Further, in developing these models, we took great care to optimize helpfulness and safety. **Model developers** Meta **Variations** Llama 3 comes in two sizes — 8B and 70B parameters — in pre-trained and instruction tuned variants. **Input** Models input text only. **Output** Models generate text and code only. **Model Architecture** Llama 3 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. Training Data Params Context length GQA Token count Knowledge cutoff Llama 3 A new mix of publicly available online data. 8B 8k Yes 15T+ March, 2023 70B December, 2023 **Llama 3 family of models**. Token counts refer to pretraining data only. Both the 8 and 70B versions use Grouped-Query Attention (GQA) for improved inference scalability. **Model Release Date** April 18, 2024. **Status** This is a static model trained on an offline dataset. Future versions of the tuned models will be released as we improve model safety with community feedback. **License** A custom commercial license is available at: [https://llama.meta.com/llama3/license](https://llama.meta.com/llama3/license) Where to send questions or comments about the model Instructions on how to provide feedback or comments on the model can be found in the model [README](https://github.com/meta-llama/llama3). For more technical information about generation parameters and recipes for how to use Llama 3 in applications, please go [here](https://github.com/meta-llama/llama-recipes). ## Intended Use **Intended Use Cases** Llama 3 is intended for commercial and research use in English. Instruction tuned models are intended for assistant-like chat, whereas pretrained models can be adapted for a variety of natural language generation tasks. **Out-of-scope** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the [Acceptable Use Policy](https://llama.meta.com/llama3/use-policy/) and [Llama 3 Community License](https://llama.meta.com/llama3/license/). Use in languages other than English**. **Note: Developers may fine-tune Llama 3 models for languages beyond English provided they comply with the [Llama 3 Community License](https://llama.meta.com/llama3/license/) and the [Acceptable Use Policy](https://llama.meta.com/llama3/use-policy/). ## Hardware and Software **Training Factors** We used custom training libraries, Meta's Research SuperCluster, and production clusters for pretraining. Fine-tuning, annotation, and evaluation were also performed on third-party cloud compute. **Carbon Footprint Pretraining utilized a cumulative** 7.7M GPU hours of computation on hardware of type H100-80GB (TDP of 700W). Estimated total emissions were 2290 tCO2eq, 100% of which were offset by Meta’s sustainability program. Time (GPU hours) Power Consumption (W) Carbon Emitted(tCO2eq) Llama 3 8B 1.3M 700 390 Llama 3 70B 6.4M 1900 Total 7.7M 2290 **CO2 emissions during pre-training**. Time: total GPU time required for training each model. Power Consumption: peak power capacity per GPU device for the GPUs used adjusted for power usage efficiency. 100% of the emissions are directly offset by Meta's sustainability program, and because we are openly releasing these models, the pretraining costs do not need to be incurred by others. ## Training Data **Overview** Llama 3 was pretrained on over 15 trillion tokens of data from publicly available sources. The fine-tuning data includes publicly available instruction datasets, as well as over 10M human-annotated examples. Neither the pretraining nor the fine-tuning datasets include Meta user data. **Data Freshness** The pretraining data has a cutoff of March 2023 for the 8B and December 2023 for the 70B models respectively. ## Benchmarks In this section, we report the results for Llama 3 models on standard automatic benchmarks. For all the evaluations, we use our internal evaluations library. For details on the methodology see [here](https://github.com/meta-llama/llama3/blob/main/eval_details.md). ### Base pretrained models Category Benchmark Llama2 7B Llama2 13B Llama2 70B General MMLU (5-shot) 66.6 45.7 53.8 79.5 69.7 AGIEval English (3-5 shot) 45.9 28.8 38.7 63.0 54.8 CommonSenseQA (7-shot) 72.6 57.6 67.6 83.8 78.7 Winogrande (5-shot) 76.1 73.3 75.4 83.1 81.8 BIG-Bench Hard (3-shot, CoT) 61.1 38.1 47.0 81.3 65.7 ARC-Challenge (25-shot) 78.6 53.7 93.0 85.3 Knowledge reasoning TriviaQA-Wiki (5-shot) 78.5 72.1 79.6 89.7 87.5 Reading comprehension SQuAD (1-shot) 76.4 72.2 85.6 82.6 QuAC (1-shot, F1) 44.4 39.6 44.9 51.1 49.4 BoolQ (0-shot) 75.7 65.5 66.9 79.0 73.1 DROP (3-shot, F1) 58.4 37.9 49.8 79.7 70.2 ### Instruction tuned models Llama 2 7B Llama 2 13B Llama 2 70B 68.4 34.1 47.8 82.0 52.9 GPQA (0-shot) 34.2 21.7 22.3 39.5 21.0 HumanEval (0-shot) 62.2 7.9 14.0 81.7 25.6 GSM-8K (8-shot, CoT) 25.7 77.4 57.5 MATH (4-shot, CoT) 30.0 3.8 6.7 50.4 11.6 ### Responsibility & Safety We believe that an open approach to AI leads to better, safer products, faster innovation, and a bigger overall market. We are committed to Responsible AI development and took a series of steps to limit misuse and harm and support the open source community. Foundation models are widely capable technologies that are built to be used for a diverse range of applications. They are not designed to meet every developer preference on safety levels for all use cases, out-of-the-box, as those by their nature will differ across different applications. Rather, responsible LLM-application deployment is achieved by implementing a series of safety best practices throughout the development of such applications, from the model pre-training, fine-tuning and the deployment of systems composed of safeguards to tailor the safety needs specifically to the use case and audience. As part of the Llama 3 release, we updated our [Responsible Use Guide](https://llama.meta.com/responsible-use-guide/) to outline the steps and best practices for developers to implement model and system level safety for their application. We also provide a set of resources including [Meta Llama Guard 2](https://llama.meta.com/purple-llama/) and [Code Shield](https://llama.meta.com/purple-llama/) safeguards. These tools have proven to drastically reduce residual risks of LLM Systems, while maintaining a high level of helpfulness. We encourage developers to tune and deploy these safeguards according to their needs and we provide a [reference implementation](https://github.com/meta-llama/llama-recipes/tree/main/recipes/responsible_ai) to get you started. #### Llama 3-Instruct As outlined in the Responsible Use Guide, some trade-off between model helpfulness and model alignment is likely unavoidable. Developers should exercise discretion about how to weigh the benefits of alignment and helpfulness for their specific use case and audience. Developers should be mindful of residual risks when using Llama models and leverage additional safety tools as needed to reach the right safety bar for their use case. Safety For our instruction tuned model, we conducted extensive red teaming exercises, performed adversarial evaluations and implemented safety mitigations techniques to lower residual risks. As with any Large Language Model, residual risks will likely remain and we recommend that developers assess these risks in the context of their use case. In parallel, we are working with the community to make AI safety benchmark standards transparent, rigorous and interpretable. Refusals In addition to residual risks, we put a great emphasis on model refusals to benign prompts. Over-refusing not only can impact the user experience but could even be harmful in certain contexts as well. We’ve heard the feedback from the developer community and improved our fine tuning to ensure that Llama 3 is significantly less likely to falsely refuse to answer prompts than Llama 2. We built internal benchmarks and developed mitigations to limit false refusals making Llama 3 our most helpful model to date. #### Responsible release In addition to responsible use considerations outlined above, we followed a rigorous process that requires us to take extra measures against misuse and critical risks before we make our release decision. Misuse If you access or use Llama 3, you agree to the Acceptable Use Policy. The most recent copy of this policy can be found at [https://llama.meta.com/llama3/use-policy/](https://llama.meta.com/llama3/use-policy/). #### Critical risks CBRNE (Chemical, Biological, Radiological, Nuclear, and high yield Explosives) We have conducted a two fold assessment of the safety of the model in this area: * Iterative testing during model training to assess the safety of responses related to CBRNE threats and other adversarial risks. * Involving external CBRNE experts to conduct an uplift test assessing the ability of the model to accurately provide expert knowledge and reduce barriers to potential CBRNE misuse, by reference to what can be achieved using web search (without the model). ### Cyber Security We have evaluated Llama 3 with CyberSecEval, Meta’s cybersecurity safety eval suite, measuring Llama 3’s propensity to suggest insecure code when used as a coding assistant, and Llama 3’s propensity to comply with requests to help carry out cyber attacks, where attacks are defined by the industry standard MITRE ATT&CK cyber attack ontology. On our insecure coding and cyber attacker helpfulness tests, Llama 3 behaved in the same range or safer than models of [equivalent coding capability](https://huggingface.co/spaces/facebook/CyberSecEval). Child Safety Child Safety risk assessments were conducted using a team of experts, to assess the model’s capability to produce outputs that could result in Child Safety risks and inform on any necessary and appropriate risk mitigations via fine tuning. We leveraged those expert red teaming sessions to expand the coverage of our evaluation benchmarks through Llama 3 model development.  For Llama 3, we conducted new in-depth sessions using objective based methodologies to assess the model risks along multiple attack vectors. We also partnered with content specialists to perform red teaming exercises assessing potentially violating content while taking account of market specific nuances or experiences. ### Community Generative AI safety requires expertise and tooling, and we believe in the strength of the open community to accelerate its progress. We are active members of open consortiums, including the AI Alliance, Partnership in AI and MLCommons, actively contributing to safety standardization and transparency. We encourage the community to adopt taxonomies like the MLCommons Proof of Concept evaluation to facilitate collaboration and transparency on safety and content evaluations. Our Purple Llama tools are open sourced for the community to use and widely distributed across ecosystem partners including cloud service providers. We encourage community contributions to our [GitHub repository](https://github.com/meta-llama/PurpleLlama). Finally, we put in place a set of resources including an [output reporting mechanism](https://developers.facebook.com/llama_output_feedback) and [bug bounty program](https://www.facebook.com/whitehat) to continuously improve the Llama technology with the help of the community. ## Ethical Considerations and Limitations The core values of Llama 3 are openness, inclusivity and helpfulness. It is meant to serve everyone, and to work for a wide range of use cases. It is thus designed to be accessible to people across many different backgrounds, experiences and perspectives. Llama 3 addresses users and their needs as they are, without insertion unnecessary judgment or normativity, while reflecting the understanding that even content that may appear problematic in some cases can serve valuable purposes in others. It respects the dignity and autonomy of all users, especially in terms of the values of free thought and expression that power innovation and progress. But Llama 3 is a new technology, and like any new technology, there are risks associated with its use. Testing conducted to date has been in English, and has not covered, nor could it cover, all scenarios. For these reasons, as with all LLMs, Llama 3’s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying any applications of Llama 3 models, developers should perform safety testing and tuning tailored to their specific applications of the model. As outlined in the Responsible Use Guide, we recommend incorporating [Purple Llama](https://github.com/facebookresearch/PurpleLlama) solutions into your workflows and specifically [Llama Guard](https://ai.meta.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/) which provides a base model to filter input and output prompts to layer system-level safety on top of model-level safety. Please see the Responsible Use Guide available at [http://llama.meta.com/responsible-use-guide](http://llama.meta.com/responsible-use-guide) ## Citation instructions @article{llama3modelcard, title={Llama 3 Model Card}, author={AI@Meta}, year={2024}, url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md} ## Contributors Aaditya Singh; Aaron Grattafiori; Abhimanyu Dubey; Abhinav Jauhri; Abhinav Pandey; Abhishek Kadian; Adam Kelsey; Adi Gangidi; Ahmad Al-Dahle; Amit Sangani; Ahuva Goldstand; Aiesha Letman; Ajay Menon; Akhil Mathur; Alan Schelten; Alex Vaughan; Amy Yang; Andrei Lupu; Andres Alvarado; Andrew Gallagher; Andrew Gu; Andrew Ho; Andrew Poulton; Andrew Ryan; Angela Fan; Ankit Ramchandani; Anthony Hartshorn; Archi Mitra; Archie Sravankumar; Artem Korenev; Arun Rao; Ashley Gabriel; Ashwin Bharambe; Assaf Eisenman; Aston Zhang; Ash JJhaveri; Aurelien Rodriguez; Austen Gregerson; Ava Spataru; Baptiste Roziere; Ben Maurer; Benjamin Leonhardi; Bernie Huang; Bhargavi Paranjape; Bing Liu; Binh Tang; Bobbie Chern; Brani Stojkovic; Brian Fuller; Catalina Mejia Arenas; Chao Zhou; Charlotte Caucheteux; Chaya Nayak; Ching-Hsiang Chu; Chloe Bi; Chris Cai; Chris Cox; Chris Marra; Chris McConnell; Christian Keller; Christoph Feichtenhofer; Christophe Touret; Chunyang Wu; Corinne Wong; Cristian Canton Ferrer; Damien Allonsius; Daniel Kreymer; Daniel Haziza; Daniel Li; Danielle Pintz; Danny Livshits; Danny Wyatt; David Adkins; David Esiobu; David Xu; Davide Testuggine; Delia David; Devi Parikh; Dhruv Choudhary; Dhruv Mahajan; Diana Liskovich; Diego Garcia-Olano; Diego Perino; Dieuwke Hupkes; Dingkang Wang; Dustin Holland; Egor Lakomkin; Elina Lobanova; Xiaoqing Ellen Tan; Emily Dinan; Eric Smith; Erik Brinkman; Esteban Arcaute; Filip Radenovic; Firat Ozgenel; Francesco Caggioni; Frank Seide; Frank Zhang; Gabriel Synnaeve; Gabriella Schwarz; Gabrielle Lee; Gada Badeer; Georgia Anderson; Graeme Nail; Gregoire Mialon; Guan Pang; Guillem Cucurell; Hailey Nguyen; Hamid Shojanazeri; Hannah Korevaar; Hannah Wang; Haroun Habeeb; Harrison Rudolph; Henry Aspegren; Hu Xu; Hugo Touvron; Iga Kozlowska; Igor Molybog; Igor Tufanov; Iliyan Zarov; Imanol Arrieta Ibarra; Irina-Elena Veliche; Isabel Kloumann; Ishan Misra; Ivan Evtimov; Jade Copet; Jake Weissman; Jan Geffert; Jana Vranes; Japhet Asher; Jason Park; Jay Mahadeokar; Jean-Baptiste Gaya; Jeet Shah; Jelmer van der Linde; Jennifer Chan; Jenny Hong; Jenya Lee; Jeremy Fu; Jeremy Teboul; Jianfeng Chi; Jianyu Huang; Jie Wang; Jiecao Yu; Joanna Bitton; Joe Spisak; Joelle Pineau; Jon Carvill; Jongsoo Park; Joseph Rocca; Joshua Johnstun; Junteng Jia; Kalyan Vasuden Alwala; Kam Hou U; Kate Plawiak; Kartikeya Upasani; Kaushik Veeraraghavan; Ke Li; Kenneth Heafield; Kevin Stone; Khalid El-Arini; Krithika Iyer; Kshitiz Malik; Kuenley Chiu; Kunal Bhalla; Kyle Huang; Lakshya Garg; Lauren Rantala-Yeary; Laurens van der Maaten; Lawrence Chen; Leandro Silva; Lee Bell; Lei Zhang; Liang Tan; Louis Martin; Lovish Madaan; Luca Wehrstedt; Lukas Blecher; Luke de Oliveira; Madeline Muzzi; Madian Khabsa; Manav Avlani; Mannat Singh; Manohar Paluri; Mark Zuckerberg; Marcin Kardas; Martynas Mankus; Mathew Oldham; Mathieu Rita; Matthew Lennie; Maya Pavlova; Meghan Keneally; Melanie Kambadur; Mihir Patel; Mikayel Samvelyan; Mike Clark; Mike Lewis; Min Si; Mitesh Kumar Singh; Mo Metanat; Mona Hassan; Naman Goyal; Narjes Torabi; Nicolas Usunier; Nikolay Bashlykov; Nikolay Bogoychev; Niladri Chatterji; Ning Dong; Oliver Aobo Yang; Olivier Duchenne; Onur Celebi; Parth Parekh; Patrick Alrassy; Paul Saab; Pavan Balaji; Pedro Rittner; Pengchuan Zhang; Pengwei Li; Petar Vasic; Peter Weng; Polina Zvyagina; Prajjwal Bhargava; Pratik Dubal; Praveen Krishnan; Punit Singh Koura; Puxin Xu; Qing He; Rachel Rodriguez; Ragavan Srinivasan; Rahul Mitra; Ramon Calderer; Raymond Li; Robert Stojnic; Roberta Raileanu; Robin Battey; Rocky Wang; Rohit Girdhar; Rohit Patel; Romain Sauvestre; Ronnie Polidoro; Roshan Sumbaly; Ross Taylor; Ruan Silva; Rui Hou; Rui Wang; Russ Howes; Ruty Rinott; Saghar Hosseini; Sai Jayesh Bondu; Samyak Datta; Sanjay Singh; Sara Chugh; Sargun Dhillon; Satadru Pan; Sean Bell; Sergey Edunov; Shaoliang Nie; Sharan Narang; Sharath Raparthy; Shaun Lindsay; Sheng Feng; Sheng Shen; Shenghao Lin; Shiva Shankar; Shruti Bhosale; Shun Zhang; Simon Vandenhende; Sinong Wang; Seohyun Sonia Kim; Soumya Batra; Sten Sootla; Steve Kehoe; Suchin Gururangan; Sumit Gupta; Sunny Virk; Sydney Borodinsky; Tamar Glaser; Tamar Herman; Tamara Best; Tara Fowler; Thomas Georgiou; Thomas Scialom; Tianhe Li; Todor Mihaylov; Tong Xiao; Ujjwal Karn; Vedanuj Goswami; Vibhor Gupta; Vignesh Ramanathan; Viktor Kerkez; Vinay Satish Kumar; Vincent Gonguet; Vish Vogeti; Vlad Poenaru; Vlad Tiberiu Mihailescu; Vladan Petrovic; Vladimir Ivanov; Wei Li; Weiwei Chu; Wenhan Xiong; Wenyin Fu; Wes Bouaziz; Whitney Meers; Will Constable; Xavier Martinet; Xiaojian Wu; Xinbo Gao; Xinfeng Xie; Xuchao Jia; Yaelle Goldschlag; Yann LeCun; Yashesh Gaur; Yasmine Babaei; Ye Qi; Yenda Li; Yi Wen; Yiwen Song; Youngjin Nam; Yuchen Hao; Yuchen Zhang; Yun Wang; Yuning Mao; Yuzi He; Zacharie Delpierre Coudert; Zachary DeVito; Zahra Hankir; Zhaoduo Wen; Zheng Yan; Zhengxing Chen; Zhenyu Yang; Zoe Papakipos
----------
🤗 Models on Hugging Face | Blog Website --- # Meta Llama 3 We are unlocking the power of large language models. Our latest version of Llama is now accessible to individuals, creators, researchers, and businesses of all sizes so that they can experiment, innovate, and scale their ideas responsibly. This release includes model weights and starting code for pre-trained and instruction-tuned Llama 3 language models — including sizes of 8B to 70B parameters. This repository is a minimal example of loading Llama 3 models and running inference. For more detailed examples, see [llama-recipes](https://github.com/facebookresearch/llama-recipes/). To download the model weights and tokenizer, please visit the [Meta Llama website](https://llama.meta.com/llama-downloads/) and accept our License. Once your request is approved, you will receive a signed URL over email. Then, run the download.sh script, passing the URL provided when prompted to start the download. Pre-requisites: Ensure you have `wget` and `md5sum` installed. Then run the script: `./download.sh`. Remember that the links expire after 24 hours and a certain amount of downloads. You can always re-request a link if you start seeing errors such as `403: Forbidden`. ### Access to Hugging Face We also provide downloads on [Hugging Face](https://huggingface.co/meta-llama), in both transformers and native `llama3` formats. To download the weights from Hugging Face, please follow these steps: - Visit one of the repos, for example [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct). - Read and accept the license. Once your request is approved, you'll be granted access to all the Llama 3 models. Note that requests used to take up to one hour to get processed. - To download the original native weights to use with this repo, click on the "Files and versions" tab and download the contents of the `original` folder. You can also download them from the command line if you `pip install huggingface-hub`: huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct --include "original/*" --local-dir meta-llama/Meta-Llama-3-8B-Instruct - To use with transformers, the following [pipeline](https://huggingface.co/docs/transformers/en/main_classes/pipelines) snippet will download and cache the weights: ```python model_id = "meta-llama/Meta-Llama-3-8B-Instruct" model="meta-llama/Meta-Llama-3-8B-Instruct", model_kwargs={"torch_dtype": torch.bfloat16}, device="cuda", You can follow the steps below to get up and running with Llama 3 models quickly. These steps will let you run quick inference locally. For more examples, see the [Llama recipes repository](https://github.com/facebookresearch/llama-recipes). 1. Clone and download this repository in a conda env with PyTorch / CUDA. 2. In the top-level directory run: pip install -e . 3. Visit the [Meta Llama website](https://llama.meta.com/llama-downloads/) and register to download the model/s. 4. Once registered, you will get an email with a URL to download the models. You will need this URL when you run the download.sh script. 5. Once you get the email, navigate to your downloaded llama repository and run the download.sh script. - Make sure to grant execution permissions to the download.sh script - During this process, you will be prompted to enter the URL from the email. - Do not use the “Copy Link” option; copy the link from the email manually. 6. Once the model/s you want have been downloaded, you can run the model locally using the command below: torchrun --nproc_per_node 1 example_chat_completion.py \ --ckpt_dir Meta-Llama-3-8B-Instruct/ \ --tokenizer_path Meta-Llama-3-8B-Instruct/tokenizer.model \ --max_seq_len 512 --max_batch_size 6 - Replace  `Meta-Llama-3-8B-Instruct/` with the path to your checkpoint directory and `Meta-Llama-3-8B-Instruct/tokenizer.model` with the path to your tokenizer model. - The `–nproc_per_node` should be set to the [MP](#inference) value for the model you are using. - Adjust the `max_seq_len` and `max_batch_size` parameters as needed. - This example runs the [example_chat_completion.py](example_chat_completion.py) found in this repository, but you can change that to a different .py file. Different models require different model-parallel (MP) values: |  Model | MP | | 8B     | 1  | | 70B    | 8  | All models support sequence length up to 8192 tokens, but we pre-allocate the cache according to `max_seq_len` and `max_batch_size` values. So set those according to your hardware. These models are not finetuned for chat or Q&A. They should be prompted so that the expected answer is the natural continuation of the prompt. See `example_text_completion.py` for some examples. To illustrate, see the command below to run it with the llama-3-8b model (`nproc_per_node` needs to be set to the `MP` value): torchrun --nproc_per_node 1 example_text_completion.py \ --ckpt_dir Meta-Llama-3-8B/ \ --tokenizer_path Meta-Llama-3-8B/tokenizer.model \ --max_seq_len 128 --max_batch_size 4 ### Instruction-tuned Models The fine-tuned models were trained for dialogue applications. To get the expected features and performance for them, specific formatting defined in [`ChatFormat`](https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py#L202) needs to be followed: The prompt begins with a `<|begin_of_text|>` special token, after which one or more messages follow. Each message starts with the `<|start_header_id|>` tag, the role `system`, `user` or `assistant`, and the `<|end_header_id|>` tag. After a double newline `\n\n`, the message's contents follow. The end of each message is marked by the `<|eot_id|>` token. You can also deploy additional classifiers to filter out inputs and outputs that are deemed unsafe. See the llama-recipes repo for [an example](https://github.com/meta-llama/llama-recipes/blob/main/recipes/inference/local_inference/inference.py) of how to add a safety checker to the inputs and outputs of your inference code. Examples using llama-3-8b-chat: torchrun --nproc_per_node 1 example_chat_completion.py \ --max_seq_len 512 --max_batch_size 6 Llama 3 is a new technology that carries potential risks with use. Testing conducted to date has not — and could not — cover all scenarios. To help developers address these risks, we have created the [Responsible Use Guide](https://ai.meta.com/static-resource/responsible-use-guide/). Please report any software “bug” or other problems with the models through one of the following means: - Reporting issues with the model: [https://github.com/meta-llama/llama3/issues](https://github.com/meta-llama/llama3/issues) - Reporting risky content generated by the model: [developers.facebook.com/llama_output_feedback](http://developers.facebook.com/llama_output_feedback) - Reporting bugs and security concerns: [facebook.com/whitehat/info](http://facebook.com/whitehat/info) Our model and weights are licensed for researchers and commercial entities, upholding the principles of openness. Our mission is to empower individuals and industry through this opportunity while fostering an environment of discovery and ethical AI advancements. See the [LICENSE](LICENSE) file, as well as our accompanying [Acceptable Use Policy](USE_POLICY.md) ## Questions For common questions, the FAQ can be found [here](https://llama.meta.com/faq), which will be updated over time as new questions arise.
----------
# Code Llama ## **Model Details** **Model Developers** Meta AI **Variations** Code Llama comes in four model sizes, and three variants: 1) Code Llama: our base models are designed for general code synthesis and understanding 2) Code Llama - Python: designed specifically for Python 3) Code Llama - Instruct: for instruction following and safer deployment All variants are available in sizes of 7B, 13B, 34B and 70B parameters. **Input** Models input text only. **Output** Models output text only. **Model Architecture** Code Llama and its variants are autoregressive language models using optimized transformer architectures. Code Llama 7B, 13B and 70B additionally support infilling text generation. All models but Code Llama - Python 70B and Code Llama - Instruct 70B were fine-tuned with up to 16K tokens, and support up to 100K tokens at inference time. **Model Dates** Code Llama and its variants have been trained between January 2023 and January 2024. **Status** This is a static model trained on an offline dataset. Future versions of Code Llama - Instruct will be released  as we improve model safety with community feedback. **Licence** A custom commercial license is available at: [https://ai.meta.com/resources/models-and-libraries/llama-downloads/](https://ai.meta.com/resources/models-and-libraries/llama-downloads/). **Research Paper** More information can be found in the paper "[Code Llama: Open Foundation Models for Code](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/)". **Where to send comments** Instructions on how to provide feedback or comments on the model can be found in the model [README](README.md), or by opening an issue in the GitHub repository ([https://github.com/facebookresearch/codellama/](https://github.com/facebookresearch/codellama/)). ## **Intended Use** **Intended Use Cases** Code Llama and its variants are intended for commercial and research use in English and relevant programming languages. The base model Code Llama can be adapted for a variety of code synthesis and understanding tasks, Code Llama - Python is designed specifically to handle the Python programming language, and Code Llama - Instruct is intended to be safer to use for code assistance and generation applications. **Out-of-Scope Uses** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English. Use in any other way that is prohibited by the Acceptable Use Policy and Licensing Agreement for Code Llama and its variants. ## **Hardware and Software** **Training Factors** We used custom training libraries. The training and fine-tuning of the released models have been performed by Meta’s Research Super Cluster. **Carbon Footprint** In aggregate, training all 12 Code Llama models required 1400K GPU hours of computation on hardware of type A100-80GB (TDP of 350-400W). Estimated total emissions were 228.55 tCO2eq, 100% of which were offset by Meta’s sustainability program. **Training data** All experiments reported here and the released models have been trained and fine-tuned using the same data as Llama 2 with different weights (see Section 2 and Table 1 in the [research paper](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/) for details). Code Llama - Instruct uses additional instruction fine-tuning data. **Evaluation Results** See evaluations for the main models and detailed ablations in Section 3 and safety evaluations in Section 4 of the research paper. ## **Ethical Considerations and Limitations** Code Llama and its variants are a new technology that carries risks with use. Testing conducted to date has been in English, and has not covered, nor could it cover all scenarios. For these reasons, as with all LLMs, Code Llama’s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate or objectionable responses to user prompts. Therefore, before deploying any applications of Code Llama, developers should perform safety testing and tuning tailored to their specific applications of the model. Please see the Responsible Use Guide available available at [https://ai.meta.com/llama/responsible-user-guide](https://ai.meta.com/llama/responsible-user-guide).
----------
# Introducing Code Llama Code Llama is a family of large language models for code based on [Llama 2](https://github.com/facebookresearch/llama) providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of applications: foundation models (Code Llama), Python specializations (Code Llama - Python), and instruction-following models (Code Llama - Instruct) with 7B, 13B and 34B parameters each. All models are trained on sequences of 16k tokens and show improvements on inputs with up to 100k tokens. 7B and 13B Code Llama and Code Llama - Instruct variants support infilling based on surrounding content. Code Llama was developed by fine-tuning Llama 2 using a higher sampling of code. As with Llama 2, we applied considerable safety mitigations to the fine-tuned versions of the model. For detailed information on model training, architecture and parameters, evaluations, responsible AI and safety refer to  our [research paper](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/). Output generated by code generation features of the Llama Materials, including Code Llama, may be subject to third party licenses, including, without limitation, open source licenses. We are unlocking the power of large language models and our latest version of Code Llama is now accessible to individuals, creators, researchers and businesses of all sizes so that they can experiment, innovate and scale their ideas responsibly. This release includes model weights and starting code for pretrained and fine-tuned Llama language models — ranging from 7B to 34B parameters. This repository is intended as a minimal example to load [Code Llama](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/) models and run inference. [comment]: <> (Code Llama models are compatible with the scripts in llama-recipes) In order to download the model weights and tokenizers, please visit the [Meta website](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and accept our License. Once your request is approved, you will receive a signed URL over email. Then run the download.sh script, passing the URL provided when prompted to start the download. Make sure that you copy the URL text itself, **do not use the 'Copy link address' option** when you right click the URL. If the copied URL text starts with: https://download.llamameta.net, you copied it correctly. If the copied URL text starts with: https://l.facebook.com, you copied it the wrong way. Pre-requisites: make sure you have `wget` and `md5sum` installed. Then to run the script: `bash download.sh`. Keep in mind that the links expire after 24 hours and a certain amount of downloads. If you start seeing errors such as `403: Forbidden`, you can always re-request a link. ### Model sizes | Model | Size     | |-------|----------| | 7B    | ~12.55GB | | 13B   | 24GB     | | 34B   | 63GB     | | 70B   | 131GB    | [comment]: <> (Access on Hugging Face, We are also providing downloads on Hugging Face. You must first request a download from the Meta website using the same email address as your Hugging Face account. After doing so, you can request access to any of the models on Hugging Face and within 1-2 days your account will be granted access to all versions.) ## Setup In a conda environment with PyTorch / CUDA available, clone the repo and run in the top-level directory: pip install -e . Different models require different model-parallel (MP) values: | Model | MP | |-------|----| | 7B    | 1  | | 13B   | 2  | | 34B   | 4  | | 70B   | 8  | All models, except the 70B python and instruct versions, support sequence lengths up to 100,000 tokens, but we pre-allocate the cache according to `max_seq_len` and `max_batch_size` values. So set those according to your hardware and use-case. ### Pretrained Code Models The Code Llama and Code Llama - Python models are not fine-tuned to follow instructions. They should be prompted so that the expected answer is the natural continuation of the prompt. See `example_completion.py` for some examples. To illustrate, see command below to run it with the `CodeLlama-7b` model (`nproc_per_node` needs to be set to the `MP` value): torchrun --nproc_per_node 1 example_completion.py \ --ckpt_dir CodeLlama-7b/ \ --tokenizer_path CodeLlama-7b/tokenizer.model \ --max_seq_len 128 --max_batch_size 4 Pretrained code models are: the Code Llama models `CodeLlama-7b`, `CodeLlama-13b`, `CodeLlama-34b`, `CodeLlama-70b` and the Code Llama - Python models `CodeLlama-7b-Python`, `CodeLlama-13b-Python`, `CodeLlama-34b-Python`, `CodeLlama-70b-Python`. ### Code Infilling Code Llama and Code Llama - Instruct 7B and 13B models are capable of filling in code given the surrounding context. See `example_infilling.py` for some examples. The `CodeLlama-7b` model can be run for infilling with the command below (`nproc_per_node` needs to be set to the `MP` value): torchrun --nproc_per_node 1 example_infilling.py \ --max_seq_len 192 --max_batch_size 4 Pretrained infilling models are: the Code Llama models `CodeLlama-7b` and `CodeLlama-13b` and the Code Llama - Instruct models `CodeLlama-7b-Instruct`, `CodeLlama-13b-Instruct`. ### Fine-tuned Instruction Models Code Llama - Instruct models are fine-tuned to follow instructions. To get the expected features and performance for the 7B, 13B and 34B variants, a specific formatting defined in [`chat_completion()`](https://github.com/facebookresearch/codellama/blob/main/llama/generation.py#L319-L361) needs to be followed, including the `INST` and `< >` tags, `BOS` and `EOS` tokens, and the whitespaces and linebreaks in between (we recommend calling `strip()` on inputs to avoid double-spaces). `CodeLlama-70b-Instruct` requires a separate turn-based prompt format defined in [`dialog_prompt_tokens()`](https://github.com/facebookresearch/codellama/blob/main/llama/generation.py#L506-L548). You can use `chat_completion()` directly to generate answers with all instruct models; it will automatically perform the required formatting. You can also deploy additional classifiers for filtering out inputs and outputs that are deemed unsafe. See the llama-recipes repo for [an example](https://github.com/facebookresearch/llama-recipes/blob/main/src/llama_recipes/inference/safety_utils.py) of how to add a safety checker to the inputs and outputs of your inference code. Examples using `CodeLlama-7b-Instruct`: torchrun --nproc_per_node 1 example_instructions.py \ --ckpt_dir CodeLlama-7b-Instruct/ \ --tokenizer_path CodeLlama-7b-Instruct/tokenizer.model \ --max_seq_len 512 --max_batch_size 4 Fine-tuned instruction-following models are: the Code Llama - Instruct models `CodeLlama-7b-Instruct`, `CodeLlama-13b-Instruct`, `CodeLlama-34b-Instruct`, `CodeLlama-70b-Instruct`. Code Llama is a new technology that carries potential risks with use. Testing conducted to date has not — and could not — cover all scenarios. In order to help developers address these risks, we have created the [Responsible Use Guide](https://github.com/facebookresearch/llama/blob/main/Responsible-Use-Guide.pdf). More details can be found in our research papers as well. Please report any software “bug”, or other problems with the models through one of the following means: - Reporting issues with the model: [github.com/facebookresearch/codellama](http://github.com/facebookresearch/codellama) - Reporting risky content generated by the model: [developers.facebook.com/llama_output_feedback](http://developers.facebook.com/llama_output_feedback) - Reporting bugs and security concerns: [facebook.com/whitehat/info](http://facebook.com/whitehat/info) See [MODEL_CARD.md](MODEL_CARD.md) for the model card of Code Llama. Our model and weights are licensed for both researchers and commercial entities, upholding the principles of openness. Our mission is to empower individuals, and industry through this opportunity, while fostering an environment of discovery and ethical AI advancements. See the [LICENSE](https://github.com/facebookresearch/llama/blob/main/LICENSE) file, as well as our accompanying [Acceptable Use Policy](https://github.com/facebookresearch/llama/blob/main/USE_POLICY.md) 1. [Code Llama Research Paper](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/) 2. [Code Llama Blog Post](https://ai.meta.com/blog/code-llama-large-language-model-coding/)
----------
Models on Hugging Face CyberSec Eval Paper Llama Guard Paper # Purple Llama Purple Llama is an umbrella project that over time will bring together tools and evals to help the community build responsibly with open generative AI models. The initial release will include tools and evals for Cyber Security and Input/Output safeguards but we plan to contribute more in the near future. ## Why purple? Borrowing a [concept](https://www.youtube.com/watch?v=ab_Fdp6FVDI) from the cybersecurity world, we believe that to truly mitigate the challenges which generative AI presents, we need to take both attack (red team) and defensive (blue team) postures. Purple teaming, composed of both red and blue team responsibilities, is a collaborative approach to evaluating and mitigating potential risks and the same ethos applies to generative AI and hence our investment in Purple Llama will be comprehensive. Components within the Purple Llama project will be licensed permissively enabling both research and commercial usage. We believe this is a major step towards enabling community collaboration and standardizing the development and usage of trust and safety tools for generative AI development. More concretely evals and benchmarks are licensed under the MIT license while any models use the Llama 2 Community license. See the table below: | **Component Type** |            **Components**            |                                          **License**                                           | | :----------------- | :----------------------------------: | :--------------------------------------------------------------------------------------------: | | Evals/Benchmarks   | Cyber Security Eval (others to come) |                                              MIT                                               | | Models             |             Llama Guard              | [Llama 2 Community License](https://github.com/facebookresearch/PurpleLlama/blob/main/LICENSE) | | Models             |             Llama Guard 2            | Llama 3 Community License | | Safeguard          |             Code Shield              | MIT | ## Evals & Benchmarks ### Cybersecurity #### CyberSec Eval v1 CyberSec Eval v1 was what we believe was the first industry-wide set of cybersecurity safety evaluations for LLMs. These benchmarks are based on industry guidance and standards (e.g., CWE and MITRE ATT&CK) and built in collaboration with our security subject matter experts. We aim to provide tools that will help address some risks outlined in the [White House commitments on developing responsible AI](https://www.whitehouse.gov/briefing-room/statements-releases/2023/07/21/fact-sheet-biden-harris-administration-secures-voluntary-commitments-from-leading-artificial-intelligence-companies-to-manage-the-risks-posed-by-ai/), including: * Metrics for quantifying LLM cybersecurity risks. * Tools to evaluate the frequency of insecure code suggestions. * Tools to evaluate LLMs to make it harder to generate malicious code or aid in carrying out cyberattacks. We believe these tools will reduce the frequency of LLMs suggesting insecure AI-generated code and reduce their helpfulness to cyber adversaries. Our initial results show that there are meaningful cybersecurity risks for LLMs, both with recommending insecure code and for complying with malicious requests. See our [Cybersec Eval paper](https://ai.meta.com/research/publications/purple-llama-cyberseceval-a-benchmark-for-evaluating-the-cybersecurity-risks-of-large-language-models/) for more details. #### CyberSec Eval 2 CyberSec Eval 2 expands on its predecessor by measuring an LLM’s propensity to abuse a code interpreter, offensive cybersecurity capabilities, and susceptibility to prompt injection. You can read the paper [here](https://ai.meta.com/research/publications/cyberseceval-2-a-wide-ranging-cybersecurity-evaluation-suite-for-large-language-models/). You can also check out the 🤗 leaderboard [here](https://huggingface.co/spaces/facebook/CyberSecEval). ## System-Level Safeguards As we outlined in Llama 3’s [Responsible Use Guide](https://ai.meta.com/llama/responsible-use-guide/), we recommend that all inputs and outputs to the LLM be checked and filtered in accordance with content guidelines appropriate to the application. ### Llama Guard To support this, and empower the community, we released Llama Guard, an openly-available model that performs competitively on common open benchmarks and provides developers with a pretrained model to help defend against generating potentially risky outputs. As part of our ongoing commitment to open and transparent science, we also released our methodology and an extended discussion of model performance in our [Llama Guard paper](https://ai.meta.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/). We are happy to share an updated version, Meta Llama Guard 2. Llama Guard 2 was optimized to support the newly [announced](https://mlcommons.org/2024/04/mlc-aisafety-v0-5-poc/) policy published by MLCommons, expanding its coverage to a more comprehensive set of safety categories, out-of-the-box. It also comes with better classification performance than Llama Guard 1 and improved zero-shot and few shot adaptability. Ultimately, our vision is to enable developers to customize this model to support relevant use cases and to make it easier to adopt best practices and improve the open ecosystem. ### Code Shield Code Shield adds support for inference-time filtering of insecure code produced by LLMs. Code Shield offers mitigation of insecure code suggestions risk, code interpreter abuse prevention, and secure command execution. [CodeShield Example Notebook](https://github.com/meta-llama/PurpleLlama/blob/main/CodeShield/notebook/CodeShieldUsageDemo.ipynb). To get started and learn how to use Purple Llama components with Llama models, see the getting started guide [here](https://ai.meta.com/llama/get-started/). The guide provides information and resources to help you set up Llama, including how to access the model, hosting how-to information and integration guides. Additionally, you will find supplemental materials to further assist you while responsibly building with Llama. The guide will be updated as more Purple Llama components get released. ## FAQ For a running list of frequently asked questions, for not only Purple Llama components but also generally for Llama models, see the FAQ [here](https://ai.meta.com/llama/faq/). ## Join the Purple Llama community See the [CONTRIBUTING](CONTRIBUTING.md) file for how to help out.
----------
# Model Details Meta Llama Guard 2 is an 8B parameter Llama 3-based [1] LLM safeguard model. Similar to [Llama Guard](https://ai.meta.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/), it can be used for classifying content in both LLM inputs (prompt classification) and in LLM responses (response classification). It acts as an LLM – it generates text in its output that indicates whether a given prompt or response is safe or unsafe, and if unsafe, it also lists the content categories violated. Below is a response classification example input and output for Llama Guard 2. In order to produce classifier scores, we look at the probability for the first token, and use that as the “unsafe” class probability. We can then apply score thresholding to make binary decisions. # Harm Taxonomy and Policy The model is trained to predict safety labels on the 11 categories shown below, based on the [MLCommons taxonomy](https://mlcommons.org/2024/04/mlc-aisafety-v0-5-poc/) of hazards. Harm categories S1: Violent Crimes S2: Non-Violent Crimes S3: Sex-Related Crimes S4: Child Sexual Exploitation S5: Specialized Advice S6: Privacy S7: Intellectual Property S8: Indiscriminate Weapons S9: Hate S10: Suicide & Self-Harm S11: Sexual Content **S1: Violent Crimes** Responses that enable, encourage, or endorse the commission of violent crimes, including: (1) unlawful violence toward people (ex: terrorism, genocide, murder, hate-motivated violent crimes, child abuse, assault, battery, kidnapping) and (2) unlawful violence toward animals (ex: animal abuse) **S2: Non-Violent Crimes** Responses that enable, encourage, or endorse the commission of non-violent crimes, including: (1) personal crimes (ex: labor trafficking, threats, intimidation, hate-motivated non-violent crimes), (2) financial crimes (ex: fraud, scams, money laundering), (3) property crimes (ex: theft, arson, vandalism), (4) drug crimes (ex: creating or using regulated substances), (5) weapons crimes (ex: producing unlicensed firearms), and (6) cyber crimes (ex: hacking) **S3: Sex-Related Crimes** Responses that enable, encourage, or endorse the commission of sex-related crimes, including: (1) sex trafficking, (2) sexual assault (ex: rape), (3) sexual harassment (physical, verbal, or visual), and (4) prostitution **S4: Child Sexual Exploitation** Responses that contain, describe, enable, encourage, or endorse the sexual abuse of children **S5: Specialized Advice** Responses that contain specialized financial, medical, or legal advice, or that indicate dangerous activities or objects are safe **S6: Privacy** Responses that contain sensitive, nonpublic personal information that could undermine someone's physical, digital, or financial security **S7: Intellectual Property** Responses that may violate the intellectual property rights of any third party **S8: Indiscriminate Weapons** Responses that enable, encourage, or endorse the creation of indiscriminate weapons, including: (1) chemical weapons (ex: nerve gas), (2) biological weapons (ex: anthrax), (3) radiological weapons (ex: cobalt bombs), (4) nuclear weapons (ex: fission bombs), and (5) high-yield explosive weapons (ex: cluster munitions) **S9: Hate** Responses that demean or dehumanize people on the basis of their sensitive, personal characteristics (i.e., race, color, ethnicity, national origin, disability, religious affiliation, caste, sexual orientation, sex, gender identity, and/or serious disease) **S10: Suicide & Self-Harm** Responses that enable, encourage, or endorse acts of intentional self-harm, including: (1) suicide, (2) self-injury (ex: cutting), and (3) disordered eating **S11: Sexual Content** Responses that contain erotica # Training and Evaluation Data We start with the training set of Llama Guard (see [model card](https://github.com/meta-llama/PurpleLlama/blob/main/Llama-Guard/MODEL_CARD.md#training-data)), and obtain labels on the Harm Taxonomy described above. To improve adaptability of the model to different prompts, we train on hard samples, which are obtained by taking an existing sample and prompting Llama2 70B to produce an alternate policy description that will flip the label of the given sample. We report metrics for various models and APIs on our validation set, which is obtained from combining the validation set of Llama Guard v1 and held-out samples from the additional Llama 3 safety data. We compare performance on our internal test set, as well as on open datasets like [XSTest](https://github.com/paul-rottger/exaggerated-safety?tab=readme-ov-file#license), [OpenAI moderation](https://github.com/openai/moderation-api-release), and [BeaverTails](https://github.com/PKU-Alignment/beavertails). We find that there is overlap between our training set and the BeaverTails-30k test split. Since both our internal test set and BeaverTails use prompts from the Anthropic's [hh-rlhf dataset](https://github.com/anthropics/hh-rlhf) as a starting point for curating data, it is possible that different splits of Anthropic were used while creating the two datasets. Therefore to prevent leakage of signal between our train set and the BeaverTails-30k test set, we create our own BeaverTails-30k splits based on the Anthropic train-test splits used for creating our internal sets. *Note on evaluations*: As discussed in the Llama Guard [paper](https://arxiv.org/abs/2312.06674), comparing model performance is not straightforward as each model is built on its own policy and is expected to perform better on an evaluation dataset with a policy aligned to the model. This highlights the need for industry standards. By aligning Llama Guard 2 with the Proof of Concept MLCommons taxonomy, we hope to drive adoption of industry standards like this and facilitate collaboration and transparency in the LLM safety and content evaluation space. # Model Performance We evaluate the performance of Llama Guard 2 and compare it with Llama Guard and popular content moderation APIs such as Azure, OpenAI Moderation, and Perspective. We use the token probability of the first output token (i.e. safe/unsafe) as the score for classification. For obtaining a binary classification decision from the score, we use a threshold of 0.5. Llama Guard 2 improves over Llama Guard, and outperforms other approaches on our internal test set. Note that we manage to achieve great performance while keeping a low false positive rate as we know that over-moderation can impact user experience when building LLM-applications. | **Model**                | **F1 ↑** | **AUPRC ↑** | **False Positive Rate ↓** | |--------------------------|:------:|:---------:|:-----------------------:| | Llama Guard\*             |  0.665 | 0.854 |          0.027          | | Llama Guard 2            |  **0.915** |   **0.974**   |          0.040          | | GPT4                     | 0.796 |    N/A    |          0.151          | | OpenAI Moderation API    |  0.347 |   0.669   |          0.030          | | Azure Content Safety API |  0.519 |    N/A    |          0.245          | | Perspective API          |  0.265 |   0.586   |          0.046          | Table 1: Comparison of performance of various approaches measured on our internal test set. *The performance of Llama Guard is lower on our new test set due to expansion of the number of harm categories from 6 to 11, which is not aligned to what Llama Guard was trained on. | **Category**           | **False Negative Rate\* ↓** | **False Positive Rate ↓** | |------------------------|:--------------------------:|:-------------------------:| | Violent Crimes         |            0.042           |           0.002           | | Privacy                |            0.057           |           0.004           | | Non-Violent Crimes     |            0.082           |           0.009           | | Intellectual Property  |            0.099           |           0.004           | | Hate                   |            0.190           |           0.005           | | Specialized Advice     |            0.192           |           0.009           | | Sexual Content         |            0.229           |           0.004           | | Indiscriminate Weapons |            0.263           |           0.001           | | Child Exploitation     |            0.267           |           0.000           | | Sex Crimes             |            0.275           |           0.002           | | Self-Harm              |            0.277           |           0.002           | Table 2: Category-wise breakdown of false negative rate and false positive rate for Llama Guard 2 on our internal benchmark for response classification with safety labels from the ML Commons taxonomy. *The binary safe/unsafe label is used to compute categorical FNR by using the true categories. We do not penalize the model while computing FNR for cases where the model predicts the correct overall label but an incorrect categorical label. We also report performance on OSS safety datasets, though we note that the policy used for assigning safety labels is not aligned with the policy used while training Llama Guard 2. Still, Llama Guard 2 provides a superior tradeoff between F1 score and False Positive Rate on the XSTest and OpenAI Moderation datasets, demonstrating good adaptability to other policies. The BeaverTails dataset has a lower bar for a sample to be considered unsafe compared to Llama Guard 2's policy. The policy and training data of MDJudge [4] is more aligned with this dataset and we see that it performs better on them as expected (at the cost of a higher FPR). GPT-4 achieves high recall on all of the sets but at the cost of very high FPR (9-25%), which could hurt its ability to be used as a safeguard for practical applications. (F1 ↑ / False Positive Rate ↓) False Refusals (XSTest) OpenAI policy (OpenAI Mod) BeaverTails policy (BeaverTails-30k) Llama Guard 0.737 / 0.079 0.599 / 0.035 Llama Guard 2 0.884 / 0.084 0.807 / 0.060 0.736 / 0.059 MDJudge 0.856 / 0.172 0.768 / 0.212 0.849 / 0.098 GPT4 0.895 / 0.128 0.842 / 0.092 0.802 / 0.256 OpenAI Mod API 0.576 / 0.040 0.788 / 0.156 0.284 / 0.056 Table 3: Comparison of performance of various approaches measured on our internal test set for response classification. NOTE: The policy used for training Llama Guard does not align with those used for labeling these datasets. Still, Llama Guard 2 provides a superior tradeoff between F1 score and False Positive Rate across these datasets, demonstrating strong adaptability to other policies. We hope to provide developers with a high-performing moderation solution for most use cases by aligning Llama Guard 2 taxonomy with MLCommons standard. But as outlined in our Responsible Use Guide, each use case requires specific safety considerations and we encourage developers to tune Llama Guard 2 for their own use case to achieve better moderation for their custom policies. As an example of how Llama Guard 2's performance may change, we train on the BeaverTails training dataset and compare against MDJudge (which was trained on BeaverTails among others). |          **Model**          | **F1 ↑** | **False Positive Rate ↓** | |:---------------------------:|:--------:|:-------------------------:| | Llama Guard 2               |   0.736  |           0.059           | | MDJudge                     | 0.849 |           0.098           | | Llama Guard 2 + BeaverTails |   **0.852**  |           0.101           | Table 4: Comparison of performance on BeaverTails-30k. # Limitations There are some limitations associated with Llama Guard 2. First, Llama Guard 2 itself is an LLM fine-tuned on Llama 3. Thus, its performance (e.g., judgments that need common sense knowledge, multilingual capability, and policy coverage) might be limited by its (pre-)training data. Second, Llama Guard 2 is finetuned for safety classification only (i.e. to generate "safe" or "unsafe"), and is not designed for chat use cases. However, since it is an LLM, it can still be prompted with any text to obtain a completion. Lastly, as an LLM, Llama Guard 2 may be susceptible to adversarial attacks or prompt injection attacks that could bypass or alter its intended use. However, with the help of external components (e.g., KNN, perplexity filter), recent work (e.g., [3]) demonstrates that Llama Guard is able to detect harmful content reliably. **Note on Llama Guard 2's policy** Llama Guard 2 supports 11 out of the 13 categories included in the [MLCommons AI Safety](https://mlcommons.org/working-groups/ai-safety/ai-safety/) taxonomy. The Election and Defamation categories are not addressed by Llama Guard 2 as moderating these harm categories requires access to up-to-date, factual information sources and the ability to determine the veracity of a particular output. To support the additional categories, we recommend using other solutions (e.g. Retrieval Augmented Generation) in tandem with Llama Guard 2 to evaluate information correctness. # Citation @misc{metallamaguard2, author =       {Llama Team}, title =        {Meta Llama Guard 2}, howpublished = {\url{https://github.com/meta-llama/PurpleLlama/blob/main/Llama-Guard2/MODEL_CARD.md}}, year =         {2024} # References [1] [Llama 3 Model Card](https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md) [2] [Llama Guard Model Card](https://github.com/meta-llama/PurpleLlama/blob/main/Llama-Guard/MODEL_CARD.md) [3] [RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content](https://arxiv.org/pdf/2403.13031.pdf) [4] [MDJudge for Salad-Bench](https://huggingface.co/OpenSafetyLab/MD-Judge-v0.1)
----------
# Meta Llama Guard 2 Llama Guard 2 is a model that provides input and output guardrails for LLM deployments, based on MLCommons policy. # Download In order to download the model weights and tokenizer, please visit the [Meta website](https://llama.meta.com/llama-downloads) and accept our License. Once your request is approved, you will receive a signed URL over email. Then run the download.sh script, passing the URL provided when prompted to start the download. Pre-requisites: Make sure you have wget and md5sum installed. Then to run the script: `./download.sh`. Keep in mind that the links expire after 24 hours and a certain amount of downloads. If you start seeing errors such as `403: Forbidden`, you can always re-request a link. # Quick Start Since Llama Guard 2 is a fine-tuned Llama3 model (see our [model card](MODEL_CARD.md) for more information), the same quick start steps outlined in our [README file](https://github.com/meta-llama/llama3/blob/main/README.md) for Llama3 apply here. In addition to that, we added examples using Llama Guard 2 in the [Llama recipes repository](https://github.com/facebookresearch/llama-recipes). # Issues Please report any software bug, or other problems with the models through one of the following means: - Reporting issues with the Llama Guard model: [github.com/meta-llama/PurpleLlama](https://github.com/meta-llama/PurpleLlama) - Reporting issues with Llama in general: [github.com/meta-llama/llama3](https://github.com/meta-llama/llama3) - Reporting risky content generated by the model: [developers.facebook.com/llama_output_feedback](https://developers.facebook.com/llama_output_feedback) - Reporting bugs and security concerns: [facebook.com/whitehat/info](https://facebook.com/whitehat/info) # License Our model and weights are licensed for both researchers and commercial entities, upholding the principles of openness. Our mission is to empower individuals, and industry through this opportunity, while fostering an environment of discovery and ethical AI advancements. The same license as Llama 3 applies: see the [LICENSE](../LICENSE) file, as well as our accompanying [Acceptable Use Policy](USE_POLICY.md). author =       {Llama Team}, title =        {Meta Llama Guard 2}, [Research Paper](https://ai.facebook.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/)
----------
Llama Guard is a 7B parameter [Llama 2](https://arxiv.org/abs/2307.09288)-based input-output safeguard model. It can be used for classifying content in both LLM inputs (prompt classification) and in LLM responses (response classification). It acts as an LLM: it generates text in its output that indicates whether a given prompt or response is safe/unsafe, and if unsafe based on a policy, it also lists the violating subcategories. Here is an example: In order to produce classifier scores, we look at the probability for the first token, and turn that into an “unsafe” class probability. Model users can then make binary decisions by applying a desired threshold to the probability scores. # Training and Evaluation We use a mix of prompts that come from the Anthropic [dataset](https://github.com/anthropics/hh-rlhf) and redteaming examples that we have collected in house, in a separate process from our production redteaming. In particular, we took the prompts only from the Anthropic dataset, and generated new responses from our in-house LLaMA models, using jailbreaking techniques to elicit violating responses. We then annotated Anthropic data (prompts & responses) in house, mapping labels according to the categories identified above. Overall we have ~13K training examples. ## Taxonomy of harms and Risk Guidelines As automated content risk mitigation relies on classifiers to make decisions about content in real time, a prerequisite to building these systems is to have the following components: - A **taxonomy** of risks that are of interest – these become the classes of a classifier. - A **risk guideline** that determines where we put the line between encouraged and discouraged outputs for each risk category in the taxonomy. Together with this model, we release an open taxonomy inspired by existing open taxonomies such as those employed by Google, Microsoft and OpenAI in the hope that it can be useful to the community. This taxonomy does not necessarily reflect Meta's own internal policies and is meant to demonstrate the value of our method to tune LLMs into classifiers that show high performance and high degrees of adaptability to different policies. ### The Llama Guard Safety Taxonomy & Risk Guidelines Below, we provide both the harm types themselves under this taxonomy and also examples of the specific kinds of content that would be considered harmful under each category: - **Violence & Hate** encompasses statements that encourage or could help people plan or engage in violence. Similarly, statements that advocate discrimination, contain slurs, or voice hateful sentiments against people based on their sensitive personal characteristics (ex: race, color, religion, national origin, sexual orientation, gender, gender identity, or disability) would also be considered inappropriate under this category. - **Sexual Content** encompasses statements encouraging someone (who could be underage) to engage in specific sex acts. Similarly, sexually explicit (i.e., erotic) statements would also be considered inappropriate under this category. - **Guns & Illegal Weapons** encompasses statements that specifically encourage, condone, or could help people plan or execute specific crimes involving the illegal acquisition, creation, or use of guns or illegal weapons (ex: explosives, biological agents, or chemical weapons), including instructions on how to create or use them. - **Regulated or Controlled Substances** encompasses statements that specifically encourage or could help people to illegally produce, transfer, or consume regulated or controlled substances like illegal drugs, tobacco, alcohol, or cannabis. - **Suicide & Self Harm** encompasses statements that encourage, condone, or enable people to harm themselves (ex: by providing instructions or information on methods of self-harm). When people express or imply an intent to harm themselves, any response other than one that provides links to appropriate health resources would be considered inappropriate (even if entirely supportive). - **Criminal Planning** encompasses miscellaneous statements that encourage, condone, or could help people plan or execute specific criminal activities, like arson, kidnapping, or theft. Items should only be considered harmful under this category when they could not be read as violating any of the other harm types above (ex: statements that encourage violence should be considered violating under Violence & Hate rather than this category). ## Evaluation results We compare the performance of the model against standard content moderation APIs in the industry, including [OpenAI](https://platform.openai.com/docs/guides/moderation/overview), [Azure Content Safety](https://learn.microsoft.com/en-us/azure/ai-services/content-safety/concepts/harm-categories), [PerspectiveAPI](https://developers.perspectiveapi.com/s/about-the-api-attributes-and-languages?language=en_US) from Google on both public and in-house benchmarks. The public benchmarks include [ToxicChat](https://huggingface.co/datasets/lmsys/toxic-chat) and [OpenAI Moderation](https://github.com/openai/moderation-api-release). Note: comparisons are not exactly apples-to-apples due to mismatches in each taxonomy. The interested reader can find a more detailed discussion about this in our [paper](https://ai.meta.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/). |                 | Our Test Set (Prompt) | OpenAI Mod | ToxicChat | Our Test Set (Response) | | --------------- | --------------------- | ---------- | --------- | ----------------------- | | Llama Guard     | **0.945**             | 0.847      | **0.626** | **0.953**               | | OpenAI API      | 0.764                 | **0.856**  | 0.588     | 0.769                   | | Perspective API | 0.728                 | 0.787      | 0.532     | 0.699                   |
----------
Hamel’s Blog - Optimizing latency Subscribe for updates Summary Below is a summary of my findings: 🏁 mlc is the fastest . This is so fast that I’m skeptical and am now motivated to measure quality (if I have time). When checking the outputs manually, they didn’t seem that different than other approaches. ❤️ CTranslate2 is my favorite tool, which is among the fastest but is also the easiest to use . The documentation is the best out of all of the solutions I tried. Furthermore, I think that the ergonomics are excellent for the models that they support. Unlike vLLM, CTranslate doesn’t seem to support distributed inference just yet. 🛠️ is really fast, but CTranslate can be much faster. On other hand, vLLM supports distributed inference , which is something you will need for larger models. vLLM might be the sweet spot for serving very large models. 😐 Text Generation Inference is an ok option (but nowhere near as fast as ) if you want to deploy HuggingFace LLMs in a standard way . TGI has some nice features like telemetry baked in ( via OpenTelemetry ) and integration with the HF ecosystem like inference endpoints . One thing to note that as of 7/28/2023, the license for TGI was changed to be more restrictive that may interfere with certain commercial uses . I am personally not a fan of the license. Rough Benchmarks This study focuses on various approaches to optimizing latency . Specifically, I want to know which tools are the most effective at optimizing latency for open source LLMs. In order to focus on latency, I hold the following variables constant: batch size of n = 1 for all prediction requests (holding throughput constant). All experiments were conducted on a Nvidia A6000 GPU, unless otherwise noted. Max output tokens were always set to 200 All numbers are calculated as an average over a fixed set of 9 prompts. The model used is meta-llama/Llama-2-7b-hf on the HuggingFace Hub In addition to batch size of and using a A6000 GPU (unless noted otherwise), I also made sure I warmed up the model by sending an initial inference request before measuring latency. Llama-v2-7b benchmark: batch size = 1, max output tokens = 200 avg tok/sec avg time (seconds) avg output token count platform options gpu float16 quantization 44.8 4.5 200.0 int8 quantization 62.6 3.2 HF Hosted Inference Endpoint A10G 30.4 6.6 202.0 HuggingFace Transformers (no server) 24.6 7.5 181.4 nf4 4bit quantization bitsandbytes 24.3 7.6 21.1 9.5 quantized w/ GPTQ 23.6 8.8 quantized w/ bitsandbytes 1.9 103.0 q4f16 117.1 1.3 153.9 text-generation-webui exllama 77.0 1.7 134.0 vllm A100 (on Modal Labs) 41.5 3.4 143.1 46.4 178.0 In some cases I did not use an b/c the platform didn’t have that particular GPU available. You can ignore these rows if you like, but I still think it is valuable information. I had access to a A6000, so I just used what I had. I noticed that the output of the LLM was quite different (less tokens) when using . I am not sure if I did something wrong here, or it changes the behavior of the LLM. Furthermore, the goal was not to be super precise on these benchmarks but rather to get a general sense of how things work and how they might compare to each other out of the box. Some of the tools above are inference servers which perform logging, tracing etc. in addition to optimizing models which effect latency. The idea is to see where there are significant differences between tools. I discussed this more Background One capability you need to be successful with open source LLMs is the ability to serve models efficiently. There are two categories of tools for model inference: Inference servers: these help with providing a web server that can provide a REST/grpc or other interface to interact with your model as a service. These inference servers usually have parameters to help you make trade-offs between throughput and latency . Additionally, some inference servers come with additional features like telemetry, model versioning and more. You can learn more about this topic the serving section of these notes. For LLMs, popular inference servers are the Text Generation Inference (TGI) Model Optimization : These modify your model to make them faster for inference. Examples include quantization Paged Attention Exllama and more. It is common to use both Inference servers techniques in conjunction. Some inference servers like even help you apply optimization techniques. Notes On Tools Other than benchmarking, an important goal of this study was to understand how to use different platforms & tools. Start with compiling the model as shown in these docs After installing MLC , you can compile meta-llama/Llama-2-7b-chat-hf like so: python3 -m mlc_llm.build \ --hf-path meta-llama/Llama-2-7b-chat-hf --target cuda --quantization q4f16_1 The arguments for the compliation are documented . This puts the model in the ./dist/ folder with the name Llama-2-7b-chat-hf-q4f16_1 You can use their python client to interact with the compiled model: from mlc_chat import ChatModule, ChatConfig cfg = ChatConfig(max_gen_len cm ChatModule(model "Llama-2-7b-chat-hf-q4f16_1" , chat_config cfg) output cm.generate(prompt prompt) You can see the full benchmarking code Warning I wasn’t able to get to run correctly with the supplied python client so I am using the chat variant ( Llama-2-7b-chat-hf ) as a proxy. I asked the kind folks who work on the mlc project and they said the python client is currently designed for chat, such that they have this system prompt that is hard coded for llama models: conv.system = ("[INST] <<SYS>>\n\nYou are a helpful, respectful and honest assistant. " "Always answer as helpfully as possible, while being safe. " "Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, " "or illegal content. " "Please ensure that your responses are socially unbiased and positive in nature.\n\n" "If a question does not make any sense, or is not factually coherent, explain why instead " "of answering something not correct. " "If you don't know the answer to a question, please don't share false " "information.\n<</SYS>>\n\n "); If you want to fix this, you must edit mlc-chat-config.json , changing conv_template LM These docs say more about the config.json The config file is located in ./dist/<model-name>/params/mlc-chat-config.json . For example: > cat ./dist/Llama-2-7b-hf-q4f16_1/params/mlc-chat-config.json "model_lib": "Llama-2-7b-hf-q4f16_1", "local_id": "Llama-2-7b-hf-q4f16_1", "conv_template": "llama-2", "temperature": 0.7, "repetition_penalty": 1.0, "top_p": 0.95, "mean_gen_len": 128, "max_gen_len": 512, "shift_fill_factor": 0.3, "tokenizer_files": [ "tokenizer.json", "tokenizer.model" "model_category": "llama", "model_name": "Llama-2-7b-hf" is an optimization tool that can make models ridiculously fast. h/t to Anton . The documentation for CTranslate2 contains specific instructions for llama models To optimize llama v2 , we first need to quantize the model. This can be done like so: ct2-transformers-converter --model int8 --output_dir llama-2-7b-ct2 --force refers to the HuggingFace repo for this model . The benchmarking code is as follows (can also be found ): time ctranslate2 sys sys.path.append( '../common/' questions pandas as pd generator ctranslate2.Generator( "llama-2-7b-ct2" , device "cuda" tokenizer transformers.AutoTokenizer.from_pretrained( "meta-llama/Llama-2-7b-hf" def predict(prompt: str "Generate text give a prompt" start time.perf_counter() tokens tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt)) results generator.generate_batch([tokens], sampling_topk , max_length , include_prompt_in_result False results[ 0 ].sequences_ids[ ] tokenizer.decode(tokens) request_time return 'tok_count' len (tokens), 'time' : request_time, 'question' : prompt, 'answer' : output, 'note' 'CTranslate2 int8 quantization' if __name__ == '__main__' counter responses [] for q in questions: >= : responses.append(predict(q)) += df pd.DataFrame(responses) df.to_csv( 'bench-ctranslate-int8.csv' , index Text Generation Inference (TGI) License Restrictions The license for TGI was recently changed away from Apache 2.0 to be more restrictive. Be careful when using TGI in commercial applications. Text generation inference which is often referred to as “TGI” was easy to use without any optimization. You can run it like this: “start_server.sh” #!/bin/bash [ -z " $HUGGING_FACE_HUB_TOKEN then echo "HUGGING_FACE_HUB_TOKEN is not set. Please set it before running this script." exit fi "TheBloke/Llama-2-7B-GPTQ" volume $PWD /data docker run --gpus all -e HUGGING_FACE_HUB_TOKEN= GPTQ_BITS=4 GPTQ_GROUPSIZE=128 --shm-size 5g -p 8081:80 -v $volume :/data ghcr.io/huggingface/text-generation-inference --max-best-of $@ We can then run the server with this command: bash start_server.sh --model-id Help You can see all the options for the TGI container with the help flag like so: run ghcr.io/huggingface/text-generation-inference --help less Quantization was very difficult to get working. There is a —quantize flag with accepts gptq . The approach makes inference much slower, which others have reported To make work for llama v2 models requires a bunch of work, you have to install the text-generation-server which can take a while and is very brittle to get right. I had to step through the Makefile carefully. After that you have to download the weights with: text-generation-server download-weights meta-llama/Llama-2-7b-hf You can run the following command to perform the quantization (the last argument is the destination directory where the weights are stored). quantize data/quantized/ However, this step is not needed for the most popular models, as someone will likely already have quantized and uploaded them to the Hub. Pre-Quantized Models Alternatively, you can use a pre-quantized model that has been uploaded to the Hub. TheBloke/Llama-2-7B-GPTQ is a good example of one. To get this to work, you have to be careful to set the GPTQ_BITS GPTQ_GROUPSIZE environment variables to match the config. For example This config necessitates setting These are already set in shown above. This PR will eventually fix that. To use the with TGI, I can use the same bash script with the following arguments: --quantize Comparison Without TGI Server When I first drafted this study I got the following response on twitter: Based on your code ( https://t.co/hSYaPTsEaK ) it seems like you measure the full HTTP request, which is like comparing trees to an apple. — Philipp Schmid ( @_philschmid July 29, 2023 Phillip certainly has a point! I am indeed testing both! I’m looking for big differences in tools here, and since some inference servers have optimization tools, and some optimization tools do not have an inference server I cannot do a true apples to apples comparison. However, I think its still useful to try different things as advertised to see what is possible, and also take note of really significant gaps in latency between tools. Therefore, I ran the following tests to perform the similar optimizations as TGI, but without the server to see what happened: HuggingFace Transformers I was able to get slightly better performance without the TGI server as predicted by Phillip, but it did not account for the the massive gap between some tools (which is exactly the kind of thing I was looking for). To benchmark quantization with bitsandbytes, I followed this blog post and wrote this benchmarking code . I quantized the model by loading it like this: model_id AutoTokenizer.from_pretrained(model_id) nf4_config BitsAndBytesConfig( load_in_4bit bnb_4bit_quant_type "nf4" bnb_4bit_compute_dtype torch.bfloat16 model_nf4 AutoModelForCausalLM.from_pretrained(model_id, quantization_config nf4_config) Unlike TGI, I was able to get bitsandbytes to work properly here, but just like TGI it didn’t speed anything up for me with respect to inference latency. As reflected in the benchmark table, I got nearly the same results with transformers without any optimizations I also quantized the model using without an inference server to compare against TGI. The code for that is The results were so bad ~ 5 tok/sec that I decided not to put this in the table, because it seemed quite off to me. Text Generation WebUI Aman let me know about text-generation-web-ui , and also these instructions for quickly experimenting with ExLlama ggml . I wasn’t able to get the variant to work properly, unfortunately. If you are really serious about using exllama, I recommend trying to use it without the text generation UI and look at the repo, specifically at test_benchmark_inference.py . (I didn’t have time for this, but if I was going to use exllama for anything serious I would go this route). From the root of the repo, you can run the following commands to start an inference server optimized with download-model.py TheBloke/Llama-2-7B-GPTQ server.py --listen --extensions openai --loader exllama_hf TheBloke_Llama-2-7B-GPTQ After the server was started, I used to conduct the benchmark. Overall, I didn’t like this particular piece of software much. It’s bit bloated because its trying to do too many things at once (An inference server, Web UIs, and other optimizations). That being said, the documentation is good and it is easy to use. I don’t think there is any particular reason to use this unless you want an end-to-end solution that also comes with a web user-interface (which many people want!). only works with CUDA 11.8, which I configured using this approach . After configuring CUDA and installing the right version of PyTorch, you need to install the bleeding edge from git: pip install -U git+https://github.com/vllm-project/vllm.git A good recipe to use for vLLM can be find on these Modal docs . Surprisingly, I had much lower latency when running on a local vs. a hosted A100 on Modal Labs. It’s possible that I did something wrong here. Currently, is the fastest solution for when you need distributed inference (i.e. when your model doesn’t fit on a single GPU). offers a server , but I benchmarked the model locally using their tools instead. The code for the benchmarking can be found here SamplingParams, LLM #from https://modal.com/docs/guide/ex/vllm_inference # Coding questions "Implement a Python function to compute the Fibonacci numbers." "Write a Rust function that performs binary exponentiation." "What are the differences between Javascript and Python?" # Literature "Write a story in the style of James Joyce about a trip to the Australian outback in 2083, to see robots in the beautiful desert." "Who does Harry turn into a balloon?" "Write a tale about a time-traveling historian who's determined to witness the most significant events in human history." # Math "What is the product of 9 and 8?" "If a train travels 120 kilometers in 2 hours, what is its average speed?" "Think through this step by step. If the sequence a_n is defined by a_1 = 3, a_2 = 5, and a_n = a_(n-1) + a_(n-2) for n > 2, find a_6." MODEL_DIR "/home/ubuntu/hamel-drive/vllm-models" download_model_to_folder(): huggingface_hub snapshot_download os snapshot_download( local_dir MODEL_DIR, token os.environ[ "HUGGING_FACE_HUB_TOKEN" LLM(MODEL_DIR) generate(question, llm, note None response : question, : note} sampling_params SamplingParams( temperature 1.0 top_p max_tokens result llm.generate(question, sampling_params) result: response[ (output.outputs[ ].token_ids) output.outputs[ ].text llm download_model_to_folder() generate(question q, llm llm, note 'vLLM' responses.append(response) 'bench-vllm.csv' HuggingFace Inference Endpoint I deployed an inference endpoint on HuggingFace for , on a Nvidia A10G GPU. I didn’t try to turn on any optimizations like quantization and wanted to see what the default performance would be like. The documentation for these interfaces can be found . There is also a python client Their documentation says they are using TGI under the hood. However, my latency was significantly faster on their hosted inference platform than using TGI locally. This could be due to the fact that I used a with them but only a locally. It’s worth looking into why this discrepancy exists further. The code for this benchmark can be found Footnotes It is common to explore the inference vs throughput frontier when conducting inference benchmarks. I did not do this, since I was most interested in latency. Here is an example of how to conduct inference benchmarks that consider both throughput and latency. ↩︎ For Llama v2 models , you must be careful to use the models ending in -hf as those are the ones that are compatible with the transformers library. The Modular Inference Engine is another example of an inference server that also applies optimization techniques. At the time of this writing, this is proprietary technology, but its worth keeping an eye on this in the future. Edit this page
----------
Achieve 23x LLM Inference Throughput & Reduce p50 Latency Anyscale Preview is now available! Login today to get free $50 compute credit 🚀 Home Blog Detail How continuous batching enables 23x throughput in LLM inference while reducing p50 latency By Cade Daniel Chen Shen Eric Liang Richard Liaw June 22, 2023 In this blog, we’ll cover the basics of large language model (LLM) inference and highlight inefficiencies in traditional batching policies. We’ll introduce continuous batching and discuss benchmark results for existing batching systems such as HuggingFace’s text-generation-inference and vLLM. By leveraging vLLM, users can achieve 23x LLM inference throughput while reducing p50 latency. Update June 2024: Anyscale Endpoints (Anyscale's LLM API Offering) and Private Endpoints (self-hosted LLMs) are now available as part of the Anyscale Platform.  Click to get started on the Anyscale platform. Due to the large GPU memory footprint and compute cost of LLMs , serving dominates the compute cost for most real world applications. ML engineers often treat LLMs like "black boxes" that can only be optimized with internal changes such as quantization and custom CUDA kernels. However, this is not entirely the case. Because LLMs iteratively generate their output, and because LLM inference is often memory and not compute bound, there are surprising system-level batching optimizations that make 10x or more differences in real-world workloads. One recent such proposed optimization is , also known as dynamic batching , or batching with iteration-level scheduling . We wanted to see how this optimization performs. We will get into details below, including how we simulate a production workload, but to summarize our findings: Up to 23x throughput improvement using continuous batching and continuous batching-specific memory optimizations (using ). 8x throughput over naive batching by using continuous batching (both on Ray Serve Hugging Face’s text-generation-inference 4x throughput over naive batching by using an optimized model implementation ( NVIDIA’s FasterTransformer You can try out continuous batching today: see this example to run vLLM on Ray Serve The remainder of this blog is structured as follows: We’ll cover the basics of how LLM inference works and highlight inefficiencies in traditional request-based dynamic batching policies. We’ll introduce continuous batching and how it answers many of the inefficiencies of request-based dynamic batching. We then discuss our benchmarks and the implications this has on how to serve LLM models cost-effectively. Link The basics of LLM inference There is a lot to know about LLM inference, and we refer users to Efficient Inference on a Single GPU Optimization story: Bloom inference for more detail. However, at a high level, LLM inference is pretty straightforward. For each request: You start with a sequence of tokens (called the "prefix" or "prompt"). The LLM produces a sequence of completion tokens, stopping only after producing a stop token or reaching a maximum sequence length. This is an iterative process. You get one additional completion token for each new forward pass of the model. For example, suppose you prompt with a sentence "What is the capital of California: ", it would take ten forward pass iterations to get back the full response of ["S", "a", "c", "r", “a”, "m", "e", "n", "t", "o"]. This example simplifies things a little bit because in actuality tokens do not map 1:1 to ASCII characters (a popular token encoding technique is Byte-Pair Encoding which is beyond the scope of this blog post), but the iterative nature of generation is the same regardless of how you tokenize your sequences. Simplified LLM inference. This toy example shows a hypothetical model which supports a maximum sequence length of 8 tokens (T1, T2, …, T8). Starting from the prompt tokens (yellow), the iterative process generates a single token at a time (blue). Once the model generates an end-of-sequence token (red), the generation loop stops. This example shows a batch of only one input sequence, so the batch size is 1. Now that we understand the simplicity of the iterative process, let’s dive deeper with some things you may not know about LLM inference: The initial ingestion (“prefill”) of the prompt "What is the capital of California: " takes about as much time as the generation of each subsequent token. This is because the prefill phase pre-computes some inputs of the attention mechanism that remain constant over the lifetime of the generation. This prefill phase efficiently uses the GPU’s parallel compute because these inputs can be computed independently of each other. LLM inference is memory-IO bound , not compute bound. In other words, it currently takes more time to load 1MB of data to the GPU’s compute cores than it does for those compute cores to perform LLM computations on 1MB of data. This means that LLM inference throughput is largely determined by how large a batch you can fit into high-bandwidth GPU memory . See this page in the NVIDIA docs for more details. The amount of GPU memory consumed scales with the base model size + the length of the token sequence. In Numbers every LLM developer should know , it’s estimated that a 13B parameter model consumes nearly 1MB of state for each token in a sequence. On a higher-end A100 GPU with 40GB RAM, back-of-the-envelope math suggests that since 14 GB are left after storing the 26GB of model parameters, ~14k tokens can be held in memory at once. This may seem high but is actually quite limiting; if we limit our sequence lengths to 512, we can process at most ~28 sequences in a batch. The problem is worse for higher sequence lengths; a sequence length of 2048 means our batch size is limited to 7 sequences. Note that this is an upper bound since it doesn’t leave room for storing intermediate computations. What this all means is that there is substantial “room on the table” so to speak if you can optimize memory usage. This is why approaches such as model quantization strategies such as are potentially so powerful; if you could halve the memory usage by moving from 16-bit to 8-bit representations, you could double the space available for larger batch sizes. However, not all strategies require modifications to the model weights. For example, FlashAttention found significant throughput improvements by reorganizing the attention computation to require less memory-IO. Continuous batching is another memory optimization technique which does not require modification of the model. We next explain how naive batching works (and is inefficient), and how continuous batching increases the memory-efficiency of LLM generation. LLM batching explained GPUs are massively-parallel compute architectures, with compute rates (measured in floating-point operations per second, or flops) in the teraflop ( ) or even petaflop ( H100 ) range. Despite these staggering amounts of compute, LLMs struggle to achieve saturation because so much of the chip’s memory bandwidth is spent loading model parameters. Batching is one way to improve the situation; instead of loading new model parameters each time you have an input sequence, you can load the model parameters once and then use them to process many input sequences. This more efficiently uses the chip’s memory bandwidth, leading to higher compute utilization, higher throughput, and cheaper LLM inference. Naive batching / static batching We call this traditional approach to batching static batching , because the size of the batch remains constant until the inference is complete. Here’s an illustration of static batching in context of LLM inference: Completing four sequences using static batching. On the first iteration (left), each sequence generates one token (blue) from the prompt tokens (yellow). After several iterations (right), the completed sequences each have different sizes because each emits their end-of-sequence-token (red) at different iterations. Even though sequence 3 finished after two iterations, static batching means that the GPU will be underutilized until the last sequence in the batch finishes generation (in this example, sequence 2 after six iterations). Unlike traditional deep learning models, batching for LLMs can be tricky due to the iterative nature of their inference. Intuitively, this is because requests can "finish" earlier in a batch, but it is tricky to release their resources and add new requests to the batch that may be at different completion states. This means that as the GPU is underutilized as generation lengths of different sequences in a batch differ from the largest generation length of the batch. In the figure on the right above, this is illustrated by the white squares after end-of-sequence tokens for sequences 1, 3, and 4. How often does static batching under-utilize the GPU? It depends on the generation lengths of sequences in a batch. For example, one could use LLM inference to emit a single token as a classification task (there are better ways to do this but let’s use this as an example). In this case, every output sequence is the same size (1 token). If the input sequences are also the same size (say, 512 tokens), then each static batch will achieve the best possible GPU utilization. On the other hand, a LLM-powered chatbot service cannot assume fixed-length input sequences, nor assume fixed-length output sequences. Proprietary models offer maximum context lengths in excess of 8K tokens at the time of writing. With static batching, variance in generation output could cause massive underutilization of GPUs. It’s no wonder OpenAI CEO Sam Altman described the compute costs as eye-watering Without restrictive assumptions on user input and model output, unoptimized production-grade LLM systems simply can’t serve traffic without underutilizing GPUs and incurring unnecessarily high costs. We need to optimize how we serve LLMs for their power to be broadly accessible. Continuous batching The industry recognized the inefficiency and came up with a better approach. Orca: A Distributed Serving System for Transformer-Based Generative Models is a paper presented in OSDI ‘22 which is the first to our knowledge to tackle this problem. Instead of waiting until every sequence in a batch has completed generation, Orca implements iteration-level scheduling where the batch size is determined per iteration. The result is that once a sequence in a batch has completed generation, a new sequence can be inserted in its place, yielding higher GPU utilization than static batching. Completing seven sequences using continuous batching. Left shows the batch after a single iteration, right shows the batch after several iterations. Once a sequence emits an end-of-sequence token, we insert a new sequence in its place (i.e. sequences S5, S6, and S7). This achieves higher GPU utilization since the GPU does not wait for all sequences to complete before starting a new one. Reality is a bit more complicated than this simplified model: since the prefill phase takes compute and has a different computational pattern than generation, it cannot be easily batched with the generation of tokens. Continuous batching frameworks currently manage this via hyperparameter: waiting_served_ratio , or the ratio of requests waiting for prefill to those waiting end-of-sequence tokens. Speaking of frameworks, Hugging Face has productionized continuous batching in their Rust- and Python-based text-generation-inference LLM inference server . We use their implementation to understand the performance characteristics of continuous batching in our benchmarks below. : Continuous batching, dynamic batching, and iteration-level scheduling are all close enough in meaning that any one of them can be used to describe the batching algorithm. We chose to use continuous batching. Dynamic batching is fitting but can be confused with request-level batching, where an LLM inference server uses a static batch whose size is chosen when the current batch has completely finished generation. We feel that iteration-level scheduling is descriptive of the scheduling mechanism but not the process as a whole. PagedAttention and vLLM For this blog post, we want to showcase the differences between static batching and continuous batching. It turns out that continuous batching can unlock memory optimizations that are not possible with static batching by improving upon Orca’s design. PagedAttention is a new attention mechanism implemented in ( ). It takes inspiration from traditional OS concepts such as paging virtual memory . They allow the KV cache (what is computed in the “prefill” phase, discussed above) to be non-contiguous by allocating memory in fixed-size “pages”, or blocks. The attention mechanism can then be rewritten to operate on block-aligned inputs, allowing attention to be performed on non-contiguous memory ranges. This means that buffer allocation can happen just-in-time instead of ahead-of-time: when starting a new generation, the framework does not need to allocate a contiguous buffer of size maximum_context_length. Each iteration, the scheduler can decide if it needs more room for a particular generation, and allocate on the fly without any degradation to PagedAttention’s performance. This doesn’t guarantee perfect utilization of memory ( their blog says the wastage is now limited to under 4%, only in the last block), but it significantly improves upon wastage from ahead-of-time allocation schemes used widely by the industry today. Altogether, PagedAttention + vLLM enable massive memory savings as most sequences will not consume the entire context window. These memory savings translate directly into a higher batch size, which means higher throughput and cheaper serving. We include vLLM in our benchmarks below. Benchmarking setup We’ll discuss our experimental setup then dive into the results of our benchmarks. Experiments Our goal is to see how continuous batching performs versus static batching on a simulated real-world live-inference workload. Fundamentally, we care about cost. We break this down into throughput and latency since cost is directly downstream of how efficiently you can serve at a given latency. Benchmark goal Measurement Measure throughput Time-to-process a queue of 1000 requests, each with 512 input tokens and generation length sampled from an exponential distribution. Measure latency Request latencies for 100 requests, with varying input lengths, output lengths, and arrival times at a fixed average rate. We’ll discuss the datasets and other details of the experiments in their respective results section. Hardware/model We benchmark throughput and latency on a single NVIDIA A100 GPU provided by Anyscale . Our A100 has 40GB of GPU RAM. We selected Meta’s OPT-13B model because each framework under test had a readily-available integration with this model. We selected the 13B variant because it fits into our GPU without requiring tensor parallelism, yet is still large enough to present memory efficiency challenges. We opt not to use tensor parallelism, where each transformer block is split over multiple GPUs, to keep our experiments simple, although both static batching and continuous batching work with tensor parallelism. Frameworks We test two static batching frameworks and three continuous batching frameworks. Our static batching frameworks are: Hugging Face’s Pipelines This is the simplest inference solution. It provides static batching with an easy-to-use API that works with any model and supports more tasks than simple text-generation. We use this as our baseline. This is a library which provides optimized implementations of various transformer models. It currently only provides static batching (the Triton inference server provides request-level dynamic batching, but not continuous batching yet). This provides us with an idea of how far an extremely optimized implementation of our model can get us with static batching – it provides a more competitive baseline than the relatively unoptimized OPT-13B implementation available on Hugging Face Hub Our continuous batching frameworks are: This is the inference server Hugging Face uses to power their LLM live-inference APIs. It implements continuous batching. Continuous batching on Ray Serve leverages Ray’s serverless capabilities to provide seamless autoscaling, high-availability, and support for complex DAGs. We wanted to understand how continuous batching works, so we re-implemented text-generation-inference’s core continuous batching logic in pure-Python on Ray Serve. As you will see in our results, our implementation achieves the same performance as text-generation-inference, which validates our understanding. This is an open-source project recently released by folks at UC Berkeley ( ). It builds upon Orca’s continuous batching design by taking full control of dynamic memory allocations, allowing it to significantly reduce different forms of GPU memory fragmentation. We test this framework because it shows the impact of further optimizations made possible by iteration-level scheduling and continuous batching. Benchmarking results: Throughput Based on our understanding of static batching, we expect continuous batching to perform significantly better when there is higher variance in sequence lengths in each batch. To show this, we run our throughput benchmark four times for each framework, each time on a dataset with higher variance in sequence lengths. To do this, we create a dataset containing 1000 sequences each with 512 input tokens. We configure our model to always emit a per-sequence generation length by ignoring the end-of-sequence token and configuring max_tokens. We then generate 1000 generation lengths, one for each request, sampled from an exponential distribution with mean=128 tokens. We use an exponential distribution as it is a good approximation of the generation lengths that one may encounter while serving an application like ChatGPT. To vary the variance of each run, we select only samples from the exponential distribution that are less than or equal to 32, 128, 512, and 1536. The total output sequence length is then, at most, 512+32=544, 512+128=640, 512+512=1024, and 512+1536=2048 (the maximum sequence length of our model). We then use a simple asyncio Python benchmarking script to submit HTTP requests to our model server. The benchmarking script submits all requests in burst fashion, so that the compute is saturated. The results are as follows: Throughput in tokens per second of each framework as variance in sequence length increases. As expected, the static batchers and naive continuous batchers perform approximately identically for lower-variance generation lengths. However as the variance increases, naive static batching’s performance plummets to 81 token/s. FasterTransformers improves upon naive static batching significantly, nearly keeping up with the naive continuous batchers until generation length limit of 1536. Continuous batching on Ray Serve and text-generation-inference achieves about the same performance, which is what we expect since they use the same batching algorithm. What is most impressive here is vLLM. For each dataset, vLLM more than doubles performance compared to naive continuous batching. We have not analyzed what optimization contributes the most to vLLM performance the most, but we suspect vLLM’s ability to reserve space dynamically instead of ahead-of-time allows vLLM to dramatically increase the batch size. We plot these performance results relative to naive static batching: Our throughput benchmark results presented as improvement multiples over naive static batching, log scale. It’s important to note how impressive even FasterTransformer’s 4x improvement is; we’re very interested in benchmarking FasterTransformers plus continuous batching when NVIDIA implements it. However, continuous batching is clearly a significant improvement over static batching even with an optimized model. The performance gap becomes gigantic when you include further memory optimization enabled by continuous batching and iteration-level scheduling as vLLM does. Benchmarking results: Latency Live-inference endpoints often face latency-throughput tradeoffs that must be optimized based on user needs. We benchmark latency on a realistic workload and measure how the cumulative distribution function of latencies changes with each framework. Similar to the throughput benchmark, we configure the model to always emit a specified amount of tokens specified per-request. We prepare 100 randomly-generated prompts by sampling lengths from a uniform distribution between 1 token and 512 tokens. We sample 100 output lengths from a capped exponential distribution with mean=128 and maximum size of 1536. These numbers were chosen because they are reasonably realistic and allow the generation to use up the full context-length of our model (512+1536=2048). Instead of submitting all requests at the same time as done in the throughput benchmark, we delay each request by a predetermined number of seconds. We sample a Poisson distribution to determine how long each request waits after the previously submitted request. The Poisson distribution is parameterized by λ, the expected rate, which in our case is how many queries per second (QPS) hit our model endpoint. We measure latencies at both QPS=1 and QPS=4 to see how the latency distribution changes as load changes. Median generation request latency for each framework, under average load of 1 QPS and 4 QPS. Continuous batching systems improve median latency. We see that while improving throughput, continuous batching systems also improve median latency. This is because continuous batching systems allow for new requests to be added to an existing batch if there is room, each iteration. But how about other percentiles? In fact, we find that they improve latency across all percentiles: Cumulative distribution function of generation request latencies for each framework with QPS=1. Static batchers and continuous batchers have distinct curve shapes caused by the presence of iteration-level batch scheduling in continuous batchers. All continuous batchers perform approximately equally under this load; FasterTransformers performs noticeably better than static batching on a naive model implementation. The reason why continuous batching improves latency at all percentiles is the same as why it improves latency at p50: new requests can be added regardless of how far into generation other sequences in the batch are. However, like static batching, continuous batching is still limited by how much space is available on the GPU. As your serving system becomes saturated with requests, meaning a higher on-average batch size, there are less opportunities to inject new requests immediately when they are received. We can see this as we increase the average QPS to 4: Cumulative distribution function of generation request latencies for each framework with QPS=4. Compared to QPS=1, FasterTransformer’s distribution of latencies becomes more similar to static batching on a naive model. Both Ray Serve and text-generation-inference’s continuous batching implementations perform similarly, but noticeably worse than vLLM. We observe that FasterTransformer becomes more similar to naive static batching, and that both text-generation-inference and Ray Serve’s implementation of continuous batching are on their way to look like FasterTransformer’s curve with QPS=1. That is, as the systems become saturated there are less opportunities to inject new requests immediately, so request latency goes up. This lines up with the vLLM curve – it remains mostly unchanged between QPS=1 and QPS=4. This is because due to its advanced memory optimizations, it has a higher maximum batch size. Anecdotally, we observe that vLLM becomes saturated around QPS=8 with a throughput near 1900 token/s. To compare these numbers apples-to-apples to the other serving systems requires more experimentation; however we have shown that continuous batching significantly improves over static batching by 1) reducing latency by injecting new requests immediately when possible, and 2) enable advanced memory optimizations (in vLLM’s case) that increase the QPS that the serving system can handle before becoming saturated. Conclusion LLMs present some amazing capabilities, and we believe their impact is still mostly undiscovered. We have shared how a new serving technique, continuous batching, works and how it outperforms static batching. It improves throughput by wasting fewer opportunities to schedule new requests, and improves latency by being capable of immediately injecting new requests into the compute stream. We are excited to see what people can do with continuous batching, and where the industry goes from here. Try out continuous batching for yourself We have a vLLM + Ray Serve example that allows you to try out continuous batching. We are integrating continuous batching systems into Aviary , a webapp that allows you to compare the outputs of different LLMs in parallel , and will release it within the week. Acknowledgements. We’d like to thank the following people for assisting in benchmarking and/or reviewing our results. : Stephanie Wang, Antoni Baum, Edward Oakes, and Amog Kamsetty; UC Berkeley : Zhuohan Li and Woosuk Kwon. Get involved with Ray code used for the experiments in the blog post is here . To connect with the Ray community, join the Ray Slack or ask questions on the Discuss forum . If you are interested in hosting LLMs, check out our managed Ray offering . If you are interested in learning more about Ray, see ray.io docs.ray.io See our earlier blog series on solving Generative AI infrastructure and using LangChain with Ray Ray Summit 2023 : If you are interested to learn much more about how Ray can be used to build performant and scalable LLM applications and fine-tune/train/serve LLMs on Ray, join Ray Summit on September 18-20th ! We have a set of great keynote speakers including John Schulman from OpenAI and Aidan Gomez Cohere , community and tech talks about Ray as well as practical training focused on LLMs Table of contents The basics of LLM inference Naive batching / static batching Try out continuous batching for yourself Get involved with Ray Sharing Tags LLM Sign up for product updates Recommended content Ray Spotlight Series: Multitenant Serve Applications with Runtime Envs as Containers Cross-modal Search for E-commerce: Building and Scaling a Cross-Modal Image Retrieval App Figure 1. End-to-end Stable Diffusion training architecture diagram. We Pre-Trained Stable Diffusion Models on 2 billion Images and Didn't Break the Bank - Definitive Guides with Ray Series Ready to try Anyscale? Access Anyscale today to see how companies using Anyscale and Ray benefit from rapid time-to-market and faster iterations across the entire AI lifecycle. Try free © Anyscale, Inc 2024 - Privacy Policy Follow Anyscale Follow Ray Company About Us News Careers Contact sales Learn Case Studies Ray Summit 2024 Events Ray Training Ray Docs Anyscale Docs Products Anyscale Platform Ray Open Source Integrations © Anyscale, Inc 2024 -
----------
GitHub - huggingface/peft: 🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning. Skip to content You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. to refresh your session. You switched accounts on another tab or window. to refresh your session. Dismiss alert huggingface / peft Public Notifications You must be signed in to change notification settings Fork 1.4k Star 14.6k 🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning. huggingface.co/docs/peft Apache-2.0 license stars forks Branches Activity You must be signed in to change notification settings huggingface/peft This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. Go to file Code Folders and files Name Last commit message Last commit date Latest commit History 1,000 Commits .github docs examples scripts src/ tests .gitignore .pre-commit-config.yaml LICENSE README.md pyproject.toml requirements.txt setup.py View all files Repository files navigation 🤗 PEFT State-of-the-art Parameter-Efficient Fine-Tuning (PEFT) methods Fine-tuning large pretrained models is often prohibitively costly due to their scale. Parameter-Efficient Fine-Tuning (PEFT) methods enable efficient adaptation of large pretrained models to various downstream applications by only fine-tuning a small number of (extra) model parameters instead of all the model's parameters. This significantly decreases the computational and storage costs. Recent state-of-the-art PEFT techniques achieve performance comparable to fully fine-tuned models. PEFT is integrated with Transformers for easy model training and inference, Diffusers for conveniently managing different adapters, and Accelerate for distributed training and inference for really big models. Tip Visit the PEFT organization to read about the PEFT methods implemented in the library and to see notebooks demonstrating how to apply these methods to a variety of downstream tasks. Click the "Watch repos" button on the organization page to be notified of newly implemented methods and notebooks! Check the PEFT Adapters API Reference section for a list of supported PEFT methods, and read the Adapters Soft prompts IA3 conceptual guides to learn more about how these methods work. Quickstart Install PEFT from pip: pip install peft Prepare a model for training with a PEFT method such as LoRA by wrapping the base model and PEFT configuration with get_peft_model . For the bigscience/mt0-large model, you're only training 0.19% of the parameters! AutoModelForSeq2SeqLM get_peft_config LoraConfig TaskType model_name_or_path "bigscience/mt0-large" tokenizer_name_or_path peft_config task_type SEQ_2_SEQ_LM inference_mode r 8 lora_alpha 32 lora_dropout 0.1 print_trainable_parameters () "trainable params: 2359296 || all params: 1231940608 || trainable%: 0.19151053100118282" To load a PEFT model for inference: AutoPeftModelForCausalLM AutoTokenizer torch "ybelkada/opt-350m-lora" "facebook/opt-350m" eval inputs "Preheat the oven to 350 degrees and place the cookie dough" return_tensors "pt" outputs generate input_ids "input_ids" ]. max_new_tokens 50 print batch_decode skip_special_tokens )[ ]) "Preheat the oven to 350 degrees and place the cookie dough in the center of the oven. In a large bowl, combine the flour, baking powder, baking soda, salt, and cinnamon. In a separate bowl, combine the egg yolks, sugar, and vanilla." Why you should use PEFT There are many benefits of using PEFT but the main one is the huge savings in compute and storage, making PEFT applicable to many different use cases. High performance on consumer hardware Consider the memory requirements for training the following models on the ought/raft/twitter_complaints dataset with an A100 80GB GPU with more than 64GB of CPU RAM. Model Full Finetuning PEFT-LoRA PyTorch PEFT-LoRA DeepSpeed with CPU Offloading bigscience/T0_3B (3B params) 47.14GB GPU / 2.96GB CPU 14.4GB GPU / 2.96GB CPU 9.8GB GPU / 17.8GB CPU bigscience/mt0-xxl (12B params) OOM GPU 56GB GPU / 3GB CPU 22GB GPU / 52GB CPU bigscience/bloomz-7b1 (7B params) 32GB GPU / 3.8GB CPU 18.1GB GPU / 35GB CPU With LoRA you can fully finetune a 12B parameter model that would've otherwise run out of memory on the 80GB GPU, and comfortably fit and train a 3B parameter model. When you look at the 3B parameter model's performance, it is comparable to a fully finetuned model at a fraction of the GPU memory. Submission Name Accuracy Human baseline (crowdsourced) 0.897 Flan-T5 0.892 lora-t0-3b 0.863 The bigscience/T0_3B model performance isn't optimized in the table above. You can squeeze even more performance out of it by playing around with the input instruction templates, LoRA hyperparameters, and other training related hyperparameters. The final checkpoint size of this model is just 19MB compared to 11GB of the full bigscience/T0_3B model. Learn more about the advantages of finetuning with PEFT in this blog post Quantization is another method for reducing the memory requirements of a model by representing the data in a lower precision. It can be combined with PEFT methods to make it even easier to train and load LLMs for inference. Learn how to finetune with QLoRA and the TRL library on a 16GB GPU in the Finetune LLMs on your own consumer hardware using tools from PyTorch and Hugging Face ecosystem blog post. Learn how to finetune a openai/whisper-large-v2 model for multilingual automatic speech recognition with LoRA and 8-bit quantization in this (see this instead for an example of streaming a dataset). Save compute and storage PEFT can help you save storage by avoiding full finetuning of models on each of downstream task or dataset. In many cases, you're only finetuning a very small fraction of a model's parameters and each checkpoint is only a few MBs in size (instead of GBs). These smaller PEFT adapters demonstrate performance comparable to a fully finetuned model. If you have many datasets, you can save a lot of storage with a PEFT model and not have to worry about catastrophic forgetting or overfitting the backbone or base model. PEFT integrations PEFT is widely supported across the Hugging Face ecosystem because of the massive efficiency it brings to training and inference. Diffusers The iterative diffusion process consumes a lot of memory which can make it difficult to train. PEFT can help reduce the memory requirements and reduce the storage size of the final model checkpoint. For example, consider the memory required for training a Stable Diffusion model with LoRA on an A100 80GB GPU with more than 64GB of CPU RAM. The final model checkpoint size is only 8.8MB! PEFT-LoRA PEFT-LoRA with Gradient Checkpointing CompVis/stable-diffusion-v1-4 27.5GB GPU / 3.97GB CPU 15.5GB GPU / 3.84GB CPU 8.12GB GPU / 3.77GB CPU Take a look at the examples/lora_dreambooth/train_dreambooth.py training script to try training your own Stable Diffusion model with LoRA, and play around with the smangrul/peft-lora-sd-dreambooth Space which is running on a T4 instance. Learn more about the PEFT integration in Diffusers in this is a library for distributed training and inference on various training setups and hardware (GPUs, TPUs, Apple Silicon, etc.). PEFT models work with Accelerate out of the box, making it really convenient to train really large models or use them for inference on consumer hardware with limited resources. PEFT can also be applied to training LLMs with RLHF components such as the ranker and policy. Get started by reading: Fine-tune a Mistral-7b model with Direct Preference Optimization with PEFT and the library to learn more about the Direct Preference Optimization (DPO) method and how to apply it to a LLM. Fine-tuning 20B LLMs with RLHF on a 24GB consumer GPU with PEFT and the library, and then try out the gpt2-sentiment_peft.ipynb notebook to optimize GPT2 to generate positive movie reviews. StackLLaMA: A hands-on guide to train LLaMA with RLHF with PEFT, and then try out the stack_llama/scripts for supervised finetuning, reward modeling, and RL finetuning. Model support Use this Space or check out the to find which models officially support a PEFT method out of the box. Even if you don't see a model listed below, you can manually configure the model config to enable PEFT for a model. Read the New transformers architecture guide to learn how. Contribute If you would like to contribute to PEFT, please check out our contribution guide Citing 🤗 PEFT To use 🤗 PEFT in your publication, please cite it by using the following BibTeX entry. @Misc title PEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods author Sourab Mangrulkar and Sylvain Gugger and Lysandre Debut and Younes Belkada and Sayak Paul and Benjamin Bossan howpublished \url{https://github.com/huggingface/peft} year 2022 About 🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning. Topics python adapter pytorch lora diffusion parameter-efficient-learning Readme Custom properties Stars Watchers 107 watching Forks Report repository Releases 17 v0.11.1 Latest May 17, 2024 + 16 releases Packages No packages published Used by 9.1k + 9,081 Contributors 175 + 161 contributors Languages Python 98.9% Other 1.1% Footer © 2024 GitHub, Inc. You can’t perform that action at this time.
----------
llama-recipes/docs/LLM_finetuning.md at main · meta-llama/llama-recipes · GitHub You signed in with another tab or window. to refresh your session. You signed out in another tab or window. to refresh your session. You switched accounts on another tab or window. to refresh your session. You must be signed in to change notification settings 10.1k © 2024 GitHub, Inc. You can’t perform that action at this time.
----------
llama-recipes/recipes/finetuning/datasets/README.md at main · meta-llama/llama-recipes · GitHub You signed in with another tab or window. to refresh your session. You signed out in another tab or window. to refresh your session. You switched accounts on another tab or window. to refresh your session. You must be signed in to change notification settings © 2024 GitHub, Inc. You can’t perform that action at this time.
----------
Efficient Fine-Tuning with LoRA: A Guide to Optimal Parameter Selection for Large Language Models | Databricks Blog Skip to main content Share this post With the rapid advancement of neural network-based techniques and Large Language Model (LLM) research, businesses are increasingly interested in AI applications for value generation. They employ various machine learning approaches, both generative and non-generative, to address text-related challenges such as classification, summarization, sequence-to-sequence tasks, and controlled text generation. Organizations can opt for third-party APIs, but fine-tuning models with proprietary data offers domain-specific and pertinent results, enabling cost-effective and independent solutions deployable across different environments in a secure manner. Ensuring efficient resource utilization and cost-effectiveness is crucial when choosing a strategy for fine-tuning. This blog explores arguably the most popular and effective variant of such parameter efficient methods, Low Rank Adaptation (LoRA), with a particular emphasis on QLoRA (an even more efficient variant of LoRA). The approach here will be to take an open large language model and fine-tune it to generate fictitious product descriptions when prompted with a product name and a category. The model chosen for this exercise is OpenLLaMA-3b-v2 , an open large language model with a permissive license (Apache 2.0), and the dataset chosen is Red Dot Design Award Product Descriptions , both of which can be downloaded from the HuggingFace Hub at the links provided. Fine-Tuning, LoRA and QLoRA In the realm of language models, fine tuning an existing language model to perform a specific task on specific data is a common practice. This involves adding a task-specific head, if necessary, and updating the weights of the neural network through backpropagation during the training process. It is important to note the distinction between this finetuning process and training from scratch. In the latter scenario, the model's weights are randomly initialized, while in finetuning, the weights are already optimized to a certain extent during the pre-training phase. The decision of which weights to optimize or update, and which ones to keep frozen, depends on the chosen technique. Full finetuning involves optimizing or training all layers of the neural network. While this approach typically yields the best results, it is also the most resource-intensive and time-consuming. Fortunately, there exist parameter-efficient approaches for fine-tuning that have proven to be effective. Although most such approaches have yielded less performance, Low Rank Adaptation (LoRA) has bucked this trend by even outperforming full finetuning in some cases, as a consequence of avoiding catastrophic forgetting (a phenomenon which occurs when the knowledge of the pretrained model is lost during the fine-tuning process). LoRA is an improved finetuning method where instead of finetuning all the weights that constitute the weight matrix of the pre-trained large language model, two smaller matrices that approximate this larger matrix are fine-tuned. These matrices constitute the LoRA adapter. This fine-tuned adapter is then loaded to the pretrained model and used for inference. QLoRA is an even more memory efficient version of LoRA where the pretrained model is loaded to GPU memory as quantized 4-bit weights (compared to 8-bits in the case of LoRA), while preserving similar effectiveness to LoRA. Probing this method, comparing the two methods when necessary, and figuring out the best combination of QLoRA hyperparameters to achieve optimal performance with the quickest training time will be the focus here. LoRA is implemented in the Hugging Face Parameter Efficient Fine-Tuning (PEFT) library, offering ease of use and QLoRA can be leveraged by using together. HuggingFace Transformer Reinforcement Learning (TRL) library offers a convenient trainer for supervised finetuning with seamless integration for LoRA. These three libraries will provide the necessary tools to finetune the chosen pretrained model to generate coherent and convincing product descriptions once prompted with an instruction indicating the desired attributes. Prepping the data for supervised fine-tuning To probe the effectiveness of QLoRA for fine tuning a model for instruction following, it is essential to transform the data to a format suited for supervised fine-tuning. Supervised fine-tuning in essence, further trains a pretrained model to generate text conditioned on a provided prompt. It is supervised in that the model is finetuned on a dataset that has prompt-response pairs formatted in a consistent manner. An example observation from our chosen dataset from the Hugging Face hub looks as follows: product category description text "Biamp Rack Products" "Digital Audio Processors" "“High recognition value, uniform aesthetics and practical scalability – this has been impressively achieved with the Biamp brand language …" "Product Name: Biamp Rack Products; Product Category: Digital Audio Processors; Product Description: “High recognition value, uniform aesthetics and practical scalability – this has been impressively achieved with the Biamp brand language … As useful as this dataset is, this is not well formatted for fine-tuning of a language model for instruction following in the manner described above. The following code snippet loads the dataset from the Hugging Face hub into memory, transforms the necessary fields into a consistently formatted string representing the prompt, and inserts the response( i.e. the description), immediately afterwards. This format is known as the ‘Alpaca format’ in large language model research circles as it was the format used to finetune the original LlaMA model from Meta to result in the Alpaca model, one of the first widely distributed instruction-following large language models (although not licensed for commercial use). datasets load_dataset Dataset #Load the dataset from the HuggingFace Hub rd_ds = load_dataset( "xiyuez/red-dot-design-award-product-description" #Convert to pandas dataframe for convenient processing rd_df = pd.DataFrame(rd_ds[ 'train' #Combine the two attributes into an instruction string rd_df[ 'instruction' ] = 'Create a detailed description for the following product: ' + rd_df[ 'product' ]+ ', belonging to category: ' 'category' rd_df = rd_df[[ 'description' ]] #Get a 5000 sample subset for fine-tuning purposes rd_df_sample = rd_df.sample(n= 5000 , random_state= 42 #Define template and format data into the template for supervised fine-tuning template = """Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: {} ### Response:\n""" rd_df_sample[ 'prompt' ] = rd_df_sample[ "instruction" ].apply( lambda x: template. format (x)) rd_df_sample.rename(columns={ 'response' }, inplace= ] + "\n### End" rd_df_sample = rd_df_sample[[ 'text' ] = rd_df[ "prompt" ] + rd_df[ "response" rd_df.drop(columns=[ ], inplace= The resulting prompts are then loaded into a hugging face dataset for supervised finetuning. Each such prompt has the following format. Below is an instruction that describes a task. Write a response that appropriately completes the request. Create a detailed description the following product: Beseye Pro, belonging to category: Cloud-Based Home Security Camera ### Response: Beseye Pro combines intelligent home monitoring with decorative art. The camera, whose form reminiscent of a water drop, secured the mounting a neodymium magnet can be rotated by 360 degrees. This allows it to be easily positioned the desired direction. The camera also houses modern technologies, such infrared LEDs, cloud-based intelligent video analyses SSL encryption. ### End To facilitate quick experimentation, each fine-tuning exercise will be done on a 5000 observation subset of this data. Testing model performance before fine-tuning Before any fine-tuning, it’s a good idea to check how the model performs without any fine-tuning to get a baseline for pre-trained model performance. The model can be loaded in 8-bit as follows and prompted with the format specified in the model card on Hugging Face LlamaTokenizer, LlamaForCausalLM model_path = 'openlm-research/open_llama_3b_v2' tokenizer = LlamaTokenizer.from_pretrained(model_path) model = LlamaForCausalLM.from_pretrained( model_path, load_in_8bit= , device_map= 'auto' #Pass in a prompt and infer with the model prompt = 'Q: Create a detailed description for the following product: Corelogic Smooth Mouse, belonging to category: Optical Mouse\nA:' input_ids = tokenizer(prompt, return_tensors= ).input_ids generation_output = model.generate( input_ids=input_ids, max_new_tokens= 128 (tokenizer.decode(generation_output[ ])) The output obtained is not quite what we want. Q: Create a detailed description the following product: Corelogic Smooth Mouse, belonging to category: Optical Mouse A: The Corelogic Smooth Mouse a wireless optical mouse that has a 1000 dpi resolution. It has a 2.4 GHz wireless connection a 12 -month warranty. Q: What the price of the Corelogic Smooth Mouse? A: The Corelogic Smooth Mouse priced at $ 29.99 . Q: What the weight of the Corelogic Smooth Mouse? A: The Corelogic Smooth Mouse weighs pounds. Q: What the dimensions of the Corelogic Smooth Mouse? A: The Corelogic Smooth Mouse has a dimension The first part of the result is actually satisfactory, but the rest of it is more of a rambling mess. Similarly, if the model is prompted with the input text in the ‘Alpaca format’ as discussed before, the output is expected to be just as sub-optimal: prompt= """Below is an instruction that describes a task. Write a response that appropriately completes the request. Create a detailed description for the following product: Corelogic Smooth Mouse, belonging to category: Optical Mouse ### Response:""" input_ids = tokenizer(prompt, return_tensors= And sure enough, it is: Corelogic Smooth Mouse a mouse that designed to be used by people disabilities. It a wireless mouse that designed to be used by people a wireless mouse that designed to be used by people a wireless mouse that designed to be used by people a wireless mouse that designed to be used by people a wireless mouse that designed to be used by people a wireless mouse that designed to be used by people a wireless mouse that designed to be used by The model performs what it was trained to do, predicts the next most probable token. The point of supervised fine-tuning in this context is to generate the desired text in a controllable manner. Please note that in the subsequent experiments, while QLoRA leverages a model loaded in 4-bit with the weights frozen, the inference process to examine output quality is done once the model has been loaded in 8-bit as shown above for consistency. The Turnable Knobs When using PEFT to train a model with LoRA or QLoRA (note that, as mentioned before, the primary difference between the two is that in the latter, the pretrained models are frozen in 4-bit during the fine-tuning process), the hyperparameters of the low rank adaptation process can be defined in a LoRA config as shown below: ... #If only targeting attention blocks of the model target_modules = [ "q_proj" "v_proj" #If targeting all linear layers 'q_proj' 'k_proj' 'v_proj' 'o_proj' 'gate_proj' 'down_proj' 'up_proj' 'lm_head' lora_config = LoraConfig( r= 16 target_modules = target_modules, lora_alpha= lora_dropout= 0.05 bias= "none" task_type= "CAUSAL_LM" ,} Two of these hyperparameters, r and target_modules are empirically shown to affect adaptation quality significantly and will be the focus of the tests that follow. The other hyperparameters are kept constant at the values indicated above for simplicity. represents the rank of the low rank matrices learned during the finetuning process. As this value is increased, the number of parameters needed to be updated during the low-rank adaptation increases. Intuitively, a lower r may lead to a quicker, less computationally intensive training process, but may affect the quality of the model thus produced. However, increasing r beyond a certain value may not yield any discernible increase in quality of model output. How the value of r affects adaptation (fine-tuning) quality will be put to the test shortly. When fine-tuning with LoRA, it is possible to target specific modules in the model architecture. The adaptation process will target these modules and apply the update matrices to them. Similar to the situation with " ," targeting more modules during LoRA adaptation results in increased training time and greater demand for compute resources. Thus, it is a common practice to only target the attention blocks of the transformer. However, recent work as shown in the QLoRA paper by Dettmers et al. suggests that targeting all linear layers results in better adaptation quality. This will be explored here as well. Names of the linear layers of the model can be conveniently appended to a list with the following code snippet: re model_modules = (model.modules) pattern = r'\((\w+)\): Linear' linear_layer_names = re.findall(pattern, model_modules) names = [] # Print the names of the Linear layers name linear_layer_names: names.append(name) target_modules = list set (names)) Tuning the finetuning with LoRA The developer experience of fine tuning large language models in general have improved dramatically over the past year or so. The latest high level abstraction from Hugging Face is the SFTTrainer class in the TRL library. To perform QLoRA, all that is needed is the following: 1.  Load the model to GPU memory in 4-bit (bitsandbytes enables this process). 2.  Define the LoRA configuration as discussed above. 3.  Define the train and test splits of the prepped instruction following data into Hugging Face Dataset objects. 4. Define training arguments. These include the number of epochs, batch size and other training hyperparameters which will be kept constant during this exercise. 5. Pass these arguments into an instance of SFTTrainer. These steps are clearly indicated in the source file in the associated with this blog. The actual training logic is abstracted away nicely as follows: trainer = SFTTrainer( model, train_dataset=dataset[ eval_dataset = dataset[ 'test' dataset_text_field= "text" max_seq_length= 256 args=training_args, # Initiate the training process mlflow.start_run(run_name= ‘run_name_of_choice’): trainer.train() If MLFlow autologging is enabled in the Databricks workspace, which is highly recommended, all the training parameters and metrics are automatically tracked and logged with the MLFlow tracking server. This functionality is invaluable in monitoring long-running training tasks. Needless to say, the fine-tuning process is performed using a compute cluster (in this case, a single node with a single A100 GPU) created using the latest Databricks Machine runtime with GPU support. Hyperparameter Combination #1: QLoRA with r=8 and targeting “q_proj”, “v_proj” The first combination of QLoRA hyperparameters attempted is r=8 and targets only the attention blocks, namely “q_proj” and “v_proj” for adaptation. The following code snippets gives the number of trainable parameters: model = get_peft_model(model, lora_config) model.print_trainable_parameters() These choices result in 2,662,400 parameters being updated during the fine-tuning process (~2.6 million) from a total of ~3.2 billion parameters the model consists of. This is less than 0.1% of the model parameters. The entire finetuning process on a single Nvidia A100 with 80 GBs of GPU for 3 epochs only takes roughly 12 minutes. The GPU utilization metrics can be conveniently viewed at the metrics tab of the cluster configurations. At the end of the training process, the fine-tuned model is obtained by loading the adapter weights to the pre-trained model as follows: peft_model = PeftModel.from_pretrained(model, adapter_location) This model can now be used for inference as any other model. Qualitative Evaluation A couple of example prompt-response pairs are listed below Prompt (passed to the model in the Alpaca format, not shown for conciseness here): Create a detailed description for the following product: Corelogic Smooth Mouse, belonging to category: Optical Mouse Response: The Corelogic Smooth Mouse a wireless optical mouse a smooth surface. The mouse equipped DPI sensor Hz polling rate. The mouse available black white. Prompt: Create a detailed description for the following product: Hoover Lightspeed, belonging to category: Cordless Vacuum Cleaner The Hoover Lightspeed a cordless vacuum cleaner that a lithium-ion battery. The battery charged via a USB cable. The vacuum cleaner brush a turbo brush. The brush suitable cleaning carpets hard floors. The turbo brush hard floors. The vacuum cleaner a dust container that can be emptied via a dust container. The model has clearly been adapted for generating more consistent descriptions. However the response to the first prompt about the optical mouse is quite short and the following phrase “ The vacuum cleaner is equipped with a dust container that can be emptied via a dust container ” is logically flawed. Hyperparameter Combination #2: QLoRA with r=16 and targeting all linear layers Surely, things can be improved here. It is worth exploring increasing the rank of low rank matrices learned during adaptation to 16, i.e. double the value of r to 16 and keep all else  the same. This doubles the number of trainable parameters to 5,324,800 (~5.3 million). The quality of output, however, remains unchanged for the same exact prompts. Create a detailed description for the following product: Corelogic Smooth Mouse, belonging to category: Optical Mouse The Corelogic Smooth Mouse a high-quality optical mouse a smooth surface. The mouse Hz polling rate. The mouse Create a detailed description for the following product: Hoover Lightspeed, belonging to category: Cordless Vacuum Cleaner a cordless vacuum cleaner that a lithium-ion battery. The battery charged via a USB cable. The vacuum cleaner a motorized brush. The brush used to clean hard floors the motorized brush used to clean carpets. The vacuum cleaner a dust container that can be emptied via a dust container. The vacuum cleaner a LED display that shows the remaining battery capacity. The same lack of detail and logical flaws in detail where details are available persists. If this fine tuned model is used for product description generation in a real-world scenario, this is not acceptable output. Hyperparameter Combination #3: QLoRA with r=8 and targeting all linear layers Given that doubling r does not seemingly result in any perceivable increase in output quality, it is worth changing the other important knob. i.e. targeting all linear layers instead of just the attention blocks. Here, the LoRA hyperparameters are r=8 and target_layers are  'q_proj','k_proj','v_proj','o_proj','gate_proj','down_proj','up_proj' and 'lm_head'. This increases the number of parameters updated to 12,994,560 and increases the training time to roughly 15.5 minutes. Prompting the model with the same prompts yield the following: Create a detailed description for the following product: Corelogic Smooth Mouse, belonging to category: Optical Mouse The Corelogic Smooth Mouse a wireless optical mouse a smooth surface. The mouse Hz polling rate. The mouse a USB receiver. The mouse Create a detailed description for the following product: Hoover Lightspeed, belonging to category: Cordless Vacuum Cleaner The Hoover Lightspeed cordless vacuum cleaner a high-performance device that easy to use. The device a lithium-ion battery that provides up to 40 minutes of runtime. The device a high-performance motor that provides a suction power of up to watts. The device dusting brush a crevice tool. The device a LED display that shows the remaining battery charge. Now it is possible to see a somewhat longer coherent description of the fictitious optical mouse and there are no logical flaws in the description of the vacuum cleaner. The product descriptions are not only logical, but relevant. Just as a reminder, these relatively high-quality results are obtained by fine-tuning less than a 1% of the model’s weights with a total dataset of 5000 such prompt-description pairs formatted in a consistent manner. Hyperparameter Combination #4: LoRA with  r=8 and targeting all linear transformer layers It is also worth exploring whether the quality of output from the model improves if the pretrained model is frozen in 8-bit instead of 4-bit. In other words, replicating the exact finetuning process using LoRA instead of QLoRA. Here, the LoRA hyperparameters are kept the same as before, in the new-found optimal configuration, i.e. r=8 and targeting all linear transformer layers during the adaptation process. The results for the two prompts used throughout the article are as given below: Create a detailed description for the following product: Corelogic Smooth Mouse, belonging to category: Optical Mouse The Corelogic Smooth Mouse a high-quality optical mouse a smooth surface. The mouse Hz polling rate. The mouse three colours: black, white red. The mouse a USB receiver compatible Windows Mac. Create a detailed description for the following product: Hoover Lightspeed, belonging to category: Cordless Vacuum Cleaner The Hoover Lightspeed cordless vacuum cleaner a compact lightweight device that easy to use. The device a lithium-ion battery that provides up to minutes of cleaning time. The vacuum cleaner a high-performance filter that ensures that the air cleaned of dust allergens. The device a crevice tool that can be used to clean hard-to-reach areas. Again, there isn’t much of an improvement in the quality of the output text. Key Observations Based on the above set of trials, and further evidence detailed in the excellent publication presenting QLoRA, it can be deduced that the value of r (the rank of matrices updated during adaptation) does not improve adaptation quality beyond a certain point. The biggest improvement is observed in targeting all linear layers in the adaptation process, as opposed to just the attention blocks, as commonly documented in technical literature detailing LoRA and QLoRA. The trials executed above and other empirical evidence suggest that QLoRA does not indeed suffer from any discernible reduction in quality of text generated, compared to LoRA. Further Considerations for using LoRA adapters in deployment It's important to optimize the usage of adapters and understand the limitations of the technique. The size of the LoRA adapter obtained through finetuning is typically just a few megabytes, while the pretrained base model can be several gigabytes in memory and on disk. During inference, both the adapter and the pretrained LLM need to be loaded, so the memory requirement remains similar. Furthermore, if the weights of the pre-trained LLM and the adapter aren’t merged, there will be a slight increase in inference latency. Fortunately, with the PEFT library, the process of merging the weights with the adapter can be done with a single line of code as shown here: merged_model = peft_model.merge_and_unload() The figure below outlines the process from fine-tuning an adapter to model deployment. While the adapter pattern offers significant benefits, merging adapters is not a universal solution. One advantage of the adapter pattern is the ability to deploy a single large pretrained model with task-specific adapters. This allows for efficient inference by utilizing the pretrained model as a backbone for different tasks. However, merging weights makes this approach impossible. The decision to merge weights depends on the specific use case and acceptable inference latency. Nonetheless, LoRA/ QLoRA continues to be a highly effective method for parameter efficient fine-tuning and is widely used. Low Rank Adaptation is a powerful fine-tuning technique that can yield great results if used with the right configuration. Choosing the correct value of rank and the layers of the neural network architecture to target during adaptation could decide the quality of the output from the fine-tuned model. QLoRA results in further memory savings while preserving the adaptation quality. Even when the fine-tuning is performed,  there are several important engineering considerations to ensure the adapted model is deployed in the correct manner. In summary, a concise table indicating the different combinations of LoRA parameters attempted, text quality output and number of parameters updated when fine-tuning OpenLLaMA-3b-v2 for 3 epochs on 5000 observations on a single A100 is shown below. target_modules Base model weights Quality of output Number of parameters updated (in millions) Attention blocks low 2.662 5.324 All linear layers high 12.995 Try this on Databricks! Clone the GitHub repository associated with the blog into a Databricks Repo to get started. More thoroughly documented examples to finetune models on Databricks are available Try Databricks for free Related posts Using MLflow AI Gateway and Llama 2 to Build Generative AI Apps August 24, 2023 by Kasey Uhlenhuth Xiangrui Meng Hagay Lupesko Sean Owen Corey Zumar Liang Zhang Ina Koleva Vladimir Kolovski Arpit Jasapara Data Science and ML To build customer support bots, internal knowledge graphs, or Q&A systems, customers often use Retrieval Augmented Generation (RAG) applications which leverage pre-trained models... Databricks + MosaicML July 19, 2023 Matei Zaharia Patrick Wendell Reynold Xin Ali Ghodsi Company Blog Today, we’re excited to share that we’ve completed our acquisition of MosaicML, a leading platform for creating and customizing generative AI models for... See all Engineering Blog posts Why Databricks Discover For Executives For Startups Lakehouse Architecture DatabricksIQ Mosaic Research Customers Featured See All Partners Cloud Providers Technology Partners Data Partners Built on Databricks Consulting & System Integrators C&SI Partner Program Partner Solutions Consulting & System Integrators Product Databricks Platform Platform Overview Governance Artificial Intelligence Business Intelligence Data Management Data Warehousing Real-Time Analytics Data Engineering Data Science Pricing Pricing Overview Pricing Calculator Open Source Integrations and Data Marketplace IDE Integrations Partner Connect Solutions Databricks For Industries Communications Financial Services Healthcare and Life Sciences Manufacturing Media and Entertainment Public Sector Retail View All Cross Industry Solutions Customer Data Platform Data Migration Professional Services Solution Accelerators Healthcare and Life Sciences Customer Support Training and Certification Learning Overview Training Overview Certification University Alliance Databricks Academy Login Data + AI Summit Data + AI World Tour Data Intelligence Days Full Calendar Blog and Podcasts Databricks Blog Databricks Mosaic Research Blog Data Brew Podcast Champions of Data & AI Podcast Data + AI Summit Data + AI World Tour Databricks Mosaic Research Blog Champions of Data & AI Podcast Who We Are Our Team Databricks Ventures Contact Us Open Jobs Working at Databricks Press Awards and Recognition Newsroom Security and Trust Databricks Inc. 160 Spear Street, 15th Floor San Francisco, CA 94105 1-866-330-0121 See Careers at Databricks © Databricks 2024. All rights reserved. Apache, Apache Spark, Spark and the Spark logo are trademarks of the Apache Software Foundation. Privacy Notice Terms of Use Your Privacy Choices Your California Privacy Rights
----------
Training LLMs Course: Discover Fine-Tuning Techniques Register today! Watch Intro Video Foundations Course Introduction NeurIPS LLM Efficiency Challange NeurIPS LLM Efficiency Challenge Q&A Hands On LLM Fine-tuning Start Your Experiments! Evaluation Introduction to LLM Evaluation Demystifying Perplexity HumanEval and LLM Performance Analysis LLM Benchmarks Deep Dive into HELM Chatbot Arena Use Case Specific Benchmarks Evaluating LLM Apps Conclusions LLM Evaluation Q&A Data Introduction to Data for Training LLMs Find Out More about MosaicML Friendly Advice How Much Data? Data Sources & Cost Q&A Which Data? Logistics of Data Loading Training & Fine-tuning Techniques Introduction to Training & Fine-tuning Techniques Hardware Requirements Memory Usage What Should You Train? Training Observability Course Assessment & Next Steps Course Assessment Resources for Further Learning About this course Free 37 lessons 4 hours of video content Learn the fundamentals of large language models Find out about the types of LLMs, model architectures, parameter sizes and scaling laws. Curate a dataset and establish an evaluation approach Learn how to find or curate a dataset for LLM training. Dive into the evaluation metrics for various LLM tasks and compare their performance across a range of benchmarks. Master training and fine-tuning techniques Learn hands-on advanced training strategies like LoRA, prefix tuning, prompt tuning, and Reinforcement Learning through Human Feedback (RLHF). Enroll for free Working knowledge of machine learning Intermediate Python experience Familiarity with DL frameworks (Pytorch/Tensorflow) Register now! All Courses © Copyright W&B AI Academy 2024
----------
Just a moment... Enable JavaScript and cookies to continue
----------
Enable JavaScript and cookies to continue
----------
loaded 51