před 1 rokem · 216447a490
--- a/README.md
+++ b/README.md
@@ -1,12 +1,14 @@
 
				 # Llama Cookbook: The Official Guide to building with Llama Models
			
 
				 
			
 
				+Checkout our latest model tutorial here: [Build with Llama 4 Scout](./getting-started/build_with_llama_4.ipynb)
			
 
				+
			
 
				 Welcome to the official repository for helping you get started with [inference](https://github.com/meta-llama/llama-cookbook/tree/main/getting-started/inference/), [fine-tuning](https://github.com/meta-llama/llama-cookbook/tree/main/getting-started/finetuning) and [end-to-end use-cases](https://github.com/meta-llama/llama-cookbook/tree/main/end-to-end-use-cases) of building with the Llama Model family.
			
 
				 
			
 
				 This repository covers the most popular community approaches, use-cases and the latest recipes for Llama Text and Vision models.
			
 
				 
			
 
				 > [!TIP]
			
 
				 > Popular getting started links:
			
 
				-> * [Build with Llama Tutorial](https://github.com/meta-llama/llama-cookbook/tree/main/getting-started/build_with_Llama_3_2.ipynb)
			
 
				+> * [Build with Llama 4 Scout](https://github.com/meta-llama/llama-cookbook/tree/main/getting-started/build_with_llama_4.ipynb)
			
 
				 > * [Multimodal Inference with Llama 3.2 Vision](https://github.com/meta-llama/llama-cookbook/tree/main/getting-started/inference/local_inference/README.md#multimodal-inference)
			
 
				 > * [Inferencing using Llama Guard (Safety Model)](https://github.com/meta-llama/llama-cookbook/tree/main/getting-started/responsible_ai/llama_guard/)
			
 
				 
			
@@ -29,19 +31,19 @@ This repository covers the most popular community approaches, use-cases and the
 
				 ## FAQ:
			
 
				 ## FAQ:
			
 
				 
			
 
				-- **Q:** What happened to llama-recipes?  
			
 
				+- **Q:** What happened to llama-recipes?
			
 
				   **A:** We recently renamed llama-recipes to llama-cookbook.
			
 
				 
			
 
				-- **Q:** Prompt Template changes for Multi-Modality?  
			
 
				+- **Q:** Prompt Template changes for Multi-Modality?
			
 
				   **A:** Llama 3.2 follows the same prompt template as Llama 3.1, with a new special token `<|image|>` representing the input image for the multimodal models. More details on the prompt templates for image reasoning, tool-calling, and code interpreter can be found [on the documentation website](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_2).
			
 
				 
			
 
				-- **Q:** I have some questions for Fine-Tuning, is there a section to address these?  
			
 
				+- **Q:** I have some questions for Fine-Tuning, is there a section to address these?
			
 
				   **A:** Checkout the Fine-Tuning FAQ [here](https://github.com/meta-llama/llama-cookbook/tree/main/src/docs/).
			
 
				 
			
 
				-- **Q:** Some links are broken/folders are missing:  
			
 
				+- **Q:** Some links are broken/folders are missing:
			
 
				   **A:** We recently did a refactor of the repo, [archive-main](https://github.com/meta-llama/llama-cookbook/tree/archive-main) is a snapshot branch from before the refactor.
			
 
				 
			
 
				-- **Q:** Where can we find details about the latest models?  
			
 
				+- **Q:** Where can we find details about the latest models?
			
 
				   **A:** Official [Llama models website](https://www.llama.com).
			
 
				 
			
 
				 ## Contributing
			
--- a/getting-started/Prompt_Engineering_with_Llama.ipynb
+++ b/getting-started/Prompt_Engineering_with_Llama.ipynb
@@ -1,769 +0,0 @@
 
				-{
			
 
				- "cells": [
			
 
				-  {
			
 
				-   "attachments": {},
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "<a href=\"https://colab.research.google.com/github/meta-llama/llama-cookbook/blob/main/getting-started/Prompt_Engineering_with_Llama.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
			
 
				-    "\n",
			
 
				-    "# Prompt Engineering with Llama\n",
			
 
				-    "\n",
			
 
				-    "Prompt engineering is using natural language to produce a desired response from a large language model (LLM).\n",
			
 
				-    "\n",
			
 
				-    "This interactive guide covers prompt engineering & best practices with Llama.\n",
			
 
				-    "\n",
			
 
				-    "Note: The notebook can be extended to any (latest) Llama models."
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "attachments": {},
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "## Introduction"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "attachments": {},
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "### Why now?\n",
			
 
				-    "\n",
			
 
				-    "[Vaswani et al. (2017)](https://arxiv.org/abs/1706.03762) introduced the world to transformer neural networks (originally for machine translation). Transformers ushered an era of generative AI with diffusion models for image creation and large language models (`LLMs`) as **programmable deep learning networks**.\n",
			
 
				-    "\n",
			
 
				-    "Programming foundational LLMs is done with natural language – it doesn't require training/tuning like ML models of the past. This has opened the door to a massive amount of innovation and a paradigm shift in how technology can be deployed. The science/art of using natural language to program language models to accomplish a task is referred to as **Prompt Engineering**."
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "attachments": {},
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "### Llama Models\n",
			
 
				-    "\n",
			
 
				-    "In 2023, Meta introduced the [Llama language models](https://ai.meta.com/llama/) (Llama Chat, Code Llama, Llama Guard). These are general purpose, state-of-the-art LLMs.\n",
			
 
				-    "\n",
			
 
				-    "Llama models come in varying parameter sizes. The smaller models are cheaper to deploy and run; the larger models are more capable.\n",
			
 
				-    "\n",
			
 
				-    "#### Llama 3.1\n",
			
 
				-    "1. `llama-3.1-8b` - base pretrained 8 billion parameter model\n",
			
 
				-    "1. `llama-3.1-70b` - base pretrained 70 billion parameter model\n",
			
 
				-    "1. `llama-3.1-405b` - base pretrained 405 billion parameter model\n",
			
 
				-    "1. `llama-3.1-8b-instruct` - instruction fine-tuned 8 billion parameter model\n",
			
 
				-    "1. `llama-3.1-70b-instruct` - instruction fine-tuned 70 billion parameter model\n",
			
 
				-    "1. `llama-3.1-405b-instruct` - instruction fine-tuned 405 billion parameter model (flagship)\n",
			
 
				-    "\n",
			
 
				-    "\n",
			
 
				-    "#### Llama 3\n",
			
 
				-    "1. `llama-3-8b` - base pretrained 8 billion parameter model\n",
			
 
				-    "1. `llama-3-70b` - base pretrained 70 billion parameter model\n",
			
 
				-    "1. `llama-3-8b-instruct` - instruction fine-tuned 8 billion parameter model\n",
			
 
				-    "1. `llama-3-70b-instruct` - instruction fine-tuned 70 billion parameter model (flagship)\n",
			
 
				-    "\n",
			
 
				-    "#### Llama 2\n",
			
 
				-    "1. `llama-2-7b` - base pretrained 7 billion parameter model\n",
			
 
				-    "1. `llama-2-13b` - base pretrained 13 billion parameter model\n",
			
 
				-    "1. `llama-2-70b` - base pretrained 70 billion parameter model\n",
			
 
				-    "1. `llama-2-7b-chat` - chat fine-tuned 7 billion parameter model\n",
			
 
				-    "1. `llama-2-13b-chat` - chat fine-tuned 13 billion parameter model\n",
			
 
				-    "1. `llama-2-70b-chat` - chat fine-tuned 70 billion parameter model (flagship)\n"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "attachments": {},
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "## Getting an LLM\n",
			
 
				-    "\n",
			
 
				-    "Large language models are deployed and accessed in a variety of ways, including:\n",
			
 
				-    "\n",
			
 
				-    "1. **Self-hosting**: Using local hardware to run inference. Ex. running Llama on your Macbook Pro using [llama.cpp](https://github.com/ggerganov/llama.cpp).\n",
			
 
				-    "    * Best for privacy/security or if you already have a GPU.\n",
			
 
				-    "1. **Cloud hosting**: Using a cloud provider to deploy an instance that hosts a specific model. Ex. running Llama on cloud providers like AWS, Azure, GCP, and others.\n",
			
 
				-    "    * Best for customizing models and their runtime (ex. fine-tuning a model for your use case).\n",
			
 
				-    "1. **Hosted API**: Call LLMs directly via an API. There are many companies that provide Llama inference APIs including AWS Bedrock, Replicate, Anyscale, Together and others.\n",
			
 
				-    "    * Easiest option overall."
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "attachments": {},
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "### Hosted APIs\n",
			
 
				-    "\n",
			
 
				-    "Hosted APIs are the easiest way to get started. We'll use them here. There are usually two main endpoints:\n",
			
 
				-    "\n",
			
 
				-    "1. **`completion`**: generate a response to a given prompt (a string).\n",
			
 
				-    "1. **`chat_completion`**: generate the next message in a list of messages, enabling more explicit instruction and context for use cases like chatbots."
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "attachments": {},
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "## Tokens\n",
			
 
				-    "\n",
			
 
				-    "LLMs process inputs and outputs in chunks called *tokens*. Think of these, roughly, as words – each model will have its own tokenization scheme. For example, this sentence...\n",
			
 
				-    "\n",
			
 
				-    "> Our destiny is written in the stars.\n",
			
 
				-    "\n",
			
 
				-    "...is tokenized into `[\"Our\", \" destiny\", \" is\", \" written\", \" in\", \" the\", \" stars\", \".\"]` for Llama 3. See [this](https://tiktokenizer.vercel.app/?model=meta-llama%2FMeta-Llama-3-8B) for an interactive tokenizer tool.\n",
			
 
				-    "\n",
			
 
				-    "Tokens matter most when you consider API pricing and internal behavior (ex. hyperparameters).\n",
			
 
				-    "\n",
			
 
				-    "Each model has a maximum context length that your prompt cannot exceed. That's 128k tokens for Llama 3.1, 4K for Llama 2, and 100K for Code Llama.\n"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "attachments": {},
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "## Notebook Setup\n",
			
 
				-    "\n",
			
 
				-    "The following APIs will be used to call LLMs throughout the guide. As an example, we'll call Llama 3.1 chat using [Groq](https://console.groq.com/playground?model=llama3-70b-8192).\n",
			
 
				-    "\n",
			
 
				-    "To install prerequisites run:"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": null,
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "import sys\n",
			
 
				-    "!{sys.executable} -m pip install groq"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": null,
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "import os\n",
			
 
				-    "from typing import Dict, List\n",
			
 
				-    "from groq import Groq\n",
			
 
				-    "\n",
			
 
				-    "# Get a free API key from https://console.groq.com/keys\n",
			
 
				-    "os.environ[\"GROQ_API_KEY\"] = \"YOUR_GROQ_API_KEY\"\n",
			
 
				-    "\n",
			
 
				-    "LLAMA3_405B_INSTRUCT = \"llama-3.1-405b-reasoning\" # Note: Groq currently only gives access here to paying customers for 405B model\n",
			
 
				-    "LLAMA3_70B_INSTRUCT = \"llama-3.1-70b-versatile\"\n",
			
 
				-    "LLAMA3_8B_INSTRUCT = \"llama3.1-8b-instant\"\n",
			
 
				-    "\n",
			
 
				-    "DEFAULT_MODEL = LLAMA3_70B_INSTRUCT\n",
			
 
				-    "\n",
			
 
				-    "client = Groq()\n",
			
 
				-    "\n",
			
 
				-    "def assistant(content: str):\n",
			
 
				-    "    return { \"role\": \"assistant\", \"content\": content }\n",
			
 
				-    "\n",
			
 
				-    "def user(content: str):\n",
			
 
				-    "    return { \"role\": \"user\", \"content\": content }\n",
			
 
				-    "\n",
			
 
				-    "def chat_completion(\n",
			
 
				-    "    messages: List[Dict],\n",
			
 
				-    "    model = DEFAULT_MODEL,\n",
			
 
				-    "    temperature: float = 0.6,\n",
			
 
				-    "    top_p: float = 0.9,\n",
			
 
				-    ") -> str:\n",
			
 
				-    "    response = client.chat.completions.create(\n",
			
 
				-    "        messages=messages,\n",
			
 
				-    "        model=model,\n",
			
 
				-    "        temperature=temperature,\n",
			
 
				-    "        top_p=top_p,\n",
			
 
				-    "    )\n",
			
 
				-    "    return response.choices[0].message.content\n",
			
 
				-    "        \n",
			
 
				-    "\n",
			
 
				-    "def completion(\n",
			
 
				-    "    prompt: str,\n",
			
 
				-    "    model: str = DEFAULT_MODEL,\n",
			
 
				-    "    temperature: float = 0.6,\n",
			
 
				-    "    top_p: float = 0.9,\n",
			
 
				-    ") -> str:\n",
			
 
				-    "    return chat_completion(\n",
			
 
				-    "        [user(prompt)],\n",
			
 
				-    "        model=model,\n",
			
 
				-    "        temperature=temperature,\n",
			
 
				-    "        top_p=top_p,\n",
			
 
				-    "    )\n",
			
 
				-    "\n",
			
 
				-    "def complete_and_print(prompt: str, model: str = DEFAULT_MODEL):\n",
			
 
				-    "    print(f'==============\\n{prompt}\\n==============')\n",
			
 
				-    "    response = completion(prompt, model)\n",
			
 
				-    "    print(response, end='\\n\\n')\n"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "attachments": {},
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "### Completion APIs\n",
			
 
				-    "\n",
			
 
				-    "Let's try Llama 3.1!"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": null,
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "complete_and_print(\"The typical color of the sky is: \")"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": null,
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "complete_and_print(\"which model version are you?\")"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "attachments": {},
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "### Chat Completion APIs\n",
			
 
				-    "Chat completion models provide additional structure to interacting with an LLM. An array of structured message objects is sent to the LLM instead of a single piece of text. This message list provides the LLM with some \"context\" or \"history\" from which to continue.\n",
			
 
				-    "\n",
			
 
				-    "Typically, each message contains `role` and `content`:\n",
			
 
				-    "* Messages with the `system` role are used to provide core instruction to the LLM by developers.\n",
			
 
				-    "* Messages with the `user` role are typically human-provided messages.\n",
			
 
				-    "* Messages with the `assistant` role are typically generated by the LLM."
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": null,
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "response = chat_completion(messages=[\n",
			
 
				-    "    user(\"My favorite color is blue.\"),\n",
			
 
				-    "    assistant(\"That's great to hear!\"),\n",
			
 
				-    "    user(\"What is my favorite color?\"),\n",
			
 
				-    "])\n",
			
 
				-    "print(response)\n",
			
 
				-    "# \"Sure, I can help you with that! Your favorite color is blue.\""
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "attachments": {},
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "### LLM Hyperparameters\n",
			
 
				-    "\n",
			
 
				-    "#### `temperature` & `top_p`\n",
			
 
				-    "\n",
			
 
				-    "These APIs also take parameters which influence the creativity and determinism of your output.\n",
			
 
				-    "\n",
			
 
				-    "At each step, LLMs generate a list of most likely tokens and their respective probabilities. The least likely tokens are \"cut\" from the list (based on `top_p`), and then a token is randomly selected from the remaining candidates (`temperature`).\n",
			
 
				-    "\n",
			
 
				-    "In other words: `top_p` controls the breadth of vocabulary in a generation and `temperature` controls the randomness within that vocabulary. A temperature of ~0 produces *almost* deterministic results.\n",
			
 
				-    "\n",
			
 
				-    "[Read more about temperature setting here](https://community.openai.com/t/cheat-sheet-mastering-temperature-and-top-p-in-chatgpt-api-a-few-tips-and-tricks-on-controlling-the-creativity-deterministic-output-of-prompt-responses/172683).\n",
			
 
				-    "\n",
			
 
				-    "Let's try it out:"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": null,
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "def print_tuned_completion(temperature: float, top_p: float):\n",
			
 
				-    "    response = completion(\"Write a haiku about llamas\", temperature=temperature, top_p=top_p)\n",
			
 
				-    "    print(f'[temperature: {temperature} | top_p: {top_p}]\\n{response.strip()}\\n')\n",
			
 
				-    "\n",
			
 
				-    "print_tuned_completion(0.01, 0.01)\n",
			
 
				-    "print_tuned_completion(0.01, 0.01)\n",
			
 
				-    "# These two generations are highly likely to be the same\n",
			
 
				-    "\n",
			
 
				-    "print_tuned_completion(1.0, 1.0)\n",
			
 
				-    "print_tuned_completion(1.0, 1.0)\n",
			
 
				-    "# These two generations are highly likely to be different"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "attachments": {},
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "## Prompting Techniques"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "attachments": {},
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "### Explicit Instructions\n",
			
 
				-    "\n",
			
 
				-    "Detailed, explicit instructions produce better results than open-ended prompts:"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": null,
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "complete_and_print(prompt=\"Describe quantum physics in one short sentence of no more than 12 words\")\n",
			
 
				-    "# Returns a succinct explanation of quantum physics that mentions particles and states existing simultaneously."
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "attachments": {},
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "You can think about giving explicit instructions as using rules and restrictions to how Llama 3 responds to your prompt.\n",
			
 
				-    "\n",
			
 
				-    "- Stylization\n",
			
 
				-    "    - `Explain this to me like a topic on a children's educational network show teaching elementary students.`\n",
			
 
				-    "    - `I'm a software engineer using large language models for summarization. Summarize the following text in under 250 words:`\n",
			
 
				-    "    - `Give your answer like an old timey private investigator hunting down a case step by step.`\n",
			
 
				-    "- Formatting\n",
			
 
				-    "    - `Use bullet points.`\n",
			
 
				-    "    - `Return as a JSON object.`\n",
			
 
				-    "    - `Use less technical terms and help me apply it in my work in communications.`\n",
			
 
				-    "- Restrictions\n",
			
 
				-    "    - `Only use academic papers.`\n",
			
 
				-    "    - `Never give sources older than 2020.`\n",
			
 
				-    "    - `If you don't know the answer, say that you don't know.`\n",
			
 
				-    "\n",
			
 
				-    "Here's an example of giving explicit instructions to give more specific results by limiting the responses to recently created sources."
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": null,
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "complete_and_print(\"Explain the latest advances in large language models to me.\")\n",
			
 
				-    "# More likely to cite sources from 2017\n",
			
 
				-    "\n",
			
 
				-    "complete_and_print(\"Explain the latest advances in large language models to me. Always cite your sources. Never cite sources older than 2020.\")\n",
			
 
				-    "# Gives more specific advances and only cites sources from 2020"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "attachments": {},
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "### Example Prompting using Zero- and Few-Shot Learning\n",
			
 
				-    "\n",
			
 
				-    "A shot is an example or demonstration of what type of prompt and response you expect from a large language model. This term originates from training computer vision models on photographs, where one shot was one example or instance that the model used to classify an image ([Fei-Fei et al. (2006)](http://vision.stanford.edu/documents/Fei-FeiFergusPerona2006.pdf)).\n",
			
 
				-    "\n",
			
 
				-    "#### Zero-Shot Prompting\n",
			
 
				-    "\n",
			
 
				-    "Large language models like Llama 3 are unique because they are capable of following instructions and producing responses without having previously seen an example of a task. Prompting without examples is called \"zero-shot prompting\".\n",
			
 
				-    "\n",
			
 
				-    "Let's try using Llama 3 as a sentiment detector. You may notice that output format varies - we can improve this with better prompting."
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": null,
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "complete_and_print(\"Text: This was the best movie I've ever seen! \\n The sentiment of the text is: \")\n",
			
 
				-    "# Returns positive sentiment\n",
			
 
				-    "\n",
			
 
				-    "complete_and_print(\"Text: The director was trying too hard. \\n The sentiment of the text is: \")\n",
			
 
				-    "# Returns negative sentiment"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "attachments": {},
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "\n",
			
 
				-    "#### Few-Shot Prompting\n",
			
 
				-    "\n",
			
 
				-    "Adding specific examples of your desired output generally results in more accurate, consistent output. This technique is called \"few-shot prompting\".\n",
			
 
				-    "\n",
			
 
				-    "In this example, the generated response follows our desired format that offers a more nuanced sentiment classifier that gives a positive, neutral, and negative response confidence percentage.\n",
			
 
				-    "\n",
			
 
				-    "See also: [Zhao et al. (2021)](https://arxiv.org/abs/2102.09690), [Liu et al. (2021)](https://arxiv.org/abs/2101.06804), [Su et al. (2022)](https://arxiv.org/abs/2209.01975), [Rubin et al. (2022)](https://arxiv.org/abs/2112.08633).\n",
			
 
				-    "\n"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": null,
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "def sentiment(text):\n",
			
 
				-    "    response = chat_completion(messages=[\n",
			
 
				-    "        user(\"You are a sentiment classifier. For each message, give the percentage of positive/netural/negative.\"),\n",
			
 
				-    "        user(\"I liked it\"),\n",
			
 
				-    "        assistant(\"70% positive 30% neutral 0% negative\"),\n",
			
 
				-    "        user(\"It could be better\"),\n",
			
 
				-    "        assistant(\"0% positive 50% neutral 50% negative\"),\n",
			
 
				-    "        user(\"It's fine\"),\n",
			
 
				-    "        assistant(\"25% positive 50% neutral 25% negative\"),\n",
			
 
				-    "        user(text),\n",
			
 
				-    "    ])\n",
			
 
				-    "    return response\n",
			
 
				-    "\n",
			
 
				-    "def print_sentiment(text):\n",
			
 
				-    "    print(f'INPUT: {text}')\n",
			
 
				-    "    print(sentiment(text))\n",
			
 
				-    "\n",
			
 
				-    "print_sentiment(\"I thought it was okay\")\n",
			
 
				-    "# More likely to return a balanced mix of positive, neutral, and negative\n",
			
 
				-    "print_sentiment(\"I loved it!\")\n",
			
 
				-    "# More likely to return 100% positive\n",
			
 
				-    "print_sentiment(\"Terrible service 0/10\")\n",
			
 
				-    "# More likely to return 100% negative"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "attachments": {},
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "### Role Prompting\n",
			
 
				-    "\n",
			
 
				-    "Llama will often give more consistent responses when given a role ([Kong et al. (2023)](https://arxiv.org/pdf/2308.07702)). Roles give context to the LLM on what type of answers are desired.\n",
			
 
				-    "\n",
			
 
				-    "Let's use Llama 3 to create a more focused, technical response for a question around the pros and cons of using PyTorch."
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": null,
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "complete_and_print(\"Explain the pros and cons of using PyTorch.\")\n",
			
 
				-    "# More likely to explain the pros and cons of PyTorch covers general areas like documentation, the PyTorch community, and mentions a steep learning curve\n",
			
 
				-    "\n",
			
 
				-    "complete_and_print(\"Your role is a machine learning expert who gives highly technical advice to senior engineers who work with complicated datasets. Explain the pros and cons of using PyTorch.\")\n",
			
 
				-    "# Often results in more technical benefits and drawbacks that provide more technical details on how model layers"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "attachments": {},
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "### Chain-of-Thought\n",
			
 
				-    "\n",
			
 
				-    "Simply adding a phrase encouraging step-by-step thinking \"significantly improves the ability of large language models to perform complex reasoning\" ([Wei et al. (2022)](https://arxiv.org/abs/2201.11903)). This technique is called \"CoT\" or \"Chain-of-Thought\" prompting.\n",
			
 
				-    "\n",
			
 
				-    "Llama 3.1 now reasons step-by-step naturally without the addition of the phrase. This section remains for completeness."
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": null,
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "prompt = \"Who lived longer, Mozart or Elvis?\"\n",
			
 
				-    "\n",
			
 
				-    "complete_and_print(prompt)\n",
			
 
				-    "# Llama 2 would often give the incorrect answer of \"Mozart\"\n",
			
 
				-    "\n",
			
 
				-    "complete_and_print(f\"{prompt} Let's think through this carefully, step by step.\")\n",
			
 
				-    "# Gives the correct answer \"Elvis\""
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "attachments": {},
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "### Self-Consistency\n",
			
 
				-    "\n",
			
 
				-    "LLMs are probabilistic, so even with Chain-of-Thought, a single generation might produce incorrect results. Self-Consistency ([Wang et al. (2022)](https://arxiv.org/abs/2203.11171)) introduces enhanced accuracy by selecting the most frequent answer from multiple generations (at the cost of higher compute):"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": null,
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "import re\n",
			
 
				-    "from statistics import mode\n",
			
 
				-    "\n",
			
 
				-    "def gen_answer():\n",
			
 
				-    "    response = completion(\n",
			
 
				-    "        \"John found that the average of 15 numbers is 40.\"\n",
			
 
				-    "        \"If 10 is added to each number then the mean of the numbers is?\"\n",
			
 
				-    "        \"Report the answer surrounded by backticks (example: `123`)\",\n",
			
 
				-    "    )\n",
			
 
				-    "    match = re.search(r'`(\\d+)`', response)\n",
			
 
				-    "    if match is None:\n",
			
 
				-    "        return None\n",
			
 
				-    "    return match.group(1)\n",
			
 
				-    "\n",
			
 
				-    "answers = [gen_answer() for i in range(5)]\n",
			
 
				-    "\n",
			
 
				-    "print(\n",
			
 
				-    "    f\"Answers: {answers}\\n\",\n",
			
 
				-    "    f\"Final answer: {mode(answers)}\",\n",
			
 
				-    "    )\n",
			
 
				-    "\n",
			
 
				-    "# Sample runs of Llama-3-70B (all correct):\n",
			
 
				-    "# ['60', '50', '50', '50', '50'] -> 50\n",
			
 
				-    "# ['50', '50', '50', '60', '50'] -> 50\n",
			
 
				-    "# ['50', '50', '60', '50', '50'] -> 50"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "attachments": {},
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "### Retrieval-Augmented Generation\n",
			
 
				-    "\n",
			
 
				-    "You'll probably want to use factual knowledge in your application. You can extract common facts from today's large models out-of-the-box (i.e. using just the model weights):"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": null,
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "complete_and_print(\"What is the capital of the California?\")\n",
			
 
				-    "# Gives the correct answer \"Sacramento\""
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "attachments": {},
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "However, more specific facts, or private information, cannot be reliably retrieved. The model will either declare it does not know or hallucinate an incorrect answer:"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": null,
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "complete_and_print(\"What was the temperature in Menlo Park on December 12th, 2023?\")\n",
			
 
				-    "# \"I'm just an AI, I don't have access to real-time weather data or historical weather records.\"\n",
			
 
				-    "\n",
			
 
				-    "complete_and_print(\"What time is my dinner reservation on Saturday and what should I wear?\")\n",
			
 
				-    "# \"I'm not able to access your personal information [..] I can provide some general guidance\""
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "attachments": {},
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "Retrieval-Augmented Generation, or RAG, describes the practice of including information in the prompt you've retrieved from an external database ([Lewis et al. (2020)](https://arxiv.org/abs/2005.11401v4)). It's an effective way to incorporate facts into your LLM application and is more affordable than fine-tuning which may be costly and negatively impact the foundational model's capabilities.\n",
			
 
				-    "\n",
			
 
				-    "This could be as simple as a lookup table or as sophisticated as a [vector database]([FAISS](https://github.com/facebookresearch/faiss)) containing all of your company's knowledge:"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": null,
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "MENLO_PARK_TEMPS = {\n",
			
 
				-    "    \"2023-12-11\": \"52 degrees Fahrenheit\",\n",
			
 
				-    "    \"2023-12-12\": \"51 degrees Fahrenheit\",\n",
			
 
				-    "    \"2023-12-13\": \"51 degrees Fahrenheit\",\n",
			
 
				-    "}\n",
			
 
				-    "\n",
			
 
				-    "\n",
			
 
				-    "def prompt_with_rag(retrived_info, question):\n",
			
 
				-    "    complete_and_print(\n",
			
 
				-    "        f\"Given the following information: '{retrived_info}', respond to: '{question}'\"\n",
			
 
				-    "    )\n",
			
 
				-    "\n",
			
 
				-    "\n",
			
 
				-    "def ask_for_temperature(day):\n",
			
 
				-    "    temp_on_day = MENLO_PARK_TEMPS.get(day) or \"unknown temperature\"\n",
			
 
				-    "    prompt_with_rag(\n",
			
 
				-    "        f\"The temperature in Menlo Park was {temp_on_day} on {day}'\",  # Retrieved fact\n",
			
 
				-    "        f\"What is the temperature in Menlo Park on {day}?\",  # User question\n",
			
 
				-    "    )\n",
			
 
				-    "\n",
			
 
				-    "\n",
			
 
				-    "ask_for_temperature(\"2023-12-12\")\n",
			
 
				-    "# \"Sure! The temperature in Menlo Park on 2023-12-12 was 51 degrees Fahrenheit.\"\n",
			
 
				-    "\n",
			
 
				-    "ask_for_temperature(\"2023-07-18\")\n",
			
 
				-    "# \"I'm not able to provide the temperature in Menlo Park on 2023-07-18 as the information provided states that the temperature was unknown.\""
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "attachments": {},
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "### Program-Aided Language Models\n",
			
 
				-    "\n",
			
 
				-    "LLMs, by nature, aren't great at performing calculations. Let's try:\n",
			
 
				-    "\n",
			
 
				-    "$$\n",
			
 
				-    "((-5 + 93 * 4 - 0) * (4^4 + -7 + 0 * 5))\n",
			
 
				-    "$$\n",
			
 
				-    "\n",
			
 
				-    "(The correct answer is 91383.)"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": null,
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "complete_and_print(\"\"\"\n",
			
 
				-    "Calculate the answer to the following math problem:\n",
			
 
				-    "\n",
			
 
				-    "((-5 + 93 * 4 - 0) * (4^4 + -7 + 0 * 5))\n",
			
 
				-    "\"\"\")\n",
			
 
				-    "# Gives incorrect answers like 92448, 92648, 95463"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "attachments": {},
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "[Gao et al. (2022)](https://arxiv.org/abs/2211.10435) introduced the concept of \"Program-aided Language Models\" (PAL). While LLMs are bad at arithmetic, they're great for code generation. PAL leverages this fact by instructing the LLM to write code to solve calculation tasks."
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": null,
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "complete_and_print(\n",
			
 
				-    "    \"\"\"\n",
			
 
				-    "    # Python code to calculate: ((-5 + 93 * 4 - 0) * (4^4 + -7 + 0 * 5))\n",
			
 
				-    "    \"\"\",\n",
			
 
				-    ")"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": null,
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "# The following code was generated by Llama 3 70B:\n",
			
 
				-    "\n",
			
 
				-    "result = ((-5 + 93 * 4 - 0) * (4**4 - 7 + 0 * 5))\n",
			
 
				-    "print(result)"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "attachments": {},
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "### Limiting Extraneous Tokens\n",
			
 
				-    "\n",
			
 
				-    "A common struggle with Llama 2 is getting output without extraneous tokens (ex. \"Sure! Here's more information on...\"), even if explicit instructions are given to Llama 2 to be concise and no preamble. Llama 3.x can better follow instructions.\n",
			
 
				-    "\n",
			
 
				-    "Check out this improvement that combines a role, rules and restrictions, explicit instructions, and an example:"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": null,
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "complete_and_print(\n",
			
 
				-    "    \"Give me the zip code for Menlo Park in JSON format with the field 'zip_code'\",\n",
			
 
				-    ")\n",
			
 
				-    "# Likely returns the JSON and also \"Sure! Here's the JSON...\"\n",
			
 
				-    "\n",
			
 
				-    "complete_and_print(\n",
			
 
				-    "    \"\"\"\n",
			
 
				-    "    You are a robot that only outputs JSON.\n",
			
 
				-    "    You reply in JSON format with the field 'zip_code'.\n",
			
 
				-    "    Example question: What is the zip code of the Empire State Building? Example answer: {'zip_code': 10118}\n",
			
 
				-    "    Now here is my question: What is the zip code of Menlo Park?\n",
			
 
				-    "    \"\"\",\n",
			
 
				-    ")\n",
			
 
				-    "# \"{'zip_code': 94025}\""
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "attachments": {},
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "## Additional References\n",
			
 
				-    "- [PromptingGuide.ai](https://www.promptingguide.ai/)\n",
			
 
				-    "- [LearnPrompting.org](https://learnprompting.org/)\n",
			
 
				-    "- [Lil'Log Prompt Engineering Guide](https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/)\n"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "attachments": {},
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "## Author & Contact\n",
			
 
				-    "\n",
			
 
				-    "Edited by [Dalton Flanagan](https://www.linkedin.com/in/daltonflanagan/) (dalton@meta.com) with contributions from Mohsen Agsen, Bryce Bortree, Ricardo Juan Palma Duran, Kaolin Fire, Thomas Scialom."
			
 
				-   ]
			
 
				-  }
			
 
				- ],
			
 
				- "metadata": {
			
 
				-  "captumWidgetMessage": [],
			
 
				-  "dataExplorerConfig": [],
			
 
				-  "kernelspec": {
			
 
				-   "display_name": "Python 3 (ipykernel)",
			
 
				-   "language": "python",
			
 
				-   "name": "python3"
			
 
				-  },
			
 
				-  "language_info": {
			
 
				-   "codemirror_mode": {
			
 
				-    "name": "ipython",
			
 
				-    "version": 3
			
 
				-   },
			
 
				-   "file_extension": ".py",
			
 
				-   "mimetype": "text/x-python",
			
 
				-   "name": "python",
			
 
				-   "nbconvert_exporter": "python",
			
 
				-   "pygments_lexer": "ipython3",
			
 
				-   "version": "3.10.14"
			
 
				-  },
			
 
				-  "last_base_url": "https://bento.edge.x2p.facebook.net/",
			
 
				-  "last_kernel_id": "161e2a7b-2d2b-4995-87f3-d1539860ecac",
			
 
				-  "last_msg_id": "4eab1242-d815b886ebe4f5b1966da982_543",
			
 
				-  "last_server_session_id": "4a7b41c5-ed66-4dcb-a376-22673aebb469",
			
 
				-  "operator_data": [],
			
 
				-  "outputWidgetContext": []
			
 
				- },
			
 
				- "nbformat": 4,
			
 
				- "nbformat_minor": 4
			
 
				-}
			
--- a/getting-started/README.md
+++ b/getting-started/README.md
@@ -2,8 +2,7 @@
 
				 
			
 
				 If you are new to developing with Meta Llama models, this is where you should start. This folder contains introductory-level notebooks across different techniques relating to Meta Llama.
			
 
				 
			
 
				-* The [Build_with_Llama 3.2](./build_with_Llama_3_2.ipynb) notebook showcases a comprehensive walkthrough of the new capabilities of Llama 3.2 models, including multimodal use cases, function/tool calling, Llama Stack, and Llama on edge.
			
 
				-* The [Prompt_Engineering_with_Llama](./Prompt_Engineering_with_Llama.ipynb) notebook showcases the various ways to elicit appropriate outputs from Llama. Take this notebook for a spin to get a feel for how Llama responds to different inputs and generation parameters.
			
 
				+* The [Build_with_Llama 4](./build_with_llama_4.ipynb) notebook showcases a comprehensive walkthrough of the new capabilities of Llama 4 Scout models, including long context, multi-images and function calling.
			
 
				 * The [inference](./inference/) folder contains scripts to deploy Llama for inference on server and mobile. See also [3p_integrations/vllm](../3p-integrations/vllm/) and [3p_integrations/tgi](../3p-integrations/tgi/) for hosting Llama on open-source model servers.
			
 
				 * The [RAG](./RAG/) folder contains a simple Retrieval-Augmented Generation application using Llama.
			
 
				 * The [finetuning](./finetuning/) folder contains resources to help you finetune Llama on your custom datasets, for both single- and multi-GPU setups. The scripts use the native llama-cookbook finetuning code found in [finetuning.py](../src/llama_cookbook/finetuning.py) which supports these features:
			
--- a/getting-started/build_with_Llama_3_2.ipynb
+++ b/getting-started/build_with_Llama_3_2.ipynb
--- a/getting-started/build_with_llama_4.ipynb
+++ b/getting-started/build_with_llama_4.ipynb
--- a/src/docs/img/a_llama_dressed_as_a_professional_mountain.jpeg
+++ b/src/docs/img/a_llama_dressed_as_a_professional_mountain.jpeg
--- a/src/docs/img/k1.jpg
+++ b/src/docs/img/k1.jpg
--- a/src/docs/img/k1_resized.jpg
+++ b/src/docs/img/k1_resized.jpg
--- a/src/docs/img/k2.jpg
+++ b/src/docs/img/k2.jpg
--- a/src/docs/img/k2_resized.jpg
+++ b/src/docs/img/k2_resized.jpg
--- a/src/docs/img/k3.jpg
+++ b/src/docs/img/k3.jpg
--- a/src/docs/img/k3_resized.jpg
+++ b/src/docs/img/k3_resized.jpg
--- a/src/docs/img/k4.jpg
+++ b/src/docs/img/k4.jpg
--- a/src/docs/img/k4_resized.jpg
+++ b/src/docs/img/k4_resized.jpg