Browse Source

Update prompt guide for Llama 3 (#484)

Jeff Tang 11 months ago
parent
commit
6da989a774
1 changed files with 82 additions and 81 deletions
  1. 82 81
      recipes/quickstart/Prompt_Engineering_with_Llama_2.ipynb

+ 82 - 81
recipes/quickstart/Prompt_Engineering_with_Llama_2.ipynb

@@ -5,11 +5,13 @@
    "cell_type": "markdown",
    "cell_type": "markdown",
    "metadata": {},
    "metadata": {},
    "source": [
    "source": [
-    "# Prompt Engineering with Llama 2\n",
+    "<a href=\"https://colab.research.google.com/github/meta-llama/llama-recipes/blob/main/recipes/quickstart/Prompt_Engineering_with_Llama_3.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
+    "\n",
+    "# Prompt Engineering with Llama 3\n",
     "\n",
     "\n",
     "Prompt engineering is using natural language to produce a desired response from a large language model (LLM).\n",
     "Prompt engineering is using natural language to produce a desired response from a large language model (LLM).\n",
     "\n",
     "\n",
-    "This interactive guide covers prompt engineering & best practices with Llama 2."
+    "This interactive guide covers prompt engineering & best practices with Llama 3."
    ]
    ]
   },
   },
   {
   {
@@ -41,7 +43,13 @@
     "\n",
     "\n",
     "In 2023, Meta introduced the [Llama language models](https://ai.meta.com/llama/) (Llama Chat, Code Llama, Llama Guard). These are general purpose, state-of-the-art LLMs.\n",
     "In 2023, Meta introduced the [Llama language models](https://ai.meta.com/llama/) (Llama Chat, Code Llama, Llama Guard). These are general purpose, state-of-the-art LLMs.\n",
     "\n",
     "\n",
-    "Llama 2 models come in 7 billion, 13 billion, and 70 billion parameter sizes. Smaller models are cheaper to deploy and run (see: deployment and performance); larger models are more capable.\n",
+    "Llama models come in varying parameter sizes. The smaller models are cheaper to deploy and run; the larger models are more capable.\n",
+    "\n",
+    "#### Llama 3\n",
+    "1. `llama-3-8b` - base pretrained 8 billion parameter model\n",
+    "1. `llama-3-70b` - base pretrained 70 billion parameter model\n",
+    "1. `llama-3-8b-instruct` - instruction fine-tuned 8 billion parameter model\n",
+    "1. `llama-3-70b-instruct` - instruction fine-tuned 70 billion parameter model (flagship)\n",
     "\n",
     "\n",
     "#### Llama 2\n",
     "#### Llama 2\n",
     "1. `llama-2-7b` - base pretrained 7 billion parameter model\n",
     "1. `llama-2-7b` - base pretrained 7 billion parameter model\n",
@@ -69,12 +77,15 @@
     "1. `codellama-7b` - code fine-tuned 7 billion parameter model\n",
     "1. `codellama-7b` - code fine-tuned 7 billion parameter model\n",
     "1. `codellama-13b` - code fine-tuned 13 billion parameter model\n",
     "1. `codellama-13b` - code fine-tuned 13 billion parameter model\n",
     "1. `codellama-34b` - code fine-tuned 34 billion parameter model\n",
     "1. `codellama-34b` - code fine-tuned 34 billion parameter model\n",
+    "1. `codellama-70b` - code fine-tuned 70 billion parameter model\n",
     "1. `codellama-7b-instruct` - code & instruct fine-tuned 7 billion parameter model\n",
     "1. `codellama-7b-instruct` - code & instruct fine-tuned 7 billion parameter model\n",
     "2. `codellama-13b-instruct` - code & instruct fine-tuned 13 billion parameter model\n",
     "2. `codellama-13b-instruct` - code & instruct fine-tuned 13 billion parameter model\n",
     "3. `codellama-34b-instruct` - code & instruct fine-tuned 34 billion parameter model\n",
     "3. `codellama-34b-instruct` - code & instruct fine-tuned 34 billion parameter model\n",
+    "3. `codellama-70b-instruct` - code & instruct fine-tuned 70 billion parameter model\n",
     "1. `codellama-7b-python` - Python fine-tuned 7 billion parameter model\n",
     "1. `codellama-7b-python` - Python fine-tuned 7 billion parameter model\n",
     "2. `codellama-13b-python` - Python fine-tuned 13 billion parameter model\n",
     "2. `codellama-13b-python` - Python fine-tuned 13 billion parameter model\n",
-    "3. `codellama-34b-python` - Python fine-tuned 34 billion parameter model"
+    "3. `codellama-34b-python` - Python fine-tuned 34 billion parameter model\n",
+    "3. `codellama-70b-python` - Python fine-tuned 70 billion parameter model"
    ]
    ]
   },
   },
   {
   {
@@ -86,11 +97,11 @@
     "\n",
     "\n",
     "Large language models are deployed and accessed in a variety of ways, including:\n",
     "Large language models are deployed and accessed in a variety of ways, including:\n",
     "\n",
     "\n",
-    "1. **Self-hosting**: Using local hardware to run inference. Ex. running Llama 2 on your Macbook Pro using [llama.cpp](https://github.com/ggerganov/llama.cpp).\n",
+    "1. **Self-hosting**: Using local hardware to run inference. Ex. running Llama on your Macbook Pro using [llama.cpp](https://github.com/ggerganov/llama.cpp).\n",
     "    * Best for privacy/security or if you already have a GPU.\n",
     "    * Best for privacy/security or if you already have a GPU.\n",
-    "1. **Cloud hosting**: Using a cloud provider to deploy an instance that hosts a specific model. Ex. running Llama 2 on cloud providers like AWS, Azure, GCP, and others.\n",
+    "1. **Cloud hosting**: Using a cloud provider to deploy an instance that hosts a specific model. Ex. running Llama on cloud providers like AWS, Azure, GCP, and others.\n",
     "    * Best for customizing models and their runtime (ex. fine-tuning a model for your use case).\n",
     "    * Best for customizing models and their runtime (ex. fine-tuning a model for your use case).\n",
-    "1. **Hosted API**: Call LLMs directly via an API. There are many companies that provide Llama 2 inference APIs including AWS Bedrock, Replicate, Anyscale, Together and others.\n",
+    "1. **Hosted API**: Call LLMs directly via an API. There are many companies that provide Llama inference APIs including AWS Bedrock, Replicate, Anyscale, Together and others.\n",
     "    * Easiest option overall."
     "    * Easiest option overall."
    ]
    ]
   },
   },
@@ -118,11 +129,11 @@
     "\n",
     "\n",
     "> Our destiny is written in the stars.\n",
     "> Our destiny is written in the stars.\n",
     "\n",
     "\n",
-    "...is tokenized into `[\"our\", \"dest\", \"iny\", \"is\", \"written\", \"in\", \"the\", \"stars\"]` for Llama 2.\n",
+    "...is tokenized into `[\"Our\", \" destiny\", \" is\", \" written\", \" in\", \" the\", \" stars\", \".\"]` for Llama 3. See [this](https://tiktokenizer.vercel.app/?model=meta-llama%2FMeta-Llama-3-8B) for an interactive tokenizer tool.\n",
     "\n",
     "\n",
     "Tokens matter most when you consider API pricing and internal behavior (ex. hyperparameters).\n",
     "Tokens matter most when you consider API pricing and internal behavior (ex. hyperparameters).\n",
     "\n",
     "\n",
-    "Each model has a maximum context length that your prompt cannot exceed. That's 4096 tokens for Llama 2 and 100K for Code Llama. \n"
+    "Each model has a maximum context length that your prompt cannot exceed. That's 8K tokens for Llama 3, 4K for Llama 2, and 100K for Code Llama. \n"
    ]
    ]
   },
   },
   {
   {
@@ -132,7 +143,7 @@
    "source": [
    "source": [
     "## Notebook Setup\n",
     "## Notebook Setup\n",
     "\n",
     "\n",
-    "The following APIs will be used to call LLMs throughout the guide. As an example, we'll call Llama 2 chat using [Replicate](https://replicate.com/meta/llama-2-70b-chat) and use LangChain to easily set up a chat completion API.\n",
+    "The following APIs will be used to call LLMs throughout the guide. As an example, we'll call Llama 3 chat using [Grok](https://console.groq.com/playground?model=llama3-70b-8192).\n",
     "\n",
     "\n",
     "To install prerequisites run:"
     "To install prerequisites run:"
    ]
    ]
@@ -143,7 +154,8 @@
    "metadata": {},
    "metadata": {},
    "outputs": [],
    "outputs": [],
    "source": [
    "source": [
-    "pip install langchain replicate"
+    "import sys\n",
+    "!{sys.executable} -m pip install groq"
    ]
    ]
   },
   },
   {
   {
@@ -152,64 +164,54 @@
    "metadata": {},
    "metadata": {},
    "outputs": [],
    "outputs": [],
    "source": [
    "source": [
-    "from typing import Dict, List\n",
-    "from langchain.llms import Replicate\n",
-    "from langchain.memory import ChatMessageHistory\n",
-    "from langchain.schema.messages import get_buffer_string\n",
     "import os\n",
     "import os\n",
+    "from typing import Dict, List\n",
+    "from groq import Groq\n",
     "\n",
     "\n",
-    "# Get a free API key from https://replicate.com/account/api-tokens\n",
-    "os.environ[\"REPLICATE_API_TOKEN\"] = \"YOUR_KEY_HERE\"\n",
+    "# Get a free API key from https://console.groq.com/keys\n",
+    "os.environ[\"GROQ_API_KEY\"] = \"YOUR_GROQ_API_KEY\"\n",
     "\n",
     "\n",
-    "LLAMA2_70B_CHAT = \"meta/llama-2-70b-chat:2d19859030ff705a87c746f7e96eea03aefb71f166725aee39692f1476566d48\"\n",
-    "LLAMA2_13B_CHAT = \"meta/llama-2-13b-chat:f4e2de70d66816a838a89eeeb621910adffb0dd0baba3976c96980970978018d\"\n",
+    "LLAMA3_70B_INSTRUCT = \"llama3-70b-8192\"\n",
+    "LLAMA3_8B_INSTRUCT = \"llama3-8b-8192\"\n",
     "\n",
     "\n",
-    "# We'll default to the smaller 13B model for speed; change to LLAMA2_70B_CHAT for more advanced (but slower) generations\n",
-    "DEFAULT_MODEL = LLAMA2_13B_CHAT\n",
+    "DEFAULT_MODEL = LLAMA3_70B_INSTRUCT\n",
     "\n",
     "\n",
-    "def completion(\n",
-    "    prompt: str,\n",
-    "    model: str = DEFAULT_MODEL,\n",
+    "client = Groq()\n",
+    "\n",
+    "def assistant(content: str):\n",
+    "    return { \"role\": \"assistant\", \"content\": content }\n",
+    "\n",
+    "def user(content: str):\n",
+    "    return { \"role\": \"user\", \"content\": content }\n",
+    "\n",
+    "def chat_completion(\n",
+    "    messages: List[Dict],\n",
+    "    model = DEFAULT_MODEL,\n",
     "    temperature: float = 0.6,\n",
     "    temperature: float = 0.6,\n",
     "    top_p: float = 0.9,\n",
     "    top_p: float = 0.9,\n",
     ") -> str:\n",
     ") -> str:\n",
-    "    llm = Replicate(\n",
+    "    response = client.chat.completions.create(\n",
+    "        messages=messages,\n",
     "        model=model,\n",
     "        model=model,\n",
-    "        model_kwargs={\"temperature\": temperature,\"top_p\": top_p, \"max_new_tokens\": 1000}\n",
+    "        temperature=temperature,\n",
+    "        top_p=top_p,\n",
     "    )\n",
     "    )\n",
-    "    return llm(prompt)\n",
+    "    return response.choices[0].message.content\n",
+    "        \n",
     "\n",
     "\n",
-    "def chat_completion(\n",
-    "    messages: List[Dict],\n",
-    "    model = DEFAULT_MODEL,\n",
+    "def completion(\n",
+    "    prompt: str,\n",
+    "    model: str = DEFAULT_MODEL,\n",
     "    temperature: float = 0.6,\n",
     "    temperature: float = 0.6,\n",
     "    top_p: float = 0.9,\n",
     "    top_p: float = 0.9,\n",
     ") -> str:\n",
     ") -> str:\n",
-    "    history = ChatMessageHistory()\n",
-    "    for message in messages:\n",
-    "        if message[\"role\"] == \"user\":\n",
-    "            history.add_user_message(message[\"content\"])\n",
-    "        elif message[\"role\"] == \"assistant\":\n",
-    "            history.add_ai_message(message[\"content\"])\n",
-    "        else:\n",
-    "            raise Exception(\"Unknown role\")\n",
-    "    return completion(\n",
-    "        get_buffer_string(\n",
-    "            history.messages,\n",
-    "            human_prefix=\"USER\",\n",
-    "            ai_prefix=\"ASSISTANT\",\n",
-    "        ),\n",
-    "        model,\n",
-    "        temperature,\n",
-    "        top_p,\n",
+    "    return chat_completion(\n",
+    "        [user(prompt)],\n",
+    "        model=model,\n",
+    "        temperature=temperature,\n",
+    "        top_p=top_p,\n",
     "    )\n",
     "    )\n",
     "\n",
     "\n",
-    "def assistant(content: str):\n",
-    "    return { \"role\": \"assistant\", \"content\": content }\n",
-    "\n",
-    "def user(content: str):\n",
-    "    return { \"role\": \"user\", \"content\": content }\n",
-    "\n",
     "def complete_and_print(prompt: str, model: str = DEFAULT_MODEL):\n",
     "def complete_and_print(prompt: str, model: str = DEFAULT_MODEL):\n",
     "    print(f'==============\\n{prompt}\\n==============')\n",
     "    print(f'==============\\n{prompt}\\n==============')\n",
     "    response = completion(prompt, model)\n",
     "    response = completion(prompt, model)\n",
@@ -223,7 +225,7 @@
    "source": [
    "source": [
     "### Completion APIs\n",
     "### Completion APIs\n",
     "\n",
     "\n",
-    "Llama 2 models tend to be wordy and explain their rationale. Later we'll explore how to manage the response length."
+    "Let's try Llama 3!"
    ]
    ]
   },
   },
   {
   {
@@ -345,7 +347,7 @@
    "cell_type": "markdown",
    "cell_type": "markdown",
    "metadata": {},
    "metadata": {},
    "source": [
    "source": [
-    "You can think about giving explicit instructions as using rules and restrictions to how Llama 2 responds to your prompt.\n",
+    "You can think about giving explicit instructions as using rules and restrictions to how Llama 3 responds to your prompt.\n",
     "\n",
     "\n",
     "- Stylization\n",
     "- Stylization\n",
     "    - `Explain this to me like a topic on a children's educational network show teaching elementary students.`\n",
     "    - `Explain this to me like a topic on a children's educational network show teaching elementary students.`\n",
@@ -387,9 +389,9 @@
     "\n",
     "\n",
     "#### Zero-Shot Prompting\n",
     "#### Zero-Shot Prompting\n",
     "\n",
     "\n",
-    "Large language models like Llama 2 are unique because they are capable of following instructions and producing responses without having previously seen an example of a task. Prompting without examples is called \"zero-shot prompting\".\n",
+    "Large language models like Llama 3 are unique because they are capable of following instructions and producing responses without having previously seen an example of a task. Prompting without examples is called \"zero-shot prompting\".\n",
     "\n",
     "\n",
-    "Let's try using Llama 2 as a sentiment detector. You may notice that output format varies - we can improve this with better prompting."
+    "Let's try using Llama 3 as a sentiment detector. You may notice that output format varies - we can improve this with better prompting."
    ]
    ]
   },
   },
   {
   {
@@ -459,9 +461,9 @@
    "source": [
    "source": [
     "### Role Prompting\n",
     "### Role Prompting\n",
     "\n",
     "\n",
-    "Llama 2 will often give more consistent responses when given a role ([Kong et al. (2023)](https://browse.arxiv.org/pdf/2308.07702.pdf)). Roles give context to the LLM on what type of answers are desired.\n",
+    "Llama will often give more consistent responses when given a role ([Kong et al. (2023)](https://browse.arxiv.org/pdf/2308.07702.pdf)). Roles give context to the LLM on what type of answers are desired.\n",
     "\n",
     "\n",
-    "Let's use Llama 2 to create a more focused, technical response for a question around the pros and cons of using PyTorch."
+    "Let's use Llama 3 to create a more focused, technical response for a question around the pros and cons of using PyTorch."
    ]
    ]
   },
   },
   {
   {
@@ -484,7 +486,9 @@
    "source": [
    "source": [
     "### Chain-of-Thought\n",
     "### Chain-of-Thought\n",
     "\n",
     "\n",
-    "Simply adding a phrase encouraging step-by-step thinking \"significantly improves the ability of large language models to perform complex reasoning\" ([Wei et al. (2022)](https://arxiv.org/abs/2201.11903)). This technique is called \"CoT\" or \"Chain-of-Thought\" prompting:"
+    "Simply adding a phrase encouraging step-by-step thinking \"significantly improves the ability of large language models to perform complex reasoning\" ([Wei et al. (2022)](https://arxiv.org/abs/2201.11903)). This technique is called \"CoT\" or \"Chain-of-Thought\" prompting.\n",
+    "\n",
+    "Llama 3 now reasons step-by-step naturally without the addition of the phrase. This section remains for completeness."
    ]
    ]
   },
   },
   {
   {
@@ -493,10 +497,12 @@
    "metadata": {},
    "metadata": {},
    "outputs": [],
    "outputs": [],
    "source": [
    "source": [
-    "complete_and_print(\"Who lived longer Elvis Presley or Mozart?\")\n",
-    "# Often gives incorrect answer of \"Mozart\"\n",
+    "prompt = \"Who lived longer, Mozart or Elvis?\"\n",
+    "\n",
+    "complete_and_print(prompt)\n",
+    "# Llama 2 would often give the incorrect answer of \"Mozart\"\n",
     "\n",
     "\n",
-    "complete_and_print(\"Who lived longer Elvis Presley or Mozart? Let's think through this carefully, step by step.\")\n",
+    "complete_and_print(f\"{prompt} Let's think through this carefully, step by step.\")\n",
     "# Gives the correct answer \"Elvis\""
     "# Gives the correct answer \"Elvis\""
    ]
    ]
   },
   },
@@ -523,10 +529,9 @@
     "    response = completion(\n",
     "    response = completion(\n",
     "        \"John found that the average of 15 numbers is 40.\"\n",
     "        \"John found that the average of 15 numbers is 40.\"\n",
     "        \"If 10 is added to each number then the mean of the numbers is?\"\n",
     "        \"If 10 is added to each number then the mean of the numbers is?\"\n",
-    "        \"Report the answer surrounded by three backticks, for example: ```123```\",\n",
-    "        model = LLAMA2_70B_CHAT\n",
+    "        \"Report the answer surrounded by backticks (example: `123`)\",\n",
     "    )\n",
     "    )\n",
-    "    match = re.search(r'```(\\d+)```', response)\n",
+    "    match = re.search(r'`(\\d+)`', response)\n",
     "    if match is None:\n",
     "    if match is None:\n",
     "        return None\n",
     "        return None\n",
     "    return match.group(1)\n",
     "    return match.group(1)\n",
@@ -538,10 +543,10 @@
     "    f\"Final answer: {mode(answers)}\",\n",
     "    f\"Final answer: {mode(answers)}\",\n",
     "    )\n",
     "    )\n",
     "\n",
     "\n",
-    "# Sample runs of Llama-2-70B (all correct):\n",
-    "# [50, 50, 750, 50, 50]  -> 50\n",
-    "# [130, 10, 750, 50, 50] -> 50\n",
-    "# [50, None, 10, 50, 50] -> 50"
+    "# Sample runs of Llama-3-70B (all correct):\n",
+    "# ['60', '50', '50', '50', '50'] -> 50\n",
+    "# ['50', '50', '50', '60', '50'] -> 50\n",
+    "# ['50', '50', '60', '50', '50'] -> 50"
    ]
    ]
   },
   },
   {
   {
@@ -560,7 +565,7 @@
    "metadata": {},
    "metadata": {},
    "outputs": [],
    "outputs": [],
    "source": [
    "source": [
-    "complete_and_print(\"What is the capital of the California?\", model = LLAMA2_70B_CHAT)\n",
+    "complete_and_print(\"What is the capital of the California?\")\n",
     "# Gives the correct answer \"Sacramento\""
     "# Gives the correct answer \"Sacramento\""
    ]
    ]
   },
   },
@@ -677,7 +682,6 @@
     "    \"\"\"\n",
     "    \"\"\"\n",
     "    # Python code to calculate: ((-5 + 93 * 4 - 0) * (4^4 + -7 + 0 * 5))\n",
     "    # Python code to calculate: ((-5 + 93 * 4 - 0) * (4^4 + -7 + 0 * 5))\n",
     "    \"\"\",\n",
     "    \"\"\",\n",
-    "    model=\"meta/codellama-34b:67942fd0f55b66da802218a19a8f0e1d73095473674061a6ea19f2dc8c053152\"\n",
     ")"
     ")"
    ]
    ]
   },
   },
@@ -687,12 +691,10 @@
    "metadata": {},
    "metadata": {},
    "outputs": [],
    "outputs": [],
    "source": [
    "source": [
-    "# The following code was generated by Code Llama 34B:\n",
+    "# The following code was generated by Llama 3 70B:\n",
     "\n",
     "\n",
-    "num1 = (-5 + 93 * 4 - 0)\n",
-    "num2 = (4**4 + -7 + 0 * 5)\n",
-    "answer = num1 * num2\n",
-    "print(answer)"
+    "result = ((-5 + 93 * 4 - 0) * (4**4 - 7 + 0 * 5))\n",
+    "print(result)"
    ]
    ]
   },
   },
   {
   {
@@ -702,7 +704,7 @@
    "source": [
    "source": [
     "### Limiting Extraneous Tokens\n",
     "### Limiting Extraneous Tokens\n",
     "\n",
     "\n",
-    "A common struggle is getting output without extraneous tokens (ex. \"Sure! Here's more information on...\").\n",
+    "A common struggle with Llama 2 is getting output without extraneous tokens (ex. \"Sure! Here's more information on...\"), even if explicit instructions are given to Llama 2 to be concise and no preamble. Llama 3 can better follow instructions.\n",
     "\n",
     "\n",
     "Check out this improvement that combines a role, rules and restrictions, explicit instructions, and an example:"
     "Check out this improvement that combines a role, rules and restrictions, explicit instructions, and an example:"
    ]
    ]
@@ -715,7 +717,6 @@
    "source": [
    "source": [
     "complete_and_print(\n",
     "complete_and_print(\n",
     "    \"Give me the zip code for Menlo Park in JSON format with the field 'zip_code'\",\n",
     "    \"Give me the zip code for Menlo Park in JSON format with the field 'zip_code'\",\n",
-    "    model = LLAMA2_70B_CHAT,\n",
     ")\n",
     ")\n",
     "# Likely returns the JSON and also \"Sure! Here's the JSON...\"\n",
     "# Likely returns the JSON and also \"Sure! Here's the JSON...\"\n",
     "\n",
     "\n",
@@ -726,7 +727,6 @@
     "    Example question: What is the zip code of the Empire State Building? Example answer: {'zip_code': 10118}\n",
     "    Example question: What is the zip code of the Empire State Building? Example answer: {'zip_code': 10118}\n",
     "    Now here is my question: What is the zip code of Menlo Park?\n",
     "    Now here is my question: What is the zip code of Menlo Park?\n",
     "    \"\"\",\n",
     "    \"\"\",\n",
-    "    model = LLAMA2_70B_CHAT,\n",
     ")\n",
     ")\n",
     "# \"{'zip_code': 94025}\""
     "# \"{'zip_code': 94025}\""
    ]
    ]
@@ -770,7 +770,8 @@
    "mimetype": "text/x-python",
    "mimetype": "text/x-python",
    "name": "python",
    "name": "python",
    "nbconvert_exporter": "python",
    "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3"
+   "pygments_lexer": "ipython3",
+   "version": "3.10.14"
   },
   },
   "last_base_url": "https://bento.edge.x2p.facebook.net/",
   "last_base_url": "https://bento.edge.x2p.facebook.net/",
   "last_kernel_id": "161e2a7b-2d2b-4995-87f3-d1539860ecac",
   "last_kernel_id": "161e2a7b-2d2b-4995-87f3-d1539860ecac",