|
@@ -5,11 +5,13 @@
|
|
|
"cell_type": "markdown",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
- "# Prompt Engineering with Llama 2\n",
|
|
|
+ "<a href=\"https://colab.research.google.com/github/meta-llama/llama-recipes/blob/main/recipes/quickstart/Prompt_Engineering_with_Llama_3.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
|
|
|
+ "\n",
|
|
|
+ "# Prompt Engineering with Llama 3\n",
|
|
|
"\n",
|
|
|
"Prompt engineering is using natural language to produce a desired response from a large language model (LLM).\n",
|
|
|
"\n",
|
|
|
- "This interactive guide covers prompt engineering & best practices with Llama 2."
|
|
|
+ "This interactive guide covers prompt engineering & best practices with Llama 3."
|
|
|
]
|
|
|
},
|
|
|
{
|
|
@@ -41,7 +43,13 @@
|
|
|
"\n",
|
|
|
"In 2023, Meta introduced the [Llama language models](https://ai.meta.com/llama/) (Llama Chat, Code Llama, Llama Guard). These are general purpose, state-of-the-art LLMs.\n",
|
|
|
"\n",
|
|
|
- "Llama 2 models come in 7 billion, 13 billion, and 70 billion parameter sizes. Smaller models are cheaper to deploy and run (see: deployment and performance); larger models are more capable.\n",
|
|
|
+ "Llama models come in varying parameter sizes. The smaller models are cheaper to deploy and run; the larger models are more capable.\n",
|
|
|
+ "\n",
|
|
|
+ "#### Llama 3\n",
|
|
|
+ "1. `llama-3-8b` - base pretrained 8 billion parameter model\n",
|
|
|
+ "1. `llama-3-70b` - base pretrained 70 billion parameter model\n",
|
|
|
+ "1. `llama-3-8b-instruct` - instruction fine-tuned 8 billion parameter model\n",
|
|
|
+ "1. `llama-3-70b-instruct` - instruction fine-tuned 70 billion parameter model (flagship)\n",
|
|
|
"\n",
|
|
|
"#### Llama 2\n",
|
|
|
"1. `llama-2-7b` - base pretrained 7 billion parameter model\n",
|
|
@@ -69,12 +77,15 @@
|
|
|
"1. `codellama-7b` - code fine-tuned 7 billion parameter model\n",
|
|
|
"1. `codellama-13b` - code fine-tuned 13 billion parameter model\n",
|
|
|
"1. `codellama-34b` - code fine-tuned 34 billion parameter model\n",
|
|
|
+ "1. `codellama-70b` - code fine-tuned 70 billion parameter model\n",
|
|
|
"1. `codellama-7b-instruct` - code & instruct fine-tuned 7 billion parameter model\n",
|
|
|
"2. `codellama-13b-instruct` - code & instruct fine-tuned 13 billion parameter model\n",
|
|
|
"3. `codellama-34b-instruct` - code & instruct fine-tuned 34 billion parameter model\n",
|
|
|
+ "3. `codellama-70b-instruct` - code & instruct fine-tuned 70 billion parameter model\n",
|
|
|
"1. `codellama-7b-python` - Python fine-tuned 7 billion parameter model\n",
|
|
|
"2. `codellama-13b-python` - Python fine-tuned 13 billion parameter model\n",
|
|
|
- "3. `codellama-34b-python` - Python fine-tuned 34 billion parameter model"
|
|
|
+ "3. `codellama-34b-python` - Python fine-tuned 34 billion parameter model\n",
|
|
|
+ "3. `codellama-70b-python` - Python fine-tuned 70 billion parameter model"
|
|
|
]
|
|
|
},
|
|
|
{
|
|
@@ -86,11 +97,11 @@
|
|
|
"\n",
|
|
|
"Large language models are deployed and accessed in a variety of ways, including:\n",
|
|
|
"\n",
|
|
|
- "1. **Self-hosting**: Using local hardware to run inference. Ex. running Llama 2 on your Macbook Pro using [llama.cpp](https://github.com/ggerganov/llama.cpp).\n",
|
|
|
+ "1. **Self-hosting**: Using local hardware to run inference. Ex. running Llama on your Macbook Pro using [llama.cpp](https://github.com/ggerganov/llama.cpp).\n",
|
|
|
" * Best for privacy/security or if you already have a GPU.\n",
|
|
|
- "1. **Cloud hosting**: Using a cloud provider to deploy an instance that hosts a specific model. Ex. running Llama 2 on cloud providers like AWS, Azure, GCP, and others.\n",
|
|
|
+ "1. **Cloud hosting**: Using a cloud provider to deploy an instance that hosts a specific model. Ex. running Llama on cloud providers like AWS, Azure, GCP, and others.\n",
|
|
|
" * Best for customizing models and their runtime (ex. fine-tuning a model for your use case).\n",
|
|
|
- "1. **Hosted API**: Call LLMs directly via an API. There are many companies that provide Llama 2 inference APIs including AWS Bedrock, Replicate, Anyscale, Together and others.\n",
|
|
|
+ "1. **Hosted API**: Call LLMs directly via an API. There are many companies that provide Llama inference APIs including AWS Bedrock, Replicate, Anyscale, Together and others.\n",
|
|
|
" * Easiest option overall."
|
|
|
]
|
|
|
},
|
|
@@ -118,11 +129,11 @@
|
|
|
"\n",
|
|
|
"> Our destiny is written in the stars.\n",
|
|
|
"\n",
|
|
|
- "...is tokenized into `[\"our\", \"dest\", \"iny\", \"is\", \"written\", \"in\", \"the\", \"stars\"]` for Llama 2.\n",
|
|
|
+ "...is tokenized into `[\"Our\", \" destiny\", \" is\", \" written\", \" in\", \" the\", \" stars\", \".\"]` for Llama 3. See [this](https://tiktokenizer.vercel.app/?model=meta-llama%2FMeta-Llama-3-8B) for an interactive tokenizer tool.\n",
|
|
|
"\n",
|
|
|
"Tokens matter most when you consider API pricing and internal behavior (ex. hyperparameters).\n",
|
|
|
"\n",
|
|
|
- "Each model has a maximum context length that your prompt cannot exceed. That's 4096 tokens for Llama 2 and 100K for Code Llama. \n"
|
|
|
+ "Each model has a maximum context length that your prompt cannot exceed. That's 8K tokens for Llama 3, 4K for Llama 2, and 100K for Code Llama. \n"
|
|
|
]
|
|
|
},
|
|
|
{
|
|
@@ -132,7 +143,7 @@
|
|
|
"source": [
|
|
|
"## Notebook Setup\n",
|
|
|
"\n",
|
|
|
- "The following APIs will be used to call LLMs throughout the guide. As an example, we'll call Llama 2 chat using [Replicate](https://replicate.com/meta/llama-2-70b-chat) and use LangChain to easily set up a chat completion API.\n",
|
|
|
+ "The following APIs will be used to call LLMs throughout the guide. As an example, we'll call Llama 3 chat using [Grok](https://console.groq.com/playground?model=llama3-70b-8192).\n",
|
|
|
"\n",
|
|
|
"To install prerequisites run:"
|
|
|
]
|
|
@@ -143,7 +154,8 @@
|
|
|
"metadata": {},
|
|
|
"outputs": [],
|
|
|
"source": [
|
|
|
- "pip install langchain replicate"
|
|
|
+ "import sys\n",
|
|
|
+ "!{sys.executable} -m pip install groq"
|
|
|
]
|
|
|
},
|
|
|
{
|
|
@@ -152,64 +164,54 @@
|
|
|
"metadata": {},
|
|
|
"outputs": [],
|
|
|
"source": [
|
|
|
- "from typing import Dict, List\n",
|
|
|
- "from langchain.llms import Replicate\n",
|
|
|
- "from langchain.memory import ChatMessageHistory\n",
|
|
|
- "from langchain.schema.messages import get_buffer_string\n",
|
|
|
"import os\n",
|
|
|
+ "from typing import Dict, List\n",
|
|
|
+ "from groq import Groq\n",
|
|
|
"\n",
|
|
|
- "# Get a free API key from https://replicate.com/account/api-tokens\n",
|
|
|
- "os.environ[\"REPLICATE_API_TOKEN\"] = \"YOUR_KEY_HERE\"\n",
|
|
|
+ "# Get a free API key from https://console.groq.com/keys\n",
|
|
|
+ "os.environ[\"GROQ_API_KEY\"] = \"YOUR_GROQ_API_KEY\"\n",
|
|
|
"\n",
|
|
|
- "LLAMA2_70B_CHAT = \"meta/llama-2-70b-chat:2d19859030ff705a87c746f7e96eea03aefb71f166725aee39692f1476566d48\"\n",
|
|
|
- "LLAMA2_13B_CHAT = \"meta/llama-2-13b-chat:f4e2de70d66816a838a89eeeb621910adffb0dd0baba3976c96980970978018d\"\n",
|
|
|
+ "LLAMA3_70B_INSTRUCT = \"llama3-70b-8192\"\n",
|
|
|
+ "LLAMA3_8B_INSTRUCT = \"llama3-8b-8192\"\n",
|
|
|
"\n",
|
|
|
- "# We'll default to the smaller 13B model for speed; change to LLAMA2_70B_CHAT for more advanced (but slower) generations\n",
|
|
|
- "DEFAULT_MODEL = LLAMA2_13B_CHAT\n",
|
|
|
+ "DEFAULT_MODEL = LLAMA3_70B_INSTRUCT\n",
|
|
|
"\n",
|
|
|
- "def completion(\n",
|
|
|
- " prompt: str,\n",
|
|
|
- " model: str = DEFAULT_MODEL,\n",
|
|
|
+ "client = Groq()\n",
|
|
|
+ "\n",
|
|
|
+ "def assistant(content: str):\n",
|
|
|
+ " return { \"role\": \"assistant\", \"content\": content }\n",
|
|
|
+ "\n",
|
|
|
+ "def user(content: str):\n",
|
|
|
+ " return { \"role\": \"user\", \"content\": content }\n",
|
|
|
+ "\n",
|
|
|
+ "def chat_completion(\n",
|
|
|
+ " messages: List[Dict],\n",
|
|
|
+ " model = DEFAULT_MODEL,\n",
|
|
|
" temperature: float = 0.6,\n",
|
|
|
" top_p: float = 0.9,\n",
|
|
|
") -> str:\n",
|
|
|
- " llm = Replicate(\n",
|
|
|
+ " response = client.chat.completions.create(\n",
|
|
|
+ " messages=messages,\n",
|
|
|
" model=model,\n",
|
|
|
- " model_kwargs={\"temperature\": temperature,\"top_p\": top_p, \"max_new_tokens\": 1000}\n",
|
|
|
+ " temperature=temperature,\n",
|
|
|
+ " top_p=top_p,\n",
|
|
|
" )\n",
|
|
|
- " return llm(prompt)\n",
|
|
|
+ " return response.choices[0].message.content\n",
|
|
|
+ " \n",
|
|
|
"\n",
|
|
|
- "def chat_completion(\n",
|
|
|
- " messages: List[Dict],\n",
|
|
|
- " model = DEFAULT_MODEL,\n",
|
|
|
+ "def completion(\n",
|
|
|
+ " prompt: str,\n",
|
|
|
+ " model: str = DEFAULT_MODEL,\n",
|
|
|
" temperature: float = 0.6,\n",
|
|
|
" top_p: float = 0.9,\n",
|
|
|
") -> str:\n",
|
|
|
- " history = ChatMessageHistory()\n",
|
|
|
- " for message in messages:\n",
|
|
|
- " if message[\"role\"] == \"user\":\n",
|
|
|
- " history.add_user_message(message[\"content\"])\n",
|
|
|
- " elif message[\"role\"] == \"assistant\":\n",
|
|
|
- " history.add_ai_message(message[\"content\"])\n",
|
|
|
- " else:\n",
|
|
|
- " raise Exception(\"Unknown role\")\n",
|
|
|
- " return completion(\n",
|
|
|
- " get_buffer_string(\n",
|
|
|
- " history.messages,\n",
|
|
|
- " human_prefix=\"USER\",\n",
|
|
|
- " ai_prefix=\"ASSISTANT\",\n",
|
|
|
- " ),\n",
|
|
|
- " model,\n",
|
|
|
- " temperature,\n",
|
|
|
- " top_p,\n",
|
|
|
+ " return chat_completion(\n",
|
|
|
+ " [user(prompt)],\n",
|
|
|
+ " model=model,\n",
|
|
|
+ " temperature=temperature,\n",
|
|
|
+ " top_p=top_p,\n",
|
|
|
" )\n",
|
|
|
"\n",
|
|
|
- "def assistant(content: str):\n",
|
|
|
- " return { \"role\": \"assistant\", \"content\": content }\n",
|
|
|
- "\n",
|
|
|
- "def user(content: str):\n",
|
|
|
- " return { \"role\": \"user\", \"content\": content }\n",
|
|
|
- "\n",
|
|
|
"def complete_and_print(prompt: str, model: str = DEFAULT_MODEL):\n",
|
|
|
" print(f'==============\\n{prompt}\\n==============')\n",
|
|
|
" response = completion(prompt, model)\n",
|
|
@@ -223,7 +225,7 @@
|
|
|
"source": [
|
|
|
"### Completion APIs\n",
|
|
|
"\n",
|
|
|
- "Llama 2 models tend to be wordy and explain their rationale. Later we'll explore how to manage the response length."
|
|
|
+ "Let's try Llama 3!"
|
|
|
]
|
|
|
},
|
|
|
{
|
|
@@ -345,7 +347,7 @@
|
|
|
"cell_type": "markdown",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
- "You can think about giving explicit instructions as using rules and restrictions to how Llama 2 responds to your prompt.\n",
|
|
|
+ "You can think about giving explicit instructions as using rules and restrictions to how Llama 3 responds to your prompt.\n",
|
|
|
"\n",
|
|
|
"- Stylization\n",
|
|
|
" - `Explain this to me like a topic on a children's educational network show teaching elementary students.`\n",
|
|
@@ -387,9 +389,9 @@
|
|
|
"\n",
|
|
|
"#### Zero-Shot Prompting\n",
|
|
|
"\n",
|
|
|
- "Large language models like Llama 2 are unique because they are capable of following instructions and producing responses without having previously seen an example of a task. Prompting without examples is called \"zero-shot prompting\".\n",
|
|
|
+ "Large language models like Llama 3 are unique because they are capable of following instructions and producing responses without having previously seen an example of a task. Prompting without examples is called \"zero-shot prompting\".\n",
|
|
|
"\n",
|
|
|
- "Let's try using Llama 2 as a sentiment detector. You may notice that output format varies - we can improve this with better prompting."
|
|
|
+ "Let's try using Llama 3 as a sentiment detector. You may notice that output format varies - we can improve this with better prompting."
|
|
|
]
|
|
|
},
|
|
|
{
|
|
@@ -459,9 +461,9 @@
|
|
|
"source": [
|
|
|
"### Role Prompting\n",
|
|
|
"\n",
|
|
|
- "Llama 2 will often give more consistent responses when given a role ([Kong et al. (2023)](https://browse.arxiv.org/pdf/2308.07702.pdf)). Roles give context to the LLM on what type of answers are desired.\n",
|
|
|
+ "Llama will often give more consistent responses when given a role ([Kong et al. (2023)](https://browse.arxiv.org/pdf/2308.07702.pdf)). Roles give context to the LLM on what type of answers are desired.\n",
|
|
|
"\n",
|
|
|
- "Let's use Llama 2 to create a more focused, technical response for a question around the pros and cons of using PyTorch."
|
|
|
+ "Let's use Llama 3 to create a more focused, technical response for a question around the pros and cons of using PyTorch."
|
|
|
]
|
|
|
},
|
|
|
{
|
|
@@ -484,7 +486,9 @@
|
|
|
"source": [
|
|
|
"### Chain-of-Thought\n",
|
|
|
"\n",
|
|
|
- "Simply adding a phrase encouraging step-by-step thinking \"significantly improves the ability of large language models to perform complex reasoning\" ([Wei et al. (2022)](https://arxiv.org/abs/2201.11903)). This technique is called \"CoT\" or \"Chain-of-Thought\" prompting:"
|
|
|
+ "Simply adding a phrase encouraging step-by-step thinking \"significantly improves the ability of large language models to perform complex reasoning\" ([Wei et al. (2022)](https://arxiv.org/abs/2201.11903)). This technique is called \"CoT\" or \"Chain-of-Thought\" prompting.\n",
|
|
|
+ "\n",
|
|
|
+ "Llama 3 now reasons step-by-step naturally without the addition of the phrase. This section remains for completeness."
|
|
|
]
|
|
|
},
|
|
|
{
|
|
@@ -493,10 +497,12 @@
|
|
|
"metadata": {},
|
|
|
"outputs": [],
|
|
|
"source": [
|
|
|
- "complete_and_print(\"Who lived longer Elvis Presley or Mozart?\")\n",
|
|
|
- "# Often gives incorrect answer of \"Mozart\"\n",
|
|
|
+ "prompt = \"Who lived longer, Mozart or Elvis?\"\n",
|
|
|
+ "\n",
|
|
|
+ "complete_and_print(prompt)\n",
|
|
|
+ "# Llama 2 would often give the incorrect answer of \"Mozart\"\n",
|
|
|
"\n",
|
|
|
- "complete_and_print(\"Who lived longer Elvis Presley or Mozart? Let's think through this carefully, step by step.\")\n",
|
|
|
+ "complete_and_print(f\"{prompt} Let's think through this carefully, step by step.\")\n",
|
|
|
"# Gives the correct answer \"Elvis\""
|
|
|
]
|
|
|
},
|
|
@@ -523,10 +529,9 @@
|
|
|
" response = completion(\n",
|
|
|
" \"John found that the average of 15 numbers is 40.\"\n",
|
|
|
" \"If 10 is added to each number then the mean of the numbers is?\"\n",
|
|
|
- " \"Report the answer surrounded by three backticks, for example: ```123```\",\n",
|
|
|
- " model = LLAMA2_70B_CHAT\n",
|
|
|
+ " \"Report the answer surrounded by backticks (example: `123`)\",\n",
|
|
|
" )\n",
|
|
|
- " match = re.search(r'```(\\d+)```', response)\n",
|
|
|
+ " match = re.search(r'`(\\d+)`', response)\n",
|
|
|
" if match is None:\n",
|
|
|
" return None\n",
|
|
|
" return match.group(1)\n",
|
|
@@ -538,10 +543,10 @@
|
|
|
" f\"Final answer: {mode(answers)}\",\n",
|
|
|
" )\n",
|
|
|
"\n",
|
|
|
- "# Sample runs of Llama-2-70B (all correct):\n",
|
|
|
- "# [50, 50, 750, 50, 50] -> 50\n",
|
|
|
- "# [130, 10, 750, 50, 50] -> 50\n",
|
|
|
- "# [50, None, 10, 50, 50] -> 50"
|
|
|
+ "# Sample runs of Llama-3-70B (all correct):\n",
|
|
|
+ "# ['60', '50', '50', '50', '50'] -> 50\n",
|
|
|
+ "# ['50', '50', '50', '60', '50'] -> 50\n",
|
|
|
+ "# ['50', '50', '60', '50', '50'] -> 50"
|
|
|
]
|
|
|
},
|
|
|
{
|
|
@@ -560,7 +565,7 @@
|
|
|
"metadata": {},
|
|
|
"outputs": [],
|
|
|
"source": [
|
|
|
- "complete_and_print(\"What is the capital of the California?\", model = LLAMA2_70B_CHAT)\n",
|
|
|
+ "complete_and_print(\"What is the capital of the California?\")\n",
|
|
|
"# Gives the correct answer \"Sacramento\""
|
|
|
]
|
|
|
},
|
|
@@ -677,7 +682,6 @@
|
|
|
" \"\"\"\n",
|
|
|
" # Python code to calculate: ((-5 + 93 * 4 - 0) * (4^4 + -7 + 0 * 5))\n",
|
|
|
" \"\"\",\n",
|
|
|
- " model=\"meta/codellama-34b:67942fd0f55b66da802218a19a8f0e1d73095473674061a6ea19f2dc8c053152\"\n",
|
|
|
")"
|
|
|
]
|
|
|
},
|
|
@@ -687,12 +691,10 @@
|
|
|
"metadata": {},
|
|
|
"outputs": [],
|
|
|
"source": [
|
|
|
- "# The following code was generated by Code Llama 34B:\n",
|
|
|
+ "# The following code was generated by Llama 3 70B:\n",
|
|
|
"\n",
|
|
|
- "num1 = (-5 + 93 * 4 - 0)\n",
|
|
|
- "num2 = (4**4 + -7 + 0 * 5)\n",
|
|
|
- "answer = num1 * num2\n",
|
|
|
- "print(answer)"
|
|
|
+ "result = ((-5 + 93 * 4 - 0) * (4**4 - 7 + 0 * 5))\n",
|
|
|
+ "print(result)"
|
|
|
]
|
|
|
},
|
|
|
{
|
|
@@ -702,7 +704,7 @@
|
|
|
"source": [
|
|
|
"### Limiting Extraneous Tokens\n",
|
|
|
"\n",
|
|
|
- "A common struggle is getting output without extraneous tokens (ex. \"Sure! Here's more information on...\").\n",
|
|
|
+ "A common struggle with Llama 2 is getting output without extraneous tokens (ex. \"Sure! Here's more information on...\"), even if explicit instructions are given to Llama 2 to be concise and no preamble. Llama 3 can better follow instructions.\n",
|
|
|
"\n",
|
|
|
"Check out this improvement that combines a role, rules and restrictions, explicit instructions, and an example:"
|
|
|
]
|
|
@@ -715,7 +717,6 @@
|
|
|
"source": [
|
|
|
"complete_and_print(\n",
|
|
|
" \"Give me the zip code for Menlo Park in JSON format with the field 'zip_code'\",\n",
|
|
|
- " model = LLAMA2_70B_CHAT,\n",
|
|
|
")\n",
|
|
|
"# Likely returns the JSON and also \"Sure! Here's the JSON...\"\n",
|
|
|
"\n",
|
|
@@ -726,7 +727,6 @@
|
|
|
" Example question: What is the zip code of the Empire State Building? Example answer: {'zip_code': 10118}\n",
|
|
|
" Now here is my question: What is the zip code of Menlo Park?\n",
|
|
|
" \"\"\",\n",
|
|
|
- " model = LLAMA2_70B_CHAT,\n",
|
|
|
")\n",
|
|
|
"# \"{'zip_code': 94025}\""
|
|
|
]
|
|
@@ -770,7 +770,8 @@
|
|
|
"mimetype": "text/x-python",
|
|
|
"name": "python",
|
|
|
"nbconvert_exporter": "python",
|
|
|
- "pygments_lexer": "ipython3"
|
|
|
+ "pygments_lexer": "ipython3",
|
|
|
+ "version": "3.10.14"
|
|
|
},
|
|
|
"last_base_url": "https://bento.edge.x2p.facebook.net/",
|
|
|
"last_kernel_id": "161e2a7b-2d2b-4995-87f3-d1539860ecac",
|