{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Imports & Env Setup" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "%reload_ext autoreload\n", "%autoreload 2\n", "import sys\n", "import os\n", "from dotenv import load_dotenv\n", "load_dotenv()\n", "\n", "import dspy\n", "sys.path.append(os.path.abspath('../'))\n", "from prompt_migrator.benchmarks import llama_mmlu_pro, leaderboard_mmlu_pro" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Configuration" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "NUM_THREADS = 16\n", "\n", "FEW_SHOTS = 5\n", "\n", "# See https://docs.litellm.ai/docs/providers/vllm for details\n", "TASK_MODEL = dspy.LM(\n", " \"hosted_vllm/meta-llama/Llama-3.3-70B-Instruct\",\n", " api_base = 'http://localhost:8000/v1' , # or api_base ?\n", " # api_version: Optional[str] = None,\n", " # api_key: Optional[str] = None,\n", " # seed: Optional[int] = None,\n", " # max_tokens: Optional[int] = None,\n", " # timeout: Optional[Union[float, int]] = None,\n", ")\n", "PROMPT_MODEL = dspy.LM(\n", " \"hosted_vllm/meta-llama/Llama-3.3-70B-Instruct\",\n", " api_base = 'http://localhost:8000/v1', # or api_base ?\n", " # api_version: Optional[str] = None,\n", " # api_key: Optional[str] = None,\n", " # seed: Optional[int] = None,\n", " # max_tokens: Optional[int] = None,\n", " # timeout: Optional[Union[float, int]] = None,\n", ")\n", "\n", "dspy.configure(lm=TASK_MODEL)\n", "\n", "# replace this with llama_mmlu_pro or whatever\n", "benchmark = leaderboard_mmlu_pro\n", "\n", "# Without chain of thought:\n", "# program = dspy.Predict(\n", "# benchmark.signature(\"\")\n", "# )\n", "\n", "# With chain of thought:\n", "program = dspy.ChainOfThought(\n", " benchmark.signature(\"\") # put your initial system prompt here, or leave blank\n", ")\n", "\n", "evaluate = dspy.Evaluate(\n", " devset=[],\n", " metric=benchmark.metric,\n", " num_threads=NUM_THREADS,\n", " display_progress=True,\n", " display_table=True,\n", " return_all_scores=True,\n", " return_outputs=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load dataset" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1203, 2165, 8664)" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "trainset, valset, testset = benchmark.datasets(\n", " train_size=0.1,\n", " validation_size=0.2,\n", ")\n", "\n", "len(trainset), len(valset), len(testset)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Optimize Subset + Evaluation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2025/01/15 17:44:49 INFO dspy.teleprompt.mipro_optimizer_v2: \n", "RUNNING WITH THE FOLLOWING LIGHT AUTO RUN SETTINGS:\n", "num_trials: 7\n", "minibatch: False\n", "num_candidates: 5\n", "valset size: 20\n", "\n", "2025/01/15 17:44:49 INFO dspy.teleprompt.mipro_optimizer_v2: \n", "==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==\n", "2025/01/15 17:44:49 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions.\n", "\n", "2025/01/15 17:44:49 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=5 sets of demonstrations...\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Bootstrapping set 1/5\n", "Bootstrapping set 2/5\n", "Bootstrapping set 3/5\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " 20%|███████████████████████████████████████████████ | 4/20 [00:20<01:23, 5.19s/it]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.\n", "Bootstrapping set 4/5\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " 40%|██████████████████████████████████████████████████████████████████████████████████████████████ | 8/20 [00:52<01:18, 6.56s/it]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Bootstrapped 4 full traces after 8 examples for up to 1 rounds, amounting to 8 attempts.\n", "Bootstrapping set 5/5\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " 20%|███████████████████████████████████████████████ | 4/20 [00:21<01:24, 5.29s/it]\n", "2025/01/15 17:46:23 INFO dspy.teleprompt.mipro_optimizer_v2: \n", "==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==\n", "2025/01/15 17:46:23 INFO dspy.teleprompt.mipro_optimizer_v2: We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Bootstrapped 2 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025/01/15 17:46:47 INFO dspy.teleprompt.mipro_optimizer_v2: \n", "Proposing instructions...\n", "\n", "2025/01/15 17:47:40 INFO dspy.teleprompt.mipro_optimizer_v2: Proposed Instructions for Predictor 0:\n", "\n", "2025/01/15 17:47:40 INFO dspy.teleprompt.mipro_optimizer_v2: 0: Given the fields `question`, `options`, produce the fields `reasoning`, `answer`.\n", "\n", "2025/01/15 17:47:40 INFO dspy.teleprompt.mipro_optimizer_v2: 1: To address the task effectively, provide a detailed, step-by-step explanation for your reasoning when answering multiple-choice questions across various subjects, including biology, chemistry, physics, and social sciences. Ensure your response includes the following elements: \n", "1. A clear understanding of the question being asked.\n", "2. An evaluation of each option based on relevant knowledge and critical thinking.\n", "3. A logical deduction of the most appropriate answer.\n", "4. A concise summary of your reasoning process.\n", "5. The final answer choice selected from the provided options.\n", "\n", "When constructing your response, consider the complexity and diversity of the questions, and tailor your reasoning to demonstrate a broad range of knowledge and analytical skills. This approach will facilitate the development of a robust and reliable question-answering system capable of handling a wide spectrum of educational and general knowledge queries.\n", "\n", "2025/01/15 17:47:40 INFO dspy.teleprompt.mipro_optimizer_v2: 2: To answer multiple-choice questions that require reasoning and analysis of the subject matter, follow these steps: \n", "\n", "1. Read the question carefully and identify the key concepts and information provided.\n", "2. Analyze the options and determine which ones are plausible based on the information provided in the question.\n", "3. Use a combination of natural language processing and knowledge retrieval to generate a step-by-step reasoning process to arrive at an answer.\n", "4. Evaluate the reasoning process and select the most appropriate answer based on the analysis.\n", "5. Provide a clear and concise explanation of the reasoning process used to arrive at the answer.\n", "\n", "Given the fields `question`, `options`, produce the fields `reasoning`, `answer` by following the above steps and using a language model to generate a step-by-step explanation of how to arrive at the answer.\n", "\n", "2025/01/15 17:47:40 INFO dspy.teleprompt.mipro_optimizer_v2: 3: To address the given task effectively, I propose the following instruction: \n", "\n", "\"Given a multiple-choice question across various subjects, including but not limited to biology, chemistry, physics, and social sciences, and a list of possible options, generate a detailed, step-by-step reasoning process to arrive at the correct answer. Consider the context, key concepts, and any relevant theories or principles that apply to the question. Ensure the reasoning is clear, logical, and easy to follow, and conclude by selecting the correct answer from the provided options.\n", "\n", "2025/01/15 17:47:40 INFO dspy.teleprompt.mipro_optimizer_v2: 4: You are a highly skilled expert in a high-stakes testing environment, and your task is to answer a series of challenging multiple-choice questions to determine your suitability for a prestigious position. The questions will cover a wide range of subjects, including biology, chemistry, physics, and social sciences. You must carefully read each question, analyze the options, and provide a step-by-step reasoning process to arrive at the correct answer. The correct answer and your reasoning will be evaluated by a panel of judges, and your performance will determine your eligibility for the position. Given the fields `question`, `options`, produce the fields `reasoning`, `answer` to demonstrate your expertise and critical thinking skills.\n", "\n", "2025/01/15 17:47:40 INFO dspy.teleprompt.mipro_optimizer_v2: \n", "\n", "2025/01/15 17:47:40 INFO dspy.teleprompt.mipro_optimizer_v2: Evaluating the default program...\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Average Metric: 5.00 / 20 (25.0%): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:24<00:00, 1.24s/it]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025/01/15 17:48:05 INFO dspy.evaluate.evaluate: Average Metric: 5 / 20 (25.0%)\n", "2025/01/15 17:48:05 INFO dspy.teleprompt.mipro_optimizer_v2: Default program score: 25.0\n", "\n", "2025/01/15 17:48:05 INFO dspy.teleprompt.mipro_optimizer_v2: ==> STEP 3: FINDING OPTIMAL PROMPT PARAMETERS <==\n", "2025/01/15 17:48:05 INFO dspy.teleprompt.mipro_optimizer_v2: We will evaluate the program over a series of trials with different combinations of instructions and few-shot examples to find the optimal combination using Bayesian Optimization.\n", "\n", "2025/01/15 17:48:05 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 1 / 7 =====\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "Average Metric: 15.00 / 20 (75.0%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:23<00:00, 1.19s/it]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025/01/15 17:48:29 INFO dspy.evaluate.evaluate: Average Metric: 15 / 20 (75.0%)\n", "2025/01/15 17:48:29 INFO dspy.teleprompt.mipro_optimizer_v2: \u001b[92mBest full score so far!\u001b[0m Score: 75.0\n", "2025/01/15 17:48:29 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 75.0 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 1'].\n", "2025/01/15 17:48:29 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [25.0, 75.0]\n", "2025/01/15 17:48:29 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 75.0\n", "2025/01/15 17:48:29 INFO dspy.teleprompt.mipro_optimizer_v2: =======================\n", "\n", "\n", "2025/01/15 17:48:29 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 2 / 7 =====\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "Average Metric: 13.00 / 16 (81.2%): 80%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 16/20 [00:12<00:01, 2.94it/s]" ] } ], "source": [ "subset_size = 20\n", "optimizer = dspy.MIPROv2(\n", " metric=benchmark.metric,\n", " auto=\"light\",\n", " num_threads=NUM_THREADS,\n", " task_model=TASK_MODEL,\n", " prompt_model=PROMPT_MODEL,\n", " max_labeled_demos=FEW_SHOTS,\n", ")\n", "\n", "optimized_program = optimizer.compile(\n", " program,\n", " trainset=trainset[:subset_size],\n", " valset=valset[:subset_size],\n", " requires_permission_to_run=False,\n", ")" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "BEST PROMPT:\n", " Given the fields `question`, `options`, produce the fields `reasoning`, `answer`.\n" ] } ], "source": [ "print(\"BEST PROMPT:\\n\", optimized_program.predict.signature.instructions)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Average Metric: 11.00 / 20 (55.0%): 100%|██████████| 20/20 [00:12<00:00, 1.65it/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025/01/15 13:55:52 INFO dspy.evaluate.evaluate: Average Metric: 11 / 20 (55.0%)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "data": { "text/html": [ "
\n", " | question | \n", "options | \n", "example_answer | \n", "reasoning | \n", "pred_answer | \n", "metric | \n", "
---|---|---|---|---|---|---|
0 | \n", "In 1935 roughly how many Americans were in favor of Social Securit... | \n", "[A. 20%, B. 30%, C. 60%, D. 100%, E. 40%, F. 10%, G. 80%, H. 90%, ... | \n", "H | \n", "Not supplied for this particular example. | \n", "C | \n", "\n", " |
1 | \n", "A circular opening of radius 0.8mm in an opaque screen is illumina... | \n", "['A. Wavelength of light for bright spot: 1.25 × 10^-6 m, Waveleng... | \n", "B | \n", "To find the wavelength of light for both bright and dark spots, we... | \n", "B. Wavelength of light for bright spot: 1.17 × 10^-6 m, Wavelength... | \n", "\n", " |
2 | \n", "For the dissociation reaction cl_2(g) > 2Cl(g) at 1200 °K, calcula... | \n", "['A. 1.10 × 10^-4', 'B. 6.89 × 10^-6', 'C. 8.97 × 10^-5', 'D. 1.23... | \n", "G | \n", "To calculate the equilibrium constant \\( K_p \\) for the dissociati... | \n", "B | \n", "\n", " |
3 | \n", "With what force does the Earth attract the moon? | \n", "['A. 1.2 × 10^25 dynes', 'B. 5.0 × 10^25 dynes', 'C. 3.0 × 10^25 d... | \n", "D | \n", "The force with which the Earth attracts the Moon can be calculated... | \n", "D | \n", "✔️ [True] | \n", "
4 | \n", "A beam of electrons has speed 10^7 m/s. It is desired to use the m... | \n", "['A. 0.1 m', 'B. 1 mm', 'C. 1 μm', 'D. 0.01 m', 'E. 1 m', 'F. 10 m... | \n", "E | \n", "To find the radius of the circle in which the electron beam will t... | \n", "E | \n", "✔️ [True] | \n", "
5 | \n", "Calculate the roost probable distribution and the thermodynamicpro... | \n", "['A. most probable distribution is number 3; \\\\Omega = 250; entrop... | \n", "I | \n", "Not supplied for this particular example. | \n", "B | \n", "\n", " |
6 | \n", "A cylinder with a movable piston contains a gas at pressure P = 1 ... | \n", "['A. 6 × 10^5 Pa', 'B. 9 × 10^5 Pa', 'C. 3 × 10^5 Pa', 'D. 5 × 10^... | \n", "H | \n", "According to Boyle's Law, for a given amount of gas at constant te... | \n", "H | \n", "✔️ [True] | \n", "
7 | \n", "If current real GDP is $5000 and full employment real GDP is at $4... | \n", "['A. A decrease in taxes and buying bonds in an open market operat... | \n", "F | \n", "The current real GDP of $5000 is above the full employment real GD... | \n", "G | \n", "\n", " |
8 | \n", "Describe the function of the lateral-line system in fishes. | \n", "['A. The lateral-line system in fishes helps with the digestion of... | \n", "G | \n", "The lateral-line system in fishes is a specialized sensory system ... | \n", "G | \n", "✔️ [True] | \n", "
9 | \n", "In class, John's teacher tells him that she will give him the coin... | \n", "['A. sensory memory decay', 'B. retroactive interference', 'C. fai... | \n", "C | \n", "John's inability to identify the pictures on the coins and bills, ... | \n", "C | \n", "✔️ [True] | \n", "
10 | \n", "What is a deflationary gap? | \n", "['A. A situation where total demand is too small to absorb the tot... | \n", "A | \n", "A deflationary gap occurs when the total demand in an economy is i... | \n", "A | \n", "✔️ [True] | \n", "
11 | \n", "How many kcal are there in one gram of ethanol?\\n | \n", "['A. 37.6 kJ or 9.0 kcal per g', 'B. 15.6 kJ or 3.7 kcal per g', '... | \n", "H | \n", "Ethanol has a caloric value of approximately 7.1 kcal per gram. Th... | \n", "H | \n", "✔️ [True] | \n", "
12 | \n", "A man was visiting his friend at the friend's cabin. The man decid... | \n", "['A. No, because the man did not intend to burn down the cabin and... | \n", "A | \n", "The key issue in determining whether the man will be found guilty ... | \n", "A | \n", "✔️ [True] | \n", "
13 | \n", "A man hired a videographer to film his daughter's wedding. The wri... | \n", "['A. The contract is open to interpretation and does not explicitl... | \n", "C | \n", "The most persuasive argument for the videographer's contention is ... | \n", "E | \n", "\n", " |
14 | \n", "In what state is the 1999 movie Magnolia' set? | \n", "[A. Nevada, B. Arizona, C. New York, D. Georgia, E. California, F.... | \n", "E | \n", "The 1999 movie \"Magnolia\" is set in California. The film explores ... | \n", "E | \n", "✔️ [True] | \n", "
15 | \n", "In the phrase 'Y2K' what does 'K' stand for? | \n", "[A. kind, B. millennium, C. kingdom, D. key, E. kilo, F. kernel, G... | \n", "J | \n", "In the phrase 'Y2K', 'K' stands for 'kilo', which is a prefix mean... | \n", "J | \n", "✔️ [True] | \n", "
16 | \n", "A resident in an exclusive residential area is a marine biologist.... | \n", "['A. not recover, because she suffered injury only because she had... | \n", "G | \n", "In this case, the neighbor is suing the resident based on strict l... | \n", "E | \n", "\n", " |
17 | \n", "This question refers to the following information. \"I travelled th... | \n", "['A. Lorenzo de Medici', 'B. Sir Francis Drake', 'C. Hernán Cortés... | \n", "A | \n", "Ibn Battuta's travels were primarily focused on the Islamic world ... | \n", "A | \n", "✔️ [True] | \n", "
18 | \n", "Nussbaum claims that in cross-cultural communication, inhabitants ... | \n", "['A. in a utilitarian way.', 'B. in a Cartesian way.', 'C. in a ni... | \n", "J | \n", "Not supplied for this particular example. | \n", "H | \n", "\n", " |
19 | \n", "For a macroscopic object of mass $1.0 \\mathrm{~g}$ moving with spe... | \n", "[A. 8$10^{26}$, B. 1$10^{27}$, C. 1$10^{26}$, D. 9$10^{26}$, E. 6$... | \n", "J | \n", "To find the quantum number \\( n \\) for a macroscopic object in a o... | \n", "C | \n", "\n", " |