{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Imports & Env Setup" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/justinai/.conda/envs/prompt-migration/lib/python3.10/site-packages/pydantic/_internal/_config.py:345: UserWarning: Valid config keys have changed in V2:\n", "* 'fields' has been removed\n", " warnings.warn(message, UserWarning)\n" ] } ], "source": [ "%reload_ext autoreload\n", "%autoreload 2\n", "import sys\n", "import os\n", "from dotenv import load_dotenv\n", "load_dotenv()\n", "\n", "import dspy\n", "sys.path.append(os.path.abspath('../'))\n", "from benchmarks import llama_mmlu_pro, leaderboard_mmlu_pro" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Configuration" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "NUM_THREADS = 32\n", "\n", "FEW_SHOTS = 5\n", "\n", "# See https://docs.litellm.ai/docs/providers/vllm for details\n", "TASK_MODEL = dspy.LM(\n", " \"hosted_vllm/meta-llama/Llama-3.3-70B-Instruct\",\n", " api_base = 'http://localhost:8000/v1' , # or api_base ?\n", " # api_version: Optional[str] = None,\n", " # api_key: Optional[str] = None,\n", " # seed: Optional[int] = None,\n", " # max_tokens: Optional[int] = None,\n", " # timeout: Optional[Union[float, int]] = None,\n", ")\n", "PROMPT_MODEL = dspy.LM(\n", " \"hosted_vllm/meta-llama/Llama-3.3-70B-Instruct\",\n", " api_base = 'http://localhost:8000/v1', # or api_base ?\n", " # api_version: Optional[str] = None,\n", " # api_key: Optional[str] = None,\n", " # seed: Optional[int] = None,\n", " # max_tokens: Optional[int] = None,\n", " # timeout: Optional[Union[float, int]] = None,\n", ")\n", "\n", "dspy.configure(lm=TASK_MODEL)\n", "\n", "# replace this with llama_mmlu_pro or whatever\n", "benchmark = llama_mmlu_pro\n", "\n", "# Without chain of thought:\n", "# program = dspy.Predict(\n", "# benchmark.signature(\"\")\n", "# )\n", "\n", "# With chain of thought:\n", "program = dspy.ChainOfThought(\n", " benchmark.signature(\"You are a helpful assistant designed to help with multiple choice question.\") # put your initial system prompt here, or leave blank\n", ")\n", "\n", "evaluate = dspy.Evaluate(\n", " devset=[],\n", " metric=benchmark.metric,\n", " num_threads=NUM_THREADS,\n", " display_progress=True,\n", " display_table=True,\n", " return_all_scores=True,\n", " return_outputs=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load dataset" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1197, 2156, 8626)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "trainset, valset, testset = benchmark.datasets(\n", " train_size=0.1,\n", " validation_size=0.2,\n", ")\n", "\n", "len(trainset), len(valset), len(testset)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Baseline Benchmark" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "BASE PROMPT:\n", " You are a helpful assistant designed to help with multiple choice question. Always return a JSON object with the following format:\n", "{\n", " \"reasoning\": \"Step-by-step reasoning here.\",\n", " \"answer\": \"Final answer (A, B, C, etc.)\"\n", "}\n", "Do NOT return plain text. Only return a valid JSON object with these keys.\n", "CPU times: user 178 μs, sys: 18 μs, total: 196 μs\n", "Wall time: 176 μs\n" ] } ], "source": [ "%%time\n", "print(\"BASE PROMPT:\\n\", program.predict.signature.instructions)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Starting execution...\n", "Average Metric: 0.00 / 8626 (0.0%): 100%|██████████████████████████████████████████████████████████████████████████████████| 8626/8626 [7:07:06<00:00, 2.97s/it]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025/01/22 08:15:35 INFO dspy.evaluate.evaluate: Average Metric: 0 / 8626 (0.0%)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "data": { "text/html": [ "
| \n", " | question | \n", "options | \n", "example_reasoning | \n", "example_answer | \n", "pred_reasoning | \n", "pred_answer | \n", "metric | \n", "
|---|---|---|---|---|---|---|---|
| 0 | \n", "How does Freudian theory account for homosexuality? | \n", "{'A': 'According to Freudian theory, homosexuality is a result of ... | \n", "\n", " | D | \n", "Freudian theory, as proposed by Sigmund Freud, attempts to explain... | \n", "D | \n", "\n", " |
| 1 | \n", "Find the remainder when 25^1059 is divided by 23. | \n", "{'A': '22', 'B': '6', 'C': '11', 'D': '5', 'E': '13', 'F': '3', 'G... | \n", "\n", " | \n", " | To find the remainder when 25^1059 is divided by 23, we can use mo... | \n", "I | \n", "\n", " |
| 2 | \n", "A company sells its product at two different prices in two differe... | \n", "{'A': 'Experimental', 'B': 'Predictive', 'C': 'Causal', 'D': 'Obse... | \n", "\n", " | A | \n", "To identify the optimal price for the product, the research needs ... | \n", "A | \n", "\n", " |
| 3 | \n", "In 1989 scientists from Norway discovered that there are far more ... | \n", "{'A': 'up to 5000000', 'B': 'up to 2500000', 'C': 'up to 500000000... | \n", "\n", " | B | \n", "To answer this question, we need to consider the findings of the N... | \n", "B | \n", "\n", " |
| 4 | \n", "Two processors, M-5 and M-7, implement the same instruction set. P... | \n", "{'A': 'All three statements are true', 'B': 'None of the statement... | \n", "\n", " | C | \n", "To determine which of the statements are true, let's analyze each ... | \n", "C | \n", "\n", " |
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 8621 | \n", "What is meant by translocation? What theories have been advanced t... | \n", "{'A': 'Translocation is the movement of water from the roots to ot... | \n", "\n", " | I | \n", "To answer this question, we first need to understand what transloc... | \n", "I | \n", "\n", " |
| 8622 | \n", "(1) Where is the near point of an eye for which a spectacle lens o... | \n", "{'A': '30 cm, -150 cm', 'B': '40 cm, -170 cm', 'C': '65 cm, -210 c... | \n", "\n", " | F | \n", "To find the near point of an eye for which a spectacle lens of pow... | \n", "F | \n", "\n", " |
| 8623 | \n", "In the popular nursery rhyme \"Mary Had a Little Lamb\", her lambwou... | \n", "{'A': \"The lamb had a habitual path that coincidentally aligned wi... | \n", "\n", " | B | \n", "The nursery rhyme 'Mary Had a Little Lamb' describes a scenario wh... | \n", "B | \n", "\n", " |
| 8624 | \n", "Studies into the etiology of Schizophrenia indicated a genetic pre... | \n", "{'A': 'Excess dopamine or sensitivity to dopamine could be a contr... | \n", "\n", " | B | \n", "The question requires identifying the incorrect statement regardin... | \n", "B | \n", "\n", " |
| 8625 | \n", "Describe each of the following situations in terms of de-mand elas... | \n", "{'A': 'a) Elastic demand, b) Inelastic demand, c) Elastic demand',... | \n", "\n", " | C | \n", "To determine the elasticity of demand in each situation, we need t... | \n", "C | \n", "\n", " |
8626 rows × 7 columns
\n", "