{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Imports & Env Setup" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "%reload_ext autoreload\n", "%autoreload 2\n", "import sys\n", "import os\n", "from dotenv import load_dotenv\n", "load_dotenv()\n", "\n", "import dspy\n", "sys.path.append(os.path.abspath('../'))\n", "from benchmarks import llama_mmlu_pro, leaderboard_mmlu_pro" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Configuration" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "NUM_THREADS = 16\n", "\n", "FEW_SHOTS = 5\n", "\n", "# See https://docs.litellm.ai/docs/providers/vllm for details\n", "TASK_MODEL = dspy.LM(\n", " \"hosted_vllm/meta-llama/Llama-3.3-70B-Instruct\",\n", " api_base = 'http://localhost:8000/v1' , # or api_base ?\n", " # api_version: Optional[str] = None,\n", " # api_key: Optional[str] = None,\n", " # seed: Optional[int] = None,\n", " # max_tokens: Optional[int] = None,\n", " # timeout: Optional[Union[float, int]] = None,\n", ")\n", "PROMPT_MODEL = dspy.LM(\n", " \"hosted_vllm/meta-llama/Llama-3.3-70B-Instruct\",\n", " api_base = 'http://localhost:8000/v1', # or api_base ?\n", " # api_version: Optional[str] = None,\n", " # api_key: Optional[str] = None,\n", " # seed: Optional[int] = None,\n", " # max_tokens: Optional[int] = None,\n", " # timeout: Optional[Union[float, int]] = None,\n", ")\n", "\n", "dspy.configure(lm=TASK_MODEL)\n", "\n", "# replace this with llama_mmlu_pro or whatever\n", "benchmark = llama_mmlu_pro\n", "\n", "# Without chain of thought:\n", "# program = dspy.Predict(\n", "# benchmark.signature(\"\")\n", "# )\n", "\n", "# With chain of thought:\n", "program = dspy.ChainOfThought(\n", " benchmark.signature(\"You are a helpful assistant designed to help with multiple choice question.\") # put your initial system prompt here, or leave blank\n", ")\n", "\n", "evaluate = dspy.Evaluate(\n", " devset=[],\n", " metric=benchmark.metric,\n", " num_threads=NUM_THREADS,\n", " display_progress=True,\n", " display_table=True,\n", " return_all_scores=True,\n", " return_outputs=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load dataset" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1197, 2156, 8626)" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "trainset, valset, testset = benchmark.datasets(\n", " train_size=0.1,\n", " validation_size=0.2,\n", ")\n", "\n", "len(trainset), len(valset), len(testset)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Baseline Benchmark" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "BASE PROMPT:\n", " You are a helpful assistant designed to help with multiple choice question.\n" ] } ], "source": [ "print(\"BASE PROMPT:\\n\", program.predict.signature.instructions)" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Average Metric: 71.00 / 99 (71.7%): 99%|████████████████████████████████████████▌| 99/100 [01:16<00:01, 1.58s/it]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025/01/16 11:41:56 ERROR dspy.utils.parallelizer: Error processing item Example({'question': 'A (1/4) in. thick double leather belt is used on a cast steel pulley 50 in. in diameter which rotates at 1000 rpm and transmits 100 hp. Calculate the belt width using the following data: Coefficient of friction between cast-steel and leather = 0.40. Safe stress for belting = 300 psi Joint efficiency = 70 percent.', 'options': {'A': '7(1/2) in.', 'B': '7 in.', 'C': '9 in.', 'D': '6 in.', 'E': '5(1/2) in.', 'F': '9(1/2) in.', 'G': '10 in.', 'H': '8(1/2) in.', 'I': '8 in.', 'J': '11 in.'}, 'answer': 'I'}) (input_keys={'options', 'question'}): Expected dict_keys(['reasoning', 'answer']) but got dict_keys(['reasoning']). Set `provide_traceback=True` to see the stack trace.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Average Metric: 71.00 / 99 (71.7%): 100%|████████████████████████████████████████| 100/100 [01:30<00:00, 1.11it/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025/01/16 11:41:56 INFO dspy.evaluate.evaluate: Average Metric: 71.0 / 100 (71.0%)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "data": { "text/html": [ "
\n", " | question | \n", "options | \n", "example_answer | \n", "reasoning | \n", "pred_answer | \n", "metric | \n", "answer | \n", "
---|---|---|---|---|---|---|---|
0 | \n", "Describe the evolution of the reptilian excretory system to accoun... | \n", "{'A': 'The excretory system includes a secondary bladder for water... | \n", "J | \n", "The evolution of the reptilian excretory system from an aquatic to... | \n", "J | \n", "✔️ [True] | \n", "NaN | \n", "
1 | \n", "A scientist used his car to transport a large quantity of highly f... | \n", "{'A': 'No, because the doctor should have been more careful around... | \n", "D | \n", "To determine if the doctor will prevail in a claim against the sci... | \n", "D | \n", "✔️ [True] | \n", "NaN | \n", "
2 | \n", "Which of the following could be used as a test for autocorrelation... | \n", "{'A': 'The Dickey-Fuller test', 'B': 'The Jarque-Bera test', 'C': ... | \n", "G | \n", "To determine which of the following could be used as a test for au... | \n", "G | \n", "✔️ [True] | \n", "NaN | \n", "
3 | \n", "Write the balanced cell reaction and calculate theemfat 298 K of t... | \n", "{'A': '.25 V', 'B': '.114 V', 'C': '0.0157963 V', 'D': '.1298 V', ... | \n", "\n", " | To solve this problem, we first need to write the balanced cell re... | \n", "B | \n", "\n", " | NaN | \n", "
4 | \n", "Assume a temperature of 300 K and find the wavelength of the photo... | \n", "{'A': '2100.0', 'B': '2200.0', 'C': '1600.0', 'D': '1400.0', 'E': ... | \n", "G | \n", "To find the wavelength of the photon necessary to cause an electro... | \n", "J | \n", "\n", " | NaN | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
95 | \n", "A pure lead bar 10 cm long is maintained with one end at T &=300 K... | \n", "{'A': '2.56e-07', 'B': '6.40e-06', 'C': '6.40e-07', 'D': '5.12e-06... | \n", "H | \n", "To find the thermoelectric power for lead, we first need to unders... | \n", "H | \n", "✔️ [True] | \n", "NaN | \n", "
96 | \n", "Which of the following is another name for evading the issue? | \n", "{'A': 'hasty generalization', 'B': 'slippery slope', 'C': '\"you to... | \n", "G | \n", "To answer this question, we need to understand what \"evading the i... | \n", "G | \n", "✔️ [True] | \n", "NaN | \n", "
97 | \n", "A spherical charge distribution varies with the radius r by the eq... | \n", "{'A': 'It increases as r approaches infinity.', 'B': 'It increases... | \n", "G | \n", "To determine how the electric field strength varies with distance ... | \n", "F | \n", "\n", " | NaN | \n", "
98 | \n", "Where in the balance sheet does each of the following belong? (A) ... | \n", "{'A': \"(A) Liability section, (B) Asset side, (C) Owner's Equity s... | \n", "J | \n", "To determine where each of the given items belongs on the balance ... | \n", "J | \n", "✔️ [True] | \n", "NaN | \n", "
99 | \n", "A $360-\\mathrm{lb}$ gorilla climbs a tree to a height of $20 \\math... | \n", "{'A': '6000 $\\\\mathrm{ft-lb}$', 'B': '3600 $\\\\mathrm{ft-lb}$', 'C'... | \n", "F | \n", "To find the work done by the gorilla climbing the tree, we can use... | \n", "F | \n", "✔️ [True] | \n", "NaN | \n", "
100 rows × 7 columns
\n", "\n", " | question | \n", "options | \n", "example_answer | \n", "reasoning | \n", "pred_answer | \n", "metric | \n", "answer | \n", "
---|---|---|---|---|---|---|---|
0 | \n", "Describe the evolution of the reptilian excretory system to accoun... | \n", "{'A': 'The excretory system includes a secondary bladder for water... | \n", "J | \n", "The transition from an aquatic to a terrestrial habitat imposed si... | \n", "J | \n", "✔️ [True] | \n", "NaN | \n", "
1 | \n", "A scientist used his car to transport a large quantity of highly f... | \n", "{'A': 'No, because the doctor should have been more careful around... | \n", "D | \n", "To prevail in a claim based on strict liability, the doctor must s... | \n", "D | \n", "✔️ [True] | \n", "NaN | \n", "
2 | \n", "Which of the following could be used as a test for autocorrelation... | \n", "{'A': 'The Dickey-Fuller test', 'B': 'The Jarque-Bera test', 'C': ... | \n", "G | \n", "The question asks for a test that can be used to detect autocorrel... | \n", "G | \n", "✔️ [True] | \n", "NaN | \n", "
3 | \n", "Write the balanced cell reaction and calculate theemfat 298 K of t... | \n", "{'A': '.25 V', 'B': '.114 V', 'C': '0.0157963 V', 'D': '.1298 V', ... | \n", "\n", " | To solve this problem, we first need to write the balanced cell re... | \n", "D | \n", "\n", " | NaN | \n", "
4 | \n", "Assume a temperature of 300 K and find the wavelength of the photo... | \n", "{'A': '2100.0', 'B': '2200.0', 'C': '1600.0', 'D': '1400.0', 'E': ... | \n", "G | \n", "To find the wavelength of the photon necessary to cause an electro... | \n", "J | \n", "\n", " | NaN | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
195 | \n", "Which statement is true? | \n", "{'A': 'All trapezoids are rectangles because they have at least on... | \n", "D | \n", "To determine which statement is true, we need to evaluate each opt... | \n", "J | \n", "\n", " | NaN | \n", "
196 | \n", "Select the best English interpretation of the given proposition, u... | \n", "{'A': 'All large apartments are bigger than some houses.', 'B': 'S... | \n", "E | \n", "The given proposition is (∃x)[(Ax • Lx) • (∃y)(Hy • Bxy)]. Breakin... | \n", "E | \n", "✔️ [True] | \n", "NaN | \n", "
197 | \n", "f(X) = [\\pi(1 + X^2)]^-1- \\infty < x < \\infty. If Y = X^2, what is... | \n", "{'A': 'h(y) = [2 / {\\\\pi(1 + \\\\sqrt{y})}] for y > 0 and = 0 otherw... | \n", "G | \n", "To find the density function of Y, given that Y = X^2, we first ne... | \n", "G | \n", "✔️ [True] | \n", "NaN | \n", "
198 | \n", "Two thin convex lenses of focal lengths f_1 and f_2 are separated ... | \n", "{'A': '[(3f_2) / 2]', 'B': '(f_1 + f_2) / 2', 'C': '(2f_2) / 3', '... | \n", "A | \n", "The focal length of the combination of two thin convex lenses can ... | \n", "A | \n", "✔️ [True] | \n", "NaN | \n", "
199 | \n", "Let $X$ be uniformly distributed over $\\{1, 2, \\ldots, m\\}$. Assum... | \n", "{'A': '0.3', 'B': '0.4', 'C': '0.1', 'D': '0.0', 'E': '0.7', 'F': ... | \n", "D | \n", "To solve this problem, we first need to understand the process and... | \n", "D | \n", "✔️ [True] | \n", "NaN | \n", "
200 rows × 7 columns
\n", "\n", " | question | \n", "options | \n", "example_answer | \n", "reasoning | \n", "pred_answer | \n", "metric | \n", "
---|---|---|---|---|---|---|
0 | \n", "Describe the evolution of the reptilian excretory system to accoun... | \n", "{'A': 'The excretory system includes a secondary bladder for water... | \n", "J | \n", "The evolution of the reptilian excretory system from an aquatic to... | \n", "J | \n", "✔️ [True] | \n", "
1 | \n", "A scientist used his car to transport a large quantity of highly f... | \n", "{'A': 'No, because the doctor should have been more careful around... | \n", "D | \n", "To determine if the doctor will prevail in a claim against the sci... | \n", "D | \n", "✔️ [True] | \n", "
2 | \n", "Which of the following could be used as a test for autocorrelation... | \n", "{'A': 'The Dickey-Fuller test', 'B': 'The Jarque-Bera test', 'C': ... | \n", "G | \n", "The Breusch-Godfrey test is a statistical test used to detect auto... | \n", "G | \n", "✔️ [True] | \n", "
3 | \n", "Write the balanced cell reaction and calculate theemfat 298 K of t... | \n", "{'A': '.25 V', 'B': '.114 V', 'C': '0.0157963 V', 'D': '.1298 V', ... | \n", "\n", " | To solve this problem, we need to write the balanced cell reaction... | \n", "D | \n", "\n", " |
4 | \n", "Assume a temperature of 300 K and find the wavelength of the photo... | \n", "{'A': '2100.0', 'B': '2200.0', 'C': '1600.0', 'D': '1400.0', 'E': ... | \n", "G | \n", "To find the wavelength of the photon necessary to cause an electro... | \n", "J | \n", "\n", " |
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
295 | \n", "We were first able to accurately measure the diameter of Pluto from: | \n", "{'A': \"Lunar-based observations made during NASA's Apollo missions... | \n", "H | \n", "The diameter of Pluto was first accurately measured through observ... | \n", "D | \n", "\n", " |
296 | \n", "Which of the following is a clustering algorithm in machine learning? | \n", "{'A': 'Linear Regression', 'B': 'CART', 'C': 'Logistic Regression'... | \n", "D | \n", "Clustering algorithms in machine learning are used to group simila... | \n", "D | \n", "✔️ [True] | \n", "
297 | \n", "In a population in Denmark, the relative fitness of the allele for... | \n", "{'A': '7.82 × 10^-5', 'B': '3.14 × 10^-5', 'C': '1.19 × 10^-4', 'D... | \n", "H | \n", "To find the mutation rate, we first need to understand the relatio... | \n", "D | \n", "\n", " |
298 | \n", "Miss Jones has been concerned about her health lately. She has not... | \n", "{'A': 'herpes', 'B': 'trichomoniasis', 'C': 'pubic lice', 'D': 'sy... | \n", "C | \n", "Given Miss Jones' symptoms of itching and skin irritation around h... | \n", "C | \n", "✔️ [True] | \n", "
299 | \n", "Parfit claims that the magnitude of pains: | \n", "{'A': 'can be precisely compared.', 'B': 'can be compared, but onl... | \n", "B | \n", "Parfit's claim is related to the comparability of pains, which is ... | \n", "B | \n", "✔️ [True] | \n", "
300 rows × 6 columns
\n", "