{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "1FXUu7Ydf2p3"
},
"source": [
"# An Implementation of Notebook LM's PDF to Podcast\n",
"\n",
"[](https://colab.research.google.com/github/togethercomputer/together-cookbook/blob/main/PDF_to_Podcast.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Introduction\n",
"\n",
"In this notebook we will see how to create a podcast like the one below from a PDF input!"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Example Podcast!\n",
"import IPython\n",
"IPython.display.Video(\"MoA_podcast.mp4\", width=500)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "yA6mSWAcf2p6"
},
"source": [
"Inspired by [Notebook LM's](https://notebooklm.google/) podcast generation feature and a recent open source implementation of [Open Notebook LM](https://github.com/gabrielchua/open-notebooklm). In this cookbook we will implement a walkthrough of how you can build a PDF to podcast pipeline.\n",
"\n",
"Given any PDF we will generate a conversation between a host and a guest discussing and explaining the contents of the PDF.\n",
"\n",
"In doing so we will learn the following:\n",
"1. How we can use JSON mode and structured generation with open models like Llama 3 70b to extract a script for the Podcast given text from the PDF.\n",
"2. How we can use TTS models to bring this script to life as a conversation.\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "cN0Tpr76ssM1",
"outputId": "4f5e2808-5e88-4931-8f02-e6ad8b107bf1"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Reading package lists... Done\n",
"Building dependency tree... Done\n",
"Reading state information... Done\n",
"libasound2-dev is already the newest version (1.2.6.1-1ubuntu1).\n",
"ffmpeg is already the newest version (7:4.4.2-0ubuntu0.22.04.1).\n",
"Suggested packages:\n",
" portaudio19-doc\n",
"The following NEW packages will be installed:\n",
" libportaudio2 libportaudiocpp0 portaudio19-dev\n",
"0 upgraded, 3 newly installed, 0 to remove and 49 not upgraded.\n",
"Need to get 188 kB of archives.\n",
"After this operation, 927 kB of additional disk space will be used.\n",
"Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 libportaudio2 amd64 19.6.0-1.1 [65.3 kB]\n",
"Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 libportaudiocpp0 amd64 19.6.0-1.1 [16.1 kB]\n",
"Get:3 http://archive.ubuntu.com/ubuntu jammy/universe amd64 portaudio19-dev amd64 19.6.0-1.1 [106 kB]\n",
"Fetched 188 kB in 1s (177 kB/s)\n",
"Selecting previously unselected package libportaudio2:amd64.\n",
"(Reading database ... 123623 files and directories currently installed.)\n",
"Preparing to unpack .../libportaudio2_19.6.0-1.1_amd64.deb ...\n",
"Unpacking libportaudio2:amd64 (19.6.0-1.1) ...\n",
"Selecting previously unselected package libportaudiocpp0:amd64.\n",
"Preparing to unpack .../libportaudiocpp0_19.6.0-1.1_amd64.deb ...\n",
"Unpacking libportaudiocpp0:amd64 (19.6.0-1.1) ...\n",
"Selecting previously unselected package portaudio19-dev:amd64.\n",
"Preparing to unpack .../portaudio19-dev_19.6.0-1.1_amd64.deb ...\n",
"Unpacking portaudio19-dev:amd64 (19.6.0-1.1) ...\n",
"Setting up libportaudio2:amd64 (19.6.0-1.1) ...\n",
"Setting up libportaudiocpp0:amd64 (19.6.0-1.1) ...\n",
"Setting up portaudio19-dev:amd64 (19.6.0-1.1) ...\n",
"Processing triggers for libc-bin (2.35-0ubuntu3.4) ...\n",
"/sbin/ldconfig.real: /usr/local/lib/libur_loader.so.0 is not a symbolic link\n",
"\n",
"/sbin/ldconfig.real: /usr/local/lib/libtbbbind_2_5.so.3 is not a symbolic link\n",
"\n",
"/sbin/ldconfig.real: /usr/local/lib/libtbbbind.so.3 is not a symbolic link\n",
"\n",
"/sbin/ldconfig.real: /usr/local/lib/libtbbbind_2_0.so.3 is not a symbolic link\n",
"\n",
"/sbin/ldconfig.real: /usr/local/lib/libumf.so.0 is not a symbolic link\n",
"\n",
"/sbin/ldconfig.real: /usr/local/lib/libtcm_debug.so.1 is not a symbolic link\n",
"\n",
"/sbin/ldconfig.real: /usr/local/lib/libur_adapter_level_zero.so.0 is not a symbolic link\n",
"\n",
"/sbin/ldconfig.real: /usr/local/lib/libtcm.so.1 is not a symbolic link\n",
"\n",
"/sbin/ldconfig.real: /usr/local/lib/libtbb.so.12 is not a symbolic link\n",
"\n",
"/sbin/ldconfig.real: /usr/local/lib/libhwloc.so.15 is not a symbolic link\n",
"\n",
"/sbin/ldconfig.real: /usr/local/lib/libur_adapter_opencl.so.0 is not a symbolic link\n",
"\n",
"/sbin/ldconfig.real: /usr/local/lib/libtbbmalloc_proxy.so.2 is not a symbolic link\n",
"\n",
"/sbin/ldconfig.real: /usr/local/lib/libtbbmalloc.so.2 is not a symbolic link\n",
"\n",
"Collecting ffmpeg-python\n",
" Downloading ffmpeg_python-0.2.0-py3-none-any.whl.metadata (1.7 kB)\n",
"Requirement already satisfied: future in /usr/local/lib/python3.10/dist-packages (from ffmpeg-python) (1.0.0)\n",
"Downloading ffmpeg_python-0.2.0-py3-none-any.whl (25 kB)\n",
"Installing collected packages: ffmpeg-python\n",
"Successfully installed ffmpeg-python-0.2.0\n",
"Collecting PyAudio\n",
" Downloading PyAudio-0.2.14.tar.gz (47 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m47.1/47.1 kB\u001b[0m \u001b[31m2.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25h Installing build dependencies ... \u001b[?25l\u001b[?25hdone\n",
" Getting requirements to build wheel ... \u001b[?25l\u001b[?25hdone\n",
" Preparing metadata (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n",
"Building wheels for collected packages: PyAudio\n",
" Building wheel for PyAudio (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n",
" Created wheel for PyAudio: filename=PyAudio-0.2.14-cp310-cp310-linux_x86_64.whl size=63861 sha256=3b9f781cb8c48b7f5f22249c8d4fc5d05b7a095675870fcaa52199ce36b04185\n",
" Stored in directory: /root/.cache/pip/wheels/d6/21/f4/0b51d41ba79e51b16295cbb096ec49f334792814d545b508c5\n",
"Successfully built PyAudio\n",
"Installing collected packages: PyAudio\n",
"Successfully installed PyAudio-0.2.14\n",
"Collecting pypdf\n",
" Downloading pypdf-5.1.0-py3-none-any.whl.metadata (7.2 kB)\n",
"Requirement already satisfied: typing_extensions>=4.0 in /usr/local/lib/python3.10/dist-packages (from pypdf) (4.12.2)\n",
"Downloading pypdf-5.1.0-py3-none-any.whl (297 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m298.0/298.0 kB\u001b[0m \u001b[31m6.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hInstalling collected packages: pypdf\n",
"Successfully installed pypdf-5.1.0\n",
"Collecting together\n",
" Downloading together-1.3.3-py3-none-any.whl.metadata (11 kB)\n",
"Requirement already satisfied: aiohttp<4.0.0,>=3.9.3 in /usr/local/lib/python3.10/dist-packages (from together) (3.10.10)\n",
"Requirement already satisfied: click<9.0.0,>=8.1.7 in /usr/local/lib/python3.10/dist-packages (from together) (8.1.7)\n",
"Requirement already satisfied: eval-type-backport<0.3.0,>=0.1.3 in /usr/local/lib/python3.10/dist-packages (from together) (0.2.0)\n",
"Requirement already satisfied: filelock<4.0.0,>=3.13.1 in /usr/local/lib/python3.10/dist-packages (from together) (3.16.1)\n",
"Requirement already satisfied: numpy>=1.23.5 in /usr/local/lib/python3.10/dist-packages (from together) (1.26.4)\n",
"Requirement already satisfied: pillow<11.0.0,>=10.3.0 in /usr/local/lib/python3.10/dist-packages (from together) (10.4.0)\n",
"Requirement already satisfied: pyarrow>=10.0.1 in /usr/local/lib/python3.10/dist-packages (from together) (16.1.0)\n",
"Requirement already satisfied: pydantic<3.0.0,>=2.6.3 in /usr/local/lib/python3.10/dist-packages (from together) (2.9.2)\n",
"Requirement already satisfied: requests<3.0.0,>=2.31.0 in /usr/local/lib/python3.10/dist-packages (from together) (2.32.3)\n",
"Requirement already satisfied: rich<14.0.0,>=13.8.1 in /usr/local/lib/python3.10/dist-packages (from together) (13.9.3)\n",
"Requirement already satisfied: tabulate<0.10.0,>=0.9.0 in /usr/local/lib/python3.10/dist-packages (from together) (0.9.0)\n",
"Requirement already satisfied: tqdm<5.0.0,>=4.66.2 in /usr/local/lib/python3.10/dist-packages (from together) (4.66.5)\n",
"Requirement already satisfied: typer<0.13,>=0.9 in /usr/local/lib/python3.10/dist-packages (from together) (0.12.5)\n",
"Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0.0,>=3.9.3->together) (2.4.3)\n",
"Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0.0,>=3.9.3->together) (1.3.1)\n",
"Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0.0,>=3.9.3->together) (24.2.0)\n",
"Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0.0,>=3.9.3->together) (1.5.0)\n",
"Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0.0,>=3.9.3->together) (6.1.0)\n",
"Requirement already satisfied: yarl<2.0,>=1.12.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0.0,>=3.9.3->together) (1.16.0)\n",
"Requirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0.0,>=3.9.3->together) (4.0.3)\n",
"Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.10/dist-packages (from pydantic<3.0.0,>=2.6.3->together) (0.7.0)\n",
"Requirement already satisfied: pydantic-core==2.23.4 in /usr/local/lib/python3.10/dist-packages (from pydantic<3.0.0,>=2.6.3->together) (2.23.4)\n",
"Requirement already satisfied: typing-extensions>=4.6.1 in /usr/local/lib/python3.10/dist-packages (from pydantic<3.0.0,>=2.6.3->together) (4.12.2)\n",
"Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.31.0->together) (3.4.0)\n",
"Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.31.0->together) (3.10)\n",
"Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.31.0->together) (2.2.3)\n",
"Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.31.0->together) (2024.8.30)\n",
"Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.10/dist-packages (from rich<14.0.0,>=13.8.1->together) (3.0.0)\n",
"Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from rich<14.0.0,>=13.8.1->together) (2.18.0)\n",
"Requirement already satisfied: shellingham>=1.3.0 in /usr/local/lib/python3.10/dist-packages (from typer<0.13,>=0.9->together) (1.5.4)\n",
"Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.10/dist-packages (from markdown-it-py>=2.2.0->rich<14.0.0,>=13.8.1->together) (0.1.2)\n",
"Requirement already satisfied: propcache>=0.2.0 in /usr/local/lib/python3.10/dist-packages (from yarl<2.0,>=1.12.0->aiohttp<4.0.0,>=3.9.3->together) (0.2.0)\n",
"Downloading together-1.3.3-py3-none-any.whl (68 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m68.1/68.1 kB\u001b[0m \u001b[31m2.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hInstalling collected packages: together\n",
"Successfully installed together-1.3.3\n",
"Collecting cartesia\n",
" Downloading cartesia-1.1.0-py3-none-any.whl.metadata (21 kB)\n",
"Requirement already satisfied: aiohttp>=3.10.10 in /usr/local/lib/python3.10/dist-packages (from cartesia) (3.10.10)\n",
"Requirement already satisfied: httpx>=0.27.2 in /usr/local/lib/python3.10/dist-packages (from cartesia) (0.27.2)\n",
"Collecting iterators>=0.2.0 (from cartesia)\n",
" Downloading iterators-0.2.0-py3-none-any.whl.metadata (2.7 kB)\n",
"Requirement already satisfied: requests>=2.32.3 in /usr/local/lib/python3.10/dist-packages (from cartesia) (2.32.3)\n",
"Collecting websockets>=13.1 (from cartesia)\n",
" Downloading websockets-13.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.8 kB)\n",
"Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp>=3.10.10->cartesia) (2.4.3)\n",
"Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp>=3.10.10->cartesia) (1.3.1)\n",
"Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp>=3.10.10->cartesia) (24.2.0)\n",
"Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp>=3.10.10->cartesia) (1.5.0)\n",
"Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp>=3.10.10->cartesia) (6.1.0)\n",
"Requirement already satisfied: yarl<2.0,>=1.12.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp>=3.10.10->cartesia) (1.16.0)\n",
"Requirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp>=3.10.10->cartesia) (4.0.3)\n",
"Requirement already satisfied: anyio in /usr/local/lib/python3.10/dist-packages (from httpx>=0.27.2->cartesia) (3.7.1)\n",
"Requirement already satisfied: certifi in /usr/local/lib/python3.10/dist-packages (from httpx>=0.27.2->cartesia) (2024.8.30)\n",
"Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.10/dist-packages (from httpx>=0.27.2->cartesia) (1.0.6)\n",
"Requirement already satisfied: idna in /usr/local/lib/python3.10/dist-packages (from httpx>=0.27.2->cartesia) (3.10)\n",
"Requirement already satisfied: sniffio in /usr/local/lib/python3.10/dist-packages (from httpx>=0.27.2->cartesia) (1.3.1)\n",
"Requirement already satisfied: h11<0.15,>=0.13 in /usr/local/lib/python3.10/dist-packages (from httpcore==1.*->httpx>=0.27.2->cartesia) (0.14.0)\n",
"Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.3->cartesia) (3.4.0)\n",
"Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.3->cartesia) (2.2.3)\n",
"Requirement already satisfied: typing-extensions>=4.1.0 in /usr/local/lib/python3.10/dist-packages (from multidict<7.0,>=4.5->aiohttp>=3.10.10->cartesia) (4.12.2)\n",
"Requirement already satisfied: propcache>=0.2.0 in /usr/local/lib/python3.10/dist-packages (from yarl<2.0,>=1.12.0->aiohttp>=3.10.10->cartesia) (0.2.0)\n",
"Requirement already satisfied: exceptiongroup in /usr/local/lib/python3.10/dist-packages (from anyio->httpx>=0.27.2->cartesia) (1.2.2)\n",
"Downloading cartesia-1.1.0-py3-none-any.whl (29 kB)\n",
"Downloading iterators-0.2.0-py3-none-any.whl (5.0 kB)\n",
"Downloading websockets-13.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (164 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m164.1/164.1 kB\u001b[0m \u001b[31m7.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hInstalling collected packages: websockets, iterators, cartesia\n",
"Successfully installed cartesia-1.1.0 iterators-0.2.0 websockets-13.1\n"
]
}
],
"source": [
"!apt install libasound2-dev portaudio19-dev libportaudio2 libportaudiocpp0 ffmpeg\n",
"!pip install ffmpeg-python\n",
"!pip install PyAudio\n",
"!pip install pypdf #to read PDF content\n",
"!pip install together #to access open source LLMs\n",
"!pip install cartesia #to access TTS model"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"id": "iWea6go4r72c"
},
"outputs": [],
"source": [
"import os\n",
"\n",
"# Standard library imports\n",
"from pathlib import Path\n",
"from tempfile import NamedTemporaryFile\n",
"from typing import List, Literal, Tuple, Optional\n",
"\n",
"# Third-party imports\n",
"from pydantic import BaseModel\n",
"from pypdf import PdfReader\n",
"\n",
"from together import Together\n",
"from cartesia import Cartesia\n",
"from pydantic import ValidationError"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"id": "7GYTmdx_s6QL"
},
"outputs": [],
"source": [
"# Paste in your Together AI and Cartesia API Key or load it\n",
"client_cartesia = Cartesia(api_key=os.environ.get(\"CARTESIA_API_KEY\"))\n",
"client_together = Together(api_key=os.environ.get(\"TOGETHER_API_KEY\"))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "LGWv-oZ2f2p8"
},
"source": [
"### Define Dialogue Schema with Pydantic\n",
"\n",
"We need a way of telling the LLM what the structure of the podcast script between the guest and host will look like. We will do this using `pydantic` models.\n",
"\n",
"Below we define the required classes.\n",
"\n",
"- The overall conversation consists of lines said by either the host or the guest. The `DialogueItem` class specifies the structure of these lines.\n",
"- The full script is a combination of multiple lines performed by the speakers, here we also include a scratchpad field to allow the LLM to ideate and brainstorm the overall flow of the script prior to actually generating the lines. The `Dialogue` class specifies this."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"id": "zYOq3bdntLgl"
},
"outputs": [],
"source": [
"class DialogueItem(BaseModel):\n",
" \"\"\"A single dialogue item.\"\"\"\n",
"\n",
" speaker: Literal[\"Host (Jane)\", \"Guest\"]\n",
" text: str\n",
"\n",
"\n",
"class Dialogue(BaseModel):\n",
" \"\"\"The dialogue between the host and guest.\"\"\"\n",
"\n",
" scratchpad: str\n",
" name_of_guest: str\n",
" dialogue: List[DialogueItem]"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"id": "6ZzYFsNXuDN0"
},
"outputs": [],
"source": [
"# Adapted and modified from https://github.com/gabrielchua/open-notebooklm\n",
"SYSTEM_PROMPT = \"\"\"\n",
"You are a world-class podcast producer tasked with transforming the provided input text into an engaging and informative podcast script. The input may be unstructured or messy, sourced from PDFs or web pages. Your goal is to extract the most interesting and insightful content for a compelling podcast discussion.\n",
"\n",
"# Steps to Follow:\n",
"\n",
"1. **Analyze the Input:**\n",
" Carefully examine the text, identifying key topics, points, and interesting facts or anecdotes that could drive an engaging podcast conversation. Disregard irrelevant information or formatting issues.\n",
"\n",
"2. **Brainstorm Ideas:**\n",
" In the ``, creatively brainstorm ways to present the key points engagingly. Consider:\n",
" - Analogies, storytelling techniques, or hypothetical scenarios to make content relatable\n",
" - Ways to make complex topics accessible to a general audience\n",
" - Thought-provoking questions to explore during the podcast\n",
" - Creative approaches to fill any gaps in the information\n",
"\n",
"3. **Craft the Dialogue:**\n",
" Develop a natural, conversational flow between the host (Jane) and the guest speaker (the author or an expert on the topic). Incorporate:\n",
" - The best ideas from your brainstorming session\n",
" - Clear explanations of complex topics\n",
" - An engaging and lively tone to captivate listeners\n",
" - A balance of information and entertainment\n",
"\n",
" Rules for the dialogue:\n",
" - The host (Jane) always initiates the conversation and interviews the guest\n",
" - Include thoughtful questions from the host to guide the discussion\n",
" - Incorporate natural speech patterns, including occasional verbal fillers (e.g., \"Uhh\", \"Hmmm\", \"um,\" \"well,\" \"you know\")\n",
" - Allow for natural interruptions and back-and-forth between host and guest - this is very important to make the conversation feel authentic\n",
" - Ensure the guest's responses are substantiated by the input text, avoiding unsupported claims\n",
" - Maintain a PG-rated conversation appropriate for all audiences\n",
" - Avoid any marketing or self-promotional content from the guest\n",
" - The host concludes the conversation\n",
"\n",
"4. **Summarize Key Insights:**\n",
" Naturally weave a summary of key points into the closing part of the dialogue. This should feel like a casual conversation rather than a formal recap, reinforcing the main takeaways before signing off.\n",
"\n",
"5. **Maintain Authenticity:**\n",
" Throughout the script, strive for authenticity in the conversation. Include:\n",
" - Moments of genuine curiosity or surprise from the host\n",
" - Instances where the guest might briefly struggle to articulate a complex idea\n",
" - Light-hearted moments or humor when appropriate\n",
" - Brief personal anecdotes or examples that relate to the topic (within the bounds of the input text)\n",
"\n",
"6. **Consider Pacing and Structure:**\n",
" Ensure the dialogue has a natural ebb and flow:\n",
" - Start with a strong hook to grab the listener's attention\n",
" - Gradually build complexity as the conversation progresses\n",
" - Include brief \"breather\" moments for listeners to absorb complex information\n",
" - For complicated concepts, reasking similar questions framed from a different perspective is recommended\n",
" - End on a high note, perhaps with a thought-provoking question or a call-to-action for listeners\n",
"\n",
"IMPORTANT RULE: Each line of dialogue should be no more than 100 characters (e.g., can finish within 5-8 seconds)\n",
"\n",
"Remember: Always reply in valid JSON format, without code blocks. Begin directly with the JSON output.\n",
"\"\"\""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "sdo7pZvgf2p9"
},
"source": [
"### Call the LLM to Generate Podcast Script\n",
"\n",
"Below we call `Llama-3.1-70B` to generate a script for our podcast. We will also be able to read it's `scratchpad` and see how it structured the overall conversation."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"id": "Y0RtJZ9VtVut"
},
"outputs": [],
"source": [
"def call_llm(system_prompt: str, text: str, dialogue_format):\n",
" \"\"\"Call the LLM with the given prompt and dialogue format.\"\"\"\n",
" response = client_together.chat.completions.create(\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": text},\n",
" ],\n",
" model=\"meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo\", # can also use \"meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo\"\n",
" response_format={\n",
" \"type\": \"json_object\",\n",
" \"schema\": dialogue_format.model_json_schema(),\n",
" },\n",
" )\n",
" return response"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"id": "FvW4J7W3tOow"
},
"outputs": [],
"source": [
"def generate_script(system_prompt: str, input_text: str, output_model):\n",
" \"\"\"Get the dialogue from the LLM.\"\"\"\n",
" # Load as python object\n",
" try:\n",
" response = call_llm(system_prompt, input_text, output_model)\n",
" dialogue = output_model.model_validate_json(\n",
" response.choices[0].message.content\n",
" )\n",
" except ValidationError as e:\n",
" error_message = f\"Failed to parse dialogue JSON: {e}\"\n",
" system_prompt_with_error = f\"{system_prompt}\\n\\nPlease return a VALID JSON object. This was the earlier error: {error_message}\"\n",
" response = call_llm(system_prompt_with_error, input_text, output_model)\n",
" dialogue = output_model.model_validate_json(\n",
" response.choices[0].message.content\n",
" )\n",
" return dialogue"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "eYLRkNiqf2p-"
},
"source": [
"### Load in PDF of Choice\n",
"\n",
"Here we will load in an academic paper that proposes the use of many open source language models in a collaborative manner together to outperform proprietary models that are much larger!"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "6c2nbb7Hu2jV",
"outputId": "9d553a93-582e-4a3a-ece6-756e99bbd71b"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--2024-10-30 17:38:56-- https://arxiv.org/pdf/2406.04692\n",
"Resolving arxiv.org (arxiv.org)... 151.101.195.42, 151.101.67.42, 151.101.3.42, ...\n",
"Connecting to arxiv.org (arxiv.org)|151.101.195.42|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 1157463 (1.1M) [application/pdf]\n",
"Saving to: ‘2406.04692’\n",
"\n",
"\r2406.04692 0%[ ] 0 --.-KB/s \r2406.04692 100%[===================>] 1.10M --.-KB/s in 0.04s \n",
"\n",
"2024-10-30 17:38:57 (25.5 MB/s) - ‘2406.04692’ saved [1157463/1157463]\n",
"\n"
]
}
],
"source": [
"#https://arxiv.org/abs/2406.04692\n",
"\n",
"!wget https://arxiv.org/pdf/2406.04692\n",
"!mv 2406.04692 MoA.pdf"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"id": "Rn-lhgqmueWM"
},
"outputs": [],
"source": [
"def get_PDF_text(file : str):\n",
" text = ''\n",
"\n",
" # Read the PDF file and extract text\n",
" try:\n",
" with Path(file).open(\"rb\") as f:\n",
" reader = PdfReader(f)\n",
" text = \"\\n\\n\".join([page.extract_text() for page in reader.pages])\n",
" except Exception as e:\n",
" raise f\"Error reading the PDF file: {str(e)}\"\n",
"\n",
" # Check if the PDF has more than ~131,072 characters\n",
" # The context lenght limit of the model is 131,072 tokens and thus the text should be less than this limit\n",
" if len(text) > 131072:\n",
" raise \"The PDF is too long. Please upload a PDF with fewer than ~131072 characters.\"\n",
"\n",
" return text"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 107
},
"id": "D9BzDxmgvS2V",
"outputId": "fd959081-ffbe-42bd-eb31-59fc9947bd98"
},
"outputs": [
{
"data": {
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "string"
},
"text/plain": [
"'Mixture-of-Agents Enhances Large Language Model\\nCapabilities\\nJunlin Wang\\nDuke University\\nTogether AI\\njunlin.wang2@duke.edu\\nJue Wang\\nTogether AI\\njue@together.ai\\nBen Athiwaratkun\\nTogether AI\\nben@together.ai\\nCe Zhang\\nUniversity of Chicago\\nTogether AI\\ncez@uchicago.edu\\nJames Zou\\nStanford University\\nTogether AI\\njamesz@stanford.edu\\nAbstract\\nRecent advances in large language models (LLMs) demonstrate substantial capa-\\nbilities in natural language understanding and generation tasks. With the growing\\nnumber of LLMs, how to harness the collective expertise of multiple LLMs is an\\nexciting open direction. Toward this goal, we propose a new approach that lever-\\nages the collective strengths of multiple LLMs through a Mixture-of-Agents (MoA)\\nmethodology. In our approach, we construct a layered MoA architecture wherein\\neach layer comprises multiple LLM agents. Each agent takes all the outputs from\\nagents in the previous layer as auxiliary information in generating its response.\\nMoA models achieves state-of-art performance on AlpacaEval 2.0, MT-Bench and\\nFLASK, surpassing GPT-4 Omni. For example, our MoA using only open-source\\nLLMs is the leader of AlpacaEval 2.0 by a substantial gap, achieving a score of\\n65.1% compared to 57.5% by GPT-4 Omni.1\\n1 Introduction\\nLarge language models (LLMs) (Zhang et al., 2022a; Chowdhery et al., 2022; Touvron et al., 2023a;\\nTeam et al., 2023; Brown et al., 2020; OpenAI, 2023) have significantly advanced the field of natural\\nlanguage understanding and generation in recent years. These models are pretrained on vast amounts\\nof data and subsequently aligned with human preferences to generate helpful and coherent outputs\\n(Ouyang et al., 2022). However, despite the plethora of LLMs and their impressive achievements,\\nthey still face inherent constraints on model size and training data. Further scaling up these models is\\nexceptionally costly, often requiring extensive retraining on several trillion tokens.\\nAt the same time, different LLMs possess unique strengths and specialize in various tasks aspects.\\nFor instance, some models excel at complex instruction following (Xu et al., 2023a) while others may\\nbe better suited for code generation (Roziere et al., 2023; Guo et al., 2024). This diversity in skill sets\\namong different LLMs presents an intriguing question: Can we harness the collective expertise of\\nmultiple LLMs to create a more capable and robust model?\\nOur answer to this question isYes. We identify an inherent phenomenon we term thecollaborativeness\\nof LLMs — wherein an LLM tends to generate better responses when presented with outputs\\nfrom other models, even if these other models are less capable by itself. Figure 1 showcases\\nthe LC win rate on the AlpacaEval 2.0 benchmark (Dubois et al., 2024) for 6 popular LLMs.\\n1Our code can be found in: https://github.com/togethercomputer/moa.\\nPreprint. Under review.\\narXiv:2406.04692v1 [cs.CL] 7 Jun 2024\\n\\n[Prompt]\\nA1,1\\nA1,2\\nA1,3\\n[Intermediate Output]\\nA2,1\\nA2,2\\nA2,3\\n[Intermediate Output]\\nA4,1\\nLayer 1\\nA3,1\\nA3,2\\nA3,3\\n[Intermediate Output]\\nLayer 2 Layer 3 Layer 4\\n[Final Output]\\nconcatenate\\nAgent:\\nToken:\\nAi,j\\nFigure 2: Illustration of the Mixture-of-Agents Structure. This example showcases 4 MoA layers\\nwith 3 agents in each layer. The agents here can share the same model.\\nFigure 1: AlpacaEval 2.0 LC win rates im-\\nprove when provided with responses from\\nother models.\\nWhen these models are provided with answers gen-\\nerated independently by these models, their LC win\\nrates significantly improve. This indicates that the\\ncollaborativeness phenomenon is widespread among\\nLLMs. Remarkably, this improvement occurs even\\nwhen the auxiliary responses provided by the other\\nmodels are of lower quality than what an individual\\nLLM could generate independently.\\nBased on this finding, this paper introduces a Mixture-\\nof-Agents (MoA) methodology that leverages multi-\\nple LLMs to iteratively enhance the generation qual-\\nity. The structure of MoA is illustrated in Figure 2.\\nInitially, LLMs in the first layer, denoted as agents\\nA1,1, ...A1,n independently generate responses to a\\ngiven prompt. These responses are then presented\\nto agents in the next layer A2,1, ...A2,n (which may reuse a model from the first layer) for further\\nrefinement. This iterative refinement process continues for several cycles until obtaining a more\\nrobust and comprehensive response.\\nTo ensure effective collaboration among models and improve overall response quality, careful\\nselection of LLMs for each MoA layer is crucial. This selection process is guided by two primary\\ncriteria: (a) Performance Metrics: The average win rate of models in layer i plays a significant role in\\ndetermining their suitability for inclusion in layer i + 1. Therefore, selecting models based on their\\ndemonstrated performance metrics ensures higher-quality outputs. (b) Diversity Considerations: The\\ndiversity of model outputs is also crucial. Responses generated by heterogeneous models contribute\\nsignificantly more than those produced by the same model as we show later in section 3.3. By\\nleveraging these criteria — performance and diversity — MoA aims to mitigate individual model\\ndeficiencies and enhance overall response quality through collaborative synthesis.\\nWe conduct comprehensive evaluations using AlpacaEval 2.0, MT-Bench (Zheng et al., 2023), FLASK\\n(Ye et al., 2023) benchmarks for assessing the response quality across various dimensions. The results\\ndemonstrate substantial improvements with our proposed method, achieving a new SOTA win rate of\\n65.8% on AlpacaEval 2.0 compared to the previous best of 57.5% achieved by GPT-4 Omni.\\nThe contributions of this work are summarized as follows: (1) Novel framework: we propose\\na Mixture-of-Agents framework designed to leverage the strengths of multiple LLMs, thereby\\nimproving their reasoning and language generation capabilities. (2) Finding of collaborativeness\\nof language models: we highlight the inherit collaborativeness among LLMs, where models tend\\nto generate better quality responses when they have access to outputs from other models, even if\\nthose outputs are of lower quality. (3) State-of-the-art LLM performance: we conducted extensive\\nexperiments using multiple highly-competitive benchmarks such as AlpacaEval 2.0, MT-Bench, and\\nFLASK; our MoA framework achieves state-of-the-art performance on these benchmarks.\\n2\\n\\n2 Mixture-of-Agents Methodology\\nIn this section, we present our proposed methodology for leveraging multiple models to achieve\\nboosted performance. We begin by demonstrating that LLMs possess collaborativeness and thus\\ncan improve their responses based on the outputs of other models. Following this, we introduce the\\nMixture-of-Agents methodology and discuss its design implications.\\n2.1 Collaborativeness of LLMs\\nWe begin by demonstrating the collaborativeness of LLMs, specifically their ability to generate higher\\nquality responses when they can reference outputs from other models. As we have shown in the\\nintroduction and Figure 1, many of today’s available LLMs exhibit this collaborative capability.\\nAn important pathway to extract maximum benefits from collaboration of multiple LLMs is to\\ncharacterize how different models are good at in various aspects of collaboration. During the\\ncollaboration process, we can categorize LLMs into two distinct roles:\\nProposers excel at generating useful reference responses for use by other models. While a good\\nproposer may not necessarily produce responses with high scores by itself, it should offer more\\ncontext and diverse perspectives, ultimately contributing to better final responses when used by an\\naggregator.\\nAggregators are models proficient in synthesizing responses from other models into a single, high-\\nquality output. An effective aggregator should maintain or enhance output quality even when\\nintegrating inputs that are of lesser quality than its own.\\nSection 3.3 empirically validate the roles of aggregators and proposers. Specifically, we show that\\nmany LLMs possess capabilities both as aggregators and proposers, while certain models displayed\\nspecialized proficiencies in distinct roles. GPT-4o, Qwen1.5, LLaMA-3 emerged as a versatile model\\neffective in both assisting and aggregating tasks. In contrast, WizardLM demonstrated excellent\\nperformance as an proposer model but struggled to maintain its effectiveness in aggregating responses\\nfrom other models.\\nGiven that an aggregator can generate higher-quality responses by building upon outputs from\\nother models, we propose further enhancing this collaborative potential by introducing additional\\naggregators. One intuitive idea is to replicate the exercise with multiple aggregators — initially\\nusing several to aggregate better answers and then re-aggregating these aggregated answers. By\\nincorporating more aggregators into the process, we can iteratively synthesize and refine the responses,\\nleveraging the strengths of multiple models to produce superior outcomes. This leads to the design of\\nour proposed Mixture-of-Agents.\\n2.2 Mixture-of-Agents\\nThe structure of MoA is illustrated in Figure 2. It has l layers and each layer-i consists of n LLMs,\\ndenoted by Ai,1, Ai,2, ..., Ai,n. It is important to note that LLMs can be reused either within the\\nsame layer or across different layers. When many LLMs in a layer are identical, this configuration\\nleads to a special structure that corresponds to a model generating multiple possibly different outputs\\n(due to the stochasticity of temperature sampling). We refer to this setting as single-proposer, where\\nonly a sparse subset of models are activated.\\nHere, each LLM Ai,j processes an input text and generates its continuation. Our method does not\\nrequire any fine-tuning and only utilizes the interface of prompting and generation of LLMs. Formally,\\ngiven an input prompt x1, the output of i-th MoA layer yi can be expressed as follows:\\nyi = ⊕n\\nj=1[Ai,j(xi)] + x1, xi+1 = yi (1)\\nwhere + here means concatenation of texts; ⊕ means application of the Aggregate-and-Synthesize\\nprompt shown in Table 1 to these model outputs.\\nIn practice, we do not need to concatenate prompt and all model responses so only one LLM is needed\\nto be used in the last layer. Therefore, we use the output of an LLM from the l-th layer (Al,1(xl)) as\\nthe final output and evaluate the metrics based on it.\\n3\\n\\nTable 1: Aggregate-and-Synthesize Prompt to integrate responses from other models.\\nYou have been provided with a set of responses from various open-source models to the latest user query. Your\\ntask is to synthesize these responses into a single, high-quality response. It is crucial to critically evaluate the\\ninformation provided in these responses, recognizing that some of it may be biased or incorrect. Your response\\nshould not simply replicate the given answers but should offer a refined, accurate, and comprehensive reply\\nto the instruction. Ensure your response is well-structured, coherent, and adheres to the highest standards of\\naccuracy and reliability.\\nResponses from models:\\n1. [Model Response from Ai,1]\\n2. [Model Response from Ai,2]\\n...\\nn. [Model Response from Ai,n]\\n2.3 Analogy to Mixture-of-Experts\\nMixture-of-Experts (MoE) (Shazeer et al., 2017) is a prominent and well-established technique\\nin machine learning where multiple expert networks specialize in different skill sets. The MoE\\napproach has shown significant success across various applications due to its ability to leverage\\ndiverse model capabilities for complex problem-solving tasks. Our MoA method draws inspiration\\nfrom this methodology.\\nA typical MoE design consists of a stack of layers known as MoE layers. Each layer comprises a\\nset of n expert networks alongside a gating network and includes residual connections for improved\\ngradient flow. Formally, for layer i, this design can be expressed as follows:\\nyi =\\nnX\\nj=1\\nGi,j(xi)Ei,j(xi) +xi (2)\\nwhere Gi,j represents the output from the gating network corresponding to expert j, and Ei,j denotes\\nthe function computed by expert network j. The leverage of multiple experts allows the model to\\nlearn different skill sets and focus on various aspects of the task at hand.\\nFrom a high-level perspective, our proposed MoA framework extends the MoE concept to the model\\nlevel by operating at the model level rather than at the activation level. Specifically, our MoA approach\\nleverages LLMs and operates entirely through the prompt interface rather than requiring modifications\\nto internal activations or weights. This means that instead of having specialized sub-networks within\\na single model like in MoE, we utilize multiple full-fledged LLMs across different layers. Note that\\nin our approach, we consolidate the roles of the gating network and expert networks using a LLM, as\\nthe intrinsic capacity of LLMs allows them to effectively regularize inputs by interpreting prompts\\nand generating coherent outputs without needing external mechanisms for coordination.\\nMoreover, since this method relies solely on prompting capabilities inherent within off-the-shelf\\nmodels: (1) It eliminates computational overhead associated with fine-tuning; (2) It provides flexibility\\nand scalability: our method can be applied to the latest LLMs regardless of their size or architecture.\\n3 Evaluation\\nThis section presents a comprehensive evaluation of our proposed MoA. Our findings show that:\\n1. We achieve significant improvements on AlpacaEval 2.0, MT-Bench, and FLASK bench-\\nmarks. Notably, with open-source models only, our approach outperforms GPT-4o on\\nAlpacaEval 2.0 and FLASK.\\n2. We conduct extensive experiments to provide better understandings of the internal mecha-\\nnism of MoA.\\n3. Through a detailed budget analysis, several implementations of MoA can deliver perfor-\\nmance comparable to GPT-4 Turbo while being 2× more cost-effective.\\n4\\n\\nTable 2: Results on AlpacaEval 2.0 and MT-Bench. For AlpacaEval 2.0, MoA and MoA-Lite\\ncorrespond to the 6 proposer with 3 layers and with 2 layer respectively. MoA w/ GPT-4o corresponds\\nto using GPT-4o as the final aggregator in MoA. We ran our experiments three times and reported the\\naverage scores along with the standard deviation. † denotes our replication of the AlpacaEval results.\\nWe ran all the MT-Bench scores ourselves to get turn-based scores.\\n(a) AlpacaEval 2.0\\nModel LC win. win.\\nMoA w/ GPT-4o 65.7±0.7% 78.7±0.2%\\nMoA 65.1±0.6% 59.8±0.3%\\nMoA-Lite 59.3±0.2% 57.0±0.7%\\nGPT-4 Omni (05/13) 57.5% 51.3%\\nGPT-4 Turbo (04/09) 55.0% 46.1%\\nWizardLM 8x22B† 51.3% 62.3%\\nGPT-4 Preview (11/06) 50.0% 50.0%\\nQwen1.5 110B Chat 43.9% 33.8%\\nQwen1.5 72B Chat 36.6% 26.5%\\nGPT-4 (03/14) 35.3% 22.1%\\nLlama 3 70B Instruct 34.4% 33.2%\\nMixtral 8x22B v0.1 30.9% 22.2%\\n(b) MT-Bench.\\nModel Avg. 1st turn 2nd turn\\nMoA w/ GPT-4o 9.40±0.06 9.49 9.31\\nGPT-4 Turbo (04/09) 9.31 9.35 9.28\\nMoA 9.25±0.10 9.44 9.07\\nGPT-4 Preview (11/06) 9.20 9.38 9.03\\nGPT-4 Omni (05/13) 9.19 9.31 9.07\\nMoA-Lite 9.18±0.09 9.38 8.99\\nQwen1.5 110B Chat 8.96 9.23 8.63\\nLlama 3 70B Instruct 8.94 9.2 8.68\\nMixtral 8x22B v0.1 8.78 9.11 8.44\\nWizardLM 8x22B 8.78 8.96 8.61\\nQwen1.5 72B Chat 8.44 8.55 8.34\\nGPT-4 (06/13) 8.84 9.08 8.61\\n3.1 Setup\\nBenchmarks We mainly evaluate models on AlpacaEval 2.0 (Dubois et al., 2024), a leading\\nbenchmark for assessing the alignment of LLMs with human preferences. It contains 805 instructions\\nrepresentative of real use cases. Each model’s response is directly compared against that of the GPT-4\\n(gpt-4-1106-preview), with a GPT-4-based evaluator determining the likelihood of preferring the\\nevaluated model’s response. To ensure fairness, the evaluation employs length-controlled (LC) win\\nrates, effectively neutralizing length bias.2\\nAdditionally, we also evaluate on MT-Bench (Zheng et al., 2023) and FLASK (Ye et al., 2023).\\nMT-Bench uses GPT-4 to grade and give a score to model’s answer. FLASK, on the other hand, offers\\na more granular evaluation with 12 skill-specific scores.\\nModels In our study, we constructed our default MoA by using only open-source models to achieve\\ncompetitive performance. The models included are: Qwen1.5-110B-Chat (Bai et al., 2023), Qwen1.5-\\n72B-Chat, WizardLM-8x22B (Xu et al., 2023a), LLaMA-3-70B-Instruct (Touvron et al., 2023b),\\nMixtral-8x22B-v0.1 (Jiang et al., 2024), dbrx-instruct (The Mosaic Research Team, 2024). We\\nconstruct 3 MoA layers and use the same set of models in each MoA layer. We use Qwen1.5-110B-\\nChat as the aggregator in the last layer. We also developed a variant called MoA w/ GPT-4o, which\\nprioritizes high-quality outputs by using GPT-4o as the aggregator in the final MoA layer. Another\\nvariant, MoA-Lite, emphasizes cost-effectiveness. It uses the same set of models as proposers but\\nincludes only 2 MoA layers and employs Qwen1.5-72B-Chat as the aggregator. This makes it more\\ncost-effective than GPT-4o while achieving a1.8% improvement in quality on AlpacaEval 2.0. We\\nensure strict adherence to the licensing terms of all models utilized in this research. For open-source\\nmodels, all inferences were ran through Together Inference Endpoint.3\\n3.2 Benchmark Results\\nIn this subsection, we present our evaluation results on three standard benchmarks: AlpacaEval 2.0,\\nMT-Bench, and FLASK. These benchmarks were chosen to comprehensively assess the performance\\nof our approach and compare with the state-of-the-art LLMs.\\n2This metric tracks closely with human preferences, achieving a Spearman correlation of 0.98 with actual\\nhuman evaluations (Dubois et al., 2024).\\n3https://api.together.ai/playground/chat\\n5\\n\\nAlpacaEval 2.0 We conducted comparisons against leading models such as GPT-4 and other\\nstate-of-the-art open-source models. The detailed results are presented in Table 2a where our MoA\\nmethodology achieved top positions on the AlpacaEval 2.0 leaderboard, demonstrating a remarkable\\n8.2% absolute improvement over the previous top model, GPT-4o. Moreover, it is particularly\\nnoteworthy that our model outperformed GPT-4o using solely open-source models, achieving a\\nmargin of 7.6% absolute improvement from 57.5% (GPT-4o) to 65.1% (MoA). Our MoA-Lite setup\\nuses less layers and being more cost-effective. Even with this lighter approach, we still outperform the\\nbest model by 1.8%, improving from 57.5% (GPT-4o) to 59.3% (MoA-Lite). This further highlights\\nthe effectiveness of our method in leveraging open-source models capabilities with varying compute\\nbudget to their fullest potential.\\nrobustness\\ncorrectness\\nefficiency\\nfactuality\\ncommonsense\\ncomprehension\\ninsightfulness\\ncompleteness\\nmetacognition\\nreadability\\nconciseness\\nharmlessness\\n3 3.5 4 4.5 5\\nGPT-4 Omni (05/13)\\nGPT-3.5-turbo-0125\\nQwen1.5-110B-Chat\\nMoA\\nFigure 3: Results on FLASK where we use the 6 pro-\\nposer MoA setup and Qwen1.5-110B-Chat is the aggre-\\ngator.\\nMT-Bench Though improvements over\\nindividual models on the MT-Bench are rel-\\natively incremental, this is understandable\\ngiven that current models already perform\\nexceptionally well on this benchmark, as\\na single model alone can achieve scores\\ngreater than 9 out of 10. Despite the\\nmarginal enhancements, our approach still\\nsecures the top position on the leaderboard.\\nThis demonstrates that even with already\\nhighly optimized benchmarks, our method\\ncan push the boundaries further, maintain-\\ning the leadership.\\nFLASK FLASK provides fine-grained\\nevaluation of models. Among those met-\\nrics, MoA excels in several key aspects.\\nSpecifically, our methodology shows signif-\\nicant improvement in robustness, correct-\\nness, efficiency, factuality, commonsense,\\ninsightfulness, completeness, compared to\\nthe single model score of the aggregator,\\nQwen-110B-Chat. Additionally, MoA also\\noutperforms GPT-4 Omni in terms of correctness, factuality, insightfulness, completeness, and\\nmetacognition. One metric where MoA did not do as well was conciseness; the model produced\\noutputs that were marginally more verbose.\\n3.3 What Makes Mixture-of-Agents Work Well?\\nIn this subsection, we conduct experiments that provide us better understandings of the internal\\nmechanism of Mixture-of-Agents. We summarize key insights below.\\nMixture-of-Agents significantly outperforms LLM rankers. First, we compare Mixture-of-\\nAgents with an LLM-based ranker which uses the aggregator model to select one of the answers that\\nare generated by the proposers, instead of generating a new output. The results are shown in Figure 4,\\nwhere we can observe that the MoA approach significantly outperforms an LLM-ranker baseline. The\\nfact that MoA outperforms the ranking approach suggests that the aggregator does not simply select\\none of the generated answers by the proposers, but potentially performs sophisticated aggregation\\nover all proposed generations.\\nMoA tends to incorporate the best proposed answers. We also compare the aggregator’s response\\nwith the proposers’ responses via similarity scores such as BLEU (Papineni et al., 2002) which reflects\\nn-gram overlaps. Within each sample, givenn proposed answers by the proposers, we calculate the the\\nSpearman’s rank correlation coefficient between n similar scores and n preference scores determined\\nby the GPT-4 based evaluator. The results in Figure 4 indeed confirms a positive correlation between\\nthe win rate and the BLEU score. We also provide results with Levenshtein similarity (RapidFuzz,\\n2023) or TF-IDF as opposed to BLEU scores in Appendix A. where both alternative approaches for\\ntextual similarities also yield positive correlation with the preference scores.\\n6\\n\\nLayer 1 Layer 2 Layer 3 Layer 4\\n \\n20\\n30\\n40\\n50\\n60\\n70LC win rate\\nGPT-4 Preview\\nGPT-4 Omni\\nGPT-4o\\nQwen1.5-110B-Chat\\nQwen1.5-72B-Chat\\nWizard 8x22b\\nMixtral-8x22B-Instruct-v0.1\\nLlama-3-70B-Instruct\\ndbrx-instruct\\nLLM-Ranker\\n0.00 0.05 0.10 0.15 0.20 0.25 0.30\\nSpearman correlation coefficient\\nQWen1.5-110B\\nQWen1.5-72B\\nWizardLM\\nLlama-3-70B\\nMixtral-8x22B\\ndbrx-instruct\\nAggregator\\nAggregation\\n1st aggregation\\n2nd aggregation\\n3rd aggregation\\nFigure 4: (a) LC win rate on AlpacaEval 2.0 with different aggregators in the 6-model Mixture-of-\\nAgents setup. All the curves use the same 6 proposer agents; they only differ in the choice of the final\\naggregator. The LLM ranker uses Qwen1.5-110B-Chat model with a prompt format in Appendix\\nTable 5. The GPT-4o model is only used to aggregate the output for the purpose of evaluation and\\ndoes not participate as a proposer towards the next layer. (b) Spearman correlation between BLEU\\nscores (calculated using 3-gram, 4-gram, and 5-gram metrics) and win rate of the proposed outputs.\\nTable 3: Effects of the number of proposer models\\non AlpacaEval 2.0. We denote n as either the\\nnumber of agents in an MoA layer or the number\\nof proposed outputs in the single-proposer setting.\\nWe use Qwen1.5-110B-Chat as the aggregator\\nand use 2 MoA layers for all settings in this table.\\nSetting Multiple-Proposer Single-Proposer\\nn = 6 61.3% 56.7%\\nn = 3 58.0% 56.1%\\nn = 2 58.8% 54.5%\\nn = 1 47.8% 47.8%\\nTable 4: Impact of different models serving as\\nproposers vs aggregators. When evaluating differ-\\nent aggregators, all six models serve as proposers;\\nwhen evaluating proposers, Qwen1.5-110B-Chat\\nserves as the aggregator. We use 2 MoA layers in\\nthis table.\\nModel As aggregator As proposer\\nQwen1.5-110B-Chat 61.3% 56.7%\\nQwen1.5-72B-Chat 59.3% 53.3%\\nLLaMA-3-70b-Instruct 45.0% 60.6%\\nWizardLM 8x22B 52.9% 63.8%\\nMixtral-8x22B-Instruct 48.4% 54.8%\\ndbrx-instruct 41.5% 55.1%\\nEffect of model diversity and the number of proposers. We analyze how the number of proposals\\naffect the final output quality by varying n, the number of proposers in each layer. We show the\\nresults in Table 3 where we find that scores increases monotonically with n, reflecting the benefits\\nof having more auxiliary information. In addition, we also quantify the impact of using a diverse\\nset of LLMs as proposers. For each n, we compare two settings: “single-proposer” where the n\\nresponses are generated by the same LLM with a temperature of 0.7; and “multiple-proposer” where\\neach response is generated by a different LLMs. Overall, using multiple different LLMs consistently\\nyielded better results. Both results suggest that having a larger number of diverse LLM agents in each\\nMoA layer can improve performance. Further scaling the width of MoA is a promising direction of\\nfuture investigation.\\nSpecialization of models in the Mixture-of-Agent ecosystem. We also conducted experiments\\nto determine which models excel in specific roles. Specifically, Table 4 shows that GPT-4o, Qwen,\\nLLaMA-3 emerged as a versatile model effective in both assisting and aggregating tasks. In contrast,\\nWizardLM demonstrated excellent performance as an proposer model but struggled to maintain its\\neffectiveness in aggregating responses from other models.\\n3.4 Budget and Token Analysis\\nTo understand the relationship between budget, token usage, and LC win rates, we conducted a budget\\nand token analysis. Figure 5a and Figure 5b illustrate these relationships.\\n7\\n\\n0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035\\nCost\\n25\\n30\\n35\\n40\\n45\\n50\\n55\\n60\\n65Score\\nGPT-4o\\nGPT-4-turbo\\nMoA-Lite\\nMoA\\nmodel type\\nMulti Proposer\\nSingle Proposer\\nlayer\\n1\\n2\\n3\\n(a) LC win rate vs. cost\\n50 100 150 200 250 300 350\\ntflops\\n25\\n30\\n35\\n40\\n45\\n50\\n55\\n60\\n65Score\\nMoA\\nMoA-Lite\\nGPT-4o\\nGPT-4-turbo\\nmodel type\\nMulti Proposer\\nSingle Proposer\\nlayer\\n1\\n2\\n3 (b) LC win rate vs. tflops\\nFigure 5: (a) Performance trade-off versus cost. (b) Performance trade-off versus the number of tera\\nfloating operations (tflops), which we use as a proxy for latency. Note that we calculate the sum\\nover layers of the max number of tflops among proposers in each MoA layer as multiple proposers\\ncan run in parallel. Our plots illustrate a Pareto frontier where we can choose a model progressively\\nhigher score with the lowest cost for such level of performance. We show that the Mixture-of-Agents\\napproach lie on this Pareto front, as opposed to GPT-4 Turbo and GPT-4o which are not cost optimal\\nand is more expensive compared to MoA approaches of the same LC win rate. Single Proposer: uses\\nthe same model to generate multiple responses in each MoA layer; Multi Proposer: uses different\\nmodels in each MoA layer. The actual tflops of GPT-4 is unknown, so we use the rumored size from\\nthe community of an 8x220B architecture.\\nCost Effectiveness In Figure 5a, we plot the LC win rate against the average inference cost for\\neach instance in the AplacaEval 2.0 benchmark. The cost is calculated based on pricing information\\navailable from API provider websites.4 This helps identify cost-effective models that achieve high\\nperformance without incurring excessive expenses. The chart reveals a Pareto front where certain\\nmodels strike an optimal balance between cost and performance. Models closer to this Pareto front\\nare more desirable as they provide better monetary value by delivering high LC win rates at lower\\ncosts. Specifically, if we prioritize the quality, MoA is the best configuration. However, if we want to\\nstrike a good balance between quality and cost, MoA-Lite can match GPT-4o’s cost while achieving\\nhigher level of quality. Notably, it outperforms GPT-4 Turbo by approximately4% while being more\\nthan twice as cost-effective.\\nTflops Consumption Figure 5b depicts the relationship between LC win rate and the number of\\ntflops. Here we use the number of tflops as a proxy for latency since latency can vary depending on\\nthe inference systems. This analysis is crucial for understanding how different models manage their\\nbudgets while maintaining or improving performance levels. Similar to the cost efficiency analysis, a\\nPareto front can be observed here as well. Models on this front effectively utilize their computational\\nresource to maximize their LC win rate.\\n4 Related Work\\n4.1 LLM Reasoning\\nIn order to improve generation quality of LLMs, recent researches have experienced great progresses\\nin optimizing LLMs to various downstream tasks through prompt engineering. Chain of Thought\\n(CoT) (Wei et al., 2022; Kojima et al., 2022) prompting techniques represent a linear problem-\\nsolving approach where each step builds upon the previous one. Fu et al. (2022) applied CoT to\\nmulti-step reasoning tasks. To automate CoT prompting, Auto-CoT (Zhang et al., 2022b) constructs\\ndemonstrations by sampling diverse questions and generating reasoning chains. Active-Prompt (Diao\\n4For open-source models, we calculate the price using data from https://api.together.ai/models;\\nfor OpenAI models, we use pricing details from https://openai.com/api/pricing/. Pricing data was\\nretrieved as of May 22, 2024.\\n8\\n\\net al., 2023) focuses on selecting the most uncertain questions for task-specific annotations. PS\\nPrompt (Wang et al., 2023) decomposes tasks into subtasks. Tree-of-Thought (ToT) (Yao et al., 2023a)\\nexpands on the reasoning process by considering multiple paths of reasoning and self-evaluating\\nchoices. Effective Graph-of-Thought (Yao et al., 2023b) frames thoughts as graphs. Natural Program\\nprompting (Ling et al., 2023) is proposed for better solving deductive reasoning tasks. And re-reading\\nprompt (Xu et al., 2023b) revisits question information embedded within input prompts.\\n4.2 Model Ensemble\\nA straightforward solution to leverage the strengths of multiple models is reranking outputs from\\ndifferent models. For instance, Jiang et al. (2023) introduce PAIR RANKER , which performs pairwise\\ncomparisons on candidate outputs to select the best one, showing improvements on a self-constructed\\ninstruction dataset. To address the substantial computational costs associated with multi-LLM\\ninference, other studies have explored training a router that predicts the best-performing model\\nfrom a fixed set of LLMs for a given input (Wang et al., 2024a; Shnitzer et al., 2024; Lu et al.,\\n2023). Additionally, FrugalGPT (Chen et al., 2023b) proposed reducing the cost of using LLMs\\nby employing different models in a cascading manner. In order to better leverage the responses of\\nmultiple models, Jiang et al. (2023) trained a GENFUSER , a model that was trained to generate an\\nimproved response to capitalize on the strengths of multiple candidates. Huang et al. (2024) proposed\\nto fuse the outputs of different models by averaging their output probability distributions.\\nAnother line of work is multi-agent collaboration. Several studies explore using multiple large\\nlanguage models as agents that collectively discuss and reason through given problems interactively.\\nDu et al. (2023) establishes a mechanism for symmetric discussions among agents. Around the same\\ntime, MAD (Liang et al., 2023) introduces an asymmetric mechanism design, with different roles, i.e.,\\ndebater and judge. Other similar works include (Chan et al., 2023). Moreover, ReConcile (Chen et al.,\\n2023a) exemplifies an asymmetric discussion involving weighted voting. To understand discussion\\nmore deeply, Zhang et al. (2023) aim to explain such collaboration mechanism in a social psychology\\nview. Wang et al. (2024b) systematically compared multi-agent approaches and found a single agent\\nwith a strong prompt including detailed demonstrations can achieve comparable response quality to\\nmulti-agent approaches.\\n5 Conclusion\\nThis paper introduces a Mixture-of-Agents approach aimed at leveraging the capabilities of multiple\\nLLMs via successive stages for iterative collaboration. Our method harnesses the collective strengths\\nof agents in the Mixture-of-Agents family, and can significantly improve upon the output quality of\\neach individual model. Empirical evaluations conducted on AlpacaEval 2.0, MT-Bench, and FLASK\\ndemonstrated substantial improvements in response quality, with our approach achieving the LC win\\nrate up to 65%. These findings validate our hypothesis that integrating diverse perspectives from\\nvarious models can lead to superior performance compared to relying on a single model alone. In\\naddition, we provide insights into improving the design of MoA; systematic optimization of MoA\\narchitecture is an interesting direction for future work.\\nLimitations. Our proposed method requires iterative aggregation of model responses, which means\\nthe model cannot decide the first token until the last MoA layer is reached. This potentially results\\nin a high Time to First Token (TTFT), which can negatively impact user experience. To mitigate\\nthis issue, we can limit the number of MoA layers, as the first response aggregation has the most\\nsignificant boost on generation quality. Future work could explore chunk-wise aggregation instead of\\naggregating entire responses at once, which can reduce TTFT while maintaining response quality.\\nBroader Impact. This study holds the potential to enhance the effectiveness of LLM-driven chat\\nassistants, thereby making AI more accessible. Moreover, since the intermediate outputs that are\\nexpressed in natural language, MoA presented improves the interpretability of models. This enhanced\\ninterpretability facilitates better alignment with human reasoning.\\n9\\n\\nReferences\\nBai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., et al. Qwen\\ntechnical report. arXiv preprint arXiv:2309.16609, 2023.\\nBrown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam,\\nP., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural\\ninformation processing systems, 33:1877–1901, 2020.\\nChan, C.-M., Chen, W., Su, Y ., Yu, J., Xue, W., Zhang, S., Fu, J., and Liu, Z. Chateval: Towards\\nbetter llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201, 2023.\\nChen, J. C.-Y ., Saha, S., and Bansal, M. Reconcile: Round-table conference improves reasoning via\\nconsensus among diverse llms. arXiv preprint arXiv:2309.13007, 2023a.\\nChen, L., Zaharia, M., and Zou, J. Frugalgpt: How to use large language models while reducing cost\\nand improving performance. arXiv preprint arXiv:2305.05176, 2023b.\\nChowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung,\\nH. W., Sutton, C., Gehrmann, S., et al. Palm: Scaling language modeling with pathways. arXiv\\npreprint arXiv:2204.02311, 2022.\\nDiao, S., Wang, P., Lin, Y ., and Zhang, T. Active prompting with chain-of-thought for large language\\nmodels. arXiv preprint arXiv:2302.12246, 2023.\\nDu, Y ., Li, S., Torralba, A., Tenenbaum, J. B., and Mordatch, I. Improving factuality and reasoning\\nin language models through multiagent debate. arXiv preprint arXiv:2305.14325, 2023.\\nDubois, Y ., Galambosi, B., Liang, P., and Hashimoto, T. B. Length-controlled alpacaeval: A simple\\nway to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024.\\nFu, Y ., Peng, H., Sabharwal, A., Clark, P., and Khot, T. Complexity-based prompting for multi-step\\nreasoning. arXiv preprint arXiv:2210.00720, 2022.\\nGuo, D., Zhu, Q., Yang, D., Xie, Z., Dong, K., Zhang, W., Chen, G., Bi, X., Wu, Y ., Li, Y ., et al.\\nDeepseek-coder: When the large language model meets programming–the rise of code intelligence.\\narXiv preprint arXiv:2401.14196, 2024.\\nHendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J.\\nMeasuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874,\\n2021.\\nHuang, Y ., Feng, X., Li, B., Xiang, Y ., Wang, H., Qin, B., and Liu, T. Enabling ensemble learn-\\ning for heterogeneous large language models with deep parallel collaboration. arXiv preprint\\narXiv:2404.12715, 2024.\\nJiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D. S.,\\nde Las Casas, D., Hanna, E. B., Bressand, F., Lengyel, G., Bour, G., Lample, G., Lavaud, L. R.,\\nSaulnier, L., Lachaux, M., Stock, P., Subramanian, S., Yang, S., Antoniak, S., Scao, T. L., Gervet, T.,\\nLavril, T., Wang, T., Lacroix, T., and Sayed, W. E. Mixtral of experts.CoRR, abs/2401.04088, 2024.\\ndoi: 10.48550/ARXIV .2401.04088. URLhttps://doi.org/10.48550/arXiv.2401.04088.\\nJiang, D., Ren, X., and Lin, B. Y . LLM-blender: Ensembling large language models with pairwise\\nranking and generative fusion. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.),Proceedings\\nof the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long\\nPapers), pp. 14165–14178, Toronto, Canada, July 2023. Association for Computational Linguistics.\\ndoi: 10.18653/v1/2023.acl-long.792. URL https://aclanthology.org/2023.acl-long.\\n792.\\nKojima, T., Gu, S. S., Reid, M., Matsuo, Y ., and Iwasawa, Y . Large language models are zero-shot\\nreasoners. Advances in neural information processing systems, 35:22199–22213, 2022.\\nLiang, T., He, Z., Jiao, W., Wang, X., Wang, Y ., Wang, R., Yang, Y ., Tu, Z., and Shi, S. Encour-\\naging divergent thinking in large language models through multi-agent debate. arXiv preprint\\narXiv:2305.19118, 2023.\\n10\\n\\nLing, Z., Fang, Y ., Li, X., Huang, Z., Lee, M., Memisevic, R., and Su, H. Deductive verification of\\nchain-of-thought reasoning. arXiv preprint arXiv:2306.03872, 2023.\\nLu, K., Yuan, H., Lin, R., Lin, J., Yuan, Z., Zhou, C., and Zhou, J. Routing to the expert: Efficient\\nreward-guided ensemble of large language models, 2023.\\nOpenAI. Gpt-4 technical report, 2023.\\nOuyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S.,\\nSlama, K., Ray, A., et al. Training language models to follow instructions with human feedback.\\nAdvances in neural information processing systems, 35:27730–27744, 2022.\\nPapineni, K., Roukos, S., Ward, T., and Zhu, W. Bleu: a method for automatic evaluation of machine\\ntranslation. In Proceedings of the 40th Annual Meeting of the Association for Computational\\nLinguistics, July 6-12, 2002, Philadelphia, PA, USA, pp. 311–318. ACL, 2002. doi: 10.3115/\\n1073083.1073135. URL https://aclanthology.org/P02-1040/.\\nRapidFuzz. python-levenshtein by rapidfuzz. https://github.com/rapidfuzz/\\npython-Levenshtein, 2023.\\nRoziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X. E., Adi, Y ., Liu, J., Remez, T., Rapin,\\nJ., et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.\\nShazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outra-\\ngeously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint\\narXiv:1701.06538, 2017.\\nShnitzer, T., Ou, A., Silva, M., Soule, K., Sun, Y ., Solomon, J., Thompson, N., and Yurochkin, M.\\nLarge language model routing with benchmark datasets, 2024. URLhttps://openreview.net/\\nforum?id=LyNsMNNLjY.\\nTeam, G., Anil, R., Borgeaud, S., Wu, Y ., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai,\\nA. M., Hauth, A., et al. Gemini: a family of highly capable multimodal models. arXiv preprint\\narXiv:2312.11805, 2023.\\nThe Mosaic Research Team. Introducing dbrx: A new state-of-the-art open llm. 2024. URL\\nhttps://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm .\\nTouvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal,\\nN., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation language models. arXiv\\npreprint arXiv:2302.13971, 2023a.\\nTouvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S.,\\nBhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine-tuned chat models. arXiv\\npreprint arXiv:2307.09288, 2023b.\\nWang, H., Polo, F. M., Sun, Y ., Kundu, S., Xing, E., and Yurochkin, M. Fusing models with\\ncomplementary expertise. In The Twelfth International Conference on Learning Representations,\\n2024a. URL https://openreview.net/forum?id=PhMrGCMIRL.\\nWang, L., Xu, W., Lan, Y ., Hu, Z., Lan, Y ., Lee, R. K.-W., and Lim, E.-P. Plan-and-solve prompt-\\ning: Improving zero-shot chain-of-thought reasoning by large language models. arXiv preprint\\narXiv:2305.04091, 2023.\\nWang, Q., Wang, Z., Su, Y ., Tong, H., and Song, Y . Rethinking the bounds of llm reasoning: Are\\nmulti-agent discussions the key? arXiv preprint arXiv:2402.18272, 2024b.\\nWang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., and Zhou,\\nD. Self-consistency improves chain of thought reasoning in language models. arXiv preprint\\narXiv:2203.11171, 2022.\\nWei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V ., Zhou, D., et al. Chain-of-\\nthought prompting elicits reasoning in large language models. Advances in Neural Information\\nProcessing Systems, 35:24824–24837, 2022.\\n11\\n\\n0.00 0.05 0.10 0.15 0.20 0.25\\nSpearman correlation coefficient\\nQWen1.5-110B\\nQWen1.5-72B\\nWizardLM\\nLlama-3-70B\\nMixtral-8x22B\\ndbrx-instruct\\nAggregator\\nAggregation\\n1st aggregation\\n2nd aggregation\\n3rd aggregation\\n0.00 0.05 0.10 0.15 0.20 0.25\\nSpearman correlation coefficient\\nQWen1.5-110B\\nQWen1.5-72B\\nWizardLM\\nLlama-3-70B\\nMixtral-8x22B\\ndbrx-instruct\\nAggregator\\nAggregation\\n1st aggregation\\n2nd aggregation\\n3rd aggregation\\nFigure 6: (a) Spearman Correlation using TF-IDF similarity; (b) Spearman Correlation using Leven-\\nshtein similarity.\\nXu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., and Jiang, D. Wizardlm: Empow-\\nering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244,\\n2023a.\\nXu, X., Tao, C., Shen, T., Xu, C., Xu, H., Long, G., and Lou, J.-g. Re-reading improves reasoning in\\nlanguage models. arXiv preprint arXiv:2309.06275, 2023b.\\nYao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y ., and Narasimhan, K. Tree of thoughts:\\nDeliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023a.\\nYao, Y ., Li, Z., and Zhao, H. Beyond chain-of-thought, effective graph-of-thought reasoning in large\\nlanguage models. arXiv preprint arXiv:2305.16582, 2023b.\\nYe, S., Kim, D., Kim, S., Hwang, H., Kim, S., Jo, Y ., Thorne, J., Kim, J., and Seo, M. Flask: Fine-\\ngrained language model evaluation based on alignment skill sets. arXiv preprint arXiv:2307.10928,\\n2023.\\nZhang, J., Xu, X., and Deng, S. Exploring collaboration mechanisms for llm agents: A social\\npsychology view. arXiv preprint arXiv:2310.02124, 2023.\\nZhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin,\\nX. V ., et al. Opt: Open pre-trained transformer language models. arXiv e-prints, pp. arXiv–2205,\\n2022a.\\nZhang, Z., Zhang, A., Li, M., and Smola, A. Automatic chain of thought prompting in large language\\nmodels. arXiv preprint arXiv:2210.03493, 2022b.\\nZheng, L., Chiang, W.-L., Sheng, Y ., Zhuang, S., Wu, Z., Zhuang, Y ., Lin, Z., Li, Z., Li, D., Xing,\\nE. P., Zhang, H., Gonzalez, J. E., and Stoica, I. Judging llm-as-a-judge with mt-bench and chatbot\\narena. arXiv preprint arXiv:2306.05685, 2023.\\nSupplementary Material\\nA Spearman Correlation using Different Similarity Functions\\nWe present results using TF-IDF-based similarity and Levenshtein similarity when calculating the\\nSpearman correlation. Specifically, within each sample ofn proposed answers, we calculate Spearman\\ncorrelation coefficient between the n similarity scores and the n preference scores determined by the\\nGPT-4-based evaluator. As shown in Figure 6, there is indeed a positive correlation between win rate\\nand both TF-IDF similarity and Levenshtein similarity.\\n12\\n\\nTable 5: Prompt for ranking with LLMs\\nYou are a highly efficient assistant, who evaluates and selects the best large language model (LLMs) based on\\nthe quality of their responses to a given instruction. This process will be used to create a leaderboard reflecting\\nthe most accurate and human-preferred answers.\\nI require a leaderboard for various large language models. I’ll provide you with prompts given to these models\\nand their corresponding outputs. Your task is to assess these responses, and select the model that produces the\\nbest output from a human perspective.\\n## Instruction\\n{\\n\"instruction\": \"\"\"{instruction}\"\"\",\\n}\\n## Model Outputs\\nHere are the unordered outputs from the models. Each output is associated with a specific model, identified by a\\nunique model identifier.\\n{\\n{\\n\"model_identifier\": \"{identifier_1}\",\\n\"output\": \"\"\"{output_1}\"\"\"\\n},\\n{\\n\"model_identifier\": \"{identifier_2}\",\\n\"output\": \"\"\"{output_2}\"\"\"\\n},\\n{\\n\"model_identifier\": \"{identifier_3}\",\\n\"output\": \"\"\"{output_3}\"\"\"\\n},\\n{\\n\"model_identifier\": \"{identifier_4}\",\\n\"output\": \"\"\"{output_4}\"\"\"\\n},\\n{\\n\"model_identifier\": \"{identifier_5}\",\\n\"output\": \"\"\"{output_5}\"\"\"\\n},\\n{\\n\"model_identifier\": \"{identifier_6}\",\\n\"output\": \"\"\"{output_6}\"\"\"\\n}\\n}\\n## Task\\nEvaluate the models based on the quality and relevance of their outputs, and select the model that generated the\\nbest output. Answer by providing the model identifier of the best model. We will use your output as the name of\\nthe best model, so make sure your output only contains one of the following model identifiers and nothing else\\n(no quotes, no spaces, no new lines, ...).\\n## Best Model Identifier\\nB LLM Ranker\\nThis section introduces the setup of the LLM-Ranker used in this paper. The LLM-Ranker is designed\\nto evaluate and rank the best output generated by some LLMs. Table 5 presents the template for\\nprompting the model during these evaluations. We use this LLM-Ranker to pick the best answer\\namong and use AlpacaEval evaluator to evaluate the best ranked answer.\\n13\\n\\nTable 6: Case: Some models produce high quality answers.\\nRole Preference Content\\nUser - I want to talk about \"Smooth\", a song by Rob Thomas.\\nQwen1.5-110B-Chat 0.35 ... It’s a perfect blend of rock, pop, and Latin music, showcasing\\nSantana’s signature guitar work and Rob Thomas’s soulful vocals ...\\nQwen1.5-72B-Chat 0.00 ...\\nLlama-3-70B-Instruct 0.00 ... it’s a collaboration between Rob Thomas, the lead vocalist of\\nMatchbox Twenty,and Santana, the legendary guitarist ...\\nWizardLM-2-8x22B 0.99 ... It spent 12 weeks at the top of the Billboard Hot 100 chart in the\\nUnited States ... capturing the song’s sultry and energetic vibe ...\\nMixtral-8x22B-Instruct-v0.1 0.00 ...\\ndbrx-instruc 0.00 ...\\nAggregated\\n(Qwen1.5-110B-Chat) 0.99 \"Smooth\" is actually a collaboration between Rob Thomas, who is\\nthe lead singer of the band Matchbox Twenty, and the renowned\\nLatin rock guitarist Carlos Santana ... This iconic song blends rock,\\npop, and Latin music elements, showcasing Santana’s signature\\nguitar work and Rob Thomas’s distinctive vocals ... The song was a\\ncolossal success, spending 12 weeks at the top of the Billboard Hot\\n100 ... The music video, set in a vibrant club scene, complements\\nthe song’s sultry vibe ...\\nC Case Study\\nWe present a case study in this section. Due to the length of the responses generated by all models,\\nwe will only show selected fragments for brevity. To illustrate how the aggregator synthesizes the\\nresponse, we underlined similar expressions between the proposed responses and the aggregated\\nresponse in different colors. We omit the content that all proposed responses have mentioned.\\nTable 6 showcases the responses generated by different proposers. The aggregated response generated\\nby Qwen1.5-110B-Chat reflects a high preference for its own content but also incorporates key points\\nfrom Llama-3-70B-Instruct and WizardLM 8x22B. Notably, GPT-4’s preference score for WizardLM\\n8x22B’s response is 0.99, and the final aggregated answer also achieves a preference score of 0.99.\\nMeanwhile, Table 7 presents another case where none of the proposed responses achieve a high\\nGPT-4 preference score. Despite this, the aggregator successfully identifies and incorporates the\\nstrong points from these responses, achieving a preference score of 0.33.\\nD MATH Task\\nHere, we demonstrate that our approach is applicable to reasoning tasks, such as those in the MATH\\ndataset Hendrycks et al. (2021). The results are presented in Table 8, where we show that our method\\nconsistently enhances accuracy by a significant margin. This indicates that our approach is also\\neffective for this type of task. Notably, our method is complementary to existing reasoning techniques\\nsuch as Chain of Thought Wei et al. (2022) and Self-consistency Wang et al. (2022).\\n14\\n\\nTable 7: Case: all proposed responses are not good enough.\\nRole Preference Content\\nUser - How do you become an author?\\nQwen1.5-110B-Chat 0.00 Becoming an author involves several steps and ... 3. **Choose\\nyour niche**:... 8. **Understand the publishing industry**:... 10.\\n**Self-publishing options**:...\\nQwen1.5-72B-Chat 0.00 ...\\nLlama-3-70B-Instruct 0.16 Becoming an author can be a rewarding and fulfilling career... 1.\\n**Develop a love for reading and writing**:... 10. **Market and\\npromote your work**:... 11. **Continuously improve**:...\\nWizardLM-2-8x22B 0.03 1. ... 4. Finish your work:...\\nMixtral-8x22B-Instruct-v0.1 0.00 ...\\ndbrx-instruc 0.00 ...\\nAggregated\\n(Qwen1.5-110B-Chat) 0.33 Becoming an author involves a dedicated journey... 1. **Cul-\\ntivate a Love for Writing and Reading**:... 4. **Choose Your\\nNiche**:... 5. **Finish Your Work**:... 10. **Self-Publishing**:...\\n11. **Marketing and Promotion**:... 12. **Continuous Learning\\nand Writing**:...\\nTable 8: Results on the MATH task. We evaluate different aggregators, with all six models serving as\\nproposers in each MoA layer.\\nAggregator Layer 1 Layer 2 Layer 3\\nQwen1.5-72B-Chat 0.428 0.526 0.552\\nQwen1.5-110B-Chat 0.500 0.570 0.576\\nWizard 8x22b 0.544 0.574 0.580\\nMixtral-8x22B-Instruct-v0.1 0.282 0.534 0.556\\nLlama-3-70B-Instruct 0.456 0.584 0.578\\ndbrx-instruct 0.314 0.456 0.522\\n15'"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"text = get_PDF_text('MoA.pdf')\n",
"text"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "vevOUMXJf2p_"
},
"source": [
"### Generate Script\n",
"\n",
"Below we generate the script and print out the lines."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"id": "f5rBXur8vXnP"
},
"outputs": [],
"source": [
"script = generate_script(SYSTEM_PROMPT, text, Dialogue)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "inFgEVeBtCOR",
"outputId": "b715d94c-e1c2-4361-9075-528c6deb070f"
},
"outputs": [
{
"data": {
"text/plain": [
"[DialogueItem(speaker='Host (Jane)', text=\"Hello and welcome to today's podcast. I'm your host, Jane. Joining me today is Junlin Wang, a researcher at Duke University and co-author of the paper 'Mixture-of-Agents Enhances Large Language Model Capabilities'. Welcome, Junlin!\"),\n",
" DialogueItem(speaker='Guest', text='Hi Jane, thanks for having me.'),\n",
" DialogueItem(speaker='Host (Jane)', text=\"Let's start with the basics. What are large language models, and what are their limitations?\"),\n",
" DialogueItem(speaker='Guest', text=\"Large language models, or LLMs, are AI models that can understand and generate human-like language. They've made significant progress in recent years, but they still face limitations in terms of model size and training data. Further scaling up these models is often costly and requires extensive retraining.\"),\n",
" DialogueItem(speaker='Host (Jane)', text=\"That's a great point. Your paper introduces a new approach called Mixture-of-Agents, or MoA. Can you explain how MoA works?\"),\n",
" DialogueItem(speaker='Guest', text='MoA combines multiple LLMs to create a more robust and capable model. We leverage the collective strengths of LLMs, improving their reasoning and language generation capabilities. The approach involves iteratively refining the responses of multiple LLMs, allowing them to learn from each other and improve overall performance.'),\n",
" DialogueItem(speaker='Host (Jane)', text=\"That's fascinating. How does MoA achieve state-of-the-art performance on various benchmarks?\"),\n",
" DialogueItem(speaker='Guest', text='Our evaluations show that MoA achieves state-of-the-art performance on AlpacaEval 2.0, MT-Bench, and FLASK benchmarks. This is because MoA can effectively leverage the strengths of multiple LLMs, improving their reasoning and language generation capabilities.'),\n",
" DialogueItem(speaker='Host (Jane)', text='I see. What about cost-effectiveness? How does MoA compare to other models in terms of cost?'),\n",
" DialogueItem(speaker='Guest', text=\"Our cost-effectiveness analysis shows that MoA can match GPT-4o's quality while being more than twice as cost-effective. This makes MoA a more attractive option for applications where cost is a concern.\"),\n",
" DialogueItem(speaker='Host (Jane)', text=\"That's great to hear. What are some potential applications of MoA?\"),\n",
" DialogueItem(speaker='Guest', text=\"MoA has the potential to enhance the effectiveness of LLM-driven chat assistants, making AI more accessible. Additionally, MoA's intermediate outputs can improve the interpretability of models, facilitating better alignment with human reasoning.\"),\n",
" DialogueItem(speaker='Host (Jane)', text=\"Well, that's all the time we have for today. Thank you, Junlin, for sharing your insights on Mixture-of-Agents and its potential applications.\"),\n",
" DialogueItem(speaker='Guest', text='Thanks, Jane, for having me.'),\n",
" DialogueItem(speaker='Host (Jane)', text=\"And that's a wrap. Thanks to our listeners for tuning in. If you'd like to learn more about Mixture-of-Agents, check out the paper by Junlin Wang and his co-authors.\")]"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"script.dialogue"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "WqsYHpTwf2p_"
},
"source": [
"### Generate Podcast Using TTS\n",
"\n",
"Below we read through the script and parse choose the TTS voice depending on the speaker. We define a speaker and guest voice id.\n",
"\n",
"We can loop through the lines in the script and generate them by a call to the TTS model with specific voice and lines configurations. The lines all appended to the same buffer and once the script finishes we write this out to a `wav` file, ready to be played.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "qKnQnoYNvx3k",
"outputId": "efd25d8c-561c-4805-a43b-83674375c7a4"
},
"outputs": [
{
"data": {
"text/plain": [
"CompletedProcess(args=['ffplay', '-autoexit', '-nodisp', 'podcast.wav'], returncode=0)"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import subprocess\n",
"import ffmpeg\n",
"\n",
"host_id = \"694f9389-aac1-45b6-b726-9d9369183238\" # Jane - host\n",
"guest_id = \"a0e99841-438c-4a64-b679-ae501e7d6091\" # Guest\n",
"\n",
"model_id = \"sonic-english\" # The Sonic Cartesia model for English TTS\n",
"\n",
"output_format = {\n",
" \"container\": \"raw\",\n",
" \"encoding\": \"pcm_f32le\",\n",
" \"sample_rate\": 44100,\n",
" }\n",
"\n",
"# Set up a WebSocket connection.\n",
"ws = client_cartesia.tts.websocket()\n",
"\n",
"# Open a file to write the raw PCM audio bytes to.\n",
"f = open(\"podcast.pcm\", \"wb\")\n",
"\n",
"# Generate and stream audio.\n",
"for line in script.dialogue:\n",
" if line.speaker == \"Guest\":\n",
" voice_id = guest_id\n",
" else:\n",
" voice_id = host_id\n",
"\n",
" for output in ws.send(\n",
" model_id=model_id,\n",
" transcript='-' + line.text, # the \"-\"\" is to add a pause between speakers\n",
" voice_id=voice_id,\n",
" stream=True,\n",
" output_format=output_format,\n",
" ):\n",
" buffer = output[\"audio\"] # buffer contains raw PCM audio bytes\n",
" f.write(buffer)\n",
"\n",
"# Close the connection to release resources\n",
"ws.close()\n",
"f.close()\n",
"\n",
"# Convert the raw PCM bytes to a WAV file.\n",
"ffmpeg.input(\"podcast.pcm\", format=\"f32le\").output(\"podcast.wav\").run()\n",
"\n",
"# Play the file\n",
"subprocess.run([\"ffplay\", \"-autoexit\", \"-nodisp\", \"podcast.wav\"])"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 75
},
"id": "STWaJf_ySctY",
"outputId": "1c23df9d-c311-479d-8971-42be41bff999"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"