{
"cells": [
{
"cell_type": "markdown",
"id": "d0ef8fea",
"metadata": {},
"source": [
"# Summarization pipeline with chunking"
]
},
{
"cell_type": "markdown",
"id": "a2240877",
"metadata": {},
"source": [
"*Copyright (c) Meta Platforms, Inc. and affiliates.\n",
"This software may be used and distributed according to the terms of the Llama Community License Agreement.*"
]
},
{
"cell_type": "markdown",
"id": "8869a933",
"metadata": {},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"id": "4eb01d0c",
"metadata": {},
"source": [
"This tutorial shows you how to build a robust summarization pipeline for long documents. We will create an \"Intelligent Summarization Assistant\" that uses Llama 4 to summarize a document that is too long to be processed in a single pass.\n",
"\n",
"While models like Llama 4 have massive context windows, summarizing extremely long texts can sometimes cause details to be \"lost in the middle.\" To solve this, we will implement the **Map-Reduce** pattern: first, we'll \"map\" a summarization task over smaller, coherent chunks of the text, and then \"reduce\" those individual summaries into a final, high-fidelity overview.\n",
"\n",
"| Component | Choice | Why |\n",
"| :----------------- | :----------------------------------------- | :------------------------------------------------------------------------------------------------------------------------------- |\n",
"| **Model** | `Llama-4-Maverick-17B-128E-Instruct-FP8` | A powerful model ideal for high-quality summarization at both the chunk and final summary stages. |\n",
"| **Pattern** | Map-Reduce Summarization | A fundamental pattern for processing long documents. We \"map\" a summarization function over each chunk, then \"reduce\" the resulting summaries into a final one. |\n",
"| **Infrastructure** | Llama API | Provides access to Llama 4 models using the `llama_api_client` SDK. |\n",
"---\n",
"\n",
"**Note on Inference Providers:** This tutorial uses the Llama API for demonstration purposes. However, you can run Llama 4 models with any preferred inference provider. Common examples include [Amazon Bedrock](https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-meta.html) and [Together AI](https://together.ai/llama). The core logic of this tutorial can be adapted to any of these providers.\n",
"\n",
"## What you will learn\n",
"\n",
"- **How to implement a robust pipeline** for summarizing documents of any length.\n",
"- **The foundational \"Map-Reduce\" pattern** for document processing.\n",
"- **Techniques for \"semantic chunking\"** to split a document logically while preserving context.\n",
"- **How to craft effective, stage-specific prompts** for a multi-step LLM pipeline.\n",
"- **How to chain LLM calls** to perform complex, multi-stage tasks.\n",
"\n",
"## Install dependencies\n",
"\n",
"You will need two libraries for this project: `tiktoken` for accurate token counting, and the official `llama-api-client`."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "175c12af-fa25-4035-afdd-bcfe482b2c5a",
"metadata": {},
"outputs": [],
"source": [
"!pip install --quiet tiktoken llama-api-client"
]
},
{
"cell_type": "markdown",
"id": "a0376ad7-8391-4bc3-b207-c0dd356b0410",
"metadata": {},
"source": [
"## Imports & Llama API client setup\n",
"\n",
"Import the necessary modules and initialize the `LlamaAPIClient`. This requires a Llama API key to be available as an environment variable. If you do not have a Llama API key, please get one from [Meta Llama API](https://llama.developer.meta.com/). \n",
"\n",
"Remember, we use the Llama API for this tutorial, but you can adapt this section to use your preferred inference provider."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "a5ac6a27-a662-4445-8e15-0a89aa30587d",
"metadata": {},
"outputs": [],
"source": [
"import os, sys, re\n",
"from typing import List\n",
"import tiktoken\n",
"from llama_api_client import LlamaAPIClient\n",
"\n",
"# --- Llama client ---\n",
"API_KEY = os.getenv(\"LLAMA_API_KEY\")\n",
"if not API_KEY:\n",
" sys.exit(\"❌ Please set the LLAMA_API_KEY environment variable.\")\n",
"\n",
"client = LlamaAPIClient(api_key=API_KEY)"
]
},
{
"cell_type": "markdown",
"id": "537de7f1-869f-4868-b107-87b4e1fc8c8b",
"metadata": {},
"source": [
"## Step 1: Get the data\n",
"\n",
"This tutorial uses a markdown version of the Meta research paper, [ASTRO: Teaching Language Models to Reason by Reflecting and Backtracking In-Context](https://ai.meta.com/research/publications/astro-teaching-language-models-to-reason-by-reflecting-and-backtracking-in-context/). The file, `ASTRO-Teaching_Language_Models_to_Reason.md`, is included in the `data` sub-directory of the repository, making it easy to follow along.\n",
"\n",
"> We are using a markdown file for this tutorial because it preserves the document's structure with headers, which is useful for semantic chunking. If you are working with other formats like PDFs, you can use parsing services like [LlamaParse](https://www.llamaindex.ai/llamaparse) to convert them to markdown."
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "ce0721a0-5624-415f-bab2-b0ba1bedab97",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"✅ Successfully loaded document: 142,921 characters.\n"
]
}
],
"source": [
"file_path = \"data/ASTRO-Teaching_Language_Models_to_Reason.md\"\n",
"\n",
"try:\n",
" with open(file_path, 'r', encoding='utf-8') as f:\n",
" document_text = f.read()\n",
"except FileNotFoundError:\n",
" raise FileNotFoundError(\n",
" f\"Error: The file was not found at {file_path}\"\n",
" )\n",
"\n",
"if document_text:\n",
" print(f\"✅ Successfully loaded document: {len(document_text):,} characters.\")"
]
},
{
"cell_type": "markdown",
"id": "4a0224bf-b718-4915-ad98-7bf0e737bd27",
"metadata": {},
"source": [
"## Step 2: The logic of chunking\n",
"\n",
"### Why Chunk?\n",
"\n",
"For long documents, even with a large context window, summarizing in a single pass can lead to context degradation, where the model may under-weigh details from the middle of the text.\n",
"\n",
"To ensure all parts of the document are processed with equal focus, we use a **map-reduce** approach. Breaking the document into smaller, coherent chunks for individual summarization guarantees a more detailed and high-quality final result.\n",
"\n",
"### How to chunk?\n",
"\n",
"An effective chunking strategy is critical. Simply splitting the text by a fixed token count can break sentences or separate related ideas. A better approach is **semantic chunking**. Our strategy has two levels:\n",
"\n",
"1. **Header-based splitting:** First, the document is split into large sections based on its markdown headers (`#`, `##`, `###`). This preserves the document's logical structure.\n",
"2. **Paragraph-based Chunking:** Each large section is then divided into the final, smaller chunks. This process respects paragraph boundaries and a specified token limit, ensuring the chunks are both semantically coherent and sized appropriately for the LLM.\n",
"\n",
"> **Note on Generalization:** This tutorial's header-based splitting is optimized for markdown documents. For other formats (like plain text or PDFs), you can generalize this header-based splitting approach by identifying similar structural elements. For instance, you could split by chapter titles, numbered sections, or use regular expressions to find custom patterns that define logical breaks in your document. The principle of multi-level semantic chunking remains the same.\n",
"\n",
"### Choosing the Right Chunk Size\n",
"\n",
"While our chunking strategy prioritizes semantic boundaries (headers and paragraphs) over fixed token counts, we still need to set a maximum size for our chunks. This ensures that even the largest semantic chunk fits comfortably within the model's context window.\n",
"\n",
"The `CHUNK_SIZE_TOKENS` constant serves as this upper limit. Finding the right value is a trade-off:\n",
"\n",
"* **Set Too High:** The limit might still be larger than the model's context window (once the prompt is included), causing API calls to fail.\n",
"* **Set Too Low:** This could force the chunking logic to split paragraphs or other logical units too aggressively, reducing the quality of the summaries. It also increases the number of API calls, leading to higher cost and latency.\n",
"\n",
"The `16000` token limit in this tutorial is a conservative size for models with large context windows (usually 128k for models available on the Llama API). It leaves ample room for the prompt while ensuring each chunk is large enough to provide meaningful context for summarization.\n",
"\n",
"> **Note on Local Processing:** All processing up to this point, including loading the data and chunking the text, happens locally. We have not yet made any calls to the Llama API. The token counting is done with a local library to ensure our chunks are the right size for the API calls in the next steps."
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "f825f541-c63b-4747-8ba9-7e675da427ab",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total chunks created: 54\n",
"Average token count per chunk: 661.94\n",
"Max token count in a chunk: 6357\n",
"Min token count in a chunk: 3\n",
"--------------------------------------------------\n",
"Top 5 Chunks:\n",
"Chunk 0:\n",
"# ASTRO: Teaching Language Models to Reason by Reflecting and Backtracking In-Context\n",
"\n",
"Joongwon Kim1,2, Anirudh Goyal1, Liang Tan1, Hannaneh Hajishirzi2, Srini Iyer1, Tianlu Wang1\n",
"\n",
"1AI at Meta, 2University of Washington\n",
"\n",
"We introduce Astro, the \"Autoregressive Search-Taught Reasoner\", a framework for training language models to reason like search algorithms, explicitly leveraging self-reflection, backtracking, and exploration in their outputs. Recently, training large language models (LLMs) via reinforcement learning (RL) has led to the advent of reasoning models with greatly enhanced reasoning capabilities. Open-source replications of reasoning models, while successful, build upon models that already exhibit strong reasoning capabilities along with search behavior observed even before RL. As a result, it is yet unclear how to boost the reasoning capabilities of other non-reasoner models including Llama 3. Astro teaches such models to internalize structured search behavior through a synthetic dataset derived from Monte Carlo Tree Search (MCTS) over mathematical problem-solving trajectories. By converting search traces into natural language chain-of-thoughts that capture both successes and recoveries from failure, Astro bootstraps models with a rich prior for exploration during RL. We finetune our models on these search-derived traces and further improve performance via RL with verifiable rewards. We apply Astro to the Llama 3 family of models and achieve absolute performance gains of 16.0% on MATH-500, 26.9% on AMC 2023, and 20.0% on AIME 2024, especially improving upon challenging problems that require iterative correction. Our results demonstrate that search-inspired training offers a principled way to instill robust reasoning capabilities into open LLMs.\n",
"\n",
"Date: June 23, 2025\n",
"Correspondence: Joongwon Kim at jwonkim@meta.com\n",
"\n",
"| **stepwise solutions** ## Step 1: Define the problem and identify what we need to find. We need to find the time it takes for Aya to complete her walk and stop at the coffee shop when walking at a speed of $s + \\frac{1}{2}$ kilometers per hour, including the time $t$ spent in the coffee shop. ## Step 2: Set up the equations based on the information given. Let's denote the total time for the walk and coffee shop at speed $s$ as 4 hours or 240 minutes, and at speed $s+2$ as 2 hours and 24 minutes, or 144 minutes ... The final answer is \\boxed{398}. Llama-3.1-70B-Instruct | **long CoT solutions** Procedure Cloning SFT RL ASTRO Let's begin by finding the time that it takes for Aya to complete her walk and stop at the coffee shop ... But wait, are we solving the problem correctly so far? Hmm... Our solution may not be correct so far. Let's go back to where we set up the equations ... Therefore Aya spent a total of 204 minutes. But wait, are we solving the problem correctly so far? Hmm... Our solution seems to be correct so far. The final answer is \\boxed{204}. Llama-3.1-70B-ASTRO-RL | \tMATH-500\tAMC 2023\tAIME 2024Llama-3.1-70B-Instruct
Llama-3.1-70B-ASTRO-SFT\t+16.0%\t\t
Llama-3.1-70B-ASTRO-RL\t\t+26.9%\t+20.0% |\n",
"| ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------- |\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"Figure 1 Astro teaches Llama-3.1-70B-Instruct to perform self-reflection and backtracking in-context and improves its mathematical reasoning, achieving 81.8% on MATH-500, 64.4% on AMC 2023 and 30.0% on AIME 2024 (pass@1).\n",
"\n",
"\n",
"--------------------------------------------------\n",
"Chunk 1:\n",
"## 1 Introduction\n",
"\n",
"Training large language models (LLMs) via reinforcement learning (RL) has greatly improved their reasoning capabilities, leading to the advent of reasoning models such as OpenAI o1 (OpenAI, 2024), DeepSeek-R1 (DeepSeek-AI, 2025) or Gemini 2.5 (Google, 2025). A prominent feature of reasoning models is their ability to iteratively refine their outputs with a behavior similar to search – a process which involves reflecting on their own outputs and backtracking to a previous state (Xiang et al., 2025). While open-source replications of reasoning models achieve notable performance improvements, they rely on distillation from existing reasoning\n",
"\n",
"---\n",
"\n",
"\n",
"\n",
"Diagram showing three stages: Monte Carlo Tree Search, Procedure Cloning, and Reinforcement Learning\n",
"\n",
"Figure 2 An overview of Astro. Given a math reasoning problem, we first perform Monte Carlo Tree Search (MCTS) in a stepwise manner with verifiable rewards and obtain a search tree where each node contains a discrete reasoning step with its associated Q-value. We then linearize the visited sequence of nodes, including intermediate nodes with incorrect answers, into a solution that integrates backtracking and self-reflection in natural language. Then, we perform supervised fine-tuning (SFT) on the search-integrated solutions and bootstrap our policy to perform autoregressive search. Finally, we further improve the policy's search and reasoning capabilities with reinforcement learning (RL).\n",
"\n",
"models (Li et al., 2025; Muennighoff et al., 2025) or direct RL (Hu et al., 2025; Yu et al., 2025) from LLMs that (1) already contain reflective behavior and strong reasoning capabilities (Chang et al., 2025; Liu et al., 2025), and (2) exhibit spurious performance gains from incorrect or noisy reward signals during RL (Lv et al., 2025; Shao et al., 2025). Hence it is unclear from a scientific perspective how reasoning models can be built from other LLMs that do not exhibit the aforementioned behavior, such as Llama 3 (AI at Meta, 2024).\n",
"\n",
"We introduce ASTRO, the \"Autoregressive Search-Taught Reasoner\", a framework that systematically infuses search-like behavior into language models ab initio to improve their reasoning capabilities. The fundamental principle guiding Astro is search, where our policy explores the solution space by selecting actions, reflecting on its own solution, and backtracking to a previous step if needed. Astro trains language models to perform autoregressive search – instead of using external search scaffolds such as beam search to solve reasoning problems, Astro internalizes the search procedure and generates entire search trajectories, including reflections and backtracks, in a single inference pass. Models trained using Astro exhibit improved reasoning abilities by frequently re-evaluating their solutions and backtracking until they reach a final answer of high confidence. Moreover, such models generate structured reasoning traces that can be mapped to a directed graph with each vertex representing a discrete reasoning step, allowing for a richer understanding of their reasoning processes.\n",
"\n",
"Astro operates in three stages: (1) search trajectory generation, (2) supervised fine-tuning and (3) reinforcement learning. We initially bootstrap our models with search behavior by generating search trajectories to be used for training data via procedure cloning (Yang et al., 2022; Laskin et al., 2022) – we perform search with custom scaffolding over our language model policy to explore over different solution trajectories for each math problem, and we train our policy without using scaffolds at test time to predict the entire sequence of actions, including intermediate actions that lead to incorrect answers, that ultimately end with a successful terminal state. Then, we further optimize our policy via RL to improve their reasoning and search capabilities. Astro provides beneficial priors for RL during its data generation stage by systematically injecting self-reflection and backtracking patterns to the search traces via procedure cloning.\n",
"\n",
"First, we generate synthetic data, also called the cold-start data (DeepSeek-AI, 2025; Qwen, 2025), to instill autoregressive search priors to our models. To this end, we use Monte Carlo Tree Search (MCTS) to explore the solution space of challenging math problems and build search trees with diverse reasoning traces. We linearize each search tree into a sequence of nodes visited during the search process, which we then convert into a natural language chain-of-thought (CoT, Wei et al. (2022)) that integrates self-reflection and backtracking. Then, we sample about 36K high-quality CoT solutions across three open-source math datasets.\n",
"\n",
"2\n",
"\n",
"---\n",
"\n",
"\n",
"\n",
"| Monte Carlo Tree Search (MCTS) ⇒ Linearization ⇒ Procedure Cloning in Language | | |\n",
"| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n",
"| Problem: A box $2$ centimeters high, $3$ centimeters wide, and $5$ centimeters long can hold $40$ grams of clay. A second box with twice the height, three times the width, and the same length as the first box can hold $n$ grams of clay. What is $n$? | Node 3 | Let's determine the dimensions of the second box by calculating its height, which is twice the height of the first box, so its height is $2 \\times 2 = 4$ centimeters, and its width, which is three times the width of the first box, so its width is $3 \\times 3 = 9$ centimeters, while noting that the length of the second box is the same as the first box, which ... (solution omitted) ... |\n",
"| ## Step 1: Determine the dimensions of the second box (Node 3, Q=0.962) The second box has twice the height of the first box, so its height is $2 \\times 2 = 4$ centimeters. It has three times the width of the first box, so its width is $3 \\times 3 = 9$ centimeters. The length of the second box is the same as the first box, which is $5$ centimeters. | Node 16 | Let's use this information to calculate n, the amount of clay that the second box can hold, using the formula n = 40 \\* (volume of second box / volume of first box) = 40 \\* (4 \\* 9 \\* 5) / (2 \\* 3 \\* 5) = 40 \\* (12) = 40 \\* 6 = 480 grams. But wait, are we correctly solving the problem so far? Hmm... I'm not sure if our solution is correct so far. Let's go back to where we calculated the volume of the second box to be 180 cubic centimeters. Given the established dimensions of the boxes, let's calculate the volume of the first box, where its volume is ... |\n",
"| ## Step 2: Calculate the volume of the second box (Node 16, Q=0.934) The second box has twice the height (2 \\* 2 = 4 cm), three times the width (3 \\* 3 = 9 cm), and the same length (5 cm) as the first box. So, the volume of the second box = height \\* width \\* length = 4 \\* 9 \\* 5 = 180 cubic centimeters. | Node 43 | (solution omitted) ... the second box is 6 times larger than the first box, it follows that the second box can hold 6 times as much clay as the first box. Therefore, to find n, the amount of clay the second box can hold, we calculate n = 6 \\* 40 = 240 grams. But wait, are we correctly solving the problem so far? Hmm... Our solution seems to be correct so far. |\n",
"| ## Step 3: Calculate the volume of the first box (Node 43, Q=0.768) The volume of the first box = height \\* width \\* length = 2 \\* 3 \\* 5 = 30 cubic centimeters. ## Step 4: Calculate n, the amount of clay that the second box can hold (Node 68, Q=0.0) The ratio of the volumes of the two boxes will equal the ratio of the amounts of clay that they can hold. ... (solution omitted) ... Therefore, n = 40 \\* (volume of second box / volume of first box) = 40 \\* (4 \\* 9 \\* 5) / (2 \\* 3 \\* 5) = 40 \\* (12) = 40 \\* 6 = 480 grams. | Node 68 | |\n",
"| ## Step 3: Calculate the volume of the first box (Node 44, Q=0.874) The volume of the first box = height \\* width \\* length = 2 \\* 3 \\* 5 = 30 cubic centimeters. ## Step 4: Calculate the ratio of the volume of the second box to the first box (Node 85, Q=1.0) ... (solution omitted) ... The ratio = volume of the second box / volume of the first box = 180 / 30 = 6. | Node 44 | |\n",
"| | Node 85 | |\n",
"| ## Step 5: Calculate the amount of clay the second box holds (Node 96, Q=1.0) Since the second box is 6 times larger than the first box, it can hold 6 times as much clay as the first box. Therefore, n = 6 \\* 40 = 240 grams. | Node 96 | The final answer is: $\\boxed{240}$ |\n",
"\n",
"\n",
"Figure 3 Example of search trajectory generation via procedure cloning. We use the policy to search through the solution space via MCTS and keep track of terminal nodes with incorrect answers, as well as terminal nodes with correct answers. Then, we linearize the search tree such that it incorporates backtracking from the incorrect terminal node (Node 68) to its greatest common ancestor (Node 16) with the correct terminal node (Node 96). Finally, we rewrite the node sequence into a long chain-of-thought, injecting self-reflection and backtracking phrases into the CoTs.\n",
"\n",
"We then perform supervised fine-tuning (SFT) to infuse autoregressive search behavior into the Llama 3 family of models (AI at Meta, 2024). After fine-tuning for just one epoch, our SFT checkpoint based on llama-3.1-70b-instruct achieves 69.6% on MATH-500, 55.0% on AMC 2023 and 13.3% on AIME 2024, and outperforms its counterpart trained on the same set of problems but without search priors. Our qualitative analyses show that even simply performing SFT with high-quality search traces can infuse search capabilities, including backtracking and self-reflection behavior, into a language model.\n",
"\n",
"Finally, we perform reinforcement learning (RL) on our models to further improve their reasoning capabilities. Our training prompts are derived from open-source math problems of moderate to high difficulties for our policies. We use a modified form of Group Relative Policy Optimization (GRPO, Shao et al. (2024)) that is very similar to that of Dr. GRPO (Liu et al., 2025) to update our policies. After RL, our policy based on llama-3.1-70b-instruct achieves 81.8% in MATH-500, 64.4% in AMC 2023 and 30.0% in AIME 2024 (pass@1). We show that our model trained end-to-end using Astro outperforms its counterpart similarly optimized with RL but initialized from a SFT checkpoint trained without search priors – this demonstrates the importance of leveraging self-reflection and backtracking as priors for improving reasoning via RL. Our work provides a clear recipe for improving the reasoning capabilities of language models by instilling autoregressive search priors with SFT and leveraging such priors to further improve the models via RL.\n",
"\n",
"\n",
"--------------------------------------------------\n",
"Chunk 2:\n",
"## 2 Search Trajectory Generation\n",
"\n",
"Astro begins by generating a dataset of search traces, expressed as long chain-of-thoughts (Wei et al., 2022) that encode self-reflection and backtracking in natural language, via procedure cloning. To this end, we first obtain search trees that explore a wide solution space for each math problem using Monte Carlo Tree Search (MCTS) in a stepwise manner, strategically balancing exploration and exploitation with verifier-based rewards to obtain diverse and high-quality solutions exploring different reasoning traces (Section 2.2).\n",
"\n",
"We then linearize the search trees into sequences of nodes that explore various states, including intermediate nodes with incorrect answers, until arriving at a high-quality solution leading to the correct answer (Section 2.3). Finally, we translate each node sequence into a chain-of-thought that integrates self-reflection and backtracking in natural language, and we add each long chain-of-thought to our final dataset (Section 2.4). The resulting dataset encodes beneficial self-reflection and backtracking priors for training language models to perform autoregressive search for solving challenging math problems via supervised fine-tuning and reinforcement learning (Section 3). Refer to Figure 3 for a visual example of our search trajectory generation pipeline.\n",
"\n",
"---\n",
"\n",
"\n",
"\n",
"\n",
"--------------------------------------------------\n",
"Chunk 3:\n",
"## 2.1 Problem Formulation and Overview\n",
"\n",
"**Problem formulation.** Our data generation setup is a Markov Decision Process (MDP) (Puterman, 1994), where the language model functions as the policy ΠLM and explores the solution space to the input x, while obtaining rewards in terminal states from a verifier V based on the correct answer. Here we assume that ΠLM solves math problems in a stepwise manner, where each step st represents a sequence of tokens y1 · · · y|st| encapsulating a minimal unit of reasoning required to solve x. Then, each state St represents a combination of the input prompt and the sequence of steps generated by the policy, i.e. St = (x, s0, · · · , st). Meanwhile, the action at+1 represents the next step st+1 taken by ΠLM to address x. Refer to Figure 3 for examples of the steps defined in our setup.\n",
"\n",
"Given this setup, we teach a language model to predict a sequence of states (S0 · · · Send) in response to x such that the states explore reasoning steps leading to correct and incorrect answers, until the LM arrives at Send and terminates its search by accepting the correct answer as its final answer.\n",
"\n",
"**Overview.** We generate training data for Astro in three main stages outlined below:\n",
"\n",
"1. For each x we generate a search tree T, where each node ni represents the state Si and each edge (ni, nj) represents the action aj, i.e. the next step sj taken from Si to Sj, using Monte Carlo Tree Search (MCTS) to explore the solution space based on verifier-based rewards from rollouts (Section 2.2).\n",
"\n",
"2. We linearize T into a sequence of nodes L = (n0, · · · , nend), a subsequence of the entire history of nodes visited by ΠLM until arriving at nend, the terminal node with the correct answer. Some adjacent pairs of nodes (nt, nt+1) in L are such that nt+1 is an ancestor of nt in T, which corresponds to self-reflection and backtracking during the search procedure (Section 2.3).\n",
"\n",
"3. We translate L into a chain-of-thought solution y = (y0, · · · , yend) that integrates self-reflection and backtracking in natural language, and we add (x, y) to our final dataset (Section 2.4).\n",
"\n",
"\n",
"--------------------------------------------------\n",
"Chunk 4:\n",
"## 2.2 Monte Carlo Tree Search\n",
"\n",
"We use our language model policy ΠLM to obtain a search tree with diverse solution traces to each input x by running Monte Carlo Tree Search (MCTS). By using MCTS, we explore a diverse solution space while balancing exploration and exploitation with reliable guidance from reward signals obtained from full rollouts. Here, we prompt x to elicit stepwise solutions from ΠLM, and assign reward scores with our verifier V to compare the predicted answer with the correct answer.\n",
"\n",
"Monte Carlo Tree Search employs three main stages – selection, expansion and backpropagation – to select promising next steps, expand the search tree, and update the quality metric of each reasoning step.\n",
"\n",
"**Selection.** At state St with k actions generated by ΠLM from St, we balance exploration and exploitation to select the most promising node from which to further perform tree search. We use the Predictor+Upper Confidence bounds applied to Trees (PUCT, Silver et al. (2016)) for selection to balance exploration and exploitation during tree search. From any state St, given the action index i ∈ [1...k], the quality score of taking action ai from state St – Q(St, ai), the total visit count of St – N(St), and the visit count of taking action ai from St – N(St, ai), we perform selection as:\n",
"\n",
"$$S^*_{t+1} = \\underset{(S_{t+1}=S_t \\rightarrow a_i)}{\\text{argmax}} \\left[Q(S_t, a_i) + c_{\\text{puct}} \\cdot \\Pi_{\\text{LM}}(a_i|S_t)\\sqrt{\\frac{N(S_t)}{1 + N(S_t, a_i)}}\\right]$$\n",
"\n",
"**Expansion.** From state St, ΠLM takes x and the sequence of steps (s0, · · · , st) as the input, and first samples k actions which each correspond to the next step for solving x. For each action, we sample M rollouts and score the full solution using V to match the predicted answer with the reference answer. Then, we average the scores across the rollouts for each new action ai (i ∈ [1...k]) to compute the reward scores for the new states. We add a new node nt+1, associated with each new state St+1, to T.\n",
"\n",
"$$R(S_{t+1}) = \\frac{1}{M} \\sum_{j\\in[1...M]} V(\\Pi_{\\text{LM},j}(S_{t+1}))$$\n",
"\n",
"---\n",
"\n",
"\n",
"\n",
"Backpropagation. We backpropagate the reward scores obtained during expansion from the leaf node to the root node to recursively update their Q-values. The updates consist of (1) incrementing the visit count of each state (Eq. 3), and (2) updating the Q-values of each (state, action) pair using the Q-values and visit counts of the children nodes of St+1 = (St, a), along with the rollout-based reward score R(St+1) (Eq. 4).\n",
"\n",
"$$N(s_t) = N(s_t) + 1$$\n",
"\n",
"$$Q(S_t, a) = \\frac{\\sum_{i=1}^K Q(S_{t+1}, a_i) \\cdot N(S_{t+1}, a_i) + R(S_{t+1})}{\\sum_{i=1}^K N(S_{t+1}, a_i) + 1}$$\n",
"\n",
"We repeat the procedure above for multiple iterations to explore the solution space for each math problem and build the search trees. We use llama-3.3-70b-instruct as our policy ΠLM and generate k = 8 actions during each expansion step with M = 16 rollouts, cpuct = 1.0, 32 iterations and maximum tree depth of 50.\n",
"\n",
"\n",
"--------------------------------------------------\n"
]
}
],
"source": [
"# --- Constants & Configuration ---\n",
"ENCODING_MODEL = \"o200k_base\"\n",
"CHUNK_SIZE_TOKENS = 16000 # A practical chunk size\n",
"\n",
"def count_tokens(text: str, encoding: tiktoken.Encoding) -> int:\n",
" \"\"\"Helper function to count tokens in a string.\"\"\"\n",
" return len(encoding.encode(text))\n",
"\n",
"def chunk_document(\n",
" markdown_text: str,\n",
" chunk_size: int = CHUNK_SIZE_TOKENS,\n",
" headers_to_split_on: List[str] = [\"#\", \"##\", \"###\"]\n",
") -> List[str]:\n",
" \"\"\"\n",
" Chunks a markdown document, preserving header context for each chunk.\n",
" \"\"\"\n",
" # 1. Split the document by headers to get sections\n",
" header_pattern = \"|\".join(f\"^{h}\\\\s\" for h in headers_to_split_on)\n",
" sections = re.split(f\"({header_pattern})\", markdown_text, flags=re.MULTILINE)\n",
" if sections and not sections[0].strip():\n",
" sections.pop(0)\n",
"\n",
" if len(sections) > 1:\n",
" sections = list(zip(sections[0::2], sections[1::2]))\n",
" else:\n",
" sections = []\n",
"\n",
" encoding = tiktoken.get_encoding(ENCODING_MODEL)\n",
" final_chunks = []\n",
"\n",
" # 2. Process each section\n",
" for header, content in sections:\n",
" header_token_count = count_tokens(header, encoding)\n",
" \n",
" if header_token_count + count_tokens(content, encoding) <= chunk_size:\n",
" final_chunks.append(header + content)\n",
" continue\n",
"\n",
" # Split the content by paragraphs\n",
" paragraphs = content.split('\\n\\n')\n",
" current_chunk_paragraphs = []\n",
" current_chunk_tokens = header_token_count\n",
"\n",
" for para in paragraphs:\n",
" para_tokens = count_tokens(para, encoding)\n",
"\n",
" # If a paragraph is too large to fit with the header, it must be truncated.\n",
" if header_token_count + para_tokens > chunk_size:\n",
" available_tokens = chunk_size - header_token_count\n",
" para_token_ids = encoding.encode(para)\n",
" truncated_ids = para_token_ids[:available_tokens]\n",
" para = encoding.decode(truncated_ids, errors='ignore')\n",
" para_tokens = len(truncated_ids)\n",
" print(f\"Warning: Truncating a paragraph to {para_tokens} \"\n",
" f\"tokens to fit the chunk size.\")\n",
"\n",
" # If the current chunk is not empty and the new paragraph doesn't fit,\n",
" # finalize the current chunk before starting a new one.\n",
" if (current_chunk_paragraphs and \n",
" (current_chunk_tokens + para_tokens > chunk_size)):\n",
" final_chunks.append(header + \"\\n\\n\".join(current_chunk_paragraphs))\n",
" current_chunk_paragraphs = []\n",
" current_chunk_tokens = header_token_count\n",
"\n",
" current_chunk_paragraphs.append(para)\n",
" current_chunk_tokens += para_tokens\n",
"\n",
" # Add the last remaining chunk\n",
" if current_chunk_paragraphs:\n",
" final_chunks.append(header + \"\\n\\n\".join(current_chunk_paragraphs))\n",
" \n",
" return final_chunks\n",
"\n",
"# Now, let's chunk our document\n",
"chunks = chunk_document(document_text)\n",
"\n",
"# --- Print Statistics and a Sample Chunk ---\n",
"if chunks:\n",
" print(f\"Total chunks created: {len(chunks)}\")\n",
" encoding = tiktoken.get_encoding(ENCODING_MODEL)\n",
" token_counts = [count_tokens(chunk, encoding) for chunk in chunks]\n",
" avg_tokens = sum(token_counts) / len(token_counts)\n",
" print(f\"Average token count per chunk: {avg_tokens:.2f}\")\n",
" print(f\"Max token count in a chunk: {max(token_counts)}\")\n",
" print(f\"Min token count in a chunk: {min(token_counts)}\")\n",
" print(\"-\" * 50)\n",
" print(\"Top 5 Chunks:\")\n",
" for i, chunk in enumerate(chunks[:5]):\n",
" print(f\"Chunk {i}:\")\n",
" print(chunk)\n",
" print(\"-\" * 50)"
]
},
{
"cell_type": "markdown",
"id": "843c4267-b959-4e37-9d6f-bc30ae628574",
"metadata": {},
"source": [
"## Step 3: The \"map\" stage - summarizing each chunk\n",
"\n",
"With the document split into manageable, semantically coherent chunks, we can begin the \"Map\" stage. This means we apply the same operation—in this case, summarization—to each chunk independently.\n",
"\n",
"### Prompt engineering\n",
"\n",
"The quality of the summaries depends heavily on the quality of the prompts. For this stage, the prompt must instruct the model to create a summary of a small piece of a larger document. It is crucial to tell the model to focus *only* on the provided text and not to add outside information."
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "678a1afc-3a17-419d-a424-549a0788b8a8",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Summary of chunk 0:\n",
"- ASTRO is a framework for training language models to reason like search algorithms.\n",
"- ASTRO leverages self-reflection, backtracking, and exploration in language model outputs.\n",
"- ASTRO uses a synthetic dataset derived from Monte Carlo Tree Search (MCTS) over mathematical problem-solving trajectories.\n",
"- The framework finetunes models on search-derived traces and improves performance via reinforcement learning (RL) with verifiable rewards.\n",
"- ASTRO is applied to the Llama 3 family of models.\n",
"- Absolute performance gains achieved: 16.0% on MATH-500, 26.9% on AMC 2023, and 20.0% on AIME 2024.\n",
"- Llama-3.1-70B-ASTRO-RL achieves 81.8% on MATH-500, 64.4% on AMC 2023, and 30.0% on AIME 2024 (pass@1).\n",
"--------------------------------------------------\n",
"Summary of chunk 1:\n",
"- ASTRO is a framework that infuses search-like behavior into language models to improve their reasoning capabilities.\n",
"- ASTRO operates in three stages: search trajectory generation, supervised fine-tuning, and reinforcement learning.\n",
"- Search trajectory generation uses Monte Carlo Tree Search (MCTS) to explore the solution space of math problems and builds search trees with diverse reasoning traces.\n",
"- About 36K high-quality chain-of-thought (CoT) solutions are sampled across three open-source math datasets.\n",
"- Supervised fine-tuning (SFT) is performed on the search-integrated solutions to infuse autoregressive search behavior into the models.\n",
"- The SFT checkpoint based on llama-3.1-70b-instruct achieves 69.6% on MATH-500, 55.0% on AMC 2023, and 13.3% on AIME 2024 after fine-tuning for one epoch.\n",
"- Reinforcement learning (RL) is performed using a modified form of Group Relative Policy Optimization (GRPO) to further improve the models' reasoning capabilities.\n",
"- After RL, the policy based on llama-3.1-70b-instruct achieves 81.8% in MATH-500, 64.4% in AMC 2023, and 30.0% in AIME 2024 (pass@1).\n",
"--------------------------------------------------\n",
"Summary of chunk 2:\n",
"* Astro generates a dataset of search traces via procedure cloning.\n",
"* Search trees are obtained using Monte Carlo Tree Search (MCTS) with verifier-based rewards.\n",
"* Search trees are linearized into sequences of nodes exploring various states.\n",
"* Node sequences are translated into chains-of-thought integrating self-reflection and backtracking in natural language.\n",
"* The resulting dataset encodes self-reflection and backtracking priors for training language models.\n",
"* The dataset is used for supervised fine-tuning and reinforcement learning to solve math problems.\n",
"--------------------------------------------------\n",
"Summary of chunk 3:\n",
"* The data generation setup is a Markov Decision Process (MDP).\n",
"* The language model functions as the policy ΠLM and explores the solution space to the input x.\n",
"* Each state St represents a combination of the input prompt and the sequence of steps generated by the policy.\n",
"* The goal is to teach a language model to predict a sequence of states (S0 · · · Send) in response to x.\n",
"* Training data for Astro is generated in three main stages: \n",
" 1. Generating a search tree T using Monte Carlo Tree Search (MCTS).\n",
" 2. Linearizing T into a sequence of nodes L.\n",
" 3. Translating L into a chain-of-thought solution y that integrates self-reflection and backtracking in natural language.\n",
"--------------------------------------------------\n",
"Summary of chunk 4:\n",
"- Monte Carlo Tree Search (MCTS) is used with language model policy ΠLM to obtain a search tree with diverse solution traces.\n",
"- MCTS involves three stages: selection, expansion, and backpropagation.\n",
"- Selection uses Predictor+Upper Confidence bounds applied to Trees (PUCT) to balance exploration and exploitation.\n",
"- The selection formula is: $$S^*_{t+1} = \\underset{(S_{t+1}=S_t \\rightarrow a_i)}{\\text{argmax}} \\left[Q(S_t, a_i) + c_{\\text{puct}} \\cdot \\Pi_{\\text{LM}}(a_i|S_t)\\sqrt{\\frac{N(S_t)}{1 + N(S_t, a_i)}}\\right]$$\n",
"- Expansion involves sampling k actions, scoring full solutions using verifier V, and averaging scores across M rollouts.\n",
"- The reward score formula is: $$R(S_{t+1}) = \\frac{1}{M} \\sum_{j\\in[1...M]} V(\\Pi_{\\text{LM},j}(S_{t+1}))$$\n",
"- Backpropagation updates Q-values and visit counts using equations: \n",
" $$N(s_t) = N(s_t) + 1$$\n",
" $$Q(S_t, a) = \\frac{\\sum_{i=1}^K Q(S_{t+1}, a_i) \\cdot N(S_{t+1}, a_i) + R(S_{t+1})}{\\sum_{i=1}^K N(S_{t+1}, a_i) + 1}$$\n",
"- The policy ΠLM used is llama-3.3-70b-instruct.\n",
"- Parameters used are: k = 8, M = 16, cpuct = 1.0, 32 iterations, and maximum tree depth of 50.\n",
"--------------------------------------------------\n"
]
}
],
"source": [
"LLM_MODEL = \"Llama-4-Maverick-17B-128E-Instruct-FP8\"\n",
"DOC_TITLE = (\"ASTRO: Teaching Language Models to Reason by Reflecting and \"\n",
" \"Backtracking In-Context\")\n",
"\n",
"MAP_PROMPT = \"\"\"\n",
"Your role is to create a concise, factual summary of a text chunk from the \n",
"research paper titled \"{document_title}\".\n",
"- Extract only key facts, figures, and statements from the chunk text itself.\n",
"- Omit any conversational introductions or conclusions. Do not explain what you \n",
" are doing.\n",
"- If a chunk contains no substantive information (e.g., only headers, formatting, \n",
" or boilerplate), output the exact phrase: \"No substantive information.\"\n",
"\n",
"**Text Chunk:**\n",
"{chunk_text}\n",
"\"\"\"\n",
"\n",
"def map_summarize_chunk(chunk_text: str, document_title: str) -> str:\n",
" \"\"\"\n",
" Summarizes a single chunk of text using the 'map' prompt.\n",
" \"\"\"\n",
" try:\n",
" resp = client.chat.completions.create(\n",
" model=LLM_MODEL,\n",
" messages=[\n",
" {\"role\": \"user\", \"content\": MAP_PROMPT.format(\n",
" document_title=document_title, chunk_text=chunk_text)},\n",
" ],\n",
" temperature=0.1, # Low temperature for deterministic summaries\n",
" )\n",
" return resp.completion_message.content.text\n",
" except Exception as e:\n",
" print(f\" Error summarizing chunk: {e}\")\n",
" return \"\" # Return empty string on failure\n",
"\n",
"# Let's test the map function on the first few chunks\n",
"if chunks:\n",
" for i, chunk in enumerate(chunks[:5]):\n",
" summary = map_summarize_chunk(chunk, DOC_TITLE)\n",
" print(f\"Summary of chunk {i}:\")\n",
" print(summary)\n",
" print(\"-\" * 50)"
]
},
{
"cell_type": "markdown",
"id": "f8c221de-0c8f-435f-a872-d65acb622b2f",
"metadata": {},
"source": [
"## Step 4: The \"reduce\" stage: creating the final summary\n",
"\n",
"With the \"map\" stage complete, we now have a list of individual summaries for each chunk. The \"reduce\" stage combines these into a single, coherent executive summary.\n",
"\n",
"### Prompt engineering for synthesis\n",
"\n",
"The prompt for this stage is different. We are no longer just summarizing; we are *synthesizing*. The prompt instructs the model to weave the individual points from the chunk summaries into a flowing, well-written narrative."
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "65ef4d59-6ec6-4de4-8214-0a5632bdacf7",
"metadata": {},
"outputs": [],
"source": [
"REDUCE_PROMPT = \"\"\"\n",
"You are a research assistant tasked with creating an executive summary.\n",
"You have been given a series of concise summaries from different sections of a \n",
"research paper.\n",
"Your goal is to synthesize these individual summaries into a single, well-written, \n",
"and coherent executive summary.\n",
"The final summary should read like a standalone document, flowing logically from \n",
"one topic to the next.\n",
"\n",
"**Summaries of Report Sections:**\n",
"{chunk_summaries}\n",
"\"\"\"\n",
"\n",
"MAX_CONTEXT_WINDOW = 100000\n",
"\n",
"def reduce_create_final_summary(chunk_summaries: List[str]) -> str:\n",
" \"\"\"\n",
" Combines chunk summaries into a final executive summary using the 'reduce' prompt.\n",
" \"\"\"\n",
" summaries_text = \"\\\\n\\\\n---\\\\n\\\\n\".join(chunk_summaries)\n",
" \n",
" encoding = tiktoken.get_encoding(ENCODING_MODEL)\n",
" if count_tokens(summaries_text, encoding) > MAX_CONTEXT_WINDOW:\n",
" # For this tutorial, we'll truncate to fit. A more advanced implementation\n",
" # might run another map-reduce pass (recursive reduction).\n",
" print(\"Warning: Combined summaries are too large; will be truncated for \"\n",
" \"final summary.\")\n",
" tokens = encoding.encode(summaries_text)\n",
" truncated_tokens = tokens[:MAX_CONTEXT_WINDOW]\n",
" summaries_text = encoding.decode(truncated_tokens, errors='ignore')\n",
"\n",
" try:\n",
" resp = client.chat.completions.create(\n",
" model=LLM_MODEL,\n",
" messages=[\n",
" {\"role\": \"user\", \"content\": REDUCE_PROMPT.format(\n",
" chunk_summaries=summaries_text)},\n",
" ],\n",
" temperature=0.3, # Slightly higher for more fluid, natural writing\n",
" )\n",
" return resp.completion_message.content.text\n",
" except Exception as e:\n",
" print(f\" Error creating final summary: {e}\")\n",
" return \"\""
]
},
{
"cell_type": "markdown",
"id": "9654471a-d764-4b25-9d03-23a2242cbd4f",
"metadata": {},
"source": [
"## Step 5: Bringing it all together\n",
"\n",
"The following code runs the full pipeline:\n",
"1. **Map:** Iterate through a subset of our chunks and generate a summary for each one.\n",
"2. **Reduce:** Take all the generated chunk summaries and synthesize them into our final executive summary.\n",
"\n",
"To keep this tutorial fast and interactive, we'll only process the first 25 chunks. In a production scenario, you would process all chunks."
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "dc7c344a-fa6d-4422-a6c8-72fb7c58dd61",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--- MAP: Summarizing 25 individual chunks ---\n",
"\\nSuccessfully summarized 25 chunks.\n",
"\\nOriginal token count: 19,127\n",
"Summarized token count: 4,163\n",
"Compression rate: 78.23%\n",
"\\n--- REDUCE: Creating final summary ---\n",
"\\n==================================================\n",
" FINAL EXECUTIVE SUMMARY\n",
"==================================================\n",
"Here is a synthesized executive summary based on the provided summaries:\n",
"\n",
"**Executive Summary**\n",
"\n",
"This report introduces ASTRO, a novel framework designed to enhance the reasoning capabilities of language models by infusing search-like behavior into their outputs. ASTRO operates in three stages: data generation using Monte Carlo Tree Search (MCTS), supervised fine-tuning (SFT), and reinforcement learning (RL). The framework leverages self-reflection, backtracking, and exploration in language model outputs to improve their performance on mathematical problem-solving tasks.\n",
"\n",
"The data generation stage utilizes MCTS to build search trees, which are then linearized into node sequences and translated into long Chain-of-Thoughts (CoTs) that integrate self-reflection and backtracking in natural language. The resulting dataset is used for SFT and RL to fine-tune the Llama 3 family of models.\n",
"\n",
"The ASTRO-trained models demonstrate significant performance gains on various mathematical benchmarks, including MATH-500, AMC 2023, and AIME 2024. Specifically, Llama-3.1-70B-ASTRO-RL achieves 81.8% on MATH-500, 64.4% on AMC 2023, and 30.0% on AIME 2024 (pass@1). The models also exhibit improved self-reflection and backtracking capabilities, generating longer CoTs and achieving better training efficacy and upper bound during RL.\n",
"\n",
"The report highlights the importance of search priors in improving the model's reasoning capabilities and demonstrates that ASTRO-trained models outperform those trained without explicit search priors across all benchmarks. The results suggest that ASTRO is a promising framework for enhancing the mathematical reasoning abilities of language models.\n",
"\n",
"Overall, this research contributes to the development of more advanced language models that can reason like search algorithms, with potential applications in various domains that require complex problem-solving capabilities.\n"
]
}
],
"source": [
"# For this demonstration, we'll process a subset of chunks.\n",
"# In a real application, you would process all of them.\n",
"CHUNKS_TO_PROCESS = 25\n",
"chunks_to_summarize = chunks[:CHUNKS_TO_PROCESS]\n",
"\n",
"print(f\"--- MAP: Summarizing {len(chunks_to_summarize)} individual chunks ---\")\n",
"chunk_summaries = [map_summarize_chunk(chunk, DOC_TITLE) \n",
" for chunk in chunks_to_summarize]\n",
"chunk_summaries = [summary for summary in chunk_summaries \n",
" if summary.strip()] # Filter out errors\n",
"print(f\"\\\\nSuccessfully summarized {len(chunk_summaries)} chunks.\")\n",
"\n",
"# --- Calculate compression rate ---\n",
"encoding = tiktoken.get_encoding(ENCODING_MODEL)\n",
"original_tokens = sum(count_tokens(chunk, encoding) \n",
" for chunk in chunks_to_summarize)\n",
"summarized_tokens = sum(count_tokens(summary, encoding) \n",
" for summary in chunk_summaries)\n",
"if original_tokens > 0:\n",
" compression_rate = (1 - (summarized_tokens / original_tokens)) * 100\n",
" print(f\"\\\\nOriginal token count: {original_tokens:,}\")\n",
" print(f\"Summarized token count: {summarized_tokens:,}\")\n",
" print(f\"Compression rate: {compression_rate:.2f}%\")\n",
"\n",
"print(\"\\\\n--- REDUCE: Creating final summary ---\")\n",
"final_summary = reduce_create_final_summary(chunk_summaries)\n",
"\n",
"# --- Display Final Result ---\n",
"print(\"\\\\n\" + \"=\" * 50)\n",
"print(\" FINAL EXECUTIVE SUMMARY\")\n",
"print(\"=\" * 50)\n",
"print(final_summary)"
]
},
{
"cell_type": "markdown",
"id": "ed7d0941-d51a-409c-922e-95f7c6f1bca8",
"metadata": {},
"source": [
"## Future enhancement: Handling extremely long documents with recursive reduction\n",
"\n",
"If you are summarizing an entire book, the combined text of your *chunk summaries* might still be too long for the model's context window. The solution is **recursive reduction**.\n",
"\n",
"You run the same map-reduce process again on the chunk summaries themselves:\n",
"1. Generate 500 chunk summaries from the original document.\n",
"2. Group these 500 summaries into batches of 50.\n",
"3. Run your `reduce_create_final_summary` function on each batch, producing 10 \"super summaries\".\n",
"4. Finally, run the reduce function one last time on the 10 \"super summaries\" to get your final executive summary.\n",
"\n",
"This approach enables you to scale this summarization technique to documents of virtually any length."
]
},
{
"cell_type": "markdown",
"id": "00f0e52c-2432-49df-9263-790e7add7ae4",
"metadata": {},
"source": [
"## Next steps and upgrade paths\n",
"\n",
"This tutorial provides a solid foundation for a powerful summarization pipeline. You can extend it in several ways for a production-grade application.\n",
"\n",
"| Need | Where to look |\n",
"| :----------------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n",
"| **More advanced chunking** | For more robust document splitting, explore libraries such as LangChain or LlamaIndex, which offer \"Recursive Character Text Splitters\" that can handle complex documents and code. These can split based on code syntax, markdown structure, and more. |\n",
"| **Alternative patterns** | The \"Map-Reduce\" pattern is not the only option. Learn about the **\"Refine\" pattern**, where the model iteratively builds upon and refines a summary by processing one chunk at a time. This can be better for creating a single, highly coherent narrative. |\n",
"| **Question & Answering** | If your goal is to ask questions of a long document instead of summarizing it, the best approach is **Retrieval-Augmented Generation (RAG)**. This involves storing chunks in a vector database and retrieving only the most relevant ones to answer a user's question. See our [Contextual chunking RAG recipe](https://github.com/meta-llama/llama-cookbook/tree/main/end-to-end-use-cases/Contextual-Chunking-RAG). |\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "My Project (uv)",
"language": "python",
"name": "my-uv-project"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}