# Summarization pipeline with chunking

*Copyright (c) Meta Platforms, Inc. and affiliates.
This software may be used and distributed according to the terms of the Llama Community License Agreement.*

<a href="https://colab.research.google.com/github/meta-llama/llama-cookbook/blob/main/end-to-end-use-cases/summarization/summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This tutorial shows you how to build a robust summarization pipeline for long documents. We will create an "Intelligent Summarization Assistant" that uses Llama 4 to summarize a document that is too long to be processed in a single pass.

While models like Llama 4 have massive context windows, summarizing extremely long texts can sometimes cause details to be "lost in the middle." To solve this, we will implement the **Map-Reduce** pattern: first, we'll "map" a summarization task over smaller, coherent chunks of the text, and then "reduce" those individual summaries into a final, high-fidelity overview.

| Component          | Choice                                     | Why                                                                                                                              |
| :----------------- | :----------------------------------------- | :------------------------------------------------------------------------------------------------------------------------------- |
| **Model**          | `Llama-4-Maverick-17B-128E-Instruct-FP8`     | A powerful model ideal for high-quality summarization at both the chunk and final summary stages. |
| **Pattern**        | Map-Reduce Summarization               | A fundamental pattern for processing long documents. We "map" a summarization function over each chunk, then "reduce" the resulting summaries into a final one. |
| **Infrastructure** | Llama API                  | Provides access to Llama 4 models using the `llama_api_client` SDK.                                     |
---

**Note on Inference Providers:** This tutorial uses the Llama API for demonstration purposes. However, you can run Llama 4 models with any preferred inference provider. Common examples include [Amazon Bedrock](https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-meta.html) and [Together AI](https://together.ai/llama). The core logic of this tutorial can be adapted to any of these providers.

## What you will learn

- **How to implement a robust pipeline** for summarizing documents of any length.
- **The foundational "Map-Reduce" pattern** for document processing.
- **Techniques for "semantic chunking"** to split a document logically while preserving context.
- **How to craft effective, stage-specific prompts** for a multi-step LLM pipeline.
- **How to chain LLM calls** to perform complex, multi-stage tasks.

## Install dependencies

You will need two libraries for this project: `tiktoken` for accurate token counting, and the official `llama-api-client`.

In [2]:
!pip install --quiet tiktoken llama-api-client

## Imports & Llama API client setup

Import the necessary modules and initialize the `LlamaAPIClient`. This requires a Llama API key to be available as an environment variable. If you do not have a Llama API key, please get one from [Meta Llama API](https://llama.developer.meta.com/). 

Remember, we use the Llama API for this tutorial, but you can adapt this section to use your preferred inference provider.

In [11]:
import os, sys, re
from typing import List
import tiktoken
from llama_api_client import LlamaAPIClient

# --- Llama client ---
API_KEY = os.getenv("LLAMA_API_KEY")
if not API_KEY:
    sys.exit("❌  Please set the LLAMA_API_KEY environment variable.")

client = LlamaAPIClient(api_key=API_KEY)

## Step 1: Get the data

This tutorial uses a markdown version of the Meta research paper, [ASTRO: Teaching Language Models to Reason by Reflecting and Backtracking In-Context](https://ai.meta.com/research/publications/astro-teaching-language-models-to-reason-by-reflecting-and-backtracking-in-context/). The file, `ASTRO-Teaching_Language_Models_to_Reason.md`, is included in the `data` sub-directory of the repository, making it easy to follow along.

> We are using a markdown file for this tutorial because it preserves the document's structure with headers, which is useful for semantic chunking. If you are working with other formats like PDFs, you can use parsing services like [LlamaParse](https://www.llamaindex.ai/llamaparse) to convert them to markdown.

In [12]:
file_path = "data/ASTRO-Teaching_Language_Models_to_Reason.md"

try:
    with open(file_path, 'r', encoding='utf-8') as f:
        document_text = f.read()
except FileNotFoundError:
    raise FileNotFoundError(
        f"Error: The file was not found at {file_path}"
    )

if document_text:
    print(f"✅  Successfully loaded document: {len(document_text):,} characters.")

✅  Successfully loaded document: 142,921 characters.


## Step 2: The logic of chunking

### Why Chunk?

For long documents, even with a large context window, summarizing in a single pass can lead to context degradation, where the model may under-weigh details from the middle of the text.

To ensure all parts of the document are processed with equal focus, we use a **map-reduce** approach. Breaking the document into smaller, coherent chunks for individual summarization guarantees a more detailed and high-quality final result.

### How to chunk?

An effective chunking strategy is critical. Simply splitting the text by a fixed token count can break sentences or separate related ideas. A better approach is **semantic chunking**. Our strategy has two levels:

1.  **Header-based splitting:** First, the document is split into large sections based on its markdown headers (`#`, `##`, `###`). This preserves the document's logical structure.
2.  **Paragraph-based Chunking:** Each large section is then divided into the final, smaller chunks. This process respects paragraph boundaries and a specified token limit, ensuring the chunks are both semantically coherent and sized appropriately for the LLM.

> **Note on Generalization:** This tutorial's header-based splitting is optimized for markdown documents. For other formats (like plain text or PDFs), you can generalize this header-based splitting approach by identifying similar structural elements. For instance, you could split by chapter titles, numbered sections, or use regular expressions to find custom patterns that define logical breaks in your document. The principle of multi-level semantic chunking remains the same.

### Choosing the Right Chunk Size

While our chunking strategy prioritizes semantic boundaries (headers and paragraphs) over fixed token counts, we still need to set a maximum size for our chunks. This ensures that even the largest semantic chunk fits comfortably within the model's context window.

The `CHUNK_SIZE_TOKENS` constant serves as this upper limit. Finding the right value is a trade-off:

*   **Set Too High:** The limit might still be larger than the model's context window (once the prompt is included), causing API calls to fail.
*   **Set Too Low:** This could force the chunking logic to split paragraphs or other logical units too aggressively, reducing the quality of the summaries. It also increases the number of API calls, leading to higher cost and latency.

The `16000` token limit in this tutorial is a conservative size for models with large context windows (usually 128k for models available on the Llama API). It leaves ample room for the prompt while ensuring each chunk is large enough to provide meaningful context for summarization.

> **Note on Local Processing:** All processing up to this point, including loading the data and chunking the text, happens locally. We have not yet made any calls to the Llama API. The token counting is done with a local library to ensure our chunks are the right size for the API calls in the next steps.

In [13]:
# --- Constants & Configuration ---
ENCODING_MODEL = "o200k_base"
CHUNK_SIZE_TOKENS = 16000 # A practical chunk size

def count_tokens(text: str, encoding: tiktoken.Encoding) -> int:
    """Helper function to count tokens in a string."""
    return len(encoding.encode(text))

def chunk_document(
    markdown_text: str,
    chunk_size: int = CHUNK_SIZE_TOKENS,
    headers_to_split_on: List[str] = ["#", "##", "###"]
) -> List[str]:
    """
    Chunks a markdown document, preserving header context for each chunk.
    """
    # 1. Split the document by headers to get sections
    header_pattern = "|".join(f"^{h}\\s" for h in headers_to_split_on)
    sections = re.split(f"({header_pattern})", markdown_text, flags=re.MULTILINE)
    if sections and not sections[0].strip():
        sections.pop(0)

    if len(sections) > 1:
        sections = list(zip(sections[0::2], sections[1::2]))
    else:
        sections = []

    encoding = tiktoken.get_encoding(ENCODING_MODEL)
    final_chunks = []

    # 2. Process each section
    for header, content in sections:
        header_token_count = count_tokens(header, encoding)
        
        if header_token_count + count_tokens(content, encoding) <= chunk_size:
            final_chunks.append(header + content)
            continue

        # Split the content by paragraphs
        paragraphs = content.split('\n\n')
        current_chunk_paragraphs = []
        current_chunk_tokens = header_token_count

        for para in paragraphs:
            para_tokens = count_tokens(para, encoding)

            # If a paragraph is too large to fit with the header, it must be truncated.
            if header_token_count + para_tokens > chunk_size:
                available_tokens = chunk_size - header_token_count
                para_token_ids = encoding.encode(para)
                truncated_ids = para_token_ids[:available_tokens]
                para = encoding.decode(truncated_ids, errors='ignore')
                para_tokens = len(truncated_ids)
                print(f"Warning: Truncating a paragraph to {para_tokens} "
                      f"tokens to fit the chunk size.")

            # If the current chunk is not empty and the new paragraph doesn't fit,
            # finalize the current chunk before starting a new one.
            if (current_chunk_paragraphs and 
                (current_chunk_tokens + para_tokens > chunk_size)):
                final_chunks.append(header + "\n\n".join(current_chunk_paragraphs))
                current_chunk_paragraphs = []
                current_chunk_tokens = header_token_count

            current_chunk_paragraphs.append(para)
            current_chunk_tokens += para_tokens

        # Add the last remaining chunk
        if current_chunk_paragraphs:
            final_chunks.append(header + "\n\n".join(current_chunk_paragraphs))
            
    return final_chunks

# Now, let's chunk our document
chunks = chunk_document(document_text)

# --- Print Statistics and a Sample Chunk ---
if chunks:
    print(f"Total chunks created: {len(chunks)}")
    encoding = tiktoken.get_encoding(ENCODING_MODEL)
    token_counts = [count_tokens(chunk, encoding) for chunk in chunks]
    avg_tokens = sum(token_counts) / len(token_counts)
    print(f"Average token count per chunk: {avg_tokens:.2f}")
    print(f"Max token count in a chunk: {max(token_counts)}")
    print(f"Min token count in a chunk: {min(token_counts)}")
    print("-" * 50)
    print("Top 5 Chunks:")
    for i, chunk in enumerate(chunks[:5]):
        print(f"Chunk {i}:")
        print(chunk)
        print("-" * 50)

Total chunks created: 54
Average token count per chunk: 661.94
Max token count in a chunk: 6357
Min token count in a chunk: 3
--------------------------------------------------
Top 5 Chunks:
Chunk 0:
# ASTRO: Teaching Language Models to Reason by Reflecting and Backtracking In-Context

Joongwon Kim<sup>1,2</sup>, Anirudh Goyal<sup>1</sup>, Liang Tan<sup>1</sup>, Hannaneh Hajishirzi<sup>2</sup>, Srini Iyer<sup>1</sup>, Tianlu Wang<sup>1</sup>

<sup>1</sup>AI at Meta, <sup>2</sup>University of Washington

We introduce Astro, the "Autoregressive Search-Taught Reasoner", a framework for training language models to reason like search algorithms, explicitly leveraging self-reflection, backtracking, and exploration in their outputs. Recently, training large language models (LLMs) via reinforcement learning (RL) has led to the advent of reasoning models with greatly enhanced reasoning capabilities. Open-source replications of reasoning models, while successful, build upon models that already e

## Step 3: The "map" stage - summarizing each chunk

With the document split into manageable, semantically coherent chunks, we can begin the "Map" stage. This means we apply the same operation—in this case, summarization—to each chunk independently.

### Prompt engineering

The quality of the summaries depends heavily on the quality of the prompts. For this stage, the prompt must instruct the model to create a summary of a small piece of a larger document. It is crucial to tell the model to focus *only* on the provided text and not to add outside information.

In [15]:
LLM_MODEL = "Llama-4-Maverick-17B-128E-Instruct-FP8"
DOC_TITLE = ("ASTRO: Teaching Language Models to Reason by Reflecting and "
             "Backtracking In-Context")

MAP_PROMPT = """
Your role is to create a concise, factual summary of a text chunk from the 
research paper titled "{document_title}".
- Extract only key facts, figures, and statements from the chunk text itself.
- Omit any conversational introductions or conclusions. Do not explain what you 
  are doing.
- If a chunk contains no substantive information (e.g., only headers, formatting, 
  or boilerplate), output the exact phrase: "No substantive information."

**Text Chunk:**
{chunk_text}
"""

def map_summarize_chunk(chunk_text: str, document_title: str) -> str:
    """
    Summarizes a single chunk of text using the 'map' prompt.
    """
    try:
        resp = client.chat.completions.create(
            model=LLM_MODEL,
            messages=[
                {"role": "user", "content": MAP_PROMPT.format(
                    document_title=document_title, chunk_text=chunk_text)},
            ],
            temperature=0.1, # Low temperature for deterministic summaries
        )
        return resp.completion_message.content.text
    except Exception as e:
        print(f"    Error summarizing chunk: {e}")
        return "" # Return empty string on failure

# Let's test the map function on the first few chunks
if chunks:
    for i, chunk in enumerate(chunks[:5]):
        summary = map_summarize_chunk(chunk, DOC_TITLE)
        print(f"Summary of chunk {i}:")
        print(summary)
        print("-" * 50)

Summary of chunk 0:
- ASTRO is a framework for training language models to reason like search algorithms.
- ASTRO leverages self-reflection, backtracking, and exploration in language model outputs.
- ASTRO uses a synthetic dataset derived from Monte Carlo Tree Search (MCTS) over mathematical problem-solving trajectories.
- The framework finetunes models on search-derived traces and improves performance via reinforcement learning (RL) with verifiable rewards.
- ASTRO is applied to the Llama 3 family of models.
- Absolute performance gains achieved: 16.0% on MATH-500, 26.9% on AMC 2023, and 20.0% on AIME 2024.
- Llama-3.1-70B-ASTRO-RL achieves 81.8% on MATH-500, 64.4% on AMC 2023, and 30.0% on AIME 2024 (pass@1).
--------------------------------------------------
Summary of chunk 1:
- ASTRO is a framework that infuses search-like behavior into language models to improve their reasoning capabilities.
- ASTRO operates in three stages: search trajectory generation, supervised fine-tuning, a

## Step 4: The "reduce" stage: creating the final summary

With the "map" stage complete, we now have a list of individual summaries for each chunk. The "reduce" stage combines these into a single, coherent executive summary.

### Prompt engineering for synthesis

The prompt for this stage is different. We are no longer just summarizing; we are *synthesizing*. The prompt instructs the model to weave the individual points from the chunk summaries into a flowing, well-written narrative.

In [16]:
REDUCE_PROMPT = """
You are a research assistant tasked with creating an executive summary.
You have been given a series of concise summaries from different sections of a 
research paper.
Your goal is to synthesize these individual summaries into a single, well-written, 
and coherent executive summary.
The final summary should read like a standalone document, flowing logically from 
one topic to the next.

**Summaries of Report Sections:**
{chunk_summaries}
"""

MAX_CONTEXT_WINDOW = 100000

def reduce_create_final_summary(chunk_summaries: List[str]) -> str:
    """
    Combines chunk summaries into a final executive summary using the 'reduce' prompt.
    """
    summaries_text = "\\n\\n---\\n\\n".join(chunk_summaries)
    
    encoding = tiktoken.get_encoding(ENCODING_MODEL)
    if count_tokens(summaries_text, encoding) > MAX_CONTEXT_WINDOW:
        # For this tutorial, we'll truncate to fit. A more advanced implementation
        # might run another map-reduce pass (recursive reduction).
        print("Warning: Combined summaries are too large; will be truncated for "
              "final summary.")
        tokens = encoding.encode(summaries_text)
        truncated_tokens = tokens[:MAX_CONTEXT_WINDOW]
        summaries_text = encoding.decode(truncated_tokens, errors='ignore')

    try:
        resp = client.chat.completions.create(
            model=LLM_MODEL,
            messages=[
                {"role": "user", "content": REDUCE_PROMPT.format(
                    chunk_summaries=summaries_text)},
            ],
            temperature=0.3, # Slightly higher for more fluid, natural writing
        )
        return resp.completion_message.content.text
    except Exception as e:
        print(f"    Error creating final summary: {e}")
        return ""

## Step 5: Bringing it all together

The following code runs the full pipeline:
1.  **Map:** Iterate through a subset of our chunks and generate a summary for each one.
2.  **Reduce:** Take all the generated chunk summaries and synthesize them into our final executive summary.

To keep this tutorial fast and interactive, we'll only process the first 25 chunks. In a production scenario, you would process all chunks.

In [17]:
# For this demonstration, we'll process a subset of chunks.
# In a real application, you would process all of them.
CHUNKS_TO_PROCESS = 25
chunks_to_summarize = chunks[:CHUNKS_TO_PROCESS]

print(f"--- MAP: Summarizing {len(chunks_to_summarize)} individual chunks ---")
chunk_summaries = [map_summarize_chunk(chunk, DOC_TITLE) 
                   for chunk in chunks_to_summarize]
chunk_summaries = [summary for summary in chunk_summaries 
                   if summary.strip()]  # Filter out errors
print(f"\\nSuccessfully summarized {len(chunk_summaries)} chunks.")

# --- Calculate compression rate ---
encoding = tiktoken.get_encoding(ENCODING_MODEL)
original_tokens = sum(count_tokens(chunk, encoding) 
                      for chunk in chunks_to_summarize)
summarized_tokens = sum(count_tokens(summary, encoding) 
                        for summary in chunk_summaries)
if original_tokens > 0:
    compression_rate = (1 - (summarized_tokens / original_tokens)) * 100
    print(f"\\nOriginal token count: {original_tokens:,}")
    print(f"Summarized token count: {summarized_tokens:,}")
    print(f"Compression rate: {compression_rate:.2f}%")

print("\\n--- REDUCE: Creating final summary ---")
final_summary = reduce_create_final_summary(chunk_summaries)

# --- Display Final Result ---
print("\\n" + "=" * 50)
print("           FINAL EXECUTIVE SUMMARY")
print("=" * 50)
print(final_summary)

--- MAP: Summarizing 25 individual chunks ---
\nSuccessfully summarized 25 chunks.
\nOriginal token count: 19,127
Summarized token count: 4,163
Compression rate: 78.23%
\n--- REDUCE: Creating final summary ---
           FINAL EXECUTIVE SUMMARY
Here is a synthesized executive summary based on the provided summaries:

**Executive Summary**

This report introduces ASTRO, a novel framework designed to enhance the reasoning capabilities of language models by infusing search-like behavior into their outputs. ASTRO operates in three stages: data generation using Monte Carlo Tree Search (MCTS), supervised fine-tuning (SFT), and reinforcement learning (RL). The framework leverages self-reflection, backtracking, and exploration in language model outputs to improve their performance on mathematical problem-solving tasks.

The data generation stage utilizes MCTS to build search trees, which are then linearized into node sequences and translated into long Chain-of-Thoughts (CoTs) that integrate se

## Future enhancement: Handling extremely long documents with recursive reduction

If you are summarizing an entire book, the combined text of your *chunk summaries* might still be too long for the model's context window. The solution is **recursive reduction**.

You run the same map-reduce process again on the chunk summaries themselves:
1.  Generate 500 chunk summaries from the original document.
2.  Group these 500 summaries into batches of 50.
3.  Run your `reduce_create_final_summary` function on each batch, producing 10 "super summaries".
4.  Finally, run the reduce function one last time on the 10 "super summaries" to get your final executive summary.

This approach enables you to scale this summarization technique to documents of virtually any length.

## Next steps and upgrade paths

This tutorial provides a solid foundation for a powerful summarization pipeline. You can extend it in several ways for a production-grade application.

| Need                           | Where to look                                                                                                                                                                                                                            |
| :----------------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **More advanced chunking**     | For more robust document splitting, explore libraries such as LangChain or LlamaIndex, which offer "Recursive Character Text Splitters" that can handle complex documents and code. These can split based on code syntax, markdown structure, and more. |
| **Alternative patterns**       | The "Map-Reduce" pattern is not the only option. Learn about the **"Refine" pattern**, where the model iteratively builds upon and refines a summary by processing one chunk at a time. This can be better for creating a single, highly coherent narrative. |
| **Question & Answering**       | If your goal is to ask questions of a long document instead of summarizing it, the best approach is **Retrieval-Augmented Generation (RAG)**. This involves storing chunks in a vector database and retrieving only the most relevant ones to answer a user's question. See our [Contextual chunking RAG recipe](https://github.com/meta-llama/llama-cookbook/tree/main/end-to-end-use-cases/Contextual-Chunking-RAG). |
