Suraj Subramanian 5d6e214a8f Updates 1 week ago
..
src 5d6e214a8f Updates 1 week ago
.gitignore 0cec086370 add structured parsing recipe 3 weeks ago
CONTRIBUTING.md 0cec086370 add structured parsing recipe 3 weeks ago
GETTING_STARTED.md 0cec086370 add structured parsing recipe 3 weeks ago
README.md 5d6e214a8f Updates 1 week ago
requirements.txt 57fa72e95c Improving step by step instructions, requirements and variable initialization 1 week ago

README.md

DocumentLens: Rich Document Parsing with LLMs

A powerful, LLM-based tool for extracting structured data from rich documents (PDFs) with Llama models.

Overview

This tool uses Llama models to extract text, tables, images, and charts from PDFs, converting unstructured document data into structured, machine-readable formats. It supports:

  • Text extraction: Extract and structure main text, titles, captions, etc.
  • Table extraction: Convert complex tables into structured data formats
  • Image extraction: Extract images with contextual descriptions and captions
  • Chart extraction: Convert charts and graphs into structured JSON data
  • Multiple output formats: JSON, CSV, Excel, and SQL database storage
  • Vector search capabilities: Semantic search across extracted content

The tool is designed to handle complex documents with high accuracy and provides flexible configuration options to tailor extraction tasks to specific needs.

Project Structure

structured_parser/
├── src/
│   ├── structured_extraction.py  # Main entry point and extraction logic
│   ├── utils.py                  # Utility functions and classes
│   ├── typedicts.py             # Type definitions
│   ├── json_to_table.py         # Database integration functions
│   └── config.yaml              # Configuration file
├── pdfs/                        # Sample PDFs and extraction results
├── README.md                    # This file
├── CONTRIBUTING.md              # Contribution guidelines
└── requirements.txt             # Python dependencies

Installation

Prerequisites

  • Python 3.9+
  • [Optional] Local GPU for offline inference

Setup

  1. Clone the repository
git clone https://github.com/meta-llama/llama-cookbook.git
cd llama-cookbook
  1. Install project specific dependencies: bash cd end-to-end-use-cases/structured_parser pip install -r requirements.txt

Configure the tool (see Configuration section)

(Note: Setup API Key, Model for inferencing, etc.)

Extract text from a PDF:

python src/structured_extraction.py path/to/document.pdf text

Extract charts and tables, and save them as CSV files:

python src/structured_extraction.py path/to/document.pdf charts,tables --save_tables_as_csv

Process a directory of PDFs and export tables to Excel:

python src/structured_extraction.py path/to/pdf_directory text,tables --export_excel

Extract all artifact types and save to database and as Excel sheets:

python src/structured_extraction.py path/to/document.pdf text,tables,images,charts --save_to_db --export_excel

Configuration

The tool is configured via src/config.yaml. Key configuration options include:

Model Configuration

model:
  backend: openai-compat  # [offline-vllm, openai-compat]

  # For openai-compat
  base_url: "https://api.llama.com/compat/v1"
  api_key: "YOUR_API_KEY"
  model_id: "Llama-4-Maverick-17B-128E-Instruct-FP8"

  # For offline-vllm
  path: "/path/to/checkpoint"
  tensor_parallel_size: 4
  max_model_len: 32000
  max_num_seqs: 32

Inference Parameters

extraction_inference:
  temperature: 0.2
  top_p: 0.9
  max_completion_tokens: 32000
  seed: 42

Database Configuration

database:
  sql_db_path: "sqlite3.db"
  vector_db_path: "chroma.db"

Artifact Configuration

The tool includes configurable prompts and output schemas for each artifact type (text, tables, images, charts). These can be modified in the config.yaml file to customize extraction behavior for specific document types.

Output Formats

JSON Output

The primary output format includes all extracted artifacts in a structured JSON format with timestamps.

CSV Export

Tables and charts can be exported as individual CSV files for easy analysis in spreadsheet applications.

Excel Export

Multiple tables can be combined into a single Excel workbook with separate tabs for each table.

Database Storage

Extracted data can be stored in SQLite databases with optional vector indexing for semantic search.

API Usage

Programmatic Usage

from src.structured_extraction import ArtifactExtractor
from src.utils import PDFUtils

# Extract pages from PDF
pages = PDFUtils.extract_pages("document.pdf")

# Process specific pages
for page in pages[10:20]:  # Process pages 10-19
    artifacts = ArtifactExtractor.from_image(
        page["image_path"],
        ["text", "tables"]
    )
    # Custom processing of artifacts...

Single Image Processing

from src.structured_extraction import ArtifactExtractor

# Extract from a single image
artifacts = ArtifactExtractor.from_image(
    "path/to/image.png",
    ["text", "tables", "images"]
)

Architecture

Core Components

  1. RequestBuilder: Builds inference requests for LLMs with image and text content
  2. ArtifactExtractor: Extracts structured data from documents using configurable prompts
  3. PDFUtils: Handles PDF processing and page extraction as images
  4. InferenceUtils: Manages LLM inference with support for VLLM and OpenAI-compatible APIs
  5. JSONUtils: Handles JSON extraction and validation from LLM responses
  6. ImageUtils: Utility functions for image encoding and processing

Data Flow

  1. PDFs are converted to images (one per page) using PyMuPDF
  2. Images are processed by the LLM to extract structured data based on configured prompts
  3. Structured data is saved in various formats (JSON, CSV, SQL, etc.)
  4. Optional vector indexing for semantic search capabilities

Supported Artifact Types

  • text: Main text content, titles, captions, and other textual elements
  • tables: Structured tabular data with proper formatting
  • images: Image descriptions, captions, and metadata
  • charts: Chart data extraction with structured format including axes, data points, and metadata

Extending the Tool

Adding New Artifact Types

  1. Add a new artifact type configuration in config.yaml:
artifacts:
  my_new_artifact:
    prompts:
      system: "Your system prompt here..."
      user: "Your user prompt with {schema} placeholder..."
    output_schema: {
      # Your JSON schema here
    }
    use_json_decoding: true

Customizing Extraction Logic

The extraction logic is modular and can be customized by:

  1. Modifying prompts in the config.yaml file
  2. Adjusting output schemas to capture different data structures
  3. Extending the ArtifactExtractor class for specialized extraction needs

Using Different Models

The tool supports two backends:

  1. openai-compat: Any API compatible with the OpenAI API format (including Llama API)
  2. offline-vllm: Local inference using VLLM for self-hosted deployments

Best Practices

  1. Model Selection: Use larger models for complex documents or when high accuracy is required
  2. Prompt Engineering: Adjust prompts in config.yaml for your specific document types
  3. Output Schema: Define precise schemas to guide the model's extraction process

Troubleshooting

Common Issues

  • Model capacity errors: Reduce max tokens or use a larger model
  • Extraction quality issues: Adjust prompts or output schemas
  • Configuration errors: Verify model paths and API credentials in config.yaml