# DocumentLens: Rich Document Parsing with LLMs

A powerful, LLM-based tool for extracting structured data from rich documents (PDFs) with Llama models.

## Overview

This tool uses Llama models to extract text, tables, images, and charts from PDFs, converting unstructured document data into structured, machine-readable formats. It supports:

- **Text extraction**: Extract and structure main text, titles, captions, etc.
- **Table extraction**: Convert complex tables into structured data formats
- **Image extraction**: Extract images with contextual descriptions and captions
- **Chart extraction**: Convert charts and graphs into structured JSON data
- **Multiple output formats**: JSON, CSV, Excel, and SQL database storage
- **Vector search capabilities**: Semantic search across extracted content

The tool is designed to handle complex documents with high accuracy and provides flexible configuration options to tailor extraction tasks to specific needs.


## Project Structure

```
structured_parser/
├── src/
│   ├── structured_extraction.py  # Main entry point and extraction logic
│   ├── utils.py                  # Utility functions and classes
│   ├── typedicts.py             # Type definitions
│   ├── json_to_table.py         # Database integration functions
│   └── config.yaml              # Configuration file
├── pdfs/                        # Sample PDFs and extraction results
├── README.md                    # This file
├── CONTRIBUTING.md              # Contribution guidelines
└── requirements.txt             # Python dependencies
```


## Installation

### Prerequisites

- Python 3.9+
- [Optional] Local GPU for offline inference

### Setup

1. Clone the repository

```bash
git clone https://github.com/meta-llama/llama-cookbook.git
cd llama-cookbook
```

2. Install project specific dependencies:
```bash
cd end-to-end-use-cases/structured_parser
pip install -r requirements.txt
```

### Configure the tool (see [Configuration](#Configuration) section)
(Note: Setup API Key, Model for inferencing, etc.)

### Extract text from a PDF:

```bash
python src/structured_extraction.py path/to/document.pdf text
```

### Extract charts and tables, and save them as CSV files:

```bash
python src/structured_extraction.py path/to/document.pdf charts,tables --save_tables_as_csv
```

### Process a directory of PDFs and export tables to Excel:

```bash
python src/structured_extraction.py path/to/pdf_directory text,tables --export_excel
```

### Extract all artifact types and save to database and as Excel sheets:

```bash
python src/structured_extraction.py path/to/document.pdf text,tables,images,charts --save_to_db --export_excel
```

## Configuration

The tool is configured via `src/config.yaml`. Key configuration options include:

### Model Configuration

```yaml
model:
  backend: openai-compat  # [offline-vllm, openai-compat]

  # For openai-compat
  base_url: "https://api.llama.com/compat/v1"
  api_key: "YOUR_API_KEY"
  model_id: "Llama-4-Maverick-17B-128E-Instruct-FP8"

  # For offline-vllm
  path: "/path/to/checkpoint"
  tensor_parallel_size: 4
  max_model_len: 32000
  max_num_seqs: 32
```

### Inference Parameters

```yaml
extraction_inference:
  temperature: 0.2
  top_p: 0.9
  max_completion_tokens: 32000
  seed: 42
```

### Database Configuration

```yaml
database:
  sql_db_path: "sqlite3.db"
  vector_db_path: "chroma.db"
```

### Artifact Configuration

The tool includes configurable prompts and output schemas for each artifact type (text, tables, images, charts). These can be modified in the `config.yaml` file to customize extraction behavior for specific document types.

## Output Formats

### JSON Output
The primary output format includes all extracted artifacts in a structured JSON format with timestamps.

### CSV Export
Tables and charts can be exported as individual CSV files for easy analysis in spreadsheet applications.

### Excel Export
Multiple tables can be combined into a single Excel workbook with separate tabs for each table.

### Database Storage
Extracted data can be stored in SQLite databases with optional vector indexing for semantic search.

## API Usage

### Programmatic Usage

```python
from src.structured_extraction import ArtifactExtractor
from src.utils import PDFUtils

# Extract pages from PDF
pages = PDFUtils.extract_pages("document.pdf")

# Process specific pages
for page in pages[10:20]:  # Process pages 10-19
    artifacts = ArtifactExtractor.from_image(
        page["image_path"],
        ["text", "tables"]
    )
    # Custom processing of artifacts...
```

### Single Image Processing

```python
from src.structured_extraction import ArtifactExtractor

# Extract from a single image
artifacts = ArtifactExtractor.from_image(
    "path/to/image.png",
    ["text", "tables", "images"]
)
```

## Architecture

### Core Components

1. **RequestBuilder**: Builds inference requests for LLMs with image and text content
2. **ArtifactExtractor**: Extracts structured data from documents using configurable prompts
3. **PDFUtils**: Handles PDF processing and page extraction as images
4. **InferenceUtils**: Manages LLM inference with support for VLLM and OpenAI-compatible APIs
5. **JSONUtils**: Handles JSON extraction and validation from LLM responses
6. **ImageUtils**: Utility functions for image encoding and processing

### Data Flow

1. PDFs are converted to images (one per page) using PyMuPDF
2. Images are processed by the LLM to extract structured data based on configured prompts
3. Structured data is saved in various formats (JSON, CSV, SQL, etc.)
4. Optional vector indexing for semantic search capabilities

### Supported Artifact Types

- **text**: Main text content, titles, captions, and other textual elements
- **tables**: Structured tabular data with proper formatting
- **images**: Image descriptions, captions, and metadata
- **charts**: Chart data extraction with structured format including axes, data points, and metadata

## Extending the Tool

### Adding New Artifact Types

1. Add a new artifact type configuration in `config.yaml`:

```yaml
artifacts:
  my_new_artifact:
    prompts:
      system: "Your system prompt here..."
      user: "Your user prompt with {schema} placeholder..."
    output_schema: {
      # Your JSON schema here
    }
    use_json_decoding: true
```


### Customizing Extraction Logic

The extraction logic is modular and can be customized by:

1. Modifying prompts in the `config.yaml` file
2. Adjusting output schemas to capture different data structures
3. Extending the `ArtifactExtractor` class for specialized extraction needs

### Using Different Models

The tool supports two backends:

1. **openai-compat**: Any API compatible with the OpenAI API format (including Llama API)
2. **offline-vllm**: Local inference using VLLM for self-hosted deployments

## Best Practices

1. **Model Selection**: Use larger models for complex documents or when high accuracy is required
2. **Prompt Engineering**: Adjust prompts in `config.yaml` for your specific document types
3. **Output Schema**: Define precise schemas to guide the model's extraction process


## Troubleshooting

### Common Issues

- **Model capacity errors**: Reduce max tokens or use a larger model
- **Extraction quality issues**: Adjust prompts or output schemas
- **Configuration errors**: Verify model paths and API credentials in config.yaml