DocumentLens: Rich Document Parsing with LLMs

A powerful, LLM-based tool for extracting structured data from rich documents (PDFs) with Llama models.

Overview

This tool uses Llama models to extract text, tables, and images from PDFs, converting unstructured document data into structured, machine-readable formats. It supports:

Text extraction: Extract and structure main text, titles, captions, etc.
Table extraction: Convert complex tables into structured data formats
Image extraction: Extract images with contextual descriptions and captions
Multiple output formats: JSON, CSV, Excel, and SQL database storage
Vector search capabilities: Semantic search across extracted content

The tool is designed to handle complex documents with high accuracy and provides flexible configuration options to tailor extraction tasks to specific needs.

Installation

Prerequisites

Python 3.9+
[Optional] Local GPU for offline inference

Setup

Clone the repository
Install dependencies:

pip install -r requirements.txt

Configure the tool (see Configuration section)

Quick Start

Extract text from a PDF:

python src/structured_extraction.py path/to/document.pdf --text

Extract text and tables, and save tables as CSV files:

python src/structured_extraction.py path/to/document.pdf --text --tables --save_tables_as_csv

Process a directory of PDFs and export tables to Excel:

python src/structured_extraction.py path/to/pdf_directory --text --tables --export_excel

Configuration

The tool is configured via config.yaml. Key configuration options include:

Model Configuration

model:
  backend: openai-compat  # [offline-vllm, openai-compat]

  # For openai-compat
  base_url: "https://api.llama.com/compat/v1"
  api_key: "YOUR_API_KEY"
  model_id: "Llama-4-Maverick-17B-128E-Instruct-FP8"

  # For offline-vllm
  path: "/path/to/checkpoint"
  tensor_parallel_size: 4
  max_model_len: 32000
  max_num_seqs: 32

Inference Parameters

extraction_inference:
  temperature: 0.2
  top_p: 0.9
  max_completion_tokens: 17000
  seed: 42

Artifact Configuration

The tool includes configurable prompts and output schemas for each artifact type (text, tables, images). These can be modified in the config.yaml file.

Architecture

Core Components

RequestBuilder: Builds inference requests for LLMs
ArtifactExtractor: Extracts structured data from documents
DatabaseManager: Manages SQL database operations
VectorIndexManager: Handles vector indexing and search

Data Flow

PDFs are converted to images (one per page)
Images are processed by the LLM to extract structured data
Structured data is saved in various formats (JSON, CSV, SQL, etc.)
Optional vector indexing for semantic search capabilities

Extending the Tool

Adding New Artifact Types

Add a new artifact type configuration in config.yaml:

artifacts:
  my_new_artifact:
    prompts:
      system: "Your system prompt here..."
      user: "Your user prompt with {schema} placeholder..."
    output_schema: {
      # Your JSON schema here
    }
    use_json_decoding: true

Update the command-line interface in structured_extraction.py to include your new artifact type.

Customizing Extraction Logic

The extraction logic is modular and can be customized by:

Modifying prompts in the config.yaml file
Adjusting output schemas to capture different data structures
Extending the ArtifactExtractor class for specialized extraction needs

Using Different Models

The tool supports two backends:

openai-compat: Any API compatible with the OpenAI API format (including Llama API)
offline-vllm: Local inference using VLLM for self-hosted deployments

Database Integration

SQL Database

The tool can store extracted data in an SQLite database:

python src/structured_extraction.py path/to/document.pdf --text --tables --save_to_db

Vector Search

When save_to_db is enabled and a vector database path is configured, the tool also indexes extracted content for semantic search:

from src.json_to_sql import VectorIndexManager

# Search for relevant content
results = VectorIndexManager.knn_query("What is the revenue growth?", "chroma.db")

Best Practices

Model Selection: Use larger models for complex documents or when high accuracy is required
Prompt Engineering: Adjust prompts in config.yaml for your specific document types
Output Schema: Define precise schemas to guide the model's extraction process
Batch Processing: Use directory processing for efficiently handling multiple documents
Performance Tuning: Adjust inference parameters based on your accuracy vs. speed requirements

Limitations

PDF rendering quality affects extraction accuracy
Complex multi-column layouts may require specialized prompts
Very large tables might be truncated due to token limitations

Advanced Use Cases

Custom Processing Pipelines

The tool's components can be used programmatically for custom pipelines:

from src.structured_extraction import ArtifactExtractor
from src.utils import PDFUtils

# Extract pages from PDF
pages = PDFUtils.extract_pages("document.pdf")

# Process specific pages
for page in pages[10:20]:  # Process pages 10-19
    artifacts = ArtifactExtractor.from_image(page["image_path"], ["text", "tables"])
    # Custom processing of artifacts...

Export to Other Systems

Extracted data can be exported to various systems:

SQL databases: Using flatten_json_to_sql
CSV files: Using json_to_csv
Excel workbooks: Using export_csvs_to_excel_tabs

Troubleshooting

Model capacity errors: Reduce max tokens or use a larger model
Extraction quality issues: Adjust prompts or output schemas
Performance issues: Use batch processing or adjust tensor parallelism

Contributing

Contributions to improve the tool are welcome! Areas for improvement include:

Additional output formats
Improved table extraction for complex layouts
Support for more document types beyond PDFs
Optimization for specific document domains

License

[License information here]

README.md 6.2 KB Geçmiş Ham