A powerful, LLM-based tool for extracting structured data from rich documents (PDFs) with Llama models.
This tool uses Llama models to extract text, tables, images, and charts from PDFs, converting unstructured document data into structured, machine-readable formats. It supports:
The tool is designed to handle complex documents with high accuracy and provides flexible configuration options to tailor extraction tasks to specific needs.
structured_parser/
├── src/
│ ├── structured_extraction.py # Main entry point and extraction logic
│ ├── utils.py # Utility functions and classes
│ ├── typedicts.py # Type definitions
│ ├── json_to_table.py # Database integration functions
│ └── config.yaml # Configuration file
├── pdfs/ # Sample PDFs and extraction results
├── README.md # This file
├── CONTRIBUTING.md # Contribution guidelines
└── requirements.txt # Python dependencies
git clone https://github.com/meta-llama/llama-cookbook.git
cd llama-cookbook
bash
cd end-to-end-use-cases/structured_parser
pip install -r requirements.txt
(Note: Setup API Key, Model for inferencing, etc.)
python src/structured_extraction.py path/to/document.pdf text
python src/structured_extraction.py path/to/document.pdf charts,tables --save_tables_as_csv
python src/structured_extraction.py path/to/pdf_directory text,tables --export_excel
python src/structured_extraction.py path/to/document.pdf text,tables,images,charts --save_to_db --export_excel
The tool is configured via src/config.yaml
. Key configuration options include:
model:
backend: openai-compat # [offline-vllm, openai-compat]
# For openai-compat
base_url: "https://api.llama.com/compat/v1"
api_key: "YOUR_API_KEY"
model_id: "Llama-4-Maverick-17B-128E-Instruct-FP8"
# For offline-vllm
path: "/path/to/checkpoint"
tensor_parallel_size: 4
max_model_len: 32000
max_num_seqs: 32
extraction_inference:
temperature: 0.2
top_p: 0.9
max_completion_tokens: 32000
seed: 42
database:
sql_db_path: "sqlite3.db"
vector_db_path: "chroma.db"
The tool includes configurable prompts and output schemas for each artifact type (text, tables, images, charts). These can be modified in the config.yaml
file to customize extraction behavior for specific document types.
The primary output format includes all extracted artifacts in a structured JSON format with timestamps.
Tables and charts can be exported as individual CSV files for easy analysis in spreadsheet applications.
Multiple tables can be combined into a single Excel workbook with separate tabs for each table.
Extracted data can be stored in SQLite databases with optional vector indexing for semantic search.
from src.structured_extraction import ArtifactExtractor
from src.utils import PDFUtils
# Extract pages from PDF
pages = PDFUtils.extract_pages("document.pdf")
# Process specific pages
for page in pages[10:20]: # Process pages 10-19
artifacts = ArtifactExtractor.from_image(
page["image_path"],
["text", "tables"]
)
# Custom processing of artifacts...
from src.structured_extraction import ArtifactExtractor
# Extract from a single image
artifacts = ArtifactExtractor.from_image(
"path/to/image.png",
["text", "tables", "images"]
)
config.yaml
:artifacts:
my_new_artifact:
prompts:
system: "Your system prompt here..."
user: "Your user prompt with {schema} placeholder..."
output_schema: {
# Your JSON schema here
}
use_json_decoding: true
The extraction logic is modular and can be customized by:
config.yaml
fileArtifactExtractor
class for specialized extraction needsThe tool supports two backends:
config.yaml
for your specific document types