A powerful, LLM-based tool for extracting structured data from rich documents (PDFs) with Llama models.
This tool uses Llama models to extract text, tables, and images from PDFs, converting unstructured document data into structured, machine-readable formats. It supports:
The tool is designed to handle complex documents with high accuracy and provides flexible configuration options to tailor extraction tasks to specific needs.
pip install -r requirements.txt
Extract text from a PDF:
python src/structured_extraction.py path/to/document.pdf --text
Extract text and tables, and save tables as CSV files:
python src/structured_extraction.py path/to/document.pdf --text --tables --save_tables_as_csv
Process a directory of PDFs and export tables to Excel:
python src/structured_extraction.py path/to/pdf_directory --text --tables --export_excel
The tool is configured via config.yaml
. Key configuration options include:
model:
backend: openai-compat # [offline-vllm, openai-compat]
# For openai-compat
base_url: "https://api.llama.com/compat/v1"
api_key: "YOUR_API_KEY"
model_id: "Llama-4-Maverick-17B-128E-Instruct-FP8"
# For offline-vllm
path: "/path/to/checkpoint"
tensor_parallel_size: 4
max_model_len: 32000
max_num_seqs: 32
extraction_inference:
temperature: 0.2
top_p: 0.9
max_completion_tokens: 17000
seed: 42
The tool includes configurable prompts and output schemas for each artifact type (text, tables, images). These can be modified in the config.yaml
file.
config.yaml
:artifacts:
my_new_artifact:
prompts:
system: "Your system prompt here..."
user: "Your user prompt with {schema} placeholder..."
output_schema: {
# Your JSON schema here
}
use_json_decoding: true
structured_extraction.py
to include your new artifact type.The extraction logic is modular and can be customized by:
config.yaml
fileArtifactExtractor
class for specialized extraction needsThe tool supports two backends:
The tool can store extracted data in an SQLite database:
python src/structured_extraction.py path/to/document.pdf --text --tables --save_to_db
When save_to_db
is enabled and a vector database path is configured, the tool also indexes extracted content for semantic search:
from src.json_to_sql import VectorIndexManager
# Search for relevant content
results = VectorIndexManager.knn_query("What is the revenue growth?", "chroma.db")
config.yaml
for your specific document typesThe tool's components can be used programmatically for custom pipelines:
from src.structured_extraction import ArtifactExtractor
from src.utils import PDFUtils
# Extract pages from PDF
pages = PDFUtils.extract_pages("document.pdf")
# Process specific pages
for page in pages[10:20]: # Process pages 10-19
artifacts = ArtifactExtractor.from_image(page["image_path"], ["text", "tables"])
# Custom processing of artifacts...
Extracted data can be exported to various systems:
flatten_json_to_sql
json_to_csv
export_csvs_to_excel_tabs
Contributions to improve the tool are welcome! Areas for improvement include:
[License information here]