This guide walks you through setting up and using the Structured Document Parser tool to extract text, tables, and images from PDF documents.
pip install -r requirements.txt
Edit the src/config.yaml
file to configure the tool:
# Choose your inference backend
model:
backend: openai-compat # Use "offline-vllm" for local inference
# If using openai-compat
base_url: "https://api.llama.com/compat/v1"
api_key: "YOUR_API_KEY"
model_id: "Llama-4-Maverick-17B-128E-Instruct-FP8" # Or your preferred model
python src/structured_extraction.py path/to/document.pdf --text
This will:
extracted
directorypython src/structured_extraction.py path/to/document.pdf --text --tables
python src/structured_extraction.py path/to/document.pdf --text --tables --images
python src/structured_extraction.py path/to/pdf_directory --text --tables
python src/structured_extraction.py path/to/document.pdf --tables --save_tables_as_csv
Tables will be saved as individual CSV files in extracted/tables_TIMESTAMP/
.
python src/structured_extraction.py path/to/document.pdf --tables --export_excel
Tables will be combined into a single Excel file with multiple sheets.
python src/structured_extraction.py path/to/document.pdf --text --tables --save_to_db
Extracted content will be stored in an SQLite database for structured querying.
from src.structured_extraction import ArtifactExtractor
from src.utils import PDFUtils
# Extract pages from a PDF
pdf_pages = PDFUtils.extract_pages("document.pdf")
# Process each page
for page in pdf_pages:
# Extract text
text_artifacts = ArtifactExtractor.from_image(
page["image_path"], ["text"]
)
# Or extract multiple artifact types
all_artifacts = ArtifactExtractor.from_image(
page["image_path"], ["text", "tables", "images"]
)
# Process the extracted artifacts
print(all_artifacts)
from src.json_to_sql import DatabaseManager
# Query all text artifacts
text_df = DatabaseManager.sql_query(
"sqlite3.db",
"SELECT * FROM document_artifacts WHERE artifact_type = 'text'"
)
# Query tables containing specific content
revenue_tables = DatabaseManager.sql_query(
"sqlite3.db",
"SELECT * FROM document_artifacts WHERE artifact_type = 'table' AND table_info LIKE '%revenue%'"
)
from src.json_to_sql import VectorIndexManager
# Search for relevant content
results = VectorIndexManager.knn_query(
"What is the revenue growth for Q2?",
"chroma.db",
n_results=5
)
# Display results
for i, (doc_id, distance, content) in enumerate(zip(
results['ids'], results['distances'], results['documents']
)):
print(f"Result {i+1} (similarity: {1-distance:.2f}):")
print(content[:200] + "...\n")
Edit the prompts in src/config.yaml
to improve extraction for your specific document types:
artifacts:
text:
prompts:
system: "You are an OCR expert. Your task is to extract all text sections..."
user: "TARGET SCHEMA:\n```json\n{schema}\n```"
src/config.yaml
:artifacts:
my_custom_type:
prompts:
system: "Your custom system prompt..."
user: "Your custom user prompt with {schema} placeholder..."
output_schema: {
# Your schema definition here
}
use_json_decoding: true
src/structured_extraction.py
:def main(
target_path: str,
text: bool = True,
tables: bool = False,
images: bool = False,
my_custom_type: bool = False, # Add your type here
save_to_db: bool = False,
...
):
# Update artifact types logic
to_extract = []
if text:
to_extract.append("text")
if tables:
to_extract.append("tables")
if images:
to_extract.append("images")
if my_custom_type:
to_extract.append("my_custom_type") # Add your type here
If the LLM responses aren't being correctly parsed, check:
config.yaml
use_json_decoding
setting (set to true
for more reliable parsing)If you encounter database errors:
DatabaseManager.create_artifact_table()
to reinitialize the table schemaIf PDF extraction quality is poor:
PDFUtils.extract_pages()