A powerful, LLM-based tool for extracting structured data from rich documents (PDFs) with Llama models.
This tool uses Llama models to extract text, tables, and images from PDFs, converting unstructured document data into structured, machine-readable formats. It supports:
The tool is designed to handle complex documents with high accuracy and provides flexible configuration options to tailor extraction tasks to specific needs.
git clone https://github.com/meta-llama/llama-cookbook.git
cd llama-cookbook
pip install -r requirements.txt
Install project specific dependencies:
cd end-to-end-use-cases/structured_parser
pip install -r requirements.txt
(Note: Setup API Key, Model for inferencing, etc.)
python src/structured_extraction.py path/to/document.pdf --text
python src/structured_extraction.py path/to/document.pdf --text --tables --save_tables_as_csv
python src/structured_extraction.py path/to/pdf_directory --text --tables --export_excel
The tool is configured via config.yaml. Key configuration options include:
model:
  backend: openai-compat  # [offline-vllm, openai-compat]
  # For openai-compat
  base_url: "https://api.llama.com/compat/v1"
  api_key: "YOUR_API_KEY"
  model_id: "Llama-4-Maverick-17B-128E-Instruct-FP8"
  # For offline-vllm
  path: "/path/to/checkpoint"
  tensor_parallel_size: 4
  max_model_len: 32000
  max_num_seqs: 32
extraction_inference:
  temperature: 0.2
  top_p: 0.9
  max_completion_tokens: 17000
  seed: 42
The tool includes configurable prompts and output schemas for each artifact type (text, tables, images). These can be modified in the config.yaml file.
config.yaml:artifacts:
  my_new_artifact:
    prompts:
      system: "Your system prompt here..."
      user: "Your user prompt with {schema} placeholder..."
    output_schema: {
      # Your JSON schema here
    }
    use_json_decoding: true
structured_extraction.py to include your new artifact type.The extraction logic is modular and can be customized by:
config.yaml fileArtifactExtractor class for specialized extraction needsThe tool supports two backends:
The tool can store extracted data in an SQLite database:
python src/structured_extraction.py path/to/document.pdf --text --tables --save_to_db
When save_to_db is enabled and a vector database path is configured, the tool also indexes extracted content for semantic search:
from src.json_to_sql import VectorIndexManager
# Search for relevant content
results = VectorIndexManager.knn_query("What is the revenue growth?", "chroma.db")
config.yaml for your specific document typesThe tool's components can be used programmatically for custom pipelines:
from src.structured_extraction import ArtifactExtractor
from src.utils import PDFUtils
# Extract pages from PDF
pages = PDFUtils.extract_pages("document.pdf")
# Process specific pages
for page in pages[10:20]:  # Process pages 10-19
    artifacts = ArtifactExtractor.from_image(page["image_path"], ["text", "tables"])
    # Custom processing of artifacts...
Extracted data can be exported to various systems:
flatten_json_to_sqljson_to_csvexport_csvs_to_excel_tabsContributions to improve the tool are welcome! Areas for improvement include:
[License information here]