|
@@ -4,16 +4,35 @@ A powerful, LLM-based tool for extracting structured data from rich documents (P
|
|
|
|
|
|
## Overview
|
|
## Overview
|
|
|
|
|
|
-This tool uses Llama models to extract text, tables, and images from PDFs, converting unstructured document data into structured, machine-readable formats. It supports:
|
|
|
|
|
|
+This tool uses Llama models to extract text, tables, images, and charts from PDFs, converting unstructured document data into structured, machine-readable formats. It supports:
|
|
|
|
|
|
- **Text extraction**: Extract and structure main text, titles, captions, etc.
|
|
- **Text extraction**: Extract and structure main text, titles, captions, etc.
|
|
- **Table extraction**: Convert complex tables into structured data formats
|
|
- **Table extraction**: Convert complex tables into structured data formats
|
|
- **Image extraction**: Extract images with contextual descriptions and captions
|
|
- **Image extraction**: Extract images with contextual descriptions and captions
|
|
|
|
+- **Chart extraction**: Convert charts and graphs into structured JSON data
|
|
- **Multiple output formats**: JSON, CSV, Excel, and SQL database storage
|
|
- **Multiple output formats**: JSON, CSV, Excel, and SQL database storage
|
|
- **Vector search capabilities**: Semantic search across extracted content
|
|
- **Vector search capabilities**: Semantic search across extracted content
|
|
|
|
|
|
The tool is designed to handle complex documents with high accuracy and provides flexible configuration options to tailor extraction tasks to specific needs.
|
|
The tool is designed to handle complex documents with high accuracy and provides flexible configuration options to tailor extraction tasks to specific needs.
|
|
|
|
|
|
|
|
+
|
|
|
|
+## Project Structure
|
|
|
|
+
|
|
|
|
+```
|
|
|
|
+structured_parser/
|
|
|
|
+├── src/
|
|
|
|
+│ ├── structured_extraction.py # Main entry point and extraction logic
|
|
|
|
+│ ├── utils.py # Utility functions and classes
|
|
|
|
+│ ├── typedicts.py # Type definitions
|
|
|
|
+│ ├── json_to_table.py # Database integration functions
|
|
|
|
+│ └── config.yaml # Configuration file
|
|
|
|
+├── pdfs/ # Sample PDFs and extraction results
|
|
|
|
+├── README.md # This file
|
|
|
|
+├── CONTRIBUTING.md # Contribution guidelines
|
|
|
|
+└── requirements.txt # Python dependencies
|
|
|
|
+```
|
|
|
|
+
|
|
|
|
+
|
|
## Installation
|
|
## Installation
|
|
|
|
|
|
### Prerequisites
|
|
### Prerequisites
|
|
@@ -24,25 +43,17 @@ The tool is designed to handle complex documents with high accuracy and provides
|
|
### Setup
|
|
### Setup
|
|
|
|
|
|
1. Clone the repository
|
|
1. Clone the repository
|
|
-2. Install dependencies:
|
|
|
|
|
|
|
|
```bash
|
|
```bash
|
|
git clone https://github.com/meta-llama/llama-cookbook.git
|
|
git clone https://github.com/meta-llama/llama-cookbook.git
|
|
-```
|
|
|
|
-```bash
|
|
|
|
cd llama-cookbook
|
|
cd llama-cookbook
|
|
```
|
|
```
|
|
-```bash
|
|
|
|
-pip install -r requirements.txt
|
|
|
|
-```
|
|
|
|
|
|
+
|
|
2. Install project specific dependencies:
|
|
2. Install project specific dependencies:
|
|
```bash
|
|
```bash
|
|
cd end-to-end-use-cases/structured_parser
|
|
cd end-to-end-use-cases/structured_parser
|
|
-```
|
|
|
|
-```bash
|
|
|
|
pip install -r requirements.txt
|
|
pip install -r requirements.txt
|
|
```
|
|
```
|
|
-## Quick Start
|
|
|
|
|
|
|
|
### Configure the tool (see [Configuration](#Configuration) section)
|
|
### Configure the tool (see [Configuration](#Configuration) section)
|
|
(Note: Setup API Key, Model for inferencing, etc.)
|
|
(Note: Setup API Key, Model for inferencing, etc.)
|
|
@@ -50,24 +61,30 @@ pip install -r requirements.txt
|
|
### Extract text from a PDF:
|
|
### Extract text from a PDF:
|
|
|
|
|
|
```bash
|
|
```bash
|
|
-python src/structured_extraction.py path/to/document.pdf --text
|
|
|
|
|
|
+python src/structured_extraction.py path/to/document.pdf text
|
|
```
|
|
```
|
|
|
|
|
|
-### Extract text and tables, and save tables as CSV files:
|
|
|
|
|
|
+### Extract charts and tables, and save them as CSV files:
|
|
|
|
|
|
```bash
|
|
```bash
|
|
-python src/structured_extraction.py path/to/document.pdf --text --tables --save_tables_as_csv
|
|
|
|
|
|
+python src/structured_extraction.py path/to/document.pdf charts,tables --save_tables_as_csv
|
|
```
|
|
```
|
|
|
|
|
|
### Process a directory of PDFs and export tables to Excel:
|
|
### Process a directory of PDFs and export tables to Excel:
|
|
|
|
|
|
```bash
|
|
```bash
|
|
-python src/structured_extraction.py path/to/pdf_directory --text --tables --export_excel
|
|
|
|
|
|
+python src/structured_extraction.py path/to/pdf_directory text,tables --export_excel
|
|
|
|
+```
|
|
|
|
+
|
|
|
|
+### Extract all artifact types and save to database and as Excel sheets:
|
|
|
|
+
|
|
|
|
+```bash
|
|
|
|
+python src/structured_extraction.py path/to/document.pdf text,tables,images,charts --save_to_db --export_excel
|
|
```
|
|
```
|
|
|
|
|
|
## Configuration
|
|
## Configuration
|
|
|
|
|
|
-The tool is configured via `config.yaml`. Key configuration options include:
|
|
|
|
|
|
+The tool is configured via `src/config.yaml`. Key configuration options include:
|
|
|
|
|
|
### Model Configuration
|
|
### Model Configuration
|
|
|
|
|
|
@@ -93,30 +110,93 @@ model:
|
|
extraction_inference:
|
|
extraction_inference:
|
|
temperature: 0.2
|
|
temperature: 0.2
|
|
top_p: 0.9
|
|
top_p: 0.9
|
|
- max_completion_tokens: 17000
|
|
|
|
|
|
+ max_completion_tokens: 32000
|
|
seed: 42
|
|
seed: 42
|
|
```
|
|
```
|
|
|
|
|
|
|
|
+### Database Configuration
|
|
|
|
+
|
|
|
|
+```yaml
|
|
|
|
+database:
|
|
|
|
+ sql_db_path: "sqlite3.db"
|
|
|
|
+ vector_db_path: "chroma.db"
|
|
|
|
+```
|
|
|
|
+
|
|
### Artifact Configuration
|
|
### Artifact Configuration
|
|
|
|
|
|
-The tool includes configurable prompts and output schemas for each artifact type (text, tables, images). These can be modified in the `config.yaml` file.
|
|
|
|
|
|
+The tool includes configurable prompts and output schemas for each artifact type (text, tables, images, charts). These can be modified in the `config.yaml` file to customize extraction behavior for specific document types.
|
|
|
|
+
|
|
|
|
+## Output Formats
|
|
|
|
+
|
|
|
|
+### JSON Output
|
|
|
|
+The primary output format includes all extracted artifacts in a structured JSON format with timestamps.
|
|
|
|
+
|
|
|
|
+### CSV Export
|
|
|
|
+Tables and charts can be exported as individual CSV files for easy analysis in spreadsheet applications.
|
|
|
|
+
|
|
|
|
+### Excel Export
|
|
|
|
+Multiple tables can be combined into a single Excel workbook with separate tabs for each table.
|
|
|
|
+
|
|
|
|
+### Database Storage
|
|
|
|
+Extracted data can be stored in SQLite databases with optional vector indexing for semantic search.
|
|
|
|
+
|
|
|
|
+## API Usage
|
|
|
|
+
|
|
|
|
+### Programmatic Usage
|
|
|
|
+
|
|
|
|
+```python
|
|
|
|
+from src.structured_extraction import ArtifactExtractor
|
|
|
|
+from src.utils import PDFUtils
|
|
|
|
+
|
|
|
|
+# Extract pages from PDF
|
|
|
|
+pages = PDFUtils.extract_pages("document.pdf")
|
|
|
|
+
|
|
|
|
+# Process specific pages
|
|
|
|
+for page in pages[10:20]: # Process pages 10-19
|
|
|
|
+ artifacts = ArtifactExtractor.from_image(
|
|
|
|
+ page["image_path"],
|
|
|
|
+ ["text", "tables"]
|
|
|
|
+ )
|
|
|
|
+ # Custom processing of artifacts...
|
|
|
|
+```
|
|
|
|
+
|
|
|
|
+### Single Image Processing
|
|
|
|
+
|
|
|
|
+```python
|
|
|
|
+from src.structured_extraction import ArtifactExtractor
|
|
|
|
+
|
|
|
|
+# Extract from a single image
|
|
|
|
+artifacts = ArtifactExtractor.from_image(
|
|
|
|
+ "path/to/image.png",
|
|
|
|
+ ["text", "tables", "images"]
|
|
|
|
+)
|
|
|
|
+```
|
|
|
|
|
|
## Architecture
|
|
## Architecture
|
|
|
|
|
|
### Core Components
|
|
### Core Components
|
|
|
|
|
|
-1. **RequestBuilder**: Builds inference requests for LLMs
|
|
|
|
-2. **ArtifactExtractor**: Extracts structured data from documents
|
|
|
|
-3. **DatabaseManager**: Manages SQL database operations
|
|
|
|
-4. **VectorIndexManager**: Handles vector indexing and search
|
|
|
|
|
|
+1. **RequestBuilder**: Builds inference requests for LLMs with image and text content
|
|
|
|
+2. **ArtifactExtractor**: Extracts structured data from documents using configurable prompts
|
|
|
|
+3. **PDFUtils**: Handles PDF processing and page extraction as images
|
|
|
|
+4. **InferenceUtils**: Manages LLM inference with support for VLLM and OpenAI-compatible APIs
|
|
|
|
+5. **JSONUtils**: Handles JSON extraction and validation from LLM responses
|
|
|
|
+6. **ImageUtils**: Utility functions for image encoding and processing
|
|
|
|
|
|
### Data Flow
|
|
### Data Flow
|
|
|
|
|
|
-1. PDFs are converted to images (one per page)
|
|
|
|
-2. Images are processed by the LLM to extract structured data
|
|
|
|
|
|
+1. PDFs are converted to images (one per page) using PyMuPDF
|
|
|
|
+2. Images are processed by the LLM to extract structured data based on configured prompts
|
|
3. Structured data is saved in various formats (JSON, CSV, SQL, etc.)
|
|
3. Structured data is saved in various formats (JSON, CSV, SQL, etc.)
|
|
4. Optional vector indexing for semantic search capabilities
|
|
4. Optional vector indexing for semantic search capabilities
|
|
|
|
|
|
|
|
+### Supported Artifact Types
|
|
|
|
+
|
|
|
|
+- **text**: Main text content, titles, captions, and other textual elements
|
|
|
|
+- **tables**: Structured tabular data with proper formatting
|
|
|
|
+- **images**: Image descriptions, captions, and metadata
|
|
|
|
+- **charts**: Chart data extraction with structured format including axes, data points, and metadata
|
|
|
|
+
|
|
## Extending the Tool
|
|
## Extending the Tool
|
|
|
|
|
|
### Adding New Artifact Types
|
|
### Adding New Artifact Types
|
|
@@ -135,7 +215,7 @@ artifacts:
|
|
use_json_decoding: true
|
|
use_json_decoding: true
|
|
```
|
|
```
|
|
|
|
|
|
-2. Update the command-line interface in `structured_extraction.py` to include your new artifact type.
|
|
|
|
|
|
+
|
|
|
|
|
|
### Customizing Extraction Logic
|
|
### Customizing Extraction Logic
|
|
|
|
|
|
@@ -152,83 +232,17 @@ The tool supports two backends:
|
|
1. **openai-compat**: Any API compatible with the OpenAI API format (including Llama API)
|
|
1. **openai-compat**: Any API compatible with the OpenAI API format (including Llama API)
|
|
2. **offline-vllm**: Local inference using VLLM for self-hosted deployments
|
|
2. **offline-vllm**: Local inference using VLLM for self-hosted deployments
|
|
|
|
|
|
-## Database Integration
|
|
|
|
-
|
|
|
|
-### SQL Database
|
|
|
|
-
|
|
|
|
-The tool can store extracted data in an SQLite database:
|
|
|
|
-
|
|
|
|
-```bash
|
|
|
|
-python src/structured_extraction.py path/to/document.pdf --text --tables --save_to_db
|
|
|
|
-```
|
|
|
|
-
|
|
|
|
-### Vector Search
|
|
|
|
-
|
|
|
|
-When `save_to_db` is enabled and a vector database path is configured, the tool also indexes extracted content for semantic search:
|
|
|
|
-
|
|
|
|
-```python
|
|
|
|
-from src.json_to_sql import VectorIndexManager
|
|
|
|
-
|
|
|
|
-# Search for relevant content
|
|
|
|
-results = VectorIndexManager.knn_query("What is the revenue growth?", "chroma.db")
|
|
|
|
-```
|
|
|
|
-
|
|
|
|
## Best Practices
|
|
## Best Practices
|
|
|
|
|
|
1. **Model Selection**: Use larger models for complex documents or when high accuracy is required
|
|
1. **Model Selection**: Use larger models for complex documents or when high accuracy is required
|
|
2. **Prompt Engineering**: Adjust prompts in `config.yaml` for your specific document types
|
|
2. **Prompt Engineering**: Adjust prompts in `config.yaml` for your specific document types
|
|
3. **Output Schema**: Define precise schemas to guide the model's extraction process
|
|
3. **Output Schema**: Define precise schemas to guide the model's extraction process
|
|
-4. **Batch Processing**: Use directory processing for efficiently handling multiple documents
|
|
|
|
-5. **Performance Tuning**: Adjust inference parameters based on your accuracy vs. speed requirements
|
|
|
|
-
|
|
|
|
-## Limitations
|
|
|
|
-
|
|
|
|
-- PDF rendering quality affects extraction accuracy
|
|
|
|
-- Complex multi-column layouts may require specialized prompts
|
|
|
|
-- Very large tables might be truncated due to token limitations
|
|
|
|
-
|
|
|
|
-## Advanced Use Cases
|
|
|
|
-
|
|
|
|
-### Custom Processing Pipelines
|
|
|
|
|
|
|
|
-The tool's components can be used programmatically for custom pipelines:
|
|
|
|
-
|
|
|
|
-```python
|
|
|
|
-from src.structured_extraction import ArtifactExtractor
|
|
|
|
-from src.utils import PDFUtils
|
|
|
|
-
|
|
|
|
-# Extract pages from PDF
|
|
|
|
-pages = PDFUtils.extract_pages("document.pdf")
|
|
|
|
-
|
|
|
|
-# Process specific pages
|
|
|
|
-for page in pages[10:20]: # Process pages 10-19
|
|
|
|
- artifacts = ArtifactExtractor.from_image(page["image_path"], ["text", "tables"])
|
|
|
|
- # Custom processing of artifacts...
|
|
|
|
-```
|
|
|
|
-
|
|
|
|
-### Export to Other Systems
|
|
|
|
-
|
|
|
|
-Extracted data can be exported to various systems:
|
|
|
|
-
|
|
|
|
-- **SQL databases**: Using `flatten_json_to_sql`
|
|
|
|
-- **CSV files**: Using `json_to_csv`
|
|
|
|
-- **Excel workbooks**: Using `export_csvs_to_excel_tabs`
|
|
|
|
|
|
|
|
## Troubleshooting
|
|
## Troubleshooting
|
|
|
|
|
|
|
|
+### Common Issues
|
|
|
|
+
|
|
- **Model capacity errors**: Reduce max tokens or use a larger model
|
|
- **Model capacity errors**: Reduce max tokens or use a larger model
|
|
- **Extraction quality issues**: Adjust prompts or output schemas
|
|
- **Extraction quality issues**: Adjust prompts or output schemas
|
|
-- **Performance issues**: Use batch processing or adjust tensor parallelism
|
|
|
|
-
|
|
|
|
-## Contributing
|
|
|
|
-
|
|
|
|
-Contributions to improve the tool are welcome! Areas for improvement include:
|
|
|
|
-
|
|
|
|
-- Additional output formats
|
|
|
|
-- Improved table extraction for complex layouts
|
|
|
|
-- Support for more document types beyond PDFs
|
|
|
|
-- Optimization for specific document domains
|
|
|
|
-
|
|
|
|
-## License
|
|
|
|
-
|
|
|
|
-[License information here]
|
|
|
|
|
|
+- **Configuration errors**: Verify model paths and API credentials in config.yaml
|