Przeglądaj źródła

Add support for extracting charts (#986)

varunfb 1 tydzień temu
rodzic
commit
784e63f183

+ 107 - 93
end-to-end-use-cases/structured_parser/README.md

@@ -4,16 +4,35 @@ A powerful, LLM-based tool for extracting structured data from rich documents (P
 
 ## Overview
 
-This tool uses Llama models to extract text, tables, and images from PDFs, converting unstructured document data into structured, machine-readable formats. It supports:
+This tool uses Llama models to extract text, tables, images, and charts from PDFs, converting unstructured document data into structured, machine-readable formats. It supports:
 
 - **Text extraction**: Extract and structure main text, titles, captions, etc.
 - **Table extraction**: Convert complex tables into structured data formats
 - **Image extraction**: Extract images with contextual descriptions and captions
+- **Chart extraction**: Convert charts and graphs into structured JSON data
 - **Multiple output formats**: JSON, CSV, Excel, and SQL database storage
 - **Vector search capabilities**: Semantic search across extracted content
 
 The tool is designed to handle complex documents with high accuracy and provides flexible configuration options to tailor extraction tasks to specific needs.
 
+
+## Project Structure
+
+```
+structured_parser/
+├── src/
+│   ├── structured_extraction.py  # Main entry point and extraction logic
+│   ├── utils.py                  # Utility functions and classes
+│   ├── typedicts.py             # Type definitions
+│   ├── json_to_table.py         # Database integration functions
+│   └── config.yaml              # Configuration file
+├── pdfs/                        # Sample PDFs and extraction results
+├── README.md                    # This file
+├── CONTRIBUTING.md              # Contribution guidelines
+└── requirements.txt             # Python dependencies
+```
+
+
 ## Installation
 
 ### Prerequisites
@@ -24,25 +43,17 @@ The tool is designed to handle complex documents with high accuracy and provides
 ### Setup
 
 1. Clone the repository
-2. Install dependencies:
 
 ```bash
 git clone https://github.com/meta-llama/llama-cookbook.git
-```
-```bash
 cd llama-cookbook
 ```
-```bash
-pip install -r requirements.txt
-```
+
 2. Install project specific dependencies:
 ```bash
 cd end-to-end-use-cases/structured_parser
-```
-```bash
 pip install -r requirements.txt
 ```
-## Quick Start
 
 ### Configure the tool (see [Configuration](#Configuration) section)
 (Note: Setup API Key, Model for inferencing, etc.)
@@ -50,24 +61,30 @@ pip install -r requirements.txt
 ### Extract text from a PDF:
 
 ```bash
-python src/structured_extraction.py path/to/document.pdf --text
+python src/structured_extraction.py path/to/document.pdf text
 ```
 
-### Extract text and tables, and save tables as CSV files:
+### Extract charts and tables, and save them as CSV files:
 
 ```bash
-python src/structured_extraction.py path/to/document.pdf --text --tables --save_tables_as_csv
+python src/structured_extraction.py path/to/document.pdf charts,tables --save_tables_as_csv
 ```
 
 ### Process a directory of PDFs and export tables to Excel:
 
 ```bash
-python src/structured_extraction.py path/to/pdf_directory --text --tables --export_excel
+python src/structured_extraction.py path/to/pdf_directory text,tables --export_excel
+```
+
+### Extract all artifact types and save to database and as Excel sheets:
+
+```bash
+python src/structured_extraction.py path/to/document.pdf text,tables,images,charts --save_to_db --export_excel
 ```
 
 ## Configuration
 
-The tool is configured via `config.yaml`. Key configuration options include:
+The tool is configured via `src/config.yaml`. Key configuration options include:
 
 ### Model Configuration
 
@@ -93,30 +110,93 @@ model:
 extraction_inference:
   temperature: 0.2
   top_p: 0.9
-  max_completion_tokens: 17000
+  max_completion_tokens: 32000
   seed: 42
 ```
 
+### Database Configuration
+
+```yaml
+database:
+  sql_db_path: "sqlite3.db"
+  vector_db_path: "chroma.db"
+```
+
 ### Artifact Configuration
 
-The tool includes configurable prompts and output schemas for each artifact type (text, tables, images). These can be modified in the `config.yaml` file.
+The tool includes configurable prompts and output schemas for each artifact type (text, tables, images, charts). These can be modified in the `config.yaml` file to customize extraction behavior for specific document types.
+
+## Output Formats
+
+### JSON Output
+The primary output format includes all extracted artifacts in a structured JSON format with timestamps.
+
+### CSV Export
+Tables and charts can be exported as individual CSV files for easy analysis in spreadsheet applications.
+
+### Excel Export
+Multiple tables can be combined into a single Excel workbook with separate tabs for each table.
+
+### Database Storage
+Extracted data can be stored in SQLite databases with optional vector indexing for semantic search.
+
+## API Usage
+
+### Programmatic Usage
+
+```python
+from src.structured_extraction import ArtifactExtractor
+from src.utils import PDFUtils
+
+# Extract pages from PDF
+pages = PDFUtils.extract_pages("document.pdf")
+
+# Process specific pages
+for page in pages[10:20]:  # Process pages 10-19
+    artifacts = ArtifactExtractor.from_image(
+        page["image_path"],
+        ["text", "tables"]
+    )
+    # Custom processing of artifacts...
+```
+
+### Single Image Processing
+
+```python
+from src.structured_extraction import ArtifactExtractor
+
+# Extract from a single image
+artifacts = ArtifactExtractor.from_image(
+    "path/to/image.png",
+    ["text", "tables", "images"]
+)
+```
 
 ## Architecture
 
 ### Core Components
 
-1. **RequestBuilder**: Builds inference requests for LLMs
-2. **ArtifactExtractor**: Extracts structured data from documents
-3. **DatabaseManager**: Manages SQL database operations
-4. **VectorIndexManager**: Handles vector indexing and search
+1. **RequestBuilder**: Builds inference requests for LLMs with image and text content
+2. **ArtifactExtractor**: Extracts structured data from documents using configurable prompts
+3. **PDFUtils**: Handles PDF processing and page extraction as images
+4. **InferenceUtils**: Manages LLM inference with support for VLLM and OpenAI-compatible APIs
+5. **JSONUtils**: Handles JSON extraction and validation from LLM responses
+6. **ImageUtils**: Utility functions for image encoding and processing
 
 ### Data Flow
 
-1. PDFs are converted to images (one per page)
-2. Images are processed by the LLM to extract structured data
+1. PDFs are converted to images (one per page) using PyMuPDF
+2. Images are processed by the LLM to extract structured data based on configured prompts
 3. Structured data is saved in various formats (JSON, CSV, SQL, etc.)
 4. Optional vector indexing for semantic search capabilities
 
+### Supported Artifact Types
+
+- **text**: Main text content, titles, captions, and other textual elements
+- **tables**: Structured tabular data with proper formatting
+- **images**: Image descriptions, captions, and metadata
+- **charts**: Chart data extraction with structured format including axes, data points, and metadata
+
 ## Extending the Tool
 
 ### Adding New Artifact Types
@@ -135,7 +215,7 @@ artifacts:
     use_json_decoding: true
 ```
 
-2. Update the command-line interface in `structured_extraction.py` to include your new artifact type.
+
 
 ### Customizing Extraction Logic
 
@@ -152,83 +232,17 @@ The tool supports two backends:
 1. **openai-compat**: Any API compatible with the OpenAI API format (including Llama API)
 2. **offline-vllm**: Local inference using VLLM for self-hosted deployments
 
-## Database Integration
-
-### SQL Database
-
-The tool can store extracted data in an SQLite database:
-
-```bash
-python src/structured_extraction.py path/to/document.pdf --text --tables --save_to_db
-```
-
-### Vector Search
-
-When `save_to_db` is enabled and a vector database path is configured, the tool also indexes extracted content for semantic search:
-
-```python
-from src.json_to_sql import VectorIndexManager
-
-# Search for relevant content
-results = VectorIndexManager.knn_query("What is the revenue growth?", "chroma.db")
-```
-
 ## Best Practices
 
 1. **Model Selection**: Use larger models for complex documents or when high accuracy is required
 2. **Prompt Engineering**: Adjust prompts in `config.yaml` for your specific document types
 3. **Output Schema**: Define precise schemas to guide the model's extraction process
-4. **Batch Processing**: Use directory processing for efficiently handling multiple documents
-5. **Performance Tuning**: Adjust inference parameters based on your accuracy vs. speed requirements
-
-## Limitations
-
-- PDF rendering quality affects extraction accuracy
-- Complex multi-column layouts may require specialized prompts
-- Very large tables might be truncated due to token limitations
-
-## Advanced Use Cases
-
-### Custom Processing Pipelines
 
-The tool's components can be used programmatically for custom pipelines:
-
-```python
-from src.structured_extraction import ArtifactExtractor
-from src.utils import PDFUtils
-
-# Extract pages from PDF
-pages = PDFUtils.extract_pages("document.pdf")
-
-# Process specific pages
-for page in pages[10:20]:  # Process pages 10-19
-    artifacts = ArtifactExtractor.from_image(page["image_path"], ["text", "tables"])
-    # Custom processing of artifacts...
-```
-
-### Export to Other Systems
-
-Extracted data can be exported to various systems:
-
-- **SQL databases**: Using `flatten_json_to_sql`
-- **CSV files**: Using `json_to_csv`
-- **Excel workbooks**: Using `export_csvs_to_excel_tabs`
 
 ## Troubleshooting
 
+### Common Issues
+
 - **Model capacity errors**: Reduce max tokens or use a larger model
 - **Extraction quality issues**: Adjust prompts or output schemas
-- **Performance issues**: Use batch processing or adjust tensor parallelism
-
-## Contributing
-
-Contributions to improve the tool are welcome! Areas for improvement include:
-
-- Additional output formats
-- Improved table extraction for complex layouts
-- Support for more document types beyond PDFs
-- Optimization for specific document domains
-
-## License
-
-[License information here]
+- **Configuration errors**: Verify model paths and API credentials in config.yaml

Plik diff jest za duży
+ 40 - 8
end-to-end-use-cases/structured_parser/src/config.yaml


+ 3 - 5
end-to-end-use-cases/structured_parser/src/json_to_sql.py

@@ -100,8 +100,7 @@ class DatabaseManager:
                 cursor.execute("DROP TABLE IF EXISTS document_artifacts")
 
                 # Create table with schema
-                cursor.execute(
-                    """
+                cursor.execute("""
                 CREATE TABLE IF NOT EXISTS document_artifacts (
                     id INTEGER PRIMARY KEY AUTOINCREMENT,
                     doc_path TEXT,
@@ -125,8 +124,7 @@ class DatabaseManager:
                     image_caption TEXT,
                     image_type TEXT
                 )
-                """
-                )
+                """)
 
                 # Create indexes for common queries
                 cursor.execute(
@@ -520,7 +518,7 @@ I require 2 things from you:
 2. A succinct filename for this table based on the data contents.
 
 You should only respond with a JSON, no preamble required. Your JSON response should follow this format:
-{"csv_table": <str of table>, "filename": <filename to save table>}"""
+{"csv_table": <str of table>, "filename": <filename to save table>}. Your CSV string should be for a single table that can be loaded into Pandas."""
 
     user_prompt = f"data:\n{json.dumps(data)}"
     if info:

+ 38 - 47
end-to-end-use-cases/structured_parser/src/structured_extraction.py

@@ -15,7 +15,7 @@ from typing import Any, Dict, List, Optional, Tuple, Union
 
 import fire
 
-from json_to_sql import flatten_json_to_sql, json_to_csv
+from json_to_table import flatten_json_to_sql, json_to_csv
 from tqdm import tqdm
 from typedicts import ArtifactCollection, ExtractedPage, InferenceRequest
 
@@ -35,7 +35,17 @@ SUPPORTED_BACKENDS = ["offline-vllm", "openai-compat"]
 SUPPORTED_FILE_TYPES = [".pdf"]
 
 
-def setup_logger(logfile, verbose=False):
+def setup_logger(logfile: str, verbose: bool = False) -> logging.Logger:
+    """
+    Set up a logger for the application with file and optional console output.
+
+    Args:
+        logfile: Path to the log file
+        verbose: If True, also log to console
+
+    Returns:
+        Configured logger instance
+    """
     # Create a logger
     logger = logging.getLogger(__name__)
     logger.setLevel(logging.DEBUG)
@@ -306,33 +316,6 @@ class ArtifactExtractor:
         return pdf_pages
 
 
-def get_artifact_types(text: bool, tables: bool, images: bool) -> List[str]:
-    """
-    Determine which artifact types to extract based on flags.
-
-    Args:
-        text: Whether to extract text
-        tables: Whether to extract tables
-        images: Whether to extract images
-
-    Returns:
-        List of artifact types to extract
-
-    Raises:
-        ValueError: If no artifact types are specified
-    """
-    to_extract = []
-    if text:
-        to_extract.append("text")
-    if tables:
-        to_extract.append("tables")
-    if images:
-        to_extract.append("images")
-    if not to_extract:
-        raise ValueError("No artifact types specified for extraction.")
-    return to_extract
-
-
 def get_target_files(target_path: str) -> List[Path]:
     """
     Get list of files to process.
@@ -423,8 +406,10 @@ def save_results(
         logger.error(f"Failed to write output file: {e}")
 
     if save_tables_as_csv or export_excel:
-        tables = sum([x["artifacts"]["tables"] for x in data], [])
-        for tab in tables:
+        tables_charts = sum([x["artifacts"]["tables"] for x in data], []) + sum(
+            [x["artifacts"]["charts"] for x in data], []
+        )
+        for tab in tables_charts:
             # llm: convert each table to a csv string
             csv_string, filename = json_to_csv(tab)
             outfile = output_dir / f"tables_{timestamp}" / filename
@@ -456,31 +441,37 @@ def save_results(
 
 def main(
     target_path: str,
-    text: bool = True,
-    tables: bool = False,
-    images: bool = False,
+    artifacts: str,
     save_to_db: bool = False,
-    save_tables_as_csv: bool = False,
+    save_tables_as_csv: bool = True,
     export_excel: bool = False,
 ) -> None:
     """
-    Extract artifacts from PDF files and optionally save to SQL and vector databases.
+    Extract structured data from PDF documents using LLM-powered extraction.
+
+    Processes PDFs to extract text, tables, images, and charts as structured JSON.
+    Outputs are saved to timestamped files and optionally to databases.
 
     Args:
-        target_path: Path to a PDF file or directory containing PDF files
-        text: Whether to extract text
-        tables: Whether to extract tables
-        images: Whether to extract images
-        save_to_sql: Whether to save extracted artifacts to SQL database
-        save_to_vector: Whether to index extracted artifacts in vector database
-        log_file: Optional path to a log file to write logs to
+        target_path: PDF file or directory path to process
+        artifacts: Comma-separated artifact types (e.g. "text,tables,images,charts")
+        save_to_db: Save to SQL/vector databases if True
+        save_tables_as_csv: Export tables as individual CSV files if True
+        export_excel: Combine all tables into single Excel workbook if True
+
+    Output:
+        - JSON file with all extracted artifacts
+        - CSV files for each table (if save_tables_as_csv=True)
+        - Excel workbook with all tables (if export_excel=True)
+        - Database records (if save_to_db=True)
 
     Raises:
-        ValueError: If no artifact types are specified or the file type is unsupported
-        FileNotFoundError: If the target path doesn't exist
+        ValueError: Invalid artifact types or unsupported file format
+        FileNotFoundError: Target path does not exist
     """
-    # Get artifact types to extract
-    artifact_types = get_artifact_types(text, tables, images)
+    ALLOWED_ARTIFACTS = list(config["artifacts"].keys())
+    artifact_types = [x for x in artifacts if x in ALLOWED_ARTIFACTS]
+    print("Extracting artifacts: ", artifact_types, "\n")
 
     # Get files to process
     targets = get_target_files(target_path)

+ 109 - 34
end-to-end-use-cases/structured_parser/src/typedicts.py

@@ -1,68 +1,143 @@
+"""
+Type definitions for structured document extraction.
+
+This module provides TypedDict classes for type safety and better IDE support
+when working with document extraction data structures.
+"""
+
 from typing import Any, Dict, List, Optional, TypedDict, Union
 
 from vllm import SamplingParams
 
 
 class MessageContent(TypedDict):
-    """Type definition for message content in LLM requests."""
+    """
+    Type definition for message content in LLM requests.
+
+    Supports both text and image content types for multimodal LLM interactions.
+    """
 
-    type: str
-    text: Optional[str] = None
-    image_url: Optional[Dict[str, str]] = None
+    type: str  # Content type: "text" or "image_url"
+    text: Optional[str]  # Text content for text type
+    image_url: Optional[Dict[str, str]]  # Image URL data for image_url type
 
 
 class Message(TypedDict):
-    """Type definition for a message in a LLM inference request."""
+    """
+    Type definition for a message in an LLM inference request.
 
-    role: str
-    content: Union[str, List[MessageContent]]
+    Represents a single message in a conversation with role and content.
+    """
+
+    role: str  # Message role: "system", "user", or "assistant"
+    content: Union[
+        str, List[MessageContent]
+    ]  # Message content as string or multimodal list
 
 
 class InferenceRequest(TypedDict, total=False):
-    """Type definition for LLM inference request."""
+    """
+    Type definition for LLM inference request parameters.
+
+    Contains all parameters needed for LLM inference including model settings,
+    messages, and generation parameters.
+    """
 
-    model: str
-    messages: List[Message]
-    temperature: float
-    top_p: float
-    max_completion_tokens: int
-    seed: int
-    response_format: Optional[Dict[str, Any]]
+    model: str  # Model identifier
+    messages: List[Message]  # Conversation messages
+    temperature: float  # Sampling temperature (0.0-1.0)
+    top_p: float  # Nucleus sampling parameter
+    max_completion_tokens: int  # Maximum tokens to generate
+    seed: int  # Random seed for reproducibility
+    response_format: Optional[Dict[str, Any]]  # Optional structured output format
 
 
 class VLLMInferenceRequest(TypedDict):
-    """Type definition for VLLM inference request format."""
+    """
+    Type definition for VLLM inference request format.
 
-    messages: List[List[Message]]
-    sampling_params: Union[SamplingParams, List[SamplingParams]]
+    Batch format specifically for VLLM engine processing multiple requests
+    with corresponding sampling parameters.
+    """
+
+    messages: List[List[Message]]  # Batch of message sequences
+    sampling_params: Union[
+        SamplingParams, List[SamplingParams]
+    ]  # VLLM sampling parameters
 
 
 class TextArtifact(TypedDict):
-    content: str
-    notes: Optional[str] = None
+    """
+    Type definition for extracted text artifacts.
+
+    Represents text content extracted from documents with optional metadata.
+    """
+
+    content: str  # Main text content
+    notes: Optional[str]  # Additional notes or observations about the text
 
 
 class ImageArtifact(TypedDict, total=False):
-    description: str
-    caption: str
-    image_type: str
-    position_top: Optional[str] = None
-    position_left: Optional[str] = None
+    """
+    Type definition for extracted image artifacts.
+
+    Represents images, charts, and visual elements extracted from documents
+    with positional and descriptive metadata.
+    """
+
+    description: str  # Detailed description of the image
+    caption: str  # Caption or label associated with the image
+    image_type: str  # Type of image (e.g., 'photograph', 'chart', 'diagram')
+    position_top: Optional[str]  # Approximate vertical position
+    position_left: Optional[str]  # Approximate horizontal position
+
+
+class ChartArtifact(TypedDict, total=False):
+    """
+    Type definition for extracted chart artifacts.
+
+    Represents charts and graphs with structured data and metadata.
+    """
+
+    chart_type: str  # Type of chart (e.g., 'bar', 'line', 'pie')
+    description: str  # Detailed description of the chart
+    caption: str  # Caption or title of the chart
+    data: Dict[str, Any]  # Structured chart data
 
 
 class TableArtifact(TypedDict, total=False):
-    table_contents: dict
-    table_info: str
+    """
+    Type definition for extracted table artifacts.
+
+    Represents tabular data with structured contents and descriptive information.
+    """
+
+    table_contents: Dict[str, Any]  # Structured table data
+    table_info: str  # Descriptive information about the table
 
 
 class ArtifactCollection(TypedDict, total=False):
-    text: TextArtifact
-    images: List[ImageArtifact]
-    tables: List[TableArtifact]
+    """
+    Type definition for a collection of extracted artifacts from a document page.
+
+    Groups all types of artifacts extracted from a single document page.
+    """
+
+    text: List[TextArtifact]  # Text artifacts from the page
+    images: List[ImageArtifact]  # Image artifacts from the page
+    tables: List[TableArtifact]  # Table artifacts from the page
+    charts: List[ChartArtifact]  # Chart artifacts from the page
 
 
 class ExtractedPage(TypedDict):
-    doc_path: str
-    image_path: str
-    page_num: int
-    artifacts: ArtifactCollection
+    """
+    Type definition for a complete extracted document page.
+
+    Represents a single page from a document with its metadata and all
+    extracted artifacts.
+    """
+
+    doc_path: str  # Path to the source document
+    image_path: str  # Path to the page image file
+    page_num: int  # Page number (0-indexed)
+    artifacts: ArtifactCollection  # All artifacts extracted from this page