# Getting Started with Structured Document Parser This guide walks you through setting up and using the Structured Document Parser tool to extract text, tables, and images from PDF documents. ## Setup ### 1. Install Dependencies ```bash pip install -r requirements.txt ``` ### 2. Configure the Tool Edit the `src/config.yaml` file to configure the tool: ```yaml # Choose your inference backend model: backend: openai-compat # Use "offline-vllm" for local inference # If using openai-compat base_url: "https://api.llama.com/compat/v1" api_key: "YOUR_API_KEY" model_id: "Llama-4-Maverick-17B-128E-Instruct-FP8" # Or your preferred model ``` ## Basic Usage Examples ### Extract Text from a PDF ```bash python src/structured_extraction.py path/to/document.pdf --text ``` This will: 1. Convert each PDF page to an image 2. Run LLM inference to extract text 3. Save extracted text as JSON in the `extracted` directory ### Extract Text and Tables ```bash python src/structured_extraction.py path/to/document.pdf --text --tables ``` ### Extract All Types of Content ```bash python src/structured_extraction.py path/to/document.pdf --text --tables --images ``` ### Process Multiple PDFs ```bash python src/structured_extraction.py path/to/pdf_directory --text --tables ``` ## Working with Extraction Results ### Export Tables to CSV ```bash python src/structured_extraction.py path/to/document.pdf --tables --save_tables_as_csv ``` Tables will be saved as individual CSV files in `extracted/tables_TIMESTAMP/`. ### Export Tables to Excel ```bash python src/structured_extraction.py path/to/document.pdf --tables --export_excel ``` Tables will be combined into a single Excel file with multiple sheets. ### Save to Database ```bash python src/structured_extraction.py path/to/document.pdf --text --tables --save_to_db ``` Extracted content will be stored in an SQLite database for structured querying. ## Python API Examples ### Extract Content Programmatically ```python from src.structured_extraction import ArtifactExtractor from src.utils import PDFUtils # Extract pages from a PDF pdf_pages = PDFUtils.extract_pages("document.pdf") # Process each page for page in pdf_pages: # Extract text text_artifacts = ArtifactExtractor.from_image( page["image_path"], ["text"] ) # Or extract multiple artifact types all_artifacts = ArtifactExtractor.from_image( page["image_path"], ["text", "tables", "images"] ) # Process the extracted artifacts print(all_artifacts) ``` ### Query the Database ```python from src.json_to_sql import DatabaseManager # Query all text artifacts text_df = DatabaseManager.sql_query( "sqlite3.db", "SELECT * FROM document_artifacts WHERE artifact_type = 'text'" ) # Query tables containing specific content revenue_tables = DatabaseManager.sql_query( "sqlite3.db", "SELECT * FROM document_artifacts WHERE artifact_type = 'table' AND table_info LIKE '%revenue%'" ) ``` ### Semantic Search ```python from src.json_to_sql import VectorIndexManager # Search for relevant content results = VectorIndexManager.knn_query( "What is the revenue growth for Q2?", "chroma.db", n_results=5 ) # Display results for i, (doc_id, distance, content) in enumerate(zip( results['ids'], results['distances'], results['documents'] )): print(f"Result {i+1} (similarity: {1-distance:.2f}):") print(content[:200] + "...\n") ``` ## Customizing Extraction ### Modify Prompts Edit the prompts in `src/config.yaml` to improve extraction for your specific document types: ```yaml artifacts: text: prompts: system: "You are an OCR expert. Your task is to extract all text sections..." user: "TARGET SCHEMA:\n```json\n{schema}\n```" ``` ### Add a Custom Artifact Type 1. Add configuration to `src/config.yaml`: ```yaml artifacts: my_custom_type: prompts: system: "Your custom system prompt..." user: "Your custom user prompt with {schema} placeholder..." output_schema: { # Your schema definition here } use_json_decoding: true ``` 2. Update the CLI in `src/structured_extraction.py`: ```python def main( target_path: str, text: bool = True, tables: bool = False, images: bool = False, my_custom_type: bool = False, # Add your type here save_to_db: bool = False, ... ): # Update artifact types logic to_extract = [] if text: to_extract.append("text") if tables: to_extract.append("tables") if images: to_extract.append("images") if my_custom_type: to_extract.append("my_custom_type") # Add your type here ``` ## Troubleshooting ### LLM Response Format Issues If the LLM responses aren't being correctly parsed, check: 1. Your output schema in `config.yaml` 2. The `use_json_decoding` setting (set to `true` for more reliable parsing) 3. Consider using a larger model or reducing extraction complexity ### Database Issues If you encounter database errors: 1. Ensure SQLite is properly installed 2. Check database file permissions 3. Use `DatabaseManager.create_artifact_table()` to reinitialize the table schema ### PDF Rendering Issues If PDF extraction quality is poor: 1. Try adjusting the DPI setting in `PDFUtils.extract_pages()` 2. For complex layouts, split extraction into smaller chunks (per section) 3. Consider pre-processing PDFs with OCR tools for better text layer quality ## Next Steps - Try extracting from different types of documents - Adjust prompts and schemas for your specific use cases - Explore the vector search capabilities for semantic document queries - Integrate with your existing document processing workflows