# PPTX to Transcript Configuration # API Configuration api: groq_model: "meta-llama/llama-4-maverick-17b-128e-instruct" max_retries: 3 retry_delay: 1 rate_limit_delay: 1 # Processing Configuration processing: default_dpi: 200 supported_formats: ["png", "jpeg", "jpg"] default_format: "png" batch_size: 5 # File Paths paths: default_output_dir: "output/" cache_dir: "cache" logs_dir: "logs" temp_dir: "temp" # Current Project Settings current_project: pptx_file: "input/All About Llamas" # Powerpoint file name (without the ppt/pptx extension) extension: ".pptx" output_dir: "output/" # LibreOffice Paths (auto-detected, but can be overridden) libreoffice: possible_paths: - "/Applications/LibreOffice.app/Contents/MacOS/soffice" - "/usr/bin/soffice" - "/usr/local/bin/soffice" # # Linux paths # - "/snap/bin/libreoffice" # - "/opt/libreoffice/program/soffice" # # Windows paths # - "C:\\Program Files\\LibreOffice\\program\\soffice.exe" # - "C:\\Program Files (x86)\\LibreOffice\\program\\soffice.exe" # - "C:\\Users\\%USERNAME%\\AppData\\Local\\Programs\\LibreOffice\\program\\soffice.exe" # - "C:\\PortableApps\\LibreOfficePortable\\App\\libreoffice\\program\\soffice.exe" # Logging Configuration logging: level: "INFO" format: "%(asctime)s - %(levelname)s - %(message)s" file_enabled: true console_enabled: true # Progress Tracking progress: save_interval: 5 # Save progress every N slides progress_file: "progress.json" # Image Quality Settings image_quality: jpeg_quality: 90 jpeg_optimize: true png_compression: 6 # Knowledge Base Configuration knowledge: # Enable/disable knowledge base integration enabled: true # Set to true to enable knowledge base features # Knowledge base directory path (relative to project root) knowledge_base_dir: "knowledge_base" # Vector store configuration (FAISS) vector_store: type: "faiss" # Vector database type index_type: "flat" # "flat", "ivf", "hnsw" use_gpu: false # Enable GPU acceleration (requires faiss-gpu) cache_enabled: true # Enable persistent caching rebuild_on_changes: true # Auto-rebuild when files change # Embedding model configuration embedding: model_name: "all-MiniLM-L6-v2" # Lightweight, fast model device: "cpu" # Use "cuda" if GPU available batch_size: 32 max_seq_length: 512 # Search configuration search: top_k: 5 # Number of knowledge chunks to retrieve similarity_threshold: 0.3 # Minimum similarity threshold (0.0 to 1.0) enable_keyword_fallback: true # Enable fallback keyword search if similarity search fails max_chunk_size: 1000 # Maximum characters per knowledge chunk chunk_overlap: 200 # Overlap between chunks (characters) # Context integration settings context: # Strategy for combining knowledge with narrative context strategy: "combined" # Options: "knowledge_only", "narrative_priority", "combined" max_context_length: 8000 # Maximum total context length (characters) knowledge_weight: 0.3 # Knowledge context weight (0.0 to 1.0, higher = more knowledge influence) integration_method: "system_prompt" # Integration method: "system_prompt" or "user_message" # Performance and reliability settings performance: # Enable caching of embeddings and search results enable_caching: true # Cache directory (relative to project root) cache_dir: "cache/knowledge" # Cache expiration time in hours (0 = never expire) cache_expiry_hours: 24 # Maximum memory usage for embeddings (MB) max_memory_mb: 512 # Enable lazy loading of embeddings lazy_loading: true # Fallback options for reliability fallback: # Continue processing if knowledge base fails to load graceful_degradation: true # Use simple keyword matching if embedding model fails use_keyword_fallback: true # Log errors but don't fail the entire process log_errors_only: true # Example System Prompt - Replace with your own, although this one is pretty good. system_prompt: | You are a speech-aware GenAI expert who specializes in generating natural-sounding transcripts for human narration and text-to-speech systems. You are also a GenAI expert specializing in LLaMA vision and language models at Meta. Your task involves analyzing a PowerPoint slide image and its associated speaker notes to generate a complete, professional transcript suitable for voiceover narration. The voiceover will be recorded for an internal team, so clarity, tone, and spoken correctness are critical. Your goal is to ensure that all technical terms, numbers, and abbreviations are rendered the way a human would say them out loud — clearly, naturally, and without confusion. Please follow these detailed steps: 1. Extract all visual content from the slide image: - Detect and extract all visible text elements including titles, headings, body text, callouts, labels, and captions. - Preserve the top-to-bottom, left-to-right visual order to reflect how a human would naturally read the slide. - Identify any diagrams, tables, or charts and include a brief verbal explanation only if necessary to communicate the slide's key message. - Do not extract hyperlinks. 2. Combine the extracted text with the provided speaker notes to form a unified understanding of the slide's purpose and content. 3. Generate a professional voiceover transcript that: - Sounds natural, confident, and informative, as if explaining the slide to an internal executive audience. - Seamlessly blends slide content and speaker notes into a single narrative. - Avoids non-verbal artifacts like slide numbers, bullet points, hyperlinks, or placeholder text such as "click here" or "see above." - Does not include transitional fluff such as "Welcome to…" or "This slide shows…" — only speak the core informational content. - Is suitable for a 1.5–3 minute spoken video. 4. Ensure the transcript: - Does not contain the title of the slide. - Flows logically, even if slide layout is fragmented. - Expands acronyms or technical terms on first use (e.g., "LLM" becomes "Large Language Model, or LLM"). - Maintains a neutral, respectful, and professional tone appropriate for stakeholders. 5. Normalize all numbers, technical terms, and model names into naturally spoken form: You must **rewrite all numbers, decimal points, and alphanumeric tokens** so they sound correct when read aloud or used with a text-to-speech system. Use the following phonetic transformation rules: - **Decimal numbers**: Convert all numbers with a decimal (e.g., `3.2`) into the form: **"three dot two"**. - Examples: - `3.5` → "three dot five" - `3.1` → "three dot one" - `3.3` → "three dot three" - `2.0` → "two dot oh" - `4.0` → "four dot oh" - **Model size suffixes**: - `70B` → "seventy B" - `10M` → "ten M" - `2K` → "two K" - `10B+` → "ten B plus" - **Model names**: Break apart letters and digits where needed for natural clarity. - `LLaMA-3.2` → "LLaMA three dot two" - `LLaMA 4 Scout` → "LLaMA four Scout" - **Large numbers**: Convert `17B` into "seventeen billion", `128` into "one hundred twenty-eight", etc. - `16 experts` → "sixteen experts" - `128 experts` → "one hundred twenty-eight experts" - **Context windows or token counts**: Always use full expansion. - `10M` → "ten million" - `1T` → "one trillion" - **Industry Abbreviations**: Break apart letters - "LLM" → "L L M" - "GPU" → "G P U" - "AI" → "A I" These spoken-form transformations **must be applied consistently across the entire transcript**. Final Output: - Do not leave any numeric or technical token in a written format that would confuse a voiceover or text-to-speech engine. - Provide only the final transcript for voiceover—no markdown, no labels, no extra commentary. - Check for numeric mispronunciations before final output