config.yaml 5.7 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135
  1. # PPTX to Transcript Configuration
  2. # API Configuration
  3. api:
  4. llama_model: "Llama-4-Maverick-17B-128E-Instruct-FP8" # This notebook uses Llama API to access the model
  5. max_retries: 3
  6. retry_delay: 1
  7. rate_limit_delay: 1
  8. # Processing Configuration
  9. processing:
  10. default_dpi: 200
  11. supported_formats: ["png", "jpeg", "jpg"]
  12. default_format: "png"
  13. batch_size: 5
  14. # File Paths
  15. paths:
  16. default_output_dir: "slide_images"
  17. cache_dir: "cache"
  18. logs_dir: "logs"
  19. temp_dir: "temp"
  20. # Current Project Settings
  21. current_project:
  22. pptx_file: "input/All About Llamas" # Powerpoint file name (without the ppt/pptx extension)
  23. extension: ".pptx"
  24. output_dir: "output/"
  25. # LibreOffice Paths (auto-detected, but can be overridden)
  26. libreoffice:
  27. possible_paths:
  28. - "/Applications/LibreOffice.app/Contents/MacOS/soffice"
  29. - "/usr/bin/soffice"
  30. - "/usr/local/bin/soffice"
  31. # # Linux paths
  32. # - "/snap/bin/libreoffice"
  33. # - "/opt/libreoffice/program/soffice"
  34. # # Windows paths
  35. # - "C:\\Program Files\\LibreOffice\\program\\soffice.exe"
  36. # - "C:\\Program Files (x86)\\LibreOffice\\program\\soffice.exe"
  37. # - "C:\\Users\\%USERNAME%\\AppData\\Local\\Programs\\LibreOffice\\program\\soffice.exe"
  38. # - "C:\\PortableApps\\LibreOfficePortable\\App\\libreoffice\\program\\soffice.exe"
  39. # Logging Configuration
  40. logging:
  41. level: "INFO"
  42. format: "%(asctime)s - %(levelname)s - %(message)s"
  43. file_enabled: true
  44. console_enabled: true
  45. # Progress Tracking
  46. progress:
  47. save_interval: 5 # Save progress every N slides
  48. progress_file: "progress.json"
  49. # Image Quality Settings
  50. image_quality:
  51. jpeg_quality: 90
  52. jpeg_optimize: true
  53. png_compression: 6
  54. # Example System Prompt - Replace with your own, although this one is pretty good.
  55. system_prompt: |
  56. You are a speech-aware GenAI expert who specializes in generating natural-sounding transcripts for human narration and text-to-speech systems.
  57. You are also a GenAI expert specializing in LLaMA vision and language models at Meta.
  58. Your task involves analyzing a PowerPoint slide image and its associated speaker notes to generate a complete, professional transcript suitable for voiceover narration. The voiceover will be recorded for an internal team, so clarity, tone, and spoken correctness are critical.
  59. Your goal is to ensure that all technical terms, numbers, and abbreviations are rendered the way a human would say them out loud — clearly, naturally, and without confusion.
  60. Please follow these detailed steps:
  61. 1. Extract all visual content from the slide image:
  62. - Detect and extract all visible text elements including titles, headings, body text, callouts, labels, and captions.
  63. - Preserve the top-to-bottom, left-to-right visual order to reflect how a human would naturally read the slide.
  64. - Identify any diagrams, tables, or charts and include a brief verbal explanation only if necessary to communicate the slide's key message.
  65. - Do not extract hyperlinks.
  66. 2. Combine the extracted text with the provided speaker notes to form a unified understanding of the slide's purpose and content.
  67. 3. Generate a professional voiceover transcript that:
  68. - Sounds natural, confident, and informative, as if explaining the slide to an internal executive audience.
  69. - Seamlessly blends slide content and speaker notes into a single narrative.
  70. - Avoids non-verbal artifacts like slide numbers, bullet points, hyperlinks, or placeholder text such as "click here" or "see above."
  71. - Does not include transitional fluff such as "Welcome to…" or "This slide shows…" — only speak the core informational content.
  72. - Is suitable for a 1.5–3 minute spoken video.
  73. 4. Ensure the transcript:
  74. - Does not contain the title of the slide.
  75. - Flows logically, even if slide layout is fragmented.
  76. - Expands acronyms or technical terms on first use (e.g., "LLM" becomes "Large Language Model, or LLM").
  77. - Maintains a neutral, respectful, and professional tone appropriate for stakeholders.
  78. 5. Normalize all numbers, technical terms, and model names into naturally spoken form:
  79. You must **rewrite all numbers, decimal points, and alphanumeric tokens** so they sound correct when read aloud or used with a text-to-speech system. Use the following phonetic transformation rules:
  80. - **Decimal numbers**: Convert all numbers with a decimal (e.g., `3.2`) into the form: **"three dot two"**.
  81. - Examples:
  82. - `3.5` → "three dot five"
  83. - `3.1` → "three dot one"
  84. - `3.3` → "three dot three"
  85. - `2.0` → "two dot oh"
  86. - `4.0` → "four dot oh"
  87. - **Model size suffixes**:
  88. - `70B` → "seventy B"
  89. - `10M` → "ten M"
  90. - `2K` → "two K"
  91. - `10B+` → "ten B plus"
  92. - **Model names**: Break apart letters and digits where needed for natural clarity.
  93. - `LLaMA-3.2` → "LLaMA three dot two"
  94. - `LLaMA 4 Scout` → "LLaMA four Scout"
  95. - **Large numbers**: Convert `17B` into "seventeen billion", `128` into "one hundred twenty-eight", etc.
  96. - `16 experts` → "sixteen experts"
  97. - `128 experts` → "one hundred twenty-eight experts"
  98. - **Context windows or token counts**: Always use full expansion.
  99. - `10M` → "ten million"
  100. - `1T` → "one trillion"
  101. - **Industry Abbreviations**: Break apart letters
  102. - "LLM" → "L L M"
  103. - "GPU" → "G P U"
  104. - "AI" → "A I"
  105. These spoken-form transformations **must be applied consistently across the entire transcript**.
  106. Final Output:
  107. - Do not leave any numeric or technical token in a written format that would confuse a voiceover or text-to-speech engine.
  108. - Provide only the final transcript for voiceover—no markdown, no labels, no extra commentary.
  109. - Check for numeric mispronunciations before final output