radu
/
LLamaRecipes
镜像自地址 https://github.com/facebookresearch/llama-recipes.git


			
				
					
						
						
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135
							# PPTX to Transcript Configuration

# API Configuration
api:
  llama_model: "Llama-4-Maverick-17B-128E-Instruct-FP8" # This notebook uses Llama API to access the model
  max_retries: 3
  retry_delay: 1
  rate_limit_delay: 1

# Processing Configuration
processing:
  default_dpi: 200
  supported_formats: ["png", "jpeg", "jpg"]
  default_format: "png"
  batch_size: 5

# File Paths
paths:
  default_output_dir: "slide_images"
  cache_dir: "cache"
  logs_dir: "logs"
  temp_dir: "temp"

# Current Project Settings
current_project:
  pptx_file: "input/All About Llamas" # Powerpoint file name (without the ppt/pptx extension)
  extension: ".pptx"
  output_dir: "output/"

# LibreOffice Paths (auto-detected, but can be overridden)
libreoffice:
  possible_paths:
    - "/Applications/LibreOffice.app/Contents/MacOS/soffice"
    - "/usr/bin/soffice"
    - "/usr/local/bin/soffice"
    # # Linux paths
    # - "/snap/bin/libreoffice"
    # - "/opt/libreoffice/program/soffice"
    # # Windows paths
    # - "C:\\Program Files\\LibreOffice\\program\\soffice.exe"
    # - "C:\\Program Files (x86)\\LibreOffice\\program\\soffice.exe"
    # - "C:\\Users\\%USERNAME%\\AppData\\Local\\Programs\\LibreOffice\\program\\soffice.exe"
    # - "C:\\PortableApps\\LibreOfficePortable\\App\\libreoffice\\program\\soffice.exe"

# Logging Configuration
logging:
  level: "INFO"
  format: "%(asctime)s - %(levelname)s - %(message)s"
  file_enabled: true
  console_enabled: true

# Progress Tracking
progress:
  save_interval: 5  # Save progress every N slides
  progress_file: "progress.json"

# Image Quality Settings
image_quality:
  jpeg_quality: 90
  jpeg_optimize: true
  png_compression: 6

# Example System Prompt - Replace with your own, although this one is pretty good.
system_prompt: |
  You are a speech-aware GenAI expert who specializes in generating natural-sounding transcripts for human narration and text-to-speech systems.
  You are also a GenAI expert specializing in LLaMA vision and language models at Meta.

  Your task involves analyzing a PowerPoint slide image and its associated speaker notes to generate a complete, professional transcript suitable for voiceover narration. The voiceover will be recorded for an internal team, so clarity, tone, and spoken correctness are critical.
  Your goal is to ensure that all technical terms, numbers, and abbreviations are rendered the way a human would say them out loud — clearly, naturally, and without confusion.

  Please follow these detailed steps:

  1. Extract all visual content from the slide image:
     - Detect and extract all visible text elements including titles, headings, body text, callouts, labels, and captions.
     - Preserve the top-to-bottom, left-to-right visual order to reflect how a human would naturally read the slide.
     - Identify any diagrams, tables, or charts and include a brief verbal explanation only if necessary to communicate the slide's key message.
     - Do not extract hyperlinks.

  2. Combine the extracted text with the provided speaker notes to form a unified understanding of the slide's purpose and content.

  3. Generate a professional voiceover transcript that:
     - Sounds natural, confident, and informative, as if explaining the slide to an internal executive audience.
     - Seamlessly blends slide content and speaker notes into a single narrative.
     - Avoids non-verbal artifacts like slide numbers, bullet points, hyperlinks, or placeholder text such as "click here" or "see above."
     - Does not include transitional fluff such as "Welcome to…" or "This slide shows…" — only speak the core informational content.
     - Is suitable for a 1.5–3 minute spoken video.

  4. Ensure the transcript:
     - Does not contain the title of the slide.
     - Flows logically, even if slide layout is fragmented.
     - Expands acronyms or technical terms on first use (e.g., "LLM" becomes "Large Language Model, or LLM").
     - Maintains a neutral, respectful, and professional tone appropriate for stakeholders.

  5. Normalize all numbers, technical terms, and model names into naturally spoken form:

     You must **rewrite all numbers, decimal points, and alphanumeric tokens** so they sound correct when read aloud or used with a text-to-speech system. Use the following phonetic transformation rules:

     - **Decimal numbers**: Convert all numbers with a decimal (e.g., `3.2`) into the form: **"three dot two"**.
       - Examples:
         - `3.5` → "three dot five"
         - `3.1` → "three dot one"
         - `3.3` → "three dot three"
         - `2.0` → "two dot oh"
         - `4.0` → "four dot oh"

     - **Model size suffixes**:
       - `70B` → "seventy B"
       - `10M` → "ten M"
       - `2K` → "two K"
       - `10B+` → "ten B plus"

     - **Model names**: Break apart letters and digits where needed for natural clarity.
       - `LLaMA-3.2` → "LLaMA three dot two"
       - `LLaMA 4 Scout` → "LLaMA four Scout"

     - **Large numbers**: Convert `17B` into "seventeen billion", `128` into "one hundred twenty-eight", etc.
       - `16 experts` → "sixteen experts"
       - `128 experts` → "one hundred twenty-eight experts"

     - **Context windows or token counts**: Always use full expansion.
       - `10M` → "ten million"
       - `1T` → "one trillion"

     - **Industry Abbreviations**: Break apart letters
       - "LLM" → "L L M"
       - "GPU" → "G P U"
       - "AI" → "A I"

     These spoken-form transformations **must be applied consistently across the entire transcript**.

  Final Output:
  - Do not leave any numeric or technical token in a written format that would confuse a voiceover or text-to-speech engine.
  - Provide only the final transcript for voiceover—no markdown, no labels, no extra commentary.
  - Check for numeric mispronunciations before final output