Yuce Dincer 484ed1b13d model update		8 meses atrás
..
input	c08856146c changed gitignore and added pptx file	9 meses atrás
knowledge_base	484ed1b13d model update	8 meses atrás
output	484ed1b13d model update	8 meses atrás
src	484ed1b13d model update	8 meses atrás
.env.example	2e4e5b723e improved with knowledge grounding	8 meses atrás
.gitignore	2e4e5b723e improved with knowledge grounding	8 meses atrás
README.md	72e4d84f08 updated readme files	8 meses atrás
config.yaml	2e4e5b723e improved with knowledge grounding	8 meses atrás
knowledge_enhanced_workflow.ipynb	72e4d84f08 updated readme files	8 meses atrás
narrative_continuity_workflow.ipynb	2e4e5b723e improved with knowledge grounding	8 meses atrás
pyproject.toml	2e4e5b723e improved with knowledge grounding	8 meses atrás
uv.lock	2e4e5b723e improved with knowledge grounding	8 meses atrás

PowerPoint to Voiceover Transcript Generator

RAG-powered AI system that converts PowerPoint presentations into professional voiceover transcripts using Llama 4 Maverick

Transform PowerPoint slides into natural-sounding voiceover scripts optimized for human narration and text-to-speech systems.

Features

Multi-Modal AI: Analyzes slide visuals and speaker notes with Llama 4 Maverick
RAG Enhancement: FAISS-powered knowledge retrieval from markdown files
Narrative Flow: Maintains context and smooth transitions between slides
Speech Ready: Converts numbers, technical terms, and abbreviations to spoken form

Use Cases

Corporate Presentations: Internal training, product demos, quarterly reviews
Educational Content: Course materials, conference talks, webinars
Marketing Materials: Product launches, sales presentations, customer demos
Technical Documentation: API walkthroughs, system architecture presentations

Quick Start

Prerequisites

Python 3.12+
LibreOffice (for PPTX conversion)
Groq API key

Installation

Option 1: Using uv (Recommended)

# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and setup project
git clone <repository-url>
cd powerpoint-to-voiceover-transcript
uv sync

# Activate environment
source .venv/bin/activate  # macOS/Linux
# or .venv\Scripts\activate  # Windows

Option 2: Using pip

git clone <repository-url>
cd powerpoint-to-voiceover-transcript
pip install -e .

Environment Setup

# Copy environment template
cp .env.example .env

# Edit .env and add your API key
echo "GROQ_API_KEY=your_api_key_here" >> .env

# Configure your presentation in config.yaml
# Update the pptx_file path to your presentation

Usage

🚀 Jupyter Notebooks (Recommended!)

# Basic workflow with narrative continuity
jupyter notebook narrative_continuity_workflow.ipynb

# Advanced workflow with knowledge enhancement
jupyter notebook knowledge_enhanced_workflow.ipynb

🐍 Python API

from src.core.pptx_processor import pptx_to_images_and_notes
from src.processors.unified_transcript_generator import UnifiedTranscriptProcessor

# Extract slides and notes
result = pptx_to_images_and_notes("input/presentation.pptx", "output/")

# Generate transcripts with full enhancement
processor = UnifiedTranscriptProcessor(
    use_narrative=True,
    context_window_size=5,
    enable_knowledge=True
)

transcripts = processor.process_slides_dataframe(
    result['notes_df'],
    "output/",
    save_context=True
)

# Save results
transcripts.to_csv("output/transcripts.csv", index=False)

Knowledge Base Setup

Add markdown files to knowledge_base/ directory:

echo "# Product Overview\nOur product..." > knowledge_base/product.md
echo "# Technical Terms\nAPI means..." > knowledge_base/glossary.md

The system will automatically:
- Index your markdown files with FAISS
- Search for relevant content during processing
- Enhance transcripts with domain knowledge

Configuration

Main Configuration File (`config.yaml`)

# API Configuration
api:
  groq_model: "meta-llama/llama-4-maverick-17b-128e-instruct"
  max_retries: 3
  retry_delay: 1

# Processing Configuration
processing:
  default_dpi: 200
  default_format: "png"
  batch_size: 5

# Current Project Settings
current_project:
  pptx_file: "input/All About Llamas"
  extension: ".pptx"
  output_dir: "output/"

# Knowledge Base Configuration
knowledge:
  # Core settings
  enabled: true
  knowledge_base_dir: "knowledge_base"

  # FAISS Vector Store
  vector_store:
    index_type: "flat"                 # "flat", "ivf", "hnsw"
    use_gpu: false                     # Enable GPU acceleration
    cache_enabled: true                # Persistent caching

  # Embedding Configuration
  embedding:
    model_name: "all-MiniLM-L6-v2"     # Lightweight, fast model
    device: "cpu"                      # Use "cuda" if GPU available

  # Search Configuration
  search:
    top_k: 5                           # Number of chunks to retrieve
    similarity_threshold: 0.3          # Minimum similarity score
    max_chunk_size: 1000              # Maximum characters per chunk

  # Context Integration
  context:
    strategy: "combined"               # "knowledge_only", "narrative_priority", "combined"
    knowledge_weight: 0.3             # Knowledge influence (0.0-1.0)
    integration_method: "system_prompt"

  # Performance Settings
  performance:
    enable_caching: true              # Cache embeddings and search results
    max_memory_mb: 512                # Maximum memory for embeddings
    lazy_loading: true                # Load embeddings on demand

# File Paths
paths:
  cache_dir: "cache"
  logs_dir: "logs"
  temp_dir: "temp"

# Logging
logging:
  level: "INFO"
  file_enabled: true
  console_enabled: true

Environment Variables (`.env`)

# Required
GROQ_API_KEY=your_groq_api_key_here

Integration Strategies

Strategy	Configuration	Best For
Knowledge Only	`strategy: "knowledge_only"`	Technical docs, specifications
Narrative Priority	`strategy: "narrative_priority"`	Storytelling, educational content
Combined	`strategy: "combined"`	Most presentations (recommended)

Performance Tuning

For Speed:

processing:
  default_dpi: 150
knowledge:
  search:
    top_k: 3
  performance:
    max_memory_mb: 256

For Quality:

knowledge:
  search:
    top_k: 7
    similarity_threshold: 0.2
  context:
    knowledge_weight: 0.4

System Architecture

Core Processing Pipeline

The system follows a modular 3-stage pipeline:

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   PowerPoint    │───▶│     Content     │───▶│    Knowledge    │
│      File       │    │    Extraction   │    │    Retrieval    │
└─────────────────┘    └─────────────────┘    └─────────────────┘
                                                        │
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│     Output      │◀───│    Transcript   │◀───│  LLM Processing │
│      Files      │    │    Generation   │    │  Vision & Text  │
└─────────────────┘    └─────────────────┘    └─────────────────┘

Stage 1: Content Extraction (`pptx_processor.py`)

Extracts speaker notes and slide text using python-pptx
Converts PPTX → PDF → Images via LibreOffice and PyMuPDF
Generates structured DataFrame with slide metadata
Supports configurable DPI (default: 200) and formats (PNG/JPEG)

Stage 2: Knowledge Retrieval (`markdown_knowledge.py`)

Loads and chunks markdown files from knowledge base
Generates embeddings using sentence-transformers
Performs semantic search for relevant knowledge chunks
Integrates knowledge with slide content and speaker notes

Stage 3: AI Processing (`groq_client.py` + `unified_transcript_generator.py`)

Integrates with Groq's vision models via groq client
Base64 encodes images for vision model processing
Applies narrative continuity with sliding context window
Handles API retries and comprehensive error management

Project Structure

├── README.md                          # This guide
├── config.yaml                        # Configuration
├── pyproject.toml                     # Dependencies
├── knowledge_enhanced_workflow.ipynb  # Advanced notebook
├── narrative_continuity_workflow.ipynb # Basic notebook
├── input/                             # PowerPoint files
│   └── All About Llamas.pptx         # Example presentation
├── output/                            # Generated transcripts and images
├── knowledge_base/                    # Domain knowledge (.md files)
│   ├── llama diet.md
│   └── llamas.md
└── src/
    ├── config/settings.py             # Configuration management
    ├── core/                          # Core processing
    │   ├── pptx_processor.py          # PPTX extraction
    │   ├── groq_client.py             # Groq API client
    │   └── image_processing.py        # Image encoding
    ├── knowledge/                     # Knowledge management
    │   ├── faiss_knowledge.py         # FAISS vector search
    │   └── context_manager.py         # Context integration
    ├── processors/                    # Main processing
    │   └── unified_transcript_generator.py
    └── utils/                         # Utilities
        ├── visualization.py           # Slide display
        └── transcript_display.py      # Transcript formatting

Processing Your Own Presentations

Add your PowerPoint:
```
cp your_presentation.pptx input/
```

Update config.yaml:

current_project:
 pptx_file: "input/your_presentation"

Add domain knowledge (optional):

echo "# Your Domain Info\n..." > knowledge_base/domain.md

Run processing:

jupyter notebook knowledge_enhanced_workflow.ipynb

Troubleshooting

Issue	Solution
LibreOffice not found	`brew install --cask libreoffice` (macOS)
API key error	Set `GROQ_API_KEY` environment variable
Memory issues	Reduce `context_window_size` and `top_k` in config
Slow processing	Lower `default_dpi` or disable knowledge: `enabled: false`
Knowledge base not loading	Check `.md` files exist: `ls knowledge_base/*.md`

Output Files

After processing, check the output/ directory for:

Slide images: slide-001.png, slide-002.png, etc.
Transcripts: *_transcripts.csv and *_transcripts.json
Context data: narrative_context/ (if narrative mode enabled)
Knowledge stats: knowledge_base_stats.json (if knowledge enabled)

README.md