1 gadu atpakaļ · 35eeeb20db
--- a/end-to-end-use-cases/data-tool/ReadMe.MD
+++ b/end-to-end-use-cases/data-tool/ReadMe.MD
@@ -27,8 +27,6 @@ TODO: Add TT links
 
				 
			
 
				 TODO: Supply requirements.txt file here instead
			
 
				 
			
 
				-### Dependencies
			
 
				-
			
 
				 ```bash
			
 
				 # Install all dependencies at once
			
 
				 pip install PyPDF2 python-docx beautifulsoup4 requests python-pptx yt-dlp youtube-transcript-api
			
@@ -47,34 +45,17 @@ You can run these steps separately or combined (parsing + QA generation):
 
				 
			
 
				 ```bash
			
 
				 # STEP 1: PARSING - Extract text from documents
			
 
				-
			
 
				-# Parse a PDF (outputs to data/output/document.txt)
			
 
				 python src/main.py docs/report.pdf
			
 
				-
			
 
				-# Parse a website
			
 
				-python src/main.py https://en.wikipedia.org/wiki/Artificial_intelligence
			
 
				-
			
 
				-# Get YouTube video transcripts
			
 
				+python src/main.py URL (NO QOUTES)
			
 
				 python src/main.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
			
 
				-
			
 
				-# Custom output location
			
 
				 python src/main.py docs/presentation.pptx -o my_training_data/
			
 
				-
			
 
				-# Specify the output filename
			
 
				 python src/main.py docs/contract.docx -n legal_text_001.txt
			
 
				+```
			
 
				 
			
 
				-# Use verbose mode for debugging
			
 
				-python src/main.py weird_file.pdf -v
			
 
				-
			
 
				-# COMBINED WORKFLOW - Parse and generate QA pairs in one step
			
 
				-
			
 
				-# Set your API key first
			
 
				+```bash
			
 
				+#Entire logic together
			
 
				 export CEREBRAS_API_KEY="your_key_here"
			
 
				-
			
 
				-# Parse a document and generate QA pairs automatically
			
 
				 python src/main.py docs/report.pdf --generate-qa
			
 
				-
			
 
				-# Parse with custom QA settings
			
 
				 python src/main.py docs/report.pdf --generate-qa --qa-pairs 50 --qa-threshold 8.0 --qa-model "llama-3.1-70b"
			
 
				 ```
			
 
				 
			
@@ -92,7 +73,7 @@ All outputs are saved as UTF-8 txt files in `data/output/` unless otherwise set.
 
				 - PDF extraction works best with digital PDFs, not scanned documents
			
 
				 - All parsers include error handling to gracefully manage parsing failures
			
 
				 
			
 
				-## 📁 Project Layout
			
 
				+## Structure
			
 
				 
			
 
				 ```
			
 
				 .
			
@@ -118,89 +99,29 @@ All outputs are saved as UTF-8 txt files in `data/output/` unless otherwise set.
 
				 │   ├── main.py        # CLI entry point
			
 
				 │   └── generate_qa.py # Creates Q&A pairs from text
			
 
				 │
			
 
				-└── README.md          # You are here
			
 
				+└── README.md
			
 
				 ```
			
 
				 
			
 
				-## 🤖 Generate QA Pairs
			
 
				+## QA Pairs Seperate
			
 
				 
			
 
				-After parsing your documents, transform them into high-quality QA pairs for LLM fine-tuning using the Cerebras API:
			
 
				+If you want to seperately just run QA pair logic:
			
 
				 
			
 
				 ```bash
			
 
				-# Set your API key first
			
 
				 export CEREBRAS_API_KEY="your_key_here"
			
 
				 
			
 
				-# Generate QA pairs in 3 steps:
			
 
				-# 1. Generate document summary 
			
 
				-# 2. Create question-answer pairs from content
			
 
				-# 3. Rate and filter pairs based on quality
			
 
				 python src/generate_qa.py docs/report.pdf
			
 
				-
			
 
				-# Customize the generation
			
 
				 python src/generate_qa.py docs/report.pdf --num-pairs 30 --threshold 7.0
			
 
				-
			
 
				-# Skip parsing if you already have text
			
 
				 python src/generate_qa.py docs/report.pdf --text-file data/output/report.txt
			
 
				-
			
 
				-# Save output to a specific directory
			
 
				 python src/generate_qa.py docs/report.pdf --output-dir training_data/
			
 
				-
			
 
				-# Use a different model
			
 
				 python src/generate_qa.py docs/report.pdf --model llama-3.1-70b-instruct
			
 
				 ```
			
 
				 
			
 
				-### 🔄 How It Works
			
 
				-
			
 
				-The QA generation pipeline follows these steps:
			
 
				-
			
 
				-1. **Document Parsing**: The document is converted to plain text using our parsers
			
 
				-2. **Summary Generation**: The LLM creates a comprehensive summary of the document
			
 
				-3. **QA Pair Creation**: The text is split into chunks, and QA pairs are generated from each
			
 
				-4. **Quality Evaluation**: Each pair is rated on a 1-10 scale for relevance and quality
			
 
				-5. **Filtering**: Only pairs above your quality threshold (default: 7.0) are kept
			
 
				-6. **Format Conversion**: The final pairs are formatted for LLM fine-tuning
			
 
				-
			
 
				-### 📊 Output Format
			
 
				-
			
 
				-The script outputs a JSON file with:
			
 
				-
			
 
				-```jsonc
			
 
				-{
			
 
				-  "summary": "Comprehensive document summary...",
			
 
				-  
			
 
				-  "qa_pairs": [
			
 
				-    // All generated pairs
			
 
				-    {"question": "What is X?", "answer": "X is..."}
			
 
				-  ],
			
 
				-  
			
 
				-  "filtered_pairs": [
			
 
				-    // Only high-quality pairs
			
 
				-    {"question": "What is X?", "answer": "X is...", "rating": 9}
			
 
				-  ],
			
 
				-  
			
 
				-  "conversations": [
			
 
				-    // Ready-to-use conversation format for fine-tuning
			
 
				-    [
			
 
				-      {"role": "system", "content": "You are a helpful AI assistant..."},
			
 
				-      {"role": "user", "content": "What is X?"},
			
 
				-      {"role": "assistant", "content": "X is..."}
			
 
				-    ]
			
 
				-  ],
			
 
				-  
			
 
				-  "metrics": {
			
 
				-    "total": 25,              // Total generated pairs
			
 
				-    "filtered": 19,           // Pairs that passed quality check
			
 
				-    "retention_rate": 0.76,   // Percentage kept
			
 
				-    "avg_score": 7.8          // Average quality score
			
 
				-  }
			
 
				-}
			
 
				-```
			
 
				-
			
 
				 ## Known bugs/sharp edges:
			
 
				 
			
 
				 - PDFs: Some PDFs are scanned images and need OCR. This is homework to users :)
			
 
				 - YouTube: We assume videos have captions, if they don't, another task for readers :)
			
 
				 
			
 
				-## 🧠 System Architecture
			
 
				+## Mind map
			
 
				 
			
 
				 Here's how the document processing and QA generation pipeline works:
			
 
				 
			
@@ -226,12 +147,9 @@ graph TD
 
				     
			
 
				     N & O & P --> Q[JSON Output]
			
 
				     Q --> R[QA Pairs for Fine-tuning]
			
 
				-    Q --> S[Summarization {WIP}]
			
 
				-    Q --> T[DPO Fine-tuning {WIP}]
			
 
				-    Q --> U[Alpaca Format {WIP}]
			
 
				 ```
			
 
				 
			
 
				-### 📄 Module Dependencies
			
 
				+### Module flow:
			
 
				 
			
 
				 - **main.py**: Entry point for document parsing
			
 
				   - Imports parsers from `src/parsers/`
			
@@ -251,12 +169,6 @@ graph TD
 
				 - **src/parsers/**: Document format-specific parsers
			
 
				   - Each parser implements `.parse()` and `.save()` methods
			
 
				   - All inherit common interface pattern for consistency
			
 
				-  
			
 
				-### 🚧 Work In Progress Features
			
 
				-
			
 
				-- **Summarization**: Create document summaries suitable for retrieval and semantic search
			
 
				-- **DPO Fine-tuning**: Direct Preference Optimization format for better instruction following
			
 
				-- **Alpaca Format**: Convert QA pairs to Alpaca instruction format for compatibility with more training pipelines
			
 
				 
			
 
				 --------