소스 검색

Update ReadMe.MD

Sanyam Bhutani 3 달 전
부모
커밋
135259d5ed
1개의 변경된 파일98개의 추가작업 그리고 10개의 파일을 삭제
  1. 98 10
      end-to-end-use-cases/data-tool/ReadMe.MD

+ 98 - 10
end-to-end-use-cases/data-tool/ReadMe.MD

@@ -27,6 +27,8 @@ TODO: Add TT links
 
 TODO: Supply requirements.txt file here instead
 
+### Dependencies
+
 ```bash
 # Install all dependencies at once
 pip install PyPDF2 python-docx beautifulsoup4 requests python-pptx yt-dlp youtube-transcript-api
@@ -45,17 +47,34 @@ You can run these steps separately or combined (parsing + QA generation):
 
 ```bash
 # STEP 1: PARSING - Extract text from documents
+
+# Parse a PDF (outputs to data/output/document.txt)
 python src/main.py docs/report.pdf
-python src/main.py URL (NO QOUTES)
+
+# Parse a website
+python src/main.py https://en.wikipedia.org/wiki/Artificial_intelligence
+
+# Get YouTube video transcripts
 python src/main.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
+
+# Custom output location
 python src/main.py docs/presentation.pptx -o my_training_data/
+
+# Specify the output filename
 python src/main.py docs/contract.docx -n legal_text_001.txt
-```
 
-```bash
-#Entire logic together
+# Use verbose mode for debugging
+python src/main.py weird_file.pdf -v
+
+# COMBINED WORKFLOW - Parse and generate QA pairs in one step
+
+# Set your API key first
 export CEREBRAS_API_KEY="your_key_here"
+
+# Parse a document and generate QA pairs automatically
 python src/main.py docs/report.pdf --generate-qa
+
+# Parse with custom QA settings
 python src/main.py docs/report.pdf --generate-qa --qa-pairs 50 --qa-threshold 8.0 --qa-model "llama-3.1-70b"
 ```
 
@@ -73,7 +92,7 @@ All outputs are saved as UTF-8 txt files in `data/output/` unless otherwise set.
 - PDF extraction works best with digital PDFs, not scanned documents
 - All parsers include error handling to gracefully manage parsing failures
 
-## Structure
+## 📁 Project Layout
 
 ```
 .
@@ -99,29 +118,89 @@ All outputs are saved as UTF-8 txt files in `data/output/` unless otherwise set.
 │   ├── main.py        # CLI entry point
 │   └── generate_qa.py # Creates Q&A pairs from text
-└── README.md
+└── README.md          # You are here
 ```
 
-## QA Pairs Seperate
+## 🤖 Generate QA Pairs
 
-If you want to seperately just run QA pair logic:
+After parsing your documents, transform them into high-quality QA pairs for LLM fine-tuning using the Cerebras API:
 
 ```bash
+# Set your API key first
 export CEREBRAS_API_KEY="your_key_here"
 
+# Generate QA pairs in 3 steps:
+# 1. Generate document summary 
+# 2. Create question-answer pairs from content
+# 3. Rate and filter pairs based on quality
 python src/generate_qa.py docs/report.pdf
+
+# Customize the generation
 python src/generate_qa.py docs/report.pdf --num-pairs 30 --threshold 7.0
+
+# Skip parsing if you already have text
 python src/generate_qa.py docs/report.pdf --text-file data/output/report.txt
+
+# Save output to a specific directory
 python src/generate_qa.py docs/report.pdf --output-dir training_data/
+
+# Use a different model
 python src/generate_qa.py docs/report.pdf --model llama-3.1-70b-instruct
 ```
 
+### 🔄 How It Works
+
+The QA generation pipeline follows these steps:
+
+1. **Document Parsing**: The document is converted to plain text using our parsers
+2. **Summary Generation**: The LLM creates a comprehensive summary of the document
+3. **QA Pair Creation**: The text is split into chunks, and QA pairs are generated from each
+4. **Quality Evaluation**: Each pair is rated on a 1-10 scale for relevance and quality
+5. **Filtering**: Only pairs above your quality threshold (default: 7.0) are kept
+6. **Format Conversion**: The final pairs are formatted for LLM fine-tuning
+
+### 📊 Output Format
+
+The script outputs a JSON file with:
+
+```jsonc
+{
+  "summary": "Comprehensive document summary...",
+  
+  "qa_pairs": [
+    // All generated pairs
+    {"question": "What is X?", "answer": "X is..."}
+  ],
+  
+  "filtered_pairs": [
+    // Only high-quality pairs
+    {"question": "What is X?", "answer": "X is...", "rating": 9}
+  ],
+  
+  "conversations": [
+    // Ready-to-use conversation format for fine-tuning
+    [
+      {"role": "system", "content": "You are a helpful AI assistant..."},
+      {"role": "user", "content": "What is X?"},
+      {"role": "assistant", "content": "X is..."}
+    ]
+  ],
+  
+  "metrics": {
+    "total": 25,              // Total generated pairs
+    "filtered": 19,           // Pairs that passed quality check
+    "retention_rate": 0.76,   // Percentage kept
+    "avg_score": 7.8          // Average quality score
+  }
+}
+```
+
 ## Known bugs/sharp edges:
 
 - PDFs: Some PDFs are scanned images and need OCR. This is homework to users :)
 - YouTube: We assume videos have captions, if they don't, another task for readers :)
 
-## Mind map
+## 🧠 System Architecture
 
 Here's how the document processing and QA generation pipeline works:
 
@@ -147,9 +226,12 @@ graph TD
     
     N & O & P --> Q[JSON Output]
     Q --> R[QA Pairs for Fine-tuning]
+    Q --> S[Summarization {WIP}]
+    Q --> T[DPO Fine-tuning {WIP}]
+    Q --> U[Alpaca Format {WIP}]
 ```
 
-### Module flow:
+### 📄 Module Dependencies
 
 - **main.py**: Entry point for document parsing
   - Imports parsers from `src/parsers/`
@@ -169,6 +251,12 @@ graph TD
 - **src/parsers/**: Document format-specific parsers
   - Each parser implements `.parse()` and `.save()` methods
   - All inherit common interface pattern for consistency
+  
+### 🚧 Work In Progress Features
+
+- **Summarization**: Create document summaries suitable for retrieval and semantic search
+- **DPO Fine-tuning**: Direct Preference Optimization format for better instruction following
+- **Alpaca Format**: Convert QA pairs to Alpaca instruction format for compatibility with more training pipelines
 
 --------