1 год назад · 0ffc25e4a2
--- a/end-to-end-use-cases/data-tool/ReadMe.MD
+++ b/end-to-end-use-cases/data-tool/ReadMe.MD
@@ -1,4 +1,174 @@
 
				-## WIP
			
 
				+# Data Prep Toolkit
			
 
				+
			
 
				+If you are working on fine-tuning a Large Language Model, the biggest effort is usually preparing the dataset. 
			
 
				+
			
 
				+## What does this tool do?
			
 
				+
			
 
				+This tool contains a bunch of util functions to make your life easy for loading a dataset to torchtune with helpers:
			
 
				+- Parse files: Prepare any file type (list below) to txt
			
 
				+- Convert to prompt response pairs: Convert the txt file to prompt-response pairs
			
 
				+- Filter: Filter out the low quality prompt response pairs
			
 
				+- Fine-tune: Use the pre-defined configs to fine tune a LLM using torchtune
			
 
				+
			
 
				+TODO: Add TT links
			
 
				+
			
 
				+## Parsers:
			
 
				+
			
 
				+(WIP) We support the following file formats for parsing:
			
 
				+
			
 
				+- **PDF** - Extracts text from most PDF files (text-based, not scanned)
			
 
				+- **HTML/Web** - Pulls content from local HTML files or any webpage URL
			
 
				+- **YouTube** - Grabs video transcripts (does not transcribe video)
			
 
				+- **Word (DOCX)** - Extracts text, tables, and structure from Word docs (not images)
			
 
				+- **PowerPoint (PPTX)** - Pulls text from slides, notes, and tables
			
 
				+- **TXT** - For when you already have plain text (just copies it)
			
 
				+
			
 
				+## Installation:
			
 
				+
			
 
				+TODO: Supply requirements.txt file here instead
			
 
				+
			
 
				+### Dependencies
			
 
				+
			
 
				+```bash
			
 
				+# Install all dependencies at once
			
 
				+pip install PyPDF2 python-docx beautifulsoup4 requests python-pptx yt-dlp youtube-transcript-api
			
 
				+```
			
 
				+
			
 
				+### Steps to run:
			
 
				+
			
 
				+TODO: Add links here
			
 
				+
			
 
				+1. Clone TODO: Link
			
 
				+2. Install TODO: req.txt
			
 
				+3. Run the parser on your files
			
 
				+4. Create prompt responses
			
 
				+5. Filter low quality
			
 
				+6. Load to TT 
			
 
				+
			
 
				+## How to use:
			
 
				+
			
 
				+```bash
			
 
				+# Parse a PDF (outputs to data/output/document.txt)
			
 
				+python src/main.py docs/report.pdf
			
 
				+
			
 
				+# Parse a website
			
 
				+python src/main.py https://en.wikipedia.org/wiki/Artificial_intelligence
			
 
				+
			
 
				+# Get YouTube video transcripts
			
 
				+python src/main.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
			
 
				+
			
 
				+# Custom output location
			
 
				+python src/main.py docs/presentation.pptx -o my_training_data/
			
 
				+
			
 
				+# Specify the output filename
			
 
				+python src/main.py docs/contract.docx -n legal_text_001.txt
			
 
				+
			
 
				+# Use verbose mode for debugging
			
 
				+python src/main.py weird_file.pdf -v
			
 
				+```
			
 
				+
			
 
				+All outputs are saved as UTF-8 txt files in `data/output/` unless otherwise set.
			
 
				+
			
 
				+### Rough edges:
			
 
				+
			
 
				+- Use quotes around YouTube URLs to avoid shell issues with the `&` character
			
 
				+- HTML parser works with both local files and web URLs
			
 
				+  - Enhanced with session persistence and retry mechanisms
			
 
				+  - Some sites with strong bot protection may still block access
			
 
				+- YouTube parser automatically extracts both manual and auto-generated captions
			
 
				+  - Prioritizes manual captions when available for better quality
			
 
				+  - Formats transcripts cleanly with proper text structure
			
 
				+- PDF extraction works best with digital PDFs, not scanned documents
			
 
				+- All parsers include error handling to gracefully manage parsing failures
			
 
				+
			
 
				+## 📁 Project Layout
			
 
				+
			
 
				+```
			
 
				+.
			
 
				+├── data/              # Where docs live
			
 
				+│   ├── pdf/           # PDF documents 
			
 
				+│   ├── html/          # HTML files
			
 
				+│   ├── youtube/       # YouTube transcript stuff
			
 
				+│   ├── docx/          # Word documents
			
 
				+│   ├── ppt/           # PowerPoint slides
			
 
				+│   ├── txt/           # Plain text
			
 
				+│   └── output/        # Where the magic happens (output)
			
 
				+│
			
 
				+├── src/               # The code that makes it tick
			
 
				+│   ├── parsers/       # All our parser implementations
			
 
				+│   │   ├── pdf_parser.py     # PDF -> text
			
 
				+│   │   ├── html_parser.py    # HTML/web -> text
			
 
				+│   │   ├── youtube_parser.py # YouTube -> text
			
 
				+│   │   ├── docx_parser.py    # Word -> text
			
 
				+│   │   ├── ppt_parser.py     # PowerPoint -> text
			
 
				+│   │   ├── txt_parser.py     # Text -> text (not much to do here)
			
 
				+│   │   └── __init__.py
			
 
				+│   ├── __init__.py
			
 
				+│   ├── main.py        # CLI entry point
			
 
				+│   └── generate_qa.py # Creates Q&A pairs from text
			
 
				+│
			
 
				+└── README.md          # You are here
			
 
				+```
			
 
				+
			
 
				+## 🤖 Generate Q&A Pairs for Fine-tuning
			
 
				+
			
 
				+Want to turn your documents into training data? After parsing, use the `generate_qa.py` script to create question-answer pairs using the Cerebras LLM API:
			
 
				+
			
 
				+```bash
			
 
				+# Set your API key first
			
 
				+export CEREBRAS_API_KEY="your_key_here"
			
 
				+
			
 
				+# Generate QA pairs from a document in three steps:
			
 
				+# 1. Summarize the document
			
 
				+# 2. Generate question-answer pairs
			
 
				+# 3. Evaluate & filter based on relevance
			
 
				+python src/generate_qa.py docs/report.pdf
			
 
				+
			
 
				+# Customize the generation
			
 
				+python src/generate_qa.py docs/report.pdf --num-pairs 30 --threshold 7.0
			
 
				+
			
 
				+# Skip parsing if you already have text
			
 
				+python src/generate_qa.py docs/report.pdf --text-file data/output/report.txt
			
 
				+
			
 
				+# Save output to a specific directory
			
 
				+python src/generate_qa.py docs/report.pdf --output-dir training_data/
			
 
				+```
			
 
				+
			
 
				+### 📊 Sample Output
			
 
				+
			
 
				+The script outputs a JSON file with:
			
 
				+- A comprehensive summary of the document
			
 
				+- Question-answer pairs in a format ready for fine-tuning
			
 
				+- Quality metrics about the generated pairs
			
 
				+
			
 
				+```jsonc
			
 
				+{
			
 
				+  "summary": "This document explains the principles of deep learning...",
			
 
				+  
			
 
				+  "qa_pairs": [
			
 
				+    {"from": "user", "value": "What are the key components of a neural network?"},
			
 
				+    {"from": "assistant", "value": "The key components of a neural network include..."},
			
 
				+    // More Q&A pairs...
			
 
				+  ],
			
 
				+  
			
 
				+  "metrics": {
			
 
				+    "initial_pairs": 25,        // Total generated pairs
			
 
				+    "filtered_pairs": 19,       // Pairs that passed quality check
			
 
				+    "retention_rate": 0.76,     // Percentage kept
			
 
				+    "average_relevance_score": 7.8  // Average quality score
			
 
				+  }
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+## Known bugs/sharp edges:
			
 
				+
			
 
				+- PDFs: Some PDFs are scanned images and need OCR. This is homework to users :)
			
 
				+- YouTube: We assume videos have captions, if they don't, another task for readers :)
			
 
				+
			
 
				+
			
 
				+--------
			
 
				+
			
 
				+## WIP BFCL FT:
			
 
				 
			
 
				 ### Instructions to run:
			
 
				 
			
@@ -35,4 +205,4 @@ Setup:
 
				 - configs: Has the config prompts for creating synthetic data using `3.3`
			
 
				 - data_prep/scripts: This is what you would like to run to prepare your datasets for annotation
			
 
				 - scripts/annotation-inference: Script for generating synthetic datasets -> Use the vllm script for inference
			
 
				-- fine-tuning: configs for FT using TorchTune
			
 
				+- fine-tuning: configs for FT using TorchTune