瀏覽代碼

Update ReadMe.MD

Sanyam Bhutani 1 月之前
父節點
當前提交
0ffc25e4a2
共有 1 個文件被更改,包括 172 次插入2 次删除
  1. 172 2
      end-to-end-use-cases/data-tool/ReadMe.MD

+ 172 - 2
end-to-end-use-cases/data-tool/ReadMe.MD

@@ -1,4 +1,174 @@
-## WIP
+# Data Prep Toolkit
+
+If you are working on fine-tuning a Large Language Model, the biggest effort is usually preparing the dataset. 
+
+## What does this tool do?
+
+This tool contains a bunch of util functions to make your life easy for loading a dataset to torchtune with helpers:
+- Parse files: Prepare any file type (list below) to txt
+- Convert to prompt response pairs: Convert the txt file to prompt-response pairs
+- Filter: Filter out the low quality prompt response pairs
+- Fine-tune: Use the pre-defined configs to fine tune a LLM using torchtune
+
+TODO: Add TT links
+
+## Parsers:
+
+(WIP) We support the following file formats for parsing:
+
+- **PDF** - Extracts text from most PDF files (text-based, not scanned)
+- **HTML/Web** - Pulls content from local HTML files or any webpage URL
+- **YouTube** - Grabs video transcripts (does not transcribe video)
+- **Word (DOCX)** - Extracts text, tables, and structure from Word docs (not images)
+- **PowerPoint (PPTX)** - Pulls text from slides, notes, and tables
+- **TXT** - For when you already have plain text (just copies it)
+
+## Installation:
+
+TODO: Supply requirements.txt file here instead
+
+### Dependencies
+
+```bash
+# Install all dependencies at once
+pip install PyPDF2 python-docx beautifulsoup4 requests python-pptx yt-dlp youtube-transcript-api
+```
+
+### Steps to run:
+
+TODO: Add links here
+
+1. Clone TODO: Link
+2. Install TODO: req.txt
+3. Run the parser on your files
+4. Create prompt responses
+5. Filter low quality
+6. Load to TT 
+
+## How to use:
+
+```bash
+# Parse a PDF (outputs to data/output/document.txt)
+python src/main.py docs/report.pdf
+
+# Parse a website
+python src/main.py https://en.wikipedia.org/wiki/Artificial_intelligence
+
+# Get YouTube video transcripts
+python src/main.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
+
+# Custom output location
+python src/main.py docs/presentation.pptx -o my_training_data/
+
+# Specify the output filename
+python src/main.py docs/contract.docx -n legal_text_001.txt
+
+# Use verbose mode for debugging
+python src/main.py weird_file.pdf -v
+```
+
+All outputs are saved as UTF-8 txt files in `data/output/` unless otherwise set.
+
+### Rough edges:
+
+- Use quotes around YouTube URLs to avoid shell issues with the `&` character
+- HTML parser works with both local files and web URLs
+  - Enhanced with session persistence and retry mechanisms
+  - Some sites with strong bot protection may still block access
+- YouTube parser automatically extracts both manual and auto-generated captions
+  - Prioritizes manual captions when available for better quality
+  - Formats transcripts cleanly with proper text structure
+- PDF extraction works best with digital PDFs, not scanned documents
+- All parsers include error handling to gracefully manage parsing failures
+
+## 📁 Project Layout
+
+```
+.
+├── data/              # Where docs live
+│   ├── pdf/           # PDF documents 
+│   ├── html/          # HTML files
+│   ├── youtube/       # YouTube transcript stuff
+│   ├── docx/          # Word documents
+│   ├── ppt/           # PowerPoint slides
+│   ├── txt/           # Plain text
+│   └── output/        # Where the magic happens (output)
+│
+├── src/               # The code that makes it tick
+│   ├── parsers/       # All our parser implementations
+│   │   ├── pdf_parser.py     # PDF -> text
+│   │   ├── html_parser.py    # HTML/web -> text
+│   │   ├── youtube_parser.py # YouTube -> text
+│   │   ├── docx_parser.py    # Word -> text
+│   │   ├── ppt_parser.py     # PowerPoint -> text
+│   │   ├── txt_parser.py     # Text -> text (not much to do here)
+│   │   └── __init__.py
+│   ├── __init__.py
+│   ├── main.py        # CLI entry point
+│   └── generate_qa.py # Creates Q&A pairs from text
+│
+└── README.md          # You are here
+```
+
+## 🤖 Generate Q&A Pairs for Fine-tuning
+
+Want to turn your documents into training data? After parsing, use the `generate_qa.py` script to create question-answer pairs using the Cerebras LLM API:
+
+```bash
+# Set your API key first
+export CEREBRAS_API_KEY="your_key_here"
+
+# Generate QA pairs from a document in three steps:
+# 1. Summarize the document
+# 2. Generate question-answer pairs
+# 3. Evaluate & filter based on relevance
+python src/generate_qa.py docs/report.pdf
+
+# Customize the generation
+python src/generate_qa.py docs/report.pdf --num-pairs 30 --threshold 7.0
+
+# Skip parsing if you already have text
+python src/generate_qa.py docs/report.pdf --text-file data/output/report.txt
+
+# Save output to a specific directory
+python src/generate_qa.py docs/report.pdf --output-dir training_data/
+```
+
+### 📊 Sample Output
+
+The script outputs a JSON file with:
+- A comprehensive summary of the document
+- Question-answer pairs in a format ready for fine-tuning
+- Quality metrics about the generated pairs
+
+```jsonc
+{
+  "summary": "This document explains the principles of deep learning...",
+  
+  "qa_pairs": [
+    {"from": "user", "value": "What are the key components of a neural network?"},
+    {"from": "assistant", "value": "The key components of a neural network include..."},
+    // More Q&A pairs...
+  ],
+  
+  "metrics": {
+    "initial_pairs": 25,        // Total generated pairs
+    "filtered_pairs": 19,       // Pairs that passed quality check
+    "retention_rate": 0.76,     // Percentage kept
+    "average_relevance_score": 7.8  // Average quality score
+  }
+}
+```
+
+## Known bugs/sharp edges:
+
+- PDFs: Some PDFs are scanned images and need OCR. This is homework to users :)
+- YouTube: We assume videos have captions, if they don't, another task for readers :)
+
+
+--------
+
+## WIP BFCL FT:
 
 ### Instructions to run:
 
@@ -35,4 +205,4 @@ Setup:
 - configs: Has the config prompts for creating synthetic data using `3.3`
 - data_prep/scripts: This is what you would like to run to prepare your datasets for annotation
 - scripts/annotation-inference: Script for generating synthetic datasets -> Use the vllm script for inference
-- fine-tuning: configs for FT using TorchTune
+- fine-tuning: configs for FT using TorchTune