# Data Prep Toolkit If you are working on fine-tuning a Large Language Model, the biggest effort is usually preparing the dataset. ## What does this tool do? This tool contains a bunch of util functions to make your life easy for loading a dataset to torchtune with helpers: - Parse files: Prepare any file type (list below) to txt - Convert to prompt response pairs: Convert the txt file to prompt-response pairs - Filter: Filter out the low quality prompt response pairs - Fine-tune: Use the pre-defined configs to fine tune a LLM using torchtune TODO: Add TT links ## Parsers: (WIP) We support the following file formats for parsing: - **PDF** - Extracts text from most PDF files (text-based, not scanned) - **HTML/Web** - Pulls content from local HTML files or any webpage URL - **YouTube** - Grabs video transcripts (does not transcribe video) - **Word (DOCX)** - Extracts text, tables, and structure from Word docs (not images) - **PowerPoint (PPTX)** - Pulls text from slides, notes, and tables - **TXT** - For when you already have plain text (just copies it) ## Installation: TODO: Supply requirements.txt file here instead ### Dependencies ```bash # Install all dependencies at once pip install PyPDF2 python-docx beautifulsoup4 requests python-pptx yt-dlp youtube-transcript-api ``` ### Steps to run: TODO: Add links here 1. Clone TODO: Link 2. Install TODO: req.txt 3. Run the parser on your files 4. Create prompt responses 5. Filter low quality 6. Load to TT ## How to use: ```bash # Parse a PDF (outputs to data/output/document.txt) python src/main.py docs/report.pdf # Parse a website python src/main.py https://en.wikipedia.org/wiki/Artificial_intelligence # Get YouTube video transcripts python src/main.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ" # Custom output location python src/main.py docs/presentation.pptx -o my_training_data/ # Specify the output filename python src/main.py docs/contract.docx -n legal_text_001.txt # Use verbose mode for debugging python src/main.py weird_file.pdf -v ``` All outputs are saved as UTF-8 txt files in `data/output/` unless otherwise set. ### Rough edges: - Use quotes around YouTube URLs to avoid shell issues with the `&` character - HTML parser works with both local files and web URLs - Enhanced with session persistence and retry mechanisms - Some sites with strong bot protection may still block access - YouTube parser automatically extracts both manual and auto-generated captions - Prioritizes manual captions when available for better quality - Formats transcripts cleanly with proper text structure - PDF extraction works best with digital PDFs, not scanned documents - All parsers include error handling to gracefully manage parsing failures ## 📁 Project Layout ``` . ├── data/ # Where docs live │ ├── pdf/ # PDF documents │ ├── html/ # HTML files │ ├── youtube/ # YouTube transcript stuff │ ├── docx/ # Word documents │ ├── ppt/ # PowerPoint slides │ ├── txt/ # Plain text │ └── output/ # Where the magic happens (output) │ ├── src/ # The code that makes it tick │ ├── parsers/ # All our parser implementations │ │ ├── pdf_parser.py # PDF -> text │ │ ├── html_parser.py # HTML/web -> text │ │ ├── youtube_parser.py # YouTube -> text │ │ ├── docx_parser.py # Word -> text │ │ ├── ppt_parser.py # PowerPoint -> text │ │ ├── txt_parser.py # Text -> text (not much to do here) │ │ └── __init__.py │ ├── __init__.py │ ├── main.py # CLI entry point │ └── generate_qa.py # Creates Q&A pairs from text │ └── README.md # You are here ``` ## Generate QA: After parsing your documents, the next step is to parse them into QA pairs: Use the `generate_qa.py` script to create using the Cerebras LLM API: ```bash # Set your API key first export CEREBRAS_API_KEY="your_key_here" # This happens in 3 steps: # 1. Summarize the doc # 2. Generate QA # 3. Evaluate & filter based on relevance python src/generate_qa.py docs/report.pdf # Customize the generation python src/generate_qa.py docs/report.pdf --num-pairs 30 --threshold 7.0 # Skip parsing if you already have text python src/generate_qa.py docs/report.pdf --text-file data/output/report.txt # Save output to a specific directory python src/generate_qa.py docs/report.pdf --output-dir training_data/ ``` ## Known bugs/sharp edges: - PDFs: Some PDFs are scanned images and need OCR. This is homework to users :) - YouTube: We assume videos have captions, if they don't, another task for readers :) -------- ## WIP BFCL FT: ### Instructions to run: Time to setup: ~20-30 minutes: - Grab your HF and Wandb.ai API key-you will need those - Steps to install: ``` conda create -n test-ft python=3.10 conda activate test-ft pip install --pre torch torchvision torchao --index-url https://download.pytorch.org/whl/nightly/cu126 pip install --pre torchtune --extra-index-url https://download.pytorch.org/whl/nightly/cpu --no-cache-dir pip install transformers datasets wandb pip install huggingface-cli huggingface-cli login wandb login git clone https://github.com/meta-llama/llama-cookbook/ cd llama-cookbook/ git checkout data-tool cd end-to-end-use-cases/data-tool/scripts/finetuning tune download meta-llama/Meta-Llama-3.1-70B-Instruct --output-dir /tmp/Meta-Llama-3.1-70B-Instruct --ignore-patterns "original/consolidated* tune run --nproc_per_node 8 full_finetune_distributed --config ft-config.yaml ``` The end goal for this effort is to serve as fine-tuning data preparation kit. ## Current status: Currently, I'm (WIP) evaluating the idea to improve tool-calling datasets. Setup: - configs: Has the config prompts for creating synthetic data using `3.3` - data_prep/scripts: This is what you would like to run to prepare your datasets for annotation - scripts/annotation-inference: Script for generating synthetic datasets -> Use the vllm script for inference - fine-tuning: configs for FT using TorchTune