|  | vor 6 Monaten | |
|---|---|---|
| .. | ||
| configs | vor 8 Monaten | |
| data_prep | vor 8 Monaten | |
| scripts | vor 7 Monaten | |
| src | vor 7 Monaten | |
| ReadMe.MD | vor 6 Monaten | |
If you are working on fine-tuning a Large Language Model, the biggest effort is usually preparing the dataset.
This tool contains a bunch of util functions to make your life easy for loading a dataset to torchtune with helpers:
TODO: Add TT links
(WIP) We support the following file formats for parsing:
TODO: Supply requirements.txt file here instead
# Install all dependencies at once
pip install PyPDF2 python-docx beautifulsoup4 requests python-pptx yt-dlp youtube-transcript-api
You can run these steps separately or combined (parsing + QA generation):
# STEP 1: PARSING - Extract text from documents
python src/main.py docs/report.pdf
python src/main.py URL (NO QOUTES)
python src/main.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
python src/main.py docs/presentation.pptx -o my_training_data/
python src/main.py docs/contract.docx -n legal_text_001.txt
#Entire logic together
export CEREBRAS_API_KEY="your_key_here"
python src/main.py docs/report.pdf --generate-qa
python src/main.py docs/report.pdf --generate-qa --qa-pairs 50 --qa-threshold 8.0 --qa-model "llama-3.1-70b"
All outputs are saved as UTF-8 txt files in data/output/ unless otherwise set.
& character.
├── data/              # Where docs live
│   ├── pdf/           # PDF documents 
│   ├── html/          # HTML files
│   ├── youtube/       # YouTube transcript stuff
│   ├── docx/          # Word documents
│   ├── ppt/           # PowerPoint slides
│   ├── txt/           # Plain text
│   └── output/        # Where the magic happens (output)
│
├── src/               # The code that makes it tick
│   ├── parsers/       # All our parser implementations
│   │   ├── pdf_parser.py     # PDF -> text
│   │   ├── html_parser.py    # HTML/web -> text
│   │   ├── youtube_parser.py # YouTube -> text
│   │   ├── docx_parser.py    # Word -> text
│   │   ├── ppt_parser.py     # PowerPoint -> text
│   │   ├── txt_parser.py     # Text -> text (not much to do here)
│   │   └── __init__.py
│   ├── __init__.py
│   ├── main.py        # CLI entry point
│   └── generate_qa.py # Creates Q&A pairs from text
│
└── README.md
If you want to seperately just run QA pair logic:
export CEREBRAS_API_KEY="your_key_here"
python src/generate_qa.py docs/report.pdf
python src/generate_qa.py docs/report.pdf --num-pairs 30 --threshold 7.0
python src/generate_qa.py docs/report.pdf --text-file data/output/report.txt
python src/generate_qa.py docs/report.pdf --output-dir training_data/
python src/generate_qa.py docs/report.pdf --model llama-3.1-70b-instruct
Here's how the document processing and QA generation pipeline works:
graph TD
    A[Document/URL] --> B[main.py]
    B --> C[src/parsers]
    C --> D[PDF Parser]
    C --> E[HTML Parser]
    C --> F[YouTube Parser]
    C --> G[DOCX Parser]
    C --> H[PPT Parser]
    C --> I[TXT Parser]
    D & E & F & G & H & I --> J[Extracted Text]
    AAA[Taxonomy] --> BBB[Llama]
    AAB[Target Domain] --> BBB[Llama]
    AAC[Seed Examples] --> BBB[Llama]
    BBB --> M[Data Generation]
    
    J --> K[Generate_QA]
    J --> KKK[Synthetic Reasoning]
    J --> KK[DPO Formatting]
    J --> KLL[GRPO Formatting]
    J --> KL[Summarization Formatting]
    J --> KM[Alpaca Format]
    K --> M[Data Generation]
    KKK --> M[Data Generation]
    KK --> M[Data Generation]
    KLL --> M[Data Generation]
    KL --> M[Data Generation]
    KM--> M[Data Generation]
    
    M --> N[Document Summary]
    M --> O[Generate QA Pairs]
    M --> P[Rate & Filter QA Pairs]
    
    N & O & P --> Q[JSON Output]
    Q --> R[QA Pairs for Fine-tuning]
main.py: Entry point for document parsing
src/parsers/generate_qa.py when using --generate-qa flag
generate_qa.py: Creates QA pairs from parsed text
QAGenerator from src/utils/main.py
src/utils/qa_generator.py: Core QA generation logic
src/parsers/: Document format-specific parsers
.parse() and .save() methodsTime to setup: ~20-30 minutes:
conda create -n test-ft python=3.10
conda activate test-ft
pip install --pre torch torchvision torchao --index-url https://download.pytorch.org/whl/nightly/cu126
pip install --pre torchtune --extra-index-url https://download.pytorch.org/whl/nightly/cpu --no-cache-dir
pip install transformers datasets wandb
pip install huggingface-cli
huggingface-cli login
wandb login
git clone https://github.com/meta-llama/llama-cookbook/
cd llama-cookbook/
git checkout data-tool
cd end-to-end-use-cases/data-tool/scripts/finetuning
tune download meta-llama/Meta-Llama-3.1-70B-Instruct --output-dir /tmp/Meta-Llama-3.1-70B-Instruct --ignore-patterns "original/consolidated*
tune run --nproc_per_node 8 full_finetune_distributed --config ft-config.yaml
The end goal for this effort is to serve as fine-tuning data preparation kit.
Currently, I'm (WIP) evaluating the idea to improve tool-calling datasets.
Setup:
3.3