|
6 months ago | |
---|---|---|
.. | ||
configs | 7 months ago | |
data_prep | 6 months ago | |
scripts | 6 months ago | |
src | 6 months ago | |
ReadMe.MD | 6 months ago |
If you are working on fine-tuning a Large Language Model, the biggest effort is usually preparing the dataset.
This tool contains a bunch of util functions to make your life easy for loading a dataset to torchtune with helpers:
TODO: Add TT links
(WIP) We support the following file formats for parsing:
TODO: Supply requirements.txt file here instead
# Install all dependencies at once
pip install PyPDF2 python-docx beautifulsoup4 requests python-pptx yt-dlp youtube-transcript-api
You can run these steps separately or combined (parsing + QA generation):
# STEP 1: PARSING - Extract text from documents
# Parse a PDF (outputs to data/output/document.txt)
python src/main.py docs/report.pdf
# Parse a website
python src/main.py https://en.wikipedia.org/wiki/Artificial_intelligence
# Get YouTube video transcripts
python src/main.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
# Custom output location
python src/main.py docs/presentation.pptx -o my_training_data/
# Specify the output filename
python src/main.py docs/contract.docx -n legal_text_001.txt
# Use verbose mode for debugging
python src/main.py weird_file.pdf -v
# COMBINED WORKFLOW - Parse and generate QA pairs in one step
# Set your API key first
export CEREBRAS_API_KEY="your_key_here"
# Parse a document and generate QA pairs automatically
python src/main.py docs/report.pdf --generate-qa
# Parse with custom QA settings
python src/main.py docs/report.pdf --generate-qa --qa-pairs 50 --qa-threshold 8.0 --qa-model "llama-3.1-70b"
All outputs are saved as UTF-8 txt files in data/output/
unless otherwise set.
&
character.
โโโ data/ # Where docs live
โ โโโ pdf/ # PDF documents
โ โโโ html/ # HTML files
โ โโโ youtube/ # YouTube transcript stuff
โ โโโ docx/ # Word documents
โ โโโ ppt/ # PowerPoint slides
โ โโโ txt/ # Plain text
โ โโโ output/ # Where the magic happens (output)
โ
โโโ src/ # The code that makes it tick
โ โโโ parsers/ # All our parser implementations
โ โ โโโ pdf_parser.py # PDF -> text
โ โ โโโ html_parser.py # HTML/web -> text
โ โ โโโ youtube_parser.py # YouTube -> text
โ โ โโโ docx_parser.py # Word -> text
โ โ โโโ ppt_parser.py # PowerPoint -> text
โ โ โโโ txt_parser.py # Text -> text (not much to do here)
โ โ โโโ __init__.py
โ โโโ __init__.py
โ โโโ main.py # CLI entry point
โ โโโ generate_qa.py # Creates Q&A pairs from text
โ
โโโ README.md # You are here
After parsing your documents, transform them into high-quality QA pairs for LLM fine-tuning using the Cerebras API:
# Set your API key first
export CEREBRAS_API_KEY="your_key_here"
# Generate QA pairs in 3 steps:
# 1. Generate document summary
# 2. Create question-answer pairs from content
# 3. Rate and filter pairs based on quality
python src/generate_qa.py docs/report.pdf
# Customize the generation
python src/generate_qa.py docs/report.pdf --num-pairs 30 --threshold 7.0
# Skip parsing if you already have text
python src/generate_qa.py docs/report.pdf --text-file data/output/report.txt
# Save output to a specific directory
python src/generate_qa.py docs/report.pdf --output-dir training_data/
# Use a different model
python src/generate_qa.py docs/report.pdf --model llama-3.1-70b-instruct
The QA generation pipeline follows these steps:
The script outputs a JSON file with:
{
"summary": "Comprehensive document summary...",
"qa_pairs": [
// All generated pairs
{"question": "What is X?", "answer": "X is..."}
],
"filtered_pairs": [
// Only high-quality pairs
{"question": "What is X?", "answer": "X is...", "rating": 9}
],
"conversations": [
// Ready-to-use conversation format for fine-tuning
[
{"role": "system", "content": "You are a helpful AI assistant..."},
{"role": "user", "content": "What is X?"},
{"role": "assistant", "content": "X is..."}
]
],
"metrics": {
"total": 25, // Total generated pairs
"filtered": 19, // Pairs that passed quality check
"retention_rate": 0.76, // Percentage kept
"avg_score": 7.8 // Average quality score
}
}
Here's how the document processing and QA generation pipeline works:
graph TD
A[Document/URL] --> B[main.py]
B --> C[src/parsers]
C --> D[PDF Parser]
C --> E[HTML Parser]
C --> F[YouTube Parser]
C --> G[DOCX Parser]
C --> H[PPT Parser]
C --> I[TXT Parser]
D & E & F & G & H & I --> J[Extracted Text]
J --> K[generate_qa.py]
K --> L[src/utils]
L --> M[QAGenerator]
M --> N[Document Summary]
M --> O[Generate QA Pairs]
M --> P[Rate & Filter QA Pairs]
N & O & P --> Q[JSON Output]
Q --> R[QA Pairs for Fine-tuning]
Q --> S[Summarization {WIP}]
Q --> T[DPO Fine-tuning {WIP}]
Q --> U[Alpaca Format {WIP}]
main.py: Entry point for document parsing
src/parsers/
generate_qa.py
when using --generate-qa
flag
generate_qa.py: Creates QA pairs from parsed text
QAGenerator
from src/utils/
main.py
src/utils/qa_generator.py: Core QA generation logic
src/parsers/: Document format-specific parsers
.parse()
and .save()
methodsTime to setup: ~20-30 minutes:
conda create -n test-ft python=3.10
conda activate test-ft
pip install --pre torch torchvision torchao --index-url https://download.pytorch.org/whl/nightly/cu126
pip install --pre torchtune --extra-index-url https://download.pytorch.org/whl/nightly/cpu --no-cache-dir
pip install transformers datasets wandb
pip install huggingface-cli
huggingface-cli login
wandb login
git clone https://github.com/meta-llama/llama-cookbook/
cd llama-cookbook/
git checkout data-tool
cd end-to-end-use-cases/data-tool/scripts/finetuning
tune download meta-llama/Meta-Llama-3.1-70B-Instruct --output-dir /tmp/Meta-Llama-3.1-70B-Instruct --ignore-patterns "original/consolidated*
tune run --nproc_per_node 8 full_finetune_distributed --config ft-config.yaml
The end goal for this effort is to serve as fine-tuning data preparation kit.
Currently, I'm (WIP) evaluating the idea to improve tool-calling datasets.
Setup:
3.3