This recipe demonstrates how to fine-tune Llama 3.2 11B Vision model on a synthetic W-2 tax form dataset for structured data extraction. The tutorial compares LoRA (Low-Rank Adaptation) and full parameter fine-tuning approaches, evaluating their trade-offs in terms of accuracy, memory consumption, and computational requirements.
| Benchmark | 11B bf16 (Baseline) | LoRA | FPFT | FPFT int4 |
|---|---|---|---|---|
| W2 extraction acc | 58 | 72 | 97 | 96 |
| Benchmark | 11B bf16 (Baseline) | LoRA | FPFT | FPFT int4 |
|---|---|---|---|---|
| bfclv3 | 39.87 | 39.87 | 39.85 | 34.67 |
| docvqa | 86.88 | 85.08 | 86.3 | 78.95 |
| gpqa-cot-diamond | 27.78 | 27.78 | 26 | 28 |
| ifeval | 74.79 | 74.78 | 74.54 | 74.42 |
| mmlu-pro-cot | 48.43 | 48.13 | 48.33 | 46.14 |
| Benchmark | 11B bf16 (Baseline) | FPFT |
|---|---|---|
| gsm8k_cot_llama_strict | 85.29 | 85.29 |
| gsm8k_cot_llama_flexible | 85.44 | 85.44 |
| chartqa llama exact | 0 | 0 |
| chartqa llama relaxed | 34.16 | 35.58 |
| chartqa llama anywhere | 43.53 | 46.52 |
git clone git@github.com:meta-llama/llama-cookbook.git
cd llama-cookbook/getting-started/finetuning/vision
conda create -n image-ft python=3.12 -y
conda activate image-ft
pip install torchtune torch torchvision torchaudio torchao bitsandbytes transformers==4.51.1 accelerate vllm==0.9.2 lm_eval wandb
Install torchtune nightly for the latest vision model support:
pip install --pre --upgrade torchtune --extra-index-url https://download.pytorch.org/whl/nightly/cpu
Important: Log in to your HuggingFace account to download the model and datasets:
huggingface-cli login
The dataset contains 2,000 examples of synthetic W-2 forms with three splits: train (1,800), test (100), and validation (100). For this use case, we found that fewer training examples (30% train, 70% test) provided sufficient improvement while allowing for more comprehensive evaluation.
The preparation script:
python prepare_w2_dataset.py --train-ratio 0.3
This creates a new dataset directory: fake_w2_us_tax_form_dataset_train30_test70
Configuration Note: If you change the train ratio, update the dataset.data_files.train path in the corresponding YAML configuration files.
Download the base Llama 3.2 11B Vision model:
tune download meta-llama/Llama-3.2-11B-Vision-Instruct --output-dir Llama-3.2-11B-Vision-Instruct
This downloads to the expected directory structure used in the provided YAML files. If you change the directory, update these keys in the configuration files:
checkpointer.checkpoint_dirtokenizer.pathBefore fine-tuning, establish a baseline by evaluating the pre-trained model on the test set.
For single GPU (H100):
CUDA_VISIBLE_DEVICES="0" python -m vllm.entrypoints.openai.api_server --model Llama-3.2-11B-Vision-Instruct/ --port 8001 --max-model-len 65000 --max-num-seqs 10
For multi-GPU setup:
CUDA_VISIBLE_DEVICES="0,1" python -m vllm.entrypoints.openai.api_server --model Llama-3.2-11B-Vision-Instruct/ --port 8001 --max-model-len 65000 --tensor-parallel-size 2 --max-num-seqs 10
python evaluate.py --server_url http://localhost:8001/v1 --model Llama-3.2-11B-Vision-Instruct/ --structured --dataset fake_w2_us_tax_form_dataset_train30_test70/test --limit 100 --max_workers 25
The repository includes two pre-configured YAML files:
11B_full_w2.yaml: Full parameter fine-tuning configuration11B_lora_w2.yaml: LoRA fine-tuning configurationKey differences:
Full Parameter Fine-tuning:
LoRA Fine-tuning:
Before training, update the WandB entity in your YAML files:
metric_logger:
_component_: torchtune.training.metric_logging.WandBLogger
project: llama3_2_w2_extraction
entity: <your_wandb_entity> # Update this
Full Parameter Fine-tuning:
tune run full_finetune_single_device --config 11B_full_w2.yaml
LoRA Fine-tuning:
tune run lora_finetune_single_device --config 11B_lora_w2.yaml
Note: The VQA dataset component in torchtune is pre-configured to handle the multimodal format, eliminating the need for custom preprocessors.
Start a vLLM server with your fine-tuned model:
For LoRA model:
CUDA_VISIBLE_DEVICES="0,1" python -m vllm.entrypoints.openai.api_server --model ./outputs/Llama-3.2-11B-Instruct-w2-lora/epoch_4/ --port 8001 --max-model-len 65000 --tensor-parallel-size 2 --max-num-seqs 10
For full fine-tuned model:
CUDA_VISIBLE_DEVICES="0,1" python -m vllm.entrypoints.openai.api_server --model ./outputs/Llama-3.2-11B-Instruct-w2-full/epoch_4/ --port 8001 --max-model-len 65000 --tensor-parallel-size 2
python evaluate.py --server_url http://localhost:8001/v1 --model ./outputs/Llama-3.2-11B-Instruct-w2-full/epoch_4/ --structured --dataset fake_w2_us_tax_form_dataset_train30_test70/test --limit 100 --max_workers 25
To evaluate bigger models, like Llama 4 Maverick, we leverage cloud providers that server these models, like together.ai. Any OpenAI compatible provider should work out of the box with our evaluation script, as it uses the OpenAI SDK.
TOGETHER_API_KEY=<your_api_key> python evaluate.py --server_url https://api.together.xyz/v1 --model meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 --structured --dataset fake_w2_us_tax_form_dataset_train30_test70/test --limit 100 --max_workers 25
Install llama-verifications for standard benchmarks:
pip install llama-verifications
Run benchmark evaluation:
uvx llama-verifications run-benchmarks \
--benchmarks mmlu-pro-cot,gpqa-cot-diamond,bfclv3,docvqa \
--provider http://localhost:8001/v1 \
--model <model_served_name> \
--continue-on-failure \
--max-parallel-generations 100
For additional benchmarks using lm-eval:
With vLLM backend:
CUDA_VISIBLE_DEVICES=0,1 lm_eval --model vllm \
--model_args pretrained=<model_path>,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9 \
--tasks gsm8k_cot_llama \
--batch_size auto \
--seed 4242
With transformers backend:
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch -m lm_eval --model hf-multimodal \
--model_args pretrained=<model_path> \
--tasks chartqa_llama_90 \
--batch_size 16 \
--seed 4242 \
--log_samples \
--output_path results
You can benchmark against the Llama API for comparison:
LLAMA_API_KEY="<your_api_key>" python evaluate.py \
--server_url https://api.llama.com/compa/v1 \
--limit 100 \
--model Llama-4-Maverick-17B-128E-Instruct-FP8 \
--structured \
--max_workers 50 \
--dataset fake_w2_us_tax_form_dataset_train30_test70/test
Data Preparation: Ensure your dataset format matches the expected structure. The preparation script handles common formatting issues.
Configuration Management: Always update paths in YAML files when changing directory structures.
Memory Management: Use PagedAdamW8bit optimizer for full parameter fine-tuning to reduce memory usage.
Evaluation Strategy: Evaluate both task-specific and general capabilities to understand trade-offs.
Monitoring: Use WandB for comprehensive training monitoring and comparison.
CUDA Out of Memory: Reduce batch size, enable gradient checkpointing, or use LoRA instead of full fine-tuning.
Dataset Path Errors: Verify that dataset paths in YAML files match your actual directory structure.
Model Download Issues: Ensure you're logged into HuggingFace and have access to Llama models.
vLLM Server Connection: Check that the server is running and accessible on the specified port.