# Fine-tuning Llama 3.2 11B Vision for Structured Data Extraction
This recipe demonstrates how to fine-tune Llama 3.2 11B Vision model on a synthetic W-2 tax form dataset for structured data extraction. The tutorial compares LoRA (Low-Rank Adaptation) and full parameter fine-tuning approaches, evaluating their trade-offs in terms of accuracy, memory consumption, and computational requirements.
## Objectives
- Showcase how to fine-tune and evaluate on a specific extraction use case
- Demonstrate custom benchmarking for structured output tasks
- Compare trade-offs between LoRA and Full Parameter Fine-tuning on both task-specific and general benchmarks
- Provide guidance on data preparation, training configuration, and evaluation methodologies
## Summary
This tutorial lays out the end to end process of preparing a sample dataset, benchmarking the baseline models, finetuning and re evaluating the models after the FT run. It also demonstrates that full parameter fine-tuning is both feasible and yields better results for specialized vision tasks with small datasets. In torchtune, memory optimized FPFT challenges the conventional wisdom that LoRA is always more memory efficient, given the caveat the default "full" parameters configurations are not training the decoder layers.
> **Note:** We tested if this model would improve in other image to json extraction tasks, but we did not observe an improvement in those. Further testing is required to understand what type and diversity of data is required to improve the performance on all json extraction tasks.
## Results
We present the results of the selected benchmarks across relevant domains. The focus was to add a new capability to the model without degrading existing ones, validated by testing tool calling, visual document and chart understanding, instruction following and general knowledge remain unchanged.
### Task-Specific Performance (W2 Extraction)
| Benchmark |
11B bf16 (Baseline) |
LoRA |
FPFT |
FPFT int4 |
| W2 extraction acc |
58 |
72 |
98 |
96 |
### General Benchmark Performance (llama-verifications)
| Benchmark |
11B bf16 (Baseline) |
LoRA |
FPFT |
FPFT int4 |
| bfclv3 |
39.87 |
39.87 |
39.85 |
34.67 |
| docvqa |
86.88 |
85.08 |
86.3 |
78.95 |
| gpqa-cot-diamond |
27.78 |
27.78 |
26 |
28 |
| ifeval |
74.79 |
74.78 |
74.54 |
74.42 |
| mmlu-pro-cot |
48.43 |
48.13 |
48.33 |
46.14 |
### LM Evaluation Harness Results
| Benchmark |
11B bf16 (Baseline) |
FPFT |
| gsm8k_cot_llama_strict |
85.29 |
85.29 |
| gsm8k_cot_llama_flexible |
85.44 |
85.44 |
| chartqa llama exact |
0 |
0 |
| chartqa llama relaxed |
34.16 |
35.58 |
| chartqa llama anywhere |
43.53 |
46.52 |
## Prerequisites
- CUDA-compatible GPU with at least 40GB VRAM
- HuggingFace account with access to Llama models
- Python 3.10+
## Setup
### Environment Creation
```bash
git clone git@github.com:meta-llama/llama-cookbook.git
cd llama-cookbook/getting-started/finetuning/vision
conda create -n image-ft python=3.12 -y
conda activate image-ft
```
### Dependencies Installation
```bash
pip install torchtune torch torchvision torchaudio torchao bitsandbytes transformers==4.51.1 accelerate vllm==0.9.2 lm_eval wandb
```
Install torchtune nightly for the latest vision model support:
```bash
pip install --pre --upgrade torchtune --extra-index-url https://download.pytorch.org/whl/nightly/cpu
```
**Important**: Log in to your HuggingFace account to download the model and datasets:
```bash
huggingface-cli login
```
## Dataset Preparation
The dataset contains 2,000 examples of synthetic W-2 forms with three splits: train (1,800), test (100), and validation (100). For this use case, we found that fewer training examples (30% train, 70% test) provided sufficient improvement while allowing for more comprehensive evaluation.
The preparation script:
1. Reshuffles the train/test splits according to the specified ratio
2. Removes unnecessary JSON structure wrappers from ground truth
3. Adds standardized prompts for training
```bash
python prepare_w2_dataset.py --train-ratio 0.3
```
This creates a new dataset directory: `fake_w2_us_tax_form_dataset_train30_test70`
> **Note**: If you change the train ratio, update the `dataset.data_files.train` path in the corresponding YAML configuration files.
## Model Download
Download the base Llama 3.2 11B Vision model:
```bash
tune download meta-llama/Llama-3.2-11B-Vision-Instruct --output-dir Llama-3.2-11B-Vision-Instruct
```
This downloads to the expected directory structure used in the provided YAML files. If you change the directory, update these keys in the configuration files:
- `checkpointer.checkpoint_dir`
- `tokenizer.path`
## Baseline Evaluation
Before fine-tuning, establish a baseline by evaluating the pre-trained model on the test set.
### Start vLLM Server
```bash
python -m vllm.entrypoints.openai.api_server --model Llama-3.2-11B-Vision-Instruct/ --port 8001 --max-model-len 9000 --max-num-seqs 100 --served-model-name 11B-base
```
For multi-GPU setup:
```bash
CUDA_VISIBLE_DEVICES="0,1" python -m vllm.entrypoints.openai.api_server --model Llama-3.2-11B-Vision-Instruct/ --port 8001 --max-model-len 9000 --tensor-parallel-size 2 --max-num-seqs 100 --served-model-name 11B-base
```
### Run Baseline Evaluation
```bash
python evaluate.py --server_url http://localhost:8001/v1 --model 11B-base --structured --dataset fake_w2_us_tax_form_dataset_train30_test70/test --limit 100 --max_workers 25
```
This will give an accuracy result around 58%, as shown in the table in the top.
> **Note:** Modify the `max-model-len` and `max-num-seqs` to fit your hardware. Specially `max-num-seqs` as vllm will OOM if not set with Llama 3.2 multimodal models.
### Cloud Evaluation Setup
To evaluate bigger models, like Llama 4 Maverick, we leverage cloud providers that server these models, like together.ai. Any OpenAI compatible provider should work out of the box with our evaluation script, as it uses the OpenAI SDK.
```bash
TOGETHER_API_KEY= python evaluate.py --server_url https://api.together.xyz/v1 --model meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 --structured --dataset fake_w2_us_tax_form_dataset_train30_test70/test --limit 100 --max_workers 25
```
The value from Maverick in our testing was around 67% accuracy out of the box.
## Fine-tuning
### Configuration Overview
The repository includes two pre-configured YAML files:
- `11B_full_w2.yaml`: Full parameter fine-tuning configuration
- `11B_lora_w2.yaml`: LoRA fine-tuning configuration
Key differences:
**Full Parameter Fine-tuning:**
- Trains encoder and fusion layers, leaving decoder frozen
- Better performance on the target use case
- Learning rate: 2e-5
- Optimizer: PagedAdamW8bit for memory efficiency
**LoRA Fine-tuning:**
- Only trains low-rank adapters across enconder and fusion layers only as well. Decoder is frozen.
- Learning rate: 1e-4
- LoRA rank: 8, alpha: 16
- Frozen decoder with LoRA on encoder and fusion layers
In our case, using torchtune, there was no significant memory gains to be had when using Lora. The footprint on the GPU was below 24GB for `batch_size: 1`, `compile: true`, `optimizer_in_bwd: True` and `gradient_accumulation_steps: 1`. Training time for 5 epochs on a single H100 was around 40 minutes.
### WandB Configuration
Before training, update the WandB entity in your YAML files and make sure you are logged in to your account:
```yaml
metric_logger:
_component_: torchtune.training.metric_logging.WandBLogger
project: llama3_2_w2_extraction
entity: # Update this
```
### Training Commands
**Full Parameter Fine-tuning:**
```bash
tune run full_finetune_single_device --config 11B_full_w2.yaml
```
**LoRA Fine-tuning:**
```bash
tune run lora_finetune_single_device --config 11B_lora_w2.yaml
```
This will create models in different `output_dir` for each configuration, used further down for the evaluation commands.
> **Note**: The VQA dataset component in torchtune is pre-configured to handle the multimodal format, eliminating the need for custom preprocessors. The `prepare_w2_dataset.py` script adapts the input to this format on the prompt.
## Model Evaluation
### Local Evaluation Setup
Start a vLLM server with your fine-tuned model:
**For LoRA model:**
```bash
python -m vllm.entrypoints.openai.api_server --model ./outputs/Llama-3.2-11B-Instruct-w2-lora/epoch_4/ --port 8001 --max-model-len 9000 --max-num-seqs 100 --served-model-name 11B-lora
```
**For full fine-tuned model:**
```bash
python -m vllm.entrypoints.openai.api_server --model ./outputs/Llama-3.2-11B-Instruct-w2-full/epoch_4/ --port 8001 --max-model-len 9000 --max-num-seqs 100 --served-model-name 11B-full
```
### Task-Specific Evaluation
```bash
python evaluate.py --server_url http://localhost:8001/v1 --model 11B-full --structured --dataset fake_w2_us_tax_form_dataset_train30_test70/test --limit 100 --max_workers 25
```
### General Benchmark Evaluation
Install llama-verifications for standard benchmarks:
```bash
pip install llama-verifications
```
Run benchmark evaluation:
```bash
uvx llama-verifications run-benchmarks \
--benchmarks mmlu-pro-cot,gpqa-cot-diamond,bfclv3,docvqa \
--provider http://localhost:8001/v1 \
--model \
--continue-on-failure \
--max-parallel-generations 100
```
### LM Evaluation Harness
For additional benchmarks using lm-eval:
**With vLLM backend:**
```bash
CUDA_VISIBLE_DEVICES=0,1 lm_eval --model vllm \
--model_args pretrained=,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9 \
--tasks gsm8k_cot_llama \
--batch_size auto \
--seed 4242
```
**With transformers backend:**
```bash
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch -m lm_eval --model hf-multimodal \
--model_args pretrained= \
--tasks chartqa_llama_90 \
--batch_size 16 \
--seed 4242 \
--log_samples \
--output_path results
```
## Performance
### Peak memory
### Loss curve graph
## References
- [Torchtune Documentation](https://pytorch.org/torchtune/)
- [vLLM Documentation](https://vllm.readthedocs.io/)
- [LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness)
- [Synthetic Data Kit](https://github.com/meta-llama/synthetic-data-kit)