vor 3 Wochen · d33139246a
--- a/getting-started/finetuning/vision/README.md
+++ b/getting-started/finetuning/vision/README.md
@@ -1,15 +1,15 @@
 
				-# Fine-tuning Llama 3.2 11B Vision for Structured Document Extraction
			
 
				+# Fine-tuning Llama 3.2 11B Vision for Structured Data Extraction
			
 
				 
			
 
				-This recipe demonstrates how to fine-tune Llama 3.2 11B Vision model on a synthetic W-2 tax form dataset for structured information extraction. The tutorial compares LoRA (Low-Rank Adaptation) and full parameter fine-tuning approaches, evaluating their trade-offs in terms of accuracy, memory consumption, and computational requirements.
			
 
				+This recipe demonstrates how to fine-tune Llama 3.2 11B Vision model on a synthetic W-2 tax form dataset for structured data extraction. The tutorial compares LoRA (Low-Rank Adaptation) and full parameter fine-tuning approaches, evaluating their trade-offs in terms of accuracy, memory consumption, and computational requirements.
			
 
				 
			
 
				 ## Objectives
			
 
				-- Showcase how to fine-tune and evaluate on a specific document extraction use case
			
 
				+- Showcase how to fine-tune and evaluate on a specific extraction use case
			
 
				 - Demonstrate custom benchmarking for structured output tasks
			
 
				 - Compare trade-offs between LoRA and Full Parameter Fine-tuning on both task-specific and general benchmarks
			
 
				 - Provide guidance on data preparation, training configuration, and evaluation methodologies
			
 
				 
			
 
				 ## Prerequisites
			
 
				-- CUDA-compatible GPU with at least 40GB VRAM (H100 recommended)
			
 
				+- CUDA-compatible GPU with at least 40GB VRAM
			
 
				 - HuggingFace account with access to Llama models
			
 
				 - Python 3.10+
			
 
				 
			
@@ -19,13 +19,13 @@ This recipe demonstrates how to fine-tune Llama 3.2 11B Vision model on a synthe
 
				 ```bash
			
 
				 git clone git@github.com:meta-llama/llama-cookbook.git
			
 
				 cd llama-cookbook/getting-started/finetuning/vision
			
 
				-conda create -n image-ft python=3.10 -y
			
 
				+conda create -n image-ft python=3.12 -y
			
 
				 conda activate image-ft
			
 
				 ```
			
 
				 
			
 
				 ### Dependencies Installation
			
 
				 ```bash
			
 
				-pip install -r requirements.txt
			
 
				+pip install torchtune torch torchvision torchaudio torchao bitsandbytes transformers==4.51.1 accelerate vllm==0.9.2 lm_eval wandb
			
 
				 ```
			
 
				 
			
 
				 Install torchtune nightly for the latest vision model support:
			
@@ -45,7 +45,7 @@ The dataset contains 2,000 examples of synthetic W-2 forms with three splits: tr
 
				 The preparation script:
			
 
				 1. Reshuffles the train/test splits according to the specified ratio
			
 
				 2. Removes unnecessary JSON structure wrappers from ground truth
			
 
				-3. Adds standardized prompts for training consistency
			
 
				+3. Adds standardized prompts for training
			
 
				 
			
 
				 ```bash
			
 
				 python prepare_w2_dataset.py --train-ratio 0.3
			
@@ -83,7 +83,7 @@ CUDA_VISIBLE_DEVICES="0,1" python -m vllm.entrypoints.openai.api_server --model
 
				 
			
 
				 ### Run Baseline Evaluation
			
 
				 ```bash
			
 
				-python evaluate.py --server_url http://localhost:8001 --model Llama-3.2-11B-Vision-Instruct/ --structured --dataset fake_w2_us_tax_form_dataset_train30_test70/test --limit 200
			
 
				+python evaluate.py --server_url http://localhost:8001/v1 --model Llama-3.2-11B-Vision-Instruct/ --structured --dataset fake_w2_us_tax_form_dataset_train30_test70/test --limit 100 --max_workers 25
			
 
				 ```
			
 
				 
			
 
				 ## Fine-tuning
			
@@ -141,17 +141,26 @@ Start a vLLM server with your fine-tuned model:
 
				 
			
 
				 **For LoRA model:**
			
 
				 ```bash
			
 
				-CUDA_VISIBLE_DEVICES="0,1" python -m vllm.entrypoints.openai.api_server --model ./outputs/Llama-3.2-11B-Instruct-w2-lora/epoch_4/ --port 8003 --max-model-len 128000 --tensor-parallel-size 2
			
 
				+CUDA_VISIBLE_DEVICES="0,1" python -m vllm.entrypoints.openai.api_server --model ./outputs/Llama-3.2-11B-Instruct-w2-lora/epoch_4/ --port 8001 --max-model-len 65000 --tensor-parallel-size 2 --max-num-seqs 10
			
 
				 ```
			
 
				 
			
 
				 **For full fine-tuned model:**
			
 
				 ```bash
			
 
				-CUDA_VISIBLE_DEVICES="0,1" python -m vllm.entrypoints.openai.api_server --model ./outputs/Llama-3.2-11B-Instruct-w2-full/epoch_4/ --port 8003 --max-model-len 128000 --tensor-parallel-size 2
			
 
				+CUDA_VISIBLE_DEVICES="0,1" python -m vllm.entrypoints.openai.api_server --model ./outputs/Llama-3.2-11B-Instruct-w2-full/epoch_4/ --port 8001 --max-model-len 65000 --tensor-parallel-size 2
			
 
				 ```
			
 
				 
			
 
				 ### Task-Specific Evaluation
			
 
				 ```bash
			
 
				-python evaluate.py --server_url http://localhost:8003 --model <model_path> --structured --dataset fake_w2_us_tax_form_dataset_train30_test70/test --limit 200
			
 
				+python evaluate.py --server_url http://localhost:8001/v1 --model ./outputs/Llama-3.2-11B-Instruct-w2-full/epoch_4/ --structured --dataset fake_w2_us_tax_form_dataset_train30_test70/test --limit 100 --max_workers 25
			
 
				+```
			
 
				+
			
 
				+
			
 
				+### Cloud Evaluation Setup
			
 
				+
			
 
				+To evaluate bigger models, like Llama 4 Maverick, we leverage cloud providers that server these models, like together.ai. Any OpenAI compatible provider should work out of the box with our evaluation script, as it uses the OpenAI SDK.
			
 
				+
			
 
				+```bash
			
 
				+TOGETHER_API_KEY=<your_api_key> python evaluate.py --server_url https://api.together.xyz/v1 --model meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8  --structured --dataset fake_w2_us_tax_form_dataset_train30_test70/test --limit 100 --max_workers 25
			
 
				 ```
			
 
				 
			
 
				 ### General Benchmark Evaluation
			
@@ -191,7 +200,9 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch -m lm_eval --model hf-multimodal
 
				     --tasks chartqa_llama_90 \
			
 
				     --batch_size 16 \
			
 
				     --seed 4242 \
			
 
				-    --log_samples
			
 
				+    --log_samples \
			
 
				+    --output_path results
			
 
				+
			
 
				 ```
			
 
				 
			
 
				 ## Results
			
@@ -348,7 +359,7 @@ You can benchmark against the Llama API for comparison:
 
				 
			
 
				 ```bash
			
 
				 LLAMA_API_KEY="<your_api_key>" python evaluate.py \
			
 
				-  --server_url https://api.llama.com/compat \
			
 
				+  --server_url https://api.llama.com/compa/v1 \
			
 
				   --limit 100 \
			
 
				   --model Llama-4-Maverick-17B-128E-Instruct-FP8 \
			
 
				   --structured \