6 месяцев назад · ad95911753
--- a/getting-started/finetuning/vision/README.md
+++ b/getting-started/finetuning/vision/README.md
@@ -170,7 +170,7 @@ python prepare_w2_dataset.py --train-ratio 0.3
 
																 This creates a new dataset directory: `fake_w2_us_tax_form_dataset_train30_test70`
															
 
																-**Configuration Note**: If you change the train ratio, update the `dataset.data_files.train` path in the corresponding YAML configuration files.
															
 
																+> **Note**: If you change the train ratio, update the `dataset.data_files.train` path in the corresponding YAML configuration files.
															
 
																 ## Model Download
															
@@ -188,21 +188,36 @@ This downloads to the expected directory structure used in the provided YAML fil
 
																 Before fine-tuning, establish a baseline by evaluating the pre-trained model on the test set.
															
 
																 ### Start vLLM Server
															
 
																-For single GPU (H100):
															
 
																+
															
 
																 ```bash
															
 
																-CUDA_VISIBLE_DEVICES="0" python -m vllm.entrypoints.openai.api_server --model Llama-3.2-11B-Vision-Instruct/ --port 8001 --max-model-len 65000 --max-num-seqs 10
															
 
																+python -m vllm.entrypoints.openai.api_server --model Llama-3.2-11B-Vision-Instruct/ --port 8001 --max-model-len 9000 --max-num-seqs 100 --served-model-name 11B-base
															
 
																+
															
 
																+
															
 
																 ```
															
 
																 For multi-GPU setup:
															
 
																 ```bash
															
 
																-CUDA_VISIBLE_DEVICES="0,1" python -m vllm.entrypoints.openai.api_server --model Llama-3.2-11B-Vision-Instruct/ --port 8001 --max-model-len 65000 --tensor-parallel-size 2 --max-num-seqs 10
															
 
																+CUDA_VISIBLE_DEVICES="0,1" python -m vllm.entrypoints.openai.api_server --model Llama-3.2-11B-Vision-Instruct/ --port 8001 --max-model-len 9000 --tensor-parallel-size 2 --max-num-seqs 100 --served-model-name 11B-base
															
 
																 ```
															
 
																 ### Run Baseline Evaluation
															
 
																 ```bash
															
 
																-python evaluate.py --server_url http://localhost:8001/v1 --model Llama-3.2-11B-Vision-Instruct/ --structured --dataset fake_w2_us_tax_form_dataset_train30_test70/test --limit 100 --max_workers 25
															
 
																+python evaluate.py --server_url http://localhost:8001/v1 --model 11B-base --structured --dataset fake_w2_us_tax_form_dataset_train30_test70/test --limit 100 --max_workers 25
															
 
																+```
															
 
																+
															
 
																+This will give an accuracy result around 58%, as shown in the table in the top.
															
 
																+> **Note:** Modify the `max-model-len` and `max-num-seqs` to fit your hardware. Specially `max-num-seqs` as vllm will OOM if not set with Llama 3.2 multimodal models.
															
 
																+
															
 
																+### Cloud Evaluation Setup
															
 
																+
															
 
																+To evaluate bigger models, like Llama 4 Maverick, we leverage cloud providers that server these models, like together.ai. Any OpenAI compatible provider should work out of the box with our evaluation script, as it uses the OpenAI SDK.
															
 
																+
															
 
																+```bash
															
 
																+TOGETHER_API_KEY=<your_api_key> python evaluate.py --server_url https://api.together.xyz/v1 --model meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8  --structured --dataset fake_w2_us_tax_form_dataset_train30_test70/test --limit 100 --max_workers 25
															
 
																 ```
															
 
																+The value from Maverick in our testing was around 67% accuracy out of the box.
															
 
																+
															
 
																 ## Fine-tuning
															
 
																 ### Configuration Overview
															
@@ -215,20 +230,21 @@ Key differences:
 
																 **Full Parameter Fine-tuning:**
															
 
																 - Trains encoder and fusion layers, leaving decoder frozen
															
 
																-- Higher memory requirements but potentially better performance
															
 
																+- Better performance on the target use case
															
 
																 - Learning rate: 2e-5
															
 
																 - Optimizer: PagedAdamW8bit for memory efficiency
															
 
																 **LoRA Fine-tuning:**
															
 
																 - Only trains low-rank adapters across enconder and fusion layers only as well. Decoder is frozen.
															
 
																-- Significantly lower memory requirements
															
 
																 - Learning rate: 1e-4
															
 
																 - LoRA rank: 8, alpha: 16
															
 
																 - Frozen decoder with LoRA on encoder and fusion layers
															
 
																+
															
 
																+In our case, using torchtune, there was no significant memory gains to be had when using Lora. The footprint on the GPU was below 24GB for `batch_size: 1`, `compile: true`, `optimizer_in_bwd: True` and `gradient_accumulation_steps: 1`. Training time for 5 epochs on a single H100 was around 40 minutes.
															
 
																 ### WandB Configuration
															
 
																-Before training, update the WandB entity in your YAML files:
															
 
																+Before training, update the WandB entity in your YAML files and make sure you are logged in to your account:
															
 
																 ```yaml
															
 
																 metric_logger:
															
 
																   _component_: torchtune.training.metric_logging.WandBLogger
															
@@ -248,7 +264,9 @@ tune run full_finetune_single_device --config 11B_full_w2.yaml
 
																 tune run lora_finetune_single_device --config 11B_lora_w2.yaml
															
 
																 ```
															
 
																-**Note**: The VQA dataset component in torchtune is pre-configured to handle the multimodal format, eliminating the need for custom preprocessors.
															
 
																+This will create models in different `output_dir` for each configuration, used further down for the evaluation commands.
															
 
																+
															
 
																+> **Note**: The VQA dataset component in torchtune is pre-configured to handle the multimodal format, eliminating the need for custom preprocessors. The `prepare_w2_dataset.py` script adapts the input to this format on the prompt.
															
 
																 ## Model Evaluation
															
@@ -258,27 +276,20 @@ Start a vLLM server with your fine-tuned model:
 
																 **For LoRA model:**
															
 
																 ```bash
															
 
																-CUDA_VISIBLE_DEVICES="0,1" python -m vllm.entrypoints.openai.api_server --model ./outputs/Llama-3.2-11B-Instruct-w2-lora/epoch_4/ --port 8001 --max-model-len 65000 --tensor-parallel-size 2 --max-num-seqs 10
															
 
																+python -m vllm.entrypoints.openai.api_server --model ./outputs/Llama-3.2-11B-Instruct-w2-lora/epoch_4/ --port 8001 --max-model-len 9000  --max-num-seqs 100 --served-model-name 11B-lora
															
 
																 ```
															
 
																 **For full fine-tuned model:**
															
 
																 ```bash
															
 
																-CUDA_VISIBLE_DEVICES="0,1" python -m vllm.entrypoints.openai.api_server --model ./outputs/Llama-3.2-11B-Instruct-w2-full/epoch_4/ --port 8001 --max-model-len 65000 --tensor-parallel-size 2
															
 
																+python -m vllm.entrypoints.openai.api_server --model ./outputs/Llama-3.2-11B-Instruct-w2-full/epoch_4/ --port 8001 --max-model-len 9000 --max-num-seqs 100 --served-model-name 11B-full
															
 
																 ```
															
 
																 ### Task-Specific Evaluation
															
 
																 ```bash
															
 
																-python evaluate.py --server_url http://localhost:8001/v1 --model ./outputs/Llama-3.2-11B-Instruct-w2-full/epoch_4/ --structured --dataset fake_w2_us_tax_form_dataset_train30_test70/test --limit 100 --max_workers 25
															
 
																+python evaluate.py --server_url http://localhost:8001/v1 --model 11B-full --structured --dataset fake_w2_us_tax_form_dataset_train30_test70/test --limit 100 --max_workers 25
															
 
																 ```
															
 
																-### Cloud Evaluation Setup
															
 
																-
															
 
																-To evaluate bigger models, like Llama 4 Maverick, we leverage cloud providers that server these models, like together.ai. Any OpenAI compatible provider should work out of the box with our evaluation script, as it uses the OpenAI SDK.
															
 
																-
															
 
																-```bash
															
 
																-TOGETHER_API_KEY=<your_api_key> python evaluate.py --server_url https://api.together.xyz/v1 --model meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8  --structured --dataset fake_w2_us_tax_form_dataset_train30_test70/test --limit 100 --max_workers 25
															
 
																-```
															
 
																 ### General Benchmark Evaluation
															
@@ -322,74 +333,16 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch -m lm_eval --model hf-multimodal
 
																 ```
															
 
																+## Performance
															
 
																-
															
 
																-## Key Findings
															
 
																-
															
 
																-### Task-Specific Performance
															
 
																-- **Full Parameter Fine-tuning** achieved the best task-specific performance (97% accuracy on W2 extraction)
															
 
																-- **LoRA fine-tuning** provided substantial improvement (72% vs 58% baseline) with significantly lower resource requirements
															
 
																-- Both approaches showed dramatic improvement over the baseline for the specific task
															
 
																-
															
 
																-### General Capability Preservation
															
 
																-- **LoRA fine-tuning** preserved general capabilities better, showing minimal degradation on standard benchmarks
															
 
																-- **Full Parameter fine-tuning** showed minimal degradation on industry benchmarks, making it the preferred choice for this small dataset and FT results. With a larger dataset, as the original split of 80/10/10 and more steps, we do see additional degradation on the benchmarks.
															
 
																-- Both methods maintained strong performance on mathematical reasoning tasks (gsm8k)
															
 
																-
															
 
																-### Resource Efficiency
															
 
																-- **LoRA** requires significantly less GPU memory and training time
															
 
																-- **Full Parameter fine-tuning** requires more resources but achieves better task-specific performance
															
 
																-
															
 
																-## Performance Graphs
															
 
																-
															
 
																-### Performance
															
 
																-
															
 
																-#### Peak memory
															
 
																+### Peak memory
															
 
																 <img src="peak_memory.png" width="600" alt="Peak Memory Usage">
															
 
																-#### Loss curve graph
															
 
																+### Loss curve graph
															
 
																 <img src="loss_curve.png" width="600" alt="Loss curve">
															
 
																-## Comparison with Llama API
															
 
																-
															
 
																-You can benchmark against the Llama API for comparison:
															
 
																-
															
 
																-```bash
															
 
																-LLAMA_API_KEY="<your_api_key>" python evaluate.py \
															
 
																-  --server_url https://api.llama.com/compa/v1 \
															
 
																-  --limit 100 \
															
 
																-  --model Llama-4-Maverick-17B-128E-Instruct-FP8 \
															
 
																-  --structured \
															
 
																-  --max_workers 50 \
															
 
																-  --dataset fake_w2_us_tax_form_dataset_train30_test70/test
															
 
																-```
															
 
																-
															
 
																-## Best Practices
															
 
																-
															
 
																-1. **Data Preparation**: Ensure your dataset format matches the expected structure. The preparation script handles common formatting issues.
															
 
																-
															
 
																-2. **Configuration Management**: Always update paths in YAML files when changing directory structures.
															
 
																-
															
 
																-3. **Memory Management**: Use PagedAdamW8bit optimizer for full parameter fine-tuning to reduce memory usage.
															
 
																-
															
 
																-4. **Evaluation Strategy**: Evaluate both task-specific and general capabilities to understand trade-offs.
															
 
																-
															
 
																-5. **Monitoring**: Use WandB for comprehensive training monitoring and comparison.
															
 
																-
															
 
																-## Troubleshooting
															
 
																-
															
 
																-### Common Issues
															
 
																-
															
 
																-1. **CUDA Out of Memory**: Reduce batch size, enable gradient checkpointing, or use LoRA instead of full fine-tuning.
															
 
																-
															
 
																-2. **Dataset Path Errors**: Verify that dataset paths in YAML files match your actual directory structure.
															
 
																-
															
 
																-3. **Model Download Issues**: Ensure you're logged into HuggingFace and have access to Llama models.
															
 
																-
															
 
																-4. **vLLM Server Connection**: Check that the server is running and accessible on the specified port.
															
 
																-
															
 
																 ## References
															
 
																 - [Torchtune Documentation](https://pytorch.org/torchtune/)
															
--- a/getting-started/finetuning/vision/evaluate.py
+++ b/getting-started/finetuning/vision/evaluate.py
@@ -29,9 +29,10 @@ from tqdm import tqdm
 
																 # Set up logging
															
 
																 logging.basicConfig(
															
 
																-    level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
															
 
																+    level=logging.WARNING, format="%(asctime)s - %(levelname)s - %(message)s"
															
 
																 )
															
 
																 logger = logging.getLogger(__name__)
															
 
																+logger.setLevel(logging.INFO)
															
 
																 class W2Form(BaseModel):