This recipe steps you through how to finetune a Llama 3.2 vision model on the OCR VQA task using the OCRVQA dataset.
Disclaimer: As our vision models already have a very good OCR ability, here we use the OCRVQA dataset only for demonstration purposes of the required steps for fine-tuning our vision models with llama-recipes.
We created an example script ocrvqa_dataset.py that can load the OCRVQA dataset with get_custom_dataset function, then provide OCRVQADataCollator class to process the image dataset.
For full finetuning with FSDP, we can run the following code:
  torchrun --nnodes 1 --nproc_per_node 4  recipes/quickstart/finetuning/finetuning.py --enable_fsdp --lr 1e-5  --num_epochs 3 --batch_size_training 2 --model_name meta-llama/Llama-3.2-11B-Vision-Instruct --dist_checkpoint_root_folder ./finetuned_model --dist_checkpoint_folder fine-tuned  --use_fast_kernels --dataset "custom_dataset" --custom_dataset.test_split "test" --custom_dataset.file "recipes/quickstart/finetuning/datasets/ocrvqa_dataset.py"  --run_validation True --batching_strategy padding
For LoRA finetuning with FSDP, we can run the following code:
  torchrun --nnodes 1 --nproc_per_node 4  recipes/quickstart/finetuning/finetuning.py --enable_fsdp --lr 1e-5  --num_epochs 3 --batch_size_training 2 --model_name meta-llama/Llama-3.2-11B-Vision-Instruct --dist_checkpoint_root_folder ./finetuned_model --dist_checkpoint_folder fine-tuned  --use_fast_kernels --dataset "custom_dataset" --custom_dataset.test_split "test" --custom_dataset.file "recipes/quickstart/finetuning/datasets/ocrvqa_dataset.py"  --run_validation True --batching_strategy padding  --use_peft --peft_method lora
For finetuning with LLM freeze using FSDP, we can run the following code:
  torchrun --nnodes 1 --nproc_per_node 4  recipes/quickstart/finetuning/finetuning.py --enable_fsdp --lr 1e-5  --num_epochs 3 --batch_size_training 2 --model_name meta-llama/Llama-3.2-11B-Vision-Instruct --dist_checkpoint_root_folder ./finetuned_model --dist_checkpoint_folder fine-tuned  --use_fast_kernels --dataset "custom_dataset" --custom_dataset.test_split "test" --custom_dataset.file "recipes/quickstart/finetuning/datasets/ocrvqa_dataset.py"  --run_validation True --batching_strategy padding --freeze_LLM_only True
Note: --batching_strategy padding is needed as the vision model will not work with packing method.
For more details about the finetuning configurations, please read the finetuning readme.
For more details about local inference with the fine-tuned checkpoint, please read Inference with FSDP checkpoints section to learn how to convert the FSDP weights into a consolidated Hugging Face formatted model for local inference.
In order to use a custom dataset, please follow the steps below:
recipes/quickstart/finetuning/dataset folder.get_custom_dataset(dataset_config, processor, split, split_ratio=0.9) function that handles the data loading.get_data_collator(processor) function that returns a custom data collator that can be used by the Pytorch Data Loader.__call__(self, samples) function that converts the image and text samples into the actual inputs that vision model expects.torchrun command from above section, please change the --custom_dataset.file to the new dataset python file, adjust the learning rate accordingly.