This recipe steps you through how to finetune a Llama 3.2 vision model on the OCR VQA task using the OCRVQA dataset.
Disclaimer: As our vision models already have a very good OCR ability, here we use the OCRVQA dataset only for demonstration purposes of the required steps for fine-tuning our vision models with llama-recipes.
We created an example script ocrvqa_dataset.py that can load the OCRVQA dataset with get_custom_dataset function, then provide OCRVQADataCollator class to process the image dataset.
For full finetuning with FSDP, we can run the following code:
  torchrun --nnodes 1 --nproc_per_node 4  recipes/quickstart/finetuning/finetuning.py --enable_fsdp --lr 1e-5  --num_epochs 3 --batch_size_training 2 --model_name meta-llama/Llama-3.2-11B-Vision-Instruct --dist_checkpoint_root_folder ./finetuned_model --dist_checkpoint_folder fine-tuned  --use_fast_kernels --dataset "custom_dataset" --custom_dataset.test_split "test" --custom_dataset.file "recipes/quickstart/finetuning/datasets/ocrvqa_dataset.py"  --run_validation True --batching_strategy padding
For LoRA finetuning with FSDP, we can run the following code:
  torchrun --nnodes 1 --nproc_per_node 4  recipes/quickstart/finetuning/finetuning.py --enable_fsdp --lr 1e-5  --num_epochs 3 --batch_size_training 2 --model_name meta-llama/Llama-3.2-11B-Vision-Instruct --dist_checkpoint_root_folder ./finetuned_model --dist_checkpoint_folder fine-tuned  --use_fast_kernels --dataset "custom_dataset" --custom_dataset.test_split "test" --custom_dataset.file "recipes/quickstart/finetuning/datasets/ocrvqa_dataset.py"  --run_validation True --batching_strategy padding  --use_peft --peft_method lora
Note: --batching_strategy padding is needed as the vision model will not work with packing method.
For more details about the finetuning configurations, please read the finetuning readme.
In order to use a custom dataset, please follow the steps below:
recipes/quickstart/finetuning/dataset folder.get_custom_dataset(dataset_config, processor, split, split_ratio=0.9) function that handles the data loading.get_data_collator(processor) that returns a custom data collator that can be used by the Pytorch Data Loader.__call__(self, samples) function that converts the image and text samples into the actual inputs that vision model expects.torchrun commend from above section, please change the --custom_dataset.file to the new dataset python file, adjust the learning rate accordingly.