This recipe steps you through how to finetune a Llama 3.2 vision model on the VQA task using the the_cauldron dataset.
Model Architecture Our Meta Llama 3.2 11B and 90B models consist of two main components: (1) an image encoder, (2) an image adapter.
[Model Architecture PICTURE]
We need have a new processor class added, that will handle the image processing and text tokenization. A processor example looks like this:
For full finetuning with FSDP, we can run the following code:
torchrun --nnodes 1 --nproc_per_node 4 recipes/quickstart/finetuning/finetuning.py --enable_fsdp --lr 1e-5 --context_length 8192 --num_epochs 3 --batch_size_training 2 --model_name meta-llama/Llama-3.2-11B-Vision-Instruct --dist_checkpoint_root_folder ./finetuned_model --dist_checkpoint_folder fine-tuned --use_fast_kernels --dataset "custom_dataset" --custom_dataset.test_split "test" --custom_dataset.file "recipes/quickstart/finetuning/datasets/vqa_dataset.py" --run_validation True --batching_strategy padding
For LoRA finetuning with FSDP, we can run the following code:
torchrun --nnodes 1 --nproc_per_node 4 recipes/quickstart/finetuning/finetuning.py --enable_fsdp --lr 1e-5 --context_length 8192 --num_epochs 3 --batch_size_training 2 --model_name meta-llama/Llama-3.2-11B-Vision-Instruct --dist_checkpoint_root_folder ./finetuned_model --dist_checkpoint_folder fine-tuned --use_fast_kernels --dataset "custom_dataset" --custom_dataset.test_split "test" --custom_dataset.file "recipes/quickstart/finetuning/datasets/vqa_dataset.py" --run_validation True --batching_strategy padding --use_peft --peft_method lora
Note: --batching_strategy padding
is needed as the vision model will not work with packing
method.
For more details about the finetuning configurations, please read the finetuning readme.
recipes/quickstart/finetuning/dataset
folderget_custom_dataset(dataset_config, processor, split, split_ratio=0.9)
function that handles the dataloading.get_data_collator(processor)
that returns a custom data collartor that can be used by the Pytorch Data Loader.__call__(self, samples)
function that converts the image and text samples into the actual inputs that vision model expects.