finetune_llama4.md 5.0 KB

Fine-Tuning Tutorial for Llama4 Models with torchtune

This tutorial shows how to perform fine-tuning on Llama4 models using torchtune.

Prerequisites

  1. Llama 4 models are sizeable, so we recommend you to use 8 x H100 with 80GB RAM available to run fine tuning. If you are using GPU's with 40GB RAM, please consider enabling CPU memory offloading. If you are leasing a VM from a cloud provider to do the fine tuning, ensure you have NVIDIA drivers installed: wget https://developer.download.nvidia.com/compute/cuda/12.1.0/local_installers/cuda_12.1.0_530.30.02_linux.run sudo sh cuda_12.1.0_530.30.02_linux.run

Verify GPUs with nvidia-smi command.

  1. We need to use torchtune to perform LoRA fine-tuning. Now let's set up the environment and install pytorch nightly build. Llama4 LORA fine-tuning also requires build from source.

a. Create a Python 3.10 environment (recommended):

conda create -n py310 python=3.10 -y
conda activate py310

b. Install PyTorch nightly and TorchTune:

pip install --force-reinstall --pre torch torchvision torchao --index-url https://download.pytorch.org/whl/nightly/cu126
git clone https://github.com/pytorch/torchtune.git
cd torchtune
git checkout 5d51c25cedfb6ba7b00e03cb2fef4f9cdb7baebd
pip install -e .

c. Depending on your setup environment, you may also need to install these packages as well:

pip install importlib_metadata
pip install torchvision
pip install torchao
  1. We also need Hugging Face access token (HF_TOKEN) for model download, please follow the instructions here to get your own token. You will also need to gain model access to Llama4 models from here

Steps

  1. Download Llama4 Weights

We will use meta-llama/Llama-4-Scout-17B-16E-Instruct as an example here. Replace with your Hugging Face token:

tune download meta-llama/Llama-4-Scout-17B-16E-Instruct --output-dir /tmp/Llama-4-Scout-17B-16E-Instruct --hf-token $HF_TOKEN

Alternatively, you can use huggingface-cli to login then download the model weights.

huggingface-cli login --token $HF_TOKEN
tune download meta-llama/Llama-4-Scout-17B-16E-Instruct --output-dir /tmp/Llama-4-Scout-17B-16E-Instruct

This retrieves the model weights, tokenizer from Hugging Face.

  1. Run LoRA Fine-Tuning for Llama4

To run LoRA fine-tuning, use the following command:

tune run --nproc_per_node 8 lora_finetune_distributed --config llama4/scout_17B_16E_lora

This will run LoRA fine-tuning on Llama4 model with 8 GPUs. The config llama4/scout_17B_16E_lora is a config file that specifies the model, tokenizer, and training parameters. This command will also download the alpaca_dataset as selected in the config file. Please refer to the Datasets section for more details.

You can customize the training process by adding command line overrides. For example, to optimize training efficiency and memory usage:

  tune run --nproc_per_node 8 lora_finetune_distributed --config llama4/scout_17B_16E_lora batch_size=4 dataset.packed=True tokenizer.max_seq_len=2048 fsdp_cpu_offload=True

These arguments control:

  • batch_size=4: Sets the number of examples processed in each training step
  • dataset.packed=True: Packs multiple sequences into a single example to maximize GPU utilization
  • tokenizer.max_seq_len=2048: Sets maximum sequence length to 2048 tokens
  • fsdp_cpu_offload=True: Enables CPU memory offloading to prevent out-of-memory errors but slows down fine tuning process significantly (especially important if you are using GPUs with 40GB)

Please check the this yaml for all the possible configs to override. To learn more about the YAML config, please refer to the YAML config documentation

  1. Run Full Parameter Fine-Tuning for Llama4

To run full parameter fine-tuning, use the following command:

  tune run  --nproc_per_node 8 full_finetune_distributed --config llama4/scout_17B_16E_full batch_size=4 dataset.packed=True tokenizer.max_seq_len=2048

This command will run a full fine-tuning on a single node as Torchtune by default use CPU offload to avoid Out-of-Memory (OOM) error. Please check the this yaml for all the possible configs to override.

Alternatively, if you want to run with multi-node to avoid possible slowness from CPU offloading, please modify this slurm script.