finetune_llama4.md 3.9 KB

Fine-Tuning Tutorial for Llama4 Models with torchtune

This tutorial shows how to perform fine-tuning on Llama4 models using torchtune.

Prerequisites

  1. We need to use torchtune to perform LoRA fine-tuning. Now llama4 LORA fine-tune requires build from source and install pytorch nightly build.

    pip install --force-reinstall --pre torch torchvision torchao --index-url https://download.pytorch.org/whl/nightly/cu126
    git clone https://github.com/pytorch/torchtune.git
    cd torchtune
    git checkout 5d51c25cedfb6ba7b00e03cb2fef4f9cdb7baebd
    pip install -e .
    
  2. We also need Hugging Face access token (HF_TOKEN) for model download, please follow the instructions here to get your own token. You will also need to gain model access to Llama4 models from here

Steps

  1. Download Llama4 Weights

We will use meta-llama/Llama-4-Scout-17B-16E-Instruct as an example here. Replace with your Hugging Face token:

tune download meta-llama/Llama-4-Scout-17B-16E-Instruct --output-dir /tmp/Llama-4-Scout-17B-16E-Instruct --hf-token $HF_TOKEN

Alternatively, you can use huggingface-cli to login then download the model weights.

huggingface-cli login --token $HF_TOKEN
tune download meta-llama/Llama-4-Scout-17B-16E-Instruct --output-dir /tmp/Llama-4-Scout-17B-16E-Instruct

This retrieves the model weights, tokenizer from Hugging Face.

  1. Run LoRA Fine-Tuning for Llama4

To run LoRA fine-tuning, use the following command:

tune run --nproc_per_node 8 lora_finetune_distributed --config llama4/scout_17B_16E_lora

This will run LoRA fine-tuning on Llama4 model with 8 GPUs. The config llama4/scout_17B_16E_lora is a config file that specifies the model, tokenizer, and training parameters. This command will also download the alpaca_dataset as selected in the config file. Please refer to the Datasets section for more details.

You can add specific overrides through the command line. For example, to use a larger batch_size:

  tune run --nproc_per_node 8 lora_finetune_distributed --config llama4/scout_17B_16E_lora batch_size=4 dataset.packed=True tokenizer.max_seq_len=2048 fsdp_cpu_offload=True

The dataset.packed=True and tokenizer.max_seq_len=2048 are additional arguments that specify the dataset and tokenizer settings. By default, lora_finetune_distributed will not use CPU offloading, so set fsdp_cpu_offload=True will enable that to avoid OOM. Please check the this yaml for all the possible configs to override. To learn more about the YAML config, please refer to the YAML config documentation

  1. Run Full Parameter Fine-Tuning for Llama4

To run full parameter fine-tuning, use the following command:

  tune run  --nproc_per_node 8 full_finetune_distributed --config llama4/scout_17B_16E_full batch_size=4 dataset.packed=True tokenizer.max_seq_len=2048

This command will run a full fine-tuning on a single node as Torchtune by default use CPU offload to avoid Out-of-Memory (OOM) error. Please check the this yaml for all the possible configs to override.

Alternatively, if you want to run with multi-node to avoid possible slowness from CPU offloading, please modify this slurm script.