This tutorial shows how to perform fine-tuning on Llama4 models using torchtune.
Verify GPUs with nvidia-smi command.
a. Create a Python 3.10 environment (recommended):
conda create -n py310 python=3.10 -y
conda activate py310
b. Install PyTorch nightly and TorchTune:
pip install --force-reinstall --pre torch torchvision torchao --index-url https://download.pytorch.org/whl/nightly/cu126
git clone https://github.com/pytorch/torchtune.git
cd torchtune
git checkout 5d51c25cedfb6ba7b00e03cb2fef4f9cdb7baebd
pip install -e .
c. Depending on your setup environment, you may also need to install these packages as well:
pip install importlib_metadata
pip install torchvision
pip install torchao
We will use meta-llama/Llama-4-Scout-17B-16E-Instruct
as an example here. Replace with your Hugging Face token:
tune download meta-llama/Llama-4-Scout-17B-16E-Instruct --output-dir /tmp/Llama-4-Scout-17B-16E-Instruct --hf-token $HF_TOKEN
Alternatively, you can use huggingface-cli
to login then download the model weights.
huggingface-cli login --token $HF_TOKEN
tune download meta-llama/Llama-4-Scout-17B-16E-Instruct --output-dir /tmp/Llama-4-Scout-17B-16E-Instruct
This retrieves the model weights, tokenizer from Hugging Face.
To run LoRA fine-tuning, use the following command:
tune run --nproc_per_node 8 lora_finetune_distributed --config llama4/scout_17B_16E_lora
This will run LoRA fine-tuning on Llama4 model with 8 GPUs. The config llama4/scout_17B_16E_lora is a config file that specifies the model, tokenizer, and training parameters. This command will also download the alpaca_dataset
as selected in the config file. Please refer to the Datasets section for more details.
You can customize the training process by adding command line overrides. For example, to optimize training efficiency and memory usage:
tune run --nproc_per_node 8 lora_finetune_distributed --config llama4/scout_17B_16E_lora batch_size=4 dataset.packed=True tokenizer.max_seq_len=2048 fsdp_cpu_offload=True
These arguments control:
batch_size=4
: Sets the number of examples processed in each training stepdataset.packed=True
: Packs multiple sequences into a single example to maximize GPU utilizationtokenizer.max_seq_len=2048
: Sets maximum sequence length to 2048 tokensfsdp_cpu_offload=True
: Enables CPU memory offloading to prevent out-of-memory errors but slows down fine tuning process significantly (especially important if you are using GPUs with 40GB)Please check the this yaml for all the possible configs to override. To learn more about the YAML config, please refer to the YAML config documentation
To run full parameter fine-tuning, use the following command:
tune run --nproc_per_node 8 full_finetune_distributed --config llama4/scout_17B_16E_full batch_size=4 dataset.packed=True tokenizer.max_seq_len=2048
This command will run a full fine-tuning on a single node as Torchtune by default use CPU offload to avoid Out-of-Memory (OOM) error. Please check the this yaml for all the possible configs to override.
Alternatively, if you want to run with multi-node to avoid possible slowness from CPU offloading, please modify this slurm script.