|
@@ -1,5 +1,5 @@
|
|
|
# Fine-tuning with Multi GPU
|
|
|
-This recipe steps you through how to finetune a Llama 2 model on the text summarization task using the [samsum](https://huggingface.co/datasets/samsum) dataset on multiple GPUs in a single or across multiple nodes.
|
|
|
+This recipe steps you through how to finetune a Meta Llama 3 model on the text summarization task using the [samsum](https://huggingface.co/datasets/samsum) dataset on multiple GPUs in a single or across multiple nodes.
|
|
|
|
|
|
|
|
|
## Requirements
|
|
@@ -9,7 +9,7 @@ We will also need 2 packages:
|
|
|
1. [PEFT](https://github.com/huggingface/peft) to use parameter-efficient finetuning.
|
|
|
2. [FSDP](https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html) which helps us parallelize the training over multiple GPUs. [More details](./LLM_finetuning_overview.md#2-full-partial-parameter-finetuning).
|
|
|
|
|
|
-> [!NOTE]
|
|
|
+> [!NOTE]
|
|
|
> The llama-recipes package will install PyTorch 2.0.1 version. In case you want to use FSDP with PEFT for multi GPU finetuning, please install the PyTorch nightlies ([details](../../README.md#pytorch-nightlies))
|
|
|
>
|
|
|
> INT8 quantization is not currently supported in FSDP
|
|
@@ -23,14 +23,14 @@ Get access to a machine with multiple GPUs (in this case we tested with 4 A100 a
|
|
|
<details open>
|
|
|
<summary>Single-node Multi-GPU</summary>
|
|
|
|
|
|
- torchrun --nnodes 1 --nproc_per_node 4 finetuning.py --enable_fsdp --model_name /patht_of_model_folder/7B --use_peft --peft_method lora --output_dir Path/to/save/PEFT/model
|
|
|
+ torchrun --nnodes 1 --nproc_per_node 4 finetuning.py --enable_fsdp --model_name /patht_of_model_folder/8B --use_peft --peft_method lora --output_dir Path/to/save/PEFT/model
|
|
|
|
|
|
</details>
|
|
|
|
|
|
<details>
|
|
|
<summary>Multi-node Multi-GPU</summary>
|
|
|
Here we use a slurm script to schedule a job with slurm over multiple nodes.
|
|
|
-
|
|
|
+
|
|
|
# Change the num nodes and GPU per nodes in the script before running.
|
|
|
sbatch ./multi_node.slurm
|
|
|
|
|
@@ -49,7 +49,7 @@ The args used in the command above are:
|
|
|
If interested in running full parameter finetuning without making use of PEFT methods, please use the following command. Make sure to change the `nproc_per_node` to your available GPUs. This has been tested with `BF16` on 8xA100, 40GB GPUs.
|
|
|
|
|
|
```bash
|
|
|
-torchrun --nnodes 1 --nproc_per_node 8 finetuning.py --enable_fsdp --model_name /patht_of_model_folder/7B --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --use_fast_kernels
|
|
|
+torchrun --nnodes 1 --nproc_per_node 8 finetuning.py --enable_fsdp --model_name /patht_of_model_folder/8B --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --use_fast_kernels
|
|
|
```
|
|
|
|
|
|
### Using less CPU memory (FSDP on 70B model)
|
|
@@ -79,23 +79,23 @@ To run with each of the datasets set the `dataset` flag in the command as shown
|
|
|
|
|
|
```bash
|
|
|
# grammer_dataset
|
|
|
-torchrun --nnodes 1 --nproc_per_node 4 finetuning.py --enable_fsdp --model_name /patht_of_model_folder/7B --use_peft --peft_method lora --dataset grammar_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
|
|
|
+torchrun --nnodes 1 --nproc_per_node 4 finetuning.py --enable_fsdp --model_name /patht_of_model_folder/8B --use_peft --peft_method lora --dataset grammar_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
|
|
|
|
|
|
# alpaca_dataset
|
|
|
|
|
|
-torchrun --nnodes 1 --nproc_per_node 4 finetuning.py --enable_fsdp --model_name /patht_of_model_folder/7B --use_peft --peft_method lora --dataset alpaca_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
|
|
|
+torchrun --nnodes 1 --nproc_per_node 4 finetuning.py --enable_fsdp --model_name /patht_of_model_folder/8B --use_peft --peft_method lora --dataset alpaca_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
|
|
|
|
|
|
|
|
|
# samsum_dataset
|
|
|
|
|
|
-torchrun --nnodes 1 --nproc_per_node 4 finetuning.py --enable_fsdp --model_name /patht_of_model_folder/7B --use_peft --peft_method lora --dataset samsum_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
|
|
|
+torchrun --nnodes 1 --nproc_per_node 4 finetuning.py --enable_fsdp --model_name /patht_of_model_folder/8B --use_peft --peft_method lora --dataset samsum_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
## [TIP] Slow interconnect between nodes?
|
|
|
-In case you are dealing with slower interconnect network between nodes, to reduce the communication overhead you can make use of `--hsdp` flag.
|
|
|
+In case you are dealing with slower interconnect network between nodes, to reduce the communication overhead you can make use of `--hsdp` flag.
|
|
|
|
|
|
HSDP (Hybrid sharding Data Parallel) helps to define a hybrid sharding strategy where you can have FSDP within `sharding_group_size` which can be the minimum number of GPUs you can fit your model and DDP between the replicas of the model specified by `replica_group_size`.
|
|
|
|
|
@@ -106,6 +106,3 @@ This will require to set the Sharding strategy in [fsdp config](../../src/llama_
|
|
|
torchrun --nnodes 4 --nproc_per_node 8 ./finetuning.py --enable_fsdp --low_cpu_fsdp --fsdp_config.pure_bf16 --model_name /patht_of_model_folder/70B --batch_size_training 1 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --hsdp --sharding_group_size n --replica_group_size world_size/n
|
|
|
|
|
|
```
|
|
|
-
|
|
|
-
|
|
|
-
|