2 سال پیش · eb2b1d1320
--- a/recipes/finetuning/multigpu_finetuning.md
+++ b/recipes/finetuning/multigpu_finetuning.md
@@ -1,5 +1,5 @@
 
				 # Fine-tuning with Multi GPU
			
 
				-This recipe steps you through how to finetune a Llama 2 model on the text summarization task using the [samsum](https://huggingface.co/datasets/samsum) dataset on multiple GPUs in a single or across multiple nodes.
			
 
				+This recipe steps you through how to finetune a Meta Llama 3 model on the text summarization task using the [samsum](https://huggingface.co/datasets/samsum) dataset on multiple GPUs in a single or across multiple nodes.
			
 
				 
			
 
				 
			
 
				 ## Requirements
			
@@ -9,7 +9,7 @@ We will also need 2 packages:
 
				 1. [PEFT](https://github.com/huggingface/peft) to use parameter-efficient finetuning.
			
 
				 2. [FSDP](https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html) which helps us parallelize the training over multiple GPUs. [More details](./LLM_finetuning_overview.md#2-full-partial-parameter-finetuning).
			
 
				 
			
 
				-> [!NOTE]  
			
 
				+> [!NOTE]
			
 
				 > The llama-recipes package will install PyTorch 2.0.1 version. In case you want to use FSDP with PEFT for multi GPU finetuning, please install the PyTorch nightlies ([details](../../README.md#pytorch-nightlies))
			
 
				 >
			
 
				 > INT8 quantization is not currently supported in FSDP
			
@@ -23,14 +23,14 @@ Get access to a machine with multiple GPUs (in this case we tested with 4 A100 a
 
				 <details open>
			
 
				 <summary>Single-node Multi-GPU</summary>
			
 
				 
			
 
				-    torchrun --nnodes 1 --nproc_per_node 4  finetuning.py --enable_fsdp --model_name /patht_of_model_folder/7B --use_peft --peft_method lora --output_dir Path/to/save/PEFT/model
			
 
				+    torchrun --nnodes 1 --nproc_per_node 4  finetuning.py --enable_fsdp --model_name /patht_of_model_folder/8B --use_peft --peft_method lora --output_dir Path/to/save/PEFT/model
			
 
				 
			
 
				 </details>
			
 
				 
			
 
				 <details>
			
 
				 <summary>Multi-node Multi-GPU</summary>
			
 
				 Here we use a slurm script to schedule a job with slurm over multiple nodes.
			
 
				-    
			
 
				+
			
 
				     # Change the num nodes and GPU per nodes in the script before running.
			
 
				     sbatch ./multi_node.slurm
			
 
				 
			
@@ -49,7 +49,7 @@ The args used in the command above are:
 
				 If interested in running full parameter finetuning without making use of PEFT methods, please use the following command. Make sure to change the `nproc_per_node` to your available GPUs. This has been tested with `BF16` on 8xA100, 40GB GPUs.
			
 
				 
			
 
				 ```bash
			
 
				-torchrun --nnodes 1 --nproc_per_node 8  finetuning.py --enable_fsdp --model_name /patht_of_model_folder/7B --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --use_fast_kernels
			
 
				+torchrun --nnodes 1 --nproc_per_node 8  finetuning.py --enable_fsdp --model_name /patht_of_model_folder/8B --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --use_fast_kernels
			
 
				 ```
			
 
				 
			
 
				 ### Using less CPU memory (FSDP on 70B model)
			
@@ -79,23 +79,23 @@ To run with each of the datasets set the `dataset` flag in the command as shown
 
				 
			
 
				 ```bash
			
 
				 # grammer_dataset
			
 
				-torchrun --nnodes 1 --nproc_per_node 4  finetuning.py --enable_fsdp  --model_name /patht_of_model_folder/7B --use_peft --peft_method lora --dataset grammar_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned  --pure_bf16 --output_dir Path/to/save/PEFT/model
			
 
				+torchrun --nnodes 1 --nproc_per_node 4  finetuning.py --enable_fsdp  --model_name /patht_of_model_folder/8B --use_peft --peft_method lora --dataset grammar_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned  --pure_bf16 --output_dir Path/to/save/PEFT/model
			
 
				 
			
 
				 # alpaca_dataset
			
 
				 
			
 
				-torchrun --nnodes 1 --nproc_per_node 4  finetuning.py --enable_fsdp  --model_name /patht_of_model_folder/7B --use_peft --peft_method lora --dataset alpaca_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
			
 
				+torchrun --nnodes 1 --nproc_per_node 4  finetuning.py --enable_fsdp  --model_name /patht_of_model_folder/8B --use_peft --peft_method lora --dataset alpaca_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
			
 
				 
			
 
				 
			
 
				 # samsum_dataset
			
 
				 
			
 
				-torchrun --nnodes 1 --nproc_per_node 4  finetuning.py --enable_fsdp --model_name /patht_of_model_folder/7B --use_peft --peft_method lora --dataset samsum_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
			
 
				+torchrun --nnodes 1 --nproc_per_node 4  finetuning.py --enable_fsdp --model_name /patht_of_model_folder/8B --use_peft --peft_method lora --dataset samsum_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
			
 
				 
			
 
				 ```
			
 
				 
			
 
				 
			
 
				 
			
 
				 ## [TIP] Slow interconnect between nodes?
			
 
				-In case you are dealing with slower interconnect network between nodes, to reduce the communication overhead you can make use of `--hsdp` flag. 
			
 
				+In case you are dealing with slower interconnect network between nodes, to reduce the communication overhead you can make use of `--hsdp` flag.
			
 
				 
			
 
				 HSDP (Hybrid sharding Data Parallel) helps to define a hybrid sharding strategy where you can have FSDP within `sharding_group_size` which can be the minimum number of GPUs you can fit your model and DDP between the replicas of the model specified by `replica_group_size`.
			
 
				 
			
@@ -106,6 +106,3 @@ This will require to set the Sharding strategy in [fsdp config](../../src/llama_
 
				 torchrun --nnodes 4 --nproc_per_node 8 ./finetuning.py --enable_fsdp --low_cpu_fsdp --fsdp_config.pure_bf16 --model_name /patht_of_model_folder/70B --batch_size_training 1 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --hsdp --sharding_group_size n --replica_group_size world_size/n
			
 
				 
			
 
				 ```
			
 
				-
			
 
				-
			
 
				-
			
--- a/recipes/finetuning/singlegpu_finetuning.md
+++ b/recipes/finetuning/singlegpu_finetuning.md
@@ -1,5 +1,5 @@
 
				 # Fine-tuning with Single GPU
			
 
				-This recipe steps you through how to finetune a Llama 2 model on the text summarization task using the [samsum](https://huggingface.co/datasets/samsum) dataset on a single GPU.
			
 
				+This recipe steps you through how to finetune a Meta Llama 3 model on the text summarization task using the [samsum](https://huggingface.co/datasets/samsum) dataset on a single GPU.
			
 
				 
			
 
				 These are the instructions for using the canonical [finetuning script](../../src/llama_recipes/finetuning.py) in the llama-recipes package.
			
 
				 
			
@@ -16,7 +16,7 @@ To run fine-tuning on a single GPU, we will make use of two packages:
 
				 ## How to run it?
			
 
				 
			
 
				 ```bash
			
 
				-python -m finetuning.py  --use_peft --peft_method lora --quantization --use_fp16 --model_name /patht_of_model_folder/7B --output_dir Path/to/save/PEFT/model
			
 
				+python -m finetuning.py  --use_peft --peft_method lora --quantization --use_fp16 --model_name /patht_of_model_folder/8B --output_dir Path/to/save/PEFT/model
			
 
				 ```
			
 
				 The args used in the command above are:
			
 
				 
			
@@ -24,10 +24,10 @@ The args used in the command above are:
 
				 * `--peft_method` to specify the PEFT method, here we use `lora` other options are `llama_adapter`, `prefix`.
			
 
				 * `--quantization` boolean flag to enable int8 quantization
			
 
				 
			
 
				-> [!NOTE]  
			
 
				+> [!NOTE]
			
 
				 > In case you are using a multi-GPU machine please make sure to only make one of them visible using `export CUDA_VISIBLE_DEVICES=GPU:id`.
			
 
				 
			
 
				- 
			
 
				+
			
 
				 ### How to run with different datasets?
			
 
				 
			
 
				 Currently 3 open source datasets are supported that can be found in [Datasets config file](../../src/llama_recipes/configs/datasets.py). You can also use your custom dataset (more info [here](./datasets/README.md)).
			
@@ -48,15 +48,15 @@ to run with each of the datasets set the `dataset` flag in the command as shown
 
				 ```bash
			
 
				 # grammer_dataset
			
 
				 
			
 
				-python -m finetuning.py  --use_peft --peft_method lora --quantization  --dataset grammar_dataset --model_name /patht_of_model_folder/7B --output_dir Path/to/save/PEFT/model
			
 
				+python -m finetuning.py  --use_peft --peft_method lora --quantization  --dataset grammar_dataset --model_name /patht_of_model_folder/8B --output_dir Path/to/save/PEFT/model
			
 
				 
			
 
				 # alpaca_dataset
			
 
				 
			
 
				-python -m finetuning.py  --use_peft --peft_method lora --quantization  --dataset alpaca_dataset --model_name /patht_of_model_folder/7B --output_dir Path/to/save/PEFT/model
			
 
				+python -m finetuning.py  --use_peft --peft_method lora --quantization  --dataset alpaca_dataset --model_name /patht_of_model_folder/8B --output_dir Path/to/save/PEFT/model
			
 
				 
			
 
				 
			
 
				 # samsum_dataset
			
 
				 
			
 
				-python -m finetuning.py  --use_peft --peft_method lora --quantization  --dataset samsum_dataset --model_name /patht_of_model_folder/7B --output_dir Path/to/save/PEFT/model
			
 
				+python -m finetuning.py  --use_peft --peft_method lora --quantization  --dataset samsum_dataset --model_name /patht_of_model_folder/8B --output_dir Path/to/save/PEFT/model
			
 
				 
			
 
				 ```