|  | @@ -68,7 +68,32 @@ If you are running full parameter fine-tuning on the 70B model, you can enable `
 | 
	
		
			
				|  |  |  torchrun --nnodes 1 --nproc_per_node 8 finetuning.py --enable_fsdp --low_cpu_fsdp --fsdp_config.pure_bf16 --model_name /path_of_model_folder/70B --batch_size_training 1 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned
 | 
	
		
			
				|  |  |  ```
 | 
	
		
			
				|  |  |  
 | 
	
		
			
				|  |  | +**Multi GPU multi node**:
 | 
	
		
			
				|  |  |  
 | 
	
		
			
				|  |  | +Here we use a slurm script to schedule a job with slurm over multiple nodes.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +```bash
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +sbatch recipes/quickstart/finetuning/multi_node.slurm
 | 
	
		
			
				|  |  | +# Change the num nodes and GPU per nodes in the script before running.
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +```
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +To fine-tune the Meta Llama 405B model with LoRA on 32xH100, 80 GB GPUs we need to combine 4bit quantization (QLoRA) and FSDP.
 | 
	
		
			
				|  |  | +We can achieve this by adding the following environment variables to the slurm script (before the srun command in the bottom).
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +```bash
 | 
	
		
			
				|  |  | +export FSDP_CPU_RAM_EFFICIENT_LOADING=1
 | 
	
		
			
				|  |  | +export ACCELERATE_USE_FSDP=1 
 | 
	
		
			
				|  |  | +```
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +Then we need to replace the bottom srun command with the following:
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +```bash
 | 
	
		
			
				|  |  | +srun  torchrun --nproc_per_node 8 --rdzv_id $RANDOM --rdzv_backend c10d --rdzv_endpoint $head_node_ip:29500 ./finetuning.py  --enable_fsdp --use_peft --peft_method lora --quantization 4bit  --quantization_config.quant_type nf4 --mixed_precision False --low_cpu_fsdp
 | 
	
		
			
				|  |  | +```
 | 
	
		
			
				|  |  | +
 | 
	
		
			
				|  |  | +Do not forget to adujust the number of nodes, ntasks and gpus-per-task in the top.
 | 
	
		
			
				|  |  |  
 | 
	
		
			
				|  |  |  ## Running with different datasets
 | 
	
		
			
				|  |  |  Currently 3 open source datasets are supported that can be found in [Datasets config file](../../../src/llama_recipes/configs/datasets.py). You can also use your custom dataset (more info [here](./datasets/README.md)).
 |