| 
					
				 | 
			
			
				@@ -7,6 +7,18 @@ The provided fine tuning script allows you to select between three datasets by p 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 * [samsum_dataset](https://huggingface.co/datasets/samsum) contains about 16k messenger-like conversations with summaries. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 * [OpenAssistant/oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1/) contains about 88k messages from assistant-style conversations. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+## Batching Strategies 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+Llama-recipes support two strategies to batch requests together. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+The default setting is `packing` which concatenates the tokenized samples into long sequences filling up the context length of the model. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+This is the most compute efficient variant as it avoids any padding and all sequences have the same langth. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+Samples at the boundary of the context length are truncated and the remainder of the cut sequence it used as the start of the next long sequence. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+If the amount of training data is small this proceedure might introduce a lot of noise into the training data which can hurt the prediction performance of the fine-tune model. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+Therefore, we also support a `padding` strategy which does not introduce the addition noise due to truncated sequences. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+The strategy tries to minimize the efficiency loss by batching samples of similar length together so only minimal padding is necessary. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+The batching strategy can be selected though the command line parameter `--batching_strategy [packing]/[padding]`. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+ 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 ## Using custom datasets 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 The list of available datasets in llama-recipes is supposed to give users a quick start on training their Llama model. 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -25,7 +37,7 @@ The `dataset_config` in the above signature will be an instance of llama_recipes 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 The split signals wether to return the training or validation dataset. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 The default function name is `get_custom_dataset` but this can be changes as described below. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-In order to start a training with the custom dataset we need to set the `--dataset` as well as the `--custom_dataset.file` parameter.  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+In order to start a training with the custom dataset we need to set the `--dataset` as well as the `--custom_dataset.file` parameter. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 ``` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 python -m llama_recipes.finetuning --dataset "custom_dataset" --custom_dataset.file "examples/custom_dataset.py" [TRAINING PARAMETERS] 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 ``` 
			 | 
		
	
	
		
			
				| 
					
				 | 
			
			
				@@ -35,7 +47,7 @@ python -m llama_recipes.finetuning --dataset "custom_dataset" --custom_dataset.f 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 ``` 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 This will call the function `get_foo` instead of `get_custom_dataset` when retrieving the dataset. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				-### Adding new dataset  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				+### Adding new dataset 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 Each dataset has a corresponding configuration (dataclass) in [configs/datasets.py](../src/llama_recipes/configs/datasets.py) which contains the dataset name, training/validation split names, as well as optional parameters like datafiles etc. 
			 | 
		
	
		
			
				 | 
				 | 
			
			
				  
			 | 
		
	
		
			
				 | 
				 | 
			
			
				 Additionally, there is a preprocessing function for each dataset in the [datasets](../src/llama_recipes/datasets) folder. 
			 |