The provided fine tuning script allows you to select between three datasets by passing the dataset arg to the llama_finetuning.py script. The current options are grammar_dataset, alpaca_datasetand samsum_dataset. Note: Use of any of the datasets should be in compliance with the dataset's underlying licenses (including but not limited to non-commercial uses)
text-davinci-003.The list of available datasets can easily be extended with custom datasets by following these instructions.
Each dataset has a corresponding configuration (dataclass) in configs/dataset.py which contains the dataset name, training/validation split names, as well as optional parameters like datafiles etc.
Additionally, there is a preprocessing function for each dataset in the ft_datasets folder.
The returned data of the dataset needs to be consumable by the forward method of the fine-tuned model by calling model(**data).
For CausalLM models this usually means that the data needs to be in the form of a dictionary with "input_ids", "attention_mask" and "labels" fields.
To add a custom dataset the following steps need to be performed.
Below we list other datasets and their main use cases that can be used for fine tuning.
English quotes 2508 Multi-label text classification, text generation
More information on evaluation dataset can be found in HELM