2 лет назад · d9edc0ca08
--- a/docs/single_gpu.md
+++ b/docs/single_gpu.md
@@ -21,7 +21,7 @@ Get access to a machine with one GPU or if using a multi-GPU machine please make
 
				 
			
 
				 ```bash
			
 
				 
			
 
				-python -m llama_recipes.finetuning  --use_peft --peft_method lora --quantization --use_fp16 --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model
			
 
				+python -m llama_recipes.finetuning  --use_peft --peft_method lora --quantization 8bit --use_fp16 --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model
			
 
				 
			
 
				 ```
			
 
				 The args used in the command above are:
			
@@ -52,16 +52,16 @@ to run with each of the datasets set the `dataset` flag in the command as shown
 
				 ```bash
			
 
				 # grammer_dataset
			
 
				 
			
 
				-python -m llama_recipes.finetuning  --use_peft --peft_method lora --quantization  --dataset grammar_dataset --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model
			
 
				+python -m llama_recipes.finetuning  --use_peft --peft_method lora --quantization 8bit --dataset grammar_dataset --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model
			
 
				 
			
 
				 # alpaca_dataset
			
 
				 
			
 
				-python -m llama_recipes.finetuning  --use_peft --peft_method lora --quantization  --dataset alpaca_dataset --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model
			
 
				+python -m llama_recipes.finetuning  --use_peft --peft_method lora --quantization 8bit --dataset alpaca_dataset --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model
			
 
				 
			
 
				 
			
 
				 # samsum_dataset
			
 
				 
			
 
				-python -m llama_recipes.finetuning  --use_peft --peft_method lora --quantization  --dataset samsum_dataset --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model
			
 
				+python -m llama_recipes.finetuning  --use_peft --peft_method lora --quantization 8bit --dataset samsum_dataset --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model
			
 
				 
			
 
				 ```
			
 
				 
			
--- a/recipes/quickstart/finetuning/README.md
+++ b/recipes/quickstart/finetuning/README.md
@@ -54,7 +54,7 @@ It lets us specify the training settings for everything from `model_name` to `da
 
				     output_dir: str = "PATH/to/save/PEFT/model"
			
 
				     freeze_layers: bool = False
			
 
				     num_freeze_layers: int = 1
			
 
				-    quantization: bool = False
			
 
				+    quantization: str = None
			
 
				     one_gpu: bool = False
			
 
				     save_model: bool = True
			
 
				     dist_checkpoint_root_folder: str="PATH/to/save/FSDP/model" # will be used if using FSDP
			
@@ -101,7 +101,7 @@ It lets us specify the training settings for everything from `model_name` to `da
 
				 You can enable [W&B](https://wandb.ai/) experiment tracking by using `use_wandb` flag as below. You can change the project name, entity and other `wandb.init` arguments in `wandb_config`.
			
 
				 
			
 
				 ```bash
			
 
				-python -m llama_recipes.finetuning --use_peft --peft_method lora --quantization --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model --use_wandb
			
 
				+python -m llama_recipes.finetuning --use_peft --peft_method lora --quantization 8bit --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model --use_wandb
			
 
				 ```
			
 
				 You'll be able to access a dedicated project or run link on [wandb.ai](https://wandb.ai) and see your dashboard like the one below.
			
 
				 <div style="display: flex;">
			
--- a/recipes/quickstart/finetuning/singlegpu_finetuning.md
+++ b/recipes/quickstart/finetuning/singlegpu_finetuning.md
@@ -15,17 +15,17 @@ To run fine-tuning on a single GPU, we will make use of two packages:
 
				 
			
 
				 ## How to run it?
			
 
				 
			
 
				-**NOTE** To run the fine-tuning with `QLORA`, make sure to set `--peft_method lora` and `--quantization int4`.
			
 
				+**NOTE** To run the fine-tuning with `QLORA`, make sure to set `--peft_method lora` and `--quantization 4bit --quantization_config.quant_type nf4`.
			
 
				 
			
 
				 
			
 
				 ```bash
			
 
				-FSDP_CPU_RAM_EFFICIENT_LOADING=1 python finetuning.py  --use_peft --peft_method lora --quantization int8 --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model
			
 
				+FSDP_CPU_RAM_EFFICIENT_LOADING=1 python finetuning.py  --use_peft --peft_method lora --quantization 8bit --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model
			
 
				 ```
			
 
				 The args used in the command above are:
			
 
				 
			
 
				 * `--use_peft` boolean flag to enable PEFT methods in the script
			
 
				 * `--peft_method` to specify the PEFT method, here we use `lora` other options are `llama_adapter`, `prefix`.
			
 
				-* `--quantization` string flag to enable int8 or int4 quantization
			
 
				+* `--quantization` string flag to enable 8bit or 4bit quantization
			
 
				 
			
 
				 > [!NOTE]
			
 
				 > In case you are using a multi-GPU machine please make sure to only make one of them visible using `export CUDA_VISIBLE_DEVICES=GPU:id`.
			
@@ -51,16 +51,16 @@ to run with each of the datasets set the `dataset` flag in the command as shown
 
				 ```bash
			
 
				 # grammar_dataset
			
 
				 
			
 
				-python -m finetuning.py  --use_peft --peft_method lora --quantization int8 --dataset grammar_dataset --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model
			
 
				+python -m finetuning.py  --use_peft --peft_method lora --quantization 8bit --dataset grammar_dataset --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model
			
 
				 
			
 
				 # alpaca_dataset
			
 
				 
			
 
				-python -m finetuning.py  --use_peft --peft_method lora --quantization int8  --dataset alpaca_dataset --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model
			
 
				+python -m finetuning.py  --use_peft --peft_method lora --quantization 8bit  --dataset alpaca_dataset --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model
			
 
				 
			
 
				 
			
 
				 # samsum_dataset
			
 
				 
			
 
				-python -m finetuning.py  --use_peft --peft_method lora --quantization int8  --dataset samsum_dataset --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model
			
 
				+python -m finetuning.py  --use_peft --peft_method lora --quantization 8bit  --dataset samsum_dataset --model_name /path_of_model_folder/8B --output_dir Path/to/save/PEFT/model
			
 
				 
			
 
				 ```
			
 
				 
			
--- a/recipes/quickstart/inference/local_inference/README.md
+++ b/recipes/quickstart/inference/local_inference/README.md
@@ -46,7 +46,7 @@ Padding would be required for batch inference. In this this [example](inference.
 
				 The inference folder also includes a chat completion example, that adds built-in safety features in fine-tuned models to the prompt tokens. To run the example:
			
 
				 
			
 
				 ```bash
			
 
				-python chat_completion/chat_completion.py --model_name "PATH/TO/MODEL/7B/" --prompt_file chat_completion/chats.json  --quantization --use_auditnlg
			
 
				+python chat_completion/chat_completion.py --model_name "PATH/TO/MODEL/7B/" --prompt_file chat_completion/chats.json  --quantization 8bit --use_auditnlg
			
 
				 
			
 
				 ```
			
 
				 
			
@@ -55,7 +55,7 @@ python chat_completion/chat_completion.py --model_name "PATH/TO/MODEL/7B/" --pro
 
				 Setting `use_fast_kernels` will enable using of Flash Attention or Xformer memory-efficient kernels based on the hardware being used. This would speed up inference when used for batched inputs. This has been enabled in `optimum` library from HuggingFace as a one-liner API, please read more [here](https://pytorch.org/blog/out-of-the-box-acceleration/).
			
 
				 
			
 
				 ```bash
			
 
				-python chat_completion/chat_completion.py --model_name "PATH/TO/MODEL/7B/" --prompt_file chat_completion/chats.json  --quantization --use_auditnlg --use_fast_kernels
			
 
				+python chat_completion/chat_completion.py --model_name "PATH/TO/MODEL/7B/" --prompt_file chat_completion/chats.json  --quantization 8bit --use_auditnlg --use_fast_kernels
			
 
				 
			
 
				 python inference.py --model_name <training_config.output_dir> --peft_model <training_config.output_dir> --prompt_file <test_prompt_file> --use_auditnlg --use_fast_kernels
			
 
				 
			
--- a/recipes/responsible_ai/llama_guard/README.md
+++ b/recipes/responsible_ai/llama_guard/README.md
@@ -2,7 +2,7 @@
 
				 <!-- markdown-link-check-disable -->
			
 
				 Meta Llama Guard is a language model that provides input and output guardrails for LLM inference. For more details and model cards, please visit the main repository for each model, [Meta Llama Guard](https://github.com/meta-llama/PurpleLlama/tree/main/Llama-Guard) and Meta [Llama Guard 2](https://github.com/meta-llama/PurpleLlama/tree/main/Llama-Guard2).
			
 
				 
			
 
				-This folder contains an example file to run inference with a locally hosted model, either using the Hugging Face Hub or a local path. 
			
 
				+This folder contains an example file to run inference with a locally hosted model, either using the Hugging Face Hub or a local path.
			
 
				 
			
 
				 ## Requirements
			
 
				 1. Access to Llama guard model weights on Hugging Face. To get access, follow the steps described [here](https://github.com/facebookresearch/PurpleLlama/tree/main/Llama-Guard#download)
			
@@ -10,7 +10,7 @@ This folder contains an example file to run inference with a locally hosted mode
 
				 
			
 
				 
			
 
				 ## Llama Guard inference script
			
 
				-For testing, you can add User or User/Agent interactions into the prompts list and the run the script to verify the results. When the conversation has one or more Agent responses, it's considered of type agent. 
			
 
				+For testing, you can add User or User/Agent interactions into the prompts list and the run the script to verify the results. When the conversation has one or more Agent responses, it's considered of type agent.
			
 
				 
			
 
				 
			
 
				 ```
			
@@ -66,7 +66,4 @@ In this case, the default categories are applied by the tokenizer, using the `ap
 
				 
			
 
				 Use this command for testing with a quantized Llama model, modifying the values accordingly:
			
 
				 
			
 
				-`python examples/inference.py --model_name <path_to_regular_llama_model> --prompt_file <path_to_prompt_file> --quantization --enable_llamaguard_content_safety`
			
 
				-
			
 
				-
			
 
				-
			
 
				+`python examples/inference.py --model_name <path_to_regular_llama_model> --prompt_file <path_to_prompt_file> --quantization 8bit --enable_llamaguard_content_safety`