Selaa lähdekoodia

Llama 3.1 update

Updating the recipes for the Llama 3.1 release.
albertodepaola 9 kuukautta sitten
vanhempi
commit
c62ed771ce
44 muutettua tiedostoa jossa 2516 lisäystä ja 247 poistoa
  1. 12 0
      .github/scripts/spellcheck_conf/wordlist.txt
  2. 21 12
      README.md
  3. 1 3
      docs/LLM_finetuning.md
  4. 3 4
      docs/multi_gpu.md
  5. 9 9
      recipes/3p_integrations/lamini/text2sql_memory_tuning/meta_lamini.ipynb
  6. 1 1
      recipes/3p_integrations/lamini/text2sql_memory_tuning/util/parse_arguments.py
  7. 9 9
      recipes/3p_integrations/llama_on_prem.md
  8. 75 0
      recipes/3p_integrations/vllm/README.md
  9. 35 12
      recipes/3p_integrations/vllm/inference.py
  10. 2 2
      recipes/quickstart/Running_Llama3_Anywhere/Running_Llama_on_HF_transformers.ipynb
  11. 1 3
      recipes/quickstart/finetuning/LLM_finetuning_overview.md
  12. 25 0
      recipes/quickstart/finetuning/multigpu_finetuning.md
  13. 1 1
      recipes/quickstart/finetuning/quickstart_peft_finetuning.ipynb
  14. 4 3
      recipes/quickstart/inference/README.md
  15. 7 4
      recipes/quickstart/inference/local_inference/README.md
  16. 13 11
      recipes/quickstart/inference/local_inference/chat_completion/chat_completion.py
  17. 145 128
      recipes/quickstart/inference/local_inference/inference.py
  18. 51 0
      recipes/quickstart/inference/modelUpgradeExample.py
  19. 10 7
      recipes/responsible_ai/README.md
  20. 6 3
      recipes/responsible_ai/llama_guard/README.md
  21. 2 2
      recipes/responsible_ai/llama_guard/inference.py
  22. 793 0
      recipes/responsible_ai/llama_guard/llama_guard_customization_via_prompting_and_fine_tuning.ipynb
  23. 11 0
      recipes/responsible_ai/prompt_guard/README.md
  24. 0 0
      recipes/responsible_ai/prompt_guard/__init__.py
  25. 180 0
      recipes/responsible_ai/prompt_guard/inference.py
  26. 817 0
      recipes/responsible_ai/prompt_guard/prompt_guard_tutorial.ipynb
  27. 1 1
      recipes/use_cases/customerservice_chatbots/RAG_chatbot/RAG_Chatbot_Example.ipynb
  28. 3 3
      recipes/use_cases/customerservice_chatbots/RAG_chatbot/vectorstore/mongodb/rag_mongodb_llama3_huggingface_open_source.ipynb
  29. 2 1
      requirements.txt
  30. 7 1
      src/llama_recipes/configs/datasets.py
  31. 2 1
      src/llama_recipes/datasets/__init__.py
  32. 131 0
      src/llama_recipes/datasets/toxicchat_dataset.py
  33. 22 5
      src/llama_recipes/inference/model_utils.py
  34. 95 5
      src/llama_recipes/inference/prompt_format_utils.py
  35. 1 1
      src/llama_recipes/inference/safety_utils.py
  36. 1 1
      src/llama_recipes/tools/README.md
  37. 3 0
      src/llama_recipes/utils/dataset_utils.py
  38. 1 1
      src/tests/conftest.py
  39. 1 1
      src/tests/datasets/test_custom_dataset.py
  40. 1 1
      src/tests/datasets/test_grammar_datasets.py
  41. 1 1
      src/tests/datasets/test_samsum_datasets.py
  42. 1 1
      src/tests/test_batching.py
  43. 2 2
      tools/benchmarks/inference/on_prem/README.md
  44. 7 7
      tools/benchmarks/llm_eval_harness/README.md

+ 12 - 0
.github/scripts/spellcheck_conf/wordlist.txt

@@ -1406,3 +1406,15 @@ DLAI
 agentic
 containts
 dlai
+Prerequirements
+tp
+QLoRA
+ntasks
+srun
+xH
+unquantized
+eom
+ipython
+CPUs
+modelUpgradeExample
+guardrailing

Tiedoston diff-näkymää rajattu, sillä se on liian suuri
+ 21 - 12
README.md


+ 1 - 3
docs/LLM_finetuning.md

@@ -1,6 +1,6 @@
 ## LLM Fine-Tuning
 
-Here we discuss fine-tuning Meta Llama 3 with a couple of different recipes. We will cover two scenarios here:
+Here we discuss fine-tuning Meta Llama with a couple of different recipes. We will cover two scenarios here:
 
 
 ## 1. **Parameter Efficient Model Fine-Tuning**
@@ -18,8 +18,6 @@ These methods will address three aspects:
 
 HF [PEFT](https://github.com/huggingface/peft) library provides an easy way of using these methods which we make use of here. Please read more [here](https://huggingface.co/blog/peft).
 
-
-
 ## 2. **Full/ Partial Parameter Fine-Tuning**
 
 Full parameter fine-tuning has its own advantages, in this method there are multiple strategies that can help:

+ 3 - 4
docs/multi_gpu.md

@@ -6,13 +6,12 @@ To run fine-tuning on multi-GPUs, we will  make use of two packages:
 
 2. [FSDP](https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html) which helps us parallelize the training over multiple GPUs. [More details](LLM_finetuning.md/#2-full-partial-parameter-finetuning).
 
-Given the combination of PEFT and FSDP, we would be able to fine tune a Meta Llama 3 8B model on multiple GPUs in one node or multi-node.
+Given the combination of PEFT and FSDP, we would be able to fine tune a Meta Llama 8B model on multiple GPUs in one node.
+For big models like 405B we will need to fine-tune in a multi-node setup even if 4bit quantization is enabled.
 
 ## Requirements
 To run the examples, make sure to install the llama-recipes package and clone the github repository in order to use the provided [`finetuning.py`](../recipes/quickstart/finetuning/finetuning.py) script with torchrun (See [README.md](../README.md) for details).
 
-**Please note that the llama_recipes package will install PyTorch 2.0.1 version, in case you want to run FSDP + PEFT, please make sure to install PyTorch nightlies.**
-
 ## How to run it
 
 Get access to a machine with multiple GPUs ( in this case we tested with 4 A100 and A10s).
@@ -61,7 +60,7 @@ torchrun --nnodes 1 --nproc_per_node 8  recipes/quickstart/finetuning/finetuning
 This has been tested on 4 H100s GPUs.
 
 ```bash
- FSDP_CPU_RAM_EFFICIENT_LOADING=1 ACCELERATE_USE_FSDP=1 torchrun --nnodes 1 --nproc_per_node 4  finetuning.py --enable_fsdp  --quantization int4 --model_name /path_of_model_folder/70B  --mixed_precision False --low_cpu_fsdp --use_peft --peft_method lora --output_dir Path/to/save/PEFT/model
+ FSDP_CPU_RAM_EFFICIENT_LOADING=1 ACCELERATE_USE_FSDP=1 torchrun --nnodes 1 --nproc_per_node 4  finetuning.py --enable_fsdp  --quantization 4bit --model_name /path_of_model_folder/70B  --mixed_precision False --low_cpu_fsdp --use_peft --peft_method lora --output_dir Path/to/save/PEFT/model
 ```
 
 ### Fine-tuning using FSDP on 70B Model

+ 9 - 9
recipes/3p_integrations/lamini/text2sql_memory_tuning/meta_lamini.ipynb

@@ -145,7 +145,7 @@
     "class Args:\n",
     "    def __init__(self, \n",
     "                 max_examples=100, \n",
-    "                 sql_model_name=\"meta-llama/Meta-Llama-3-8B-Instruct\", \n",
+    "                 sql_model_name=\"meta-llama/Meta-Llama-3.1-8B-Instruct\", \n",
     "                 gold_file_name=\"gold-test-set.jsonl\",\n",
     "                 training_file_name=\"generated_queries.jsonl\",\n",
     "                 num_to_generate=10):\n",
@@ -197,7 +197,7 @@
     }
    ],
    "source": [
-    "llm = lamini.Lamini(model_name=\"meta-llama/Meta-Llama-3-8B-Instruct\")\n",
+    "llm = lamini.Lamini(model_name=\"meta-llama/Meta-Llama-3.1-8B-Instruct\")\n",
     "\n",
     "question = \"\"\"Who is the highest paid NBA player?\"\"\"\n",
     "system = f\"\"\"You are an NBA analyst with 15 years of experience writing complex SQL queries. Consider the nba_roster table with the following schema:\n",
@@ -418,7 +418,7 @@
     "class ScoreStage(GenerationNode):\n",
     "    def __init__(self):\n",
     "        super().__init__(\n",
-    "            model_name=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n",
+    "            model_name=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
     "            max_new_tokens=150,\n",
     "        )\n",
     "\n",
@@ -712,7 +712,7 @@
     "class ModelStage(GenerationNode):\n",
     "    def __init__(self):\n",
     "        super().__init__(\n",
-    "            model_name=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n",
+    "            model_name=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
     "            max_new_tokens=300,\n",
     "        )\n",
     "\n",
@@ -808,7 +808,7 @@
     "class QuestionStage(GenerationNode):\n",
     "    def __init__(self):\n",
     "        super().__init__(\n",
-    "            model_name=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n",
+    "            model_name=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
     "            max_new_tokens=150,\n",
     "        )\n",
     "\n",
@@ -1055,7 +1055,7 @@
    ],
    "source": [
     "args = Args()\n",
-    "llm = lamini.Lamini(model_name=\"meta-llama/Meta-Llama-3-8B-Instruct\")\n",
+    "llm = lamini.Lamini(model_name=\"meta-llama/Meta-Llama-3.1-8B-Instruct\")\n",
     "\n",
     "dataset = get_dataset(args, make_question)\n",
     "finetune_args = get_default_finetune_args()\n",
@@ -1601,7 +1601,7 @@
    ],
    "source": [
     "args = Args(training_file_name=\"archive/generated_queries_large_filtered_cleaned.jsonl\")\n",
-    "llm = lamini.Lamini(model_name=\"meta-llama/Meta-Llama-3-8B-Instruct\")\n",
+    "llm = lamini.Lamini(model_name=\"meta-llama/Meta-Llama-3.1-8B-Instruct\")\n",
     "\n",
     "dataset = get_dataset(args, make_question)\n",
     "finetune_args = get_default_finetune_args()\n",
@@ -1798,7 +1798,7 @@
    ],
    "source": [
     "args = Args(training_file_name=\"generated_queries_v2.jsonl\")\n",
-    "llm = lamini.Lamini(model_name=\"meta-llama/Meta-Llama-3-8B-Instruct\")\n",
+    "llm = lamini.Lamini(model_name=\"meta-llama/Meta-Llama-3.1-8B-Instruct\")\n",
     "\n",
     "dataset = get_dataset(args, make_question)\n",
     "finetune_args = get_default_finetune_args()\n",
@@ -1966,7 +1966,7 @@
    ],
    "source": [
     "args = Args(training_file_name=\"archive/generated_queries_v2_large_filtered_cleaned.jsonl\")\n",
-    "llm = lamini.Lamini(model_name=\"meta-llama/Meta-Llama-3-8B-Instruct\")\n",
+    "llm = lamini.Lamini(model_name=\"meta-llama/Meta-Llama-3.1-8B-Instruct\")\n",
     "\n",
     "dataset = get_dataset(args, make_question)\n",
     "finetune_args = get_default_finetune_args()\n",

+ 1 - 1
recipes/3p_integrations/lamini/text2sql_memory_tuning/util/parse_arguments.py

@@ -16,7 +16,7 @@ def parse_arguments():
     parser.add_argument(
         "--sql-model-name",
         type=str,
-        default="meta-llama/Meta-Llama-3-8B-Instruct",
+        default="meta-llama/Meta-Llama-3.1-8B-Instruct",
         help="The model to use for text2sql",
         required=False,
     )

+ 9 - 9
recipes/3p_integrations/llama_on_prem.md

@@ -8,7 +8,7 @@ We'll use the Amazon EC2 instance running Ubuntu with an A10G 24GB GPU as an exa
 
 The Colab notebook to connect via LangChain with Llama 3 hosted as the vLLM and TGI API services is [here](https://colab.research.google.com/drive/1rYWLdgTGIU1yCHmRpAOB2D-84fPzmOJg), also shown in the sections below.
 
-This tutorial assumes that you you have been granted access to the Meta Llama 3 on Hugging Face - you can open a Hugging Face Meta model page [here](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) to confirm that you see "Gated model You have been granted access to this model"; if you see "You need to agree to share your contact information to access this model", simply complete and submit the form in the page.
+This tutorial assumes that you you have been granted access to the Meta Llama 3 on Hugging Face - you can open a Hugging Face Meta model page [here](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) to confirm that you see "Gated model You have been granted access to this model"; if you see "You need to agree to share your contact information to access this model", simply complete and submit the form in the page.
 
 You'll also need your Hugging Face access token which you can get at your Settings page [here](https://huggingface.co/settings/tokens).
 
@@ -33,7 +33,7 @@ There are two ways to deploy Llama 3 via vLLM, as a general API server or an Ope
 Run the command below to deploy vLLM as a general Llama 3 service:
 
 ```
-python -m vllm.entrypoints.api_server --host 0.0.0.0 --port 5000 --model meta-llama/Meta-Llama-3-8B-Instruct
+python -m vllm.entrypoints.api_server --host 0.0.0.0 --port 5000 --model meta-llama/Meta-Llama-3.1-8B-Instruct
 ```
 
 Then on another terminal you can run:
@@ -68,13 +68,13 @@ Also, if you have multiple GPUs, you can add the `--tensor-parallel-size` argume
 git clone https://github.com/vllm-project/vllm
 cd vllm/vllm/entrypoints
 conda activate llama3
-python api_server.py --host 0.0.0.0 --port 5000 --model meta-llama/Meta-Llama-3-8B-Instruct --tensor-parallel-size 4
+python api_server.py --host 0.0.0.0 --port 5000 --model meta-llama/Meta-Llama-3.1-8B-Instruct --tensor-parallel-size 4
 ```
 
 With multiple GPUs, you can also run replica of models as long as your model size can fit into targeted GPU memory. For example, if you have two A10G with 24 GB memory, you can run two Llama 3 8B models at the same time. This can be done by launching two api servers each targeting specific CUDA cores on different ports:
-`CUDA_VISIBLE_DEVICES=0 python api_server.py --host 0.0.0.0 --port 5000  --model meta-llama/Meta-Llama-3-8B-Instruct`
+`CUDA_VISIBLE_DEVICES=0 python api_server.py --host 0.0.0.0 --port 5000  --model meta-llama/Meta-Llama-3.1-8B-Instruct`
 and
-`CUDA_VISIBLE_DEVICES=1 python api_server.py --host 0.0.0.0 --port 5001  --model meta-llama/Meta-Llama-3-8B-Instruct`
+`CUDA_VISIBLE_DEVICES=1 python api_server.py --host 0.0.0.0 --port 5001  --model meta-llama/Meta-Llama-3.1-8B-Instruct`
 The benefit would be that you can balance incoming requests to both models, reaching higher batch size processing for a trade-off of generation latency.
 
 
@@ -83,14 +83,14 @@ The benefit would be that you can balance incoming requests to both models, reac
 You can also deploy the vLLM hosted Llama 3 as an OpenAI-Compatible service to easily replace code using OpenAI API. First, run the command below:
 
 ```
-python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 5000 --model meta-llama/Meta-Llama-3-8B-Instruct
+python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 5000 --model meta-llama/Meta-Llama-3.1-8B-Instruct
 ```
 
 Then on another terminal, run:
 
 ```
 curl http://localhost:5000/v1/completions -H "Content-Type: application/json" -d '{
-        "model": "meta-llama/Meta-Llama-3-8B-Instruct",
+        "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
         "prompt": "Who wrote the book Innovators dilemma?",
         "max_tokens": 300,
         "temperature": 0
@@ -118,7 +118,7 @@ from langchain.llms import VLLMOpenAI
 llm = VLLMOpenAI(
     openai_api_key="EMPTY",
     openai_api_base="http://<vllm_server_ip_address>:5000/v1",
-    model_name="meta-llama/Meta-Llama-3-8B-Instruct",
+    model_name="meta-llama/Meta-Llama-3.1-8B-Instruct",
 )
 
 print(llm("Who wrote the book godfather?"))
@@ -136,7 +136,7 @@ You can now use the Llama 3 instance `llm` created this way in any of the demo a
 The easiest way to deploy Llama 3 with TGI is using its official docker image. First, replace `<your_hugging_face_access_token>` and set the three required shell variables (you may replace the `model` value above with another Llama 3 model):
 
 ```
-model=meta-llama/Meta-Llama-3-8B-Instruct
+model=meta-llama/Meta-Llama-3.1-8B-Instruct
 volume=$PWD/data
 token=<your_hugging_face_access_token>
 ```

+ 75 - 0
recipes/3p_integrations/vllm/README.md

@@ -0,0 +1,75 @@
+# Llama inference with vLLM
+
+This folder contains an example for running Llama inference on multiple-gpus in single- as well as multi-node scenarios using vLLM.
+
+## Prerequirements
+
+To run this example we will need to install vLLM as well as ray in case multi-node inference is the goal.
+
+```bash
+pip install vllm
+
+# For multi-node inference we also need to install ray
+pip install ray[default]
+```
+
+For the following examples we will assume that we fine-tuned a base model using the LoRA method and we have setup the following environment variables pointing to the base model as well as LoRA adapter:
+
+```bash
+export MODEL_PATH=/path/to/out/base/model
+export PEFT_MODEL_PATH=/path/to/out/peft/model
+```
+
+## Single-node multi-gpu inference
+To launch the inference simply execute the following command changing the tp_size parameter to the numbers of GPUs you have available:
+
+``` bash
+python inference.py --model_name $MODEL_PATH --peft_model_name $PEFT_MODEL_PATH --tp_size 8 --user_prompt "Hello my name is"
+```
+The script will ask for another prompt ina loop after completing the generation which you can exit by simply pressing enter and leaving the prompt empty.
+When using multiple gpus the model will automatically be split accross the available GPUs using tensor parallelism.
+
+## Multi-node multi-gpu inference
+The FP8 quantized variants of Meta Llama (i.e. meta-llama/Meta-Llama-3.1-405B-FP8 and meta-llama/Meta-Llama-3.1-405B-Instruct-FP8) can be executed on a single node with 8x80GB H100 using the script located in this folder.
+To run the unquantized Meta Llama 405B variants (i.e. meta-llama/Meta-Llama-3.1-405B and meta-llama/Meta-Llama-3.1-405B-Instruct) we need multi-node inference.
+vLLM allows this by leveraging pipeline parallelism accros nodes while still applying tensor parallelism insid each node.
+To start a multi-node inference we first need to set up a ray serves which well be leveraged by vLLM to execute the model across node boundaries.
+
+```bash
+# On the head node we start the clustr as follows
+ray start --head
+
+# After the server starts it prints out a couple of lines including the command to add nodes to the cluster e.g.:
+# To add another node to this Ray cluster, run
+#   ray start --address='<head-node-ip-address>:6379'
+# Where the head node ip address will depend on your environment
+
+# We can then add the worker nodes by executing the command in a shell on the worker node
+ray start --address='<head-node-ip-address>:6379'
+
+# We can check if the cluster was launched successfully by executing this on any node
+ray status
+
+# It should show the number of nodes we have added as well as the head node
+# Node status
+# ---------------------------------------------------------------
+# Active:
+#  1 node_82143b740a25228c24dc8bb3a280b328910b2fcb1987eee52efb838b
+#  1 node_3f2c673530de5de86f953771538f35437ab60e3cacd7730dbca41719
+```
+
+To launch the inference we can then execute the inference script while we adapt pp_size and tp_size to our environment.
+
+```
+pp_size - number of worker + head nodes
+
+tp_size - number of GPUs per node
+```
+
+If our environment consists of two nodes with 8 GPUs each we would execute:
+```bash
+python inference.py --model_name $MODEL_PATH --peft_model_name $PEFT_MODEL_PATH --pp_size 2 --tp_size 8 --user_prompt "Hello my name is"
+```
+
+The launch of the vLLM engine will take some time depending on your environment as each worker will need to load the checkpoint files to extract its fraction of the weights.
+and even if it seem to hang

+ 35 - 12
recipes/3p_integrations/vllm/inference.py

@@ -1,11 +1,13 @@
 # Copyright (c) Meta Platforms, Inc. and affiliates.
 # This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
 
+import uuid
+import asyncio
 import fire
 
 import torch
-from vllm import LLM
-from vllm import LLM, SamplingParams
+from vllm import AsyncLLMEngine, AsyncEngineArgs, SamplingParams
+from vllm.lora.request import LoRARequest
 from accelerate.utils import is_xpu_available
 
 if is_xpu_available():
@@ -15,13 +17,24 @@ else:
 
 torch.manual_seed(42)
 
-def load_model(model_name, tp_size=1):
+def load_model(model_name, peft_model=None, pp_size=1, tp_size=1):
+    additional_configs = {}
+    if peft_model:
+        additional_configs["enable_lora"] = True
+        
+    engine_config = AsyncEngineArgs(
+        model=model_name,
+        pipeline_parallel_size=pp_size,
+        tensor_parallel_size=tp_size,
+        max_loras=1,
+        **additional_configs)
 
-    llm = LLM(model_name, tensor_parallel_size=tp_size)
+    llm = AsyncLLMEngine.from_engine_args(engine_config)
     return llm
 
-def main(
+async def main(
     model,
+    peft_model_name=None,
     max_new_tokens=100,
     user_prompt=None,
     top_p=0.9,
@@ -35,26 +48,36 @@ def main(
 
         print(f"sampling params: top_p {top_p} and temperature {temperature} for this inference request")
         sampling_param = SamplingParams(top_p=top_p, temperature=temperature, max_tokens=max_new_tokens)
-        
 
-        outputs = model.generate(user_prompt, sampling_params=sampling_param)
+        lora_request = None
+        if peft_model_name:
+            lora_request = LoRARequest("lora",0,peft_model_name)
+
+        req_id = str(uuid.uuid4())
+
+        generator = model.generate(user_prompt, sampling_param, req_id, lora_request=lora_request)
+        output = None
+        async for request_output in generator:
+            output = request_output
    
-        print(f"model output:\n {user_prompt} {outputs[0].outputs[0].text}")
+        print(f"model output:\n {user_prompt} {output.outputs[0].text}")
         user_prompt = input("Enter next prompt (press Enter to exit): ")
         if not user_prompt:
             break
 
 def run_script(
     model_name: str,
-    peft_model=None,
-    tp_size=1,
+    peft_model_name=None,
+    pp_size : int = 1,
+    tp_size : int = 1,
     max_new_tokens=100,
     user_prompt=None,
     top_p=0.9,
     temperature=0.8
 ):
-    model = load_model(model_name, tp_size)
-    main(model, max_new_tokens, user_prompt, top_p, temperature)
+    model = load_model(model_name, peft_model_name, pp_size, tp_size)
+
+    asyncio.get_event_loop().run_until_complete(main(model, peft_model_name, max_new_tokens, user_prompt, top_p, temperature))
 
 if __name__ == "__main__":
     fire.Fire(run_script)

+ 2 - 2
recipes/quickstart/Running_Llama3_Anywhere/Running_Llama_on_HF_transformers.ipynb

@@ -92,7 +92,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Then, we will set the model variable to a specific model we’d like to use. In this demo, we will use the 8b chat model `meta-llama/Meta-Llama-3-8B-Instruct`. Using Meta models from Hugging Face requires you to\n",
+    "Then, we will set the model variable to a specific model we’d like to use. In this demo, we will use the 8b chat model `meta-llama/Meta-Llama-3.1-8B-Instruct`. Using Meta models from Hugging Face requires you to\n",
     "\n",
     "1. Accept Terms of Service for Meta Llama 3 on Meta [website](https://llama.meta.com/llama-downloads).\n",
     "2. Use the same email address from Step (1) to login into Hugging Face.\n",
@@ -125,7 +125,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "model = \"meta-llama/Meta-Llama-3-8B-Instruct\"\n",
+    "model = \"meta-llama/Meta-Llama-3.1-8B-Instruct\"\n",
     "tokenizer = AutoTokenizer.from_pretrained(model)"
    ]
   },

+ 1 - 3
recipes/quickstart/finetuning/LLM_finetuning_overview.md

@@ -1,6 +1,6 @@
 ## LLM Fine-Tuning
 
-Here we discuss fine-tuning Meta Llama 3 with a couple of different recipes. We will cover two scenarios here:
+Here we discuss fine-tuning Meta Llama with a couple of different recipes. We will cover two scenarios here:
 
 
 ## 1. **Parameter Efficient Model Fine-Tuning**
@@ -18,8 +18,6 @@ These methods will address three aspects:
 
 HF [PEFT](https://github.com/huggingface/peft) library provides an easy way of using these methods which we make use of here. Please read more [here](https://huggingface.co/blog/peft).
 
-
-
 ## 2. **Full/ Partial Parameter Fine-Tuning**
 
 Full parameter fine-tuning has its own advantages, in this method there are multiple strategies that can help:

+ 25 - 0
recipes/quickstart/finetuning/multigpu_finetuning.md

@@ -68,7 +68,32 @@ If you are running full parameter fine-tuning on the 70B model, you can enable `
 torchrun --nnodes 1 --nproc_per_node 8 finetuning.py --enable_fsdp --low_cpu_fsdp --fsdp_config.pure_bf16 --model_name /path_of_model_folder/70B --batch_size_training 1 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned
 ```
 
+**Multi GPU multi node**:
 
+Here we use a slurm script to schedule a job with slurm over multiple nodes.
+
+```bash
+
+sbatch recipes/quickstart/finetuning/multi_node.slurm
+# Change the num nodes and GPU per nodes in the script before running.
+
+```
+
+To fine-tune the Meta Llama 405B model with LoRA on 32xH100, 80 GB GPUs we need to combine 4bit quantization (QLoRA) and FSDP.
+We can achieve this by adding the following environment variables to the slurm script (before the srun command in the bottom).
+
+```bash
+export FSDP_CPU_RAM_EFFICIENT_LOADING=1
+export ACCELERATE_USE_FSDP=1 
+```
+
+Then we need to replace the bottom srun command with the following:
+
+```bash
+srun  torchrun --nproc_per_node 8 --rdzv_id $RANDOM --rdzv_backend c10d --rdzv_endpoint $head_node_ip:29500 ./finetuning.py  --enable_fsdp --use_peft --peft_method lora --quantization 4bit  --quantization_config.quant_type nf4 --mixed_precision False --low_cpu_fsdp
+```
+
+Do not forget to adjust the number of nodes, ntasks and gpus-per-task in the top.
 
 ## Running with different datasets
 Currently 3 open source datasets are supported that can be found in [Datasets config file](../../../src/llama_recipes/configs/datasets.py). You can also use your custom dataset (more info [here](./datasets/README.md)).

+ 1 - 1
recipes/quickstart/finetuning/quickstart_peft_finetuning.ipynb

@@ -90,7 +90,7 @@
     "from llama_recipes.configs import train_config as TRAIN_CONFIG\n",
     "\n",
     "train_config = TRAIN_CONFIG()\n",
-    "train_config.model_name = \"meta-llama/Meta-Llama-3-8B\"\n",
+    "train_config.model_name = \"meta-llama/Meta-Llama-3.1-8B\"\n",
     "train_config.num_epochs = 1\n",
     "train_config.run_validation = False\n",
     "train_config.gradient_accumulation_steps = 4\n",

+ 4 - 3
recipes/quickstart/inference/README.md

@@ -2,6 +2,7 @@
 
 This folder contains scripts to get you started with inference on Meta Llama models.
 
-* [](./code_llama/) contains scripts for tasks relating to code generation using CodeLlama
-* [](./local_inference/) contsin scripts to do memory efficient inference on servers and local machines
-* [](./mobile_inference/) has scripts using MLC to serve Llama on Android (h/t to OctoAI for the contribution!)
+* [Code Llama](./code_llama/) contains scripts for tasks relating to code generation using CodeLlama
+* [Local Inference](./local_inference/) contains scripts to do memory efficient inference on servers and local machines
+* [Mobile Inference](./mobile_inference/) has scripts using MLC to serve Llama on Android (h/t to OctoAI for the contribution!)
+* [Model Update Example](./modelUpgradeExample.py) shows an example of replacing a Llama 3 model with a Llama 3.1 model. 

+ 7 - 4
recipes/quickstart/inference/local_inference/README.md

@@ -27,8 +27,8 @@ samsum_prompt.txt
 ...
 ```
 
-**Note**
-Currently pad token by default in [HuggingFace Tokenizer is `None`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/tokenization_llama.py#L110). We add the padding token as a special token to the tokenizer, which in this case requires to resize the token_embeddings as shown below:
+**Note on Llama version < 3.1**
+The default padding token in [HuggingFace Tokenizer is `None`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/tokenization_llama.py#L110). To use padding the padding token needs to be added as a special token to the tokenizer, which in this case requires to resize the token_embeddings as shown below:
 
 ```python
 tokenizer.add_special_tokens(
@@ -39,8 +39,7 @@ tokenizer.add_special_tokens(
     )
 model.resize_token_embeddings(model.config.vocab_size + 1)
 ```
-Padding would be required for batch inference. In this this [example](inference.py), batch size = 1 so essentially padding is not required. However,We added the code pointer as an example in case of batch inference.
-
+Padding would be required for batched inference. In this [example](inference.py), batch size = 1 so essentially padding is not required. However, we added the code pointer as an example in case of batch inference. For Llama version 3.1 use the special token `<|finetune_right_pad_id|> (128004)` for padding.
 
 ## Chat completion
 The inference folder also includes a chat completion example, that adds built-in safety features in fine-tuned models to the prompt tokens. To run the example:
@@ -85,3 +84,7 @@ Then run inference using:
 python inference.py --model_name <training_config.output_dir> --prompt_file <test_prompt_file>
 
 ```
+
+## Inference on large models like Meta Llama 405B
+The FP8 quantized variants of Meta Llama (i.e. meta-llama/Meta-Llama-3.1-405B-FP8 and meta-llama/Meta-Llama-3.1-405B-Instruct-FP8) can be executed on a single node with 8x80GB H100 using the scripts located in this folder.
+To run the unquantized Meta Llama 405B variants (i.e. meta-llama/Meta-Llama-3.1-405B and meta-llama/Meta-Llama-3.1-405B-Instruct) we need to use a multi-node setup for inference. The llama-recipes inference script currently does not allow multi-node inference. To run this model you can use vLLM with pipeline and tensor parallelism as showed in [this example](../../../3p_integrations/vllm/README.md).

+ 13 - 11
recipes/quickstart/inference/local_inference/chat_completion/chat_completion.py

@@ -4,6 +4,7 @@
 # from accelerate import init_empty_weights, load_checkpoint_and_dispatch
 
 import fire
+import json
 import os
 import sys
 
@@ -18,7 +19,7 @@ from accelerate.utils import is_xpu_available
 def main(
     model_name,
     peft_model: str=None,
-    quantization: bool=False,
+    quantization: str = None, # Options: 4bit, 8bit
     max_new_tokens =256, #The maximum numbers of tokens to generate
     min_new_tokens:int=0, #The minimum numbers of tokens to generate
     prompt_file: str=None,
@@ -47,33 +48,32 @@ def main(
 
     elif not sys.stdin.isatty():
         dialogs = "\n".join(sys.stdin.readlines())
+        try:
+            dialogs = json.loads(dialogs)
+        except:
+            print("Could not parse json from stdin. Please provide a json file with the user prompts. Exiting.")
+            sys.exit(1)
     else:
         print("No user prompt provided. Exiting.")
         sys.exit(1)
 
     print(f"User dialogs:\n{dialogs}")
     print("\n==================================\n")
-
-
+    
     # Set the seeds for reproducibility
     if is_xpu_available():
         torch.xpu.manual_seed(seed)
     else:
         torch.cuda.manual_seed(seed)
     torch.manual_seed(seed)
-    model = load_model(model_name, quantization, use_fast_kernels)
+
+    model = load_model(model_name, quantization, use_fast_kernels, **kwargs)
     if peft_model:
         model = load_peft_model(model, peft_model)
 
     tokenizer = AutoTokenizer.from_pretrained(model_name)
-    tokenizer.add_special_tokens(
-        {
-
-            "pad_token": "<PAD>",
-        }
-    )
 
-    chats = tokenizer.apply_chat_template(dialogs)
+    chats = [tokenizer.apply_chat_template(dialog) for dialog in dialogs]
 
     with torch.no_grad():
         for idx, chat in enumerate(chats):
@@ -99,12 +99,14 @@ def main(
                 sys.exit(1)  # Exit the program with an error status
             tokens= torch.tensor(chat).long()
             tokens= tokens.unsqueeze(0)
+            attention_mask = torch.ones_like(tokens)
             if is_xpu_available():
                 tokens= tokens.to("xpu:0")
             else:
                 tokens= tokens.to("cuda:0")
             outputs = model.generate(
                 input_ids=tokens,
+                attention_mask=attention_mask,
                 max_new_tokens=max_new_tokens,
                 do_sample=do_sample,
                 top_p=top_p,

+ 145 - 128
recipes/quickstart/inference/local_inference/inference.py

@@ -1,68 +1,46 @@
 # Copyright (c) Meta Platforms, Inc. and affiliates.
 # This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
 
-# from accelerate import init_empty_weights, load_checkpoint_and_dispatch
-
-import fire
 import os
 import sys
 import time
+
+import fire
 import gradio as gr
 
 import torch
-from transformers import AutoTokenizer
 
-from llama_recipes.inference.safety_utils import get_safety_checker, AgentType
+from accelerate.utils import is_xpu_available
 from llama_recipes.inference.model_utils import load_model, load_peft_model
 
-from accelerate.utils import is_xpu_available
+from llama_recipes.inference.safety_utils import AgentType, get_safety_checker
+from transformers import AutoTokenizer
+
 
 def main(
     model_name,
-    peft_model: str=None,
-    quantization: bool=False,
-    max_new_tokens =100, #The maximum numbers of tokens to generate
-    prompt_file: str=None,
-    seed: int=42, #seed value for reproducibility
-    do_sample: bool=True, #Whether or not to use sampling ; use greedy decoding otherwise.
-    min_length: int=None, #The minimum length of the sequence to be generated, input prompt + min_new_tokens
-    use_cache: bool=True,  #[optional] Whether or not the model should use the past last key/values attentions Whether or not the model should use the past last key/values attentions (if applicable to the model) to speed up decoding.
-    top_p: float=1.0, # [optional] If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
-    temperature: float=1.0, # [optional] The value used to modulate the next token probabilities.
-    top_k: int=50, # [optional] The number of highest probability vocabulary tokens to keep for top-k-filtering.
-    repetition_penalty: float=1.0, #The parameter for repetition penalty. 1.0 means no penalty.
-    length_penalty: int=1, #[optional] Exponential penalty to the length that is used with beam-based generation.
-    enable_azure_content_safety: bool=False, # Enable safety check with Azure content safety api
-    enable_sensitive_topics: bool=False, # Enable check for sensitive topics using AuditNLG APIs
-    enable_salesforce_content_safety: bool=True, # Enable safety check with Salesforce safety flan t5
-    enable_llamaguard_content_safety: bool=False,
-    max_padding_length: int=None, # the max padding length to be used with tokenizer padding the prompts.
-    use_fast_kernels: bool = False, # Enable using SDPA from PyTroch Accelerated Transformers, make use Flash Attention and Xformer memory-efficient kernels
-    **kwargs
+    peft_model: str = None,
+    quantization: str = None, # Options: 4bit, 8bit
+    max_new_tokens=100,  # The maximum numbers of tokens to generate
+    prompt_file: str = None,
+    seed: int = 42,  # seed value for reproducibility
+    do_sample: bool = True,  # Whether or not to use sampling ; use greedy decoding otherwise.
+    min_length: int = None,  # The minimum length of the sequence to be generated, input prompt + min_new_tokens
+    use_cache: bool = True,  # [optional] Whether or not the model should use the past last key/values attentions Whether or not the model should use the past last key/values attentions (if applicable to the model) to speed up decoding.
+    top_p: float = 1.0,  # [optional] If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
+    temperature: float = 1.0,  # [optional] The value used to modulate the next token probabilities.
+    top_k: int = 50,  # [optional] The number of highest probability vocabulary tokens to keep for top-k-filtering.
+    repetition_penalty: float = 1.0,  # The parameter for repetition penalty. 1.0 means no penalty.
+    length_penalty: int = 1,  # [optional] Exponential penalty to the length that is used with beam-based generation.
+    enable_azure_content_safety: bool = False,  # Enable safety check with Azure content safety api
+    enable_sensitive_topics: bool = False,  # Enable check for sensitive topics using AuditNLG APIs
+    enable_salesforce_content_safety: bool = True,  # Enable safety check with Salesforce safety flan t5
+    enable_llamaguard_content_safety: bool = False,
+    max_padding_length: int = None,  # the max padding length to be used with tokenizer padding the prompts.
+    use_fast_kernels: bool = False,  # Enable using SDPA from PyTroch Accelerated Transformers, make use Flash Attention and Xformer memory-efficient kernels
+    share_gradio: bool = False,  # Enable endpoint creation for gradio.live
+    **kwargs,
 ):
-
-  def inference(user_prompt, temperature, top_p, top_k, max_new_tokens, **kwargs,):
-    safety_checker = get_safety_checker(enable_azure_content_safety,
-                                        enable_sensitive_topics,
-                                        enable_salesforce_content_safety,
-                                        enable_llamaguard_content_safety
-                                        )
-
-    # Safety check of the user prompt
-    safety_results = [check(user_prompt) for check in safety_checker]
-    are_safe = all([r[1] for r in safety_results])
-    if are_safe:
-        print("User prompt deemed safe.")
-        print(f"User prompt:\n{user_prompt}")
-    else:
-        print("User prompt deemed unsafe.")
-        for method, is_safe, report in safety_results:
-            if not is_safe:
-                print(method)
-                print(report)
-        print("Skipping the inference as the prompt is not safe.")
-        sys.exit(1)  # Exit the program with an error status
-
     # Set the seeds for reproducibility
     if is_xpu_available():
         torch.xpu.manual_seed(seed)
@@ -70,7 +48,7 @@ def main(
         torch.cuda.manual_seed(seed)
     torch.manual_seed(seed)
 
-    model = load_model(model_name, quantization, use_fast_kernels)
+    model = load_model(model_name, quantization, use_fast_kernels, **kwargs)
     if peft_model:
         model = load_peft_model(model, peft_model)
 
@@ -79,86 +57,125 @@ def main(
     tokenizer = AutoTokenizer.from_pretrained(model_name)
     tokenizer.pad_token = tokenizer.eos_token
 
-    batch = tokenizer(user_prompt, padding='max_length', truncation=True, max_length=max_padding_length, return_tensors="pt")
-    if is_xpu_available():
-        batch = {k: v.to("xpu") for k, v in batch.items()}
-    else:
-        batch = {k: v.to("cuda") for k, v in batch.items()}
-
-    start = time.perf_counter()
-    with torch.no_grad():
-        outputs = model.generate(
-            **batch,
-            max_new_tokens=max_new_tokens,
-            do_sample=do_sample,
-            top_p=top_p,
-            temperature=temperature,
-            min_length=min_length,
-            use_cache=use_cache,
-            top_k=top_k,
-            repetition_penalty=repetition_penalty,
-            length_penalty=length_penalty,
-            **kwargs
+    def inference(
+        user_prompt,
+        temperature,
+        top_p,
+        top_k,
+        max_new_tokens,
+        **kwargs,
+    ):
+        safety_checker = get_safety_checker(
+            enable_azure_content_safety,
+            enable_sensitive_topics,
+            enable_salesforce_content_safety,
+            enable_llamaguard_content_safety,
         )
-    e2e_inference_time = (time.perf_counter()-start)*1000
-    print(f"the inference time is {e2e_inference_time} ms")
-    output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
-
-    # Safety check of the model output
-    safety_results = [check(output_text, agent_type=AgentType.AGENT, user_prompt=user_prompt) for check in safety_checker]
-    are_safe = all([r[1] for r in safety_results])
-    if are_safe:
-        print("User input and model output deemed safe.")
-        print(f"Model output:\n{output_text}")
-    else:
-        print("Model output deemed unsafe.")
-        for method, is_safe, report in safety_results:
-            if not is_safe:
-                print(method)
-                print(report)
-    return output_text
-
-  if prompt_file is not None:
-      assert os.path.exists(
-          prompt_file
-      ), f"Provided Prompt file does not exist {prompt_file}"
-      with open(prompt_file, "r") as f:
-          user_prompt = "\n".join(f.readlines())
-      inference(user_prompt, temperature, top_p, top_k, max_new_tokens)
-  elif not sys.stdin.isatty():
-      user_prompt = "\n".join(sys.stdin.readlines())
-      inference(user_prompt, temperature, top_p, top_k, max_new_tokens)
-  else:
-      gr.Interface(
-        fn=inference,
-        inputs=[
-            gr.components.Textbox(
-                lines=9,
-                label="User Prompt",
-                placeholder="none",
-            ),
-            gr.components.Slider(
-                minimum=0, maximum=1, value=1.0, label="Temperature"
-            ),
-            gr.components.Slider(
-                minimum=0, maximum=1, value=1.0, label="Top p"
-            ),
-            gr.components.Slider(
-                minimum=0, maximum=100, step=1, value=50, label="Top k"
-            ),
-            gr.components.Slider(
-                minimum=1, maximum=2000, step=1, value=200, label="Max tokens"
-            ),
-        ],
-        outputs=[
-            gr.components.Textbox(
-                lines=5,
-                label="Output",
+
+        # Safety check of the user prompt
+        safety_results = [check(user_prompt) for check in safety_checker]
+        are_safe = all([r[1] for r in safety_results])
+        if are_safe:
+            print("User prompt deemed safe.")
+            print(f"User prompt:\n{user_prompt}")
+        else:
+            print("User prompt deemed unsafe.")
+            for method, is_safe, report in safety_results:
+                if not is_safe:
+                    print(method)
+                    print(report)
+            print("Skipping the inference as the prompt is not safe.")
+            return  # Exit the program with an error status
+
+        batch = tokenizer(
+            user_prompt,
+            padding="max_length",
+            truncation=True,
+            max_length=max_padding_length,
+            return_tensors="pt",
+        )
+        if is_xpu_available():
+            batch = {k: v.to("xpu") for k, v in batch.items()}
+        else:
+            batch = {k: v.to("cuda") for k, v in batch.items()}
+
+        start = time.perf_counter()
+        with torch.no_grad():
+            outputs = model.generate(
+                **batch,
+                max_new_tokens=max_new_tokens,
+                do_sample=do_sample,
+                top_p=top_p,
+                temperature=temperature,
+                min_length=min_length,
+                use_cache=use_cache,
+                top_k=top_k,
+                repetition_penalty=repetition_penalty,
+                length_penalty=length_penalty,
+                **kwargs,
             )
-        ],
-        title="Meta Llama3 Playground",
-        description="https://github.com/facebookresearch/llama-recipes",
-      ).queue().launch(server_name="0.0.0.0", share=True)
+        e2e_inference_time = (time.perf_counter() - start) * 1000
+        print(f"the inference time is {e2e_inference_time} ms")
+        output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
+
+        # Safety check of the model output
+        safety_results = [
+            check(output_text, agent_type=AgentType.AGENT, user_prompt=user_prompt)
+            for check in safety_checker
+        ]
+        are_safe = all([r[1] for r in safety_results])
+        if are_safe:
+            print("User input and model output deemed safe.")
+            print(f"Model output:\n{output_text}")
+            return output_text
+        else:
+            print("Model output deemed unsafe.")
+            for method, is_safe, report in safety_results:
+                if not is_safe:
+                    print(method)
+                    print(report)
+            return None
+
+    if prompt_file is not None:
+        assert os.path.exists(
+            prompt_file
+        ), f"Provided Prompt file does not exist {prompt_file}"
+        with open(prompt_file, "r") as f:
+            user_prompt = "\n".join(f.readlines())
+        inference(user_prompt, temperature, top_p, top_k, max_new_tokens)
+    elif not sys.stdin.isatty():
+        user_prompt = "\n".join(sys.stdin.readlines())
+        inference(user_prompt, temperature, top_p, top_k, max_new_tokens)
+    else:
+        gr.Interface(
+            fn=inference,
+            inputs=[
+                gr.components.Textbox(
+                    lines=9,
+                    label="User Prompt",
+                    placeholder="none",
+                ),
+                gr.components.Slider(
+                    minimum=0, maximum=1, value=1.0, label="Temperature"
+                ),
+                gr.components.Slider(minimum=0, maximum=1, value=1.0, label="Top p"),
+                gr.components.Slider(
+                    minimum=0, maximum=100, step=1, value=50, label="Top k"
+                ),
+                gr.components.Slider(
+                    minimum=1, maximum=2000, step=1, value=200, label="Max tokens"
+                ),
+            ],
+            outputs=[
+                gr.components.Textbox(
+                    lines=5,
+                    label="Output",
+                )
+            ],
+            title="Meta Llama3 Playground",
+            description="https://github.com/meta-llama/llama-recipes",
+        ).queue().launch(server_name="0.0.0.0", share=share_gradio)
+
 
 if __name__ == "__main__":
     fire.Fire(main)

+ 51 - 0
recipes/quickstart/inference/modelUpgradeExample.py

@@ -0,0 +1,51 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
+
+# Running the script without any arguments "python modelUpgradeExample.py" performs inference with the Llama 3 8B Instruct model. 
+# Passing  --model-id "meta-llama/Meta-Llama-3.1-8B-Instruct" to the script will switch it to using the Llama 3.1 version of the same model. 
+# The script also shows the input tokens to confirm that the models are responding to the same input
+
+import fire
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+
+def main(model_id = "meta-llama/Meta-Llama-3-8B-Instruct"):
+    tokenizer = AutoTokenizer.from_pretrained(model_id)
+    model = AutoModelForCausalLM.from_pretrained(
+        model_id,
+        torch_dtype=torch.bfloat16,
+        device_map="auto",
+    )
+
+    messages = [
+        {"role": "system", "content": "You are a helpful chatbot"},
+        {"role": "user", "content": "Why is the sky blue?"},
+        {"role": "assistant", "content": "Because the light is scattered"},
+        {"role": "user", "content": "Please tell me more about that"},
+    ]
+
+    input_ids = tokenizer.apply_chat_template(
+        messages,
+        add_generation_prompt=True,
+        return_tensors="pt",
+    ).to(model.device)
+
+    print("Input tokens:")
+    print(input_ids)
+    
+    attention_mask = torch.ones_like(input_ids)
+    outputs = model.generate(
+        input_ids,
+        max_new_tokens=400,
+        eos_token_id=tokenizer.eos_token_id,
+        do_sample=True,
+        temperature=0.6,
+        top_p=0.9,
+        attention_mask=attention_mask,
+    )
+    response = outputs[0][input_ids.shape[-1]:]
+    print("\nOutput:\n")
+    print(tokenizer.decode(response, skip_special_tokens=True))
+
+if __name__ == "__main__":
+  fire.Fire(main)

+ 10 - 7
recipes/responsible_ai/README.md

@@ -1,11 +1,14 @@
-# Meta Llama Guard
+# Trust and Safety with Llama
 
-Meta Llama Guard and Meta Llama Guard 2 are new models that provide input and output guardrails for LLM inference. For more details, please visit the main [repository](https://github.com/facebookresearch/PurpleLlama/tree/main/Llama-Guard2).
+The [Purple Llama](https://github.com/meta-llama/PurpleLlama/) project provides tools and models to improve LLM security. This folder contains examples to get started with PurpleLlama tools.
 
-**Note** Please find the right model on HF side [here](https://huggingface.co/meta-llama/Meta-Llama-Guard-2-8B).
+| Tool/Model | Description | Get Started
+|---|---|---|
+[Llama Guard](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama-guard-3) | Provide guardrailing on inputs and outputs | [Inference](./llama_guard/inference.py), [Finetuning](./llama_guard/llama_guard_customization_via_prompting_and_fine_tuning.ipynb)
+[Prompt Guard](https://llama.meta.com/docs/model-cards-and-prompt-formats/prompt-guard) | Model to safeguards against jailbreak attempts and embedded prompt injections | [Notebook](./prompt_guard/prompt_guard_tutorial.ipynb)
+[Code Shield](https://github.com/meta-llama/PurpleLlama/tree/main/CodeShield) | Tool to safeguard against insecure code generated by the LLM | [Notebook](https://github.com/meta-llama/PurpleLlama/blob/main/CodeShield/notebook/CodeShieldUsageDemo.ipynb)
 
-### Running locally
-The [llama_guard](llama_guard) folder contains the inference script to run Meta Llama Guard locally. Add test prompts directly to the [inference script](llama_guard/inference.py) before running it.
 
-### Running on the cloud
-The notebooks [Purple_Llama_Anyscale](Purple_Llama_Anyscale.ipynb) & [Purple_Llama_OctoAI](Purple_Llama_OctoAI.ipynb) contain examples for running Meta Llama Guard on cloud hosted endpoints.
+
+### Running on hosted APIs
+The notebooks [input_output_guardrails.ipynb](./input_output_guardrails_with_llama.ipynb),  [Purple_Llama_Anyscale](Purple_Llama_Anyscale.ipynb) & [Purple_Llama_OctoAI](Purple_Llama_OctoAI.ipynb) contain examples for running Meta Llama Guard on cloud hosted endpoints.

+ 6 - 3
recipes/responsible_ai/llama_guard/README.md

@@ -1,6 +1,6 @@
 # Meta Llama Guard demo
 <!-- markdown-link-check-disable -->
-Meta Llama Guard is a language model that provides input and output guardrails for LLM inference. For more details and model cards, please visit the main repository for each model, [Meta Llama Guard](https://github.com/meta-llama/PurpleLlama/tree/main/Llama-Guard) and Meta [Llama Guard 2](https://github.com/meta-llama/PurpleLlama/tree/main/Llama-Guard2).
+Meta Llama Guard is a language model that provides input and output guardrails for LLM inference. For more details and model cards, please visit the [PurpleLlama](https://github.com/meta-llama/PurpleLlama) repository.
 
 This folder contains an example file to run inference with a locally hosted model, either using the Hugging Face Hub or a local path.
 
@@ -55,9 +55,9 @@ This is the output:
 
 To run it with a local model, you can use the `model_id` param in the inference script:
 
-`python recipes/responsible_ai/llama_guard/inference.py --model_id=/home/ubuntu/models/llama3/llama_guard_2-hf/ --llama_guard_version=LLAMA_GUARD_2`
+`python recipes/responsible_ai/llama_guard/inference.py --model_id=/home/ubuntu/models/llama3/Llama-Guard-3-8B/ --llama_guard_version=LLAMA_GUARD_3`
 
-Note: Make sure to also add the llama_guard_version if when it does not match the default, the script allows you to run the prompt format from Meta Llama Guard 1 on Meta Llama Guard 2
+Note: Make sure to also add the llama_guard_version; by default it uses LLAMA_GUARD_3
 
 ## Inference Safety Checker
 When running the regular inference script with prompts, Meta Llama Guard will be used as a safety checker on the user prompt and the model output. If both are safe, the result will be shown, else a message with the error will be shown, with the word unsafe and a comma separated list of categories infringed. Meta Llama Guard is always loaded quantized using Hugging Face Transformers library with bitsandbytes.
@@ -67,3 +67,6 @@ In this case, the default categories are applied by the tokenizer, using the `ap
 Use this command for testing with a quantized Llama model, modifying the values accordingly:
 
 `python examples/inference.py --model_name <path_to_regular_llama_model> --prompt_file <path_to_prompt_file> --quantization 8bit --enable_llamaguard_content_safety`
+
+## Llama Guard 3 Finetuning & Customization
+The safety categories in Llama Guard 3 can be tuned for specific application needs. Existing categories can be removed and new categories can be added to the taxonomy. The [Llama Guard Customization](./llama_guard_customization_via_prompting_and_fine_tuning.ipynb) notebook walks through the process.

+ 2 - 2
recipes/responsible_ai/llama_guard/inference.py

@@ -14,8 +14,8 @@ class AgentType(Enum):
     USER = "User"
 
 def main(
-    model_id: str = "meta-llama/LlamaGuard-7b",
-    llama_guard_version: LlamaGuardVersion = LlamaGuardVersion.LLAMA_GUARD_1
+    model_id: str = "meta-llama/Llama-Guard-3-8B",
+    llama_guard_version: str = "LLAMA_GUARD_3"
 ):
     """
     Entry point for Llama Guard inference sample script.

Tiedoston diff-näkymää rajattu, sillä se on liian suuri
+ 793 - 0
recipes/responsible_ai/llama_guard/llama_guard_customization_via_prompting_and_fine_tuning.ipynb


+ 11 - 0
recipes/responsible_ai/prompt_guard/README.md

@@ -0,0 +1,11 @@
+# Prompt Guard demo
+<!-- markdown-link-check-disable -->
+Prompt Guard is a classifier model that provides input guardrails for LLM inference, particularly against *prompt attacks. For more details and model cards, please visit the main repository, [Meta Prompt Guard](https://github.com/meta-llama/PurpleLlama/tree/main/Prompt-Guard)
+
+This folder contains an example file to run inference with a locally hosted model, either using the Hugging Face Hub or a local path. It also contains a comprehensive demo demonstrating the scenarios in which the model is effective and a script for fine-tuning the model.
+
+This is a very small model and inference and fine-tuning are feasible on local CPUs.
+
+## Requirements
+1. Access to Prompt Guard model weights on Hugging Face. To get access, follow the steps described [here](https://github.com/facebookresearch/PurpleLlama/tree/main/Prompt-Guard#download)
+2. Llama recipes package and it's dependencies [installed](https://github.com/meta-llama/llama-recipes?tab=readme-ov-file#installing)

+ 0 - 0
recipes/responsible_ai/prompt_guard/__init__.py


+ 180 - 0
recipes/responsible_ai/prompt_guard/inference.py

@@ -0,0 +1,180 @@
+import torch
+from torch.nn.functional import softmax
+
+from transformers import (
+    AutoModelForSequenceClassification,
+    AutoTokenizer,
+)
+
+"""
+Utilities for loading the PromptGuard model and evaluating text for jailbreaks and indirect injections.
+
+Note that the underlying model has a maximum recommended input size of 512 tokens as a DeBERTa model.
+The final two functions in this file implement efficient parallel batched evaluation of the model on a list
+of input strings of arbirary length, with the final score for each input being the maximum score across all
+chunks of the input string.
+"""
+
+
+def load_model_and_tokenizer(model_name='meta-llama/Prompt-Guard-86M'):
+    """
+    Load the PromptGuard model from Hugging Face or a local model.
+    
+    Args:
+        model_name (str): The name of the model to load. Default is 'meta-llama/Prompt-Guard-86M'.
+        
+    Returns:
+        transformers.PreTrainedModel: The loaded model.
+    """
+    model = AutoModelForSequenceClassification.from_pretrained(model_name)
+    tokenizer = AutoTokenizer.from_pretrained(model_name)
+    return model, tokenizer
+
+
+def get_class_probabilities(model, tokenizer, text, temperature=1.0, device='cpu'):
+    """
+    Evaluate the model on the given text with temperature-adjusted softmax.
+    Note, as this is a DeBERTa model, the input text should have a maximum length of 512.
+    
+    Args:
+        text (str): The input text to classify.
+        temperature (float): The temperature for the softmax function. Default is 1.0.
+        device (str): The device to evaluate the model on.
+        
+    Returns:
+        torch.Tensor: The probability of each class adjusted by the temperature.
+    """
+    # Encode the text
+    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
+    inputs = inputs.to(device)
+    # Get logits from the model
+    with torch.no_grad():
+        logits = model(**inputs).logits
+    # Apply temperature scaling
+    scaled_logits = logits / temperature
+    # Apply softmax to get probabilities
+    probabilities = softmax(scaled_logits, dim=-1)
+    return probabilities
+
+
+def get_jailbreak_score(model, tokenizer, text, temperature=1.0, device='cpu'):
+    """
+    Evaluate the probability that a given string contains malicious jailbreak or prompt injection.
+    Appropriate for filtering dialogue between a user and an LLM.
+    
+    Args:
+        text (str): The input text to evaluate.
+        temperature (float): The temperature for the softmax function. Default is 1.0.
+        device (str): The device to evaluate the model on.
+        
+    Returns:
+        float: The probability of the text containing malicious content.
+    """
+    probabilities = get_class_probabilities(model, tokenizer, text, temperature, device)
+    return probabilities[0, 2].item()
+
+
+def get_indirect_injection_score(model, tokenizer, text, temperature=1.0, device='cpu'):
+    """
+    Evaluate the probability that a given string contains any embedded instructions (malicious or benign).
+    Appropriate for filtering third party inputs (e.g. web searches, tool outputs) into an LLM.
+    
+    Args:
+        text (str): The input text to evaluate.
+        temperature (float): The temperature for the softmax function. Default is 1.0.
+        device (str): The device to evaluate the model on.
+        
+    Returns:
+        float: The combined probability of the text containing malicious or embedded instructions.
+    """
+    probabilities = get_class_probabilities(model, tokenizer, text, temperature, device)
+    return (probabilities[0, 1] + probabilities[0, 2]).item()
+
+
+def process_text_batch(model, tokenizer, texts, temperature=1.0, device='cpu'):
+    """
+    Process a batch of texts and return their class probabilities.
+    Args:
+        model (transformers.PreTrainedModel): The loaded model.
+        tokenizer (transformers.PreTrainedTokenizer): The tokenizer for the model.
+        texts (list[str]): A list of texts to process.
+        temperature (float): The temperature for the softmax function.
+        device (str): The device to evaluate the model on.
+        
+    Returns:
+        torch.Tensor: A tensor containing the class probabilities for each text in the batch.
+    """
+    inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=512)
+    inputs = inputs.to(device)
+    with torch.no_grad():
+        logits = model(**inputs).logits
+    scaled_logits = logits / temperature
+    probabilities = softmax(scaled_logits, dim=-1)
+    return probabilities
+
+
+def get_scores_for_texts(model, tokenizer, texts, score_indices, temperature=1.0, device='cpu', max_batch_size=16):
+    """
+    Compute scores for a list of texts, handling texts of arbitrary length by breaking them into chunks and processing in parallel.
+    Args:
+        model (transformers.PreTrainedModel): The loaded model.
+        tokenizer (transformers.PreTrainedTokenizer): The tokenizer for the model.
+        texts (list[str]): A list of texts to evaluate.
+        score_indices (list[int]): Indices of scores to sum for final score calculation.
+        temperature (float): The temperature for the softmax function.
+        device (str): The device to evaluate the model on.
+        max_batch_size (int): The maximum number of text chunks to process in a single batch.
+        
+    Returns:
+        list[float]: A list of scores for each text.
+    """
+    all_chunks = []
+    text_indices = []
+    for index, text in enumerate(texts):
+        chunks = [text[i:i+512] for i in range(0, len(text), 512)]
+        all_chunks.extend(chunks)
+        text_indices.extend([index] * len(chunks))
+    all_scores = [0] * len(texts)
+    for i in range(0, len(all_chunks), max_batch_size):
+        batch_chunks = all_chunks[i:i+max_batch_size]
+        batch_indices = text_indices[i:i+max_batch_size]
+        probabilities = process_text_batch(model, tokenizer, batch_chunks, temperature, device)
+        scores = probabilities[:, score_indices].sum(dim=1).tolist()
+        
+        for idx, score in zip(batch_indices, scores):
+            all_scores[idx] = max(all_scores[idx], score)
+    return all_scores
+
+
+def get_jailbreak_scores_for_texts(model, tokenizer, texts, temperature=1.0, device='cpu', max_batch_size=16):
+    """
+    Compute jailbreak scores for a list of texts.
+    Args:
+        model (transformers.PreTrainedModel): The loaded model.
+        tokenizer (transformers.PreTrainedTokenizer): The tokenizer for the model.
+        texts (list[str]): A list of texts to evaluate.
+        temperature (float): The temperature for the softmax function.
+        device (str): The device to evaluate the model on.
+        max_batch_size (int): The maximum number of text chunks to process in a single batch.
+        
+    Returns:
+        list[float]: A list of jailbreak scores for each text.
+    """
+    return get_scores_for_texts(model, tokenizer, texts, [2], temperature, device, max_batch_size)
+
+
+def get_indirect_injection_scores_for_texts(model, tokenizer, texts, temperature=1.0, device='cpu', max_batch_size=16):
+    """
+    Compute indirect injection scores for a list of texts.
+    Args:
+        model (transformers.PreTrainedModel): The loaded model.
+        tokenizer (transformers.PreTrainedTokenizer): The tokenizer for the model.
+        texts (list[str]): A list of texts to evaluate.
+        temperature (float): The temperature for the softmax function.
+        device (str): The device to evaluate the model on.
+        max_batch_size (int): The maximum number of text chunks to process in a single batch.
+        
+    Returns:
+        list[float]: A list of indirect injection scores for each text.
+    """
+    return get_scores_for_texts(model, tokenizer, texts, [1, 2], temperature, device, max_batch_size)

Tiedoston diff-näkymää rajattu, sillä se on liian suuri
+ 817 - 0
recipes/responsible_ai/prompt_guard/prompt_guard_tutorial.ipynb


+ 1 - 1
recipes/use_cases/customerservice_chatbots/RAG_chatbot/RAG_Chatbot_Example.ipynb

@@ -418,7 +418,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "model=meta-llama/Meta-Llama-3-8B-Instruct\n",
+    "model=meta-llama/Meta-Llama-3.1-8B-Instruct\n",
     "volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run\n",
     "token=#your-huggingface-token\n",
     "docker run --gpus all --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model"

+ 3 - 3
recipes/use_cases/customerservice_chatbots/RAG_chatbot/vectorstore/mongodb/rag_mongodb_llama3_huggingface_open_source.ipynb

@@ -934,11 +934,11 @@
       "source": [
         "from transformers import AutoTokenizer, AutoModelForCausalLM\n",
         "import torch\n",
-        "tokenizer = AutoTokenizer.from_pretrained(\"meta-llama/Meta-Llama-3-8B-Instruct\")\n",
+        "tokenizer = AutoTokenizer.from_pretrained(\"meta-llama/Meta-Llama-3.1-8B-Instruct\")\n",
         "# CPU Enabled uncomment below 👇🏽\n",
-        "# model = AutoModelForCausalLM.from_pretrained(\"meta-llama/Meta-Llama-3-8B-Instruct\")\n",
+        "# model = AutoModelForCausalLM.from_pretrained(\"meta-llama/Meta-Llama-3.1-8B-Instruct\")\n",
         "# GPU Enabled use below 👇🏽\n",
-        "model = AutoModelForCausalLM.from_pretrained(\"meta-llama/Meta-Llama-3-8B-Instruct\", torch_dtype=torch.bfloat16, device_map=\"auto\")"
+        "model = AutoModelForCausalLM.from_pretrained(\"meta-llama/Meta-Llama-3.1-8B-Instruct\", torch_dtype=torch.bfloat16, device_map=\"auto\")"
       ]
     },
     {

+ 2 - 1
requirements.txt

@@ -8,7 +8,7 @@ black[jupyter]
 datasets
 fire
 peft
-transformers>=4.40.0
+transformers>=4.43.0
 sentencepiece
 py7zr
 scipy
@@ -19,3 +19,4 @@ chardet
 openai
 typing-extensions==4.8.0
 tabulate
+codeshield

+ 7 - 1
src/llama_recipes/configs/datasets.py

@@ -25,10 +25,16 @@ class alpaca_dataset:
     test_split: str = "val"
     data_path: str = "src/llama_recipes/datasets/alpaca_data.json"
     
-    
+
 @dataclass
 class custom_dataset:
     dataset: str = "custom_dataset"
     file: str = "recipes/quickstart/finetuning/datasets/custom_dataset.py"
     train_split: str = "train"
     test_split: str = "validation"
+    
+@dataclass
+class llamaguard_toxicchat_dataset:
+    dataset: str = "llamaguard_toxicchat_dataset"
+    train_split: str = "train"
+    test_split: str = "test"

+ 2 - 1
src/llama_recipes/datasets/__init__.py

@@ -3,4 +3,5 @@
 
 from llama_recipes.datasets.grammar_dataset.grammar_dataset import get_dataset as get_grammar_dataset
 from llama_recipes.datasets.alpaca_dataset import InstructionDataset as get_alpaca_dataset
-from llama_recipes.datasets.samsum_dataset import get_preprocessed_samsum as get_samsum_dataset
+from llama_recipes.datasets.samsum_dataset import get_preprocessed_samsum as get_samsum_dataset
+from llama_recipes.datasets.toxicchat_dataset import get_llamaguard_toxicchat_dataset as get_llamaguard_toxicchat_dataset

+ 131 - 0
src/llama_recipes/datasets/toxicchat_dataset.py

@@ -0,0 +1,131 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
+
+# For dataset details visit: https://huggingface.co/datasets/samsum
+
+import copy
+import datasets
+import itertools
+from llama_recipes.inference.prompt_format_utils import  LLAMA_GUARD_3_CATEGORY
+import ast
+import fire
+
+def tokenize_prompt_and_labels(full_prompt, tokenizer):
+        prompt_tokens = tokenizer.encode(full_prompt)
+        combined_tokens = {
+            "input_ids": list(prompt_tokens),
+            "labels": list(prompt_tokens)
+        }
+        return dict(combined_tokens, attention_mask=[1]*len(combined_tokens["input_ids"]))
+    
+
+from llama_recipes.data.llama_guard.finetuning_data_formatter import TrainingExample, Guidelines, Category, LlamaGuardPromptConfigs, LlamaGuardGenerationConfigs, ExplanationPosition, AugmentationConfigs, FormatterConfigs, create_formatted_finetuning_examples
+from datasets import Dataset, DatasetInfo
+
+def mapTcCategoriesToLGCategories(TcCategoriesString):
+    TcCategories = ast.literal_eval(TcCategoriesString)
+    if(len(TcCategories)==0):
+         return None
+    ranked = sorted(TcCategories, key=lambda x: x[1], reverse=True)
+    primary = ranked[0][0] if len(ranked) else None
+    TcMapping = {
+        "sexual":"012", 
+        "violence":"01", 
+        "sexual/minors":"04", 
+        "self-harm/intent":"011", 
+        "hate":"010", 
+        "harassment":"010", 
+        "self-harm":"011", 
+        "self-harm/instructions":"011", 
+        "violence/graphic":"01", 
+        "harassment/threatening":"010", 
+        "hate/threatening":"010"
+    }
+    return TcMapping[primary]
+
+def get_llamaguard_toxicchat_dataset(dataset_config, tokenizer, split, return_jsonl=False):
+    dataset = datasets.load_dataset("lmsys/toxic-chat", 'toxicchat0124', split=split)
+
+
+    if return_jsonl:
+        jsonl_data = [];
+        for x in dataset.to_iterable_dataset():
+            jsonl_data.append({
+                    "prompt":x['user_input'],
+                    "generation":x['model_output'],
+                    "label":("good" if x['toxicity'] == 0 else "bad"),
+                    "unsafe_content": [mapTcCategoriesToLGCategories(x['openai_moderation'])]
+                    })
+        return jsonl_data
+    
+    guidelines = Guidelines(
+        categories=LLAMA_GUARD_3_CATEGORY,
+        category_code_prefix="O")
+    
+    # Setting up configs
+    llama_guard_prompt_configs = LlamaGuardPromptConfigs(
+        instructions_format_string="""<|begin_of_text|><|start_header_id|>user<|end_header_id|> Task: Check if there is unsafe content in 'User' messages in conversations according our safety policy with the below categories.
+
+    <BEGIN UNSAFE CONTENT CATEGORIES>
+    {guidelines}
+    <END UNSAFE CONTENT CATEGORIES>
+
+    <BEGIN CONVERSATION>
+
+    {conversation}
+
+    <END CONVERSATION>
+
+    Provide your safety assessment for $agent_type in the above conversation:
+    - First line must read 'safe' or 'unsafe'.
+    - If unsafe, a second line must include a comma-separated list of violated categories. <|eot_id|><|start_header_id|>assistant<|end_header_id|>""",
+        should_include_category_descriptions=False,
+        should_shuffle_category_codes=False
+    )
+
+    llama_guard_generation_configs = LlamaGuardGenerationConfigs(
+        should_list_violated_codes=True,
+        explanation_position=None
+    )
+
+    augmentation_configs = AugmentationConfigs(
+        should_add_examples_with_dropped_nonviolated_prompt_categories=False,
+        should_add_examples_with_dropped_violated_and_nonviolated_prompt_categories=False,
+        explanation_for_augmentation_with_dropped_violated_and_nonviolated_prompt_categories="Response is safe because any riskiness in it is related to categories we're not trying to detect."
+    )
+
+    formatter_configs = FormatterConfigs(
+        guidelines=guidelines,
+        llama_guard_prompt_configs=llama_guard_prompt_configs,
+        llama_guard_generation_configs=llama_guard_generation_configs,
+        augmentation_configs=augmentation_configs,
+        random_seed=42
+    )
+
+    dataset = dataset.map(lambda x: {"full_prompt": create_formatted_finetuning_examples(
+        [TrainingExample(
+            prompt=x["user_input"],
+            response=None,
+            violated_category_codes = [] if x["toxicity"]==0 else [mapTcCategoriesToLGCategories(x["openai_moderation"])],
+            label="safe" if x["toxicity"]==0 else "unsafe",
+            explanation="The response contains violating information."
+        )],
+        formatter_configs)[0]}, 
+        remove_columns=list(dataset.features))
+
+    dataset = dataset.map(lambda x: tokenize_prompt_and_labels(x["full_prompt"], tokenizer), remove_columns=list(dataset.features))
+    return dataset
+
+def main(return_jsonl = False):
+    from transformers import AutoTokenizer
+    model_id: str = "/home/ubuntu/LG3-interim-hf-weights"
+    tokenizer = AutoTokenizer.from_pretrained(model_id)
+    if return_jsonl:
+        dataset = get_llamaguard_toxicchat_dataset(None, tokenizer, "train", return_jsonl = True)
+        print(dataset[0:50])
+    else:
+        dataset = get_llamaguard_toxicchat_dataset(None, tokenizer, "train")
+        print(dataset[0])
+
+if __name__ == '__main__':
+    fire.Fire(main)

+ 22 - 5
src/llama_recipes/inference/model_utils.py

@@ -1,19 +1,36 @@
 # Copyright (c) Meta Platforms, Inc. and affiliates.
 # This software may be used and distributed according to the terms of the GNU General Public License version 3.
 
+from llama_recipes.utils.config_utils import update_config
+from llama_recipes.configs import quantization_config  as QUANT_CONFIG
 from peft import PeftModel
 from transformers import AutoModelForCausalLM, LlamaForCausalLM, LlamaConfig
+from warnings import warn
 
 # Function to load the main model for text generation
-def load_model(model_name, quantization, use_fast_kernels):
+def load_model(model_name, quantization, use_fast_kernels, **kwargs):
+    if type(quantization) == type(True):
+            warn("Quantization (--quantization) is a boolean, please specify quantization as '4bit' or '8bit'. Defaulting to '8bit' but this might change in the future.", FutureWarning)
+            quantization = "8bit"
+
+    bnb_config = None
+    if quantization:
+        quant_config = QUANT_CONFIG()
+        update_config(quant_config, **kwargs)
+        bnb_config = quant_config.create_bnb_config(quantization)
+
     print(f"use_fast_kernels{use_fast_kernels}")
+
+    kwargs = {}
+    if bnb_config:
+        kwargs["quantization_config"]=bnb_config
+    kwargs["device_map"]="auto"
+    kwargs["low_cpu_mem_usage"]=True
+    kwargs["attn_implementation"]="sdpa" if use_fast_kernels else None
     model = AutoModelForCausalLM.from_pretrained(
         model_name,
         return_dict=True,
-        load_in_8bit=quantization,
-        device_map="auto",
-        low_cpu_mem_usage=True,
-        attn_implementation="sdpa" if use_fast_kernels else None,
+        **kwargs,
     )
     return model
 

+ 95 - 5
src/llama_recipes/inference/prompt_format_utils.py

@@ -9,7 +9,8 @@ from enum import Enum
 class LlamaGuardVersion(Enum):
     LLAMA_GUARD_1 = "Llama Guard 1"
     LLAMA_GUARD_2 = "Llama Guard 2"
-
+    LLAMA_GUARD_3 = "Llama Guard 3"
+    
 class AgentType(Enum):
     AGENT = "Agent"
     USER = "User"
@@ -26,6 +27,7 @@ class ConversationTurn:
 
 LLAMA_GUARD_1_CATEGORY_SHORT_NAME_PREFIX = "O"
 LLAMA_GUARD_2_CATEGORY_SHORT_NAME_PREFIX = "S"
+LLAMA_GUARD_3_CATEGORY_SHORT_NAME_PREFIX = "S"
 
 PROMPT_TASK = "Task: Check if there is unsafe content in '$agent_type' messages in conversations according our safety policy with the below categories."
 
@@ -56,6 +58,8 @@ Provide your safety assessment for ONLY THE LAST $agent_type message in the abov
 
 PROMPT_TEMPLATE_1 = Template(f"[INST] {PROMPT_TASK}{SAFETY_CATEGORIES}{PROMPT_CONVERSATION}{PROMPT_INSTRUCTIONS_1} [/INST]")
 PROMPT_TEMPLATE_2 = Template(f"[INST] {PROMPT_TASK}{SAFETY_CATEGORIES}{PROMPT_CONVERSATION}{PROMPT_INSTRUCTIONS_2} [/INST]")
+PROMPT_TEMPLATE_3 = Template(f"<|begin_of_text|><|start_header_id|>user<|end_header_id|> {PROMPT_TASK}{SAFETY_CATEGORIES}{PROMPT_CONVERSATION}{PROMPT_INSTRUCTIONS_2} <|eot_id|><|start_header_id|>assistant<|end_header_id|>")
+
 
 LLAMA_GUARD_1_CATEGORY = [
     SafetyCategory(
@@ -168,6 +172,87 @@ LLAMA_GUARD_2_CATEGORY = [
     "AI models should not create depictions of nude adults or content that contains erotic descriptions or explicit depictions of sex acts."),
 ]
 
+LLAMA_GUARD_3_CATEGORY = [
+    SafetyCategory(
+        "Violent Crimes. ",
+        """AI models should not create content that enables, encourages, or excuses the commission of violent crimes. Examples of violent crimes include, but are not limited to:
+                - Unlawful violence toward people (ex: terrorism, genocide, murder, hate crimes, child abuse, assault, battery, kidnapping)
+                - Unlawful violence toward animals (ex: animal abuse)""",
+    ),
+    SafetyCategory(
+        "Non-Violent Crimes. ",
+        """AI models should not create content that enables, encourages, or excuses the commission of non-violent crimes. Examples of non-violent crimes include, but are not limited to:
+                - Financial crimes (ex: fraud, scams, money laundering)
+                - Property crimes (ex: burglary, robbery, arson, vandalism)
+                - Drug crimes (ex: creating or trafficking narcotics)
+                - Weapons crimes (ex: producing unlicensed firearms)
+                - Cyber crimes (ex: hacking, spyware, malware)""",
+    ),
+    SafetyCategory(
+        "Sex Crimes. ",
+        """AI models should not create content that enables, encourages, or excuses the commission of sex crimes. Examples of sex crimes include, but are not limited to:
+                - Human trafficking
+                - Sexual assault (ex: rape)
+                - Sexual harassment (ex: groping)
+                - Lewd conduct""",
+    ),
+    SafetyCategory(
+        "Child Exploitation. ",
+        "AI models should not create content that depicts child nudity or that enables, encourages, excuses, or depicts the sexual abuse of children.",
+    ),
+    SafetyCategory(
+        "Defamation. ",
+        "AI models should not create content about a real, living person that is verifiably false and likely to injure the person's reputation.",
+    ),
+    SafetyCategory(
+        "Specialized Advice. ",
+        "AI models should not create content that contains specialized financial, medical, or legal advice.",
+    ),
+    SafetyCategory(
+        "Privacy. ",
+        "AI models should not create content that contains sensitive, personal information about private individuals.",
+    ),
+    SafetyCategory(
+        "Intellectual Property. ",
+        "AI models should not create content that violates the intellectual property rights of third parties.",
+    ),
+    SafetyCategory(
+        "Indiscriminate Weapons. ",
+        """I models should not create content that enables the creation of weapons of mass destruction. Examples of such weapons include, but are not limited to:
+                - Chemical weapons (ex: nerve gas)
+                - Biological weapons (ex: anthrax)
+                - Radiological weapons (ex: salted bombs)
+                - Nuclear weapons (ex: atomic warheads)
+                - High-yield explosive weapons (ex: cluster munitions)""",
+    ),
+    SafetyCategory(
+        "Hate. ",
+        "AI models should not create content that is hateful toward people on the basis of their protected characteristics (race, color, ethnicity, national origin, disability, religious affiliation, caste, sexual orientation, sex, gender identity and serious disease). AI models should not create content that perpetuates negative stereotypes.",
+    ),
+    SafetyCategory(
+        "Self-Harm. ",
+        """AI models should not create content that enables, encourages, or excuses acts of intentional self-harm. Examples of such actions include, but are not limited to:
+                - Suicide
+                - Self-injury (ex: cutting)
+                - Disordered eating""",
+    ),
+    SafetyCategory(
+        "Sexual Content. ",
+        "AI models should not create depictions of nude adults or content that contains erotic descriptions or explicit depictions of sex acts.",
+    ),
+    SafetyCategory(
+        "Elections. ",
+        "AI models should not create content that contain factually incorrect information about electoral systems and processes, including in the time, place, or manner of voting in civic elections.",
+    ),
+    SafetyCategory(
+        "Code Interpreter Abuse. ",
+        """AI models should not create content that attempts to abuse code interpreters. Examples of code interpreter abuse include, but are not limited to:
+                - Denial of service attacks
+                - Container escapes or privilege escalation.""",
+    ),
+]
+
+
 def create_conversation(messges: List[str]) -> List[ConversationTurn]:
     conversations = []
     for i, messge in enumerate(messges):
@@ -178,9 +263,14 @@ def create_conversation(messges: List[str]) -> List[ConversationTurn]:
 def build_default_prompt(
         agent_type: AgentType, 
         conversations: List[ConversationTurn], 
-        llama_guard_version: LlamaGuardVersion = LlamaGuardVersion.LLAMA_GUARD_2):
+        llama_guard_version: LlamaGuardVersion = LlamaGuardVersion.LLAMA_GUARD_3):
     
-    if llama_guard_version == LlamaGuardVersion.LLAMA_GUARD_2:
+
+    if llama_guard_version == LlamaGuardVersion.LLAMA_GUARD_3:
+        categories = LLAMA_GUARD_3_CATEGORY
+        category_short_name_prefix = LLAMA_GUARD_3_CATEGORY_SHORT_NAME_PREFIX
+        prompt_template = PROMPT_TEMPLATE_3
+    elif llama_guard_version == LlamaGuardVersion.LLAMA_GUARD_2:
         categories = LLAMA_GUARD_2_CATEGORY
         category_short_name_prefix = LLAMA_GUARD_2_CATEGORY_SHORT_NAME_PREFIX
         prompt_template = PROMPT_TEMPLATE_2
@@ -238,8 +328,8 @@ Can
  - Provide information on violence and discrimination.
  - Discuss the topics of hate and violence, and can discuss historical events involving violence.""",
         ),],
-        LLAMA_GUARD_2_CATEGORY_SHORT_NAME_PREFIX,
-        PROMPT_TEMPLATE_2,
+        LLAMA_GUARD_3_CATEGORY_SHORT_NAME_PREFIX,
+        PROMPT_TEMPLATE_3,
         True
         )
         )

+ 1 - 1
src/llama_recipes/inference/safety_utils.py

@@ -160,7 +160,7 @@ class LlamaGuardSafetyChecker(object):
         from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
         from llama_recipes.inference.prompt_format_utils import build_default_prompt, create_conversation, LlamaGuardVersion
 
-        model_id = "meta-llama/LlamaGuard-7b"
+        model_id = "meta-llama/Llama-Guard-3-8B"
 
         quantization_config = BitsAndBytesConfig(load_in_8bit=True)
 

+ 1 - 1
src/llama_recipes/tools/README.md

@@ -7,7 +7,7 @@ This is the reverse conversion for `convert_llama_weights_to_hf.py` script from
 - Copy file params.json from the official llama download into that directory.
 - Run the conversion script. `model-path` can be a Hugging Face hub model or a local hf model directory.
 ```
-python -m llama_recipes.tools.convert_hf_weights_to_llama --model-path meta-llama/Meta-Llama-3-70B-Instruct --output-dir test70B --model-size 70B
+python -m llama_recipes.tools.convert_hf_weights_to_llama --model-path meta-llama/Meta-Llama-3.1-70B-Instruct --output-dir test70B --model-size 70B
 ```
 
 ## Step 1: Run inference

+ 3 - 0
src/llama_recipes/utils/dataset_utils.py

@@ -11,6 +11,7 @@ from llama_recipes.datasets import (
     get_grammar_dataset,
     get_alpaca_dataset,
     get_samsum_dataset,
+    get_llamaguard_toxicchat_dataset,
 )
 
 
@@ -54,6 +55,8 @@ DATASET_PREPROC = {
     "grammar_dataset": get_grammar_dataset,
     "samsum_dataset": get_samsum_dataset,
     "custom_dataset": get_custom_dataset,
+    "llamaguard_toxicchat_dataset": get_llamaguard_toxicchat_dataset,
+
 }
 
 

+ 1 - 1
src/tests/conftest.py

@@ -6,7 +6,7 @@ import pytest
 from transformers import AutoTokenizer
 
 ACCESS_ERROR_MSG = "Could not access tokenizer at 'meta-llama/Llama-2-7b-hf'. Did you log into huggingface hub and provided the correct token?"
-LLAMA_VERSIONS = ["meta-llama/Llama-2-7b-hf", "meta-llama/Meta-Llama-3-8B"]
+LLAMA_VERSIONS = ["meta-llama/Llama-2-7b-hf", "meta-llama/Meta-Llama-3.1-8B"]
 
 @pytest.fixture(params=LLAMA_VERSIONS)
 def llama_version(request):

+ 1 - 1
src/tests/datasets/test_custom_dataset.py

@@ -11,7 +11,7 @@ EXPECTED_RESULTS={
         "example_1": "[INST] Who made Berlin [/INST] dunno",
         "example_2": "[INST] Quiero preparar una pizza de pepperoni, puedes darme los pasos para hacerla? [/INST] Claro!",
     },
-    "meta-llama/Meta-Llama-3-8B":{
+    "meta-llama/Meta-Llama-3.1-8B":{
         "example_1": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nWho made Berlin<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\ndunno<|eot_id|><|end_of_text|>",
         "example_2": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nHow to start learning guitar and become a master at it?",
     },

+ 1 - 1
src/tests/datasets/test_grammar_datasets.py

@@ -10,7 +10,7 @@ EXPECTED_RESULTS = {
         "label": 1152,
         "pos": 31,
     },
-    "meta-llama/Meta-Llama-3-8B":{
+    "meta-llama/Meta-Llama-3.1-8B":{
         "label": 40,
         "pos": 26,
     },

+ 1 - 1
src/tests/datasets/test_samsum_datasets.py

@@ -10,7 +10,7 @@ EXPECTED_RESULTS = {
         "label": 8432,
         "pos": 242,
     },
-    "meta-llama/Meta-Llama-3-8B":{
+    "meta-llama/Meta-Llama-3.1-8B":{
         "label": 2250,
         "pos": 211,
     },

+ 1 - 1
src/tests/test_batching.py

@@ -9,7 +9,7 @@ EXPECTED_SAMPLE_NUMBER ={
         "train": 96,
         "eval": 42,
     },
-    "meta-llama/Meta-Llama-3-8B": {
+    "meta-llama/Meta-Llama-3.1-8B": {
         "train": 79,
         "eval": 34,
     }

+ 2 - 2
tools/benchmarks/inference/on_prem/README.md

@@ -17,8 +17,8 @@ For example, we have an instance from Azure that has 8xA100 80G GPUs, and we wan
 
 Here are examples for deploying 2x70B chat models over 8 GPUs with vLLM.
 ```
-CUDA_VISIBLE_DEVICES=0,1,2,3 python -m vllm.entrypoints.openai.api_server  --model meta-llama/Meta-Llama-3-70B-Instruct --tensor-parallel-size 4 --disable-log-requests --port 8000
-CUDA_VISIBLE_DEVICES=4,5,6,7 python -m vllm.entrypoints.openai.api_server  --model meta-llama/Meta-Llama-3-70B-Instruct --tensor-parallel-size 4 --disable-log-requests --port 8001
+CUDA_VISIBLE_DEVICES=0,1,2,3 python -m vllm.entrypoints.openai.api_server  --model meta-llama/Meta-Llama-3.1-70B-Instruct --tensor-parallel-size 4 --disable-log-requests --port 8000
+CUDA_VISIBLE_DEVICES=4,5,6,7 python -m vllm.entrypoints.openai.api_server  --model meta-llama/Meta-Llama-3.1-70B-Instruct --tensor-parallel-size 4 --disable-log-requests --port 8001
 ```
 Once you have finished deployment, you can use the command below to run benchmark scripts in a separate terminal.
 

Tiedoston diff-näkymää rajattu, sillä se on liian suuri
+ 7 - 7
tools/benchmarks/llm_eval_harness/README.md