1 yıl önce · 49d6af9e4d
--- a/.github/scripts/spellcheck_conf/wordlist.txt
+++ b/.github/scripts/spellcheck_conf/wordlist.txt
@@ -1400,6 +1400,19 @@ sqlite
 
				 customerservice
			
 
				 fn
			
 
				 ExecuTorch
			
 
				+LLMScore
			
 
				+RecursiveCharacterTextSplitter
			
 
				+TPD
			
 
				+TPM
			
 
				+Tianjun
			
 
				+Zhang
			
 
				+distractor
			
 
				+distractors
			
 
				+frac
			
 
				+numRefusal
			
 
				+totalQA
			
 
				+DirectoryLoader
			
 
				+SitemapLoader
			
 
				 nf
			
 
				 quant
			
 
				 DLAI
			
@@ -1418,3 +1431,23 @@ ipython
 
				 CPUs
			
 
				 modelUpgradeExample
			
 
				 guardrailing
			
 
				+MaaS
			
 
				+MFU
			
 
				+BBH
			
 
				+GPQA
			
 
				+IFEVAL
			
 
				+IFeval
			
 
				+bos
			
 
				+gpqa
			
 
				+ifeval
			
 
				+lighteval
			
 
				+sqrt
			
 
				+wis
			
 
				+evals
			
 
				+mmlu
			
 
				+parsers
			
 
				+reproducibility
			
 
				+openhathi
			
 
				+sarvam
			
 
				+subtask
			
 
				+acc
			
--- a/README.md
+++ b/README.md
--- a/recipes/3p_integrations/azure/Azure
+++ b/recipes/3p_integrations/azure/Azure
@@ -0,0 +1,494 @@
 
				+{
			
 
				+  "cells": [
			
 
				+    {
			
 
				+      "cell_type": "markdown",
			
 
				+      "metadata": {},
			
 
				+      "source": [
			
 
				+        "# Use Azure API with Llama 3.1\n",
			
 
				+        "\n",
			
 
				+        "This notebook shows examples of how to use Llama 3.1 APIs offered by Microsoft Azure. We will cover:  \n",
			
 
				+        "* HTTP requests API usage for Llama 3.1 instruct models in CLI\n",
			
 
				+        "* HTTP requests API usage for Llama 3.1 instruct models in Python\n",
			
 
				+        "* Plug the APIs into LangChain\n",
			
 
				+        "* Wire the model with Gradio to build a simple chatbot with memory\n",
			
 
				+        "\n",
			
 
				+        "\n"
			
 
				+      ]
			
 
				+    },
			
 
				+    {
			
 
				+      "cell_type": "markdown",
			
 
				+      "metadata": {},
			
 
				+      "source": [
			
 
				+        "## Prerequisite\n",
			
 
				+        "\n",
			
 
				+        "Before we start building with Azure Llama 3.1 APIs, there are certain steps we need to take to deploy the models:\n",
			
 
				+        "\n",
			
 
				+        "* Register for a valid Azure account with subscription [here](https://azure.microsoft.com/en-us/free/search/?ef_id=_k_CjwKCAiA-P-rBhBEEiwAQEXhH5OHAJLhzzcNsuxwpa5c9EJFcuAjeh6EvZw4afirjbWXXWkiZXmU2hoC5GoQAvD_BwE_k_&OCID=AIDcmm5edswduu_SEM__k_CjwKCAiA-P-rBhBEEiwAQEXhH5OHAJLhzzcNsuxwpa5c9EJFcuAjeh6EvZw4afirjbWXXWkiZXmU2hoC5GoQAvD_BwE_k_&gad_source=1&gclid=CjwKCAiA-P-rBhBEEiwAQEXhH5OHAJLhzzcNsuxwpa5c9EJFcuAjeh6EvZw4afirjbWXXWkiZXmU2hoC5GoQAvD_BwE)\n",
			
 
				+        "* Take a quick look on what is the [Azure AI Studio](https://learn.microsoft.com/en-us/azure/ai-studio/what-is-ai-studio?tabs=home) and navigate to the website from the link in the article\n",
			
 
				+        "* Follow the demos in the article to create a project and [resource](https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/manage-resource-groups-portal) group.\n",
			
 
				+        "* For Llama 3.1 instruct models from Model catalog, click Deploy in the model page and select \"Serverless API with Azure AI Content Safety\". Once deployed successfully, you should be assigned for an API endpoint and a security key for inference.\n",
			
 
				+        "* For Llama 3.1 pretrained models, Azure currently only support manual deployment under regular subscription. This means you will need to acquire a virtual machine with managed compute resource. We won't cover it here in this tutorial.\n",
			
 
				+        "\n",
			
 
				+        "For more information, you should consult Azure's official documentation [here](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/deploy-models-llama?tabs=azure-studio) for model deployment and inference."
			
 
				+      ]
			
 
				+    },
			
 
				+    {
			
 
				+      "cell_type": "markdown",
			
 
				+      "metadata": {},
			
 
				+      "source": [
			
 
				+        "## HTTP Requests API Usage in CLI\n",
			
 
				+        "\n",
			
 
				+        "### Basics\n",
			
 
				+        "\n",
			
 
				+        "The usage and schema of the API are identical to Llama 3 API hosted on Azure.\n",
			
 
				+        "\n",
			
 
				+        "For using the REST API, You will need to have an Endpoint url and Authentication Key associated with that endpoint.  \n",
			
 
				+        "This can be acquired from previous steps.  \n",
			
 
				+        "\n",
			
 
				+        "In this chat completion example for instruct model, we use a simple curl call for illustration. There are three major components:  \n",
			
 
				+        "\n",
			
 
				+        "* The `host-url` is your endpoint url with completion schema. \n",
			
 
				+        "* The `headers` defines the content type as well as your api key. \n",
			
 
				+        "* The `payload` or `data`, which is your prompt detail and model hyper parameters."
			
 
				+      ]
			
 
				+    },
			
 
				+    {
			
 
				+      "cell_type": "markdown",
			
 
				+      "metadata": {},
			
 
				+      "source": [
			
 
				+        "The `host-url` needs to be `/v1/chat/completions` and the request payload to include roles in conversations. Here is a sample payload:  \n",
			
 
				+        "\n",
			
 
				+        "```\n",
			
 
				+        "{ \n",
			
 
				+        "  \"messages\": [ \n",
			
 
				+        "    { \n",
			
 
				+        "      \"content\": \"You are a helpful assistant.\", \n",
			
 
				+        "      \"role\": \"system\" \n",
			
 
				+        "},  \n",
			
 
				+        "    { \n",
			
 
				+        "      \"content\": \"Hello!\", \n",
			
 
				+        "      \"role\": \"user\" \n",
			
 
				+        "    } \n",
			
 
				+        "  ], \n",
			
 
				+        "  \"max_tokens\": 50, \n",
			
 
				+        "} \n",
			
 
				+        "```\n",
			
 
				+        "\n",
			
 
				+        "Here is a sample curl call for chat completion"
			
 
				+      ]
			
 
				+    },
			
 
				+    {
			
 
				+      "cell_type": "code",
			
 
				+      "execution_count": null,
			
 
				+      "metadata": {},
			
 
				+      "outputs": [],
			
 
				+      "source": [
			
 
				+        "!curl -X POST -L https://your-endpoint.inference.ai.azure.com/v1/chat/completions -H 'Content-Type: application/json' -H 'Authorization: your-auth-key' -d '{\"messages\":[{\"content\":\"You are a helpful assistant.\",\"role\":\"system\"},{\"content\":\"What is good about Wuhan?\",\"role\":\"user\"}], \"max_tokens\": 50}'"
			
 
				+      ]
			
 
				+    },
			
 
				+    {
			
 
				+      "cell_type": "markdown",
			
 
				+      "metadata": {},
			
 
				+      "source": [
			
 
				+        "### Streaming\n",
			
 
				+        "\n",
			
 
				+        "One fantastic feature the API offers is the streaming capability.  \n",
			
 
				+        "Streaming allows the generated tokens to be sent as data-only server-sent events whenever they become available.  \n",
			
 
				+        "This is extremely important for interactive applications such as chatbots, so the user is always engaged.  \n",
			
 
				+        "\n",
			
 
				+        "To use streaming, simply set `\"stream\":true` as part of the request payload.  \n",
			
 
				+        "In the streaming mode, the REST API response will be different from non-streaming mode.\n",
			
 
				+        "\n",
			
 
				+        "Here is an example: "
			
 
				+      ]
			
 
				+    },
			
 
				+    {
			
 
				+      "cell_type": "code",
			
 
				+      "execution_count": null,
			
 
				+      "metadata": {},
			
 
				+      "outputs": [],
			
 
				+      "source": [
			
 
				+        "!curl -X POST -L https://your-endpoint.inference.ai.azure.com/v1/chat/completions -H 'Content-Type: application/json' -H 'Authorization: your-auth-key' -d '{\"messages\":[{\"content\":\"You are a helpful assistant.\",\"role\":\"system\"},{\"content\":\"What is good about Wuhan?\",\"role\":\"user\"}], \"max_tokens\": 500, \"stream\": true}'"
			
 
				+      ]
			
 
				+    },
			
 
				+    {
			
 
				+      "cell_type": "markdown",
			
 
				+      "metadata": {},
			
 
				+      "source": [
			
 
				+        "As you can see the result comes back as a stream of `data` objects, each contains generated information including a `choice`.  \n",
			
 
				+        "The stream terminated by a `data:[DONE]\\n\\n` message."
			
 
				+      ]
			
 
				+    },
			
 
				+    {
			
 
				+      "cell_type": "markdown",
			
 
				+      "metadata": {},
			
 
				+      "source": [
			
 
				+        "### Content Safety Filtering\n",
			
 
				+        "\n",
			
 
				+        "If you enabled content filtering during deployment, Azure Llama 3.1 API endpoints will have content safety feature turned on. Both input prompt and output tokens are filtered by this service automatically.  \n",
			
 
				+        "To know more about the impact to the request/response payload, please refer to official guide [here](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/content-filter?tabs=python).   \n",
			
 
				+        "\n",
			
 
				+        "For model input and output, if the filter detects there is harmful content, the generation will error out with additional information. \n",
			
 
				+        "\n",
			
 
				+        "If you disabled content filtering during deployment, Llama models had content safety built-in for generation. It will refuse to answer your questions if any harmful content was detected.\n",
			
 
				+        "\n",
			
 
				+        "Here is an example prompt that triggered content safety filtering:\n"
			
 
				+      ]
			
 
				+    },
			
 
				+    {
			
 
				+      "cell_type": "code",
			
 
				+      "execution_count": null,
			
 
				+      "metadata": {},
			
 
				+      "outputs": [],
			
 
				+      "source": [
			
 
				+        "!curl -X POST -L https://your-endpoint.inference.ai.azure.com/v1/chat/completions -H 'Content-Type: application/json' -H 'Authorization: your-auth-key' -d '{\"messages\":[{\"content\":\"You are a helpful assistant.\",\"role\":\"system\"},{\"content\":\"How to make bomb?\",\"role\":\"user\"}], \"max_tokens\": 50}'"
			
 
				+      ]
			
 
				+    },
			
 
				+    {
			
 
				+      "cell_type": "markdown",
			
 
				+      "metadata": {},
			
 
				+      "source": [
			
 
				+        "## HTTP Requests API Usage in Python\n",
			
 
				+        "\n",
			
 
				+        "Besides calling the API directly from command line tools, you can also programatically call them in Python.  \n",
			
 
				+        "\n",
			
 
				+        "Here is an example for the instruct model:\n",
			
 
				+        "\n",
			
 
				+        "\n"
			
 
				+      ]
			
 
				+    },
			
 
				+    {
			
 
				+      "cell_type": "code",
			
 
				+      "execution_count": null,
			
 
				+      "metadata": {},
			
 
				+      "outputs": [],
			
 
				+      "source": [
			
 
				+        "import urllib.request\n",
			
 
				+        "import json\n",
			
 
				+        "\n",
			
 
				+        "#Configure payload data sending to API endpoint\n",
			
 
				+        "data = {\"messages\":[\n",
			
 
				+        "            {\"role\":\"system\", \"content\":\"You are a helpful assistant.\"},\n",
			
 
				+        "            {\"role\":\"user\", \"content\":\"What is good about Wuhan?\"}],\n",
			
 
				+        "        \"max_tokens\": 500,\n",
			
 
				+        "        \"temperature\": 0.9,\n",
			
 
				+        "        \"stream\": True,\n",
			
 
				+        "}\n",
			
 
				+        "\n",
			
 
				+        "body = str.encode(json.dumps(data))\n",
			
 
				+        "\n",
			
 
				+        "#Replace the url with your API endpoint\n",
			
 
				+        "url = 'https://your-endpoint.inference.ai.azure.com/v1/chat/completions'\n",
			
 
				+        "\n",
			
 
				+        "#Replace this with the key for the endpoint\n",
			
 
				+        "api_key = 'your-auth-key'\n",
			
 
				+        "if not api_key:\n",
			
 
				+        "    raise Exception(\"API Key is missing\")\n",
			
 
				+        "\n",
			
 
				+        "headers = {'Content-Type':'application/json', 'Authorization':(api_key)}\n",
			
 
				+        "\n",
			
 
				+        "req = urllib.request.Request(url, body, headers)\n",
			
 
				+        "\n",
			
 
				+        "try:\n",
			
 
				+        "    response = urllib.request.urlopen(req)\n",
			
 
				+        "    result = response.read()\n",
			
 
				+        "    print(result)\n",
			
 
				+        "except urllib.error.HTTPError as error:\n",
			
 
				+        "    print(\"The request failed with status code: \" + str(error.code))\n",
			
 
				+        "    # Print the headers - they include the requert ID and the timestamp, which are useful for debugging the failure\n",
			
 
				+        "    print(error.info())\n",
			
 
				+        "    print(error.read().decode(\"utf8\", 'ignore'))\n"
			
 
				+      ]
			
 
				+    },
			
 
				+    {
			
 
				+      "cell_type": "markdown",
			
 
				+      "metadata": {},
			
 
				+      "source": [
			
 
				+        "However in this example, the streamed data content returns back as a single payload. It didn't stream as a serial of data events as we wished. To build true streaming capabilities utilizing the API endpoint, we will utilize the [`requests`](https://requests.readthedocs.io/en/latest/) library instead."
			
 
				+      ]
			
 
				+    },
			
 
				+    {
			
 
				+      "cell_type": "markdown",
			
 
				+      "metadata": {},
			
 
				+      "source": [
			
 
				+        "### Streaming in Python\n",
			
 
				+        "\n",
			
 
				+        "`Requests` library is a simple HTTP library for Python built with [`urllib3`](https://github.com/urllib3/urllib3). It automatically maintains the keep-alive and HTTP connection pooling. With the `Session` class, we can easily stream the result from our API calls.  \n",
			
 
				+        "\n",
			
 
				+        "Here is a quick example:"
			
 
				+      ]
			
 
				+    },
			
 
				+    {
			
 
				+      "cell_type": "code",
			
 
				+      "execution_count": null,
			
 
				+      "metadata": {},
			
 
				+      "outputs": [],
			
 
				+      "source": [
			
 
				+        "import json\n",
			
 
				+        "import requests\n",
			
 
				+        "\n",
			
 
				+        "data = {\"messages\":[\n",
			
 
				+        "            {\"role\":\"system\", \"content\":\"You are a helpful assistant.\"},\n",
			
 
				+        "            {\"role\":\"user\", \"content\":\"What is good about Wuhan?\"}],\n",
			
 
				+        "        \"max_tokens\": 500,\n",
			
 
				+        "        \"temperature\": 0.9,\n",
			
 
				+        "        \"stream\": True\n",
			
 
				+        "}\n",
			
 
				+        "\n",
			
 
				+        "\n",
			
 
				+        "def post_stream(url):\n",
			
 
				+        "    s = requests.Session()\n",
			
 
				+        "    api_key = \"your-auth-key\"\n",
			
 
				+        "    headers = {'Content-Type':'application/json', 'Authorization':(api_key)}\n",
			
 
				+        "\n",
			
 
				+        "    with s.post(url, data=json.dumps(data), headers=headers, stream=True) as resp:\n",
			
 
				+        "        print(resp.status_code)\n",
			
 
				+        "        for line in resp.iter_lines():\n",
			
 
				+        "            if line:\n",
			
 
				+        "                print(line)\n",
			
 
				+        "\n",
			
 
				+        "\n",
			
 
				+        "url = \"https://your-endpoint.inference.ai.azure.com/v1/chat/completions\"\n",
			
 
				+        "post_stream(url)"
			
 
				+      ]
			
 
				+    },
			
 
				+    {
			
 
				+      "cell_type": "markdown",
			
 
				+      "metadata": {},
			
 
				+      "source": [
			
 
				+        "## Use Llama 3.1 API with LangChain\n",
			
 
				+        "\n",
			
 
				+        "In this section, we will demonstrate how to use Llama 3.1 APIs with LangChain, one of the most popular framework to accelerate building your AI product.  \n",
			
 
				+        "One common solution here is to create your customized LLM instance, so you can add it to various chains to complete different tasks.  \n",
			
 
				+        "In this example, we will use the `AzureMLChatOnlineEndpoint` class LangChain provides to build a customized LLM instance. This particular class is designed to take in Azure endpoint and API keys as inputs and wire it with HTTP calls. So the underlying of it is very similar to how we used `urllib.request` library to send RESTful calls in previous examples to the Azure Endpoint.   \n",
			
 
				+        "\n",
			
 
				+        "First, let's install dependencies: \n",
			
 
				+        "\n"
			
 
				+      ]
			
 
				+    },
			
 
				+    {
			
 
				+      "cell_type": "code",
			
 
				+      "execution_count": null,
			
 
				+      "metadata": {},
			
 
				+      "outputs": [],
			
 
				+      "source": [
			
 
				+        "pip install langchain"
			
 
				+      ]
			
 
				+    },
			
 
				+    {
			
 
				+      "cell_type": "markdown",
			
 
				+      "metadata": {},
			
 
				+      "source": [
			
 
				+        "Once all dependencies are installed, you can directly create a `llm` instance based on `AzureMLChatOnlineEndpoint` as follows:  "
			
 
				+      ]
			
 
				+    },
			
 
				+    {
			
 
				+      "cell_type": "code",
			
 
				+      "execution_count": null,
			
 
				+      "metadata": {},
			
 
				+      "outputs": [],
			
 
				+      "source": [
			
 
				+        "from langchain_community.chat_models.azureml_endpoint import (\n",
			
 
				+        "    AzureMLEndpointApiType,\n",
			
 
				+        "    CustomOpenAIChatContentFormatter,\n",
			
 
				+        "    AzureMLChatOnlineEndpoint,\n",
			
 
				+        ")\n",
			
 
				+        "\n",
			
 
				+        "from langchain_core.messages import HumanMessage\n",
			
 
				+        "\n",
			
 
				+        "llm = AzureMLChatOnlineEndpoint(\n",
			
 
				+        "    endpoint_api_key=\"your-auth-key\",\n",
			
 
				+        "    endpoint_url=\"https://your-endpoint.inference.ai.azure.com/v1/chat/completions\",\n",
			
 
				+        "    endpoint_api_type=AzureMLEndpointApiType.serverless,\n",
			
 
				+        "    model_kwargs={\"temperature\": 0.6, \"max_tokens\": 256, \"top_p\": 0.9},\n",
			
 
				+        "    content_formatter=CustomOpenAIChatContentFormatter(),\n",
			
 
				+        ")"
			
 
				+      ]
			
 
				+    },
			
 
				+    {
			
 
				+      "cell_type": "markdown",
			
 
				+      "metadata": {},
			
 
				+      "source": [
			
 
				+        "However, you might wonder what is the `CustomOpenAIChatContentFormatter` in the context when creating the `llm` instance?   \n",
			
 
				+        "The `CustomOpenAIChatContentFormatter` is a [handler class](https://python.langchain.com/docs/integrations/llms/azure_ml#content-formatter) for transforming the request and response of an AzureML endpoint to match with required schema. Since there are various models in the Azure model catalog, each of which needs to handle the data accordingly.  \n",
			
 
				+        "In our case, we can use the default `CustomOpenAIChatContentFormatter` which can handle Llama model schemas. If you need to have special handlings, you can customize this specific class. \n",
			
 
				+        "\n",
			
 
				+        "Once you have the `llm` ready, you can simple inference it by:"
			
 
				+      ]
			
 
				+    },
			
 
				+    {
			
 
				+      "cell_type": "code",
			
 
				+      "execution_count": null,
			
 
				+      "metadata": {},
			
 
				+      "outputs": [],
			
 
				+      "source": [
			
 
				+        "response = llm.invoke([HumanMessage(content=\"What is good about Wuhan?\")])\n",
			
 
				+        "response"
			
 
				+      ]
			
 
				+    },
			
 
				+    {
			
 
				+      "cell_type": "markdown",
			
 
				+      "metadata": {},
			
 
				+      "source": [
			
 
				+        "Here is an example that you can create a translator chain with the `llm` instance and translate English to French:"
			
 
				+      ]
			
 
				+    },
			
 
				+    {
			
 
				+      "cell_type": "code",
			
 
				+      "execution_count": null,
			
 
				+      "metadata": {},
			
 
				+      "outputs": [],
			
 
				+      "source": [
			
 
				+        "from langchain.chains import LLMChain\n",
			
 
				+        "from langchain.prompts import PromptTemplate\n",
			
 
				+        "\n",
			
 
				+        "template = \"\"\"\n",
			
 
				+        "You are a Translator. Translate the following content from {input_language} to {output_language} and reply with only the translated result.\n",
			
 
				+        "{input_content}\n",
			
 
				+        "\"\"\"\n",
			
 
				+        "\n",
			
 
				+        "translator_chain = LLMChain(\n",
			
 
				+        "    llm = llm,\n",
			
 
				+        "    prompt = PromptTemplate(\n",
			
 
				+        "            template=template,\n",
			
 
				+        "            input_variables=[\"input_language\", \"output_language\", \"input_content\"],\n",
			
 
				+        "        ),\n",
			
 
				+        ")\n",
			
 
				+        "\n",
			
 
				+        "print(translator_chain.run(input_language=\"English\", output_language=\"French\", input_content=\"What is good about Wuhan?\"))\n"
			
 
				+      ]
			
 
				+    },
			
 
				+    {
			
 
				+      "cell_type": "markdown",
			
 
				+      "metadata": {},
			
 
				+      "source": [
			
 
				+        "## Build a chatbot with Llama 3.1 API\n",
			
 
				+        "\n",
			
 
				+        "In this section, we will build a simple chatbot using Azure Llama 3.1 API, LangChain and [Gradio](https://www.gradio.app/)'s `ChatInterface` with memory capability.\n",
			
 
				+        "\n",
			
 
				+        "Gradio is a framework to help demo your machine learning model with a web interface. We also have a dedicated Gradio chatbot [example](https://github.com/meta-llama/llama-recipes/blob/main/recipes/use_cases/customerservice_chatbots/RAG_chatbot/RAG_Chatbot_Example.ipynb) built with Llama 3 on-premises with RAG.   \n",
			
 
				+        "\n",
			
 
				+        "First, let's install Gradio dependencies.\n"
			
 
				+      ]
			
 
				+    },
			
 
				+    {
			
 
				+      "cell_type": "code",
			
 
				+      "execution_count": null,
			
 
				+      "metadata": {},
			
 
				+      "outputs": [],
			
 
				+      "source": [
			
 
				+        "pip install gradio==4.39.0"
			
 
				+      ]
			
 
				+    },
			
 
				+    {
			
 
				+      "cell_type": "markdown",
			
 
				+      "metadata": {},
			
 
				+      "source": [
			
 
				+        "Let's use `AzureMLChatOnlineEndpoint` class from the previous example.  \n",
			
 
				+        "In this example, we have three major components:  \n",
			
 
				+        "1. Chatbot UI hosted as web interface by Gradio. These are the UI logics that render our model predictions.\n",
			
 
				+        "2. Model itself, which is the core component that ingests prompts and returns an answer back.\n",
			
 
				+        "3. Memory component, which stores previous conversation context. In this example, we will use [conversation window buffer](https://python.langchain.com/docs/modules/memory/types/buffer_window) which logs context in certain time window in the past. \n",
			
 
				+        "\n",
			
 
				+        "All of them are chained together using LangChain."
			
 
				+      ]
			
 
				+    },
			
 
				+    {
			
 
				+      "cell_type": "code",
			
 
				+      "execution_count": null,
			
 
				+      "metadata": {},
			
 
				+      "outputs": [],
			
 
				+      "source": [
			
 
				+        "import gradio as gr\n",
			
 
				+        "import langchain\n",
			
 
				+        "from langchain.chains import ConversationChain\n",
			
 
				+        "from langchain.prompts import PromptTemplate\n",
			
 
				+        "from langchain.memory import ConversationBufferWindowMemory\n",
			
 
				+        "from langchain_core.messages import HumanMessage\n",
			
 
				+        "from langchain_community.chat_models.azureml_endpoint import (\n",
			
 
				+        "    AzureMLEndpointApiType,\n",
			
 
				+        "    CustomOpenAIChatContentFormatter,\n",
			
 
				+        "    AzureMLChatOnlineEndpoint,\n",
			
 
				+        ")\n",
			
 
				+        "\n",
			
 
				+        "llm = AzureMLChatOnlineEndpoint(\n",
			
 
				+        "    endpoint_api_key=\"your-auth-key\",\n",
			
 
				+        "    endpoint_url=\"https://your-endpoint.inference.ai.azure.com/v1/chat/completions\",\n",
			
 
				+        "    endpoint_api_type=AzureMLEndpointApiType.serverless,\n",
			
 
				+        "    model_kwargs={\"temperature\": 0.6, \"max_tokens\": 256, \"top_p\": 0.9},\n",
			
 
				+        "    content_formatter=CustomOpenAIChatContentFormatter(),\n",
			
 
				+        ")\n",
			
 
				+        "\n",
			
 
				+        "langchain.debug=True\n",
			
 
				+        "\n",
			
 
				+        "#Create memory\n",
			
 
				+        "memory = ConversationBufferWindowMemory(llm=llm, k=5, memory_key=\"chat_history\", ai_prefix=\"Assistant\", human_prefix=\"User\")\n",
			
 
				+        "\n",
			
 
				+        "#Create input prompt template with chat history for chaining\n",
			
 
				+        "INPUT_TEMPLATE = \"\"\"Current conversation:\n",
			
 
				+        "{chat_history}\n",
			
 
				+        "\n",
			
 
				+        "User question:{input}\"\"\"\n",
			
 
				+        "\n",
			
 
				+        "conversation_prompt_template = PromptTemplate(\n",
			
 
				+        "    input_variables=[\"chat_history\", \"input\"], template=INPUT_TEMPLATE\n",
			
 
				+        ")\n",
			
 
				+        "\n",
			
 
				+        "conversation_chain_with_memory = ConversationChain(\n",
			
 
				+        "    llm = llm,\n",
			
 
				+        "    prompt = conversation_prompt_template,\n",
			
 
				+        "    verbose = True,\n",
			
 
				+        "    memory = memory,\n",
			
 
				+        ")\n",
			
 
				+        "\n",
			
 
				+        "#Prediction\n",
			
 
				+        "def predict(message, history):\n",
			
 
				+        "    history_format = []\n",
			
 
				+        "    for user, assistant in history:\n",
			
 
				+        "        history_format.append({\"role\": \"user\", \"content\": user })\n",
			
 
				+        "        history_format.append({\"role\": \"assistant\", \"content\":assistant})\n",
			
 
				+        "    history_format.append({\"role\": \"user\", \"content\": message})\n",
			
 
				+        "    response = conversation_chain_with_memory.run(input=message)\n",
			
 
				+        "    return response\n",
			
 
				+        "\n",
			
 
				+        "#Launch Gradio chatbot interface\n",
			
 
				+        "gr.ChatInterface(predict).launch()"
			
 
				+      ]
			
 
				+    },
			
 
				+    {
			
 
				+      "cell_type": "markdown",
			
 
				+      "metadata": {},
			
 
				+      "source": [
			
 
				+        "After successfully executing the code above, a chat interface should appear as the interactive output or you can open the localhost url in your selected browser window. You can see how amazing it is to build a AI chatbot just in few lines of code.\n",
			
 
				+        "\n",
			
 
				+        "This concludes our tutorial and examples. Here are some additional reference:  \n",
			
 
				+        "* [Fine-tune Llama](https://learn.microsoft.com/azure/ai-studio/how-to/fine-tune-model-llama)\n",
			
 
				+        "* [Plan and manage costs (marketplace)](https://learn.microsoft.com/azure/ai-studio/how-to/costs-plan-manage#monitor-costs-for-models-offered-through-the-azure-marketplace)\n"
			
 
				+      ]
			
 
				+    }
			
 
				+  ],
			
 
				+  "metadata": {
			
 
				+    "fileHeader": "",
			
 
				+    "fileUid": "599e1edd-cd59-4e55-823f-17157fc07b18",
			
 
				+    "isAdHoc": false,
			
 
				+    "kernelspec": {
			
 
				+      "display_name": "Python 3",
			
 
				+      "language": "python",
			
 
				+      "name": "python3"
			
 
				+    },
			
 
				+    "language_info": {
			
 
				+      "codemirror_mode": {
			
 
				+        "name": "ipython",
			
 
				+        "version": 3
			
 
				+      },
			
 
				+      "file_extension": ".py",
			
 
				+      "mimetype": "text/x-python",
			
 
				+      "name": "python",
			
 
				+      "nbconvert_exporter": "python",
			
 
				+      "pygments_lexer": "ipython3",
			
 
				+      "version": "3.9.6"
			
 
				+    }
			
 
				+  },
			
 
				+  "nbformat": 4,
			
 
				+  "nbformat_minor": 2
			
 
				+}
			
--- a/recipes/3p_integrations/azure/README.md
+++ b/recipes/3p_integrations/azure/README.md
@@ -1,5 +1,2 @@
 
				-In this folder, we show various examples in a notebook for running Llama model inference on Azure's serverless API offerings. We will cover:  
			
 
				-* HTTP requests API usage for Llama 3 instruct models in CLI
			
 
				-* HTTP requests API usage for Llama 3 instruct models in Python
			
 
				-* Plug the APIs into LangChain
			
 
				-* Wire the model with Gradio to build a simple chatbot with memory
			
 
				+In this folder, we show various recipes for Llama models working with Azure AI services. This includes:
			
 
				+* Examples for running Llama model inference on Azure's serverless API offerings (aka. MaaS)
			
--- a/recipes/3p_integrations/azure/azure_api_example.ipynb
+++ b/recipes/3p_integrations/azure/azure_api_example.ipynb
@@ -1,532 +0,0 @@
 
				-{
			
 
				- "cells": [
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "# Use Azure API with Llama 3\n",
			
 
				-    "\n",
			
 
				-    "This notebook shows examples of how to use Llama 3 APIs offered by Microsoft Azure. We will cover:  \n",
			
 
				-    "* HTTP requests API usage for Llama 3 instruct models in CLI\n",
			
 
				-    "* HTTP requests API usage for Llama 3 instruct models in Python\n",
			
 
				-    "* Plug the APIs into LangChain\n",
			
 
				-    "* Wire the model with Gradio to build a simple chatbot with memory\n",
			
 
				-    "\n",
			
 
				-    "\n"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "## Prerequisite\n",
			
 
				-    "\n",
			
 
				-    "Before we start building with Azure Llama 3 APIs, there are certain steps we need to take to deploy the models:\n",
			
 
				-    "\n",
			
 
				-    "* Register for a valid Azure account with subscription [here](https://azure.microsoft.com/en-us/free/search/?ef_id=_k_CjwKCAiA-P-rBhBEEiwAQEXhH5OHAJLhzzcNsuxwpa5c9EJFcuAjeh6EvZw4afirjbWXXWkiZXmU2hoC5GoQAvD_BwE_k_&OCID=AIDcmm5edswduu_SEM__k_CjwKCAiA-P-rBhBEEiwAQEXhH5OHAJLhzzcNsuxwpa5c9EJFcuAjeh6EvZw4afirjbWXXWkiZXmU2hoC5GoQAvD_BwE_k_&gad_source=1&gclid=CjwKCAiA-P-rBhBEEiwAQEXhH5OHAJLhzzcNsuxwpa5c9EJFcuAjeh6EvZw4afirjbWXXWkiZXmU2hoC5GoQAvD_BwE)\n",
			
 
				-    "* Take a quick look on what is the [Azure AI Studio](https://learn.microsoft.com/en-us/azure/ai-studio/what-is-ai-studio?tabs=home) and navigate to the website from the link in the article\n",
			
 
				-    "* Follow the demos in the article to create a project and [resource](https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/manage-resource-groups-portal) group, or you can also follow the guide [here](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/deploy-models-llama?tabs=azure-studio)\n",
			
 
				-    "* For Llama 3 instruct models from Model catalog, click Deploy in the model page and select \"Pay-as-you-go\". Once deployed successfully, you should be assigned for an API endpoint and a security key for inference.\n",
			
 
				-    "* For Llama 3 pretrained models, Azure currently only support manual deployment under regular subscription. We are working with them to bring \"Pay-as-you-go\" for pretrained models.\n",
			
 
				-    "\n",
			
 
				-    "For more information, you should consult Azure's official documentation [here](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/deploy-models-llama?tabs=azure-studio) for model deployment and inference."
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "## HTTP Requests API Usage in CLI\n",
			
 
				-    "\n",
			
 
				-    "### Basics\n",
			
 
				-    "\n",
			
 
				-    "The usage and schema of the API are identical to Llama 3 API hosted on Azure.\n",
			
 
				-    "\n",
			
 
				-    "For using the REST API, You will need to have an Endpoint url and Authentication Key associated with that endpoint.  \n",
			
 
				-    "This can be acquired from previous steps.  \n",
			
 
				-    "\n",
			
 
				-    "In this chat completion example for instruct model, we use a simple curl call for illustration. There are three major components:  \n",
			
 
				-    "\n",
			
 
				-    "* The `host-url` is your endpoint url with completion schema. \n",
			
 
				-    "* The `headers` defines the content type as well as your api key. \n",
			
 
				-    "* The `payload` or `data`, which is your prompt detail and model hyper parameters."
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "The `host-url` needs to be `/v1/chat/completions` and the request payload to include roles in conversations. Here is a sample payload:  \n",
			
 
				-    "\n",
			
 
				-    "```\n",
			
 
				-    "{ \n",
			
 
				-    "  \"messages\": [ \n",
			
 
				-    "    { \n",
			
 
				-    "      \"content\": \"You are a helpful assistant.\", \n",
			
 
				-    "      \"role\": \"system\" \n",
			
 
				-    "},  \n",
			
 
				-    "    { \n",
			
 
				-    "      \"content\": \"Hello!\", \n",
			
 
				-    "      \"role\": \"user\" \n",
			
 
				-    "    } \n",
			
 
				-    "  ], \n",
			
 
				-    "  \"max_tokens\": 50, \n",
			
 
				-    "} \n",
			
 
				-    "```\n",
			
 
				-    "\n",
			
 
				-    "Here is a sample curl call for chat completion"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": null,
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "!curl -X POST -L https://your-endpoint.inference.ai.azure.com/v1/chat/completions -H 'Content-Type: application/json' -H 'Authorization: your-auth-key' -d '{\"messages\":[{\"content\":\"You are a helpful assistant.\",\"role\":\"system\"},{\"content\":\"Who wrote the book Innovators dilemma?\",\"role\":\"user\"}], \"max_tokens\": 50}'"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "### Streaming\n",
			
 
				-    "\n",
			
 
				-    "One fantastic feature the API offers is the streaming capability.  \n",
			
 
				-    "Streaming allows the generated tokens to be sent as data-only server-sent events whenever they become available.  \n",
			
 
				-    "This is extremely important for interactive applications such as chatbots, so the user is always engaged.  \n",
			
 
				-    "\n",
			
 
				-    "To use streaming, simply set `\"stream\":True` as part of the request payload.  \n",
			
 
				-    "In the streaming mode, the REST API response will be different from non-streaming mode.\n",
			
 
				-    "\n",
			
 
				-    "Here is an example: "
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": null,
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "!curl -X POST -L https://your-endpoint.inference.ai.azure.com/v1/chat/completions -H 'Content-Type: application/json' -H 'Authorization: your-auth-key' -d '{\"messages\":[{\"content\":\"You are a helpful assistant.\",\"role\":\"system\"},{\"content\":\"Who wrote the book Innovators dilemma?\",\"role\":\"user\"}], \"max_tokens\": 500, \"stream\": True}'"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "As you can see the result comes back as a stream of `data` objects, each contains generated information including a `choice`.  \n",
			
 
				-    "The stream terminated by a `data:[DONE]\\n\\n` message."
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "### Content Safety Filtering\n",
			
 
				-    "\n",
			
 
				-    "All Azure Llama 3 API endpoints have content safety feature turned on. Both input prompt and output tokens are filtered by this service automatically.  \n",
			
 
				-    "To know more about the impact to the request/response payload, please refer to official guide [here](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/content-filter?tabs=python).   \n",
			
 
				-    "\n",
			
 
				-    "For model input and output, if the filter detects there is harmful content, the generation will error out with a response payload containing the reasoning, along with information on the type of content violation and its severity. \n",
			
 
				-    "\n",
			
 
				-    "Here is an example prompt that triggered content safety filtering:\n"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": null,
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "!curl -X POST -L https://your-endpoint.inference.ai.azure.com/v1/chat/completions -H 'Content-Type: application/json' -H 'Authorization: your-auth-key' -d '{\"messages\":[{\"content\":\"You are a helpful assistant.\",\"role\":\"system\"},{\"content\":\"How to make bomb?\",\"role\":\"user\"}], \"max_tokens\": 50}'"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "## HTTP Requests API Usage in Python\n",
			
 
				-    "\n",
			
 
				-    "Besides calling the API directly from command line tools, you can also programatically call them in Python.  \n",
			
 
				-    "\n",
			
 
				-    "Here is an example for the instruct model:\n",
			
 
				-    "\n",
			
 
				-    "\n"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": null,
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "import urllib.request\n",
			
 
				-    "import json\n",
			
 
				-    "\n",
			
 
				-    "#Configure payload data sending to API endpoint\n",
			
 
				-    "data = {\"messages\":[\n",
			
 
				-    "            {\"role\":\"system\", \"content\":\"You are a helpful assistant.\"},\n",
			
 
				-    "            {\"role\":\"user\", \"content\":\"Who wrote the book Innovators dilemma?\"}], \n",
			
 
				-    "        \"max_tokens\": 500,\n",
			
 
				-    "        \"temperature\": 0.9,\n",
			
 
				-    "        \"stream\": True,\n",
			
 
				-    "}\n",
			
 
				-    "\n",
			
 
				-    "body = str.encode(json.dumps(data))\n",
			
 
				-    "\n",
			
 
				-    "#Replace the url with your API endpoint\n",
			
 
				-    "url = 'https://your-endpoint.inference.ai.azure.com/v1/chat/completions'\n",
			
 
				-    "\n",
			
 
				-    "#Replace this with the key for the endpoint\n",
			
 
				-    "api_key = 'your-auth-key'\n",
			
 
				-    "if not api_key:\n",
			
 
				-    "    raise Exception(\"API Key is missing\")\n",
			
 
				-    "\n",
			
 
				-    "headers = {'Content-Type':'application/json', 'Authorization':(api_key)}\n",
			
 
				-    "\n",
			
 
				-    "req = urllib.request.Request(url, body, headers)\n",
			
 
				-    "\n",
			
 
				-    "try:\n",
			
 
				-    "    response = urllib.request.urlopen(req)\n",
			
 
				-    "    result = response.read()\n",
			
 
				-    "    print(result)\n",
			
 
				-    "except urllib.error.HTTPError as error:\n",
			
 
				-    "    print(\"The request failed with status code: \" + str(error.code))\n",
			
 
				-    "    # Print the headers - they include the requert ID and the timestamp, which are useful for debugging the failure\n",
			
 
				-    "    print(error.info())\n",
			
 
				-    "    print(error.read().decode(\"utf8\", 'ignore'))\n"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "However in this example, the streamed data content returns back as a single payload. It didn't stream as a serial of data events as we wished. To build true streaming capabilities utilizing the API endpoint, we will utilize the [`requests`](https://requests.readthedocs.io/en/latest/) library instead."
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "### Streaming in Python\n",
			
 
				-    "\n",
			
 
				-    "`Requests` library is a simple HTTP library for Python built with [`urllib3`](https://github.com/urllib3/urllib3). It automatically maintains the keep-alive and HTTP connection pooling. With the `Session` class, we can easily stream the result from our API calls.  \n",
			
 
				-    "\n",
			
 
				-    "Here is a quick example:"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": null,
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "import json\n",
			
 
				-    "import requests\n",
			
 
				-    "\n",
			
 
				-    "data = {\"messages\":[\n",
			
 
				-    "            {\"role\":\"system\", \"content\":\"You are a helpful assistant.\"},\n",
			
 
				-    "            {\"role\":\"user\", \"content\":\"Who wrote the book Innovators dilemma?\"}],\n",
			
 
				-    "        \"max_tokens\": 500,\n",
			
 
				-    "        \"temperature\": 0.9,\n",
			
 
				-    "        \"stream\": True\n",
			
 
				-    "}\n",
			
 
				-    "\n",
			
 
				-    "\n",
			
 
				-    "def post_stream(url):\n",
			
 
				-    "    s = requests.Session()\n",
			
 
				-    "    api_key = \"your-auth-key\"\n",
			
 
				-    "    headers = {'Content-Type':'application/json', 'Authorization':(api_key)}\n",
			
 
				-    "\n",
			
 
				-    "    with s.post(url, data=json.dumps(data), headers=headers, stream=True) as resp:\n",
			
 
				-    "        print(resp.status_code)\n",
			
 
				-    "        for line in resp.iter_lines():\n",
			
 
				-    "            if line:\n",
			
 
				-    "                print(line)\n",
			
 
				-    "\n",
			
 
				-    "\n",
			
 
				-    "url = \"https://your-endpoint.inference.ai.azure.com/v1/chat/completions\"\n",
			
 
				-    "post_stream(url)"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "## Use Llama 3 API with LangChain\n",
			
 
				-    "\n",
			
 
				-    "In this section, we will demonstrate how to use Llama 3 APIs with LangChain, one of the most popular framework to accelerate building your AI product.  \n",
			
 
				-    "One common solution here is to create your customized LLM instance, so you can add it to various chains to complete different tasks.  \n",
			
 
				-    "In this example, we will use the `AzureMLOnlineEndpoint` class LangChain provides to build a customized LLM instance. This particular class is designed to take in Azure endpoint and API keys as inputs and wire it with HTTP calls. So the underlying of it is very similar to how we used `urllib.request` library to send RESTful calls in previous examples to the Azure Endpoint.   \n",
			
 
				-    "\n",
			
 
				-    "First, let's install dependencies: \n",
			
 
				-    "\n"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": null,
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "pip install langchain"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "Once all dependencies are installed, you can directly create a `llm` instance based on `AzureMLOnlineEndpoint` as follows:  "
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": null,
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "from langchain.llms.azureml_endpoint import AzureMLOnlineEndpoint, ContentFormatterBase\n",
			
 
				-    "from typing import Dict\n",
			
 
				-    "import json\n",
			
 
				-    "\n",
			
 
				-    "\n",
			
 
				-    "class AzureLlamaAPIContentFormatter(ContentFormatterBase):\n",
			
 
				-    "#Content formatter for Llama 3 API for Azure MaaS\n",
			
 
				-    "\n",
			
 
				-    "    def format_request_payload(self, prompt: str, model_kwargs: Dict) -> bytes:\n",
			
 
				-    "        #Formats the request according to the chosen api\n",
			
 
				-    "        prompt = ContentFormatterBase.escape_special_characters(prompt)\n",
			
 
				-    "        request_payload_dict = {\n",
			
 
				-    "                \"messages\": [\n",
			
 
				-    "                    {\"role\":\"system\", \"content\":\"You are a helpful assistant\"},\n",
			
 
				-    "                    {\"role\":\"user\", \"content\":f\"{prompt}\"}\n",
			
 
				-    "                    ]               \n",
			
 
				-    "            }\n",
			
 
				-    "        #Add model parameters as part of the dict\n",
			
 
				-    "        request_payload_dict.update(model_kwargs)\n",
			
 
				-    "        request_payload = json.dumps(request_payload_dict)\n",
			
 
				-    "        return str.encode(request_payload)\n",
			
 
				-    "\n",
			
 
				-    "    def format_response_payload(self, output: bytes) -> str:\n",
			
 
				-    "        #Formats response\n",
			
 
				-    "        return json.loads(output)[\"choices\"][0][\"message\"][\"content\"]\n",
			
 
				-    "\n",
			
 
				-    "\n",
			
 
				-    "content_formatter = AzureLlamaAPIContentFormatter()\n",
			
 
				-    "\n",
			
 
				-    "llm = AzureMLOnlineEndpoint(\n",
			
 
				-    "    endpoint_api_key=\"your-auth-key\",\n",
			
 
				-    "    endpoint_url=\"https://your-endpoint.inference.ai.azure.com/v1/chat/completions\",\n",
			
 
				-    "    model_kwargs={\"temperature\": 0.6, \"max_tokens\": 512, \"top_p\": 0.9},\n",
			
 
				-    "    content_formatter=content_formatter,\n",
			
 
				-    ")"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "However, you might wonder what is the `content_formatter` in the context when creating the `llm` instance?   \n",
			
 
				-    "The `content_formatter` parameter is a [handler class](https://python.langchain.com/docs/integrations/llms/azure_ml#content-formatter) for transforming the request and response of an AzureML endpoint to match with required schema. Since there are various models in the Azure model catalog, each of which needs to handle the data accordingly.  \n",
			
 
				-    "In our case, all current formatters provided by Langchain including `LLamaContentFormatter` don't follow the schema. So we created our own customized formatter called `AzureLlamaAPIContentFormatter` to handle the input and output data.  \n",
			
 
				-    "\n",
			
 
				-    "Once you have the `llm` ready, you can simple inference it by:"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": null,
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "print(llm(\"Who wrote the book Innovators dilemma?\"))"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "Here is an example that you can create a translator chain with the `llm` instance and translate English to French:"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": null,
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "from langchain.chains import LLMChain\n",
			
 
				-    "from langchain.prompts import PromptTemplate\n",
			
 
				-    "\n",
			
 
				-    "template = \"\"\"\n",
			
 
				-    "You are a Translator. Translate the following content from {input_language} to {output_language} and reply with only the translated result.\n",
			
 
				-    "{input_content}\n",
			
 
				-    "\"\"\"\n",
			
 
				-    "\n",
			
 
				-    "translator_chain = LLMChain(\n",
			
 
				-    "    llm = llm,\n",
			
 
				-    "    prompt = PromptTemplate(\n",
			
 
				-    "            template=template,\n",
			
 
				-    "            input_variables=[\"input_language\", \"output_language\", \"input_content\"],\n",
			
 
				-    "        ),\n",
			
 
				-    ")\n",
			
 
				-    "\n",
			
 
				-    "print(translator_chain.run(input_language=\"English\", output_language=\"French\", input_content=\"Who wrote the book Innovators dilemma?\"))\n"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "## Build a chatbot with Llama 3 API\n",
			
 
				-    "\n",
			
 
				-    "In this section, we will build a simple chatbot using Azure Llama 3 API, LangChain and [Gradio](https://www.gradio.app/)'s `ChatInterface` with memory capability.\n",
			
 
				-    "\n",
			
 
				-    "Gradio is a framework to help demo your machine learning model with a web interface. We also have a dedicated Gradio chatbot [example](https://github.com/meta-llama/llama-recipes/blob/main/recipes/use_cases/chatbots/RAG_chatbot/RAG_Chatbot_Example.ipynb) built with Llama 3 on-premises with RAG.   \n",
			
 
				-    "\n",
			
 
				-    "First, let's install Gradio dependencies.\n"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": null,
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "\n",
			
 
				-    "pip install gradio"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "Let's use `AzureMLOnlineEndpoint` class from the previous example.  \n",
			
 
				-    "In this example, we have three major components:  \n",
			
 
				-    "1. Chatbot UI hosted as web interface by Gradio. These are the UI logics that render our model predictions.\n",
			
 
				-    "2. Model itself, which is the core component that ingests prompts and returns an answer back.\n",
			
 
				-    "3. Memory component, which stores previous conversation context. In this example, we will use [conversation window buffer](https://python.langchain.com/docs/modules/memory/types/buffer_window) which logs context in certain time window in the past. \n",
			
 
				-    "\n",
			
 
				-    "All of them are chained together using LangChain."
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "code",
			
 
				-   "execution_count": null,
			
 
				-   "metadata": {},
			
 
				-   "outputs": [],
			
 
				-   "source": [
			
 
				-    "import gradio as gr\n",
			
 
				-    "from langchain.chains import ConversationChain\n",
			
 
				-    "from langchain.prompts import PromptTemplate\n",
			
 
				-    "from langchain.llms.azureml_endpoint import AzureMLOnlineEndpoint, ContentFormatterBase\n",
			
 
				-    "from langchain.memory import ConversationBufferWindowMemory\n",
			
 
				-    "\n",
			
 
				-    "import langchain\n",
			
 
				-    "from typing import Dict\n",
			
 
				-    "import json\n",
			
 
				-    "\n",
			
 
				-    "langchain.debug=True\n",
			
 
				-    "\n",
			
 
				-    "class AzureLlamaAPIContentFormatter(ContentFormatterBase):\n",
			
 
				-    "#Content formatter for Llama 3 API for Azure MaaS\n",
			
 
				-    "\n",
			
 
				-    "    def format_request_payload(self, prompt: str, model_kwargs: Dict) -> bytes:\n",
			
 
				-    "        #Formats the request according to the chosen api\n",
			
 
				-    "        prompt = ContentFormatterBase.escape_special_characters(prompt)\n",
			
 
				-    "\n",
			
 
				-    "        #Note how we instructed the model with system prompts. Past conversation can be past as in system prompt as well\n",
			
 
				-    "        request_payload_dict = {\n",
			
 
				-    "                \"messages\": [\n",
			
 
				-    "                    {\"role\":\"system\", \"content\":\"The following is a conversation between a user and you. Answer the user question based on the conversation. Provide your answer only\"},\n",
			
 
				-    "                    {\"role\":\"user\", \"content\":f\"{prompt}\"}\n",
			
 
				-    "                    ]               \n",
			
 
				-    "            }\n",
			
 
				-    "        request_payload_dict.update(model_kwargs)\n",
			
 
				-    "        request_payload = json.dumps(request_payload_dict)\n",
			
 
				-    "        return str.encode(request_payload)\n",
			
 
				-    "\n",
			
 
				-    "    def format_response_payload(self, output: bytes) -> str:\n",
			
 
				-    "        #Formats response\n",
			
 
				-    "        return json.loads(output)[\"choices\"][0][\"message\"][\"content\"]\n",
			
 
				-    "\n",
			
 
				-    "#Create content fomartter\n",
			
 
				-    "content_formatter = AzureLlamaAPIContentFormatter()\n",
			
 
				-    "\n",
			
 
				-    "#Create llm instance\n",
			
 
				-    "llm = AzureMLOnlineEndpoint(\n",
			
 
				-    "    endpoint_api_key=\"your-auth-key\",\n",
			
 
				-    "    endpoint_url=\"https://your-endpoint.inference.ai.azure.com/v1/chat/completions\",\n",
			
 
				-    "    model_kwargs={\"temperature\": 0.6, \"max_tokens\": 128, \"top_p\": 0.9},\n",
			
 
				-    "    content_formatter=content_formatter,\n",
			
 
				-    ")\n",
			
 
				-    "\n",
			
 
				-    "#Create memory\n",
			
 
				-    "memory = ConversationBufferWindowMemory(llm=llm, k=5, memory_key=\"chat_history\", ai_prefix=\"Assistant\", human_prefix=\"User\")\n",
			
 
				-    "\n",
			
 
				-    "#Create input prompt template with chat history for chaining\n",
			
 
				-    "INPUT_TEMPLATE = \"\"\"Current conversation:\n",
			
 
				-    "{chat_history}\n",
			
 
				-    "\n",
			
 
				-    "User question:{input}\"\"\"\n",
			
 
				-    "\n",
			
 
				-    "conversation_prompt_template = PromptTemplate(\n",
			
 
				-    "    input_variables=[\"chat_history\", \"input\"], template=INPUT_TEMPLATE\n",
			
 
				-    ")\n",
			
 
				-    "\n",
			
 
				-    "conversation_chain_with_memory = ConversationChain(\n",
			
 
				-    "    llm = llm,\n",
			
 
				-    "    prompt = conversation_prompt_template,\n",
			
 
				-    "    verbose = True,\n",
			
 
				-    "    memory = memory,\n",
			
 
				-    ")\n",
			
 
				-    "\n",
			
 
				-    "#Prediction\n",
			
 
				-    "def predict(message, history):\n",
			
 
				-    "    history_format = []\n",
			
 
				-    "    for user, assistant in history:\n",
			
 
				-    "        history_format.append({\"role\": \"user\", \"content\": user })\n",
			
 
				-    "        history_format.append({\"role\": \"assistant\", \"content\":assistant})\n",
			
 
				-    "    history_format.append({\"role\": \"user\", \"content\": message})\n",
			
 
				-    "    response = conversation_chain_with_memory.run(input=message)\n",
			
 
				-    "    return response\n",
			
 
				-    "\n",
			
 
				-    "#Launch Gradio chatbot interface\n",
			
 
				-    "gr.ChatInterface(predict).launch()"
			
 
				-   ]
			
 
				-  },
			
 
				-  {
			
 
				-   "cell_type": "markdown",
			
 
				-   "metadata": {},
			
 
				-   "source": [
			
 
				-    "After successfully executing the code above, a chat interface should appear as the interactive output or you can open the localhost url in your selected browser window.  \n",
			
 
				-    "\n",
			
 
				-    "This concludes our tutorial and examples. Here are some additional reference:  \n",
			
 
				-    "* [Fine-tune Llama](https://learn.microsoft.com/azure/ai-studio/how-to/fine-tune-model-llama)\n",
			
 
				-    "* [Plan and manage costs (marketplace)](https://learn.microsoft.com/azure/ai-studio/how-to/costs-plan-manage#monitor-costs-for-models-offered-through-the-azure-marketplace)\n"
			
 
				-   ]
			
 
				-  }
			
 
				- ],
			
 
				- "metadata": {
			
 
				-  "kernelspec": {
			
 
				-   "display_name": "Python 3",
			
 
				-   "language": "python",
			
 
				-   "name": "python3"
			
 
				-  },
			
 
				-  "language_info": {
			
 
				-   "codemirror_mode": {
			
 
				-    "name": "ipython",
			
 
				-    "version": 3
			
 
				-   },
			
 
				-   "file_extension": ".py",
			
 
				-   "mimetype": "text/x-python",
			
 
				-   "name": "python",
			
 
				-   "nbconvert_exporter": "python",
			
 
				-   "pygments_lexer": "ipython3",
			
 
				-   "version": "3.9.6"
			
 
				-  }
			
 
				- },
			
 
				- "nbformat": 4,
			
 
				- "nbformat_minor": 2
			
 
				-}
			
--- a/recipes/README.md
+++ b/recipes/README.md
@@ -4,8 +4,8 @@ This folder contains examples organized by topic:
 
				 
			
 
				 | Subfolder | Description |
			
 
				 |---|---|
			
 
				-[quickstart](./quickstart)|The "Hello World" of using Llama 3, start here if you are new to using Llama 3
			
 
				-[use_cases](./use_cases)|Scripts showing common applications of Llama 3
			
 
				-[3p_integrations](./3p_integrations)|Partner-owned folder showing Meta Llama 3 usage along with third-party tools 
			
 
				+[quickstart](./quickstart)|The "Hello World" of using Llama, start here if you are new to using Llama
			
 
				+[use_cases](./use_cases)|Scripts showing common applications of Llama
			
 
				+[3p_integrations](./3p_integrations)|Partner-owned folder showing Llama usage along with third-party tools
			
 
				 [responsible_ai](./responsible_ai)|Scripts to use PurpleLlama for safeguarding model outputs
			
 
				-[experimental](./experimental)|Meta Llama implementations of experimental LLM techniques
			
 
				+[experimental](./experimental)| Llama implementations of experimental LLM techniques
			
--- a/recipes/quickstart/Getting_to_know_Llama.ipynb
+++ b/recipes/quickstart/Getting_to_know_Llama.ipynb
@@ -15,8 +15,8 @@
 
				     "id": "LERqQn5v8-ak"
			
 
				    },
			
 
				    "source": [
			
 
				-    "# **Getting to know Llama 3: Everything you need to start building**\n",
			
 
				-    "Our goal in this session is to provide a guided tour of Llama 3 with comparison with Llama 2, including understanding different Llama 3 models, how and where to access them, Generative AI and Chatbot architectures, prompt engineering, RAG (Retrieval Augmented Generation), Fine-tuning and more. All this is implemented with a starter code for you to take it and use it in your Llama 3 projects."
			
 
				+    "# **Getting to know Llama 3.1: Everything you need to start building**\n",
			
 
				+    "Our goal in this session is to provide a guided tour of Llama 3.1 with comparison with Llama 2, including understanding different Llama 3.1 models, how and where to access them, Generative AI and Chatbot architectures, prompt engineering, RAG (Retrieval Augmented Generation), Fine-tuning and more. All this is implemented with a starter code for you to take it and use it in your Llama 3.1 projects."
			
 
				    ]
			
 
				   },
			
 
				   {
			
@@ -113,6 +113,20 @@
 
				     "      llama-3-70b --> llama-3-70b-instruct\n",
			
 
				     "      classDef default fill:#CCE6FF,stroke:#84BCF5,textColor:#1C2B33,fontFamily:trebuchet ms;\n",
			
 
				     "  \"\"\")\n",
			
 
				+    "  \n",
			
 
				+    "def llama3_1_family():\n",
			
 
				+    "  mm(\"\"\"\n",
			
 
				+    "  graph LR;\n",
			
 
				+    "      llama-3-1 --> llama-3-8b\n",
			
 
				+    "      llama-3-1 --> llama-3-70b\n",
			
 
				+    "      llama-3-1 --> llama-3-4050b\n",
			
 
				+    "      llama-3-1-8b --> llama-3-1-8b\n",
			
 
				+    "      llama-3-1-8b --> llama-3-1-8b-instruct\n",
			
 
				+    "      llama-3-1-70b --> llama-3-1-70b\n",
			
 
				+    "      llama-3-1-70b --> llama-3-1-70b-instruct\n",
			
 
				+    "      llama-3-1-405b --> llama-3-1-405b-instruct\n",
			
 
				+    "      classDef default fill:#CCE6FF,stroke:#84BCF5,textColor:#1C2B33,fontFamily:trebuchet ms;\n",
			
 
				+    "  \"\"\")\n",
			
 
				     "\n",
			
 
				     "import ipywidgets as widgets\n",
			
 
				     "from IPython.display import display, Markdown\n",
			
@@ -184,7 +198,7 @@
 
				     "id": "i4Np_l_KtIno"
			
 
				    },
			
 
				    "source": [
			
 
				-    "### **1 - Understanding Llama 3**"
			
 
				+    "### **1 - Understanding Llama 3.1**"
			
 
				    ]
			
 
				   },
			
 
				   {
			
@@ -193,13 +207,13 @@
 
				     "id": "PGPSI3M5PGTi"
			
 
				    },
			
 
				    "source": [
			
 
				-    "### **1.1 - What is Llama 3?**\n",
			
 
				+    "### **1.1 - What is Llama 3.1?**\n",
			
 
				     "\n",
			
 
				     "* State of the art (SOTA), Open Source LLM\n",
			
 
				-    "* 8B, 70B - base and instruct models\n",
			
 
				+    "* 8B, 70B, 405B - base and instruct models\n",
			
 
				     "* Choosing model: Size, Quality, Cost, Speed\n",
			
 
				     "* Pretrained + Chat\n",
			
 
				-    "* [Meta Llama 3 Blog](https://ai.meta.com/blog/meta-llama-3/)\n",
			
 
				+    "* [Meta Llama 3.1 Blog](https://ai.meta.com/blog/meta-llama-3-1/)\n",
			
 
				     "* [Getting Started with Meta Llama](https://llama.meta.com/docs/get-started)"
			
 
				    ]
			
 
				   },
			
@@ -239,12 +253,21 @@
 
				    ]
			
 
				   },
			
 
				   {
			
 
				+   "cell_type": "code",
			
 
				+   "execution_count": null,
			
 
				+   "metadata": {},
			
 
				+   "outputs": [],
			
 
				+   "source": [
			
 
				+    "llama3_1_family()"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				    "cell_type": "markdown",
			
 
				    "metadata": {
			
 
				     "id": "aYeHVVh45bdT"
			
 
				    },
			
 
				    "source": [
			
 
				-    "### **1.2 - Accessing Llama 3**\n",
			
 
				+    "### **1.2 - Accessing Llama 3.1**\n",
			
 
				     "* Download + Self Host (i.e. [download Llama](https://ai.meta.com/resources/models-and-libraries/llama-downloads))\n",
			
 
				     "* Hosted API Platform (e.g. [Groq](https://console.groq.com/), [Replicate](https://replicate.com/meta/meta-llama-3-8b-instruct), [Together](https://api.together.xyz/playground/language/meta-llama/Llama-3-8b-hf), [Anyscale](https://app.endpoints.anyscale.com/playground))\n",
			
 
				     "\n",
			
@@ -258,7 +281,7 @@
 
				     "id": "kBuSay8vtzL4"
			
 
				    },
			
 
				    "source": [
			
 
				-    "### **1.3 - Use Cases of Llama 3**\n",
			
 
				+    "### **1.3 - Use Cases of Llama 3.1**\n",
			
 
				     "* Content Generation\n",
			
 
				     "* Summarization\n",
			
 
				     "* General Chatbots\n",
			
@@ -943,7 +966,7 @@
 
				     "import bs4\n",
			
 
				     "\n",
			
 
				     "# Step 1: Load the document from a web url\n",
			
 
				-    "loader = WebBaseLoader([\"https://huggingface.co/blog/llama3\"])\n",
			
 
				+    "loader = WebBaseLoader([\"https://huggingface.co/blog/llama31\"])\n",
			
 
				     "documents = loader.load()\n",
			
 
				     "\n",
			
 
				     "# Step 2: Split the document into chunks with a specified chunk size\n",
			
@@ -1013,8 +1036,8 @@
 
				    "source": [
			
 
				     "# This time your previous question and answer will be included as a chat history which will enable the ability\n",
			
 
				     "# to ask follow up questions.\n",
			
 
				-    "chat_history = [(query, result[\"answer\"])]\n",
			
 
				     "query = \"What two sizes?\"\n",
			
 
				+    "chat_history = [(query, result[\"answer\"])]\n",
			
 
				     "result = chain({\"question\": query, \"chat_history\": chat_history})\n",
			
 
				     "md(result['answer'])"
			
 
				    ]
			
@@ -1079,7 +1102,7 @@
 
				    },
			
 
				    "source": [
			
 
				     "#### **Resources**\n",
			
 
				-    "- [Meta Llama 3 Blog](https://ai.meta.com/blog/meta-llama-3/)\n",
			
 
				+    "- [Meta Llama 3.1 Blog](https://ai.meta.com/blog/meta-llama-3-1/)\n",
			
 
				     "- [Getting Started with Meta Llama](https://llama.meta.com/docs/get-started)\n",
			
 
				     "- [Llama 3 repo](https://github.com/meta-llama/llama3)\n",
			
 
				     "- [Llama 3 model card](https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md)\n",
			
@@ -1088,6 +1111,11 @@
 
				     "- [Acceptable Use Policy](https://ai.meta.com/llama/use-policy/)\n",
			
 
				     "\n"
			
 
				    ]
			
 
				+  },
			
 
				+  {
			
 
				+   "cell_type": "markdown",
			
 
				+   "metadata": {},
			
 
				+   "source": []
			
 
				   }
			
 
				  ],
			
 
				  "metadata": {
			
--- a/recipes/quickstart/Prompt_Engineering_with_Llama_3.ipynb
+++ b/recipes/quickstart/Prompt_Engineering_with_Llama_3.ipynb
@@ -7,11 +7,11 @@
 
				    "source": [
			
 
				     "<a href=\"https://colab.research.google.com/github/meta-llama/llama-recipes/blob/main/recipes/quickstart/Prompt_Engineering_with_Llama_3.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
			
 
				     "\n",
			
 
				-    "# Prompt Engineering with Llama 3\n",
			
 
				+    "# Prompt Engineering with Llama 3.1\n",
			
 
				     "\n",
			
 
				     "Prompt engineering is using natural language to produce a desired response from a large language model (LLM).\n",
			
 
				     "\n",
			
 
				-    "This interactive guide covers prompt engineering & best practices with Llama 3."
			
 
				+    "This interactive guide covers prompt engineering & best practices with Llama 3.1."
			
 
				    ]
			
 
				   },
			
 
				   {
			
@@ -45,6 +45,15 @@
 
				     "\n",
			
 
				     "Llama models come in varying parameter sizes. The smaller models are cheaper to deploy and run; the larger models are more capable.\n",
			
 
				     "\n",
			
 
				+    "#### Llama 3.1\n",
			
 
				+    "1. `llama-3.1-8b` - base pretrained 8 billion parameter model\n",
			
 
				+    "1. `llama-3.1-70b` - base pretrained 70 billion parameter model\n",
			
 
				+    "1. `llama-3.1-405b` - base pretrained 405 billion parameter model\n",
			
 
				+    "1. `llama-3.1-8b-instruct` - instruction fine-tuned 8 billion parameter model\n",
			
 
				+    "1. `llama-3.1-70b-instruct` - instruction fine-tuned 70 billion parameter model\n",
			
 
				+    "1. `llama-3.1-405b-instruct` - instruction fine-tuned 405 billion parameter model (flagship)\n",
			
 
				+    "\n",
			
 
				+    "\n",
			
 
				     "#### Llama 3\n",
			
 
				     "1. `llama-3-8b` - base pretrained 8 billion parameter model\n",
			
 
				     "1. `llama-3-70b` - base pretrained 70 billion parameter model\n",
			
@@ -133,7 +142,7 @@
 
				     "\n",
			
 
				     "Tokens matter most when you consider API pricing and internal behavior (ex. hyperparameters).\n",
			
 
				     "\n",
			
 
				-    "Each model has a maximum context length that your prompt cannot exceed. That's 8K tokens for Llama 3, 4K for Llama 2, and 100K for Code Llama. \n"
			
 
				+    "Each model has a maximum context length that your prompt cannot exceed. That's 128k tokens for Llama 3.1, 4K for Llama 2, and 100K for Code Llama.\n"
			
 
				    ]
			
 
				   },
			
 
				   {
			
@@ -143,7 +152,7 @@
 
				    "source": [
			
 
				     "## Notebook Setup\n",
			
 
				     "\n",
			
 
				-    "The following APIs will be used to call LLMs throughout the guide. As an example, we'll call Llama 3 chat using [Grok](https://console.groq.com/playground?model=llama3-70b-8192).\n",
			
 
				+    "The following APIs will be used to call LLMs throughout the guide. As an example, we'll call Llama 3.1 chat using [Grok](https://console.groq.com/playground?model=llama3-70b-8192).\n",
			
 
				     "\n",
			
 
				     "To install prerequisites run:"
			
 
				    ]
			
@@ -171,8 +180,9 @@
 
				     "# Get a free API key from https://console.groq.com/keys\n",
			
 
				     "os.environ[\"GROQ_API_KEY\"] = \"YOUR_GROQ_API_KEY\"\n",
			
 
				     "\n",
			
 
				-    "LLAMA3_70B_INSTRUCT = \"llama3-70b-8192\"\n",
			
 
				-    "LLAMA3_8B_INSTRUCT = \"llama3-8b-8192\"\n",
			
 
				+    "LLAMA3_405B_INSTRUCT = \"llama-3.1-405b-reasoning\" # Note: Groq currently only gives access here to paying customers for 405B model\n",
			
 
				+    "LLAMA3_70B_INSTRUCT = \"llama-3.1-70b-versatile\"\n",
			
 
				+    "LLAMA3_8B_INSTRUCT = \"llama3.1-8b-instant\"\n",
			
 
				     "\n",
			
 
				     "DEFAULT_MODEL = LLAMA3_70B_INSTRUCT\n",
			
 
				     "\n",
			
@@ -225,7 +235,7 @@
 
				    "source": [
			
 
				     "### Completion APIs\n",
			
 
				     "\n",
			
 
				-    "Let's try Llama 3!"
			
 
				+    "Let's try Llama 3.1!"
			
 
				    ]
			
 
				   },
			
 
				   {
			
@@ -488,7 +498,7 @@
 
				     "\n",
			
 
				     "Simply adding a phrase encouraging step-by-step thinking \"significantly improves the ability of large language models to perform complex reasoning\" ([Wei et al. (2022)](https://arxiv.org/abs/2201.11903)). This technique is called \"CoT\" or \"Chain-of-Thought\" prompting.\n",
			
 
				     "\n",
			
 
				-    "Llama 3 now reasons step-by-step naturally without the addition of the phrase. This section remains for completeness."
			
 
				+    "Llama 3.1 now reasons step-by-step naturally without the addition of the phrase. This section remains for completeness."
			
 
				    ]
			
 
				   },
			
 
				   {
			
@@ -704,7 +714,7 @@
 
				    "source": [
			
 
				     "### Limiting Extraneous Tokens\n",
			
 
				     "\n",
			
 
				-    "A common struggle with Llama 2 is getting output without extraneous tokens (ex. \"Sure! Here's more information on...\"), even if explicit instructions are given to Llama 2 to be concise and no preamble. Llama 3 can better follow instructions.\n",
			
 
				+    "A common struggle with Llama 2 is getting output without extraneous tokens (ex. \"Sure! Here's more information on...\"), even if explicit instructions are given to Llama 2 to be concise and no preamble. Llama 3.x can better follow instructions.\n",
			
 
				     "\n",
			
 
				     "Check out this improvement that combines a role, rules and restrictions, explicit instructions, and an example:"
			
 
				    ]
			
--- a/recipes/quickstart/RAG/hello_llama_cloud.ipynb
+++ b/recipes/quickstart/RAG/hello_llama_cloud.ipynb
--- a/recipes/quickstart/README.md
+++ b/recipes/quickstart/README.md
@@ -2,28 +2,8 @@
 
				 
			
 
				 If you are new to developing with Meta Llama models, this is where you should start. This folder contains introductory-level notebooks across different techniques relating to Meta Llama.
			
 
				 
			
 
				-* The [Running_Llama3_Anywhere](./Running_Llama3_Anywhere/) notebooks demonstrate how to run Llama inference across Linux, Mac and Windows platforms using the appropriate tooling.
			
 
				-* The [Prompt_Engineering_with_Llama_3](./Prompt_Engineering_with_Llama_3.ipynb) notebook showcases the various ways to elicit appropriate outputs from Llama. Take this notebook for a spin to get a feel for how Llama responds to different inputs and generation parameters.
			
 
				+* The [Running_Llama_Anywhere](./Running_Llama3_Anywhere/) notebooks demonstrate how to run Llama inference across Linux, Mac and Windows platforms using the appropriate tooling.
			
 
				+* The [Prompt_Engineering_with_Llama](./Prompt_Engineering_with_Llama_3.ipynb) notebook showcases the various ways to elicit appropriate outputs from Llama. Take this notebook for a spin to get a feel for how Llama responds to different inputs and generation parameters.
			
 
				 * The [inference](./inference/) folder contains scripts to deploy Llama for inference on server and mobile. See also [3p_integrations/vllm](../3p_integrations/vllm/) and [3p_integrations/tgi](../3p_integrations/tgi/) for hosting Llama on open-source model servers.
			
 
				-* The [RAG](./RAG/) folder contains a simple Retrieval-Augmented Generation application using Llama 3.
			
 
				-* The [finetuning](./finetuning/) folder contains resources to help you finetune Llama 3 on your custom datasets, for both single- and multi-GPU setups. The scripts use the native llama-recipes finetuning code found in [finetuning.py](../../src/llama_recipes/finetuning.py) which supports these features:
			
 
				-
			
 
				-| Feature                                        |   |
			
 
				-| ---------------------------------------------- | - |
			
 
				-| HF support for finetuning                      | ✅ |
			
 
				-| Deferred initialization ( meta init)           | ✅ |
			
 
				-| HF support for inference                       | ✅ |
			
 
				-| Low CPU mode for multi GPU                     | ✅ |
			
 
				-| Mixed precision                                | ✅ |
			
 
				-| Single node quantization                       | ✅ |
			
 
				-| Flash attention                                | ✅ |
			
 
				-| PEFT                                           | ✅ |
			
 
				-| Activation checkpointing FSDP                  | ✅ |
			
 
				-| Hybrid Sharded Data Parallel (HSDP)            | ✅ |
			
 
				-| Dataset packing & padding                      | ✅ |
			
 
				-| BF16 Optimizer ( Pure BF16)                    | ✅ |
			
 
				-| Profiling & MFU tracking                       | ✅ |
			
 
				-| Gradient accumulation                          | ✅ |
			
 
				-| CPU offloading                                 | ✅ |
			
 
				-| FSDP checkpoint conversion to HF for inference | ✅ |
			
 
				-| W&B experiment tracker                         | ✅ |
			
 
				+* The [RAG](./RAG/) folder contains a simple Retrieval-Augmented Generation application using Llama.
			
 
				+* The [finetuning](./finetuning/) folder contains resources to help you finetune Llama on your custom datasets, for both single- and multi-GPU setups. The scripts use the native llama-recipes finetuning code found in [finetuning.py](../../src/llama_recipes/finetuning.py) which supports these features:
			
--- a/recipes/quickstart/Running_Llama3_Anywhere/Running_Llama_on_HF_transformers.ipynb
+++ b/recipes/quickstart/Running_Llama3_Anywhere/Running_Llama_on_HF_transformers.ipynb
@@ -4,8 +4,8 @@
 
				    "cell_type": "markdown",
			
 
				    "metadata": {},
			
 
				    "source": [
			
 
				-    "## Running Meta Llama 3 on Google Colab using Hugging Face transformers library\n",
			
 
				-    "This notebook goes over how you can set up and run Llama 3 using Hugging Face transformers library\n",
			
 
				+    "## Running Meta Llama 3.1 on Google Colab using Hugging Face transformers library\n",
			
 
				+    "This notebook goes over how you can set up and run Llama 3.1 using Hugging Face transformers library\n",
			
 
				     "<a href=\"https://colab.research.google.com/github/meta-llama/llama-recipes/blob/main/recipes/quickstart/Running_Llama2_Anywhere/Running_Llama_on_HF_transformers.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
			
 
				    ]
			
 
				   },
			
@@ -14,7 +14,7 @@
 
				    "metadata": {},
			
 
				    "source": [
			
 
				     "### Steps at a glance:\n",
			
 
				-    "This demo showcases how to run the example with already converted Llama 3 weights on [Hugging Face](https://huggingface.co/meta-llama). Please Note: To use the downloads on Hugging Face, you must first request a download as shown in the steps below making sure that you are using the same email address as your Hugging Face account.\n",
			
 
				+    "This demo showcases how to run the example with already converted Llama 3.1 weights on [Hugging Face](https://huggingface.co/meta-llama). Please Note: To use the downloads on Hugging Face, you must first request a download as shown in the steps below making sure that you are using the same email address as your Hugging Face account.\n",
			
 
				     "\n",
			
 
				     "To use already converted weights, start here:\n",
			
 
				     "1. Request download of model weights from the Llama website\n",
			
@@ -45,7 +45,7 @@
 
				     "Request download of model weights from the Llama website\n",
			
 
				     "Before you can run the model locally, you will need to get the model weights. To get the model weights, visit the [Llama website](https://llama.meta.com/) and click on “download models”. \n",
			
 
				     "\n",
			
 
				-    "Fill  the required information, select the models “Meta Llama 3” and accept the terms & conditions. You will receive a URL in your email in a short time."
			
 
				+    "Fill  the required information, select the models “Meta Llama 3.1” and accept the terms & conditions. You will receive a URL in your email in a short time."
			
 
				    ]
			
 
				   },
			
 
				   {
			
@@ -94,7 +94,7 @@
 
				    "source": [
			
 
				     "Then, we will set the model variable to a specific model we’d like to use. In this demo, we will use the 8b chat model `meta-llama/Meta-Llama-3.1-8B-Instruct`. Using Meta models from Hugging Face requires you to\n",
			
 
				     "\n",
			
 
				-    "1. Accept Terms of Service for Meta Llama 3 on Meta [website](https://llama.meta.com/llama-downloads).\n",
			
 
				+    "1. Accept Terms of Service for Meta Llama 3.1 on Meta [website](https://llama.meta.com/llama-downloads).\n",
			
 
				     "2. Use the same email address from Step (1) to login into Hugging Face.\n",
			
 
				     "\n",
			
 
				     "Follow the instructions on this Hugging Face page to login from your [terminal](https://huggingface.co/docs/huggingface_hub/en/quick-start). "
			
@@ -208,7 +208,7 @@
 
				     "#### 2. Clone the llama repo and get the weights\n",
			
 
				     "Git clone the [Meta Llama 3 repo](https://github.com/meta-llama/llama3). Run the `download.sh` script and follow the instructions. This will download the model checkpoints and tokenizer.\n",
			
 
				     "\n",
			
 
				-    "This example demonstrates a Meta Llama 3 model with 8B-instruct parameters, but the steps we follow would be similar for other llama models, as well as for other parameter models."
			
 
				+    "This example demonstrates a Meta Llama 3.1 model with 8B-instruct parameters, but the steps we follow would be similar for other llama models, as well as for other parameter models."
			
 
				    ]
			
 
				   },
			
 
				   {
			
@@ -223,7 +223,7 @@
 
				     "* `cd transformers`\n",
			
 
				     "* `pip install -e .`\n",
			
 
				     "* `pip install torch tiktoken blobfile accelerate`\n",
			
 
				-    "* `python3 src/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir ${path_to_meta_downloaded_model} --output_dir ${path_to_save_converted_hf_model} --model_size 8B --llama_version 3`"
			
 
				+    "* `python3 src/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir ${path_to_meta_downloaded_model} --output_dir ${path_to_save_converted_hf_model} --model_size 8B --llama_version 3.1`"
			
 
				    ]
			
 
				   },
			
 
				   {
			
@@ -233,7 +233,7 @@
 
				     "\n",
			
 
				     "#### 4. Prepare the script\n",
			
 
				     "Import the following necessary modules in your script: \n",
			
 
				-    "* `AutoModel` is the Llama 2 model class\n",
			
 
				+    "* `AutoModel` is the Llama 3 model class\n",
			
 
				     "* `AutoTokenizer` prepares your prompt for the model to process\n",
			
 
				     "* `pipeline` is an abstraction to generate model outputs"
			
 
				    ]
			
--- a/recipes/quickstart/Running_Llama3_Anywhere/Running_Llama_on_Mac_Windows_Linux.ipynb
+++ b/recipes/quickstart/Running_Llama3_Anywhere/Running_Llama_on_Mac_Windows_Linux.ipynb
@@ -5,7 +5,7 @@
 
				    "metadata": {},
			
 
				    "source": [
			
 
				     "## Running Llama 3 on Mac, Windows or Linux\n",
			
 
				-    "This notebook goes over how you can set up and run Llama 3 locally on a Mac, Windows or Linux using [Ollama](https://ollama.com/)."
			
 
				+    "This notebook goes over how you can set up and run Llama 3.1 locally on a Mac, Windows or Linux using [Ollama](https://ollama.com/)."
			
 
				    ]
			
 
				   },
			
 
				   {
			
@@ -14,9 +14,9 @@
 
				    "source": [
			
 
				     "### Steps at a glance:\n",
			
 
				     "1. Download and install Ollama.\n",
			
 
				-    "2. Download and test run Llama 3.\n",
			
 
				-    "3. Use local Llama 3 via Python.\n",
			
 
				-    "4. Use local Llama 3 via LangChain.\n"
			
 
				+    "2. Download and test run Llama 3.1\n",
			
 
				+    "3. Use local Llama 3.1 via Python.\n",
			
 
				+    "4. Use local Llama 3.1 via LangChain.\n"
			
 
				    ]
			
 
				   },
			
 
				   {
			
@@ -36,16 +36,16 @@
 
				    "source": [
			
 
				     "#### 2. Download and test run Llama 3\n",
			
 
				     "\n",
			
 
				-    "On a terminal or console, run `ollama pull llama3` to download the Llama 3 8b chat model, in the 4-bit quantized format with size about 4.7 GB.\n",
			
 
				+    "On a terminal or console, run `ollama pull llama3.1` to download the Llama 3.1 8b chat model, in the 4-bit quantized format with size about 4.7 GB.\n",
			
 
				     "\n",
			
 
				-    "Run `ollama pull llama3:70b` to download the Llama 3 70b chat model, also in the 4-bit quantized format with size 39GB.\n",
			
 
				+    "Run `ollama pull llama3.1:70b` to download the Llama 3.1 70b chat model, also in the 4-bit quantized format with size 39GB.\n",
			
 
				     "\n",
			
 
				-    "Then you can run `ollama run llama3` and ask Llama 3 questions such as \"who wrote the book godfather?\" or \"who wrote the book godfather? answer in one sentence.\" You can also try `ollama run llama3:70b`, but the inference speed will most likely be too slow - for example, on an Apple M1 Pro with 32GB RAM, it takes over 10 seconds to generate one token using Llama 3 70b chat (vs over 10 tokens per second with Llama 3 8b chat).\n",
			
 
				+    "Then you can run `ollama run llama3.1` and ask Llama 3.1 questions such as \"who wrote the book godfather?\" or \"who wrote the book godfather? answer in one sentence.\" You can also try `ollama run llama3.1:70b`, but the inference speed will most likely be too slow - for example, on an Apple M1 Pro with 32GB RAM, it takes over 10 seconds to generate one token using Llama 3.1 70b chat (vs over 10 tokens per second with Llama 3.1 8b chat).\n",
			
 
				     "\n",
			
 
				-    "You can also run the following command to test Llama 3 8b chat:\n",
			
 
				+    "You can also run the following command to test Llama 3.1 8b chat:\n",
			
 
				     "```\n",
			
 
				     " curl http://localhost:11434/api/chat -d '{\n",
			
 
				-    "  \"model\": \"llama3\",\n",
			
 
				+    "  \"model\": \"llama3.1\",\n",
			
 
				     "  \"messages\": [\n",
			
 
				     "    {\n",
			
 
				     "      \"role\": \"user\",\n",
			
@@ -63,7 +63,7 @@
 
				    "cell_type": "markdown",
			
 
				    "metadata": {},
			
 
				    "source": [
			
 
				-    "#### 3. Use local Llama 3 via Python\n",
			
 
				+    "#### 3. Use local Llama 3.1 via Python\n",
			
 
				     "\n",
			
 
				     "The Python code below is the port of the curl command above."
			
 
				    ]
			
@@ -81,7 +81,7 @@
 
				     "\n",
			
 
				     "def llama3(prompt):\n",
			
 
				     "    data = {\n",
			
 
				-    "        \"model\": \"llama3\",\n",
			
 
				+    "        \"model\": \"llama3.1\",\n",
			
 
				     "        \"messages\": [\n",
			
 
				     "            {\n",
			
 
				     "              \"role\": \"user\",\n",
			
@@ -114,7 +114,7 @@
 
				    "cell_type": "markdown",
			
 
				    "metadata": {},
			
 
				    "source": [
			
 
				-    "#### 4. Use local Llama 3 via LangChain\n",
			
 
				+    "#### 4. Use local Llama 3.1 via LangChain\n",
			
 
				     "\n",
			
 
				     "Code below use LangChain with Ollama to query Llama 3 running locally. For a more advanced example of using local Llama 3 with LangChain and agent-powered RAG, see [this](https://github.com/langchain-ai/langgraph/blob/main/examples/rag/langgraph_rag_agent_llama3_local.ipynb)."
			
 
				    ]
			
@@ -136,7 +136,7 @@
 
				    "source": [
			
 
				     "from langchain_community.chat_models import ChatOllama\n",
			
 
				     "\n",
			
 
				-    "llm = ChatOllama(model=\"llama3\", temperature=0)\n",
			
 
				+    "llm = ChatOllama(model=\"llama3.1\", temperature=0)\n",
			
 
				     "response = llm.invoke(\"who wrote the book godfather?\")\n",
			
 
				     "print(response.content)\n"
			
 
				    ]
			
--- a/recipes/quickstart/finetuning/README.md
+++ b/recipes/quickstart/finetuning/README.md
@@ -27,8 +27,8 @@ It lets us specify the training settings for everything from `model_name` to `da
 
				 ```python
			
 
				     model_name: str="PATH/to/Model"
			
 
				     tokenizer_name: str=None
			
 
				-    enable_fsdp: bool=False
			
 
				-    low_cpu_fsdp: bool=False
			
 
				+    enable_fsdp: bool=False # shards model parameters, optimizer states and gradients across DDP ranks
			
 
				+    low_cpu_fsdp: bool=False # saves cpu memory by loading pretrained model on rank0 only
			
 
				     run_validation: bool=True
			
 
				     batch_size_training: int=4
			
 
				     batching_strategy: str="packing" #alternative: padding
			
@@ -42,14 +42,14 @@ It lets us specify the training settings for everything from `model_name` to `da
 
				     num_workers_dataloader: int=1
			
 
				     lr: float=1e-4
			
 
				     weight_decay: float=0.0
			
 
				-    gamma: float= 0.85
			
 
				+    gamma: float= 0.85 # multiplicatively decay the learning rate by gamma after each epoch
			
 
				     seed: int=42
			
 
				     use_fp16: bool=False
			
 
				     mixed_precision: bool=True
			
 
				     val_batch_size: int=1
			
 
				     dataset = "samsum_dataset"
			
 
				     peft_method: str = "lora" # None, llama_adapter (Caution: llama_adapter is currently not supported with FSDP)
			
 
				-    use_peft: bool=False
			
 
				+    use_peft: bool=False # use parameter efficient fine tuning
			
 
				     from_peft_checkpoint: str="" # if not empty and use_peft=True, will load the peft checkpoint and resume the fine-tuning on that checkpoint
			
 
				     output_dir: str = "PATH/to/save/PEFT/model"
			
 
				     freeze_layers: bool = False
			
--- a/recipes/quickstart/finetuning/datasets/raft_dataset.py
+++ b/recipes/quickstart/finetuning/datasets/raft_dataset.py
@@ -0,0 +1,97 @@
 
				+# Copyright (c) Meta Platforms, Inc. and affiliates.
			
 
				+# This software may be used and distributed according to the terms of the Llama 3 Community License Agreement.
			
 
				+
			
 
				+
			
 
				+import copy
			
 
				+from datasets import load_dataset
			
 
				+import itertools
			
 
				+
			
 
				+# check system prompt token seq or user prompt token seq is in the current token list
			
 
				+def check_header(targets,seq):
			
 
				+    for i in range(len(seq)-3):
			
 
				+        if seq[i:i+3] in targets:
			
 
				+            return True
			
 
				+    return False
			
 
				+def replace_target(target,seq):
			
 
				+    for i in range(len(seq)-3):
			
 
				+        if seq[i:i+3] == target:
			
 
				+            seq[i],seq[i+1],seq[i+2] = -100,-100,-100
			
 
				+    return seq
			
 
				+def tokenize_dialog(dialog, tokenizer):
			
 
				+    # If vocab size is above 128000, use the chat template to generate the tokens as it is from Llama 3 family models
			
 
				+    if tokenizer.vocab_size >= 128000:
			
 
				+        dialog_tokens = tokenizer.apply_chat_template(dialog)
			
 
				+        eot_indices = [i for i,n in enumerate(dialog_tokens) if n == 128009]
			
 
				+        labels = copy.copy(dialog_tokens)
			
 
				+        last_idx = 0
			
 
				+        # system prompt header "<|start_header_id|>system<|end_header_id|>" has been tokenized to [128006, 9125, 128007]
			
 
				+        # user prompt header "<|start_header_id|>user<|end_header_id|>" has been tokenized to [128006, 882, 128007]
			
 
				+        prompt_header_seqs = [[128006, 9125, 128007],[128006, 882, 128007]]
			
 
				+        for n, idx in enumerate(eot_indices):
			
 
				+            current_seq = labels[last_idx:idx+1]
			
 
				+            if check_header(prompt_header_seqs,current_seq):
			
 
				+                # found prompt header, indicating that this seq should be masked
			
 
				+                labels[last_idx:idx+1] = [-100] * (idx-last_idx+1)
			
 
				+            else:
			
 
				+                last_idx = idx
			
 
				+        # Lastly mask all the assistant header prompt <|start_header_id|>assistant<|end_header_id|>, which has been tokenized to [128006, 78191, 128007]
			
 
				+        assistant_header_seq = [128006, 78191, 128007]
			
 
				+        labels = replace_target(assistant_header_seq,labels)
			
 
				+        dialog_tokens = [dialog_tokens]
			
 
				+        labels_tokens = [labels]
			
 
				+    else:
			
 
				+        raise Exception("This raft_dataset only supports Llama 3 family models, please make sure the tokenizer is from Llama 3 family models.")
			
 
				+
			
 
				+    combined_tokens = {
			
 
				+        "input_ids": list(itertools.chain(*(t for t in dialog_tokens))),
			
 
				+        "labels": list(itertools.chain(*(t for t in labels_tokens))),
			
 
				+    }
			
 
				+
			
 
				+    return dict(combined_tokens, attention_mask=[1]*len(combined_tokens["input_ids"]))
			
 
				+def raft_tokenize(q_a_pair, tokenizer):
			
 
				+    end_tag = "</DOCUMENT>"
			
 
				+    # find the last end_tag in the instruction, the rest is the question
			
 
				+    try:
			
 
				+        index =q_a_pair["instruction"].rindex(end_tag)+len(end_tag)
			
 
				+    except ValueError:
			
 
				+        print(q_a_pair["instruction"])
			
 
				+        raise Exception("The instruction does not contain the end tag <\/DOCUMENT>")
			
 
				+    # all the lines after end_tag are the question
			
 
				+    question = q_a_pair["instruction"][index:].strip()
			
 
				+    # all the lines before end_tag are the context
			
 
				+    documents = q_a_pair["instruction"][:index].strip() 
			
 
				+    # output is the label
			
 
				+    answer = q_a_pair["output"]
			
 
				+    system_prompt = "You are a helpful chatbot who can provide an answer to every questions from the user given a relevant context."
			
 
				+    user_prompt = """
			
 
				+        Question: {question}\nContext: {context}\n
			
 
				+        Answer this question using the information given by multiple documents in the context above. Here are the things to pay attention to:
			
 
				+        - The context contains many documents, each document starts with <DOCUMENT> and ends </DOCUMENT>.
			
 
				+        - First provide step-by-step reasoning on how to answer the question.
			
 
				+        - In the reasoning, if you need to copy paste some sentences from the context, include them in ##begin_quote## and ##end_quote##. This would mean that things outside of ##begin_quote## and ##end_quote## are not directly copy paste from the context.
			
 
				+        - End your response with final answer in the form <ANSWER>: $answer, the answer should less than 60 words.
			
 
				+        You MUST begin your final answer with the tag "<ANSWER>:".
			
 
				+    """.format(question=question, context=documents)
			
 
				+
			
 
				+    chat = [
			
 
				+    {"role": "system", "content": system_prompt},
			
 
				+    {"role": "user", "content": user_prompt},
			
 
				+    {"role": "assistant", "content": answer}
			
 
				+    ]
			
 
				+    return tokenize_dialog(chat, tokenizer)
			
 
				+
			
 
				+
			
 
				+def get_custom_dataset(dataset_config, tokenizer, split, split_ratio=0.9):
			
 
				+    # load_dataset will return DatasetDict that contains all the data in the train set
			
 
				+    dataset_dict = load_dataset('json', data_files=dataset_config.data_path)
			
 
				+    dataset = dataset_dict['train']
			
 
				+    dataset = dataset.train_test_split(test_size=1-split_ratio, shuffle=True, seed=42)
			
 
				+
			
 
				+    dataset = dataset[split].map(lambda sample: {
			
 
				+        "instruction": sample["instruction"],
			
 
				+        "output": sample["cot_answer"],
			
 
				+        },
			
 
				+        batched=True,
			
 
				+    )
			
 
				+    dataset = dataset.map(lambda x: raft_tokenize(x, tokenizer))
			
 
				+    return dataset
			
--- a/recipes/responsible_ai/llama_guard/llama_guard_customization_via_prompting_and_fine_tuning.ipynb
+++ b/recipes/responsible_ai/llama_guard/llama_guard_customization_via_prompting_and_fine_tuning.ipynb
@@ -15,7 +15,7 @@
 
				    "source": [
			
 
				     "# Llama Guard 3 Customization: Taxonomy Customization, Zero/Few-shot prompting, Evaluation and Fine Tuning \n",
			
 
				     "\n",
			
 
				-    "<a target=\"_blank\" href=\"https://colab.research.google.com/github/meta-llama/llama-recipes/blob/main/recipes/responsible_ai/llama_guard/llama_guard_customization_via_prompting_changes_and_fine_tuning.ipynb\">\n",
			
 
				+    "<a target=\"_blank\" href=\"https://colab.research.google.com/github/meta-llama/llama-recipes/blob/main/recipes/responsible_ai/llama_guard/llama_guard_customization_via_prompting_and_fine_tuning.ipynb\">\n",
			
 
				     "  <img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n",
			
 
				     "</a>\n",
			
 
				     "\n",
			
--- a/recipes/responsible_ai/prompt_guard/inference.py
+++ b/recipes/responsible_ai/prompt_guard/inference.py
@@ -31,7 +31,45 @@ def load_model_and_tokenizer(model_name='meta-llama/Prompt-Guard-86M'):
 
				     return model, tokenizer
			
 
				 
			
 
				 
			
 
				-def get_class_probabilities(model, tokenizer, text, temperature=1.0, device='cpu'):
			
 
				+def preprocess_text_for_promptguard(text: str, tokenizer) -> str:
			
 
				+    """
			
 
				+    Preprocess the text by removing spaces that break apart larger tokens.
			
 
				+    This hotfixes a workaround to PromptGuard, where spaces can be inserted into a string
			
 
				+    to allow the string to be classified as benign.
			
 
				+
			
 
				+    Args:
			
 
				+        text (str): The input text to preprocess.
			
 
				+        tokenizer (transformers.PreTrainedTokenizer): The tokenizer for the model.
			
 
				+
			
 
				+    Returns:
			
 
				+        str: The preprocessed text.
			
 
				+    """
			
 
				+
			
 
				+    try:
			
 
				+        cleaned_text = ''
			
 
				+        index_map = []
			
 
				+        for i, char in enumerate(text):
			
 
				+            if not char.isspace():
			
 
				+                cleaned_text += char
			
 
				+                index_map.append(i)
			
 
				+        tokens = tokenizer.tokenize(cleaned_text)
			
 
				+        result = []
			
 
				+        last_end = 0
			
 
				+        for token in tokens:
			
 
				+            token_str = tokenizer.convert_tokens_to_string([token])
			
 
				+            start = cleaned_text.index(token_str, last_end)
			
 
				+            end = start + len(token_str)
			
 
				+            original_start = index_map[start]
			
 
				+            if original_start > 0 and text[original_start - 1].isspace():
			
 
				+                result.append(' ')
			
 
				+            result.append(token_str)
			
 
				+            last_end = end
			
 
				+        return ''.join(result)
			
 
				+    except Exception:
			
 
				+        return text
			
 
				+
			
 
				+
			
 
				+def get_class_probabilities(model, tokenizer, text, temperature=1.0, device='cpu', preprocess=True):
			
 
				     """
			
 
				     Evaluate the model on the given text with temperature-adjusted softmax.
			
 
				     Note, as this is a DeBERTa model, the input text should have a maximum length of 512.
			
@@ -44,6 +82,8 @@ def get_class_probabilities(model, tokenizer, text, temperature=1.0, device='cpu
 
				     Returns:
			
 
				         torch.Tensor: The probability of each class adjusted by the temperature.
			
 
				     """
			
 
				+    if preprocess:
			
 
				+        text = preprocess_text_for_promptguard(text, tokenizer)
			
 
				     # Encode the text
			
 
				     inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
			
 
				     inputs = inputs.to(device)
			
@@ -57,7 +97,7 @@ def get_class_probabilities(model, tokenizer, text, temperature=1.0, device='cpu
 
				     return probabilities
			
 
				 
			
 
				 
			
 
				-def get_jailbreak_score(model, tokenizer, text, temperature=1.0, device='cpu'):
			
 
				+def get_jailbreak_score(model, tokenizer, text, temperature=1.0, device='cpu', preprocess=True):
			
 
				     """
			
 
				     Evaluate the probability that a given string contains malicious jailbreak or prompt injection.
			
 
				     Appropriate for filtering dialogue between a user and an LLM.
			
@@ -70,11 +110,11 @@ def get_jailbreak_score(model, tokenizer, text, temperature=1.0, device='cpu'):
 
				     Returns:
			
 
				         float: The probability of the text containing malicious content.
			
 
				     """
			
 
				-    probabilities = get_class_probabilities(model, tokenizer, text, temperature, device)
			
 
				+    probabilities = get_class_probabilities(model, tokenizer, text, temperature, device, preprocess)
			
 
				     return probabilities[0, 2].item()
			
 
				 
			
 
				 
			
 
				-def get_indirect_injection_score(model, tokenizer, text, temperature=1.0, device='cpu'):
			
 
				+def get_indirect_injection_score(model, tokenizer, text, temperature=1.0, device='cpu', preprocess=True):
			
 
				     """
			
 
				     Evaluate the probability that a given string contains any embedded instructions (malicious or benign).
			
 
				     Appropriate for filtering third party inputs (e.g. web searches, tool outputs) into an LLM.
			
@@ -87,11 +127,11 @@ def get_indirect_injection_score(model, tokenizer, text, temperature=1.0, device
 
				     Returns:
			
 
				         float: The combined probability of the text containing malicious or embedded instructions.
			
 
				     """
			
 
				-    probabilities = get_class_probabilities(model, tokenizer, text, temperature, device)
			
 
				+    probabilities = get_class_probabilities(model, tokenizer, text, temperature, device, preprocess)
			
 
				     return (probabilities[0, 1] + probabilities[0, 2]).item()
			
 
				 
			
 
				 
			
 
				-def process_text_batch(model, tokenizer, texts, temperature=1.0, device='cpu'):
			
 
				+def process_text_batch(model, tokenizer, texts, temperature=1.0, device='cpu', preprocess=True):
			
 
				     """
			
 
				     Process a batch of texts and return their class probabilities.
			
 
				     Args:
			
@@ -104,6 +144,8 @@ def process_text_batch(model, tokenizer, texts, temperature=1.0, device='cpu'):
 
				     Returns:
			
 
				         torch.Tensor: A tensor containing the class probabilities for each text in the batch.
			
 
				     """
			
 
				+    if preprocess:
			
 
				+        texts = [preprocess_text_for_promptguard(text, tokenizer) for text in texts]
			
 
				     inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=512)
			
 
				     inputs = inputs.to(device)
			
 
				     with torch.no_grad():
			
@@ -113,7 +155,7 @@ def process_text_batch(model, tokenizer, texts, temperature=1.0, device='cpu'):
 
				     return probabilities
			
 
				 
			
 
				 
			
 
				-def get_scores_for_texts(model, tokenizer, texts, score_indices, temperature=1.0, device='cpu', max_batch_size=16):
			
 
				+def get_scores_for_texts(model, tokenizer, texts, score_indices, temperature=1.0, device='cpu', max_batch_size=16, preprocess=True):
			
 
				     """
			
 
				     Compute scores for a list of texts, handling texts of arbitrary length by breaking them into chunks and processing in parallel.
			
 
				     Args:
			
@@ -138,7 +180,7 @@ def get_scores_for_texts(model, tokenizer, texts, score_indices, temperature=1.0
 
				     for i in range(0, len(all_chunks), max_batch_size):
			
 
				         batch_chunks = all_chunks[i:i+max_batch_size]
			
 
				         batch_indices = text_indices[i:i+max_batch_size]
			
 
				-        probabilities = process_text_batch(model, tokenizer, batch_chunks, temperature, device)
			
 
				+        probabilities = process_text_batch(model, tokenizer, batch_chunks, temperature, device, preprocess)
			
 
				         scores = probabilities[:, score_indices].sum(dim=1).tolist()
			
 
				         
			
 
				         for idx, score in zip(batch_indices, scores):
			
@@ -146,7 +188,7 @@ def get_scores_for_texts(model, tokenizer, texts, score_indices, temperature=1.0
 
				     return all_scores
			
 
				 
			
 
				 
			
 
				-def get_jailbreak_scores_for_texts(model, tokenizer, texts, temperature=1.0, device='cpu', max_batch_size=16):
			
 
				+def get_jailbreak_scores_for_texts(model, tokenizer, texts, temperature=1.0, device='cpu', max_batch_size=16, preprocess=True):
			
 
				     """
			
 
				     Compute jailbreak scores for a list of texts.
			
 
				     Args:
			
@@ -160,10 +202,10 @@ def get_jailbreak_scores_for_texts(model, tokenizer, texts, temperature=1.0, dev
 
				     Returns:
			
 
				         list[float]: A list of jailbreak scores for each text.
			
 
				     """
			
 
				-    return get_scores_for_texts(model, tokenizer, texts, [2], temperature, device, max_batch_size)
			
 
				+    return get_scores_for_texts(model, tokenizer, texts, [2], temperature, device, max_batch_size, preprocess)
			
 
				 
			
 
				 
			
 
				-def get_indirect_injection_scores_for_texts(model, tokenizer, texts, temperature=1.0, device='cpu', max_batch_size=16):
			
 
				+def get_indirect_injection_scores_for_texts(model, tokenizer, texts, temperature=1.0, device='cpu', max_batch_size=16, preprocess=True):
			
 
				     """
			
 
				     Compute indirect injection scores for a list of texts.
			
 
				     Args:
			
@@ -177,4 +219,4 @@ def get_indirect_injection_scores_for_texts(model, tokenizer, texts, temperature
 
				     Returns:
			
 
				         list[float]: A list of indirect injection scores for each text.
			
 
				     """
			
 
				-    return get_scores_for_texts(model, tokenizer, texts, [1, 2], temperature, device, max_batch_size)
			
 
				+    return get_scores_for_texts(model, tokenizer, texts, [1, 2], temperature, device, max_batch_size, preprocess)
			
--- a/recipes/responsible_ai/prompt_guard/prompt_guard_tutorial.ipynb
+++ b/recipes/responsible_ai/prompt_guard/prompt_guard_tutorial.ipynb
--- a/recipes/use_cases/end2end-recipes/RAFT-Chatbot/README.md
+++ b/recipes/use_cases/end2end-recipes/RAFT-Chatbot/README.md
--- a/recipes/use_cases/end2end-recipes/RAFT-Chatbot/config.py
+++ b/recipes/use_cases/end2end-recipes/RAFT-Chatbot/config.py
@@ -0,0 +1,10 @@
 
				+# Copyright (c) Meta Platforms, Inc. and affiliates.
			
 
				+# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
			
 
				+
			
 
				+import yaml
			
 
				+
			
 
				+def load_config(config_path: str = "./config.yaml"):
			
 
				+    # Read the YAML configuration file
			
 
				+    with open(config_path, "r") as file:
			
 
				+        config = yaml.safe_load(file)
			
 
				+    return config
			
--- a/recipes/use_cases/end2end-recipes/RAFT-Chatbot/eval_llama.json
+++ b/recipes/use_cases/end2end-recipes/RAFT-Chatbot/eval_llama.json
--- a/recipes/use_cases/end2end-recipes/RAFT-Chatbot/format.py
+++ b/recipes/use_cases/end2end-recipes/RAFT-Chatbot/format.py
@@ -0,0 +1,174 @@
 
				+# file copied from https://github.com/ShishirPatil/gorilla/blob/main/raft/format.py
			
 
				+from abc import ABC, abstractmethod
			
 
				+import argparse
			
 
				+from datasets import Dataset, load_dataset
			
 
				+from typing import Dict, Literal, Any, get_args
			
 
				+
			
 
				+"""
			
 
				+This file allows to convert raw HuggingFace Datasets into files suitable to fine tune completion and chat models.
			
 
				+"""
			
 
				+
			
 
				+OutputDatasetType = Literal["parquet", "jsonl"]
			
 
				+outputDatasetTypes = list(get_args(OutputDatasetType))
			
 
				+
			
 
				+InputDatasetType = Literal["arrow", "jsonl"]
			
 
				+inputDatasetTypes = list(get_args(InputDatasetType))
			
 
				+
			
 
				+DatasetFormat = Literal["hf", "completion", "chat"]
			
 
				+datasetFormats = list(get_args(DatasetFormat))
			
 
				+
			
 
				+def get_args() -> argparse.Namespace:
			
 
				+    """
			
 
				+    Parses and returns the arguments specified by the user's command
			
 
				+    """
			
 
				+    parser = argparse.ArgumentParser()
			
 
				+
			
 
				+    parser.add_argument("--input", type=str, required=True, help="Input HuggingFace dataset file")
			
 
				+    parser.add_argument("--input-type", type=str, default="arrow", help="Format of the input dataset. Defaults to arrow.", choices=inputDatasetTypes)
			
 
				+    parser.add_argument("--output", type=str, required=True, help="Output file")
			
 
				+    parser.add_argument("--output-format", type=str, required=True, help="Format to convert the dataset to", choices=datasetFormats)
			
 
				+    parser.add_argument("--output-type", type=str, default="jsonl", help="Type to export the dataset to. Defaults to jsonl.", choices=outputDatasetTypes)
			
 
				+    parser.add_argument("--output-chat-system-prompt", type=str, help="The system prompt to use when the output format is chat")
			
 
				+
			
 
				+    args = parser.parse_args()
			
 
				+    return args
			
 
				+
			
 
				+class DatasetFormatter(ABC):
			
 
				+    """
			
 
				+    Base class for dataset formatters. Formatters rename columns, remove and add 
			
 
				+    columns to match the expected target format structure. HF, Chat or Completion models file formats.
			
 
				+    https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset
			
 
				+    """
			
 
				+    @abstractmethod
			
 
				+    def format(self, ds: Dataset, params: Dict[str, str]) -> Dataset:
			
 
				+        pass
			
 
				+
			
 
				+class DatasetExporter(ABC):
			
 
				+    """
			
 
				+    Base class for dataset exporters. Exporters export dataset to different file types, JSONL, Parquet, ...
			
 
				+    """
			
 
				+    @abstractmethod
			
 
				+    def export(self, ds: Dataset, output_path: str):
			
 
				+        pass
			
 
				+
			
 
				+class DatasetConverter():
			
 
				+    """
			
 
				+    Entry point class. It resolves which DatasetFormatter and which DatasetExporter to use and runs them.
			
 
				+    """
			
 
				+    formats: Dict[DatasetFormat, DatasetFormatter]
			
 
				+    exporters: Dict[OutputDatasetType, Any]
			
 
				+
			
 
				+    def __init__(self) -> None:
			
 
				+        self.formats = {
			
 
				+            "hf": HuggingFaceDatasetFormatter(),
			
 
				+            "completion": OpenAiCompletionDatasetFormatter(),
			
 
				+            "chat": OpenAiChatDatasetFormatter()
			
 
				+        }
			
 
				+        self.exporters = {
			
 
				+            "parquet": ParquetDatasetExporter(),
			
 
				+            "jsonl": JsonlDatasetExporter()
			
 
				+        }
			
 
				+
			
 
				+    def convert(self, ds: Dataset, format: DatasetFormat, output_path: str, output_type: OutputDatasetType, params: Dict[str, str]):
			
 
				+        if not format in self.formats:
			
 
				+            raise Exception(f"Output Format {format} is not supported, pleased select one of {self.formats.keys()}")
			
 
				+        
			
 
				+        if not output_type in self.exporters:
			
 
				+            raise Exception(f"Output Type {output_type} is not supported, pleased select one of {self.exporters.keys()}")
			
 
				+
			
 
				+        formatter = self.formats[format]
			
 
				+        newds = formatter.format(ds, params)
			
 
				+        exporter = self.exporters[output_type]
			
 
				+        exporter.export(newds, output_path)
			
 
				+
			
 
				+class HuggingFaceDatasetFormatter(DatasetFormatter):
			
 
				+    """
			
 
				+    Returns the HuggingFace Dataset as is
			
 
				+    """
			
 
				+    def format(self, ds: Dataset, params: Dict[str, str]) -> Dataset:
			
 
				+        return ds
			
 
				+
			
 
				+def _remove_all_columns_but(ds: Dataset, keep_columns) -> Dataset:
			
 
				+    """
			
 
				+    HF Dataset doesn't have a way to copy only specific columns of a Dataset so this help
			
 
				+    removes all columns but the ones specified.
			
 
				+    """
			
 
				+    remove_columns = list(ds.column_names)
			
 
				+    for keep in keep_columns:
			
 
				+        remove_columns.remove(keep)
			
 
				+    ds = ds.remove_columns(remove_columns)
			
 
				+    return ds
			
 
				+
			
 
				+class OpenAiCompletionDatasetFormatter(DatasetFormatter):
			
 
				+    """
			
 
				+    Returns the Dataset in the OpenAI Completion Fine-tuning file format with two fields "prompt" and "completion".
			
 
				+    https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset
			
 
				+    """
			
 
				+    def format(self, ds: Dataset, params: Dict[str, str]) -> Dataset:
			
 
				+        newds = ds.rename_columns({'question': 'prompt', 'cot_answer': 'completion'})
			
 
				+        return _remove_all_columns_but(newds, ['prompt', 'completion'])
			
 
				+
			
 
				+class OpenAiChatDatasetFormatter(OpenAiCompletionDatasetFormatter):
			
 
				+    """
			
 
				+    Returns the Dataset in the OpenAI Chat Fine-tuning file format with one field "messages".
			
 
				+    https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset
			
 
				+    """
			
 
				+    def format(self, ds: Dataset, params: Dict[str, str]) -> Dataset:
			
 
				+        newds = super().format(ds, params)
			
 
				+
			
 
				+        def format_messages(row):
			
 
				+            messages = []
			
 
				+            if 'system_prompt' in params:
			
 
				+                system_prompt = params['system_prompt']
			
 
				+                messages.append({ "role": "system", "content": system_prompt})
			
 
				+            messages.extend([{ "role": "user", "content": row['prompt']}, { "role": "assistant", "content": row['completion']}])
			
 
				+            chat_row = {"messages": messages}
			
 
				+            return chat_row
			
 
				+
			
 
				+        newds = newds.map(format_messages)
			
 
				+        return _remove_all_columns_but(newds, ['messages'])
			
 
				+
			
 
				+def append_extension(path: str, extension: str) -> str:
			
 
				+    suffix = "." + extension
			
 
				+    if not path.endswith(suffix):
			
 
				+        path = path + suffix
			
 
				+    return path
			
 
				+
			
 
				+
			
 
				+class JsonlDatasetExporter(DatasetExporter):
			
 
				+    """
			
 
				+    Exports the Dataset to a JSONL file
			
 
				+    """
			
 
				+
			
 
				+    def export(self, ds: Dataset, output_path: str):
			
 
				+        ds.to_json(append_extension(output_path, "jsonl"))
			
 
				+
			
 
				+
			
 
				+class ParquetDatasetExporter(DatasetExporter):
			
 
				+    """
			
 
				+    Exports the Dataset to a Parquet file
			
 
				+    """
			
 
				+
			
 
				+    def export(self, ds: Dataset, output_path: str):
			
 
				+        ds.to_parquet(append_extension(output_path, "parquet"))
			
 
				+
			
 
				+
			
 
				+def main():
			
 
				+    """
			
 
				+    When raft.py is executed from the command line.
			
 
				+    """
			
 
				+    args = get_args()
			
 
				+    ds = load_dataset(args.input_type, data_files={"train": args.input})['train']
			
 
				+    formatter = DatasetConverter()
			
 
				+
			
 
				+    if args.output_chat_system_prompt and args.output_format != "chat":
			
 
				+        raise Exception("Parameter --output-chat-system-prompt can only be used with --output-format chat")
			
 
				+
			
 
				+    format_params = {}
			
 
				+    if args.output_chat_system_prompt:
			
 
				+        format_params['system_prompt'] = args.output_chat_system_prompt
			
 
				+
			
 
				+    formatter.convert(ds=ds, format=args.output_format, output_path=args.output, output_type=args.output_type, params=format_params)
			
 
				+
			
 
				+if __name__ == "__main__":
			
 
				+    main()
			
--- a/recipes/use_cases/end2end-recipes/RAFT-Chatbot/images/Answers_Precision.png
+++ b/recipes/use_cases/end2end-recipes/RAFT-Chatbot/images/Answers_Precision.png
--- a/recipes/use_cases/end2end-recipes/RAFT-Chatbot/images/LLM_score_comparison.png
+++ b/recipes/use_cases/end2end-recipes/RAFT-Chatbot/images/LLM_score_comparison.png
--- a/recipes/use_cases/end2end-recipes/RAFT-Chatbot/images/Num_of_refusal_comparison.png
+++ b/recipes/use_cases/end2end-recipes/RAFT-Chatbot/images/Num_of_refusal_comparison.png
--- a/recipes/use_cases/end2end-recipes/RAFT-Chatbot/images/RAFT.png
+++ b/recipes/use_cases/end2end-recipes/RAFT-Chatbot/images/RAFT.png
--- a/recipes/use_cases/end2end-recipes/RAFT-Chatbot/raft.py
+++ b/recipes/use_cases/end2end-recipes/RAFT-Chatbot/raft.py
@@ -0,0 +1,89 @@
 
				+import logging
			
 
				+import os
			
 
				+import argparse
			
 
				+from raft_utils import generate_questions, add_chunk_to_dataset
			
 
				+from format import DatasetConverter, datasetFormats, outputDatasetTypes
			
 
				+from config import load_config
			
 
				+
			
 
				+logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
			
 
				+
			
 
				+def main(api_config):
			
 
				+    ds = None
			
 
				+    try:
			
 
				+        logging.info("Starting to generate question pair.")
			
 
				+        # Generate questions as list for each chunk
			
 
				+        chunk_questions_zip = generate_questions(api_config)
			
 
				+        if not chunk_questions_zip:
			
 
				+            logging.warning("No questions generated from text. Please check the api_config or model configuration.")
			
 
				+            return
			
 
				+        logging.info(f"Successfully generated {sum([len(q) for c,q in chunk_questions_zip])} question/answer pairs.")
			
 
				+        ds = add_chunk_to_dataset(chunk_questions_zip,api_config)
			
 
				+        ds.save_to_disk(args.output)
			
 
				+        logging.info(f"Data successfully written to {api_config['output']}. Process completed.")
			
 
				+        formatter = DatasetConverter()
			
 
				+
			
 
				+        # Extract format specific params
			
 
				+        format_params = {}
			
 
				+        formatter.convert(ds=ds, format=args.output_format, output_path=args.output+"raft", output_type=args.output_type, params=format_params)
			
 
				+    except Exception as e:
			
 
				+        logging.error(f"An unexpected error occurred during the process: {e}",exc_info=True)
			
 
				+
			
 
				+def parse_arguments():
			
 
				+    # Define command line arguments for the script
			
 
				+    parser = argparse.ArgumentParser(
			
 
				+        description="Generate RAFT question/answer/context pairs from documentation."
			
 
				+    )
			
 
				+    parser.add_argument(
			
 
				+        "-t", "--questions_per_chunk",
			
 
				+        type=int,
			
 
				+        default=4,
			
 
				+        help="Specify the number of question pairs to generate per chunk."
			
 
				+    )
			
 
				+    parser.add_argument(
			
 
				+        "-m", "--model",
			
 
				+        default="meta-llama/Meta-Llama-3-70B-Instruct",
			
 
				+        help="Select the model to use for generation."
			
 
				+    )
			
 
				+    parser.add_argument(
			
 
				+        "-c", "--config_path",
			
 
				+        default="./raft.yaml",
			
 
				+        help="Set the configuration file path that has system prompt along with language, dataset path and number of questions."
			
 
				+    )
			
 
				+    parser.add_argument(
			
 
				+        "-u", "--endpoint_url",
			
 
				+        default="http://localhost:8001/v1",
			
 
				+        type=str,
			
 
				+        help="LLM API url for generating question/answer pairs."
			
 
				+    )
			
 
				+    parser.add_argument(
			
 
				+        "-k", "--api_key",
			
 
				+        default="EMPTY",
			
 
				+        type=str,
			
 
				+        help="LLM API key for generating question/answer pairs."
			
 
				+    )
			
 
				+    parser.add_argument("--chunk_size", type=int, default=1000, help="The size of each chunk in number of tokens")
			
 
				+    parser.add_argument("-o","--output", type=str, default="./output/", help="The path at which to save the dataset")
			
 
				+    parser.add_argument("--output-format", type=str, default="hf", help="Format to convert the dataset to. Defaults to hf.", choices=datasetFormats)
			
 
				+    parser.add_argument("--output-type", type=str, default="jsonl", help="Type to export the dataset to. Defaults to jsonl.", choices=outputDatasetTypes)
			
 
				+    return parser.parse_args()
			
 
				+
			
 
				+if __name__ == "__main__":
			
 
				+    logging.info("Initializing the process and loading configuration...")
			
 
				+    args = parse_arguments()
			
 
				+
			
 
				+    api_config = load_config(args.config_path)
			
 
				+    api_config["questions_per_chunk"] = args.questions_per_chunk
			
 
				+    api_config["model"] = args.model
			
 
				+    api_config["chunk_size"] = args.chunk_size
			
 
				+    api_config["endpoint_url"] = args.endpoint_url
			
 
				+    api_config["output"] = args.output
			
 
				+    api_config["api_key"] = args.api_key
			
 
				+    # if OPENAI_API_KEY is defined in the system environment, use it as the API key
			
 
				+    if os.environ.get('API_KEY') is not None:
			
 
				+        api_config["api_key"] = os.environ["API_KEY"]
			
 
				+    logging.info(f"Configuration loaded. Generating {args.questions_per_chunk} question per chunk using model '{args.model}'.")
			
 
				+    logging.info(f"Chunk size: {args.chunk_size}.")
			
 
				+    logging.info(f"num_distract_docs: {api_config['num_distract_docs']}, refusal_probability: {api_config['refusal_probability']}")
			
 
				+    logging.info(f"Will use endpoint_url: {args.endpoint_url}.")
			
 
				+    logging.info(f"Output will be written to {args.output}.")
			
 
				+    main(api_config)
			
--- a/recipes/use_cases/end2end-recipes/RAFT-Chatbot/raft.yaml
+++ b/recipes/use_cases/end2end-recipes/RAFT-Chatbot/raft.yaml
@@ -0,0 +1,51 @@
 
				+COT_prompt_template: >
			
 
				+  <|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a helpful chatbot who can provide an answer to every questions from the user given a relevant context.<|eot_id|>
			
 
				+  <|start_header_id|>user<|end_header_id|>
			
 
				+  Question: {question}\nContext: {context}\n
			
 
				+  Answer this question using the information given by multiple documents in the context above. Here are the things to pay attention to:
			
 
				+  - The context contains many documents, each document starts with <DOCUMENT> and ends </DOCUMENT>.
			
 
				+  - First provide step-by-step reasoning on how to answer the question.
			
 
				+  - In the reasoning, if you need to copy paste some sentences from the context, include them in ##begin_quote## and ##end_quote##. This would mean that things outside of ##begin_quote## and ##end_quote## are not directly copy paste from the context.
			
 
				+  - End your response with final answer in the form <ANSWER>: $answer, the answer should less than 60 words.
			
 
				+  You MUST begin your final answer with the tag "<ANSWER>:". <|eot_id|><|start_header_id|>assistant<|end_header_id|>
			
 
				+
			
 
				+question_prompt_template: >
			
 
				+  <|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a synthetic question-answer pair generator. Given a chunk of context about
			
 
				+  some topic(s), generate {num_questions} example questions a user could ask and would be answered
			
 
				+  using information from the chunk. For example, if the given context was a Wikipedia
			
 
				+  paragraph about the United States, an example question could be 'How many states are
			
 
				+  in the United States?
			
 
				+  Your questions should be formulated in the same style as questions that users could ask in a search engine.
			
 
				+  This means that your questions MUST NOT mention something like "according to the passage" or "context".
			
 
				+  The questions should be able to be answered in 60 words or less. Include only the questions in your response.<|eot_id|>
			
 
				+  <|start_header_id|>user<|end_header_id|>
			
 
				+  Context: {context}\n <|eot_id|><|start_header_id|>assistant<|end_header_id|>
			
 
				+
			
 
				+# question_prompt_template: >
			
 
				+#   <|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a language model skilled in creating quiz questions.
			
 
				+#   You will be provided with a document,
			
 
				+#   read it and please generate factoid question and answer pairs that are most likely be asked by a user of Llama language models
			
 
				+#   which includes LLama, Llama2, Meta Llama3, Code Llama, Meta Llama Guard 1,	Meta Llama Guard 2
			
 
				+#   Your factoid questions should be answerable with a specific, concise piece of factual information from the context.
			
 
				+#   Your factoid questions should be formulated in the same style as questions users could ask in a search engine.
			
 
				+#   This means that your factoid questions MUST NOT mention something like "according to the passage" or "context".
			
 
				+#   please make sure you follow those rules:
			
 
				+#   1. Generate {num_questions} question answer pairs, you can generate less answer if there is nothing related to
			
 
				+#   model, training, fine-tuning and evaluation details of Llama language models,
			
 
				+#   2. The questions can be answered based *solely* on the given passage.
			
 
				+#   3. Avoid asking questions with similar meaning.
			
 
				+#   4. Never use any abbreviation.
			
 
				+#   5. The questions should be able to be answered in 60 words or less. Include only the questions in your response. <|eot_id|>
			
 
				+#   <|start_header_id|>user<|end_header_id|>
			
 
				+#   Context: {context}\n <|eot_id|><|start_header_id|>assistant<|end_header_id|>
			
 
				+data_dir: "./data"
			
 
				+
			
 
				+xml_path: ""
			
 
				+
			
 
				+chunk_size: 1000
			
 
				+
			
 
				+questions_per_chunk: 5
			
 
				+
			
 
				+num_distract_docs: 4 # number of distracting documents to add to each chunk
			
 
				+
			
 
				+refusal_probability: 0.05 # probability of related documents to be added to each chunk
			
--- a/recipes/use_cases/end2end-recipes/RAFT-Chatbot/raft_eval.py
+++ b/recipes/use_cases/end2end-recipes/RAFT-Chatbot/raft_eval.py
@@ -0,0 +1,336 @@
 
				+# Copyright (c) Meta Platforms, Inc. and affiliates.
			
 
				+# This software may be used and distributed according to the terms of the Llama 3 Community License Agreement.
			
 
				+import logging
			
 
				+import evaluate
			
 
				+import argparse
			
 
				+from config import load_config
			
 
				+import json
			
 
				+from langchain_openai import ChatOpenAI
			
 
				+from langchain_community.embeddings import HuggingFaceEmbeddings
			
 
				+from langchain_community.vectorstores import FAISS
			
 
				+from langchain.text_splitter import RecursiveCharacterTextSplitter
			
 
				+from langchain_community.vectorstores.utils import DistanceStrategy
			
 
				+from datetime import datetime
			
 
				+from langchain_community.document_loaders import DirectoryLoader
			
 
				+import re
			
 
				+import string
			
 
				+import pandas as pd 
			
 
				+
			
 
				+
			
 
				+def generate_answers_model_only(model_name,question_list,api_url="http://localhost:8000/v1",key="EMPTY"):
			
 
				+        # Use langchain to load the documents from data directory
			
 
				+    # Load the RAFT model
			
 
				+
			
 
				+    llm = ChatOpenAI(
			
 
				+        openai_api_key=key,
			
 
				+        openai_api_base=api_url,
			
 
				+        model_name=model_name,
			
 
				+        temperature=0.0,
			
 
				+        max_tokens=1000
			
 
				+        )
			
 
				+
			
 
				+    all_tasks = [api_config['eval_prompt_template'].format(question=question) for question in question_list]
			
 
				+    generated_answers = llm.batch(all_tasks)
			
 
				+    generated_answers = [ item.content for item in generated_answers]
			
 
				+    if len(generated_answers) == 0:
			
 
				+        logging.error("No model answers generated. Please check the input context or model configuration in ",model_name)
			
 
				+        return []
			
 
				+    return clean_text_list(generated_answers)
			
 
				+def format_docs_raft(docs):
			
 
				+    context = ""
			
 
				+    for doc in docs:
			
 
				+        context += "\n<DOCUMENT>" + str(doc.page_content) + "</DOCUMENT>\n"
			
 
				+    return context
			
 
				+def build_retriever(api_config,embedding_model_name,retrieved_docs_num=5):
			
 
				+    # Use langchain to load the documents from data directory
			
 
				+    loader = DirectoryLoader(api_config['data_dir'])
			
 
				+    docs = loader.load()
			
 
				+    # Split the document into chunks with a specified chunk size
			
 
				+    text_splitter = RecursiveCharacterTextSplitter(chunk_size=api_config["chunk_size"],chunk_overlap=int(api_config["chunk_size"] / 10),separators= ["----------","\n\n", "\n", " ", ""],strip_whitespace=True)
			
 
				+    docs_processed = text_splitter.split_documents(docs)
			
 
				+    # Remove duplicates
			
 
				+    unique_texts = {}
			
 
				+    docs_processed_unique = []
			
 
				+    for doc in docs_processed:
			
 
				+        if doc.page_content not in unique_texts:
			
 
				+            unique_texts[doc.page_content] = True
			
 
				+            docs_processed_unique.append(doc)
			
 
				+    logging.info(f"Total number of docs_processed used by vectorstore: {len(docs_processed_unique)}")
			
 
				+    # Store the document into a vector store with a specific embedding model
			
 
				+    embedding_model = HuggingFaceEmbeddings(
			
 
				+        model_name=embedding_model_name,
			
 
				+        model_kwargs={"device": "cuda"},
			
 
				+        encode_kwargs={"normalize_embeddings": True},  # Set `True` for cosine similarity
			
 
				+    )
			
 
				+    vectorstore = FAISS.from_documents(docs_processed_unique, embedding_model, distance_strategy=DistanceStrategy.COSINE)
			
 
				+    retriever = vectorstore.as_retriever(
			
 
				+        search_kwargs={"k": retrieved_docs_num},
			
 
				+    )
			
 
				+    return retriever
			
 
				+def generate_answers_with_RAG(model_name, question_list,api_config,retriever,api_url_overwrite=None):
			
 
				+    api_url = api_config['model_endpoint_url']
			
 
				+    if api_url_overwrite:
			
 
				+        api_url = api_url_overwrite
			
 
				+    key = api_config['api_key']
			
 
				+    # Load the RAFT model
			
 
				+    llm = ChatOpenAI(
			
 
				+        openai_api_key=key,
			
 
				+        openai_api_base=api_url,
			
 
				+        model_name=model_name,
			
 
				+        temperature=0.0,
			
 
				+        max_tokens=1000
			
 
				+        )
			
 
				+    all_tasks = []
			
 
				+    for q in question_list:
			
 
				+        # retrive the top K documents
			
 
				+        retrieved_docs = retriever.invoke(q)        
			
 
				+        # format the documents into a string
			
 
				+        documents = format_docs_raft(retrieved_docs)
			
 
				+        # create a prompt
			
 
				+        text = api_config["RAG_prompt_template"].format(context=documents,question=q)
			
 
				+        all_tasks.append(text)
			
 
				+    generated_answers = llm.batch(all_tasks)
			
 
				+    generated_answers = [ item.content for item in generated_answers]
			
 
				+    if len(generated_answers) == 0:
			
 
				+        logging.error("No RAG answers generated. Please check the input context or model configuration in ",model_name)
			
 
				+        return []
			
 
				+    return clean_text_list(generated_answers)
			
 
				+def compute_rouge_score(generated : list, reference: list):
			
 
				+    rouge_score = evaluate.load('rouge')
			
 
				+    return rouge_score.compute(
			
 
				+        predictions=generated,
			
 
				+        references=reference,
			
 
				+        use_stemmer=True,
			
 
				+        use_aggregator=True
			
 
				+    )
			
 
				+def clean_text_list(text_list):
			
 
				+    result = []
			
 
				+    for text in text_list:
			
 
				+        # for raft model, the answer will started with <ANSWER>
			
 
				+        index = text.rfind("<ANSWER>")
			
 
				+        if index!= -1:
			
 
				+            text = text[index:]
			
 
				+            text = text.replace("</ANSWER>:","")
			
 
				+        text = text.replace("begin_quote","")
			
 
				+        text = text.replace("end_quote","")
			
 
				+        text = text.replace("##","")
			
 
				+        text = text.strip()
			
 
				+        result.append(text)
			
 
				+    return result
			
 
				+
			
 
				+def normalize_answer(s):
			
 
				+
			
 
				+    def remove_articles(text):
			
 
				+        return re.sub(r'\b(a|an|the)\b', ' ', text)
			
 
				+
			
 
				+    def white_space_fix(text):
			
 
				+        return ' '.join(text.split())
			
 
				+
			
 
				+    def remove_punc(text):
			
 
				+        exclude = set(string.punctuation)
			
 
				+        return ''.join(ch for ch in text if ch not in exclude)
			
 
				+
			
 
				+    def lower(text):
			
 
				+        return text.lower()
			
 
				+
			
 
				+    return white_space_fix(remove_articles(remove_punc(lower(s))))
			
 
				+def exact_match_score(prediction, ground_truth):
			
 
				+    """Computes EM score for a single prediction and ground truth answer."""
			
 
				+    num_match = 0
			
 
				+    assert len(prediction) == len(ground_truth), "Answer length does not match prediction length."
			
 
				+    assert(len(ground_truth) > 0)
			
 
				+    for idx, (pred,gold) in enumerate(zip(prediction, ground_truth)):
			
 
				+        if (normalize_answer(pred) == normalize_answer(gold)):
			
 
				+            num_match += 1
			
 
				+    return num_match/len(ground_truth)
			
 
				+def compute_judge_score(questions: list, generated : list, reference: list, api_config,api_url="http://localhost:8001/v1",key="EMPTY"):
			
 
				+    correct_num = 0
			
 
				+    model_name = "meta-llama/Meta-Llama-3-70B-Instruct"
			
 
				+    llm = ChatOpenAI(
			
 
				+        openai_api_key=key,
			
 
				+        openai_api_base=api_url,
			
 
				+        model_name=model_name,
			
 
				+        max_tokens=1000,
			
 
				+        temperature=0.0)
			
 
				+    all_tasks = []
			
 
				+    for question,prediction,gold in zip(questions, generated,reference):
			
 
				+        message = api_config['judge_prompt_template'].format(question=question,prediction=prediction,gold=gold)
			
 
				+        all_tasks.append(message)
			
 
				+    judge_responses = llm.batch(all_tasks)
			
 
				+    judge_responses = ["YES" in item.content for item in judge_responses]
			
 
				+    correct_num = sum(judge_responses)
			
 
				+    return correct_num/len(questions),judge_responses
			
 
				+def score_single(api_config,generated,reference,questions, run_exact_match=True,run_rouge=True, run_llm_as_judge=True):
			
 
				+    # set metric to default -1, means no metric is computed
			
 
				+    metric = {
			
 
				+        "Rouge_score": -1,
			
 
				+        "LLM_judge_score": -1,
			
 
				+        "Exact_match": -1
			
 
				+    }
			
 
				+    if run_rouge:
			
 
				+        rouge_score = compute_rouge_score(generated,reference)
			
 
				+        metric["Rouge_score"] = rouge_score
			
 
				+        print("Rouge_score:",rouge_score)
			
 
				+    if api_config["judge_endpoint_url"] and run_llm_as_judge:
			
 
				+        api_url = api_config["judge_endpoint_url"]
			
 
				+        LLM_judge_score,judge_responses = compute_judge_score(questions, generated, reference, api_config,api_url=api_url)
			
 
				+        metric["LLM_judge_score"] = LLM_judge_score
			
 
				+        metric["LLM_judge_responses"] = judge_responses
			
 
				+        print(f"LLM_judge_score: {LLM_judge_score}")
			
 
				+    if run_exact_match:
			
 
				+        exact_match = exact_match_score(generated,reference)
			
 
				+        print(f"Exact_match_percentage: {exact_match:.4f}")
			
 
				+        metric["Exact_match"] = exact_match
			
 
				+    return metric
			
 
				+def main(api_config):
			
 
				+    # Since the eval set is small, we can run the eval without async functions
			
 
				+    try:
			
 
				+        api_url = api_config["model_endpoint_url"]
			
 
				+        logging.info("Starting to generate answer given the eval set.")
			
 
				+        questions,groud_truth = [],[]
			
 
				+        if api_config["eval_file"].endswith(".parquet"):
			
 
				+            eval_file = pd.read_parquet(api_config["eval_file"],filters=[('source', '=', 'pt_discuss_forum')])
			
 
				+            for index, item in eval_file.iterrows():
			
 
				+                questions.append(item["question"]+"\nDetails:\n"+item["context"])
			
 
				+                groud_truth.append(item["answer"])
			
 
				+        else:
			
 
				+            with open(api_config["eval_file"]) as fp:
			
 
				+                eval_file = json.load(fp)
			
 
				+                for index, item in enumerate(eval_file):
			
 
				+                    questions.append(item["question"])
			
 
				+                    groud_truth.append(item["answer"])
			
 
				+        generated_answers = {}            
			
 
				+        # build retriver
			
 
				+        retriever = build_retriever(api_config,"sentence-transformers/multi-qa-mpnet-base-cos-v1",api_config["rag_topk"])
			
 
				+        # Generate answers for 8B models
			
 
				+        model_name = api_config["model_name"]
			
 
				+        generated_answers[model_name] = generate_answers_model_only(model_name,questions,api_url)
			
 
				+        generated_answers[model_name+"_RAG"] = generate_answers_with_RAG(model_name, questions,api_config,retriever)
			
 
				+        print("Finished generating answers for ", model_name)
			
 
				+        large_model_name = "meta-llama/Meta-Llama-3-70B-Instruct"
			
 
				+        large_api_url = api_config["judge_endpoint_url"]
			
 
				+        generated_answers["70B_Base"] = generate_answers_model_only(large_model_name,questions,large_api_url)
			
 
				+        generated_answers["70B_RAG"] = generate_answers_with_RAG(large_model_name, questions,api_config,retriever,large_api_url)
			
 
				+        print("Finished generating answers for ", large_model_name)
			
 
				+        logging.info(f"Successfully generated {len(generated_answers[model_name+'_RAG'])} answers for all models.")
			
 
				+        # for generate answer from each model, compute the score metric
			
 
				+        all_metrics = []
			
 
				+        output_file = api_config["output_log"]+str(datetime.now().strftime("%Y%m%d_%H%M%S"))
			
 
				+
			
 
				+        for model_name,model_answer in generated_answers.items():
			
 
				+            if len(model_answer) != len(groud_truth):
			
 
				+                print(f"The length of {model_name} answer is not equal to the length of ground truth.")
			
 
				+                continue
			
 
				+            metric = score_single(api_config,model_answer,groud_truth,questions)
			
 
				+            print(f"The eval result for {model_name} is: {metric}")
			
 
				+            with open(output_file,"a") as fp:
			
 
				+                fp.write(f"Eval_result for {model_name} \n")
			
 
				+                fp.write(f"Rouge_score: {metric['Rouge_score']} \n")
			
 
				+                fp.write(f"Exact_match_percentage: {metric['Exact_match']} \n")
			
 
				+                judge_responses = ["None"] * len(questions)
			
 
				+                if api_config["judge_endpoint_url"]:
			
 
				+                    fp.write(f"LLM_judge_score: {metric['LLM_judge_score']} \n")
			
 
				+                    judge_responses = metric["LLM_judge_responses"]
			
 
				+                    all_metrics.append((model_name,metric['LLM_judge_score'],metric["LLM_judge_responses"]))
			
 
				+                fp.write(f"QA details: \n")
			
 
				+                for item in zip(questions,model_answer,groud_truth,judge_responses):
			
 
				+                    fp.write(f"question: {item[0]} \n")
			
 
				+                    fp.write(f"generated_answers: {item[1]} \n")
			
 
				+                    fp.write(f"groud_truth: {item[2]} \n")
			
 
				+                    fp.write(f"LLM_judge_response: {item[3]} \n")
			
 
				+                    fp.write("\n")
			
 
				+                fp.write("\n------------------------------------\n")
			
 
				+        # Now we want to take a closer look at the questions that are not answered the same by all the models.
			
 
				+        judge_zip = list(zip(*[item[-1] for item in all_metrics]))
			
 
				+        model_names = [item[0] for item in all_metrics]
			
 
				+        with open(output_file,"a") as fp:
			
 
				+            for item in all_metrics:
			
 
				+                fp.write(f"Model_Name: {item[0]}, LLM_SCORE: {item[1]} \n")
			
 
				+            for idx,item in enumerate(judge_zip):
			
 
				+                # if all the responses are "YES", then we skip this question
			
 
				+                if sum(item) == len(item):
			
 
				+                    continue 
			
 
				+                else:
			
 
				+                    fp.write(f"Comparing interested question: {questions[idx]} \n")
			
 
				+                    fp.write(f"groud_truth: {groud_truth[idx]} \n")
			
 
				+                    for i in range(len(model_names)):
			
 
				+                        fp.write(f"{item[i]} {model_names[i]}_answers: {generated_answers[model_names[i]][idx]} \n")
			
 
				+                    fp.write("------------------------\n")
			
 
				+            fp.write(json.dumps(all_metrics))
			
 
				+        print("Finished evaluating the model.")
			
 
				+
			
 
				+
			
 
				+        logging.info(f"Eval successfully, the eval result is saved to {api_config['output_log']}.")
			
 
				+        # Saving the eval result to a log file
			
 
				+    except Exception as e:
			
 
				+        logging.error(f"An unexpected error occurred during the process: {e}",exc_info=True)
			
 
				+
			
 
				+def parse_arguments():
			
 
				+    # Define command line arguments for the script
			
 
				+    parser = argparse.ArgumentParser(
			
 
				+        description="Generate question/answer pairs from documentation."
			
 
				+    )
			
 
				+    parser.add_argument(
			
 
				+        "-m", "--model_name",
			
 
				+        default=None,
			
 
				+        help="Provide the model_name to use for evaluation. If not specified, the model_path in eval_config.yaml will be used."
			
 
				+    )
			
 
				+    parser.add_argument(
			
 
				+        "-c", "--config_path",
			
 
				+        default="raft_eval_config.yaml",
			
 
				+        help="Set the configuration file path that has system prompt along with language, evalset path."
			
 
				+    )
			
 
				+    parser.add_argument(
			
 
				+        "-d", "--data_dir",
			
 
				+        default=None,
			
 
				+        help="Provide the data folder path to build RAG for evaluation. If not specified, the data_dir in eval_config.yaml will be used."
			
 
				+    )
			
 
				+    parser.add_argument(
			
 
				+        "-u", "--model_endpoint_url",
			
 
				+        default="http://localhost:8000/v1",
			
 
				+        type=str,
			
 
				+        help="The raft model endpoint url for eval."
			
 
				+    )
			
 
				+    parser.add_argument(
			
 
				+        "-j", "--judge_endpoint_url",
			
 
				+        default=None,
			
 
				+        type=str,
			
 
				+        help="The large model endpoint url for judge as LLM."
			
 
				+    )
			
 
				+    parser.add_argument(
			
 
				+        "-o", "--output_log",
			
 
				+        default="./eval_result",
			
 
				+        help="save the eval result to a log file. Default is eval_result[timestamp].log"
			
 
				+    )
			
 
				+    parser.add_argument(
			
 
				+        "-k", "--api_key",
			
 
				+        default="EMPTY",
			
 
				+        type=str,
			
 
				+        help="LLM API key for generating question/answer pairs."
			
 
				+    )
			
 
				+    parser.add_argument(
			
 
				+        "-r", "--rag_topk",
			
 
				+        default=5,
			
 
				+        type=int,
			
 
				+        help="set the number of top k documents the RAG needs to retrive."
			
 
				+    )
			
 
				+    parser.add_argument("--chunk_size", type=int, default=1000, help="The character size of each chunk used in RAG")
			
 
				+    return parser.parse_args()
			
 
				+
			
 
				+if __name__ == "__main__":
			
 
				+    logging.info("Initializing the process and loading configuration...")
			
 
				+    args = parse_arguments()
			
 
				+    api_config = load_config(args.config_path)
			
 
				+    api_config["model_endpoint_url"] = args.model_endpoint_url
			
 
				+    if args.data_dir:
			
 
				+        api_config["data_dir"] = args.data_dir
			
 
				+    if args.model_name:
			
 
				+        api_config["model_name"] = args.model_name
			
 
				+    api_config["judge_endpoint_url"] = args.judge_endpoint_url
			
 
				+    api_config["output_log"] = args.output_log
			
 
				+    api_config["api_key"] = args.api_key
			
 
				+    api_config["chunk_size"] = args.chunk_size
			
 
				+    api_config["rag_topk"] = args.rag_topk
			
 
				+    if api_config["judge_endpoint_url"]:
			
 
				+        logging.info(f"The judge model url is: '{args.judge_endpoint_url}'.")
			
 
				+    main(api_config)
			
--- a/recipes/use_cases/end2end-recipes/RAFT-Chatbot/raft_eval_config.yaml
+++ b/recipes/use_cases/end2end-recipes/RAFT-Chatbot/raft_eval_config.yaml
@@ -0,0 +1,37 @@
 
				+eval_prompt_template: >
			
 
				+  <|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a AI assistant that skilled in answering questions related to Llama language models,
			
 
				+  which includes LLama, Llama2, Meta Llama3, Code Llama, Meta Llama Guard 1,	Meta Llama Guard 2,
			
 
				+  Below is a question from a llama user, please the answer it with best of your knowledge,
			
 
				+  The returned answer should be no more than 60 words. Please return the answers in text directly without any special tokens.<|eot_id|>
			
 
				+  <|start_header_id|>user<|end_header_id|>
			
 
				+  Question:{question} \n <|eot_id|><|start_header_id|>assistant<|end_header_id|>
			
 
				+judge_prompt_template: >
			
 
				+    <|begin_of_text|><|start_header_id|>system<|end_header_id|>You have been provided with a question, a teacher's answer and a student's answer below.
			
 
				+    Given that question, you need to score the how good the student answer is compare to
			
 
				+    the teacher's answer. If the student's answer is correct based on the teacher's answer, then return YES, else return NO.
			
 
				+    Here are the grade criterias to follow:
			
 
				+    1. Review it carefully to make sure that the keywords and numerical vaules are exactly the same.
			
 
				+    2. Ensure that the student answer does not contain any conflicting statements.
			
 
				+    3. It is OK if the student answer contains more information than the ground truth answer, as long as it is factually accurate relative to the  ground truth answer.
			
 
				+    YES means that the student's answer meets all of the criteria.
			
 
				+    NO means that the student's answer does not meet all of the criteria. This is the lowest possible score you can give.
			
 
				+    Only respond with "YES" or "NO", do not respond with anything else.<|eot_id|>
			
 
				+    <|start_header_id|>user<|end_header_id|>
			
 
				+    Question: {question} \n Teacher's Answer: {gold} \n Student's Answer: {prediction} <|eot_id|><|start_header_id|>assistant<|end_header_id|>
			
 
				+RAG_prompt_template: >
			
 
				+  <|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a helpful chatbot who can provide an answer to every questions from the user given a relevant context.<|eot_id|>
			
 
				+  <|start_header_id|>user<|end_header_id|>
			
 
				+  Question: {question}\nContext: {context}\n
			
 
				+  Answer this question using the information given by multiple documents in the context above. Here are the things to pay attention to:
			
 
				+  - The context contains many documents, each document starts with <DOCUMENT> and ends </DOCUMENT>.
			
 
				+  - First provide step-by-step reasoning on how to answer the question.
			
 
				+  - In the reasoning, if you need to copy paste some sentences from the context, include them in ##begin_quote## and ##end_quote##. This would mean that things outside of ##begin_quote## and ##end_quote## are not directly copy paste from the context.
			
 
				+  - End your response with final answer in the form <ANSWER>: $answer, the answer should less than 60 words.
			
 
				+  You MUST begin your final answer with the tag "<ANSWER>:". <|eot_id|><|start_header_id|>assistant<|end_header_id|>
			
 
				+eval_file: "./eval_llama.json"
			
 
				+
			
 
				+model_name: "raft-8b"
			
 
				+
			
 
				+data_dir: "./data"
			
 
				+
			
 
				+rag_topk: 5
			
--- a/recipes/use_cases/end2end-recipes/RAFT-Chatbot/raft_utils.py
+++ b/recipes/use_cases/end2end-recipes/RAFT-Chatbot/raft_utils.py
@@ -0,0 +1,245 @@
 
				+# Copyright (c) Meta Platforms, Inc. and affiliates.
			
 
				+# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
			
 
				+
			
 
				+import os
			
 
				+import logging
			
 
				+from langchain.text_splitter import RecursiveCharacterTextSplitter
			
 
				+from datasets import Dataset
			
 
				+import random
			
 
				+from langchain_community.document_loaders import SitemapLoader,DirectoryLoader
			
 
				+from bs4 import BeautifulSoup
			
 
				+from langchain_openai import ChatOpenAI
			
 
				+import copy
			
 
				+
			
 
				+
			
 
				+# Initialize logging
			
 
				+logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
			
 
				+def strip_str(s: str) -> str:
			
 
				+    """
			
 
				+    Helper function for helping format strings returned by GPT-4.
			
 
				+    """
			
 
				+    l, r = 0, len(s)-1
			
 
				+    beg_found = False
			
 
				+    for i in range(len(s)):
			
 
				+        if s[i].isalpha():
			
 
				+            if not beg_found:
			
 
				+                l = i
			
 
				+                beg_found = True
			
 
				+            else:
			
 
				+                r = i
			
 
				+    r += 2
			
 
				+    return s[l:min(r, len(s))]
			
 
				+def clean_documents(raw_text):
			
 
				+    all_lines = []
			
 
				+    for line in raw_text.split("\n"):
			
 
				+        line = line.strip()
			
 
				+        if len(line.split()) == 0:
			
 
				+            continue
			
 
				+        else:
			
 
				+            all_lines.append(line)
			
 
				+    result = " ".join(all_lines)
			
 
				+    return result
			
 
				+def clean_text(content: BeautifulSoup) -> str:
			
 
				+    # Find all 'nav' and 'header' elements in the BeautifulSoup object
			
 
				+    nav_elements = content.find_all("nav")
			
 
				+    header_elements = content.find_all("header")
			
 
				+    mydivs = content.find_all("div", {"role": "list"})
			
 
				+    # Remove each 'nav' and 'header' element from the BeautifulSoup object
			
 
				+    for element in nav_elements + header_elements+mydivs:
			
 
				+        element.decompose()
			
 
				+    raw_text = content.get_text("\n")
			
 
				+    return clean_documents(raw_text)
			
 
				+# Read
			
 
				+def read_file_content(xml_path: str, data_folder: str) -> str:
			
 
				+    if xml_path and data_folder:
			
 
				+        logging.info(f"Error: both xml_path and data_folder are provided, will only read from xml for now")
			
 
				+    if not xml_path and not data_folder:
			
 
				+        logging.info(f"Error: both xml_path and data_folder are not provided")
			
 
				+        return ""
			
 
				+    if xml_path:
			
 
				+        if not os.path.exists(xml_path):
			
 
				+            logging.info(f"Error: {xml_path} does not exist")
			
 
				+            return ""
			
 
				+        # Use langchain to load the documents from webpage links in the xml file
			
 
				+        sitemap_loader = SitemapLoader(web_path=xml_path,is_local=True,parsing_function=clean_text)
			
 
				+        sitemap_loader.requests_kwargs = {"verify": False}
			
 
				+        docs = sitemap_loader.load()
			
 
				+        return docs
			
 
				+    elif len(data_folder) != 0:
			
 
				+        if not os.path.exists(data_folder):
			
 
				+            logging.info(f"Error: {data_folder} does not exist")
			
 
				+            return ""
			
 
				+        # Use langchain to load the documents from data folder
			
 
				+        loader = DirectoryLoader(data_folder)
			
 
				+        docs = loader.load()
			
 
				+        return docs
			
 
				+
			
 
				+
			
 
				+
			
 
				+def get_chunks(
			
 
				+    docs: list,
			
 
				+    chunk_size: int = 1000,
			
 
				+    api_config: dict = None,
			
 
				+) -> list[str]:
			
 
				+    """
			
 
				+    Takes in a list of documents, breaks them down into chunks of size
			
 
				+    `chunk_size`, and returns the chunks.
			
 
				+    """
			
 
				+    chunks = []
			
 
				+    if  len(docs) == 0:
			
 
				+        raise TypeError("Can not get chunks from empty text")
			
 
				+    else:
			
 
				+        text_splitter = RecursiveCharacterTextSplitter(chunk_size=api_config["chunk_size"],chunk_overlap=int(api_config["chunk_size"] / 10),separators= ["----------","\n\n", "\n", " "],strip_whitespace=True)
			
 
				+        docs_processed = text_splitter.split_documents(docs)
			
 
				+        logging.info(f"Total number of docs_processed: {len(docs_processed)}")
			
 
				+        # Remove duplicates
			
 
				+        unique_texts = {}
			
 
				+        docs_processed_unique = []
			
 
				+        for doc in docs_processed:
			
 
				+            if doc.page_content not in unique_texts and len(doc.page_content) > 100 :
			
 
				+                unique_texts[doc.page_content] = True
			
 
				+                docs_processed_unique.append(doc)        
			
 
				+        chunks = [chunk.page_content for chunk in docs_processed_unique]
			
 
				+        logging.info(f"Total number of docs_processed_unique: {len(docs_processed_unique)}")
			
 
				+    return chunks
			
 
				+# read all the files in the data folder, then split them into chunks
			
 
				+# generate questions for each chunk and return zip of chunk and related questions list
			
 
				+def generate_questions(api_config):
			
 
				+    # get documents from the data folder or xml file
			
 
				+    api_url = api_config["endpoint_url"]
			
 
				+    key = api_config["api_key"]
			
 
				+    documents = read_file_content(api_config["xml_path"],api_config["data_dir"])
			
 
				+    if len(documents) == 0:
			
 
				+        logging.info(f"Error reading files, document_text is {len(documents)}")
			
 
				+    document_batches = get_chunks(documents,api_config["chunk_size"],api_config)
			
 
				+    # use OpenAI API protocol to hanlde the chat request, including local VLLM openai compatible server
			
 
				+    llm = ChatOpenAI(
			
 
				+        openai_api_key=key,
			
 
				+        openai_api_base=api_url,
			
 
				+        model_name=api_config["model"],
			
 
				+        temperature=0.0,
			
 
				+        max_tokens=500
			
 
				+        )
			
 
				+    all_tasks = [api_config['question_prompt_template'].format(num_questions=str(api_config['questions_per_chunk']),context=document) for document in document_batches]
			
 
				+    generated_answers = llm.batch(all_tasks)
			
 
				+    generated_answers = [ item.content for item in generated_answers]
			
 
				+    if len(generated_answers) == 0:
			
 
				+        logging.error("No model answers generated. Please check the input context or model configuration in ",api_config["model"])
			
 
				+        return []
			
 
				+    final_result = []
			
 
				+    for result in generated_answers:
			
 
				+        queries = result.split('\n')
			
 
				+        queries = [strip_str(q) for q in queries]
			
 
				+        queries = [q for q in queries if any(c.isalpha() for c in q)]
			
 
				+        if len(queries) > int(api_config['questions_per_chunk']):
			
 
				+            # As the model may have unrelated question at the begining of the result
			
 
				+            # if queries is more than questions_per_chunk, then we need to truncate it and only keep last questions_per_chunk lines
			
 
				+            queries = queries[-int(api_config['questions_per_chunk']):]
			
 
				+        final_result.append(queries)
			
 
				+    return list(zip(document_batches,final_result))
			
 
				+
			
 
				+# Generate COT answer for each question given the chunk context
			
 
				+def generate_COT(chunk_questions_zip,api_config) -> dict:
			
 
				+    all_tasks = []
			
 
				+    chunk_questions = []
			
 
				+    question_asked = set()
			
 
				+    for document_content,questions in chunk_questions_zip:
			
 
				+        for question in questions:
			
 
				+            question = question.strip()
			
 
				+            # avoid asking the same question twice
			
 
				+            if question not in question_asked:
			
 
				+                question_asked.add(question)
			
 
				+                prompt = api_config['COT_prompt_template'].format(question=question,context=str(document_content))
			
 
				+                all_tasks.append(prompt)
			
 
				+                chunk_questions.append((document_content,question))
			
 
				+    # use OpenAI API protocol to hanlde the chat request, including local VLLM openai compatible server
			
 
				+    llm = ChatOpenAI(
			
 
				+        openai_api_key=api_config["api_key"],
			
 
				+        openai_api_base=api_config["endpoint_url"],
			
 
				+        model_name=api_config["model"],
			
 
				+        temperature=0.0,
			
 
				+        max_tokens=500
			
 
				+        )
			
 
				+    generated_answers = llm.batch(all_tasks)
			
 
				+    generated_answers = [ item.content for item in generated_answers]
			
 
				+    COT_results = []
			
 
				+    # return a list of (chunk, question, generated_answer)
			
 
				+    for (chunk, question),generated_answer in zip(chunk_questions,generated_answers):
			
 
				+        COT_results.append((chunk,question,generated_answer))
			
 
				+    return COT_results
			
 
				+
			
 
				+def add_chunk_to_dataset(
			
 
				+    chunk_questions_zip: list,
			
 
				+    api_config: dict,
			
 
				+) -> None:
			
 
				+    """
			
 
				+    Given a chunk and related questions lists, create {Q, A, D} triplets and add them to the dataset.
			
 
				+    """
			
 
				+    num_distract = api_config["num_distract_docs"]
			
 
				+    p = api_config["refusal_probability"]
			
 
				+    chunks = [chunk for chunk, _ in chunk_questions_zip]
			
 
				+    COT_results = generate_COT(chunk_questions_zip,api_config)
			
 
				+    logging.info(f"COT generation completed, total num of COT results: {len(COT_results)}")
			
 
				+    completed,refusal= 0,0
			
 
				+    data_list = []
			
 
				+    for chunk, q , cot in COT_results:
			
 
				+        # The COT answer will be used as the label in the fine-tuning stage
			
 
				+
			
 
				+        datapt = {
			
 
				+            "id": None,
			
 
				+            "type": "general",
			
 
				+            "question": q,
			
 
				+            "context": None,
			
 
				+            "oracle_context": None,
			
 
				+            "cot_answer": cot
			
 
				+        }
			
 
				+        i = chunks.index(chunk)
			
 
				+        datapt["id"] = f"seed_task_{len(data_list)}"
			
 
				+        # add num_distract distractor docs
			
 
				+        docs = [chunk]
			
 
				+        indices = list(range(0, len(chunks)))
			
 
				+        indices.remove(i)
			
 
				+        for j in random.sample(indices, num_distract):
			
 
				+            docs.append(chunks[j])
			
 
				+        doc_copy = docs.copy()
			
 
				+        random.shuffle(docs)
			
 
				+        d = {
			
 
				+            "title": [],
			
 
				+            "sentences": []
			
 
				+        }
			
 
				+
			
 
				+        d["title"].append(["placeholder_title"]*(num_distract+1))
			
 
				+        d["sentences"].append(docs)
			
 
				+        datapt["context"] = d
			
 
				+        datapt["oracle_context"] = chunk
			
 
				+
			
 
				+        # construct model instruction
			
 
				+        context = ""
			
 
				+        for doc in docs:
			
 
				+            context += "<DOCUMENT>" + str(doc) + "</DOCUMENT>\n"
			
 
				+        context += q
			
 
				+        # This instruction will be used in the fine-tuning stage
			
 
				+        datapt["instruction"] = context
			
 
				+        datapt_copy = copy.deepcopy(datapt)
			
 
				+        # add to dataset
			
 
				+        data_list.append(datapt)
			
 
				+        # decides whether to add refusal example where the related documents are not provided
			
 
				+        refusal = random.uniform(0, 1) <= p
			
 
				+        if refusal:
			
 
				+            doc_copy[0] = chunks[random.sample(indices, 1)[0]]
			
 
				+            random.shuffle(doc_copy)
			
 
				+            refusl_context = ""
			
 
				+            for doc in doc_copy:
			
 
				+                refusl_context += "<DOCUMENT>" + str(doc) + "</DOCUMENT>\n"
			
 
				+            refusl_context += q
			
 
				+            # This instruction will be used in the fine-tuning stage
			
 
				+            datapt_copy["id"] = f"refusal_task_{len(data_list)}"
			
 
				+            datapt_copy["instruction"] = refusl_context
			
 
				+            datapt_copy["cot_answer"] = "Sorry, I don't know the answer to this question because related documents are not found. Please try again."
			
 
				+            data_list.append(datapt_copy)
			
 
				+            refusal += 1
			
 
				+        completed += 1
			
 
				+        if completed % 100 == 0:
			
 
				+            logging.info(f"refusal example added: {refusal}, total examples added: {completed}, total examples to be added: {len(COT_results)- completed}")
			
 
				+    ds = Dataset.from_list(data_list)
			
 
				+    return ds
			
--- a/recipes/use_cases/multilingual/README.md
+++ b/recipes/use_cases/multilingual/README.md
@@ -1,7 +1,8 @@
 
				 # Extending Llama to a new language
			
 
				 Authored by : Sarvam team
			
 
				 In this recipe, we will see how to add a new language to the Llama family of models. The steps are quite general and can be easily adapted to other models as well. Using this recipe, you should be able to replicate the findings of [OpenHathi](https://huggingface.co/sarvamai/OpenHathi-7B-Hi-v0.1-Base).
			
 
				-Please read more about OpenHathi [here](https://www.sarvam.ai/blog/announcing-openhathi-series)
			
 
				+Please read more about OpenHathi [here](https://web.archive.org/web/20240418103408/https://www.sarvam.ai/blog/announcing-openhathi-series)
			
 
				+
			
 
				 ## Data
			
 
				 The original OpenHathi model uses a combination of [Sangraha](https://huggingface.co/datasets/ai4bharat/sangraha) and Wikipedia as its primary data sources. If the reader is interested in using these sources, they would also have to preprocess the data: clean, filter, and deduplicate. See [Setu](https://github.com/AI4Bharat/setu) for an easy way to do this at scale.
			
 
				 
			
--- a/requirements.txt
+++ b/requirements.txt
@@ -19,4 +19,13 @@ chardet
 
				 openai
			
 
				 typing-extensions==4.8.0
			
 
				 tabulate
			
 
				+evaluate
			
 
				+rouge_score
			
 
				+pyyaml==6.0.1
			
 
				+faiss-gpu
			
 
				+unstructured[pdf]
			
 
				+langchain_openai
			
 
				+langchain
			
 
				+langchain_community
			
 
				+sentence_transformers
			
 
				 codeshield
			
--- a/src/llama_recipes/configs/datasets.py
+++ b/src/llama_recipes/configs/datasets.py
@@ -3,28 +3,28 @@
 
				 
			
 
				 from dataclasses import dataclass
			
 
				 
			
 
				-    
			
 
				+
			
 
				 @dataclass
			
 
				 class samsum_dataset:
			
 
				     dataset: str =  "samsum_dataset"
			
 
				     train_split: str = "train"
			
 
				     test_split: str = "validation"
			
 
				-    
			
 
				-    
			
 
				+    trust_remote_code: bool = False
			
 
				+
			
 
				+
			
 
				 @dataclass
			
 
				 class grammar_dataset:
			
 
				     dataset: str = "grammar_dataset"
			
 
				-    train_split: str = "src/llama_recipes/datasets/grammar_dataset/gtrain_10k.csv" 
			
 
				+    train_split: str = "src/llama_recipes/datasets/grammar_dataset/gtrain_10k.csv"
			
 
				     test_split: str = "src/llama_recipes/datasets/grammar_dataset/grammar_validation.csv"
			
 
				 
			
 
				-    
			
 
				+
			
 
				 @dataclass
			
 
				 class alpaca_dataset:
			
 
				     dataset: str = "alpaca_dataset"
			
 
				     train_split: str = "train"
			
 
				     test_split: str = "val"
			
 
				     data_path: str = "src/llama_recipes/datasets/alpaca_data.json"
			
 
				-    
			
 
				 
			
 
				 @dataclass
			
 
				 class custom_dataset:
			
@@ -32,9 +32,10 @@ class custom_dataset:
 
				     file: str = "recipes/quickstart/finetuning/datasets/custom_dataset.py"
			
 
				     train_split: str = "train"
			
 
				     test_split: str = "validation"
			
 
				+    data_path: str = ""
			
 
				     
			
 
				 @dataclass
			
 
				 class llamaguard_toxicchat_dataset:
			
 
				     dataset: str = "llamaguard_toxicchat_dataset"
			
 
				     train_split: str = "train"
			
 
				-    test_split: str = "test"
			
 
				+    test_split: str = "test"
			
--- a/src/llama_recipes/configs/training.py
+++ b/src/llama_recipes/configs/training.py
@@ -8,8 +8,8 @@ from dataclasses import dataclass
 
				 class train_config:
			
 
				     model_name: str="PATH/to/Model"
			
 
				     tokenizer_name: str=None
			
 
				-    enable_fsdp: bool=False
			
 
				-    low_cpu_fsdp: bool=False
			
 
				+    enable_fsdp: bool=False # shards model parameters, optimizer states and gradients across DDP ranks
			
 
				+    low_cpu_fsdp: bool=False # saves cpu memory by loading pretrained model on rank0 only
			
 
				     run_validation: bool=True
			
 
				     batch_size_training: int=4
			
 
				     batching_strategy: str="packing" #alternative: padding
			
@@ -23,14 +23,14 @@ class train_config:
 
				     num_workers_dataloader: int=1
			
 
				     lr: float=1e-4
			
 
				     weight_decay: float=0.0
			
 
				-    gamma: float= 0.85
			
 
				+    gamma: float= 0.85 # multiplicatively decay the learning rate by gamma after each epoch
			
 
				     seed: int=42
			
 
				     use_fp16: bool=False
			
 
				     mixed_precision: bool=True
			
 
				     val_batch_size: int=1
			
 
				     dataset = "samsum_dataset"
			
 
				     peft_method: str = "lora" # None, llama_adapter (Caution: llama_adapter is currently not supported with FSDP)
			
 
				-    use_peft: bool=False
			
 
				+    use_peft: bool=False # use parameter efficient fine tuning
			
 
				     from_peft_checkpoint: str="" # if not empty and use_peft=True, will load the peft checkpoint and resume the fine-tuning on that checkpoint
			
 
				     output_dir: str = "PATH/to/save/PEFT/model"
			
 
				     freeze_layers: bool = False
			
--- a/src/llama_recipes/datasets/samsum_dataset.py
+++ b/src/llama_recipes/datasets/samsum_dataset.py
@@ -8,7 +8,9 @@ import datasets
 
				 
			
 
				 
			
 
				 def get_preprocessed_samsum(dataset_config, tokenizer, split):
			
 
				-    dataset = datasets.load_dataset("samsum", split=split)
			
 
				+    if not hasattr(dataset_config, "trust_remote_code") or not dataset_config.trust_remote_code:
			
 
				+        raise ValueError("The repository for samsum contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/samsum. To activate `trust_remote_code` option use this config: --samsum_dataset.trust_remote_code=True")
			
 
				+    dataset = datasets.load_dataset("samsum", split=split, trust_remote_code=dataset_config.trust_remote_code)
			
 
				 
			
 
				     prompt = (
			
 
				         f"Summarize this dialog:\n{{dialog}}\n---\nSummary:\n"
			
--- a/src/llama_recipes/model_checkpointing/__init__.py
+++ b/src/llama_recipes/model_checkpointing/__init__.py
@@ -4,6 +4,7 @@
 
				 from llama_recipes.model_checkpointing.checkpoint_handler import (
			
 
				     load_model_checkpoint,
			
 
				     save_model_checkpoint,
			
 
				+    save_peft_checkpoint,
			
 
				     load_optimizer_checkpoint,
			
 
				     save_optimizer_checkpoint,
			
 
				     save_model_and_optimizer_sharded,
			
--- a/src/llama_recipes/model_checkpointing/checkpoint_handler.py
+++ b/src/llama_recipes/model_checkpointing/checkpoint_handler.py
@@ -26,6 +26,7 @@ from torch.distributed.checkpoint.default_planner import (
 
				 )
			
 
				 
			
 
				 
			
 
				+from torch.distributed.checkpoint.state_dict import get_model_state_dict, StateDictOptions
			
 
				 from torch.distributed.fsdp.fully_sharded_data_parallel import StateDictType
			
 
				 import torch.distributed._shard.checkpoint as dist_cp
			
 
				 import torch.distributed as dist
			
@@ -264,4 +265,12 @@ def load_sharded_model_single_gpu(model,model_path):
 
				     model.load_state_dict(state_dict["model"])
			
 
				     
			
 
				     print(f"Sharded state checkpoint loaded from {model_path}")
			
 
				-    return model
			
 
				+    return model
			
 
				+
			
 
				+def save_peft_checkpoint(model, model_path):
			
 
				+    """save_pretrained peft model"""
			
 
				+
			
 
				+    options = StateDictOptions(full_state_dict=True, cpu_offload=True)
			
 
				+
			
 
				+    state_dict = get_model_state_dict(model, options=options)
			
 
				+    model.save_pretrained(model_path, state_dict=state_dict)
			
--- a/src/llama_recipes/utils/train_utils.py
+++ b/src/llama_recipes/utils/train_utils.py
@@ -20,7 +20,7 @@ from transformers import LlamaTokenizer
 
				 import json
			
 
				 
			
 
				 
			
 
				-from llama_recipes.model_checkpointing import save_model_checkpoint, save_model_and_optimizer_sharded, save_optimizer_checkpoint
			
 
				+from llama_recipes.model_checkpointing import save_model_checkpoint, save_model_and_optimizer_sharded, save_optimizer_checkpoint, save_peft_checkpoint
			
 
				 from llama_recipes.policies import fpSixteen,bfSixteen, get_llama_wrapper
			
 
				 from llama_recipes.utils.memory_utils import MemoryTrace
			
 
				 from accelerate.utils import is_xpu_available, is_ccl_available
			
@@ -235,7 +235,7 @@ def train(model, train_dataloader,eval_dataloader, tokenizer, optimizer, lr_sche
 
				                             print(f"we are about to save the PEFT modules")
			
 
				                     else:
			
 
				                         print(f"we are about to save the PEFT modules")
			
 
				-                    model.save_pretrained(train_config.output_dir)
			
 
				+                    save_peft_checkpoint(model, train_config.output_dir)
			
 
				                     if train_config.enable_fsdp:
			
 
				                         if rank==0:
			
 
				                             print(f"PEFT modules are saved in {train_config.output_dir} directory")
			
--- a/tools/benchmarks/llm_eval_harness/README.md
+++ b/tools/benchmarks/llm_eval_harness/README.md
--- a/tools/benchmarks/llm_eval_harness/meta_eval_reproduce/README.md
+++ b/tools/benchmarks/llm_eval_harness/meta_eval_reproduce/README.md
--- a/tools/benchmarks/llm_eval_harness/meta_eval_reproduce/eval_config.yaml
+++ b/tools/benchmarks/llm_eval_harness/meta_eval_reproduce/eval_config.yaml
@@ -0,0 +1,32 @@
 
				+model_name: "meta-llama/Meta-Llama-3.1-8B-Instruct" # The name of the model to evaluate. This must be a valid Meta Llama 3 based model name in the HuggingFace model hub."
			
 
				+
			
 
				+evals_dataset: "meta-llama/Meta-Llama-3.1-8B-Instruct-evals" # The name of the 3.1 evals dataset to evaluate, please make sure this eval dataset corresponds to the model loaded. This must be a valid Meta Llama 3.1 evals dataset name in the Llama 3.1 Evals collection.
			
 
				+# Must be one of the following ["meta-llama/Meta-Llama-3.1-8B-Instruct-evals","meta-llama/Meta-Llama-3.1-70B-Instruct-evals","meta-llama/Meta-Llama-3.1-405B-Instruct-evals","meta-llama/Meta-Llama-3.1-8B-evals","meta-llama/Meta-Llama-3.1-70B-evals","meta-llama/Meta-Llama-3.1-405B-evals"]
			
 
				+
			
 
				+tasks: "meta_instruct" # Available tasks for instruct model: "meta_math_hard", "meta_gpqa", "meta_mmlu_pro_instruct", "meta_ifeval"; or just use "meta_instruct" to run all of them.
			
 
				+# Available tasks for pretrain model: "meta_bbh", "meta_mmlu_pro_pretrain"; or just use "meta_pretrain" to run all of them.
			
 
				+
			
 
				+tensor_parallel_size: 1 # The VLLM argument that speicify the tensor parallel size for the model, eg how many GPUs to use for a model copy.
			
 
				+
			
 
				+data_parallel_size: 4 # The VLLM argument that speicify the data parallel size for the model, eg how copies of model will be used.
			
 
				+
			
 
				+gpu_memory_utilization: 0.9 #The VLLM argument that speicify gpu memory utilization, the rest will be reserved for KV cache.
			
 
				+
			
 
				+max_model_len: 8192 #The VLLM argument that speicify model max length, decrease this value only if GPU memory issue encountered. Please make sure the max_gen_toks in the yaml does not exceed this length.
			
 
				+
			
 
				+batch_size: "auto" # Batch size, can be 'auto', 'auto:N', or an integer. It is strongly recommend to use 'auto' for vllm to speed up the inference
			
 
				+
			
 
				+output_path: "eval_results" # the output folder to store all the eval results and samples.
			
 
				+
			
 
				+#limit: 12 # Limit number of examples per task, set 'null' to run all.
			
 
				+limit: null # Limit number of examples per task, set 'null' to run all.
			
 
				+
			
 
				+verbosity: "INFO" #Logging level: CRITICAL, ERROR, WARNING, INFO, DEBUG.
			
 
				+
			
 
				+log_samples: true # If True, write out all model outputs and documents for per-sample measurement and post-hoc analysis.
			
 
				+
			
 
				+work_dir: ./work_dir # The work folder where the task template yaml files will be copied and modified, datasets will be downloaded for math_hard, ifeval.
			
 
				+
			
 
				+template_dir: ./meta_template #Path to the folder that contains all the meta task templates
			
 
				+
			
 
				+show_config: false # If True, shows the full config of all tasks at the end of the evaluation.
			
--- a/tools/benchmarks/llm_eval_harness/meta_eval_reproduce/meta_template/bbh/bbh_3shot_cot.yaml
+++ b/tools/benchmarks/llm_eval_harness/meta_eval_reproduce/meta_template/bbh/bbh_3shot_cot.yaml
@@ -0,0 +1,28 @@
 
				+dataset_path: meta-llama/Meta-Llama-3.1-8B-evals
			
 
				+dataset_name: Meta-Llama-3.1-8B-evals__bbh__details
			
 
				+task: meta_bbh
			
 
				+output_type: generate_until
			
 
				+process_docs: !function utils.process_docs
			
 
				+test_split: latest
			
 
				+doc_to_text: !function utils.doc_to_text
			
 
				+doc_to_target: answer
			
 
				+filter_list:
			
 
				+  - name: "strict-match"
			
 
				+    filter:
			
 
				+      - function: "regex"
			
 
				+        regex_pattern: 'the answer is (.*?)\.'
			
 
				+      - function: "take_first"
			
 
				+generation_kwargs:
			
 
				+  until: "\n\nQ: "
			
 
				+  do_sample: false
			
 
				+  temperature: 0
			
 
				+  max_gen_toks: 512
			
 
				+num_fewshot: 0
			
 
				+metric_list:
			
 
				+  - metric: exact_match
			
 
				+    aggregation: mean
			
 
				+    higher_is_better: true
			
 
				+    ignore_case: true
			
 
				+    ignore_punctuation: true
			
 
				+metadata:
			
 
				+  version: 1.0
			
--- a/tools/benchmarks/llm_eval_harness/meta_eval_reproduce/meta_template/bbh/utils.py
+++ b/tools/benchmarks/llm_eval_harness/meta_eval_reproduce/meta_template/bbh/utils.py
@@ -0,0 +1,21 @@
 
				+import random
			
 
				+import re
			
 
				+
			
 
				+import datasets
			
 
				+
			
 
				+
			
 
				+
			
 
				+def doc_to_text(doc: dict) -> str:
			
 
				+    return doc["input_final_prompts"][0]
			
 
				+
			
 
				+def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:
			
 
				+    def _process_doc(doc: dict) -> dict:
			
 
				+        out_doc = {
			
 
				+            "problem": doc["input_question"],
			
 
				+            "answer": doc["input_correct_responses"][0],
			
 
				+        }
			
 
				+        return out_doc
			
 
				+    dataset = dataset.select_columns(["input_question", "input_correct_responses", "input_final_prompts", "is_correct","input_question_hash","output_prediction_text"])
			
 
				+    dataset = dataset.rename_column("is_correct","previously_is_correct")
			
 
				+    dataset = dataset.map(_process_doc)
			
 
				+    return dataset.map(_process_doc)
			
--- a/tools/benchmarks/llm_eval_harness/meta_eval_reproduce/meta_template/gpqa_cot/gpqa_0shot_cot.yaml
+++ b/tools/benchmarks/llm_eval_harness/meta_eval_reproduce/meta_template/gpqa_cot/gpqa_0shot_cot.yaml
@@ -0,0 +1,29 @@
 
				+dataset_path: meta-llama/Meta-Llama-3.1-8B-Instruct-evals
			
 
				+dataset_name: Meta-Llama-3.1-8B-Instruct-evals__gpqa__details
			
 
				+task: meta_gpqa
			
 
				+output_type: generate_until
			
 
				+process_docs: !function utils.process_docs
			
 
				+test_split: latest
			
 
				+doc_to_text: !function utils.doc_to_text
			
 
				+doc_to_target: gold
			
 
				+filter_list:
			
 
				+  - name: "strict-match"
			
 
				+    filter:
			
 
				+      - function: "regex"
			
 
				+        group_select: -1
			
 
				+        regex_pattern: 'best answer is ([A-Z])'
			
 
				+      - function: "take_first"
			
 
				+generation_kwargs:
			
 
				+  until: []
			
 
				+  do_sample: false
			
 
				+  temperature: 0
			
 
				+  max_gen_toks: 2048
			
 
				+num_fewshot: 0
			
 
				+metric_list:
			
 
				+  - metric: exact_match
			
 
				+    aggregation: mean
			
 
				+    higher_is_better: true
			
 
				+    ignore_case: true
			
 
				+    ignore_punctuation: true
			
 
				+metadata:
			
 
				+  version: 1.0
			
--- a/tools/benchmarks/llm_eval_harness/meta_eval_reproduce/meta_template/gpqa_cot/utils.py
+++ b/tools/benchmarks/llm_eval_harness/meta_eval_reproduce/meta_template/gpqa_cot/utils.py
@@ -0,0 +1,20 @@
 
				+import random
			
 
				+import re
			
 
				+
			
 
				+import datasets
			
 
				+
			
 
				+
			
 
				+
			
 
				+def doc_to_text(doc: dict) -> str:
			
 
				+    return doc["input_final_prompts"][0]
			
 
				+def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:
			
 
				+    def _process_doc(doc: dict) -> dict:
			
 
				+        out_doc = {
			
 
				+            "problem": doc["input_question"],
			
 
				+            "gold": doc["input_correct_responses"][0],
			
 
				+        }
			
 
				+        return out_doc
			
 
				+    dataset = dataset.select_columns(["input_question", "input_correct_responses", "input_final_prompts", "is_correct","input_question_hash","input_choice_list","output_prediction_text"])
			
 
				+    dataset = dataset.rename_column("is_correct","previously_is_correct")
			
 
				+    dataset = dataset.map(_process_doc)
			
 
				+    return dataset.map(_process_doc)
			
--- a/tools/benchmarks/llm_eval_harness/meta_eval_reproduce/meta_template/ifeval/ifeval.yaml
+++ b/tools/benchmarks/llm_eval_harness/meta_eval_reproduce/meta_template/ifeval/ifeval.yaml
@@ -0,0 +1,32 @@
 
				+task: meta_ifeval
			
 
				+dataset_path: parquet
			
 
				+dataset_kwargs:
			
 
				+  data_files: ./work_dir/joined_ifeval.parquet
			
 
				+output_type: generate_until
			
 
				+test_split: train
			
 
				+num_fewshot: 0
			
 
				+doc_to_text: prompt
			
 
				+doc_to_target: 0
			
 
				+generation_kwargs:
			
 
				+  until: []
			
 
				+  do_sample: false
			
 
				+  temperature: 0.0
			
 
				+  max_gen_toks: 1280
			
 
				+process_results: !function utils.process_results
			
 
				+metric_list:
			
 
				+  - metric: prompt_level_strict_acc
			
 
				+    aggregation: mean
			
 
				+    higher_is_better: true
			
 
				+  - metric: inst_level_strict_acc
			
 
				+    aggregation: !function utils.agg_inst_level_acc
			
 
				+    higher_is_better: true
			
 
				+  - metric: prompt_level_loose_acc
			
 
				+    aggregation: mean
			
 
				+    higher_is_better: true
			
 
				+  - metric: inst_level_loose_acc
			
 
				+    aggregation: !function utils.agg_inst_level_acc
			
 
				+    higher_is_better: true
			
 
				+metadata:
			
 
				+  version: 2.0
			
 
				+fewshot_config:
			
 
				+  sampler: first_n
			
--- a/tools/benchmarks/llm_eval_harness/meta_eval_reproduce/meta_template/ifeval/utils.py
+++ b/tools/benchmarks/llm_eval_harness/meta_eval_reproduce/meta_template/ifeval/utils.py
@@ -0,0 +1,139 @@
 
				+import dataclasses
			
 
				+from typing import Dict, Optional, Union
			
 
				+
			
 
				+from lm_eval.tasks.ifeval import instructions_registry
			
 
				+
			
 
				+
			
 
				+@dataclasses.dataclass
			
 
				+class InputExample:
			
 
				+    key: int
			
 
				+    instruction_id_list: list[str]
			
 
				+    prompt: str
			
 
				+    kwargs: list[Dict[str, Optional[Union[str, int]]]]
			
 
				+
			
 
				+
			
 
				+@dataclasses.dataclass
			
 
				+class OutputExample:
			
 
				+    instruction_id_list: list[str]
			
 
				+    prompt: str
			
 
				+    response: str
			
 
				+    follow_all_instructions: bool
			
 
				+    follow_instruction_list: list[bool]
			
 
				+
			
 
				+
			
 
				+def test_instruction_following_strict(
			
 
				+    inp,
			
 
				+    response,
			
 
				+):
			
 
				+    """Tests response to see if instructions are followed."""
			
 
				+    instruction_list = inp.instruction_id_list
			
 
				+    is_following_list = []
			
 
				+
			
 
				+    for index, instruction_id in enumerate(instruction_list):
			
 
				+        instruction_cls = instructions_registry.INSTRUCTION_DICT[instruction_id]
			
 
				+        instruction = instruction_cls(instruction_id)
			
 
				+                
			
 
				+        # Remove None values from kwargs to avoid unexpected keyword argument errors in build_description method.
			
 
				+        kwargs = {k: v for k, v in inp.kwargs[index].items() if v}
			
 
				+        instruction.build_description(**kwargs)
			
 
				+        args = instruction.get_instruction_args()
			
 
				+        if args and "prompt" in args:
			
 
				+            instruction.build_description(prompt=inp.prompt)
			
 
				+
			
 
				+        if response.strip() and instruction.check_following(response):
			
 
				+            is_following_list.append(True)
			
 
				+        else:
			
 
				+            is_following_list.append(False)
			
 
				+
			
 
				+    return OutputExample(
			
 
				+        instruction_id_list=inp.instruction_id_list,
			
 
				+        prompt=inp.prompt,
			
 
				+        response=response,
			
 
				+        follow_all_instructions=all(is_following_list),
			
 
				+        follow_instruction_list=is_following_list,
			
 
				+    )
			
 
				+
			
 
				+
			
 
				+def test_instruction_following_loose(
			
 
				+    inp,
			
 
				+    response,
			
 
				+):
			
 
				+    """Tests response for an upper bound for following instructions."""
			
 
				+    r = response.split("\n")
			
 
				+    response_remove_first = "\n".join(r[1:]).strip()
			
 
				+    response_remove_last = "\n".join(r[:-1]).strip()
			
 
				+    response_remove_both = "\n".join(r[1:-1]).strip()
			
 
				+    revised_response = response.replace("*", "")
			
 
				+    revised_response_remove_first = response_remove_first.replace("*", "")
			
 
				+    revised_response_remove_last = response_remove_last.replace("*", "")
			
 
				+    revised_response_remove_both = response_remove_both.replace("*", "")
			
 
				+    all_responses = [
			
 
				+        response,
			
 
				+        revised_response,
			
 
				+        response_remove_first,
			
 
				+        response_remove_last,
			
 
				+        response_remove_both,
			
 
				+        revised_response_remove_first,
			
 
				+        revised_response_remove_last,
			
 
				+        revised_response_remove_both,
			
 
				+    ]
			
 
				+    instruction_list = inp.instruction_id_list
			
 
				+    is_following_list = []
			
 
				+
			
 
				+    for index, instruction_id in enumerate(instruction_list):
			
 
				+        instruction_cls = instructions_registry.INSTRUCTION_DICT[instruction_id]
			
 
				+        instruction = instruction_cls(instruction_id)
			
 
				+
			
 
				+        # Remove None values from kwargs to avoid unexpected keyword argument errors in build_description method.
			
 
				+        kwargs = {k: v for k, v in inp.kwargs[index].items() if v}
			
 
				+        instruction.build_description(**kwargs)
			
 
				+        args = instruction.get_instruction_args()
			
 
				+        if args and "prompt" in args:
			
 
				+            instruction.build_description(prompt=inp.prompt)
			
 
				+
			
 
				+        is_following = False
			
 
				+        for r in all_responses:
			
 
				+            if r.strip() and instruction.check_following(r):
			
 
				+                is_following = True
			
 
				+                break
			
 
				+
			
 
				+        is_following_list.append(is_following)
			
 
				+
			
 
				+    return OutputExample(
			
 
				+        instruction_id_list=inp.instruction_id_list,
			
 
				+        prompt=inp.prompt,
			
 
				+        response=response,
			
 
				+        follow_all_instructions=all(is_following_list),
			
 
				+        follow_instruction_list=is_following_list,
			
 
				+    )
			
 
				+
			
 
				+
			
 
				+def process_results(doc, results):
			
 
				+    new_kwargs = []
			
 
				+    for item in doc["kwargs"]:
			
 
				+        if item["nth_paragraph"]:
			
 
				+            item["nth_paragraph"] = int(item["nth_paragraph"])
			
 
				+        new_kwargs.append(item)
			
 
				+    inp = InputExample(
			
 
				+        key=doc["key"],
			
 
				+        instruction_id_list=doc["instruction_id_list"],
			
 
				+        prompt=doc["prompt"],
			
 
				+        kwargs=new_kwargs,
			
 
				+    )
			
 
				+    response = results[0]
			
 
				+
			
 
				+    out_strict = test_instruction_following_strict(inp, response)
			
 
				+    out_loose = test_instruction_following_loose(inp, response)
			
 
				+
			
 
				+    return {
			
 
				+        "prompt_level_strict_acc": out_strict.follow_all_instructions,
			
 
				+        "inst_level_strict_acc": out_strict.follow_instruction_list,
			
 
				+        "prompt_level_loose_acc": out_loose.follow_all_instructions,
			
 
				+        "inst_level_loose_acc": out_loose.follow_instruction_list,
			
 
				+    }
			
 
				+
			
 
				+
			
 
				+def agg_inst_level_acc(items):
			
 
				+    flat_items = [item for sublist in items for item in sublist]
			
 
				+    inst_level_acc = sum(flat_items) / len(flat_items)
			
 
				+    return inst_level_acc
			
--- a/tools/benchmarks/llm_eval_harness/meta_eval_reproduce/meta_template/math_hard/math_hard_0shot_cot.yaml
+++ b/tools/benchmarks/llm_eval_harness/meta_eval_reproduce/meta_template/math_hard/math_hard_0shot_cot.yaml
@@ -0,0 +1,21 @@
 
				+dataset_path: parquet
			
 
				+dataset_kwargs:
			
 
				+  data_files: ./work_dir/joined_math.parquet
			
 
				+task: meta_math_hard
			
 
				+process_docs: !function utils.process_docs
			
 
				+output_type: generate_until
			
 
				+test_split: train
			
 
				+doc_to_text:  !function utils.doc_to_text
			
 
				+process_results: !function utils.process_results
			
 
				+doc_to_target: answer
			
 
				+generation_kwargs:
			
 
				+  until: []
			
 
				+  do_sample: false
			
 
				+  temperature: 0
			
 
				+  max_gen_toks: 5120
			
 
				+metric_list:
			
 
				+  - metric: exact_match
			
 
				+    aggregation: mean
			
 
				+    higher_is_better: true
			
 
				+metadata:
			
 
				+  version: 1.0
			
--- a/tools/benchmarks/llm_eval_harness/meta_eval_reproduce/meta_template/math_hard/utils.py
+++ b/tools/benchmarks/llm_eval_harness/meta_eval_reproduce/meta_template/math_hard/utils.py
@@ -0,0 +1,268 @@
 
				+# Most of the code taken from https://github.com/EleutherAI/lm-evaluation-harness/blob/cddce0a148ec1710e2d60546c6f92727dd8a78fd/lm_eval/tasks/leaderboard/math/utils.py
			
 
				+import re
			
 
				+import signal
			
 
				+from typing import Dict, List, Optional
			
 
				+
			
 
				+import datasets
			
 
				+
			
 
				+from lm_eval.utils import eval_logger
			
 
				+
			
 
				+
			
 
				+try:
			
 
				+    import sympy
			
 
				+    from sympy.parsing.latex import parse_latex
			
 
				+except ModuleNotFoundError:
			
 
				+    raise ModuleNotFoundError(
			
 
				+        "`sympy` is required for generating translation task prompt templates. \
			
 
				+please install sympy via pip install lm-eval[math] or pip install -e .[math]",
			
 
				+    )
			
 
				+
			
 
				+# taken from
			
 
				+# https://github.com/wellecks/lm-evaluation-harness/blob/master/lm_eval/tasks/minerva_math.py
			
 
				+def doc_to_text(doc: dict) -> str:
			
 
				+    return doc["input_final_prompts"][0]
			
 
				+
			
 
				+def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:
			
 
				+    def _process_doc(doc: dict) -> dict:
			
 
				+        out_doc = {
			
 
				+            "problem": doc["input_question"],
			
 
				+            "answer": normalize_final_answer(
			
 
				+                 remove_boxed(last_boxed_only_string(doc["solution"]))
			
 
				+            ),
			
 
				+            "meta_target": doc["input_correct_responses"]
			
 
				+        }
			
 
				+        return out_doc
			
 
				+    return dataset.map(_process_doc)
			
 
				+
			
 
				+
			
 
				+def process_results(doc: dict, results: List[str]) -> Dict[str, int]:
			
 
				+    candidates = results[0]
			
 
				+    last_boxed_string = last_boxed_only_string(candidates)
			
 
				+    if not last_boxed_string:
			
 
				+        # No boxed string found, so we can't evaluate
			
 
				+        return {"exact_match": 0}
			
 
				+    unnormalized_answer = remove_boxed(last_boxed_string)
			
 
				+    answer = normalize_final_answer(unnormalized_answer)
			
 
				+
			
 
				+    if answer.strip() == doc["answer"].strip() or is_equiv(answer, doc["answer"]):
			
 
				+        retval = 1
			
 
				+    else:
			
 
				+        retval = 0
			
 
				+
			
 
				+    results = {
			
 
				+        "exact_match": retval,
			
 
				+    }
			
 
				+    return results
			
 
				+
			
 
				+
			
 
				+def last_boxed_only_string(string: str) -> Optional[str]:
			
 
				+    idx = string.rfind("\\boxed")
			
 
				+    if "\\boxed " in string:
			
 
				+        return "\\boxed " + string.split("\\boxed ")[-1].split("$")[0]
			
 
				+    if idx < 0:
			
 
				+        idx = string.rfind("\\fbox")
			
 
				+        if idx < 0:
			
 
				+            return None
			
 
				+
			
 
				+    i = idx
			
 
				+    right_brace_idx = None
			
 
				+    num_left_braces_open = 0
			
 
				+    while i < len(string):
			
 
				+        if string[i] == "{":
			
 
				+            num_left_braces_open += 1
			
 
				+        if string[i] == "}":
			
 
				+            num_left_braces_open -= 1
			
 
				+            if num_left_braces_open == 0:
			
 
				+                right_brace_idx = i
			
 
				+                break
			
 
				+        i += 1
			
 
				+
			
 
				+    if right_brace_idx is None:
			
 
				+        retval = None
			
 
				+    else:
			
 
				+        retval = string[idx : right_brace_idx + 1]
			
 
				+
			
 
				+    return retval
			
 
				+
			
 
				+
			
 
				+def remove_boxed(s: str) -> str:
			
 
				+    if "\\boxed " in s:
			
 
				+        left = "\\boxed "
			
 
				+        assert s[: len(left)] == left
			
 
				+        return s[len(left) :]
			
 
				+
			
 
				+    left = "\\boxed{"
			
 
				+
			
 
				+    assert s[: len(left)] == left
			
 
				+    assert s[-1] == "}"
			
 
				+
			
 
				+    return s[len(left) : -1]
			
 
				+
			
 
				+
			
 
				+class timeout:
			
 
				+    def __init__(self, seconds=1, error_message="Timeout"):
			
 
				+        self.seconds = seconds
			
 
				+        self.error_message = error_message
			
 
				+
			
 
				+    def handle_timeout(self, signum, frame):
			
 
				+        raise TimeoutError(self.error_message)
			
 
				+
			
 
				+    def __enter__(self):
			
 
				+        signal.signal(signal.SIGALRM, self.handle_timeout)
			
 
				+        signal.alarm(self.seconds)
			
 
				+
			
 
				+    def __exit__(self, type, value, traceback):
			
 
				+        signal.alarm(0)
			
 
				+
			
 
				+
			
 
				+def is_equiv(x1: str, x2: str) -> bool:
			
 
				+    """
			
 
				+    x1 and x2 are normalized latex string
			
 
				+    """
			
 
				+    try:
			
 
				+        with timeout(seconds=5):
			
 
				+            try:
			
 
				+                parsed_x1 = parse_latex(x1)
			
 
				+                parsed_x2 = parse_latex(x2)
			
 
				+            except (
			
 
				+                sympy.parsing.latex.errors.LaTeXParsingError,
			
 
				+                sympy.SympifyError,
			
 
				+                TypeError,
			
 
				+            ):
			
 
				+                eval_logger.debug(f"couldn't parse one of {x1} or {x2}")
			
 
				+                return False
			
 
				+
			
 
				+            try:
			
 
				+                diff = parsed_x1 - parsed_x2
			
 
				+            except TypeError:
			
 
				+                eval_logger.debug(f"couldn't subtract {x1} and {x2}")
			
 
				+                return False
			
 
				+
			
 
				+            try:
			
 
				+                if sympy.simplify(diff) == 0:
			
 
				+                    return True
			
 
				+                else:
			
 
				+                    return False
			
 
				+            except ValueError:
			
 
				+                eval_logger.debug(
			
 
				+                    f"Had some trouble simplifying when comparing {x1} and {x2}"
			
 
				+                )
			
 
				+    except TimeoutError:
			
 
				+        eval_logger.debug(f"Timed out comparing {x1} and {x2}")
			
 
				+        return False
			
 
				+    except ImportError as e:
			
 
				+        eval_logger.error(e)
			
 
				+        raise
			
 
				+    except Exception as e:
			
 
				+        eval_logger.debug(f"Failed comparing {x1} and {x2} with {e}")
			
 
				+        return False
			
 
				+
			
 
				+
			
 
				+def get_unnormalized_answer(text: str) -> str:
			
 
				+    INVALID_ANSWER = "[invalidanswer]"
			
 
				+    end_seq = "I hope it is correct."
			
 
				+    text += end_seq
			
 
				+    match = re.search(
			
 
				+        r"Final Answer: The final answer is(.*?). I hope it is correct.",
			
 
				+        text,
			
 
				+    )
			
 
				+    if match:
			
 
				+        return match.group(1).strip()
			
 
				+    else:
			
 
				+        return INVALID_ANSWER
			
 
				+
			
 
				+
			
 
				+SUBSTITUTIONS = [
			
 
				+    ("an ", ""),
			
 
				+    ("a ", ""),
			
 
				+    (".$", "$"),
			
 
				+    ("\\$", ""),
			
 
				+    (r"\ ", ""),
			
 
				+    (" ", ""),
			
 
				+    ("mbox", "text"),
			
 
				+    (",\\text{and}", ","),
			
 
				+    ("\\text{and}", ","),
			
 
				+    ("\\text{m}", "\\text{}"),
			
 
				+]
			
 
				+REMOVED_EXPRESSIONS = [
			
 
				+    "square",
			
 
				+    "ways",
			
 
				+    "integers",
			
 
				+    "dollars",
			
 
				+    "mph",
			
 
				+    "inches",
			
 
				+    "ft",
			
 
				+    "hours",
			
 
				+    "km",
			
 
				+    "units",
			
 
				+    "\\ldots",
			
 
				+    "sue",
			
 
				+    "points",
			
 
				+    "feet",
			
 
				+    "minutes",
			
 
				+    "digits",
			
 
				+    "cents",
			
 
				+    "degrees",
			
 
				+    "cm",
			
 
				+    "gm",
			
 
				+    "pounds",
			
 
				+    "meters",
			
 
				+    "meals",
			
 
				+    "edges",
			
 
				+    "students",
			
 
				+    "childrentickets",
			
 
				+    "multiples",
			
 
				+    "\\text{s}",
			
 
				+    "\\text{.}",
			
 
				+    "\\text{\ns}",
			
 
				+    "\\text{}^2",
			
 
				+    "\\text{}^3",
			
 
				+    "\\text{\n}",
			
 
				+    "\\text{}",
			
 
				+    r"\mathrm{th}",
			
 
				+    r"^\circ",
			
 
				+    r"^{\circ}",
			
 
				+    r"\;",
			
 
				+    r",\!",
			
 
				+    "{,}",
			
 
				+    '"',
			
 
				+    "\\dots",
			
 
				+]
			
 
				+
			
 
				+
			
 
				+def normalize_final_answer(final_answer: str) -> str:
			
 
				+    """
			
 
				+    Normalize a final answer to a quantitative reasoning question.
			
 
				+
			
 
				+    Copied character for character from appendix D of Lewkowycz et al. (2022)
			
 
				+    """
			
 
				+    final_answer = final_answer.split("=")[-1]
			
 
				+
			
 
				+    for before, after in SUBSTITUTIONS:
			
 
				+        final_answer = final_answer.replace(before, after)
			
 
				+    for expr in REMOVED_EXPRESSIONS:
			
 
				+        final_answer = final_answer.replace(expr, "")
			
 
				+
			
 
				+    # Extract answer that is in LaTeX math, is bold,
			
 
				+    # is surrounded by a box, etc.
			
 
				+    final_answer = re.sub(r"(.*?)(\$)(.*?)(\$)(.*)", "$\\3$", final_answer)
			
 
				+    final_answer = re.sub(r"(\\text\{)(.*?)(\})", "\\2", final_answer)
			
 
				+    final_answer = re.sub(r"(\\textbf\{)(.*?)(\})", "\\2", final_answer)
			
 
				+    final_answer = re.sub(r"(\\overline\{)(.*?)(\})", "\\2", final_answer)
			
 
				+    final_answer = re.sub(r"(\\boxed\{)(.*)(\})", "\\2", final_answer)
			
 
				+
			
 
				+    # Normalize shorthand TeX:
			
 
				+    #  \fracab -> \frac{a}{b}
			
 
				+    #  \frac{abc}{bef} -> \frac{abc}{bef}
			
 
				+    #  \fracabc -> \frac{a}{b}c
			
 
				+    #  \sqrta -> \sqrt{a}
			
 
				+    #  \sqrtab -> sqrt{a}b
			
 
				+    final_answer = re.sub(r"(frac)([^{])(.)", "frac{\\2}{\\3}", final_answer)
			
 
				+    final_answer = re.sub(r"(sqrt)([^{])", "sqrt{\\2}", final_answer)
			
 
				+    final_answer = final_answer.replace("$", "")
			
 
				+
			
 
				+    # Normalize 100,000 -> 100000
			
 
				+    if final_answer.replace(",", "").isdigit():
			
 
				+        final_answer = final_answer.replace(",", "")
			
 
				+
			
 
				+    return final_answer
			
--- a/tools/benchmarks/llm_eval_harness/meta_eval_reproduce/meta_template/meta_instruct.yaml
+++ b/tools/benchmarks/llm_eval_harness/meta_eval_reproduce/meta_template/meta_instruct.yaml
@@ -0,0 +1,6 @@
 
				+group: meta_instruct
			
 
				+task:
			
 
				+  - meta_ifeval
			
 
				+  - meta_math_hard
			
 
				+  - meta_gpqa
			
 
				+  - meta_mmlu_pro_instruct
			
--- a/tools/benchmarks/llm_eval_harness/meta_eval_reproduce/meta_template/meta_pretrain.yaml
+++ b/tools/benchmarks/llm_eval_harness/meta_eval_reproduce/meta_template/meta_pretrain.yaml
@@ -0,0 +1,4 @@
 
				+group: meta_pretrain
			
 
				+task:
			
 
				+  - meta_bbh
			
 
				+  - meta_mmlu_pro_pretrain
			
--- a/tools/benchmarks/llm_eval_harness/meta_eval_reproduce/meta_template/mmlu_pro/mmlu_pro_5shot_cot_instruct.yaml
+++ b/tools/benchmarks/llm_eval_harness/meta_eval_reproduce/meta_template/mmlu_pro/mmlu_pro_5shot_cot_instruct.yaml
@@ -0,0 +1,29 @@
 
				+task: meta_mmlu_pro_instruct
			
 
				+dataset_path: meta-llama/Meta-Llama-3.1-8B-Instruct-evals
			
 
				+dataset_name: Meta-Llama-3.1-8B-Instruct-evals__mmlu_pro__details
			
 
				+test_split: latest
			
 
				+output_type: generate_until
			
 
				+process_docs: !function utils.process_docs
			
 
				+doc_to_text: !function utils.doc_to_text
			
 
				+doc_to_target: gold
			
 
				+filter_list:
			
 
				+  - name: "strict-match"
			
 
				+    filter:
			
 
				+      - function: "regex"
			
 
				+        group_select: -1
			
 
				+        regex_pattern: 'best answer is ([A-Z])'
			
 
				+      - function: "take_first"
			
 
				+generation_kwargs:
			
 
				+  until: []
			
 
				+  do_sample: false
			
 
				+  temperature: 0
			
 
				+  max_gen_toks: 1024
			
 
				+num_fewshot: 0
			
 
				+metric_list:
			
 
				+  - metric: exact_match
			
 
				+    aggregation: mean
			
 
				+    higher_is_better: true
			
 
				+    ignore_case: true
			
 
				+    ignore_punctuation: true
			
 
				+metadata:
			
 
				+  version: 1.0
			
--- a/tools/benchmarks/llm_eval_harness/meta_eval_reproduce/meta_template/mmlu_pro/mmlu_pro_5shot_cot_pretrain.yaml
+++ b/tools/benchmarks/llm_eval_harness/meta_eval_reproduce/meta_template/mmlu_pro/mmlu_pro_5shot_cot_pretrain.yaml
@@ -0,0 +1,28 @@
 
				+task: meta_mmlu_pro_pretrain
			
 
				+dataset_path: meta-llama/Meta-Llama-3.1-8B-evals
			
 
				+dataset_name: Meta-Llama-3.1-8B-evals__mmlu_pro__details
			
 
				+test_split: latest
			
 
				+output_type: generate_until
			
 
				+process_docs: !function utils.process_docs
			
 
				+doc_to_text: !function utils.doc_to_text
			
 
				+doc_to_target: gold
			
 
				+filter_list:
			
 
				+  - name: "strict-match"
			
 
				+    filter:
			
 
				+      - function: "regex"
			
 
				+        regex_pattern: 'answer is \(([A-Z])\)'
			
 
				+      - function: "take_first"
			
 
				+generation_kwargs:
			
 
				+  until: "\n\nQ: "
			
 
				+  do_sample: false
			
 
				+  temperature: 0
			
 
				+  max_gen_toks: 512
			
 
				+num_fewshot: 0
			
 
				+metric_list:
			
 
				+  - metric: exact_match
			
 
				+    aggregation: mean
			
 
				+    higher_is_better: true
			
 
				+    ignore_case: true
			
 
				+    ignore_punctuation: true
			
 
				+metadata:
			
 
				+  version: 1.0
			
--- a/tools/benchmarks/llm_eval_harness/meta_eval_reproduce/meta_template/mmlu_pro/utils.py
+++ b/tools/benchmarks/llm_eval_harness/meta_eval_reproduce/meta_template/mmlu_pro/utils.py
@@ -0,0 +1,21 @@
 
				+import string
			
 
				+
			
 
				+
			
 
				+import datasets
			
 
				+
			
 
				+
			
 
				+
			
 
				+def doc_to_text(doc: dict) -> str:
			
 
				+    return doc["input_final_prompts"][0]
			
 
				+
			
 
				+def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:
			
 
				+    def _process_doc(doc: dict) -> dict:
			
 
				+        out_doc = {
			
 
				+            "problem": doc["input_question"],
			
 
				+            "gold": doc["input_correct_responses"][0],
			
 
				+        }
			
 
				+        return out_doc
			
 
				+    dataset = dataset.select_columns(["input_question", "input_correct_responses", "input_final_prompts", "is_correct","input_question_hash","input_choice_list","output_prediction_text"])
			
 
				+    dataset = dataset.rename_column("is_correct","previously_is_correct")
			
 
				+    dataset = dataset.map(_process_doc)
			
 
				+    return dataset.map(_process_doc)
			
--- a/tools/benchmarks/llm_eval_harness/meta_eval_reproduce/prepare_meta_eval.py
+++ b/tools/benchmarks/llm_eval_harness/meta_eval_reproduce/prepare_meta_eval.py
@@ -0,0 +1,237 @@
 
				+# Copyright (c) Meta Platforms, Inc. and affiliates.
			
 
				+# This software may be used and distributed according to the terms of the Llama 3 Community License Agreement.
			
 
				+
			
 
				+import argparse
			
 
				+import errno
			
 
				+import shutil
			
 
				+import glob
			
 
				+import os
			
 
				+from pathlib import Path
			
 
				+import nltk
			
 
				+import yaml
			
 
				+from datasets import Dataset, load_dataset
			
 
				+
			
 
				+
			
 
				+# get the ifeval  from the evals dataset and join it with the original ifeval datasets
			
 
				+def get_ifeval_data(model_name, output_dir):
			
 
				+    print(f"preparing the ifeval data using {model_name}'s evals dataset")
			
 
				+    if model_name not in [
			
 
				+        "Meta-Llama-3.1-8B-Instruct",
			
 
				+        "Meta-Llama-3.1-70B-Instruct",
			
 
				+        "Meta-Llama-3.1-405B-Instruct",
			
 
				+    ]:
			
 
				+        raise ValueError(
			
 
				+            "Only Meta-Llama-3.1-8B-Instruct, Meta-Llama-3.1-70B-Instruct, Meta-Llama-3.1-405B-Instruct models are supported for IFEval"
			
 
				+        )
			
 
				+    original_dataset_name = "wis-k/instruction-following-eval"
			
 
				+    meta_dataset_name = f"meta-llama/{model_name}-evals"
			
 
				+    meta_data = load_dataset(
			
 
				+        meta_dataset_name,
			
 
				+        name=f"{model_name}-evals__ifeval__strict__details",
			
 
				+        split="latest",
			
 
				+    )
			
 
				+    ifeval_data = load_dataset(original_dataset_name, split="train")
			
 
				+    meta_data = meta_data.map(get_question)
			
 
				+    meta_df = meta_data.to_pandas()
			
 
				+    ifeval_df = ifeval_data.to_pandas()
			
 
				+    ifeval_df = ifeval_df.rename(columns={"prompt": "input_question"})
			
 
				+    # join the two datasets on the input_question column
			
 
				+    joined = meta_df.join(ifeval_df.set_index("input_question"), on="input_question")
			
 
				+    joined = joined.rename(columns={"input_final_prompts": "prompt"})
			
 
				+    joined = joined.rename(columns={"is_correct": "previous_is_correct"})
			
 
				+    joined = Dataset.from_pandas(joined)
			
 
				+    joined = joined.select_columns(
			
 
				+        [
			
 
				+            "input_question",
			
 
				+            "prompt",
			
 
				+            "previous_is_correct",
			
 
				+            "instruction_id_list",
			
 
				+            "kwargs",
			
 
				+            "output_prediction_text",
			
 
				+            "key",
			
 
				+        ]
			
 
				+    )
			
 
				+    joined.rename_column("output_prediction_text", "previous_output_prediction_text")
			
 
				+    joined.to_parquet(output_dir + "/joined_ifeval.parquet")
			
 
				+
			
 
				+
			
 
				+# get the math_hard data from the evals dataset and join it with the original math_hard dataset
			
 
				+def get_math_data(model_name, output_dir):
			
 
				+    print(f"preparing the math data using {model_name}'s evals dataset")
			
 
				+    if model_name not in [
			
 
				+        "Meta-Llama-3.1-8B-Instruct",
			
 
				+        "Meta-Llama-3.1-70B-Instruct",
			
 
				+        "Meta-Llama-3.1-405B-Instruct",
			
 
				+    ]:
			
 
				+        raise ValueError(
			
 
				+            "Only Meta-Llama-3.1-8B-Instruct, Meta-Llama-3.1-70B-Instruct, Meta-Llama-3.1-405B-Instruct models are supported for MATH_hard"
			
 
				+        )
			
 
				+    original_dataset_name = "lighteval/MATH-Hard"
			
 
				+    meta_dataset_name = f"meta-llama/{model_name}-evals"
			
 
				+    meta_data = load_dataset(
			
 
				+        meta_dataset_name,
			
 
				+        name=f"{model_name}-evals__math_hard__details",
			
 
				+        split="latest",
			
 
				+    )
			
 
				+    math_data = load_dataset(original_dataset_name, split="test")
			
 
				+    meta_df = meta_data.to_pandas()
			
 
				+    math_df = math_data.to_pandas()
			
 
				+    math_df = math_df.rename(columns={"problem": "input_question"})
			
 
				+    # join the two datasets on the input_question column
			
 
				+    joined = meta_df.join(math_df.set_index("input_question"), on="input_question")
			
 
				+    joined = Dataset.from_pandas(joined)
			
 
				+    joined = joined.select_columns(
			
 
				+        [
			
 
				+            "input_question",
			
 
				+            "input_correct_responses",
			
 
				+            "input_final_prompts",
			
 
				+            "is_correct",
			
 
				+            "solution",
			
 
				+            "output_prediction_text",
			
 
				+        ]
			
 
				+    )
			
 
				+    joined = joined.rename_column("is_correct", "previous_is_correct")
			
 
				+    joined = joined.rename_column(
			
 
				+        "output_prediction_text", "previous_output_prediction_text"
			
 
				+    )
			
 
				+
			
 
				+    joined.to_parquet(output_dir + "/joined_math.parquet")
			
 
				+
			
 
				+
			
 
				+# get the question from the ifeval dataset
			
 
				+def get_question(example):
			
 
				+    try:
			
 
				+        example["input_question"] = (
			
 
				+            eval(
			
 
				+                example["input_question"]
			
 
				+                .replace("null", "None")
			
 
				+                .replace("true", "True")
			
 
				+                .replace("false", "False")
			
 
				+            )["dialog"][0]["body"]
			
 
				+            .replace("Is it True that the first song", "Is it true that the first song")
			
 
				+            .replace("Is the following True", "Is the following true")
			
 
				+        )
			
 
				+        example["input_final_prompts"] = example["input_final_prompts"][0]
			
 
				+        return example
			
 
				+    except:
			
 
				+        print(example["input_question"])
			
 
				+        return
			
 
				+
			
 
				+
			
 
				+# change the yaml file to use the correct model name
			
 
				+def change_yaml(args, base_name):
			
 
				+    for yaml_file in glob.glob(args.template_dir + "**/*/*.yaml", recursive=True):
			
 
				+        with open(yaml_file, "r") as sources:
			
 
				+            lines = sources.readlines()
			
 
				+        output_path = yaml_file.replace(args.template_dir, args.work_dir)
			
 
				+        print(f"changing {yaml_file} to output_path: {output_path}")
			
 
				+        path = Path(output_path)
			
 
				+        yaml_dir = path.parent
			
 
				+        with open(output_path, "w") as output:
			
 
				+            for line in lines:
			
 
				+                output.write(
			
 
				+                    line.replace("Meta-Llama-3.1-8B", base_name).replace(
			
 
				+                        "WORK_DIR", str(yaml_dir)
			
 
				+                    )
			
 
				+                )
			
 
				+
			
 
				+
			
 
				+# copy the files and change the yaml file to use the correct model name
			
 
				+def copy_and_prepare(args):
			
 
				+    # nltk punkt_tab package is needed
			
 
				+    nltk.download('punkt_tab')
			
 
				+    if not os.path.exists(args.work_dir):
			
 
				+        # Copy the all files, including yaml files and python files, from template folder to the work folder
			
 
				+
			
 
				+        copy_dir(args.template_dir, args.work_dir)
			
 
				+    else:
			
 
				+        print("work_dir already exists, no need to copy files")
			
 
				+    # Use the template yaml to get the correct model name in work_dir yaml
			
 
				+    base_name = (
			
 
				+        args.evals_dataset.split("/")[-1].replace("-evals", "").replace("-Instruct", "")
			
 
				+    )
			
 
				+    change_yaml(args, base_name)
			
 
				+
			
 
				+
			
 
				+def parse_eval_args():
			
 
				+    parser = argparse.ArgumentParser(formatter_class=argparse.RawTextHelpFormatter)
			
 
				+    parser.add_argument(
			
 
				+        "--config_path",
			
 
				+        type=str,
			
 
				+        default="./eval_config.yaml",
			
 
				+        help="the config yaml file that contains all the eval parameters",
			
 
				+    )
			
 
				+    return parser.parse_args()
			
 
				+
			
 
				+
			
 
				+def prepare_datasets(args):
			
 
				+    # Prepare the dataset for the IFeval and MATH_Hard tasks as we need to join the original dataset with the evals dataset by the actual questions.
			
 
				+    # model_name are derived from the evals_dataset name
			
 
				+    task_list = args.tasks.split(",")
			
 
				+    model_name = args.evals_dataset.split("/")[-1].replace("-evals", "")
			
 
				+    if "meta_instruct" in task_list:
			
 
				+        get_ifeval_data(model_name, args.work_dir)
			
 
				+
			
 
				+        get_math_data(model_name, args.work_dir)
			
 
				+    else:
			
 
				+        if "meta_ifeval" in task_list:
			
 
				+            get_ifeval_data(model_name, args.work_dir)
			
 
				+        if "meta_math_hard" in task_list:
			
 
				+            get_math_data(model_name, args.work_dir)
			
 
				+
			
 
				+
			
 
				+# copy the files from src to dst
			
 
				+def copy_dir(src, dst):
			
 
				+    try:
			
 
				+        shutil.copytree(src, dst)
			
 
				+    except OSError as exc:  # python >2.5
			
 
				+        if exc.errno in (errno.ENOTDIR, errno.EINVAL):
			
 
				+            shutil.copy(src, dst)
			
 
				+        else:
			
 
				+            raise
			
 
				+
			
 
				+
			
 
				+# load the config yaml file
			
 
				+def load_config(config_path: str = "./config.yaml"):
			
 
				+    # Read the YAML configuration file
			
 
				+    with open(config_path, "r") as file:
			
 
				+        config = yaml.safe_load(file)
			
 
				+    return config
			
 
				+
			
 
				+
			
 
				+if __name__ == "__main__":
			
 
				+    args = parse_eval_args()
			
 
				+    config = load_config(args.config_path)
			
 
				+    # Create VLLM model args
			
 
				+    for k, v in config.items():
			
 
				+        args.__setattr__(k, v)
			
 
				+    if not os.path.exists(args.template_dir):
			
 
				+        raise ValueError("The template_dir does not exist, please check the path")
			
 
				+    if args.evals_dataset not in [
			
 
				+        "meta-llama/Meta-Llama-3.1-8B-Instruct-evals",
			
 
				+        "meta-llama/Meta-Llama-3.1-70B-Instruct-evals",
			
 
				+        "meta-llama/Meta-Llama-3.1-405B-Instruct-evals",
			
 
				+        "meta-llama/Meta-Llama-3.1-8B-evals",
			
 
				+        "meta-llama/Meta-Llama-3.1-70B-evals",
			
 
				+        "meta-llama/Meta-Llama-3.1-405B-evals",
			
 
				+    ]:
			
 
				+        raise ValueError(
			
 
				+            "The evals dataset is not valid, please double check the name, must use the name in the Llama 3.1 Evals collection"
			
 
				+        )
			
 
				+    args.model_args = f"pretrained={args.model_name},tensor_parallel_size={args.tensor_parallel_size},dtype=auto,gpu_memory_utilization={args.gpu_memory_utilization},data_parallel_size={args.data_parallel_size},max_model_len={args.max_model_len},add_bos_token=True,seed=42"
			
 
				+    # Copy the all files from template folder to the work folder
			
 
				+    copy_and_prepare(args)
			
 
				+    # Prepare the datasets for the IFeval and MATH_Hard tasks as we need to join the original dataset
			
 
				+    prepare_datasets(args)
			
 
				+    print(
			
 
				+        f"prepration for the {args.model_name} using {args.evals_dataset} is done, all saved the work_dir: {args.work_dir}"
			
 
				+    )
			
 
				+    command_str = f"lm_eval --model vllm   --model_args {args.model_args} --tasks {args.tasks} --batch_size auto --output_path { args.output_path} --include_path {os.path.abspath(args.work_dir)} --seed 42 "
			
 
				+    if args.limit:
			
 
				+        command_str += f" --limit {args.limit}"
			
 
				+    if args.log_samples:
			
 
				+        command_str += " --log_samples "
			
 
				+    if args.show_config:
			
 
				+        command_str += " --show_config "
			
 
				+    print("please use the following command to run the meta reproduce evals:")
			
 
				+    print(command_str)