Bladeren bron

rewrite readme

Sanyam Bhutani 6 maanden geleden
bovenliggende
commit
83e353178c

File diff suppressed because it is too large
+ 39 - 0
README.md


+ 2 - 1
UPDATES.md

@@ -13,10 +13,11 @@ Nested Folders rename:
 - /dev_requirements.txt -> /src/dev_requirements.txt
 - /requirements.txt -> /src/requirements.txt
 - /tools -> /end-to-end-use-cases/benchmarks/ 
+- /recipes/experimental/long_context -> /end-to-end-use-cases/long_context
 
 
 Removed folders:
 - /flagged (Empty folder)
 - /recipes/quickstart/Running_Llama3_Anywhere (Redundant code)
-- /recipes/quickstart/codellama (deprecated model)
+- /recipes/quickstart/inference/codellama (deprecated model)
 

recipes/experimental/long_context/H2O/README.md → end-to-end-use-cases/long_context/H2O/README.md


recipes/experimental/long_context/H2O/data/summarization/cnn_dailymail.jsonl → end-to-end-use-cases/long_context/H2O/data/summarization/cnn_dailymail.jsonl


recipes/experimental/long_context/H2O/data/summarization/xsum.jsonl → end-to-end-use-cases/long_context/H2O/data/summarization/xsum.jsonl


recipes/experimental/long_context/H2O/requirements.txt → end-to-end-use-cases/long_context/H2O/requirements.txt


recipes/experimental/long_context/H2O/run_streaming.py → end-to-end-use-cases/long_context/H2O/run_streaming.py


recipes/experimental/long_context/H2O/run_summarization.py → end-to-end-use-cases/long_context/H2O/run_summarization.py


recipes/experimental/long_context/H2O/src/streaming.sh → end-to-end-use-cases/long_context/H2O/src/streaming.sh


recipes/experimental/long_context/H2O/utils/cache.py → end-to-end-use-cases/long_context/H2O/utils/cache.py


recipes/experimental/long_context/H2O/utils/llama.py → end-to-end-use-cases/long_context/H2O/utils/llama.py


recipes/experimental/long_context/H2O/utils/streaming.py → end-to-end-use-cases/long_context/H2O/utils/streaming.py


+ 0 - 336
getting-started/Running_Llama3_Anywhere/Running_Llama_on_HF_transformers.ipynb

@@ -1,336 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Running Meta Llama 3.1 on Google Colab using Hugging Face transformers library\n",
-    "This notebook goes over how you can set up and run Llama 3.1 using Hugging Face transformers library\n",
-    "<a href=\"https://colab.research.google.com/github/meta-llama/llama-recipes/blob/main/recipes/quickstart/Running_Llama2_Anywhere/Running_Llama_on_HF_transformers.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Steps at a glance:\n",
-    "This demo showcases how to run the example with already converted Llama 3.1 weights on [Hugging Face](https://huggingface.co/meta-llama). Please Note: To use the downloads on Hugging Face, you must first request a download as shown in the steps below making sure that you are using the same email address as your Hugging Face account.\n",
-    "\n",
-    "To use already converted weights, start here:\n",
-    "1. Request download of model weights from the Llama website\n",
-    "2. Login to Hugging Face from your terminal using the same email address as (1). Follow the instructions [here](https://huggingface.co/docs/huggingface_hub/en/quick-start). \n",
-    "3. Run the example\n",
-    "\n",
-    "\n",
-    "Else, if you'd like to download the models locally and convert them to the HF format, follow the steps below to convert the weights:\n",
-    "1. Request download of model weights from the Llama website\n",
-    "2. Clone the llama repo and get the weights\n",
-    "3. Convert the model weights\n",
-    "4. Prepare the script\n",
-    "5. Run the example"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Using already converted weights"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "#### 1. Request download of model weights from the Llama website\n",
-    "Request download of model weights from the Llama website\n",
-    "Before you can run the model locally, you will need to get the model weights. To get the model weights, visit the [Llama website](https://llama.meta.com/) and click on “download models”. \n",
-    "\n",
-    "Fill  the required information, select the models “Meta Llama 3.1” and accept the terms & conditions. You will receive a URL in your email in a short time."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "#### 2. Prepare the script\n",
-    "\n",
-    "We will install the Transformers library and Accelerate library for our demo.\n",
-    "\n",
-    "The `Transformers` library provides many models to perform tasks on texts such as classification, question answering, text generation, etc.\n",
-    "The `accelerate` library enables the same PyTorch code to be run across any distributed configuration of GPUs and CPUs.\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "!pip install transformers\n",
-    "!pip install accelerate"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Next, we will import AutoTokenizer, which is a class from the transformers library that automatically chooses the correct tokenizer for a given pre-trained model, import transformers library and torch for PyTorch.\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 2,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from transformers import AutoTokenizer\n",
-    "import transformers\n",
-    "import torch"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Then, we will set the model variable to a specific model we’d like to use. In this demo, we will use the 8b chat model `meta-llama/Meta-Llama-3.1-8B-Instruct`. Using Meta models from Hugging Face requires you to\n",
-    "\n",
-    "1. Accept Terms of Service for Meta Llama 3.1 on Meta [website](https://llama.meta.com/llama-downloads).\n",
-    "2. Use the same email address from Step (1) to login into Hugging Face.\n",
-    "\n",
-    "Follow the instructions on this Hugging Face page to login from your [terminal](https://huggingface.co/docs/huggingface_hub/en/quick-start). "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "pip install --upgrade huggingface_hub"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from huggingface_hub import login\n",
-    "login()"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "model = \"meta-llama/Meta-Llama-3.1-8B-Instruct\"\n",
-    "tokenizer = AutoTokenizer.from_pretrained(model)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Now, we will use the `from_pretrained` method of `AutoTokenizer` to create a tokenizer. This will download and cache the pre-trained tokenizer and return an instance of the appropriate tokenizer class.\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "pipeline = transformers.pipeline(\n",
-    "\"text-generation\",\n",
-    "      model=model,\n",
-    "      torch_dtype=torch.float16,\n",
-    " device_map=\"auto\",\n",
-    ")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "#### 3. Run the example\n",
-    "\n",
-    "Now, let’s create the pipeline for text generation. We’ll also set the device_map argument to `auto`, which means the pipeline will automatically use a GPU if one is available.\n",
-    "\n",
-    "Let’s also generate a text sequence based on the input that we provide. "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "sequences = pipeline(\n",
-    "    'I have tomatoes, basil and cheese at home. What can I cook for dinner?\\n',\n",
-    "    do_sample=True,\n",
-    "    top_k=10,\n",
-    "    num_return_sequences=1,\n",
-    "    eos_token_id=tokenizer.eos_token_id,\n",
-    "    truncation = True,\n",
-    "    max_length=400,\n",
-    ")\n",
-    "\n",
-    "for seq in sequences:\n",
-    "    print(f\"Result: {seq['generated_text']}\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "<br>\n",
-    "\n",
-    "### Downloading and converting weights to Hugging Face format"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "#### 1. Request download of model weights from the Llama website\n",
-    "Request download of model weights from the Llama website\n",
-    "Before you can run the model locally, you will need to get the model weights. To get the model weights, visit the [Llama website](https://llama.meta.com/) and click on “download models”. \n",
-    "\n",
-    "Fill  the required information, select the models \"Meta Llama 3\" and accept the terms & conditions. You will receive a URL in your email in a short time."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "#### 2. Clone the llama repo and get the weights\n",
-    "Git clone the [Meta Llama 3 repo](https://github.com/meta-llama/llama3). Run the `download.sh` script and follow the instructions. This will download the model checkpoints and tokenizer.\n",
-    "\n",
-    "This example demonstrates a Meta Llama 3.1 model with 8B-instruct parameters, but the steps we follow would be similar for other llama models, as well as for other parameter models."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "#### 3. Convert the model weights using Hugging Face transformer from source\n",
-    "\n",
-    "* `python3 -m venv hf-convertor`\n",
-    "* `source hf-convertor/bin/activate`\n",
-    "* `git clone https://github.com/huggingface/transformers.git`\n",
-    "* `cd transformers`\n",
-    "* `pip install -e .`\n",
-    "* `pip install torch tiktoken blobfile accelerate`\n",
-    "* `python3 src/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir ${path_to_meta_downloaded_model} --output_dir ${path_to_save_converted_hf_model} --model_size 8B --llama_version 3.1`"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "\n",
-    "#### 4. Prepare the script\n",
-    "Import the following necessary modules in your script: \n",
-    "* `AutoModel` is the Llama 3 model class\n",
-    "* `AutoTokenizer` prepares your prompt for the model to process\n",
-    "* `pipeline` is an abstraction to generate model outputs"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import torch\n",
-    "import transformers\n",
-    "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
-    "\n",
-    "model_dir = \"${path_the_converted_hf_model}\"\n",
-    "model = AutoModelForCausalLM.from_pretrained(\n",
-    "        model_dir,\n",
-    "        device_map=\"auto\",\n",
-    "    )\n",
-    "tokenizer = AutoTokenizer.from_pretrained(model_dir)\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "We need a way to use our model for inference. Pipeline allows us to specify which type of task the pipeline needs to run (`text-generation`), specify the model that the pipeline should use to make predictions (`model`), define the precision to use this model (`torch.float16`), device on which the pipeline should run (`device_map`)  among various other options. \n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 2,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "pipeline = transformers.pipeline(\n",
-    "    \"text-generation\",\n",
-    "    model=model,\n",
-    "    tokenizer=tokenizer,\n",
-    "    torch_dtype=torch.float16,\n",
-    "    device_map=\"auto\",\n",
-    ")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Now we have our pipeline defined, and we need to provide some text prompts as inputs to our pipeline to use when it runs to generate responses (`sequences`). The pipeline shown in the example below sets `do_sample` to True, which allows us to specify the decoding strategy we’d like to use to select the next token from the probability distribution over the entire vocabulary. In our example, we are using top_k sampling. \n",
-    "\n",
-    "By changing `max_length`, you can specify how long you’d like the generated response to be. \n",
-    "Setting the `num_return_sequences` parameter to greater than one will let you generate more than one output.\n",
-    "\n",
-    "In your script, add the following to provide input, and information on how to run the pipeline:\n",
-    "\n",
-    "\n",
-    "#### 5. Run the example"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "sequences = pipeline(\n",
-    "    'I have tomatoes, basil and cheese at home. What can I cook for dinner?\\n',\n",
-    "    do_sample=True,\n",
-    "    top_k=10,\n",
-    "    num_return_sequences=1,\n",
-    "    eos_token_id=tokenizer.eos_token_id,\n",
-    "    max_length=400,\n",
-    ")\n",
-    "for seq in sequences:\n",
-    "    print(f\"{seq['generated_text']}\")\n"
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.8.10"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}

+ 0 - 166
getting-started/Running_Llama3_Anywhere/Running_Llama_on_Mac_Windows_Linux.ipynb

@@ -1,166 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Running Llama 3 on Mac, Windows or Linux\n",
-    "This notebook goes over how you can set up and run Llama 3.1 locally on a Mac, Windows or Linux using [Ollama](https://ollama.com/)."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Steps at a glance:\n",
-    "1. Download and install Ollama.\n",
-    "2. Download and test run Llama 3.1\n",
-    "3. Use local Llama 3.1 via Python.\n",
-    "4. Use local Llama 3.1 via LangChain.\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "#### 1. Download and install Ollama\n",
-    "\n",
-    "On Mac or Windows, go to the Ollama download page [here](https://ollama.com/download) and select your platform to download it, then double click the downloaded file to install Ollama.\n",
-    "\n",
-    "On Linux, you can simply run on a terminal `curl -fsSL https://ollama.com/install.sh | sh` to download and install Ollama."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "#### 2. Download and test run Llama 3\n",
-    "\n",
-    "On a terminal or console, run `ollama pull llama3.1` to download the Llama 3.1 8b chat model, in the 4-bit quantized format with size about 4.7 GB.\n",
-    "\n",
-    "Run `ollama pull llama3.1:70b` to download the Llama 3.1 70b chat model, also in the 4-bit quantized format with size 39GB.\n",
-    "\n",
-    "Then you can run `ollama run llama3.1` and ask Llama 3.1 questions such as \"who wrote the book godfather?\" or \"who wrote the book godfather? answer in one sentence.\" You can also try `ollama run llama3.1:70b`, but the inference speed will most likely be too slow - for example, on an Apple M1 Pro with 32GB RAM, it takes over 10 seconds to generate one token using Llama 3.1 70b chat (vs over 10 tokens per second with Llama 3.1 8b chat).\n",
-    "\n",
-    "You can also run the following command to test Llama 3.1 8b chat:\n",
-    "```\n",
-    " curl http://localhost:11434/api/chat -d '{\n",
-    "  \"model\": \"llama3.1\",\n",
-    "  \"messages\": [\n",
-    "    {\n",
-    "      \"role\": \"user\",\n",
-    "      \"content\": \"who wrote the book godfather?\"\n",
-    "    }\n",
-    "  ],\n",
-    "  \"stream\": false\n",
-    "}'\n",
-    "```\n",
-    "\n",
-    "The complete Ollama API doc is [here](https://github.com/ollama/ollama/blob/main/docs/api.md)."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "#### 3. Use local Llama 3.1 via Python\n",
-    "\n",
-    "The Python code below is the port of the curl command above."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import requests\n",
-    "import json\n",
-    "\n",
-    "url = \"http://localhost:11434/api/chat\"\n",
-    "\n",
-    "def llama3(prompt):\n",
-    "    data = {\n",
-    "        \"model\": \"llama3.1\",\n",
-    "        \"messages\": [\n",
-    "            {\n",
-    "              \"role\": \"user\",\n",
-    "              \"content\": prompt\n",
-    "            }\n",
-    "        ],\n",
-    "        \"stream\": False\n",
-    "    }\n",
-    "    \n",
-    "    headers = {\n",
-    "        'Content-Type': 'application/json'\n",
-    "    }\n",
-    "    \n",
-    "    response = requests.post(url, headers=headers, json=data)\n",
-    "    \n",
-    "    return(response.json()['message']['content'])"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "response = llama3(\"who wrote the book godfather\")\n",
-    "print(response)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "#### 4. Use local Llama 3.1 via LangChain\n",
-    "\n",
-    "Code below use LangChain with Ollama to query Llama 3 running locally. For a more advanced example of using local Llama 3 with LangChain and agent-powered RAG, see [this](https://github.com/langchain-ai/langgraph/blob/main/examples/rag/langgraph_rag_agent_llama3_local.ipynb)."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "!pip install langchain"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from langchain_community.chat_models import ChatOllama\n",
-    "\n",
-    "llm = ChatOllama(model=\"llama3.1\", temperature=0)\n",
-    "response = llm.invoke(\"who wrote the book godfather?\")\n",
-    "print(response.content)\n"
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.11.9"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}

+ 0 - 39
getting-started/inference/code_llama/README.md

@@ -1,39 +0,0 @@
-# Code Llama
-
-Code llama was recently released with three flavors, base-model that support multiple programming languages, Python fine-tuned model and an instruction fine-tuned and aligned variation of Code Llama, please read more [here](https://ai.meta.com/blog/code-llama-large-language-model-coding/). Also note that the Python fine-tuned model and 34B models are not trained on infilling objective, hence can not be used for infilling use-case.
-
-Find the scripts to run Code Llama, where there are two examples of running code completion and infilling.
-
-**Note** Please find the right model on HF [here](https://huggingface.co/models?search=meta-llama%20codellama). 
-
-Make sure to install Transformers from source for now
-
-```bash
-
-pip install git+https://github.com/huggingface/transformers
-
-```
-
-To run the code completion example:
-
-```bash
-
-python code_completion_example.py --model_name MODEL_NAME  --prompt_file code_completion_prompt.txt --temperature 0.2 --top_p 0.9
-
-```
-
-To run the code infilling example:
-
-```bash
-
-python code_infilling_example.py --model_name MODEL_NAME --prompt_file code_infilling_prompt.txt --temperature 0.2 --top_p 0.9
-
-```
-To run the 70B Instruct model example run the following (you'll need to enter the system and user prompts to instruct the model):
-
-```bash
-
-python code_instruct_example.py --model_name codellama/CodeLlama-70b-Instruct-hf --temperature 0.2 --top_p 0.9
-
-```
-You can learn more about the chat prompt template [on HF](https://huggingface.co/meta-llama/CodeLlama-70b-Instruct-hf#chat-prompt) and [original Code Llama repository](https://github.com/meta-llama/codellama/blob/main/README.md#fine-tuned-instruction-models). HF tokenizer has already taken care of the chat template as shown in this example. 

+ 0 - 119
getting-started/inference/code_llama/code_completion_example.py

@@ -1,119 +0,0 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
-
-# from accelerate import init_empty_weights, load_checkpoint_and_dispatch
-
-import fire
-import os
-import sys
-import time
-
-import torch
-from transformers import AutoTokenizer
-
-from llama_recipes.inference.safety_utils import get_safety_checker
-from llama_recipes.inference.model_utils import load_model, load_peft_model
-
-
-def main(
-    model_name,
-    peft_model: str=None,
-    quantization: bool=False,
-    max_new_tokens =100, #The maximum numbers of tokens to generate
-    prompt_file: str=None,
-    seed: int=42, #seed value for reproducibility
-    do_sample: bool=True, #Whether or not to use sampling ; use greedy decoding otherwise.
-    min_length: int=None, #The minimum length of the sequence to be generated, input prompt + min_new_tokens
-    use_cache: bool=True,  #[optional] Whether or not the model should use the past last key/values attentions Whether or not the model should use the past last key/values attentions (if applicable to the model) to speed up decoding.
-    top_p: float=0.9, # [optional] If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
-    temperature: float=0.6, # [optional] The value used to modulate the next token probabilities.
-    top_k: int=50, # [optional] The number of highest probability vocabulary tokens to keep for top-k-filtering.
-    repetition_penalty: float=1.0, #The parameter for repetition penalty. 1.0 means no penalty.
-    length_penalty: int=1, #[optional] Exponential penalty to the length that is used with beam-based generation. 
-    enable_azure_content_safety: bool=False, # Enable safety check with Azure content safety api
-    enable_sensitive_topics: bool=False, # Enable check for sensitive topics using AuditNLG APIs
-    enable_salesforce_content_safety: bool=True, # Enable safety check with Salesforce safety flan t5
-    enable_llamaguard_content_safety: bool=False, # Enable safety check with Llama-Guard
-    use_fast_kernels: bool = True, # Enable using SDPA from PyTroch Accelerated Transformers, make use Flash Attention and Xformer memory-efficient kernels
-    **kwargs
-):
-    if prompt_file is not None:
-        assert os.path.exists(
-            prompt_file
-        ), f"Provided Prompt file does not exist {prompt_file}"
-        with open(prompt_file, "r") as f:
-            user_prompt = f.read()
-    else:
-        print("No user prompt provided. Exiting.")
-        sys.exit(1)
-    
-    # Set the seeds for reproducibility
-    torch.cuda.manual_seed(seed)
-    torch.manual_seed(seed)
-    
-    model = load_model(model_name, quantization, use_fast_kernels)
-    if peft_model:
-        model = load_peft_model(model, peft_model)
-
-    model.eval()
-    
-    tokenizer = AutoTokenizer.from_pretrained(model_name)
-    safety_checker = get_safety_checker(enable_azure_content_safety,
-                                        enable_sensitive_topics,
-                                        enable_salesforce_content_safety,
-                                        enable_llamaguard_content_safety,
-                                        )
-
-    # Safety check of the user prompt
-    safety_results = [check(user_prompt) for check in safety_checker]
-    are_safe = all([r[1] for r in safety_results])
-    if are_safe:
-        print("User prompt deemed safe.")
-        print(f"User prompt:\n{user_prompt}")
-    else:
-        print("User prompt deemed unsafe.")
-        for method, is_safe, report in safety_results:
-            if not is_safe:
-                print(method)
-                print(report)
-        print("Skipping the inference as the prompt is not safe.")
-        sys.exit(1)  # Exit the program with an error status
-        
-    batch = tokenizer(user_prompt, return_tensors="pt")
-
-    batch = {k: v.to("cuda") for k, v in batch.items()}
-    start = time.perf_counter()
-    with torch.no_grad():
-        outputs = model.generate(
-            **batch,
-            max_new_tokens=max_new_tokens,
-            do_sample=do_sample,
-            top_p=top_p,
-            temperature=temperature,
-            min_length=min_length,
-            use_cache=use_cache,
-            top_k=top_k,
-            repetition_penalty=repetition_penalty,
-            length_penalty=length_penalty,
-            **kwargs 
-        )
-    e2e_inference_time = (time.perf_counter()-start)*1000
-    print(f"the inference time is {e2e_inference_time} ms")
-    output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
-    
-    # Safety check of the model output
-    safety_results = [check(output_text) for check in safety_checker]
-    are_safe = all([r[1] for r in safety_results])
-    if are_safe:
-        print("User input and model output deemed safe.")
-        print(f"Model output:\n{output_text}")
-    else:
-        print("Model output deemed unsafe.")
-        for method, is_safe, report in safety_results:
-            if not is_safe:
-                print(method)
-                print(report)
-                
-
-if __name__ == "__main__":
-    fire.Fire(main)

+ 0 - 7
getting-started/inference/code_llama/code_completion_prompt.txt

@@ -1,7 +0,0 @@
-import argparse
-
-def main(string: str):
-    print(string)
-    print(string[::-1])
-    
-if __name__ == "__main__":

+ 0 - 118
getting-started/inference/code_llama/code_infilling_example.py

@@ -1,118 +0,0 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
-
-# from accelerate import init_empty_weights, load_checkpoint_and_dispatch
-
-import fire
-import torch
-import os
-import sys
-import time
-
-from transformers import AutoTokenizer
-
-from llama_recipes.inference.safety_utils import get_safety_checker
-from llama_recipes.inference.model_utils import load_model, load_peft_model
-
-def main(
-    model_name,
-    peft_model: str=None,
-    quantization: bool=False,
-    max_new_tokens =100, #The maximum numbers of tokens to generate
-    prompt_file: str=None,
-    seed: int=42, #seed value for reproducibility
-    do_sample: bool=True, #Whether or not to use sampling ; use greedy decoding otherwise.
-    min_length: int=None, #The minimum length of the sequence to be generated, input prompt + min_new_tokens
-    use_cache: bool=True,  #[optional] Whether or not the model should use the past last key/values attentions Whether or not the model should use the past last key/values attentions (if applicable to the model) to speed up decoding.
-    top_p: float=0.9, # [optional] If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
-    temperature: float=0.6, # [optional] The value used to modulate the next token probabilities.
-    top_k: int=50, # [optional] The number of highest probability vocabulary tokens to keep for top-k-filtering.
-    repetition_penalty: float=1.0, #The parameter for repetition penalty. 1.0 means no penalty.
-    length_penalty: int=1, #[optional] Exponential penalty to the length that is used with beam-based generation. 
-    enable_azure_content_safety: bool=False, # Enable safety check with Azure content safety api
-    enable_sensitive_topics: bool=False, # Enable check for sensitive topics using AuditNLG APIs
-    enable_salesforce_content_safety: bool=True, # Enable safety check with Salesforce safety flan t5
-    enable_llamaguard_content_safety: bool=False, # Enable safety check with Llama-Guard
-    use_fast_kernels: bool = True, # Enable using SDPA from PyTroch Accelerated Transformers, make use Flash Attention and Xformer memory-efficient kernels
-    **kwargs
-):
-    if prompt_file is not None:
-        assert os.path.exists(
-            prompt_file
-        ), f"Provided Prompt file does not exist {prompt_file}"
-        with open(prompt_file, "r") as f:
-            user_prompt = f.read()
-    else:
-        print("No user prompt provided. Exiting.")
-        sys.exit(1)
-    # Set the seeds for reproducibility
-    torch.cuda.manual_seed(seed)
-    torch.manual_seed(seed)
-    
-    model = load_model(model_name, quantization, use_fast_kernels)
-    model.config.tp_size=1
-    if peft_model:
-        model = load_peft_model(model, peft_model)
-
-    model.eval()
-   
-    tokenizer = AutoTokenizer.from_pretrained(model_name)
-    
-    safety_checker = get_safety_checker(enable_azure_content_safety,
-                                        enable_sensitive_topics,
-                                        enable_salesforce_content_safety,
-                                        enable_llamaguard_content_safety,
-                                        )
-
-    # Safety check of the user prompt
-    safety_results = [check(user_prompt) for check in safety_checker]
-    are_safe = all([r[1] for r in safety_results])
-    if are_safe:
-        print("User prompt deemed safe.")
-        print(f"User prompt:\n{user_prompt}")
-    else:
-        print("User prompt deemed unsafe.")
-        for method, is_safe, report in safety_results:
-            if not is_safe:
-                print(method)
-                print(report)
-        print("Skipping the inference as the prompt is not safe.")
-        sys.exit(1)  # Exit the program with an error status
-        
-    batch = tokenizer(user_prompt, return_tensors="pt")
-    batch = {k: v.to("cuda") for k, v in batch.items()}
-    
-    start = time.perf_counter()
-    with torch.no_grad():
-        outputs = model.generate(
-            **batch,
-            max_new_tokens=max_new_tokens,
-            do_sample=do_sample,
-            top_p=top_p,
-            temperature=temperature,
-            min_length=min_length,
-            use_cache=use_cache,
-            top_k=top_k,
-            repetition_penalty=repetition_penalty,
-            length_penalty=length_penalty,
-            **kwargs 
-        )
-    e2e_inference_time = (time.perf_counter()-start)*1000
-    print(f"the inference time is {e2e_inference_time} ms")
-    filling = tokenizer.batch_decode(outputs[:, batch["input_ids"].shape[1]:], skip_special_tokens=True)[0]
-    # Safety check of the model output
-    safety_results = [check(filling) for check in safety_checker]
-    are_safe = all([r[1] for r in safety_results])
-    if are_safe:
-        print("User input and model output deemed safe.")
-        print(user_prompt.replace("<FILL_ME>", filling))
-    else:
-        print("Model output deemed unsafe.")
-        for method, is_safe, report in safety_results:
-            if not is_safe:
-                print(method)
-                print(report)
-                
-
-if __name__ == "__main__":
-    fire.Fire(main)

+ 0 - 3
getting-started/inference/code_llama/code_infilling_prompt.txt

@@ -1,3 +0,0 @@
-def remove_non_ascii(s: str) -> str:
-    """ <FILL_ME>
-    return result

+ 0 - 143
getting-started/inference/code_llama/code_instruct_example.py

@@ -1,143 +0,0 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
-
-import fire
-import os
-import sys
-import time
-
-import torch
-from transformers import AutoTokenizer
-
-from llama_recipes.inference.safety_utils import get_safety_checker
-from llama_recipes.inference.model_utils import load_model, load_peft_model
-
-
-def handle_safety_check(are_safe_user_prompt, user_prompt, safety_results_user_prompt, are_safe_system_prompt, system_prompt, safety_results_system_prompt):
-    """
-    Handles the output based on the safety check of both user and system prompts.
-
-    Parameters:
-    - are_safe_user_prompt (bool): Indicates whether the user prompt is safe.
-    - user_prompt (str): The user prompt that was checked for safety.
-    - safety_results_user_prompt (list of tuples): A list of tuples for the user prompt containing the method, safety status, and safety report.
-    - are_safe_system_prompt (bool): Indicates whether the system prompt is safe.
-    - system_prompt (str): The system prompt that was checked for safety.
-    - safety_results_system_prompt (list of tuples): A list of tuples for the system prompt containing the method, safety status, and safety report.
-    """
-    def print_safety_results(are_safe_prompt, prompt, safety_results, prompt_type="User"):
-        """
-        Prints the safety results for a prompt.
-
-        Parameters:
-        - are_safe_prompt (bool): Indicates whether the prompt is safe.
-        - prompt (str): The prompt that was checked for safety.
-        - safety_results (list of tuples): A list of tuples containing the method, safety status, and safety report.
-        - prompt_type (str): The type of prompt (User/System).
-        """
-        if are_safe_prompt:
-            print(f"{prompt_type} prompt deemed safe.")
-            print(f"{prompt_type} prompt:\n{prompt}")
-        else:
-            print(f"{prompt_type} prompt deemed unsafe.")
-            for method, is_safe, report in safety_results:
-                if not is_safe:
-                    print(method)
-                    print(report)
-            print(f"Skipping the inference as the {prompt_type.lower()} prompt is not safe.")
-            sys.exit(1)
-
-    # Check user prompt
-    print_safety_results(are_safe_user_prompt, user_prompt, safety_results_user_prompt, "User")
-    
-    # Check system prompt
-    print_safety_results(are_safe_system_prompt, system_prompt, safety_results_system_prompt, "System")
-
-def main(
-    model_name,
-    peft_model: str=None,
-    quantization: bool=False,
-    max_new_tokens =100, #The maximum numbers of tokens to generate
-    seed: int=42, #seed value for reproducibility
-    do_sample: bool=True, #Whether or not to use sampling ; use greedy decoding otherwise.
-    min_length: int=None, #The minimum length of the sequence to be generated, input prompt + min_new_tokens
-    use_cache: bool=False,  #[optional] Whether or not the model should use the past last key/values attentions Whether or not the model should use the past last key/values attentions (if applicable to the model) to speed up decoding.
-    top_p: float=0.9, # [optional] If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
-    temperature: float=0.6, # [optional] The value used to modulate the next token probabilities.
-    top_k: int=50, # [optional] The number of highest probability vocabulary tokens to keep for top-k-filtering.
-    repetition_penalty: float=1.0, #The parameter for repetition penalty. 1.0 means no penalty.
-    length_penalty: int=1, #[optional] Exponential penalty to the length that is used with beam-based generation. 
-    enable_azure_content_safety: bool=False, # Enable safety check with Azure content safety api
-    enable_sensitive_topics: bool=False, # Enable check for sensitive topics using AuditNLG APIs
-    enable_salesforce_content_safety: bool=True, # Enable safety check with Salesforce safety flan t5
-    enable_llamaguard_content_safety: bool=False, # Enable safety check with Llama-Guard
-    use_fast_kernels: bool = True, # Enable using SDPA from PyTroch Accelerated Transformers, make use Flash Attention and Xformer memory-efficient kernels
-    **kwargs
-):
-    system_prompt = input("Please insert your system prompt: ")
-    user_prompt = input("Please insert your prompt: ")
-    chat = [
-   {"role": "system", "content": system_prompt},
-   {"role": "user", "content": user_prompt},
-    ]       
-    # Set the seeds for reproducibility
-    torch.cuda.manual_seed(seed)
-    torch.manual_seed(seed)
-    
-    model = load_model(model_name, quantization, use_fast_kernels)
-    if peft_model:
-        model = load_peft_model(model, peft_model)
-
-    model.eval()
-        
-    tokenizer = AutoTokenizer.from_pretrained(model_name)
-    safety_checker = get_safety_checker(enable_azure_content_safety,
-                                        enable_sensitive_topics,
-                                        enable_salesforce_content_safety,
-                                        enable_llamaguard_content_safety,
-                                        )
-
-    # Safety check of the user prompt
-    safety_results_user_prompt = [check(user_prompt) for check in safety_checker]
-    safety_results_system_prompt = [check(system_prompt) for check in safety_checker]
-    are_safe_user_prompt = all([r[1] for r in safety_results_user_prompt])
-    are_safe_system_prompt = all([r[1] for r in safety_results_system_prompt])
-    handle_safety_check(are_safe_user_prompt, user_prompt, safety_results_user_prompt, are_safe_system_prompt, system_prompt, safety_results_system_prompt)
-        
-    inputs = tokenizer.apply_chat_template(chat, return_tensors="pt").to("cuda")
-
-    start = time.perf_counter()
-    with torch.no_grad():
-        outputs = model.generate(
-            input_ids=inputs,
-            max_new_tokens=max_new_tokens,
-            do_sample=do_sample,
-            top_p=top_p,
-            temperature=temperature,
-            min_length=min_length,
-            use_cache=use_cache,
-            top_k=top_k,
-            repetition_penalty=repetition_penalty,
-            length_penalty=length_penalty,
-            **kwargs 
-        )
-    e2e_inference_time = (time.perf_counter()-start)*1000
-    print(f"the inference time is {e2e_inference_time} ms")
-    output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
-    
-    # Safety check of the model output
-    safety_results = [check(output_text) for check in safety_checker]
-    are_safe = all([r[1] for r in safety_results])
-    if are_safe:
-        print("User input and model output deemed safe.")
-        print(f"Model output:\n{output_text}")
-    else:
-        print("Model output deemed unsafe.")
-        for method, is_safe, report in safety_results:
-            if not is_safe:
-                print(method)
-                print(report)
-                
-
-if __name__ == "__main__":
-    fire.Fire(main)

+ 0 - 51
getting-started/inference/modelUpgradeExample.py

@@ -1,51 +0,0 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
-
-# Running the script without any arguments "python modelUpgradeExample.py" performs inference with the Llama 3 8B Instruct model. 
-# Passing  --model-id "meta-llama/Meta-Llama-3.1-8B-Instruct" to the script will switch it to using the Llama 3.1 version of the same model. 
-# The script also shows the input tokens to confirm that the models are responding to the same input
-
-import fire
-from transformers import AutoTokenizer, AutoModelForCausalLM
-import torch
-
-def main(model_id = "meta-llama/Meta-Llama-3-8B-Instruct"):
-    tokenizer = AutoTokenizer.from_pretrained(model_id)
-    model = AutoModelForCausalLM.from_pretrained(
-        model_id,
-        torch_dtype=torch.bfloat16,
-        device_map="auto",
-    )
-
-    messages = [
-        {"role": "system", "content": "You are a helpful chatbot"},
-        {"role": "user", "content": "Why is the sky blue?"},
-        {"role": "assistant", "content": "Because the light is scattered"},
-        {"role": "user", "content": "Please tell me more about that"},
-    ]
-
-    input_ids = tokenizer.apply_chat_template(
-        messages,
-        add_generation_prompt=True,
-        return_tensors="pt",
-    ).to(model.device)
-
-    print("Input tokens:")
-    print(input_ids)
-    
-    attention_mask = torch.ones_like(input_ids)
-    outputs = model.generate(
-        input_ids,
-        max_new_tokens=400,
-        eos_token_id=tokenizer.eos_token_id,
-        do_sample=True,
-        temperature=0.6,
-        top_p=0.9,
-        attention_mask=attention_mask,
-    )
-    response = outputs[0][input_ids.shape[-1]:]
-    print("\nOutput:\n")
-    print(tokenizer.decode(response, skip_special_tokens=True))
-
-if __name__ == "__main__":
-  fire.Fire(main)

+ 0 - 11
recipes/README.md

@@ -1,11 +0,0 @@
-## Llama-Recipes
-
-This folder contains examples organized by topic:
-
-| Subfolder | Description |
-|---|---|
-[quickstart](./quickstart)|The "Hello World" of using Llama, start here if you are new to using Llama
-[use_cases](./use_cases)|Scripts showing common applications of Llama
-[3p_integrations](./3p_integrations)|Partner-owned folder showing Llama usage along with third-party tools
-[responsible_ai](./responsible_ai)|Scripts to use PurpleLlama for safeguarding model outputs
-[experimental](./experimental)| Llama implementations of experimental LLM techniques