Преглед на файлове

[OctoAI model provider] Llama3 update (#494)

Hamid Shojanazeri преди 11 месеца
родител
ревизия
b2eec4f9b0

+ 89 - 109
recipes/llama_api_providers/OctoAI_API_examples/Getting_to_know_Llama.ipynb

@@ -6,8 +6,43 @@
     "id": "LERqQn5v8-ak"
    },
    "source": [
-    "# **Getting to know Llama 2: Everything you need to start building**\n",
-    "Our goal in this session is to provide a guided tour of Llama 2, including understanding different Llama 2 models, how and where to access them, Generative AI and Chatbot architectures, prompt engineering, RAG (Retrieval Augmented Generation), Fine-tuning and more. All this is implemented with a starter code for you to take it and use it in your Llama 2 projects."
+    "# **Getting to know Llama 3: Everything you need to start building**\n",
+    "Our goal in this session is to provide a guided tour of Llama 3, including understanding different Llama 3 models, how and where to access them, Generative AI and Chatbot architectures, prompt engineering, RAG (Retrieval Augmented Generation), Fine-tuning and more. All this is implemented with a starter code for you to take it and use it in your Llama 3 projects."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "h3YGMDJidHtH"
+   },
+   "source": [
+    "### **Install dependencies**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "VhN6hXwx7FCp"
+   },
+   "outputs": [],
+   "source": [
+    "# Install dependencies and initialize\n",
+    "%pip install \\\n",
+    "    langchain==0.1.19 \\\n",
+    "    matplotlib \\\n",
+    "    octoai-sdk==0.10.1 \\\n",
+    "    openai \\\n",
+    "    sentence_transformers \\\n",
+    "    pdf2image \\\n",
+    "    pdfminer \\\n",
+    "    pdfminer.six \\\n",
+    "    unstructured \\\n",
+    "    faiss-cpu \\\n",
+    "    pillow-heif \\\n",
+    "    opencv-python \\\n",
+    "    unstructured-inference \\\n",
+    "    pikepdf"
    ]
   },
   {
@@ -58,7 +93,7 @@
     "    A[Users] --> B(Applications e.g. mobile, web)\n",
     "    B --> |Hosted API|C(Platforms e.g. Custom, OctoAI, HuggingFace, Replicate)\n",
     "    B -- optional --> E(Frameworks e.g. LangChain)\n",
-    "    C-->|User Input|D[Llama 2]\n",
+    "    C-->|User Input|D[Llama 3]\n",
     "    D-->|Model Output|C\n",
     "    E --> C\n",
     "    classDef default fill:#CCE6FF,stroke:#84BCF5,textColor:#1C2B33,fontFamily:trebuchet ms;\n",
@@ -69,19 +104,15 @@
     "  flowchart TD\n",
     "    A[User Prompts] --> B(Frameworks e.g. LangChain)\n",
     "    B <--> |Database, Docs, XLS|C[fa:fa-database External Data]\n",
-    "    B -->|API|D[Llama 2]\n",
+    "    B -->|API|D[Llama 3]\n",
     "    classDef default fill:#CCE6FF,stroke:#84BCF5,textColor:#1C2B33,fontFamily:trebuchet ms;\n",
     "  \"\"\")\n",
     "\n",
-    "def llama2_family():\n",
+    "def llama3_family():\n",
     "  mm(\"\"\"\n",
     "  graph LR;\n",
-    "      llama-2 --> llama-2-7b\n",
-    "      llama-2 --> llama-2-13b\n",
-    "      llama-2 --> llama-2-70b\n",
-    "      llama-2-7b --> llama-2-7b-chat\n",
-    "      llama-2-13b --> llama-2-13b-chat\n",
-    "      llama-2-70b --> llama-2-70b-chat\n",
+    "      llama-3 --> llama-3-8b-instruct\n",
+    "      llama-3 --> llama-3-70b-instruct\n",
     "      classDef default fill:#CCE6FF,stroke:#84BCF5,textColor:#1C2B33,fontFamily:trebuchet ms;\n",
     "  \"\"\")\n",
     "\n",
@@ -91,7 +122,7 @@
     "    users --> apps\n",
     "    apps --> frameworks\n",
     "    frameworks --> platforms\n",
-    "    platforms --> Llama 2\n",
+    "    platforms --> Llama 3\n",
     "    classDef default fill:#CCE6FF,stroke:#84BCF5,textColor:#1C2B33,fontFamily:trebuchet ms;\n",
     "  \"\"\")\n",
     "\n",
@@ -115,8 +146,8 @@
     "  user --> prompt\n",
     "  prompt --> i_safety\n",
     "  i_safety --> context\n",
-    "  context --> Llama_2\n",
-    "  Llama_2 --> output\n",
+    "  context --> Llama_3\n",
+    "  Llama_3 --> output\n",
     "  output --> o_safety\n",
     "  i_safety --> memory\n",
     "  o_safety --> memory\n",
@@ -165,7 +196,7 @@
     "id": "i4Np_l_KtIno"
    },
    "source": [
-    "##**1 - Understanding Llama 2**"
+    "##**1 - Understanding Llama 3**"
    ]
   },
   {
@@ -174,14 +205,13 @@
     "id": "PGPSI3M5PGTi"
    },
    "source": [
-    "### **1.1 - What is Llama 2?**\n",
+    "### **1.1 - What is Llama 3?**\n",
     "\n",
     "* State of the art (SOTA), Open Source LLM\n",
-    "* 7B, 13B, 70B\n",
+    "* Llama 3 8B, 70B\n",
     "* Pretrained + Chat\n",
     "* Choosing model: Size, Quality, Cost, Speed\n",
-    "* [Research paper](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/)\n",
-    "\n",
+    "* [Llama 3 blog](https://ai.meta.com/blog/meta-llama-3/)\n",
     "* [Responsible use guide](https://ai.meta.com/llama/responsible-use-guide/)"
    ]
   },
@@ -208,7 +238,7 @@
    },
    "outputs": [],
    "source": [
-    "llama2_family()"
+    "llama3_family()"
    ]
   },
   {
@@ -217,11 +247,10 @@
     "id": "aYeHVVh45bdT"
    },
    "source": [
-    "###**1.2 - Accessing Llama 2**\n",
+    "###**1.2 - Accessing Llama 3**\n",
     "* Download + Self Host (on-premise)\n",
     "* Hosted API Platform (e.g. [OctoAI](https://octoai.cloud/), [Replicate](https://replicate.com/meta))\n",
-    "* Hosted Container Platform (e.g. [Azure](https://techcommunity.microsoft.com/t5/ai-machine-learning-blog/introducing-llama-2-on-azure/ba-p/3881233), [AWS](https://aws.amazon.com/blogs/machine-learning/llama-2-foundation-models-from-meta-are-now-available-in-amazon-sagemaker-jumpstart/), [GCP](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/139))\n",
-    "\n"
+    "* Hosted Container Platform (e.g. [Azure](https://techcommunity.microsoft.com/t5/ai-machine-learning-blog/introducing-llama-2-on-azure/ba-p/3881233), [AWS](https://aws.amazon.com/blogs/machine-learning/llama-2-foundation-models-from-meta-are-now-available-in-amazon-sagemaker-jumpstart/), [GCP](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/139))"
    ]
   },
   {
@@ -230,7 +259,7 @@
     "id": "kBuSay8vtzL4"
    },
    "source": [
-    "### **1.3 - Use Cases of Llama 2**\n",
+    "### **1.3 - Use Cases of Llama 3**\n",
     "* Content Generation\n",
     "* Chatbots\n",
     "* Summarization\n",
@@ -245,42 +274,9 @@
     "id": "sd54g0OHuqBY"
    },
    "source": [
-    "##**2 - Using Llama 2**\n",
+    "##**2 - Using Llama 3**\n",
     "\n",
-    "In this notebook, we are going to access [Llama 13b chat model](https://octoai.cloud/tools/text/chat?mode=demo&model=llama-2-13b-chat-fp16) using hosted API from OctoAI."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "id": "h3YGMDJidHtH"
-   },
-   "source": [
-    "### **2.1 - Install dependencies**"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "id": "VhN6hXwx7FCp"
-   },
-   "outputs": [],
-   "source": [
-    "# Install dependencies and initialize\n",
-    "%pip install -qU \\\n",
-    "    octoai-sdk \\\n",
-    "    langchain \\\n",
-    "    sentence_transformers \\\n",
-    "    pdf2image \\\n",
-    "    pdfminer \\\n",
-    "    pdfminer.six \\\n",
-    "    unstructured \\\n",
-    "    faiss-cpu \\\n",
-    "    pillow-heif \\\n",
-    "    opencv-python \\\n",
-    "    unstructured-inference \\\n",
-    "    pikepdf"
+    "In this notebook, we are going to access [Llama 3 8b instruct model](https://octoai.cloud/text/chat?model=meta-llama-3-8b-instruct&mode=api) using hosted API from OctoAI."
    ]
   },
   {
@@ -292,9 +288,9 @@
    "outputs": [],
    "source": [
     "# model on OctoAI platform that we will use for inferencing\n",
-    "# We will use llama 13b chat model hosted on OctoAI server ()\n",
+    "# We will use llama 3 8b instruct model hosted on OctoAI server\n",
     "\n",
-    "llama2_13b = \"llama-2-13b-chat-fp16\""
+    "llama3_8b = \"meta-llama-3-8b-instruct\""
    ]
   },
   {
@@ -326,21 +322,21 @@
    },
    "outputs": [],
    "source": [
-    "# we will use OctoAI's hosted API\n",
-    "from octoai.client import Client\n",
+    "# We will use OpenAI's APIs to talk to OctoAI's hosted model endpoint\n",
+    "from openai import OpenAI\n",
     "\n",
-    "client = Client(OCTOAI_API_TOKEN)\n",
+    "client = OpenAI(\n",
+    "   base_url = \"https://text.octoai.run/v1\",\n",
+    "   api_key = os.environ[\"OCTOAI_API_TOKEN\"]\n",
+    ")\n",
     "\n",
     "# text completion with input prompt\n",
     "def Completion(prompt):\n",
     "    output = client.chat.completions.create(\n",
     "        messages=[\n",
-    "            {\n",
-    "                \"role\": \"user\",\n",
-    "                \"content\": prompt\n",
-    "            }\n",
+    "            {\"role\": \"user\", \"content\": prompt}\n",
     "        ],\n",
-    "        model=\"llama-2-13b-chat-fp16\",\n",
+    "        model=llama3_8b,\n",
     "        max_tokens=1000\n",
     "    )\n",
     "    return output.choices[0].message.content\n",
@@ -349,16 +345,10 @@
     "def ChatCompletion(prompt, system_prompt=None):\n",
     "    output = client.chat.completions.create(\n",
     "        messages=[\n",
-    "            {\n",
-    "                \"role\": \"system\",\n",
-    "                \"content\": system_prompt\n",
-    "            },\n",
-    "            {\n",
-    "                \"role\": \"user\",\n",
-    "                \"content\": prompt\n",
-    "            }\n",
+    "            {\"role\": \"system\", \"content\": system_prompt},\n",
+    "            {\"role\": \"user\", \"content\": prompt}\n",
     "        ],\n",
-    "        model=\"llama-2-13b-chat-fp16\",\n",
+    "        model=llama3_8b,\n",
     "        max_tokens=1000\n",
     "    )\n",
     "    return output.choices[0].message.content"
@@ -370,7 +360,7 @@
     "id": "5Jxq0pmf6L73"
    },
    "source": [
-    "### **2.2 - Basic completion**"
+    "# **2.1 - Basic completion**"
    ]
   },
   {
@@ -391,7 +381,7 @@
     "id": "StccjUDh6W0Q"
    },
    "source": [
-    "### **2.3 - System prompts**\n"
+    "## **2.2 - System prompts**\n"
    ]
   },
   {
@@ -415,7 +405,7 @@
     "id": "Hp4GNa066pYy"
    },
    "source": [
-    "### **2.4 - Response formats**\n",
+    "### **2.3 - Response formats**\n",
     "* Can support different formatted outputs e.g. text, JSON, etc."
    ]
   },
@@ -483,7 +473,7 @@
     "\n",
     "* User Prompts\n",
     "* Input Safety\n",
-    "* Llama 2\n",
+    "* Llama 3\n",
     "* Output Safety\n",
     "\n",
     "* Memory & Context"
@@ -743,12 +733,9 @@
     "### **4.3 - Retrieval Augmented Generation (RAG)**\n",
     "* Prompt Eng Limitations - Knowledge cutoff & lack of specialized data\n",
     "\n",
-    "* Retrieval Augmented Generation(RAG) allows us to retrieve snippets of information from external data sources and augment it to the user's prompt to get tailored responses from Llama 2.\n",
-    "\n",
-    "For our demo, we are going to download an external PDF file from a URL and query against the content in the pdf file to get contextually relevant information back with the help of Llama!\n",
+    "* Retrieval Augmented Generation(RAG) allows us to retrieve snippets of information from external data sources and augment it to the user's prompt to get tailored responses from Llama 3.\n",
     "\n",
-    "\n",
-    "\n"
+    "For our demo, we are going to download an external PDF file from a URL and query against the content in the pdf file to get contextually relevant information back with the help of Llama!"
    ]
   },
   {
@@ -797,24 +784,16 @@
    "source": [
     "# langchain setup\n",
     "from langchain.llms.octoai_endpoint import OctoAIEndpoint\n",
-    "# Use the Llama 2 model hosted on OctoAI\n",
-    "# Temperature: Adjusts randomness of outputs, greater than 1 is random and 0 is deterministic, 0.75 is a good starting value\n",
+    "\n",
+    "# Use the Llama 3 model hosted on OctoAI\n",
+    "# max_tokens: Maximum number of tokens to generate. A word is generally 2-3 tokens\n",
+    "# temperature: Adjusts randomness of outputs, greater than 1 is random and 0 is deterministic, 0.75 is a good starting value\n",
     "# top_p: When decoding text, samples from the top p percentage of most likely tokens; lower to ignore less likely tokens\n",
-    "# max_new_tokens: Maximum number of tokens to generate. A word is generally 2-3 tokens\n",
     "llama_model = OctoAIEndpoint(\n",
-    "    endpoint_url=\"https://text.octoai.run/v1/chat/completions\",\n",
-    "    model_kwargs={\n",
-    "        \"model\": llama2_13b,\n",
-    "        \"messages\": [\n",
-    "            {\n",
-    "                \"role\": \"system\",\n",
-    "                \"content\": \"You are a helpful, respectful and honest assistant.\"\n",
-    "            }\n",
-    "        ],\n",
-    "        \"max_tokens\": 1000,\n",
-    "        \"top_p\": 1,\n",
-    "        \"temperature\": 0.75\n",
-    "    },\n",
+    "    model=llama3_8b,\n",
+    "    max_tokens=1000,\n",
+    "    temperature=0.75,\n",
+    "    top_p=1\n",
     ")"
    ]
   },
@@ -973,10 +952,11 @@
    },
    "source": [
     "#### **Resources**\n",
-    "- [GitHub - Llama 2](https://github.com/facebookresearch/llama)\n",
-    "- [Github - LLama 2 Recipes](https://github.com/facebookresearch/llama-recipes)\n",
-    "- [Llama 2](https://ai.meta.com/llama/)\n",
-    "- [Research Paper](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/)\n",
+    "- [GitHub - Llama](https://github.com/facebookresearch/llama)\n",
+    "- [Github - LLama Recipes](https://github.com/facebookresearch/llama-recipes)\n",
+    "- [Llama](https://ai.meta.com/llama/)\n",
+    "- [Research Paper on Llama 2](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/)\n",
+    "- [Llama 3 Page](https://ai.meta.com/blog/meta-llama-3/)\n",
     "- [Model Card](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md)\n",
     "- [Responsible Use Guide](https://ai.meta.com/llama/responsible-use-guide/)\n",
     "- [Acceptable Use Policy](https://ai.meta.com/llama/use-policy/)\n",
@@ -992,9 +972,9 @@
    "source": [
     "#### **Authors & Contact**\n",
     "  * asangani@meta.com, [Amit Sangani | LinkedIn](https://www.linkedin.com/in/amitsangani/)\n",
-    "  * mohsena@meta.com, [Mohsen Agsen | LinkedIn](https://www.linkedin.com/in/mohsen-agsen-62a9791/)\n",
+    "  * mohsena@meta.com, [Mohsen Agsen | LinkedIn](https://www.linkedin.com/in/dr-thierry-moreau/)\n",
     "\n",
-    "Adapted to run on OctoAI by Thierry Moreau - tmoreau@octo.ai"
+    "Adapted to run on OctoAI and use Llama 3 by tmoreau@octo.ai [Thierry Moreay | LinkedIn]()"
    ]
   }
  ],

+ 24 - 34
recipes/llama_api_providers/OctoAI_API_examples/HelloLlamaCloud.ipynb

@@ -6,13 +6,12 @@
    "metadata": {},
    "source": [
     "## This demo app shows:\n",
-    "* How to run Llama2 in the cloud hosted on OctoAI\n",
+    "* How to run Llama 3 in the cloud hosted on OctoAI\n",
     "* How to use LangChain to ask Llama general questions and follow up questions\n",
-    "* How to use LangChain to load a recent PDF doc - the Llama2 paper pdf - and chat about it. This is the well known RAG (Retrieval Augmented Generation) method to let LLM such as Llama2 be able to answer questions about the data not publicly available when Llama2 was trained, or about your own data. RAG is one way to prevent LLM's hallucination\n",
-    "* You should also review the [HelloLlamaLocal](HelloLlamaLocal.ipynb) notebook for more information on RAG\n",
+    "* How to use LangChain to load a recent PDF doc - the Llama paper pdf - and chat about it. This is the well known RAG (Retrieval Augmented Generation) method to let LLM such as Llama be able to answer questions about your own data. RAG is one way to prevent LLM's hallucination\n",
     "\n",
     "**Note** We will be using OctoAI to run the examples here. You will need to first sign into [OctoAI](https://octoai.cloud/) with your Github or Google account, then create a free API token [here](https://octo.ai/docs/getting-started/how-to-create-an-octoai-access-token) that you can use for a while (a month or $10 in OctoAI credits, whichever one runs out first).\n",
-    "After the free trial ends, you will need to enter billing info to continue to use Llama2 hosted on OctoAI."
+    "After the free trial ends, you will need to enter billing info to continue to use Llama 3 hosted on OctoAI."
    ]
   },
   {
@@ -35,7 +34,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "!pip install langchain octoai-sdk sentence-transformers chromadb pypdf"
+    "%pip install langchain==0.1.19 octoai-sdk==0.10.1 openai sentence-transformers chromadb pypdf"
    ]
   },
   {
@@ -57,15 +56,17 @@
    "id": "3e8870c1",
    "metadata": {},
    "source": [
-    "Next we call the Llama 2 model from OctoAI. In this example we will use the Llama 2 13b chat FP16 model. You can find more on Llama 2 models on the [OctoAI text generation solution page](https://octoai.cloud/tools/text).\n",
+    "Next we call the Llama 3 model from OctoAI. In this example we will use the Llama 3 8b instruct model. You can find more on Llama models on the [OctoAI text generation solution page](https://octoai.cloud/text).\n",
     "\n",
     "At the time of writing this notebook the following Llama models are available on OctoAI:\n",
-    "* llama-2-13b-chat\n",
-    "* llama-2-70b-chat\n",
+    "* meta-llama-3-8b-instruct\n",
+    "* meta-llama-3-70b-instruct\n",
     "* codellama-7b-instruct\n",
     "* codellama-13b-instruct\n",
     "* codellama-34b-instruct\n",
-    "* codellama-70b-instruct"
+    "* llama-2-13b-chat\n",
+    "* llama-2-70b-chat\n",
+    "* llamaguard-7b"
    ]
   },
   {
@@ -77,21 +78,11 @@
    "source": [
     "from langchain.llms.octoai_endpoint import OctoAIEndpoint\n",
     "\n",
-    "llama2_13b = \"llama-2-13b-chat-fp16\"\n",
+    "llama3_8b = \"meta-llama-3-8b-instruct\"\n",
     "llm = OctoAIEndpoint(\n",
-    "    endpoint_url=\"https://text.octoai.run/v1/chat/completions\",\n",
-    "    model_kwargs={\n",
-    "        \"model\": llama2_13b,\n",
-    "        \"messages\": [\n",
-    "            {\n",
-    "                \"role\": \"system\",\n",
-    "                \"content\": \"You are a helpful, respectful and honest assistant.\"\n",
-    "            }\n",
-    "        ],\n",
-    "        \"max_tokens\": 500,\n",
-    "        \"top_p\": 1,\n",
-    "        \"temperature\": 0.01\n",
-    "    },\n",
+    "    model=llama3_8b,\n",
+    "    max_tokens=500,\n",
+    "    temperature=0.01\n",
     ")"
    ]
   },
@@ -111,7 +102,7 @@
    "outputs": [],
    "source": [
     "question = \"who wrote the book Innovator's dilemma?\"\n",
-    "answer = llm(question)\n",
+    "answer = llm.invoke(question)\n",
     "print(answer)"
    ]
   },
@@ -134,7 +125,7 @@
    "source": [
     "# chat history not passed so Llama doesn't have the context and doesn't know this is more about the book\n",
     "followup = \"tell me more\"\n",
-    "followup_answer = llm(followup)\n",
+    "followup_answer = llm.invoke(followup)\n",
     "print(followup_answer)"
    ]
   },
@@ -162,7 +153,7 @@
     "memory = ConversationBufferMemory()\n",
     "conversation = ConversationChain(\n",
     "    llm=llm, \n",
-    "    memory = memory,\n",
+    "    memory=memory,\n",
     "    verbose=False\n",
     ")"
    ]
@@ -208,11 +199,10 @@
    "id": "fc436163",
    "metadata": {},
    "source": [
-    "Next, let's explore using Llama 2 to answer questions using documents for context. \n",
-    "This gives us the ability to update Llama 2's knowledge thus giving it better context without needing to finetune. \n",
-    "For a more in-depth study of this, see the notebook on using Llama 2 locally [here](HelloLlamaLocal.ipynb)\n",
+    "Next, let's explore using Llama 3 to answer questions using documents for context. \n",
+    "This gives us the ability to update Llama 3's knowledge thus giving it better context without needing to finetune. \n",
     "\n",
-    "We will use the PyPDFLoader to load in a pdf, in this case, the Llama 2 paper."
+    "We will use the PyPDFLoader to load in a pdf, in this case, the Llama paper."
    ]
   },
   {
@@ -301,7 +291,7 @@
    "id": "54ad02d7",
    "metadata": {},
    "source": [
-    "We then use ` RetrievalQA` to retrieve the documents from the vector database and give the model more context on Llama 2, thereby increasing its knowledge.\n",
+    "We then use ` RetrievalQA` to retrieve the documents from the vector database and give the model more context on Llama, thereby increasing its knowledge.\n",
     "\n",
     "For each question, LangChain performs a semantic similarity search of it in the vector db, then passes the search results as the context to Llama to answer the question."
    ]
@@ -321,7 +311,7 @@
     "    retriever=vectordb.as_retriever()\n",
     ")\n",
     "\n",
-    "question = \"What is llama2?\"\n",
+    "question = \"What is llama?\"\n",
     "result = qa_chain({\"query\": question})\n",
     "print(result['result'])"
    ]
@@ -344,7 +334,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# no context passed so Llama2 doesn't have enough context to answer so it lets its imagination go wild\n",
+    "# no context passed so Llama doesn't have enough context to answer so it lets its imagination go wild\n",
     "result = qa_chain({\"query\": \"what are its use cases?\"})\n",
     "print(result['result'])"
    ]
@@ -376,7 +366,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# let's ask the original question \"What is llama2?\" again\n",
+    "# let's ask the original question \"What is llama?\" again\n",
     "result = chat_chain({\"question\": question, \"chat_history\": []})\n",
     "print(result['answer'])"
    ]

+ 67 - 143
recipes/llama_api_providers/OctoAI_API_examples/LiveData.ipynb

@@ -7,12 +7,12 @@
    "source": [
     "## This demo app shows:\n",
     "* How to use LlamaIndex, an open source library to help you build custom data augmented LLM applications\n",
-    "* How to ask Llama questions about recent live data via the You.com live search API and LlamaIndex\n",
+    "* How to ask Llama 3 questions about recent live data via the Tavily live search API\n",
     "\n",
-    "The LangChain package is used to facilitate the call to Llama2 hosted on OctoAI\n",
+    "The LangChain package is used to facilitate the call to Llama 3 hosted on OctoAI\n",
     "\n",
     "**Note** We will be using OctoAI to run the examples here. You will need to first sign into [OctoAI](https://octoai.cloud/) with your Github or Google account, then create a free API token [here](https://octo.ai/docs/getting-started/how-to-create-an-octoai-access-token) that you can use for a while (a month or $10 in OctoAI credits, whichever one runs out first).\n",
-    "After the free trial ends, you will need to enter billing info to continue to use Llama2 hosted on OctoAI."
+    "After the free trial ends, you will need to enter billing info to continue to use Llama3 hosted on OctoAI."
    ]
   },
   {
@@ -32,23 +32,13 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "!pip install llama-index langchain"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "21fe3849",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# use ServiceContext to configure the LLM used and the custom embeddings\n",
-    "from llama_index import ServiceContext\n",
-    "\n",
-    "# VectorStoreIndex is used to index custom data \n",
-    "from llama_index import VectorStoreIndex\n",
-    "\n",
-    "from langchain.llms.octoai_endpoint import OctoAIEndpoint"
+    "!pip install llama-index \n",
+    "!pip install llama-index-core\n",
+    "!pip install llama-index-llms-octoai\n",
+    "!pip install llama-index-embeddings-octoai\n",
+    "!pip install octoai-sdk\n",
+    "!pip install tavily-python\n",
+    "!pip install replicate"
    ]
   },
   {
@@ -75,227 +65,161 @@
   },
   {
    "cell_type": "markdown",
-   "id": "f8ff812b",
-   "metadata": {},
-   "source": [
-    "In this example we will use the [YOU.com](https://you.com/) search engine to augment the LLM's responses.\n",
-    "To use the You.com Search API, you can email api@you.com to request an API key. "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "75275628-5235-4b55-8033-601c76107528",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "YOUCOM_API_KEY = getpass()\n",
-    "os.environ[\"YOUCOM_API_KEY\"] = YOUCOM_API_KEY"
-   ]
-  },
-  {
-   "cell_type": "markdown",
    "id": "cb210c7c",
    "metadata": {},
    "source": [
-    "We then call the Llama 2 model from OctoAI.\n",
+    "We then call the Llama 3 model from OctoAI.\n",
     "\n",
-    "We will use the Llama 2 13b chat FP16 model. You can find more on Llama 2 models on the [OctoAI text generation solution page](https://octoai.cloud/tools/text).\n",
+    "We will use the Llama 3 8b instruct model. You can find more on Llama models on the [OctoAI text generation solution page](https://octoai.cloud/text).\n",
     "\n",
     "At the time of writing this notebook the following Llama models are available on OctoAI:\n",
-    "* llama-2-13b-chat\n",
-    "* llama-2-70b-chat\n",
+    "* meta-llama-3-8b-instruct\n",
+    "* meta-llama-3-70b-instruct\n",
     "* codellama-7b-instruct\n",
     "* codellama-13b-instruct\n",
     "* codellama-34b-instruct\n",
-    "* codellama-70b-instruct"
+    "* llama-2-13b-chat\n",
+    "* llama-2-70b-chat\n",
+    "* llamaguard-7b"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "c12fc2cb",
+   "id": "21fe3849",
    "metadata": {},
    "outputs": [],
    "source": [
-    "# set llm to be using Llama2 hosted on OctoAI\n",
-    "llama2_13b = \"llama-2-13b-chat-fp16\"\n",
+    "# use ServiceContext to configure the LLM used and the custom embeddings\n",
+    "from llama_index.core import ServiceContext\n",
     "\n",
-    "llm = OctoAIEndpoint(\n",
-    "    endpoint_url=\"https://text.octoai.run/v1/chat/completions\",\n",
-    "    model_kwargs={\n",
-    "        \"model\": llama2_13b,\n",
-    "        \"messages\": [\n",
-    "            {\n",
-    "                \"role\": \"system\",\n",
-    "                \"content\": \"You are a helpful, respectful and honest assistant.\"\n",
-    "            }\n",
-    "        ],\n",
-    "        \"max_tokens\": 500,\n",
-    "        \"top_p\": 1,\n",
-    "        \"temperature\": 0.01\n",
-    "    },\n",
-    ")"
+    "# VectorStoreIndex is used to index custom data \n",
+    "from llama_index.core import VectorStoreIndex\n",
+    "\n",
+    "from llama_index.core import Settings, VectorStoreIndex\n",
+    "from llama_index.embeddings.octoai import OctoAIEmbedding\n",
+    "from llama_index.llms.octoai import OctoAI\n",
+    "\n",
+    "Settings.llm = OctoAI(\n",
+    "    model=\"meta-llama-3-8b-instruct\",\n",
+    "    token=OCTOAI_API_TOKEN,\n",
+    "    temperature=0.0,\n",
+    "    max_tokens=128,\n",
+    ")\n",
+    "\n",
+    "Settings.embed_model = OctoAIEmbedding(api_key=OCTOAI_API_TOKEN)"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "476d72da",
+   "id": "f8ff812b",
    "metadata": {},
    "source": [
-    "Using our api key we set up earlier, we make a request from YOU.com for live data on a particular topic."
+    "Next you will use the [Tavily](https://tavily.com/) search engine to augment the Llama 3's responses. To create a free trial Tavily Search API, sign in with your Google or Github account [here](https://app.tavily.com/sign-in)."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "effc9656-b18d-4d24-a80b-6066564a838b",
+   "id": "75275628-5235-4b55-8033-601c76107528",
    "metadata": {},
    "outputs": [],
    "source": [
-    "import requests\n",
+    "from tavily import TavilyClient\n",
     "\n",
-    "query = \"Meta Connect\" # you can try other live data query about sports score, stock market and weather info \n",
-    "headers = {\"X-API-Key\": os.environ[\"YOUCOM_API_KEY\"]}\n",
-    "data = requests.get(\n",
-    "    f\"https://api.ydc-index.io/search?query={query}\",\n",
-    "    headers=headers,\n",
-    ").json()"
+    "TAVILY_API_KEY = getpass()\n",
+    "tavily = TavilyClient(api_key=TAVILY_API_KEY)"
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "8bed3baf-742e-473c-ada1-4459012a8a2c",
+   "cell_type": "markdown",
+   "id": "476d72da",
    "metadata": {},
-   "outputs": [],
    "source": [
-    "# check the query result in JSON\n",
-    "import json\n",
-    "\n",
-    "print(json.dumps(data, indent=2))"
+    "Do a live web search on \"Llama 3 fine-tuning\"."
    ]
   },
   {
-   "cell_type": "markdown",
-   "id": "b196e697",
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "effc9656-b18d-4d24-a80b-6066564a838b",
    "metadata": {},
+   "outputs": [],
    "source": [
-    "We then use the [`JSONLoader`](https://llamahub.ai/l/file-json) to extract the text from the returned data. The `JSONLoader` gives us the ability to load the data into LamaIndex.\n",
-    "In the next cell we show how to load the JSON result with key info stored as \"snippets\".\n",
-    "\n",
-    "However, you can also add the snippets in the query result to documents like below:\n",
-    "```python \n",
-    "from llama_index import Document\n",
-    "snippets = [snippet for hit in data[\"hits\"] for snippet in hit[\"snippets\"]]\n",
-    "documents = [Document(text=s) for s in snippets]\n",
-    "```\n",
-    "This can be handy if you just need to add a list of text strings to doc"
+    "response = tavily.search(query=\"Llama 3 fine-tuning\")\n",
+    "context = [{\"url\": obj[\"url\"], \"content\": obj[\"content\"]} for obj in response['results']]"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "7c40e73f-ca13-4f4a-a753-e613df3d389e",
+   "id": "6b5af98b-c26b-4fd7-8031-31ac4915cdac",
    "metadata": {},
    "outputs": [],
    "source": [
-    "# one way to load the JSON result with key info stored as \"snippets\"\n",
-    "from llama_index import download_loader\n",
-    "\n",
-    "JsonDataReader = download_loader(\"JsonDataReader\")\n",
-    "loader = JsonDataReader()\n",
-    "documents = loader.load_data([hit[\"snippets\"] for hit in data[\"hits\"]])\n"
+    "context"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "8e5e3b4e",
+   "id": "0f4ea96b-bb00-4a1f-8bd2-7f15237415f6",
    "metadata": {},
    "source": [
-    "With the data set up, we create a vector store for the data and a query engine for it.\n",
-    "\n",
-    "For our embeddings we will use `OctoAIEmbeddings` whose default embedding model is GTE-Large. This model provides a good balance between speed and performance.\n",
-    "\n",
-    "For more info see https://octoai.cloud/tools/text/embeddings?mode=demo&model=thenlper%2Fgte-large. "
+    "Create documents based on the search results, index and save them to a vector store, then create a query engine."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "a5de3080-2c4b-479c-baba-793b3bee36ed",
+   "id": "7513ac70-155a-4d56-b326-0e8c2733ab99",
    "metadata": {},
    "outputs": [],
    "source": [
-    "# use OctoAI embeddings \n",
-    "from langchain_community.embeddings import OctoAIEmbeddings\n",
-    "from llama_index.embeddings import LangchainEmbedding\n",
-    "\n",
-    "\n",
-    "embeddings = LangchainEmbedding(OctoAIEmbeddings(\n",
-    "    endpoint_url=\"https://text.octoai.run/v1/embeddings\"\n",
-    "))\n",
-    "print(embeddings)\n",
-    "\n",
-    "# create a ServiceContext instance to use Llama2 and custom embeddings\n",
-    "service_context = ServiceContext.from_defaults(llm=llm, chunk_size=800, chunk_overlap=20, embed_model=embeddings)\n",
+    "from llama_index.core import Document\n",
     "\n",
-    "# create vector store index from the documents created above\n",
-    "index = VectorStoreIndex.from_documents(documents, service_context=service_context)\n",
+    "documents = [Document(text=ct['content']) for ct in context]\n",
+    "index = VectorStoreIndex.from_documents(documents)\n",
     "\n",
-    "# create query engine from the index\n",
-    "query_engine = index.as_query_engine(streaming=False)"
+    "query_engine = index.as_query_engine(streaming=True)"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "2c4ea012",
+   "id": "df743c62-165c-4834-b1f1-7d7848a6815e",
    "metadata": {},
    "source": [
-    "We are now ready to ask Llama 2 a question about the live data using our query engine."
+    "You are now ready to ask Llama 3 questions about the live data using the query engine."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "de91a191-d0f2-498e-88dc-b2b43423e0e5",
+   "id": "b2fd905b-575a-45f1-88da-9b093caa232a",
    "metadata": {},
    "outputs": [],
    "source": [
-    "# ask Llama2 a summary question about the search result\n",
     "response = query_engine.query(\"give me a summary\")\n",
-    "print(str(response))"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "72814b20-06aa-4da8-b4dd-f0b0d74a2ea0",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# more questions\n",
-    "print(str(query_engine.query(\"what products were announced\")))"
+    "response.print_response_stream()"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "a65bc037-a689-476d-b529-0059a27bc949",
+   "id": "88c45380-1d00-46d5-80ac-0eff68fd1f8a",
    "metadata": {},
    "outputs": [],
    "source": [
-    "print(str(query_engine.query(\"tell me more about Meta AI assistant\")))"
+    "query_engine.query(\"what's the latest about Llama 3 fine-tuning?\").print_response_stream()"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "16a56542",
+   "id": "0fe54976-5345-4426-a6f0-dc3bfd45dac3",
    "metadata": {},
    "outputs": [],
    "source": [
-    "print(str(query_engine.query(\"what are Generative AI stickers\")))"
+    "query_engine.query(\"tell me more about Llama 3 fine-tuning\").print_response_stream()"
    ]
   }
  ],

+ 27 - 31
recipes/llama_api_providers/OctoAI_API_examples/Llama2_Gradio.ipynb

@@ -5,14 +5,14 @@
    "id": "47a9adb3",
    "metadata": {},
    "source": [
-    "## This demo app shows how to query Llama 2 using the Gradio UI.\n",
+    "## This demo app shows how to query Llama 3 using the Gradio UI.\n",
     "\n",
     "Since we are using OctoAI in this example, you'll need to obtain an OctoAI token:\n",
     "\n",
     "- You will need to first sign into [OctoAI](https://octoai.cloud/) with your Github or Google account\n",
     "- Then create a free API token [here](https://octo.ai/docs/getting-started/how-to-create-an-octoai-access-token) that you can use for a while (a month or $10 in OctoAI credits, whichever one runs out first)\n",
     "\n",
-    "**Note** After the free trial ends, you will need to enter billing info to continue to use Llama2 hosted on OctoAI.\n",
+    "**Note** After the free trial ends, you will need to enter billing info to continue to use Llama 3 hosted on OctoAI.\n",
     "\n",
     "To run this example:\n",
     "- Run the notebook\n",
@@ -22,8 +22,7 @@
     "In the notebook or a browser with URL http://127.0.0.1:7860 you should see a UI with your answer.\n",
     "\n",
     "Let's start by installing the necessary packages:\n",
-    "- langchain provides necessary RAG tools for this demo\n",
-    "- octoai-sdk allows us to use OctoAI Llama 2 endpoint\n",
+    "- openai for us to use its APIs to talk to the OctoAI endpoint\n",
     "- gradio is used for the UI elements\n",
     "\n",
     "And setting up the OctoAI token."
@@ -36,7 +35,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "!pip install langchain octoai-sdk gradio"
+    "!pip install openai gradio"
    ]
   },
   {
@@ -60,37 +59,34 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from langchain.schema import AIMessage, HumanMessage\n",
     "import gradio as gr\n",
-    "from langchain.llms.octoai_endpoint import OctoAIEndpoint\n",
+    "import openai\n",
     "\n",
-    "llama2_13b = \"llama-2-13b-chat-fp16\"\n",
-    "\n",
-    "llm = OctoAIEndpoint(\n",
-    "    endpoint_url=\"https://text.octoai.run/v1/chat/completions\",\n",
-    "    model_kwargs={\n",
-    "        \"model\": llama2_13b,\n",
-    "        \"messages\": [\n",
-    "            {\n",
-    "                \"role\": \"system\",\n",
-    "                \"content\": \"You are a helpful, respectful and honest assistant.\"\n",
-    "            }\n",
-    "        ],\n",
-    "        \"max_tokens\": 500,\n",
-    "        \"top_p\": 1,\n",
-    "        \"temperature\": 0.01\n",
-    "    },\n",
+    "# Init OctoAI client\n",
+    "client = openai.OpenAI(\n",
+    "    base_url=\"https://text.octoai.run/v1\",\n",
+    "    api_key=os.environ[\"OCTOAI_API_TOKEN\"]\n",
     ")\n",
     "\n",
-    "\n",
     "def predict(message, history):\n",
-    "    history_langchain_format = []\n",
-    "    for human, ai in history:\n",
-    "        history_langchain_format.append(HumanMessage(content=human))\n",
-    "        history_langchain_format.append(AIMessage(content=ai))\n",
-    "    history_langchain_format.append(HumanMessage(content=message))\n",
-    "    llm_response = llm(message, history_langchain_format)\n",
-    "    return llm_response.content\n",
+    "    history_openai_format = []\n",
+    "    for human, assistant in history:\n",
+    "        history_openai_format.append({\"role\": \"user\", \"content\": human})\n",
+    "        history_openai_format.append({\"role\": \"assistant\", \"content\": assistant})\n",
+    "    history_openai_format.append({\"role\": \"user\", \"content\": message})\n",
+    "\n",
+    "    response = client.chat.completions.create(\n",
+    "        model = 'meta-llama-3-70b-instruct',\n",
+    "        messages = history_openai_format,\n",
+    "        temperature = 0.0,\n",
+    "        stream = True\n",
+    "     )\n",
+    "\n",
+    "    partial_message = \"\"\n",
+    "    for chunk in response:\n",
+    "        if chunk.choices[0].delta.content is not None:\n",
+    "              partial_message = partial_message + chunk.choices[0].delta.content\n",
+    "              yield partial_message\n",
     "\n",
     "gr.ChatInterface(predict).launch()"
    ]

+ 23 - 29
recipes/llama_api_providers/OctoAI_API_examples/RAG_Chatbot_example/RAG_Chatbot_Example.ipynb

@@ -4,16 +4,16 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Building a Llama 2 chatbot with Retrieval Augmented Generation (RAG)\n",
+    "# Building a Llama 3 chatbot with Retrieval Augmented Generation (RAG)\n",
     "\n",
     "This notebook shows a complete example of how to build a Llama 2 chatbot hosted on your browser that can answer questions based on your own data. We'll cover:\n",
-    "* How to run Llama2 in the cloud hosted on OctoAI\n",
+    "* How to run Llama 3 in the cloud hosted on OctoAI\n",
     "* A chatbot example built with [Gradio](https://github.com/gradio-app/gradio) and wired to the server\n",
-    "* Adding RAG capability with Llama 2 specific knowledge based on our Getting Started [guide](https://ai.meta.com/llama/get-started/)\n",
+    "* Adding RAG capability with Llama 3 specific knowledge based on our Getting Started [guide](https://ai.meta.com/llama/get-started/)\n",
     "\n",
     "\n",
     "**Note** We will be using OctoAI to run the examples here. You will need to first sign into [OctoAI](https://octoai.cloud/) with your Github or Google account, then create a free API token [here](https://octo.ai/docs/getting-started/how-to-create-an-octoai-access-token) that you can use for a while (a month or $10 in OctoAI credits, whichever one runs out first).\n",
-    "After the free trial ends, you will need to enter billing info to continue to use Llama2 hosted on OctoAI."
+    "After the free trial ends, you will need to enter billing info to continue to use Llama 3 hosted on OctoAI."
    ]
   },
   {
@@ -51,14 +51,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## How to Develop a RAG Powered Llama 2 Chatbot\n",
+    "## How to Develop a RAG Powered Llama 3 Chatbot\n",
     "\n",
-    "The easiest way to develop RAG-powered Llama 2 chatbots is to use frameworks such as [**LangChain**](https://www.langchain.com/) and [**LlamaIndex**](https://www.llamaindex.ai/), two leading open-source frameworks for building LLM apps. Both offer convenient APIs for implementing RAG with Llama 2 including:\n",
+    "The easiest way to develop RAG-powered Llama 3 chatbots is to use frameworks such as [**LangChain**](https://www.langchain.com/) and [**LlamaIndex**](https://www.llamaindex.ai/), two leading open-source frameworks for building LLM apps. Both offer convenient APIs for implementing RAG with Llama 3 including:\n",
     "\n",
     "* Load and split documents\n",
     "* Embed and store document splits\n",
     "* Retrieve the relevant context based on the user query\n",
-    "* Call Llama 2 with query and context to generate the answer\n",
+    "* Call Llama 3 with query and context to generate the answer\n",
     "\n",
     "LangChain is a more general purpose and flexible framework for developing LLM apps with RAG capabilities, while LlamaIndex as a data framework focuses on connecting custom data sources to LLMs. The integration of the two may provide the best performant and effective solution to building real world RAG apps.\n",
     "In our example, for simplicifty, we will use LangChain alone with locally stored PDF data."
@@ -73,7 +73,7 @@
     "For this demo, we will be using the Gradio for chatbot UI, Text-generation-inference framework for model serving.\n",
     "For vector storage and similarity search, we will be using [FAISS](https://github.com/facebookresearch/faiss).\n",
     "In this example, we will be running everything in a AWS EC2 instance (i.e. [g5.2xlarge]( https://aws.amazon.com/ec2/instance-types/g5/)). g5.2xlarge features one A10G GPU. We recommend running this notebook with at least one GPU equivalent to A10G with at least 16GB video memory.\n",
-    "There are certain techniques to downsize the Llama 2 7B model, so it can fit into smaller GPUs. But it is out of scope here.\n",
+    "There are certain techniques to downsize the Llama 3 7B model, so it can fit into smaller GPUs. But it is out of scope here.\n",
     "\n",
     "First, let's install all dependencies with PIP. We also recommend you start a dedicated Conda environment for better package management.\n",
     "\n",
@@ -109,7 +109,7 @@
     "### Data Processing\n",
     "\n",
     "First run all the imports and define the path of the data and vector storage after processing.\n",
-    "For the data, we will be using a raw pdf crawled from Llama 2 Getting Started guide on [Meta AI website](https://ai.meta.com/llama/)."
+    "For the data, we will be using a raw pdf crawled from \"Llama 2 Getting Started\" guide on [Meta AI website](https://ai.meta.com/llama/)."
    ]
   },
   {
@@ -276,14 +276,12 @@
     "from langchain.prompts.prompt import PromptTemplate\n",
     "from anyio.from_thread import start_blocking_portal #For model callback streaming\n",
     "\n",
-    "# langchain.debug=True\n",
-    "\n",
-    "#vector db path\n",
+    "# Vector db path\n",
     "DB_FAISS_PATH = 'vectorstore/db_faiss'\n",
     "\n",
     "model_dict = {\n",
-    "    \"13-chat\" : \"llama-2-13b-chat-fp16\",\n",
-    "    \"70b-chat\" : \"llama-2-70b-chat-fp16\",\n",
+    "    \"8b-instruct\" : \"meta-llama-3-8b-instruct\",\n",
+    "    \"70b-instruct\" : \"meta-llama-3-70b-instruct\",\n",
     "}\n",
     "\n",
     "system_message = {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}"
@@ -303,22 +301,24 @@
    "outputs": [],
    "source": [
     "embeddings = OctoAIEmbeddings(endpoint_url=\"https://text.octoai.run/v1/embeddings\")\n",
-    "db = FAISS.load_local(DB_FAISS_PATH, embeddings)"
+    "db = FAISS.load_local(DB_FAISS_PATH, embeddings, allow_dangerous_deserialization=True)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Next we call the Llama 2 model from OctoAI. In this example we will use the Llama 2 13b chat FP16 model. You can find more on Llama 2 models on the [OctoAI text generation solution page](https://octoai.cloud/tools/text).\n",
+    "Next we call the Llama 3 model from OctoAI. In this example we will use the Llama 3 8b instruct model. You can find more on Llama models on the [OctoAI text generation solution page](https://octoai.cloud/text).\n",
     "\n",
     "At the time of writing this notebook the following Llama models are available on OctoAI:\n",
-    "* llama-2-13b-chat\n",
-    "* llama-2-70b-chat\n",
+    "* meta-llama-3-8b-instruct\n",
+    "* meta-llama-3-70b-instruct\n",
     "* codellama-7b-instruct\n",
     "* codellama-13b-instruct\n",
     "* codellama-34b-instruct\n",
-    "* codellama-70b-instruct"
+    "* llama-2-13b-chat\n",
+    "* llama-2-70b-chat\n",
+    "* llamaguard-7b"
    ]
   },
   {
@@ -329,16 +329,10 @@
    "source": [
     "from langchain.llms.octoai_endpoint import OctoAIEndpoint\n",
     "\n",
-    "llama2_13b = \"llama-2-13b-chat-fp16\"\n",
     "llm = OctoAIEndpoint(\n",
-    "    endpoint_url=\"https://text.octoai.run/v1/chat/completions\",\n",
-    "    model_kwargs={\n",
-    "        \"model\": llama2_13b,\n",
-    "        \"messages\": [system_message],\n",
-    "        \"max_tokens\": 500,\n",
-    "        \"top_p\": 1,\n",
-    "        \"temperature\": 0.01\n",
-    "    },\n",
+    "    model=model_dict[\"8b-instruct\"],\n",
+    "    max_tokens=500,\n",
+    "    temperature=0.01\n",
     ")"
    ]
   },
@@ -347,7 +341,7 @@
    "metadata": {},
    "source": [
     "Next, we define the retriever and template for our RetrivalQA chain. For each call of the RetrievalQA, LangChain performs a semantic similarity search of the query in the vector database, then passes the search results as the context to Llama to answer the query about the data stored in the verctor database.\n",
-    "Whereas for the template, this defines the format of the question along with context that we will be sent into Llama for generation. In general, Llama 2 has special prompt format to handle special tokens. In some cases, the serving framework might already have taken care of it. Otherwise, you will need to write customized template to properly handle that."
+    "Whereas for the template, this defines the format of the question along with context that we will be sent into Llama for generation. In general, Llama 3 has special prompt format to handle special tokens. In some cases, the serving framework might already have taken care of it. Otherwise, you will need to write customized template to properly handle that."
    ]
   },
   {

+ 2 - 2
recipes/llama_api_providers/OctoAI_API_examples/RAG_Chatbot_example/requirements.txt

@@ -1,7 +1,7 @@
 gradio==4.16.0
 pypdf==4.0.0
-langchain==0.1.7
+langchain==0.1.19
 sentence-transformers==2.2.2
 faiss-cpu==1.7.4
 text-generation==0.6.1
-octoai-sdk==0.8.3
+octoai-sdk==0.10.1

+ 79 - 126
recipes/llama_api_providers/OctoAI_API_examples/VideoSummary.ipynb

@@ -7,8 +7,8 @@
    "source": [
     "## This demo app shows:\n",
     "* How to use LangChain's YoutubeLoader to retrieve the caption in a YouTube video\n",
-    "* How to ask Llama to summarize the content (per the Llama's input size limit) of the video in a naive way using LangChain's stuff method\n",
-    "* How to bypass the limit of Llama's max input token size by using a more sophisticated way using LangChain's map_reduce and refine methods - see [here](https://python.langchain.com/docs/use_cases/summarization) for more info"
+    "* How to ask Llama 3 to summarize the content (per the Llama's input size limit) of the video in a naive way using LangChain's stuff method\n",
+    "* How to bypass the limit of Llama 3's max input token size by using a more sophisticated way using LangChain's map_reduce and refine methods - see [here](https://python.langchain.com/docs/use_cases/summarization) for more info"
    ]
   },
   {
@@ -22,7 +22,7 @@
     "- [tiktoken](https://github.com/openai/tiktoken) BytePair Encoding tokenizer\n",
     "- [pytube](https://pytube.io/en/latest/) Utility for downloading YouTube videos\n",
     "\n",
-    "**Note** This example uses OctoAI to host the Llama model. If you have not set up/or used OctoAI before, we suggest you take a look at the [HelloLlamaCloud](HelloLlamaCloud.ipynb) example for information on how to set up OctoAI before continuing with this example.\n",
+    "**Note** This example uses OctoAI to host the Llama 3 model. If you have not set up/or used OctoAI before, we suggest you take a look at the [HelloLlamaCloud](HelloLlamaCloud.ipynb) example for information on how to set up OctoAI before continuing with this example.\n",
     "If you do not want to use OctoAI, you will need to make some changes to this notebook as you go along."
    ]
   },
@@ -33,7 +33,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "!pip install langchain octoai-sdk youtube-transcript-api tiktoken pytube"
+    "!pip install langchain==0.1.19 youtube-transcript-api tiktoken pytube"
    ]
   },
   {
@@ -41,7 +41,7 @@
    "id": "af3069b1",
    "metadata": {},
    "source": [
-    "Let's load the YouTube video transcript using the YoutubeLoader."
+    "Let's first load a long (2:47:16) YouTube video (Lex Fridman with Yann Lecun: Meta AI, Open Source, Limits of LLMs, AGI & the Future of AI) transcript using the YoutubeLoader."
    ]
   },
   {
@@ -54,7 +54,7 @@
     "from langchain.document_loaders import YoutubeLoader\n",
     "\n",
     "loader = YoutubeLoader.from_youtube_url(\n",
-    "    \"https://www.youtube.com/watch?v=1k37OcjH7BM\", add_video_info=True\n",
+    "    \"https://www.youtube.com/watch?v=5t1vTLU7s40\", add_video_info=True\n",
     ")"
    ]
   },
@@ -85,17 +85,16 @@
    "id": "4af7cc16",
    "metadata": {},
    "source": [
-    "We are using OctoAI in this example to host our Llama 2 model so you will need to get a OctoAI token.\n",
+    "You should see 142689 returned for the doc character length, which is about 30k words or 40k tokens, beyond the 8k context length limit of Llama 3. You'll see how to summarize a text longer than the limit.\n",
+    "\n",
+    "**Note**: We are using OctoAI in this example to host our Llama 3 model so you will need to get a OctoAI token.\n",
     "\n",
     "To get the OctoAI token:\n",
     "\n",
     "- You will need to first sign in with OctoAI with your github account\n",
     "- Then create a free API token [here](https://octo.ai/docs/getting-started/how-to-create-an-octoai-access-token) that you can use for a while (a month or $10 in OctoAI credits, whichever one runs out first)\n",
     "\n",
-    "**Note** After the free trial ends, you will need to enter billing info to continue to use Llama2 hosted on OctoAI.\n",
-    "\n",
-    "Alternatively, you can run Llama locally. See:\n",
-    "- [HelloLlamaLocal](HelloLlamaLocal.ipynb) for further information on how to run Llama locally."
+    "After the free trial ends, you will need to enter billing info to continue to use Llama2 hosted on OctoAI."
    ]
   },
   {
@@ -118,17 +117,17 @@
    "id": "6b911efd",
    "metadata": {},
    "source": [
-    "Next we call the Llama 2 model from OctoAI. In this example we will use the Llama 2 13b chat FP16 model. You can find more on Llama 2 models on the [OctoAI text generation solution page](https://octoai.cloud/tools/text).\n",
+    "Next we call the Llama 3 model from OctoAI. In this example we will use the Llama 3 8b instruct model. You can find more on Llama models on the [OctoAI text generation solution page](https://octoai.cloud/text).\n",
     "\n",
     "At the time of writing this notebook the following Llama models are available on OctoAI:\n",
-    "* llama-2-13b-chat\n",
-    "* llama-2-70b-chat\n",
+    "* meta-llama-3-8b-instruct\n",
+    "* meta-llama-3-70b-instruct\n",
     "* codellama-7b-instruct\n",
     "* codellama-13b-instruct\n",
     "* codellama-34b-instruct\n",
-    "* codellama-70b-instruct\n",
-    "\n",
-    "If you using local Llama, just set llm accordingly - see the [HelloLlamaLocal notebook](HelloLlamaLocal.ipynb)"
+    "* llama-2-13b-chat\n",
+    "* llama-2-70b-chat\n",
+    "* llamaguard-7b"
    ]
   },
   {
@@ -140,21 +139,11 @@
    "source": [
     "from langchain.llms.octoai_endpoint import OctoAIEndpoint\n",
     "\n",
-    "llama2_13b = \"llama-2-13b-chat-fp16\"\n",
+    "llama3_8b = \"meta-llama-3-8b-instruct\"\n",
     "llm = OctoAIEndpoint(\n",
-    "    endpoint_url=\"https://text.octoai.run/v1/chat/completions\",\n",
-    "    model_kwargs={\n",
-    "        \"model\": llama2_13b,\n",
-    "        \"messages\": [\n",
-    "            {\n",
-    "                \"role\": \"system\",\n",
-    "                \"content\": \"You are a helpful, respectful and honest assistant.\"\n",
-    "            }\n",
-    "        ],\n",
-    "        \"max_tokens\": 500,\n",
-    "        \"top_p\": 1,\n",
-    "        \"temperature\": 0.01\n",
-    "    },\n",
+    "    model=llama3_8b,\n",
+    "    max_tokens=500,\n",
+    "    temperature=0.01\n",
     ")"
    ]
   },
@@ -163,7 +152,7 @@
    "id": "8e3baa56",
    "metadata": {},
    "source": [
-    "Once everything is set up, we prompt Llama 2 to summarize the first 4000 characters of the transcript for us."
+    "Once everything is set up, we prompt Llama 3 to summarize the first 4000 characters of the transcript for us."
    ]
   },
   {
@@ -173,90 +162,74 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from langchain.prompts import ChatPromptTemplate\n",
+    "from langchain.prompts import PromptTemplate\n",
     "from langchain.chains import LLMChain\n",
-    "prompt = ChatPromptTemplate.from_template(\n",
-    "    \"Give me a summary of the text below: {text}?\"\n",
+    "\n",
+    "prompt_template = \"Give me a summary of the text below: {text}?\"\n",
+    "prompt = PromptTemplate(\n",
+    "    input_variables=[\"text\"], template=prompt_template\n",
     ")\n",
-    "chain = LLMChain(llm=llm, prompt=prompt)\n",
+    "chain = prompt | llm\n",
+    "\n",
     "# be careful of the input text length sent to LLM\n",
-    "text = docs[0].page_content[:4000]\n",
-    "summary = chain.run(text)\n",
-    "# this is the summary of the first 4000 characters of the video content\n",
+    "text = docs[0].page_content[:10000]\n",
+    "summary = chain.invoke(text)\n",
+    "\n",
+    "# Note: The context length of 8k tokens in Llama 3 is roughly 6000-7000 words or 32k characters\n",
     "print(summary)"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "8b684b29",
+   "id": "1ad1881a",
    "metadata": {},
    "source": [
-    "Next we try to summarize all the content of the transcript and we should get a `RuntimeError: Your input is too long. Max input length is 4096 tokens, but you supplied 5597 tokens.`."
+    "If you try the whole content which has over 142k characters, about 40k tokens, which exceeds the 8k limit, you'll get an empty result (OctoAI used to return an error \"BadRequestError: The token count (32704) of your prompt (32204) + your setting of `max_tokens` (500) cannot exceed this model's context length (8192).\")."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "88a2c17f",
+   "id": "61a088b7-cba2-4603-ba7c-f6673bfaa3cd",
    "metadata": {},
    "outputs": [],
    "source": [
-    "# try to get a summary of the whole content\n",
+    "# this will generate an empty result because the input exceeds Llama 3's context length limit\n",
     "text = docs[0].page_content\n",
-    "summary = chain.run(text)\n",
+    "summary = llm.invoke(f\"Give me a summary of the text below: {text}.\")\n",
     "print(summary)"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "1ad1881a",
+   "id": "e112845f-de16-4c2f-8afe-6cca31f6fa38",
    "metadata": {},
    "source": [
+    "To fix this, you can use LangChain's load_summarize_chain method (detail [here](https://python.langchain.com/docs/use_cases/summarization)).\n",
     "\n",
-    "Let's try some workarounds to see if we can summarize the entire transcript without running into the `RuntimeError`.\n",
+    "First you'll create splits or sub-documents of the original content, then use the LangChain's `load_summarize_chain` with the `refine` or `map_reduce type`.\n",
     "\n",
-    "We will use the LangChain's `load_summarize_chain` and play around with the `chain_type`.\n"
+    "Because this may involve many calls to Llama 3, it'd be great to set up a quick free LangChain API key [here](https://smith.langchain.com/settings), run the following cell to set up necessary environment variables, and check the logs on [LangSmith](https://docs.smith.langchain.com/) during and after the run."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "9bfee2d3-3afe-41d9-8968-6450cc23f493",
+   "id": "55586a09-db53-4741-87d8-fdfb40d9f8cb",
    "metadata": {},
    "outputs": [],
    "source": [
-    "from langchain.chains.summarize import load_summarize_chain\n",
-    "# see https://python.langchain.com/docs/use_cases/summarization for more info\n",
-    "chain = load_summarize_chain(llm, chain_type=\"stuff\") # other supported methods are map_reduce and refine\n",
-    "chain.run(docs)\n",
-    "# same RuntimeError: Your input is too long. but stuff works for shorter text with input length <= 4096 tokens"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "682799a8-3846-41b1-a908-02ab5ac3ecee",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "chain = load_summarize_chain(llm, chain_type=\"refine\")\n",
-    "# still get the \"RuntimeError: Your input is too long. Max input length is 4096 tokens\"\n",
-    "chain.run(docs)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "aecf6328",
-   "metadata": {},
-   "source": [
-    "\n",
-    "Since the transcript is bigger than the model can handle, we can split the transcript into chunks instead and use the [`refine`](https://python.langchain.com/docs/modules/chains/document/refine) `chain_type` to iteratively create an answer."
+    "import os\n",
+    "os.environ[\"LANGCHAIN_API_KEY\"] = \"your_langchain_api_key\"\n",
+    "os.environ[\"LANGCHAIN_API_KEY\"] = \"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\"\n",
+    "os.environ[\"LANGCHAIN_TRACING_V2\"] = \"true\"\n",
+    "os.environ[\"LANGCHAIN_PROJECT\"] = \"Video Summary with Llama 3\""
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "3be1236a-fe6a-4bf6-983f-0e72dde39fee",
+   "id": "9bfee2d3-3afe-41d9-8968-6450cc23f493",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -264,7 +237,7 @@
     "\n",
     "# we need to split the long input text\n",
     "text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(\n",
-    "    chunk_size=3000, chunk_overlap=0\n",
+    "    chunk_size=1000, chunk_overlap=0\n",
     ")\n",
     "split_docs = text_splitter.split_documents(docs)"
    ]
@@ -272,7 +245,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "12ae9e9d-3434-4a84-a298-f2b98de9ff01",
+   "id": "682799a8-3846-41b1-a908-02ab5ac3ecee",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -281,81 +254,61 @@
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "127f17fe-d5b7-43af-bd2f-2b47b076d0b1",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# now get the summary of the whole docs - the whole youtube content\n",
-    "chain = load_summarize_chain(llm, chain_type=\"refine\")\n",
-    "print(str(chain.run(split_docs)))"
-   ]
-  },
-  {
    "cell_type": "markdown",
-   "id": "c3976c92",
+   "id": "aecf6328",
    "metadata": {},
    "source": [
-    "You can also use [`map_reduce`](https://python.langchain.com/docs/modules/chains/document/map_reduce) `chain_type` to implement a map reduce like architecture while summarizing the documents."
+    "The `refine` type implements the following steps under the hood:\n",
+    "\n",
+    "1. Call Llama 3 on the first sub-document to generate a concise summary;\n",
+    "2. Loop over each subsequent sub-document, pass the previous summary with the current sub-document to generate a refined new summary;\n",
+    "3. Return the final summary generated on the final sub-document as the final answer - the summary of the whole content.\n",
+    "\n",
+    "An example prompt template for each call in step 2, which gets used under the hood by LangChain, is:\n",
+    "\n",
+    "```\n",
+    "Your job is to produce a final summary.\n",
+    "We have provided an existing summary up to a certain point:\n",
+    "<previous_summary>\n",
+    "Refine the existing summary (only if needed) with some more content below:\n",
+    "<new_content>\n",
+    "```\n",
+    "\n",
+    "**Note**: The following call will make 33 calls to Llama 3 and genereate the final summary in about 10 minutes."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "8991df49-8578-46de-8b30-cb2cd11e30f1",
+   "id": "3be1236a-fe6a-4bf6-983f-0e72dde39fee",
    "metadata": {},
    "outputs": [],
    "source": [
-    "# another method is map_reduce\n",
-    "chain = load_summarize_chain(llm, chain_type=\"map_reduce\")\n",
-    "print(str(chain.run(split_docs)))"
+    "from langchain.chains.summarize import load_summarize_chain\n",
+    "\n",
+    "chain = load_summarize_chain(llm, chain_type=\"refine\")\n",
+    "print(chain.run(split_docs))"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "77d580de",
+   "id": "752f2b71-5fd6-4a8a-ac09-371bce1db703",
    "metadata": {},
    "source": [
-    "To investigate further, let's turn on Langchain's debug mode on to get an idea of how many calls are made to the model and the details of the inputs and outputs.\n",
-    "We will then run our summary using the `stuff` and `refine` `chain_types` and take a look at our output."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "f2138911-d2b9-41f3-870f-9bc37e2043d9",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# to find how many calls to Llama have been made and the details of inputs and outputs of each call, set langchain to debug\n",
-    "import langchain\n",
-    "langchain.debug = True\n",
+    "You can also set `chain_type` to `map_reduce` to generate the summary of the entire content using the standard map and reduce method, which works behind the scene by first mapping each split document to a sub-summary via a call to LLM, then combines all those sub-summaries into a single final summary by yet another call to LLM.\n",
     "\n",
-    "# stuff method will cause the error in the end\n",
-    "chain = load_summarize_chain(llm, chain_type=\"stuff\")\n",
-    "chain.run(split_docs)"
+    "**Note**: The following call takes about 3 minutes and all the calls to Llama 3."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "60d1a531-ab48-45cc-a7de-59a14e18240d",
+   "id": "8991df49-8578-46de-8b30-cb2cd11e30f1",
    "metadata": {},
    "outputs": [],
    "source": [
-    "# but refine works\n",
-    "chain = load_summarize_chain(llm, chain_type=\"refine\")\n",
-    "chain.run(split_docs)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "61ccd0fb-5cdb-43c4-afaf-05bc9f7cf959",
-   "metadata": {},
-   "source": [
-    "\n",
-    "As you can see, `stuff` fails because it tries to treat all the split documents as one and \"stuffs\" it into one prompt which leads to a much larger prompt than Llama 2 can handle while `refine` iteratively runs over the documents updating its answer as it goes."
+    "chain = load_summarize_chain(llm, chain_type=\"map_reduce\")\n",
+    "print(chain.run(split_docs))"
    ]
   }
  ],