Browse Source

Nb-3 complete

Sanyam Bhutani 6 tháng trước cách đây
mục cha
commit
9926fb2456

+ 65 - 45
recipes/quickstart/Multi-Modal-RAG/notebooks/Part_3_RAG_Setup_and_Validation.ipynb

@@ -1,6 +1,27 @@
 {
  "cells": [
   {
+   "cell_type": "markdown",
+   "id": "1d5d6034",
+   "metadata": {},
+   "source": [
+    "## Part 3: Setting up RAG Example and Validating our Retrieval Pipeline\n",
+    "\n",
+    "We are ready for the finale, but let's recap:\n",
+    "\n",
+    "- We started with an example dataset of 5000 images\n",
+    "- In the first notebook, we cleaned this up for labelling and used `Llama-3.2-11B` model for labelling\n",
+    "- In the second notebook, we cleaned up some hallucinations of the model and pre-processed the descriptions that were synthetically generated\n",
+    "\n",
+    "\n",
+    "Step 3 is to setup a RAG pipeline and profit.\n",
+    "\n",
+    "We will use [lance-db](https://lancedb.com) since it's open source and Llama and open source go well together 🤝\n",
+    "\n",
+    "We also love free stuff and Llama partner [Together](https://www.together.ai) is hosting 11B model for free. For our final demo, we will use their API and validate the same in this example."
+   ]
+  },
+  {
    "cell_type": "code",
    "execution_count": 12,
    "id": "ee4b18be-3bf0-4f7c-8ac5-ef68b7566750",
@@ -9,32 +30,17 @@
    },
    "outputs": [],
    "source": [
-    "!pip install lancedb rerankers together -q"
+    "#!pip install lancedb rerankers together -q"
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": 14,
-   "id": "3f3d4a75-bc50-46e2-ac26-2e4c975cfa9d",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Cloning into 'MM-Demo'...\n",
-      "remote: Enumerating objects: 27, done.\u001b[K\n",
-      "remote: Counting objects: 100% (23/23), done.\u001b[K\n",
-      "remote: Compressing objects: 100% (23/23), done.\u001b[K\n",
-      "remote: Total 27 (delta 5), reused 0 (delta 0), pack-reused 4 (from 1)\u001b[K\n",
-      "Unpacking objects: 100% (27/27), 930.83 KiB | 4.70 MiB/s, done.\n"
-     ]
-    }
-   ],
+   "cell_type": "markdown",
+   "id": "38fce923",
+   "metadata": {},
    "source": [
-    "!git clone https://huggingface.co/datasets/Sanyam/MM-Demo"
+    "Since the outputs of LLMs are non-deterministic, we will use the uploaded CSVs from this dataset to get the same experience. \n",
+    "\n",
+    "In other words, the maintainers of llama-recipes don't want more complaints so we will re-use this avoid new Github issues."
    ]
   },
   {
@@ -68,7 +74,7 @@
     }
    ],
    "source": [
-    "!wget https://huggingface.co/datasets/Sanyam/MM-Demo/resolve/main/archive.zip?download=true -O archive.zip"
+    "#!wget https://huggingface.co/datasets/Sanyam/MM-Demo/resolve/main/archive.zip?download=true -O archive.zip"
    ]
   },
   {
@@ -89,7 +95,15 @@
     }
    ],
    "source": [
-    "!unzip archive.zip"
+    "#!unzip archive.zip"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "add7aeb6",
+   "metadata": {},
+   "source": [
+    "### Loading the Dataset and Creating Embeddings"
    ]
   },
   {
@@ -102,6 +116,11 @@
    "outputs": [],
    "source": [
     "import pandas as pd\n",
+    "import os\n",
+    "from together import Together\n",
+    "\n",
+    "os.environ[\"TOGETHER_API_KEY\"] = \"\"\n",
+    "client = Together(api_key=os.environ.get('TOGETHER_API_KEY'))\n",
     "\n",
     "df = pd.read_csv(\"./final_balanced_sample_dataset.csv\")"
    ]
@@ -253,6 +272,18 @@
    ]
   },
   {
+   "cell_type": "markdown",
+   "id": "188acbd2",
+   "metadata": {},
+   "source": [
+    "#### Creating Embeddings: \n",
+    "\n",
+    "We will define a Schema and use `BAAI/bge-small-en-v1.5` embeddings to create our vector embeddings. This is a first step, we can iterate with more embeddings in our final app later. \n",
+    "\n",
+    "For retrieval, we will embed descriptions of all clothes and use these."
+   ]
+  },
+  {
    "cell_type": "code",
    "execution_count": 37,
    "id": "efec4e5e-3371-435a-8b49-2a4b69d7776c",
@@ -551,7 +582,14 @@
    "id": "94652723-7813-48db-8089-cbee2ee8fc85",
    "metadata": {},
    "source": [
-    "### Pattern 1: text input"
+    "#### Approach 1/3: Using Text Search\n",
+    "\n",
+    "The approach we will take first is:\n",
+    "\n",
+    "- Upload an image and ask `3.2-11B-Instruct` from Together to describe the image\n",
+    "- We will then try to find similar images in the dataset\n",
+    "\n",
+    "Note: This is to validate that our pipeline is working, in the final demo-we prompt Llama to find complementary clothes."
    ]
   },
   {
@@ -571,12 +609,6 @@
     }
    ],
    "source": [
-    "import os\n",
-    "from together import Together\n",
-    "\n",
-    "os.environ[\"TOGETHER_API_KEY\"] = \"\"\n",
-    "client = Together(api_key=os.environ.get('TOGETHER_API_KEY'))\n",
-    "\n",
     "prompt = prompt + \"Please answer within 100 words consisely and only provide the answer, don't\" \\\n",
     "            \"repeat any word from the input, start your response from the clothing items that can be paired with the input\"\n",
     "response = client.chat.completions.create(\n",
@@ -801,19 +833,10 @@
   },
   {
    "cell_type": "markdown",
-   "id": "4096bf87-a6fe-4ca2-bfe2-0381922dd6bd",
-   "metadata": {},
-   "source": [
-    "### Limitations of semantic search\n",
-    "-- "
-   ]
-  },
-  {
-   "cell_type": "markdown",
    "id": "908e8c15-9e42-4457-b60d-b0b1fdca7b23",
    "metadata": {},
    "source": [
-    "## Let's try Full-text search/ BM25"
+    "#### Approach 2/3: Using Full Text-Search"
    ]
   },
   {
@@ -1072,10 +1095,7 @@
    "id": "6370d9aa-3262-4234-915a-0459db9144ad",
    "metadata": {},
    "source": [
-    "### Hybrid search\n",
-    " \n",
-    " \n",
-    "-- Let's use ColBert Reranker"
+    "#### Approach 3/3: Using Hybrid search"
    ]
   },
   {