|
@@ -1,6 +1,27 @@
|
|
|
{
|
|
|
"cells": [
|
|
|
{
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "id": "1d5d6034",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "## Part 3: Setting up RAG Example and Validating our Retrieval Pipeline\n",
|
|
|
+ "\n",
|
|
|
+ "We are ready for the finale, but let's recap:\n",
|
|
|
+ "\n",
|
|
|
+ "- We started with an example dataset of 5000 images\n",
|
|
|
+ "- In the first notebook, we cleaned this up for labelling and used `Llama-3.2-11B` model for labelling\n",
|
|
|
+ "- In the second notebook, we cleaned up some hallucinations of the model and pre-processed the descriptions that were synthetically generated\n",
|
|
|
+ "\n",
|
|
|
+ "\n",
|
|
|
+ "Step 3 is to setup a RAG pipeline and profit.\n",
|
|
|
+ "\n",
|
|
|
+ "We will use [lance-db](https://lancedb.com) since it's open source and Llama and open source go well together 🤝\n",
|
|
|
+ "\n",
|
|
|
+ "We also love free stuff and Llama partner [Together](https://www.together.ai) is hosting 11B model for free. For our final demo, we will use their API and validate the same in this example."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
"cell_type": "code",
|
|
|
"execution_count": 12,
|
|
|
"id": "ee4b18be-3bf0-4f7c-8ac5-ef68b7566750",
|
|
@@ -9,32 +30,17 @@
|
|
|
},
|
|
|
"outputs": [],
|
|
|
"source": [
|
|
|
- "!pip install lancedb rerankers together -q"
|
|
|
+ "#!pip install lancedb rerankers together -q"
|
|
|
]
|
|
|
},
|
|
|
{
|
|
|
- "cell_type": "code",
|
|
|
- "execution_count": 14,
|
|
|
- "id": "3f3d4a75-bc50-46e2-ac26-2e4c975cfa9d",
|
|
|
- "metadata": {
|
|
|
- "tags": []
|
|
|
- },
|
|
|
- "outputs": [
|
|
|
- {
|
|
|
- "name": "stdout",
|
|
|
- "output_type": "stream",
|
|
|
- "text": [
|
|
|
- "Cloning into 'MM-Demo'...\n",
|
|
|
- "remote: Enumerating objects: 27, done.\u001b[K\n",
|
|
|
- "remote: Counting objects: 100% (23/23), done.\u001b[K\n",
|
|
|
- "remote: Compressing objects: 100% (23/23), done.\u001b[K\n",
|
|
|
- "remote: Total 27 (delta 5), reused 0 (delta 0), pack-reused 4 (from 1)\u001b[K\n",
|
|
|
- "Unpacking objects: 100% (27/27), 930.83 KiB | 4.70 MiB/s, done.\n"
|
|
|
- ]
|
|
|
- }
|
|
|
- ],
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "id": "38fce923",
|
|
|
+ "metadata": {},
|
|
|
"source": [
|
|
|
- "!git clone https://huggingface.co/datasets/Sanyam/MM-Demo"
|
|
|
+ "Since the outputs of LLMs are non-deterministic, we will use the uploaded CSVs from this dataset to get the same experience. \n",
|
|
|
+ "\n",
|
|
|
+ "In other words, the maintainers of llama-recipes don't want more complaints so we will re-use this avoid new Github issues."
|
|
|
]
|
|
|
},
|
|
|
{
|
|
@@ -68,7 +74,7 @@
|
|
|
}
|
|
|
],
|
|
|
"source": [
|
|
|
- "!wget https://huggingface.co/datasets/Sanyam/MM-Demo/resolve/main/archive.zip?download=true -O archive.zip"
|
|
|
+ "#!wget https://huggingface.co/datasets/Sanyam/MM-Demo/resolve/main/archive.zip?download=true -O archive.zip"
|
|
|
]
|
|
|
},
|
|
|
{
|
|
@@ -89,7 +95,15 @@
|
|
|
}
|
|
|
],
|
|
|
"source": [
|
|
|
- "!unzip archive.zip"
|
|
|
+ "#!unzip archive.zip"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "id": "add7aeb6",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "### Loading the Dataset and Creating Embeddings"
|
|
|
]
|
|
|
},
|
|
|
{
|
|
@@ -102,6 +116,11 @@
|
|
|
"outputs": [],
|
|
|
"source": [
|
|
|
"import pandas as pd\n",
|
|
|
+ "import os\n",
|
|
|
+ "from together import Together\n",
|
|
|
+ "\n",
|
|
|
+ "os.environ[\"TOGETHER_API_KEY\"] = \"\"\n",
|
|
|
+ "client = Together(api_key=os.environ.get('TOGETHER_API_KEY'))\n",
|
|
|
"\n",
|
|
|
"df = pd.read_csv(\"./final_balanced_sample_dataset.csv\")"
|
|
|
]
|
|
@@ -253,6 +272,18 @@
|
|
|
]
|
|
|
},
|
|
|
{
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "id": "188acbd2",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "#### Creating Embeddings: \n",
|
|
|
+ "\n",
|
|
|
+ "We will define a Schema and use `BAAI/bge-small-en-v1.5` embeddings to create our vector embeddings. This is a first step, we can iterate with more embeddings in our final app later. \n",
|
|
|
+ "\n",
|
|
|
+ "For retrieval, we will embed descriptions of all clothes and use these."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
"cell_type": "code",
|
|
|
"execution_count": 37,
|
|
|
"id": "efec4e5e-3371-435a-8b49-2a4b69d7776c",
|
|
@@ -551,7 +582,14 @@
|
|
|
"id": "94652723-7813-48db-8089-cbee2ee8fc85",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
- "### Pattern 1: text input"
|
|
|
+ "#### Approach 1/3: Using Text Search\n",
|
|
|
+ "\n",
|
|
|
+ "The approach we will take first is:\n",
|
|
|
+ "\n",
|
|
|
+ "- Upload an image and ask `3.2-11B-Instruct` from Together to describe the image\n",
|
|
|
+ "- We will then try to find similar images in the dataset\n",
|
|
|
+ "\n",
|
|
|
+ "Note: This is to validate that our pipeline is working, in the final demo-we prompt Llama to find complementary clothes."
|
|
|
]
|
|
|
},
|
|
|
{
|
|
@@ -571,12 +609,6 @@
|
|
|
}
|
|
|
],
|
|
|
"source": [
|
|
|
- "import os\n",
|
|
|
- "from together import Together\n",
|
|
|
- "\n",
|
|
|
- "os.environ[\"TOGETHER_API_KEY\"] = \"\"\n",
|
|
|
- "client = Together(api_key=os.environ.get('TOGETHER_API_KEY'))\n",
|
|
|
- "\n",
|
|
|
"prompt = prompt + \"Please answer within 100 words consisely and only provide the answer, don't\" \\\n",
|
|
|
" \"repeat any word from the input, start your response from the clothing items that can be paired with the input\"\n",
|
|
|
"response = client.chat.completions.create(\n",
|
|
@@ -801,19 +833,10 @@
|
|
|
},
|
|
|
{
|
|
|
"cell_type": "markdown",
|
|
|
- "id": "4096bf87-a6fe-4ca2-bfe2-0381922dd6bd",
|
|
|
- "metadata": {},
|
|
|
- "source": [
|
|
|
- "### Limitations of semantic search\n",
|
|
|
- "-- "
|
|
|
- ]
|
|
|
- },
|
|
|
- {
|
|
|
- "cell_type": "markdown",
|
|
|
"id": "908e8c15-9e42-4457-b60d-b0b1fdca7b23",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
- "## Let's try Full-text search/ BM25"
|
|
|
+ "#### Approach 2/3: Using Full Text-Search"
|
|
|
]
|
|
|
},
|
|
|
{
|
|
@@ -1072,10 +1095,7 @@
|
|
|
"id": "6370d9aa-3262-4234-915a-0459db9144ad",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
- "### Hybrid search\n",
|
|
|
- " \n",
|
|
|
- " \n",
|
|
|
- "-- Let's use ColBert Reranker"
|
|
|
+ "#### Approach 3/3: Using Hybrid search"
|
|
|
]
|
|
|
},
|
|
|
{
|