Sanyam Bhutani 7 months ago
parent
commit
fa2b2b6732

+ 112 - 4
recipes/quickstart/Multi-Modal-RAG/notebooks/Part_1_Data_Preperation.ipynb

@@ -66,7 +66,13 @@
    "id": "01fbc052-b633-4d7c-a6b8-e8b70c484697",
    "metadata": {},
    "source": [
-    "#### All the imports"
+    "#### All the imports\n",
+    "\n",
+    "We import all the libraries here. \n",
+    "\n",
+    "- PIL: For handling images to be passed to our Llama model\n",
+    "- Huggingface Tranformers: For running the model\n",
+    "- Concurrent Library: Because 405B suggested its useful for speedups and we want to look smart when doing OS stuff :) "
    ]
   },
   {
@@ -99,7 +105,11 @@
    "id": "544c6687-e174-4490-b221-4b3fbed080b3",
    "metadata": {},
    "source": [
-    "#### Clean Corrupt Images"
+    "#### Clean Corrupt Images\n",
+    "\n",
+    "Cleaning corruption is a task for AGI but we can handle the corrupt images in our dataset for now with some concurrency for fast checking. \n",
+    "\n",
+    "This takes a few moments so it might be a good idea to take a small break and socialise for a good change. "
    ]
   },
   {
@@ -180,6 +190,14 @@
    ]
   },
   {
+   "cell_type": "markdown",
+   "id": "d339c0d1",
+   "metadata": {},
+   "source": [
+    "Let's load in the Meta-Data of the images and remove the rows with the corrupt images"
+   ]
+  },
+  {
    "cell_type": "code",
    "execution_count": 7,
    "id": "05c65335-ad2f-4735-a25b-d75adb195113",
@@ -295,6 +313,14 @@
    ]
   },
   {
+   "cell_type": "markdown",
+   "id": "cc899cf1",
+   "metadata": {},
+   "source": [
+    "We can now \"clean\" up the dataframe by subtracting the corrupt images."
+   ]
+  },
+  {
    "cell_type": "code",
    "execution_count": 9,
    "id": "1f1e37bb-b625-44ac-b1bb-c2361b5edbf9",
@@ -340,7 +366,11 @@
     "jp-MarkdownHeadingCollapsed": true
    },
    "source": [
-    "## EDA"
+    "## EDA\n",
+    "\n",
+    "Now that we got rid of corruption we can proceed to building a great society with checking our dataset :) \n",
+    "\n",
+    "Let's start by double-checking any empty values"
    ]
   },
   {
@@ -499,6 +529,16 @@
    ]
   },
   {
+   "cell_type": "markdown",
+   "id": "c65411e6",
+   "metadata": {},
+   "source": [
+    "#### Understanding the Label Distribution \n",
+    "\n",
+    "The existing dataset comes with multi-labels, let's take a look at all categories:"
+   ]
+  },
+  {
    "cell_type": "code",
    "execution_count": 15,
    "id": "fea1f2d8-48c4-4b0e-9790-3427c2517e4e",
@@ -570,6 +610,14 @@
    ]
   },
   {
+   "cell_type": "markdown",
+   "id": "1cc50c67",
+   "metadata": {},
+   "source": [
+    "If we had more ~~prompts~~ time, this would be a fancier plot but for now let's take a look at the distribution skew to understand what's in our dataset:"
+   ]
+  },
+  {
    "cell_type": "code",
    "execution_count": 17,
    "id": "14a86ee1-d419-495b-86b0-7ef193e81b4a",
@@ -598,6 +646,17 @@
    ]
   },
   {
+   "cell_type": "markdown",
+   "id": "a0861297",
+   "metadata": {},
+   "source": [
+    "Let's start with some more cleanup:\n",
+    "\n",
+    "- Remove kids clothing since that is a smaller subset\n",
+    "- Let's use our lack of understanding of fashion to reduce categories and also make our lives with pre-processing easier"
+   ]
+  },
+  {
    "cell_type": "code",
    "execution_count": 18,
    "id": "48a00d85-011d-4632-af7d-d34c8dee6a2c",
@@ -752,6 +811,14 @@
    ]
   },
   {
+   "cell_type": "markdown",
+   "id": "c2793936",
+   "metadata": {},
+   "source": [
+    "For once, lack of fashion knowledge is useful-we can reduce our work by creating less categories. Nicely organised just like an coder's wardrobe"
+   ]
+  },
+  {
    "cell_type": "code",
    "execution_count": 20,
    "id": "99115476-9862-4b92-83f4-dd0145e1ee86",
@@ -825,6 +892,14 @@
    ]
   },
   {
+   "cell_type": "markdown",
+   "id": "a3e0061a",
+   "metadata": {},
+   "source": [
+    "This is the part that makes Thanos happy, we will balance our universe of clothes by randomly sampling."
+   ]
+  },
+  {
    "cell_type": "code",
    "execution_count": 22,
    "id": "43b65158-1865-4535-bba0-610b32811c82",
@@ -934,7 +1009,15 @@
    "id": "5798ee82-e237-4dd4-8a07-7777694a8981",
    "metadata": {},
    "source": [
-    "## Synthetic Labelling using Llama 3.2"
+    "## Synthetic Labelling using Llama 3.2\n",
+    "\n",
+    "All the effort so far was to prepare our dataset for labelling. \n",
+    "\n",
+    "At this stage, we are ready to start labelling the images using Llama-3.2 models. We will use 11B here for testing. \n",
+    "\n",
+    "For our rich readers, we suggest testing 90B as an assignment. Although you will find that 11B is a great candidate for this model. \n",
+    "\n",
+    "Read more about the model capabilites [here](https://www.llama.com/docs/how-to-guides/vision-capabilities/)"
    ]
   },
   {
@@ -981,6 +1064,14 @@
    ]
   },
   {
+   "cell_type": "markdown",
+   "id": "2d97ec1b",
+   "metadata": {},
+   "source": [
+    "Feel free to randomly grab any example from the `ls` command above. This shirt is colorful enough for us to use-so we will go with the current example"
+   ]
+  },
+  {
    "cell_type": "code",
    "execution_count": 27,
    "id": "8112f7bb-377c-4556-90a6-3e576321c152",
@@ -1028,6 +1119,23 @@
    ]
   },
   {
+   "cell_type": "markdown",
+   "id": "f9d3d44f",
+   "metadata": {},
+   "source": [
+    "#### Labelling Prompt\n",
+    "\n",
+    "For anyone who feels strongly about Prompt Engineering-this section is for you. The drama in the first prompt stems from constant errors encountered when running the model. \n",
+    "\n",
+    "Suggested approach:\n",
+    "\n",
+    "- Run a simple prompt on an image\n",
+    "- See output and iterate\n",
+    "\n",
+    "After painfully trying this a few times, we learn that for some reason the model doesn't follow JSON formatting unless it's strongly urged. So we fix this with the dramatic prompt:"
+   ]
+  },
+  {
    "cell_type": "code",
    "execution_count": 30,
    "id": "1de59227-6042-441b-a1f8-b19ce83f7c45",