Browse Source

nb2-final

Sanyam Bhutani 6 months ago
parent
commit
a013bc8091

+ 88 - 2
recipes/quickstart/Multi-Modal-RAG/notebooks/Part_2_Cleaning_Data_and_DB.ipynb

@@ -14,7 +14,13 @@
     "- Pre-processed categories to reduce complexity\n",
     "- Balanced categories by random sampling\n",
     "- Iterated and prompted 11B to label images\n",
-    "- Created Script to label images"
+    "- Created Script to label images\n",
+    "\n",
+    "Next steps:\n",
+    "\n",
+    "- Cleaing up Annotations produced from the previous step\n",
+    "- Re-balancing categories: Since the model still hallucinates some new categories\n",
+    "- Final round of EDA beforing moving to creating a RAG pipeline in Notebook 3"
    ]
   },
   {
@@ -23,7 +29,15 @@
    "metadata": {},
    "source": [
     "### Cleaning up Annotations\n",
-    "\n"
+    "\n",
+    "Hopefully you remember the prompt from previous notebook. Regardless of the prompt engineering, we still have a few issues to deal with: \n",
+    "\n",
+    "- The model hallucinates categories\n",
+    "- We need to delete escape characters to handle the JSON formatting. Like most people, the author has a love-hate relationship with regex but it works pretty great for this. Another approach that works is using `Llama-3.2-3B-Instruct` model for cleaning up. This is conviently left as an excercise for the reader\n",
+    "- Refusals: Sometimes the model refuses to label the images-we need to remove these examples\n",
+    "\n",
+    "\n",
+    "These are prompt engineering skill issues that you can improve by going back to notebook 1, for now let's proceed:"
    ]
   },
   {
@@ -57,6 +71,14 @@
    ]
   },
   {
+   "cell_type": "markdown",
+   "id": "c6b6d254",
+   "metadata": {},
+   "source": [
+    "List of CSV files produced from multi-GPU run:"
+   ]
+  },
+  {
    "cell_type": "code",
    "execution_count": 30,
    "id": "26be4145-dff1-4ece-8909-4346b253a799",
@@ -82,6 +104,18 @@
    ]
   },
   {
+   "cell_type": "markdown",
+   "id": "493475b5",
+   "metadata": {},
+   "source": [
+    "#### Cleaning up captions:\n",
+    "\n",
+    "Hello Regex our dark old friend! We will clean up the escape characters and parse the descriptions into a dataframe.\n",
+    "\n",
+    "Don't ask how we got the regex expression-only 405B knows."
+   ]
+  },
+  {
    "cell_type": "code",
    "execution_count": 33,
    "id": "b93654ab-d6be-4737-af46-9073889ead45",
@@ -568,6 +602,14 @@
    ]
   },
   {
+   "cell_type": "markdown",
+   "id": "092177e8",
+   "metadata": {},
+   "source": [
+    "Check the difference of cleanup:"
+   ]
+  },
+  {
    "cell_type": "code",
    "execution_count": 40,
    "id": "fd13a94a-ed78-4bf1-b264-538610fbb302",
@@ -830,6 +872,14 @@
    ]
   },
   {
+   "cell_type": "markdown",
+   "id": "48cd600f",
+   "metadata": {},
+   "source": [
+    "Let's drop the `NaN` examples and remove the `size` column. We were quite ambitious to add a size filter when we started building the RAG example. Now this is another assignment for the reader that we drop:"
+   ]
+  },
+  {
    "cell_type": "code",
    "execution_count": 43,
    "id": "41bcb1be-06a1-41b1-bba8-8a71eedb0b69",
@@ -1091,6 +1141,14 @@
    ]
   },
   {
+   "cell_type": "markdown",
+   "id": "300839b7",
+   "metadata": {},
+   "source": [
+    "The model still hallucinates and goes off-track with some categories, let's fix this by re-mapping them:"
+   ]
+  },
+  {
    "cell_type": "code",
    "execution_count": 61,
    "id": "b0fde9df-9659-4339-8c75-037e86f89d45",
@@ -1201,6 +1259,14 @@
    ]
   },
   {
+   "cell_type": "markdown",
+   "id": "e1dc4690",
+   "metadata": {},
+   "source": [
+    "We can also re-map the categories like so:"
+   ]
+  },
+  {
    "cell_type": "code",
    "execution_count": 69,
    "id": "6f105d26-9e4c-442f-8ba1-d2e9685e325e",
@@ -1425,6 +1491,14 @@
    ]
   },
   {
+   "cell_type": "markdown",
+   "id": "08ce5180",
+   "metadata": {},
+   "source": [
+    "We can now re-sample and have a nice and balanced dataset:"
+   ]
+  },
+  {
    "cell_type": "code",
    "execution_count": 78,
    "id": "6e09e47a-6bef-4259-b6b7-51b3d677b1ff",
@@ -1706,6 +1780,18 @@
    ]
   },
   {
+   "cell_type": "markdown",
+   "id": "eede2e0c",
+   "metadata": {},
+   "source": [
+    "#### Next Step\n",
+    "\n",
+    "We have made a lot of progress! Now our dataset is great to be embedded and used for our final step. \n",
+    "\n",
+    "The next part will be the easiest, however, we will still prompt engineer a bit"
+   ]
+  },
+  {
    "cell_type": "code",
    "execution_count": null,
    "id": "ee854540-3908-4428-a063-72c8997a2540",