1 год назад · a013bc8091
--- a/recipes/quickstart/Multi-Modal-RAG/notebooks/Part_2_Cleaning_Data_and_DB.ipynb
+++ b/recipes/quickstart/Multi-Modal-RAG/notebooks/Part_2_Cleaning_Data_and_DB.ipynb
@@ -14,7 +14,13 @@
 
				     "- Pre-processed categories to reduce complexity\n",
			
 
				     "- Balanced categories by random sampling\n",
			
 
				     "- Iterated and prompted 11B to label images\n",
			
 
				-    "- Created Script to label images"
			
 
				+    "- Created Script to label images\n",
			
 
				+    "\n",
			
 
				+    "Next steps:\n",
			
 
				+    "\n",
			
 
				+    "- Cleaing up Annotations produced from the previous step\n",
			
 
				+    "- Re-balancing categories: Since the model still hallucinates some new categories\n",
			
 
				+    "- Final round of EDA beforing moving to creating a RAG pipeline in Notebook 3"
			
 
				    ]
			
 
				   },
			
 
				   {
			
@@ -23,7 +29,15 @@
 
				    "metadata": {},
			
 
				    "source": [
			
 
				     "### Cleaning up Annotations\n",
			
 
				-    "\n"
			
 
				+    "\n",
			
 
				+    "Hopefully you remember the prompt from previous notebook. Regardless of the prompt engineering, we still have a few issues to deal with: \n",
			
 
				+    "\n",
			
 
				+    "- The model hallucinates categories\n",
			
 
				+    "- We need to delete escape characters to handle the JSON formatting. Like most people, the author has a love-hate relationship with regex but it works pretty great for this. Another approach that works is using `Llama-3.2-3B-Instruct` model for cleaning up. This is conviently left as an excercise for the reader\n",
			
 
				+    "- Refusals: Sometimes the model refuses to label the images-we need to remove these examples\n",
			
 
				+    "\n",
			
 
				+    "\n",
			
 
				+    "These are prompt engineering skill issues that you can improve by going back to notebook 1, for now let's proceed:"
			
 
				    ]
			
 
				   },
			
 
				   {
			
@@ -57,6 +71,14 @@
 
				    ]
			
 
				   },
			
 
				   {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "c6b6d254",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "List of CSV files produced from multi-GPU run:"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				    "cell_type": "code",
			
 
				    "execution_count": 30,
			
 
				    "id": "26be4145-dff1-4ece-8909-4346b253a799",
			
@@ -82,6 +104,18 @@
 
				    ]
			
 
				   },
			
 
				   {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "493475b5",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "#### Cleaning up captions:\n",
			
 
				+    "\n",
			
 
				+    "Hello Regex our dark old friend! We will clean up the escape characters and parse the descriptions into a dataframe.\n",
			
 
				+    "\n",
			
 
				+    "Don't ask how we got the regex expression-only 405B knows."
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				    "cell_type": "code",
			
 
				    "execution_count": 33,
			
 
				    "id": "b93654ab-d6be-4737-af46-9073889ead45",
			
@@ -568,6 +602,14 @@
 
				    ]
			
 
				   },
			
 
				   {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "092177e8",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "Check the difference of cleanup:"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				    "cell_type": "code",
			
 
				    "execution_count": 40,
			
 
				    "id": "fd13a94a-ed78-4bf1-b264-538610fbb302",
			
@@ -830,6 +872,14 @@
 
				    ]
			
 
				   },
			
 
				   {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "48cd600f",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "Let's drop the `NaN` examples and remove the `size` column. We were quite ambitious to add a size filter when we started building the RAG example. Now this is another assignment for the reader that we drop:"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				    "cell_type": "code",
			
 
				    "execution_count": 43,
			
 
				    "id": "41bcb1be-06a1-41b1-bba8-8a71eedb0b69",
			
@@ -1091,6 +1141,14 @@
 
				    ]
			
 
				   },
			
 
				   {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "300839b7",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "The model still hallucinates and goes off-track with some categories, let's fix this by re-mapping them:"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				    "cell_type": "code",
			
 
				    "execution_count": 61,
			
 
				    "id": "b0fde9df-9659-4339-8c75-037e86f89d45",
			
@@ -1201,6 +1259,14 @@
 
				    ]
			
 
				   },
			
 
				   {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "e1dc4690",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "We can also re-map the categories like so:"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				    "cell_type": "code",
			
 
				    "execution_count": 69,
			
 
				    "id": "6f105d26-9e4c-442f-8ba1-d2e9685e325e",
			
@@ -1425,6 +1491,14 @@
 
				    ]
			
 
				   },
			
 
				   {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "08ce5180",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "We can now re-sample and have a nice and balanced dataset:"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				    "cell_type": "code",
			
 
				    "execution_count": 78,
			
 
				    "id": "6e09e47a-6bef-4259-b6b7-51b3d677b1ff",
			
@@ -1706,6 +1780,18 @@
 
				    ]
			
 
				   },
			
 
				   {
			
 
				+   "cell_type": "markdown",
			
 
				+   "id": "eede2e0c",
			
 
				+   "metadata": {},
			
 
				+   "source": [
			
 
				+    "#### Next Step\n",
			
 
				+    "\n",
			
 
				+    "We have made a lot of progress! Now our dataset is great to be embedded and used for our final step. \n",
			
 
				+    "\n",
			
 
				+    "The next part will be the easiest, however, we will still prompt engineer a bit"
			
 
				+   ]
			
 
				+  },
			
 
				+  {
			
 
				    "cell_type": "code",
			
 
				    "execution_count": null,
			
 
				    "id": "ee854540-3908-4428-a063-72c8997a2540",