|
@@ -14,7 +14,13 @@
|
|
|
"- Pre-processed categories to reduce complexity\n",
|
|
|
"- Balanced categories by random sampling\n",
|
|
|
"- Iterated and prompted 11B to label images\n",
|
|
|
- "- Created Script to label images"
|
|
|
+ "- Created Script to label images\n",
|
|
|
+ "\n",
|
|
|
+ "Next steps:\n",
|
|
|
+ "\n",
|
|
|
+ "- Cleaing up Annotations produced from the previous step\n",
|
|
|
+ "- Re-balancing categories: Since the model still hallucinates some new categories\n",
|
|
|
+ "- Final round of EDA beforing moving to creating a RAG pipeline in Notebook 3"
|
|
|
]
|
|
|
},
|
|
|
{
|
|
@@ -23,7 +29,15 @@
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
"### Cleaning up Annotations\n",
|
|
|
- "\n"
|
|
|
+ "\n",
|
|
|
+ "Hopefully you remember the prompt from previous notebook. Regardless of the prompt engineering, we still have a few issues to deal with: \n",
|
|
|
+ "\n",
|
|
|
+ "- The model hallucinates categories\n",
|
|
|
+ "- We need to delete escape characters to handle the JSON formatting. Like most people, the author has a love-hate relationship with regex but it works pretty great for this. Another approach that works is using `Llama-3.2-3B-Instruct` model for cleaning up. This is conviently left as an excercise for the reader\n",
|
|
|
+ "- Refusals: Sometimes the model refuses to label the images-we need to remove these examples\n",
|
|
|
+ "\n",
|
|
|
+ "\n",
|
|
|
+ "These are prompt engineering skill issues that you can improve by going back to notebook 1, for now let's proceed:"
|
|
|
]
|
|
|
},
|
|
|
{
|
|
@@ -57,6 +71,14 @@
|
|
|
]
|
|
|
},
|
|
|
{
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "id": "c6b6d254",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "List of CSV files produced from multi-GPU run:"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
"cell_type": "code",
|
|
|
"execution_count": 30,
|
|
|
"id": "26be4145-dff1-4ece-8909-4346b253a799",
|
|
@@ -82,6 +104,18 @@
|
|
|
]
|
|
|
},
|
|
|
{
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "id": "493475b5",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "#### Cleaning up captions:\n",
|
|
|
+ "\n",
|
|
|
+ "Hello Regex our dark old friend! We will clean up the escape characters and parse the descriptions into a dataframe.\n",
|
|
|
+ "\n",
|
|
|
+ "Don't ask how we got the regex expression-only 405B knows."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
"cell_type": "code",
|
|
|
"execution_count": 33,
|
|
|
"id": "b93654ab-d6be-4737-af46-9073889ead45",
|
|
@@ -568,6 +602,14 @@
|
|
|
]
|
|
|
},
|
|
|
{
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "id": "092177e8",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "Check the difference of cleanup:"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
"cell_type": "code",
|
|
|
"execution_count": 40,
|
|
|
"id": "fd13a94a-ed78-4bf1-b264-538610fbb302",
|
|
@@ -830,6 +872,14 @@
|
|
|
]
|
|
|
},
|
|
|
{
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "id": "48cd600f",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "Let's drop the `NaN` examples and remove the `size` column. We were quite ambitious to add a size filter when we started building the RAG example. Now this is another assignment for the reader that we drop:"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
"cell_type": "code",
|
|
|
"execution_count": 43,
|
|
|
"id": "41bcb1be-06a1-41b1-bba8-8a71eedb0b69",
|
|
@@ -1091,6 +1141,14 @@
|
|
|
]
|
|
|
},
|
|
|
{
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "id": "300839b7",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "The model still hallucinates and goes off-track with some categories, let's fix this by re-mapping them:"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
"cell_type": "code",
|
|
|
"execution_count": 61,
|
|
|
"id": "b0fde9df-9659-4339-8c75-037e86f89d45",
|
|
@@ -1201,6 +1259,14 @@
|
|
|
]
|
|
|
},
|
|
|
{
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "id": "e1dc4690",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "We can also re-map the categories like so:"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
"cell_type": "code",
|
|
|
"execution_count": 69,
|
|
|
"id": "6f105d26-9e4c-442f-8ba1-d2e9685e325e",
|
|
@@ -1425,6 +1491,14 @@
|
|
|
]
|
|
|
},
|
|
|
{
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "id": "08ce5180",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "We can now re-sample and have a nice and balanced dataset:"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
"cell_type": "code",
|
|
|
"execution_count": 78,
|
|
|
"id": "6e09e47a-6bef-4259-b6b7-51b3d677b1ff",
|
|
@@ -1706,6 +1780,18 @@
|
|
|
]
|
|
|
},
|
|
|
{
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "id": "eede2e0c",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "#### Next Step\n",
|
|
|
+ "\n",
|
|
|
+ "We have made a lot of progress! Now our dataset is great to be embedded and used for our final step. \n",
|
|
|
+ "\n",
|
|
|
+ "The next part will be the easiest, however, we will still prompt engineer a bit"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
"cell_type": "code",
|
|
|
"execution_count": null,
|
|
|
"id": "ee854540-3908-4428-a063-72c8997a2540",
|