Sanyam Bhutani 51a5ae812d Update recipes/quickstart/Multi-Modal-RAG/notebooks/Part_1_Data_Preperation.ipynb		1 год назад
..
notebooks	51a5ae812d Update recipes/quickstart/Multi-Modal-RAG/notebooks/Part_1_Data_Preperation.ipynb	1 год назад
scripts	57ee65a174 Update label_script.py	1 год назад
README.md	c9df1be8df Update recipes/quickstart/Multi-Modal-RAG/README.md	1 год назад

End to End Tutorial on using Llama models for Multi-Modal RAG

Recipe Overview: Multi-Modal RAG using `Llama-3.2-11B` model:

This is a complete workshop on labelling images using the new Llama 3.2-Vision Models and performing RAG using the image caption capiblites of the model.

Data Labeling and Preparation: We start by downloading 5000 images of clothing items and labeling them using Llama-3.2-11B-Vision-Instruct model
Cleaning Labels: With the labels based on the notebook above, we will then clean the dataset and prepare it for RAG
Building Vector DB and RAG Pipeline: With the final clean dataset, we can use descriptions and 11B model to generate recommendations

Requirements:

Before we start:

Please grab your HF CLI Token from here
git clone this dataset inside the Multi-Modal-RAG folder: git clone https://huggingface.co/datasets/Sanyam/MM-Demo
Launch jupyter notebook inside this folder
We will also run two scripts after the notebooks
Make sure you grab a together.ai token here

Detailed Outline for running:

Order of running files, the notebook establish the method of approaching the problem. Once we establish it, we use the scripts to run the method end to end.

Notebook 1: Part_1_Data_Preperation.ipynb
Script: label_script.py
Notebook 2: Part_2_Cleaning_Data_and_DB.ipynb
Notebook 3: Part_3_RAG_Setup_and_Validation.ipynb
Script: final_demo.py

Here's the detailed outline:

Step 1: Data Prep and Synthetic Labeling:

Notebook for Step 1 and Script for Step 1

To run the script (remember to set n):

python scripts/label_script.py --hf_token "your_huggingface_token_here" \
    --input_path "../MM-Demo/images_compressed" \
    --output_path "../MM-Demo/output/" \
    --num_gpus N

The dataset consists of 5000 images with some meta-data.

The first half is preparing the dataset for labeling:

Clean/Remove corrupt images
EDA to understand existing distribution
Merging up categories of clothes to reduce complexity
Balancing dataset by randomly sampling images

Second Half consists of Labeling the dataset. Llama 3.2, 11B model can only process one image at a time:

We load a few images and test captioning
We run this pipeline on random images and iterate on the prompt till we feel the model is giving good outputs
Finally, we can create a script to label all 5000 images on multi-GPU

After running the script on the entire dataset, we have more data cleaning to perform.

Step 2: Cleaning up Synthetic Labels and preparing the dataset:

Notebook for Step 2

Even after our lengthy (apart from other things) prompt, the model still hallucinates categories and label, here is how we address this

Re-balance the dataset by mapping correct categories
Fix Descriptions so that we can create a CSV

Now, we are ready to try our vector db pipeline:

Step 3: Notebook 3: MM-RAG using lance-db to validate idea

Notebook for Step 3 and Final Demo Script

With the cleaned descriptions and dataset, we can now store these in a vector-db

You will note that we are not using the categorization from our model-this is by design to show how RAG can simplify a lot of things.

We create embeddings using the text description of our clothes
Use 11-B model to describe the uploaded image
Try to find similar or complimentary images based on the upload

We try the approach with different retrieval methods.

Finally, we can bring this all together in a Gradio App.

For running the script:

python scripts/final_demo.py \
    --images_folder "../MM-Demo/compressed_images" \
    --csv_path "../MM-Demo/final_balanced_sample_dataset.csv" \
    --table_path "~/.lancedb" \
    --api_key "your_together_api_key" \
    --default_model "BAAI/bge-large-en-v1.5" \
    --use_existing_table

Task: We can further improve the description prompt. You will notice sometimes the description starts with the title of the cloth which causes in retrieval of "similar" clothes instead of "complementary" items

Upload an image
11B model describes the image
We retrieve complementary clothes to wear based on the description
You can keep the loop going by chatting with the model

Resources used:

Credit and Thanks to List of models and resources used in the showcase:

Firstly, thanks to the author here for providing this dataset on which we base our exercise []()

Llama-3.2-11B-Vision-Instruct Model
Lance-db for vector database
[This Kaggle dataset]()
HF Dataset Since output of the model can be non-deterministic every time we run, we will use the uploaded dataset to give a universal experience
Together API for demo

README.md