# MultiModal Document RAG with ColQwen2 and Llama 3.2 90B Vision
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/togethercomputer/together-cookbook/blob/main/MultiModal_RAG_with_Nvidia_Investor_Slide_Deck.ipynb)

## Hardware Requirements
*To ensure the notebook runs faster please change the runtime type to T4 GPU:
`Runtime` -> `Change runtime type` -> `T4 GPU`*

## Introduction

In this notebook we will see how to use Multimodal RAG to chat with Nvidia's invester slide deck from last year. The [slide deck](https://s201.q4cdn.com/141608511/files/doc_presentations/2023/Oct/01/ndr_presentation_oct_2023_final.pdf) is 39 pages with a combination of text, visuals, tables, charts and annotations. The document structure and templates vary from page to page and is quite difficult to RAG over using traditional methods.

We will be using a new multimodal approach!

<img src="images/Nvidia_collage.png" width="500">

## MultiModal RAG Workflow

[ColPali](https://arxiv.org/abs/2407.01449) is a new multimodal retrieval system that seamlessly enables image retrieval.

By directly encoding image patches, it eliminates the need for optical character recognition (OCR), or image captioning to extract text from PDFs.

We will use `byaldi`, a library from [AnswerAI](https://www.answer.ai/), that makes it easier to work with an upgraded version of ColPali, called ColQwen2, to embed and retrieve images of our PDF documents.

Retrieved pages will then be passed into the Llama-3.2 90B Vision model served via a [Together AI](https://www.together.ai/) inference endpoint for it to answer questions.

To get a better explanation of how ColPali and the new Llama 3.2 Vision models work checkout the [blog post](https://www.together.ai/blog/multimodal-document-rag-with-llama-3-2-vision-and-colqwen2) connected to this notebook.

<img src="images/mmrag_only.png" width="600">

### Install relevant libraries

In [None]:
!pip install byaldi together pdf2image

In [None]:
!sudo apt-get install -y poppler-utils

In [None]:
# Paste in your Together AI API Key or load it
api_key = os.environ.get("TOGETHER_API_KEY")

### Initialize the ColPali Model

In [None]:
import os
from pathlib import Path
from byaldi import RAGMultiModalModel

# Initialize RAGMultiModalModel
model = RAGMultiModalModel.from_pretrained("vidore/colqwen2-v0.1")

Verbosity is set to 1 (active). Pass verbose=0 to make quieter.


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


adapter_config.json:   0%|          | 0.00/728 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}


model.safetensors.index.json:   0%|          | 0.00/56.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.85G [00:00<?, ?B/s]

`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/220 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/74.0M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/568 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/4.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/392 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/613 [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

### The document we will be retrieving from is a 39 page Nvidia investor presentation from 2023: [Investor Presentation October 2023](https://s201.q4cdn.com/141608511/files/doc_presentations/2023/Oct/01/ndr_presentation_oct_2023_final.pdf)

In [None]:
# Download and rename the last presentation from Nvidia to investors
!wget https://s201.q4cdn.com/141608511/files/doc_presentations/2023/Oct/01/ndr_presentation_oct_2023_final.pdf
!mv ndr_presentation_oct_2023_final.pdf nvidia_presentation.pdf

--2024-10-04 14:34:23--  https://s201.q4cdn.com/141608511/files/doc_presentations/2023/Oct/01/ndr_presentation_oct_2023_final.pdf
Resolving s201.q4cdn.com (s201.q4cdn.com)... 68.70.205.3, 68.70.205.4, 68.70.205.1, ...
Connecting to s201.q4cdn.com (s201.q4cdn.com)|68.70.205.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8609256 (8.2M) [application/pdf]
Saving to: â€˜ndr_presentation_oct_2023_final.pdfâ€™


2024-10-04 14:34:24 (24.4 MB/s) - â€˜ndr_presentation_oct_2023_final.pdfâ€™ saved [8609256/8609256]



### Lets create our index that will store the embeddings for the page images.

Caution: This cell below takes ~5 mins to index the whole PDF!

In [None]:
# Use ColQwen2 to index and store the presentation
index_name = "nvidia_index"
model.index(input_path=Path("nvidia_presentation.pdf"),
    index_name=index_name,
    store_collection_with_index=True, # Stores base64 images along with the vectors
    overwrite=True
)

Added page 1 of document 0 to index.
Added page 2 of document 0 to index.
Added page 3 of document 0 to index.
Added page 4 of document 0 to index.
Added page 5 of document 0 to index.
Added page 6 of document 0 to index.
Added page 7 of document 0 to index.
Added page 8 of document 0 to index.
Added page 9 of document 0 to index.
Added page 10 of document 0 to index.
Added page 11 of document 0 to index.
Added page 12 of document 0 to index.
Added page 13 of document 0 to index.
Added page 14 of document 0 to index.
Added page 15 of document 0 to index.
Added page 16 of document 0 to index.
Added page 17 of document 0 to index.
Added page 18 of document 0 to index.
Added page 19 of document 0 to index.
Added page 20 of document 0 to index.
Added page 21 of document 0 to index.
Added page 22 of document 0 to index.
Added page 23 of document 0 to index.
Added page 24 of document 0 to index.
Added page 25 of document 0 to index.
Added page 26 of document 0 to index.
Added page 27 of docu

{0: '/content/nvidia_presentation.pdf'}

### This concludes the indexing of the PDF phase - everything below happens at query time.

<img src="images/colpali_arch.png" width="700">

### Let's query our indexed document.

Here the important thing to note is that the query is asking for details that are found on page 25 of the PDF!

In [None]:
# Lets query our index and retrieve the page that has content with the highest similarity to the query

# The Data Centre revenue results are on page 25 - for context!
query = "What are the half year data centre renevue results and the 5 year CAGR for Nvidia data centre revenue?"
results = model.search(query, k=5)

print(f"Search results for '{query}':")
for result in results:
    print(f"Doc ID: {result.doc_id}, Page: {result.page_num}, Score: {result.score}")

print("Test completed successfully!")

Search results for 'What are the half year data centre renevue results and the 5 year CAGR for Nvidia data centre revenue?':
Doc ID: 0, Page: 25, Score: 25.875
Doc ID: 0, Page: 24, Score: 25.0
Doc ID: 0, Page: 28, Score: 23.75
Doc ID: 0, Page: 32, Score: 23.75
Doc ID: 0, Page: 31, Score: 23.75
Test completed successfully!


### Notice that ColQwen2 is able to retrieve that correct page with the highest similarity!

<img src="images/page_25.png" width="700">

### How does this work? What happens under the hood between the different pages and query token?

The interaction operation between page image patch and query text token representations to score each page of the document is what allows this great retrieval performance.

Typically each image is resized and cut into patch sizes of 16x16 pixels. These patches are then embedded into 128 dimensional vectors which are stored and used to perform the MaxSim and late interaction operations between the image and text tokens. ColPali is a multi-vector approach because it produces multiple vectors for each image/query; one vector for each token instead of just one vector for all tokens. 

<img src="images/ColPaliMaxSim-1.png" width="700">

The retrieval step takes about 185 ms.

In [None]:
%%timeit
model.search(query, k=5)

182 ms Â± 4.7 ms per loop (mean Â± std. dev. of 7 runs, 1 loop each)


## Lets now pass in the retrieved page to the Llama-3.2 90B Vision Model.

This model will read the question: `"What are the half year data centre renevue results and the 5 year CAGR for Nivida data centre revenue?"`

And take in the retrieved page and produce an answer!

You can pass in a URL to the image of the retrieved page or a base64 encoded version of the image.

In [None]:
# Since we stored the collection along with the index we have the base64 images of all PDF pages as well!
model.search(query, k=1)

In [None]:
returned_page = model.search(query, k=1)[0].base64

## We'll use a [Together AI](together.ai) inference endpoint to access the Llama-3.2 90B Vision model

In [None]:
import os
from together import Together

client = Together(api_key = api_key)

response = client.chat.completions.create(
  model="meta-llama/Llama-3.2-90B-Vision-Instruct-Turbo",
  messages=[
    {
      "role": "user",
      "content": [
        {"type": "text", "text": query}, #query
        {
          "type": "image_url",
          "image_url": {
            "url": f"data:image/jpeg;base64,{returned_page}", #retrieved page image
          },
        },
      ],
    }
  ],
  max_tokens=300,
)

print(response.choices[0].message.content)


The half-year data center revenue results for Nvidia are $14,607 million. The 5-year CAGR for Nvidia's data center revenue is 51%.


Here we can see that the combination of ColQwen2 as a image retriever and Llama-3.2 90B Vision is a powerful duo for multimodal RAG applications specially with PDFs.

Not only was ColQwen2 able to retrieve the correct page that had the right answer on it but then Llama-3.2 90B Vision was also able to find exactly where on the page this answer was, ignoring all the irrelevant details!

Voila!ðŸŽ‰ðŸŽ‰

Learn more about Llama 3.2 Vision in the [docs](https://docs.together.ai/docs/vision-overview) here!