浏览代码

Contextual keywords generation for RAG using Llama-3.1 (#889)

Sanyam Bhutani 2 周之前
父节点
当前提交
2a65dd2727

+ 7 - 0
.github/scripts/spellcheck_conf/wordlist.txt

@@ -1529,3 +1529,10 @@ DIFFLOG
 Dimitry
 Khorzov
 LinkedIn
+DEEPINFRA
+DeepInfra
+LLAMAPARSE
+LlamaParse
+ailabs
+jina
+jinaai

文件差异内容过多而无法显示
+ 1646 - 0
end-to-end-use-cases/Contextual-Chunking-RAG/Example_FinancialReport_RAG.ipynb


+ 17 - 0
end-to-end-use-cases/Contextual-Chunking-RAG/README.md

@@ -0,0 +1,17 @@
+# Contextual keywords generation for RAG using Llama-3.1
+
+**Problem**: Independent chunking in traditional RAG systems leads to the loss of contextual information between chunks. This makes it difficult for LLMs to retrieve relevant data when context (e.g., the subject or entity being discussed) is not explicitly repeated within individual chunks.
+
+**Solution**: Generate keywords for each chunk to fulfill missing contextual information. These keywords (e.g., "BMW, X5, pricing") enrich the chunk with necessary context, ensuring better retrieval accuracy. By embedding this enriched metadata, the system bridges gaps between related chunks, enabling effective query matching and accurate answer generation.
+
+[This article](https://medium.com/@ailabs/overcoming-independent-chunking-in-rag-systems-a-hybrid-approach-5d2c205b3732) explains benefits of contextual chunking.
+
+**Note** This method does not require calling LLM for each chunk separately, which makes it efficient.
+
+**Getting started**
+In this cookbook, we’ll use DeepInfra for Llama inference services, so be sure to obtain an API key from https://deepinfra.com/.
+You'll also need a LlamaParse API key to parse PDF files, which can be obtained from https://www.llamaindex.ai/.
+Additionally, we will use the "jinaai/jina-embeddings-v2-base-en" model from HuggingFace to generate text embeddings locally.
+Before getting started, update the <code>config.py</code> file as following:
+    "DEEPINFRA_API_KEY"="<your_api_key>"    
+    "LLAMAPARSE_API_KEY"="<your_api_key>"

+ 248 - 0
end-to-end-use-cases/Contextual-Chunking-RAG/Tutorial.ipynb

@@ -0,0 +1,248 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "Wt3pZ_J2tKeo"
+   },
+   "source": [
+    "# Tutorial\n",
+    "In this tutorial, we'll break a sample text document into chunks and generate contextual keywords for each one using Llama 3.1.\n",
+    "\n",
+    "Let's start by installing the important packages.\n",
+    "For Llama model inference, we use DeepInfra here, but you can use any inference service provider"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "LEz4Dfa9-i-Z"
+   },
+   "outputs": [],
+   "source": [
+    "#Install dependencies\n",
+    "!pip install tiktoken\n",
+    "!pip install openai\n",
+    "\n",
+    "from config import DEEPINFRA_API_KEY"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "mwW6s7EJm9ul"
+   },
+   "source": [
+    "First, obtain your document content. For this tutorial, the recommended document size ranges from 2,000 to 20,000 tokens."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "_Iooj2f76wXl"
+   },
+   "outputs": [],
+   "source": [
+    "document_content = \"\"\n",
+    "with open('./data/llama_article.txt', 'r') as file:\n",
+    "    document_content = file.read()\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "q4IO9wIip8lV"
+   },
+   "source": [
+    "We will then split the document content into chunks of 300-1000 tokens"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "oek9g5Xom2XJ"
+   },
+   "outputs": [],
+   "source": [
+    "#split into chunks (simple way)\n",
+    "def split_into_chunks(content, chunk_size):\n",
+    "\timport tiktoken\n",
+    "\tenc = tiktoken.get_encoding(\"o200k_base\")\n",
+    "\ta = enc.encode(content)\n",
+    "\tleft, chunks = 0, []\n",
+    "\twhile left < len(a):\n",
+    "\t\tarr = a[left : left+chunk_size]\n",
+    "\t\tchunks.append(enc.decode(arr))\n",
+    "\t\tleft+=chunk_size\n",
+    "\treturn chunks\n",
+    "\n",
+    "chunks = split_into_chunks(document_content, 400)\n",
+    "\n",
+    "#generate chunked content\n",
+    "chunked_content = \"\"\n",
+    "for idx, text in enumerate(chunks):\n",
+    "  chunked_content+=f\"### Chunk {idx+1} ###\\n{text}\\n\\n\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "J4evTJGL83P-"
+   },
+   "source": [
+    "Now your chunked_content looks like\n",
+    "\n",
+    "```\n",
+    "### Chunk 1 ###\n",
+    "{chunk1}\n",
+    "\n",
+    "### Chunk 2 ###\n",
+    "{chunk2}\n",
+    "\n",
+    "..\n",
+    "```\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "iTBzN--tmIv-"
+   },
+   "source": [
+    "Next, generate contextual keywords to have better chunk representation for embeddings. Here, we use DeepInfra servers for inference"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "2AqW8zgw9Ah2"
+   },
+   "outputs": [],
+   "source": [
+    "from openai import OpenAI\n",
+    "openai = OpenAI(api_key=DEEPINFRA_API_KEY, base_url=\"https://api.deepinfra.com/v1/openai\")\n",
+    "\n",
+    "def deepinfra_run(system_prompt, user_message):\n",
+    "\tchat_completion = openai.chat.completions.create(\n",
+    "\t    model=\"meta-llama/Meta-Llama-3.1-405B-Instruct\",\n",
+    "\t    messages=[{\"role\": \"system\", \"content\": system_prompt}, {\"role\": \"user\", \"content\": user_message}],\n",
+    "\t    max_tokens=4096\n",
+    "\t)\n",
+    "\treturn chat_completion.choices[0].message.content\n",
+    "\n",
+    "system_prompt = '''\n",
+    "Each chunk is separated as ### Chunk [id] ###. For each chunk generate keywords required to fully understand the chunk without any need for looking at the previous chunks.\n",
+    "Don't just say \"List of services\", because its unclear what services are you referring to. Make sure to cover all chunks.\n",
+    "Sample output:\n",
+    "Chunk 1: BMW X5, pricings in France\n",
+    "Chunk 2: BMW X5, discounts\n",
+    "'''\n",
+    "\n",
+    "keywords_st = deepinfra_run(system_prompt, chunked_content)\n",
+    "print(keywords_st)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "ucmJJjvuqpSU"
+   },
+   "source": [
+    "Next, we need to parse the generated keywords into array\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/",
+     "height": 198
+    },
+    "id": "ikTy7nA8AFGz",
+    "outputId": "e8fba18d-a697-472e-b5cd-cb71f23c8a0d"
+   },
+   "outputs": [],
+   "source": [
+    "import re\n",
+    "def parse_keywords(content):\n",
+    "    result = []\n",
+    "    lines = content.strip().split('\\n')\n",
+    "    current_chunk = []\n",
+    "    inline_pattern = re.compile(r'^\\s*[^#:]+\\s*:\\s*(.+)$')  # Matches lines like \"Chunk1: word1, word2\"\n",
+    "    section_pattern = re.compile(r'^###\\s*[^#]+\\s*###$')    # Matches lines like \"### Chunk1 ###\"\n",
+    "\n",
+    "    for line in lines:\n",
+    "        line = line.strip()\n",
+    "        if not line: continue\n",
+    "        inline_match = inline_pattern.match(line)\n",
+    "\n",
+    "        if inline_match:\n",
+    "            words_str = inline_match.group(1)\n",
+    "            words = [word.strip() for word in words_str.split(',') if word.strip()]\n",
+    "            result.append(words)\n",
+    "            continue\n",
+    "\n",
+    "        if section_pattern.match(line):\n",
+    "            if current_chunk:\n",
+    "                result.append(current_chunk)\n",
+    "                current_chunk = []\n",
+    "            continue\n",
+    "\n",
+    "        if current_chunk is not None:\n",
+    "            words = [word.strip() for word in line.split(',') if word.strip()]\n",
+    "            current_chunk.extend(words)\n",
+    "\n",
+    "    if current_chunk:\n",
+    "      result.append(current_chunk)\n",
+    "    return result\n",
+    "\n",
+    "\n",
+    "keywords = parse_keywords(keywords_st)\n",
+    "print(keywords)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "X1myu--0kUi0"
+   },
+   "source": [
+    "Now you can modify the chunks using the generated keywords.\n",
+    "\n",
+    "```\n",
+    "For example,\n",
+    "chunk1 = #{keywords1}\\n{chunk1}\n",
+    "```"
+   ]
+  }
+ ],
+ "metadata": {
+  "colab": {
+   "provenance": []
+  },
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}

+ 2 - 0
end-to-end-use-cases/Contextual-Chunking-RAG/config.py

@@ -0,0 +1,2 @@
+LLAMAPARSE_API_KEY=""
+DEEPINFRA_API_KEY=""

+ 61 - 0
end-to-end-use-cases/Contextual-Chunking-RAG/data/llama_article.txt

@@ -0,0 +1,61 @@
+Understanding the LLaMA Model: A Breakthrough in Large Language Models
+
+In recent years, large language models (LLMs) have revolutionized the field of natural language processing (NLP). Among them, Meta’s LLaMA (Large Language Model Meta AI) has emerged as a powerful, efficient, and open-weight model that provides high-quality text generation capabilities while being more accessible than proprietary alternatives. This article explores the architecture, capabilities, and applications of LLaMA, along with its significance in the AI landscape.
+1. Introduction to LLaMA
+
+LLaMA is a family of autoregressive transformer-based models designed by Meta AI. Unlike massive models like OpenAI’s GPT-4, which require extensive computational resources and are primarily closed-source, LLaMA aims to provide powerful language modeling in a more efficient and open format. The original LLaMA release included models ranging from 7 billion to 65 billion parameters, offering different levels of computational demand and performance.
+
+The second iteration, LLaMA 2, introduced in 2023, further improved efficiency, accuracy, and usability. LLaMA 2 models are available in 7B, 13B, and 65B parameter variants, with optimized training methodologies and increased alignment with human preferences.
+2. Architecture and Training
+
+LLaMA follows the transformer architecture, the foundation of most modern language models. Key architectural improvements and training strategies include:
+
+    Tokenization: LLaMA uses Byte Pair Encoding (BPE) for tokenization, ensuring better handling of various languages and token efficiency.
+    Efficient Training: Trained on a diverse dataset containing publicly available and licensed data, LLaMA reduces reliance on proprietary sources. The training process leverages a causal decoder-only transformer, meaning it predicts tokens autoregressively while attending to previous context.
+    Scaled Attention Mechanism: LLaMA incorporates Rotary Position Embeddings (RoPE) for efficient long-context understanding. This improves its ability to handle longer sequences compared to earlier models.
+    Memory Optimization: Unlike some larger models requiring thousands of GPUs for inference, LLaMA’s optimized weight distribution and efficient parameter scaling allow it to run on fewer computational resources while maintaining high performance.
+
+The training data includes code, technical documents, research papers, and general text, making LLaMA well-suited for various NLP tasks, from answering questions to generating detailed content.
+3. Performance and Benchmarks
+
+LLaMA models have demonstrated impressive performance across multiple benchmarks. The 65B variant outperforms GPT-3 (175B) on several standard NLP tasks while using significantly fewer parameters. Key benchmarking results include:
+
+    MMLU (Massive Multitask Language Understanding): LLaMA 2-65B achieves results comparable to GPT-4 in general knowledge and reasoning tasks.
+    ARC (AI2 Reasoning Challenge): LLaMA models show strong problem-solving capabilities, particularly in logic-based questions.
+    HellaSwag & PIQA: LLaMA performs well in commonsense reasoning, approaching human-level accuracy.
+    Code Generation: Though not primarily designed for coding, LLaMA exhibits notable competence in generating and completing programming code snippets.
+
+Despite being smaller than some competing models, LLaMA's efficiency enables it to achieve state-of-the-art performance per parameter count, making it a highly cost-effective solution.
+4. Applications of LLaMA
+
+The versatility of LLaMA enables a wide range of applications across industries, including:
+
+    Chatbots and Virtual Assistants: LLaMA powers intelligent conversational AI systems, providing human-like responses with improved contextual understanding.
+    Content Generation: From summarizing long documents to creating articles and reports, LLaMA is widely used for generating high-quality text.
+    Programming Assistance: Developers use LLaMA to generate code snippets, debug errors, and improve software development efficiency.
+    Scientific Research: The model helps researchers analyze papers, generate summaries, and assist in hypothesis generation.
+    Education and Tutoring: LLaMA aids in personalized learning, answering students’ queries and explaining complex topics interactively.
+
+Its open-weight availability also allows organizations to fine-tune the model on proprietary data, making it adaptable for specialized use cases such as medical AI, legal document analysis, and multilingual NLP tasks.
+5. Challenges and Limitations
+
+Despite its advantages, LLaMA faces several challenges:
+
+    Ethical Concerns: Like all LLMs, LLaMA can generate biased or misleading information. Efforts are ongoing to align the model with ethical AI principles.
+    Computational Costs: Although LLaMA is optimized for efficiency, larger variants still require significant GPU resources for fine-tuning and inference.
+    Context Length Limitations: While improved, LLaMA still has constraints on long-context reasoning compared to specialized extended-context models.
+    Security Risks: Open-weight models pose potential risks for misuse, such as generating harmful or deceptive content. Responsible deployment and monitoring are necessary.
+
+6. The Future of LLaMA
+
+Meta continues to refine the LLaMA model family, with research focused on improving alignment, reducing biases, and extending context understanding. Future iterations may include:
+
+    LLaMA 3 and Beyond: Expected advancements in parameter efficiency and multimodal capabilities.
+    Better Fine-Tuning Techniques: Enhancing adaptability for domain-specific applications.
+    Integration with Retrieval-Augmented Generation (RAG): Combining LLaMA with external knowledge sources for more accurate responses.
+    Edge Deployment: Efforts to make LLaMA smaller and faster for local AI applications without cloud dependence.
+
+As open-source AI research progresses, LLaMA remains a key player in democratizing access to powerful language models, enabling innovation across academia, business, and technology sectors.
+7. Conclusion
+
+LLaMA represents a significant step forward in making high-quality language models more accessible. By balancing efficiency, openness, and performance, it provides a compelling alternative to closed-source models like GPT-4. Whether for research, business applications, or general AI development, LLaMA offers a robust platform for advancing NLP capabilities while promoting transparency and innovation in AI.

+ 43 - 0
end-to-end-use-cases/Contextual-Chunking-RAG/embedding.py

@@ -0,0 +1,43 @@
+import torch
+from transformers import AutoTokenizer, AutoModel
+from llama_index.core.base.embeddings.base import BaseEmbedding
+
+device = "cuda" if torch.cuda.is_available() else "cpu"
+
+# Load tokenizer and model
+model_id = "jinaai/jina-embeddings-v2-base-en" #"jinaai/jina-embeddings-v3"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModel.from_pretrained(model_id, trust_remote_code=True).to(device)
+
+# Define function to generate embeddings
+def get_embedding(text):
+    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device)
+    with torch.no_grad():
+        outputs = model(**inputs)    
+    return outputs.last_hidden_state.mean(dim=1).squeeze().cpu().numpy() #.to(torch.float32)
+
+
+class LocalJinaEmbedding(BaseEmbedding):
+    def __init__(self):
+        super().__init__()
+
+    def _get_text_embedding(self, text):
+        return get_embedding(text).tolist()  # Ensure compatibility with LlamaIndex
+
+    def _get_query_embedding(self, query):
+        return get_embedding(query).tolist()
+    
+    async def _aget_query_embedding(self, query: str) -> list:
+        return get_embedding(query).tolist()
+
+
+
+def test(): #this did not produce reasonable results for some reason
+    #!pip install llama-index-embeddings-huggingface
+    from llama_index.embeddings.huggingface import HuggingFaceEmbedding 
+    embed_model = HuggingFaceEmbedding(model_name=model_id)
+
+
+if __name__=="__main__":
+	emb = get_embedding("hi there")
+	print(emb.shape)

+ 137 - 0
end-to-end-use-cases/Contextual-Chunking-RAG/helper.py

@@ -0,0 +1,137 @@
+import re
+import io
+import codecs
+import random
+from openai import OpenAI
+from config import DEEPINFRA_API_KEY
+
+openai = OpenAI(api_key=DEEPINFRA_API_KEY, base_url="https://api.deepinfra.com/v1/openai")
+#client = OpenAI(api_key=OPENAI_API_KEY)
+
+
+def file_put_contents(filename, st):
+	file = codecs.open(filename, "w", "utf-8")
+	file.write(st)
+	file.close()
+
+def file_get_contents(name):
+	f = io.open(name, mode="r", encoding="utf-8") #utf-8 | Windows-1252
+	return f.read()
+
+
+def openai_run(system_prompt, user_message):
+	messages = [{"role":"system", "content":system_prompt}, {"role":"user", "content":user_message}]    
+	completion = client.chat.completions.create(
+	  model="gpt-4o-mini", #"gpt-4o-2024-05-13",
+	  temperature=0,
+	  max_tokens=2000,
+	  messages=messages
+	)
+	message = completion.choices[0].message
+	return message.content    
+
+
+def deepinfra_run(system_prompt, user_message):
+	chat_completion = openai.chat.completions.create(
+		model="meta-llama/Meta-Llama-3.1-405B-Instruct",
+		messages=[{"role": "system", "content": system_prompt}, {"role": "user", "content": user_message}],
+		max_tokens=4096
+	)
+	return chat_completion.choices[0].message.content
+
+
+
+def get_llm_answer(chunks_content, user_message): #keywords + content
+	gp = "Is answer is not given below, say that you don't know it. Make sure to copy answers from documents without changing them."+chunks_content
+	answer = deepinfra_run(gp, user_message)
+	return answer
+
+
+
+def parse_keywords(content):
+	result = []
+	lines = content.strip().split('\n')
+	current_chunk = None
+	inline_pattern = re.compile(r'^\s*[^#:]+\s*:\s*(.+)$')  # Matches lines like "Chunk1: word1, word2"
+	#section_pattern = re.compile(r'^###\s*[^#]+\s*###$') #v1
+	section_pattern = re.compile(r'[#\*]*\s*Chunk\s*\d+\s*[#\*]*') #v2
+ 
+	for line in lines:
+		line = line.strip()
+		if not line: continue
+		inline_match = inline_pattern.match(line)
+
+		if inline_pattern.match(line) and "Chunk" in line:			
+			words_str = inline_match.group(1)
+			words = [word.strip() for word in words_str.split(',') if word.strip()]
+			result.append(words)
+
+		elif section_pattern.match(line):			
+			if current_chunk: result.append(current_chunk)
+			current_chunk = []
+
+		elif current_chunk is not None: #section_pattern continuation
+			words = [word.strip() for word in line.split(',') if word.strip()]
+			current_chunk.extend(words)
+
+	if current_chunk: result.append(current_chunk)
+	return result
+
+
+
+def generate_contextual_keywords(chunked_content):
+	system_prompt = '''
+	Each chunk is separated as ### Chunk [id] ###. For each chunk generate keywords required to fully understand the chunk without any need for looking at the previous chunks.
+	Don't just say "List of services", because its unclear what services are you referring to. Make sure to cover all chunks.
+	Sample output:
+	Chunk 1: BMW X5, pricings in France
+	Chunk 2: BMW X5, discounts
+	'''
+	keywords_st = deepinfra_run(system_prompt, chunked_content)
+	print("Keywords_st:\n", keywords_st, "\n")
+	keywords = parse_keywords(keywords_st)    
+	return keywords
+
+
+def generate_questions_bychunk(chunks):
+	system_prompt = '''
+ Given a chunk from document. Generate 1-3 questions related to the chunk. Each question must be full and not require additional context. 
+ Example output:
+ 1. How to open new account?
+ 2. How much BMW X5 costs? 
+	'''	
+	n = len(chunks)
+	indexes = [i for i in range(n)]
+	random.shuffle(indexes)
+	for idx in indexes[: min(n//5, 60)]:
+		chunk  = chunks[idx]
+		text = "#"+(", ".join(chunk["keywords"]))+"\n"+chunk["content"]
+		out =  deepinfra_run(system_prompt, text) #anthropic_run(system_prompt, text)
+		question_pattern = re.compile(r'^\s*\d+\.\s+(.*)', re.MULTILINE)
+		questions = question_pattern.findall(out)
+		chunk["questions"] = questions
+		chunk["idx"] = idx
+	return chunks
+
+	
+
+def temp():
+	st = '''
+Here are the keywords for each chunk:
+
+**Chunk 1**
+3M, industrial and consumer products, electrical power transmission, renewable energy, infrastructure, Communication Markets Division, Germany
+
+### Chunk 2 ###
+3M, consumer retail, office supply products, home improvement products, Scotch brand, Post-it Products, Filtrete Filters, Thinsulate Insulation
+
+** Chunk 3 **
+3M, patents, trademarks, research and development, inventions, intellectual property, legal protection
+'''
+	print( parse_keywords(st) )
+
+
+
+if __name__=="__main__":
+	temp()
+