소스 검색

detailed - march 5

1) kept only one pdf in data,
2) removed anthropic function call
AILabs 2 달 전
부모
커밋
551fa78b6d

+ 1 - 1
end-to-end-use-cases/Contextual-Chunking-RAG/Example_FinancialReport_RAG.ipynb

@@ -65,7 +65,7 @@
     "\n",
     "# use SimpleDirectoryReader to parse our file\n",
     "file_extractor = {\".pdf\": parser}\n",
-    "documents = SimpleDirectoryReader(input_files=['./data/ADOBE_2015_10K.pdf'], file_extractor=file_extractor).load_data()\n",
+    "documents = SimpleDirectoryReader(input_files=['./data/AMAZON_2015_10K.pdf'], file_extractor=file_extractor).load_data()\n",
     "print(\"pdf file pages:\", len(documents))"
    ]
   },

+ 2 - 2
end-to-end-use-cases/Contextual-Chunking-RAG/Tutorial.ipynb

@@ -34,7 +34,7 @@
     "id": "mwW6s7EJm9ul"
    },
    "source": [
-    "First of all, get your document content. Preferred document size is between 2k and 20k"
+    "First, obtain your document content. For this tutorial, the recommended document size ranges from 2,000 to 20,000 tokens."
    ]
   },
   {
@@ -84,7 +84,7 @@
     "#generate chunked content\n",
     "chunked_content = \"\"\n",
     "for idx, text in enumerate(chunks):\n",
-    "  chunked_content+=f\"### Chunk {idx+1} ###\\n{text}\\n\\n\"\n"
+    "  chunked_content+=f\"### Chunk {idx+1} ###\\n{text}\\n\\n\""
    ]
   },
   {

BIN
end-to-end-use-cases/Contextual-Chunking-RAG/data/3M_2018_10K.pdf


BIN
end-to-end-use-cases/Contextual-Chunking-RAG/data/ADOBE_2015_10K.pdf


+ 0 - 13
end-to-end-use-cases/Contextual-Chunking-RAG/data/README.md

@@ -1,13 +0,0 @@
-# Contextual keywords generation for RAG using Llama-3.1
-
-**Problem**: Independent chunking in traditional RAG systems leads to the loss of contextual information between chunks. This makes it difficult for LLMs to retrieve relevant data when context (e.g., the subject or entity being discussed) is not explicitly repeated within individual chunks.
-
-**Solution**: Generate keywords for each chunk to fulfill missing contextual information. These keywords (e.g., "BMW, X5, pricing") enrich the chunk with necessary context, ensuring better retrieval accuracy. By embedding this enriched metadata, the system bridges gaps between related chunks, enabling effective query matching and accurate answer generation.
-
-[This article](https://medium.com/@ailabs/overcoming-independent-chunking-in-rag-systems-a-hybrid-approach-5d2c205b3732) explains benefits of contextual chunking.
-
-**Note** This method does not require calling LLM for each chunk separately, which makes it efficient.
-
-**Getting started**
-In this tutorial, we will use the https://deepinfra.com/ for inference services. So make sure to get API key from there. 
-Then create config.py file that contains "DEEPINFRA_API_LKE"

파일 크기가 너무 크기때문에 변경 상태를 표시하지 않습니다.
+ 0 - 150
end-to-end-use-cases/Contextual-Chunking-RAG/data/financebench_open_source.jsonl


+ 0 - 16
end-to-end-use-cases/Contextual-Chunking-RAG/helper.py

@@ -31,22 +31,6 @@ def openai_run(system_prompt, user_message):
 	return message.content    
 
 
-def anthropic_run(system_prompt, user_message):
-	import anthropic 
-	client = anthropic.Anthropic(  
-	api_key=ANTHROPIC_API_KEY,
-	)
-	message = client.messages.create(
-	model="claude-3-sonnet-20240229", #"claude-3-opus-20240229",
-	max_tokens=4096,
-	system=system_prompt,
-	messages=[
-	 {"role": "user", "content": user_message}
-	]
-	)
-	return message.content[0].text
-
-
 def deepinfra_run(system_prompt, user_message):
 	chat_completion = openai.chat.completions.create(
 		model="meta-llama/Meta-Llama-3.1-405B-Instruct",

파일 크기가 너무 크기때문에 변경 상태를 표시하지 않습니다.
+ 0 - 1489
end-to-end-use-cases/Contextual-Chunking-RAG/results/keywords_top5.txt


파일 크기가 너무 크기때문에 변경 상태를 표시하지 않습니다.
+ 0 - 1490
end-to-end-use-cases/Contextual-Chunking-RAG/results/nokeywords_top5