# Tutorial
In this tutorial, we'll break a sample text document into chunks and generate contextual keywords for each one using Llama 3.1.

Let's start by installing the important packages.
For Llama model inference, we use DeepInfra here, but you can use any inference service provider

In [None]:
#Install dependencies
!pip install tiktoken
!pip install openai

from config import DEEPINFRA_API_KEY

First, obtain your document content. For this tutorial, the recommended document size ranges from 2,000 to 20,000 tokens.

In [None]:
document_content = ""
with open('./data/llama_article.txt', 'r') as file:
 document_content = file.read()


We will then split the document content into chunks of 300-1000 tokens

In [None]:
#split into chunks (simple way)
def split_into_chunks(content, chunk_size):
	import tiktoken
	enc = tiktoken.get_encoding("o200k_base")
	a = enc.encode(content)
	left, chunks = 0, []
	while left < len(a):
		arr = a[left : left+chunk_size]
		chunks.append(enc.decode(arr))
		left+=chunk_size
	return chunks

chunks = split_into_chunks(document_content, 400)

#generate chunked content
chunked_content = ""
for idx, text in enumerate(chunks):
 chunked_content+=f"### Chunk {idx+1} ###\n{text}\n\n"

Now your chunked_content looks like

```
### Chunk 1 ###
{chunk1}

### Chunk 2 ###
{chunk2}

..
```


Next, generate contextual keywords to have better chunk representation for embeddings. Here, we use DeepInfra servers for inference

In [None]:
from openai import OpenAI
openai = OpenAI(api_key=DEEPINFRA_API_KEY, base_url="https://api.deepinfra.com/v1/openai")

def deepinfra_run(system_prompt, user_message):
	chat_completion = openai.chat.completions.create(
	 model="meta-llama/Meta-Llama-3.1-405B-Instruct",
	 messages=[{"role": "system", "content": system_prompt}, {"role": "user", "content": user_message}],
	 max_tokens=4096
	)
	return chat_completion.choices[0].message.content

system_prompt = '''
Each chunk is separated as ### Chunk [id] ###. For each chunk generate keywords required to fully understand the chunk without any need for looking at the previous chunks.
Don't just say "List of services", because its unclear what services are you referring to. Make sure to cover all chunks.
Sample output:
Chunk 1: BMW X5, pricings in France
Chunk 2: BMW X5, discounts
'''

keywords_st = deepinfra_run(system_prompt, chunked_content)
print(keywords_st)

Next, we need to parse the generated keywords into array


In [None]:
import re
def parse_keywords(content):
 result = []
 lines = content.strip().split('\n')
 current_chunk = []
 inline_pattern = re.compile(r'^\s*[^#:]+\s*:\s*(.+)$') # Matches lines like "Chunk1: word1, word2"
 section_pattern = re.compile(r'^###\s*[^#]+\s*###$') # Matches lines like "### Chunk1 ###"

 for line in lines:
 line = line.strip()
 if not line: continue
 inline_match = inline_pattern.match(line)

 if inline_match:
 words_str = inline_match.group(1)
 words = [word.strip() for word in words_str.split(',') if word.strip()]
 result.append(words)
 continue

 if section_pattern.match(line):
 if current_chunk:
 result.append(current_chunk)
 current_chunk = []
 continue

 if current_chunk is not None:
 words = [word.strip() for word in line.split(',') if word.strip()]
 current_chunk.extend(words)

 if current_chunk:
 result.append(current_chunk)
 return result


keywords = parse_keywords(keywords_st)
print(keywords)

Now you can modify the chunks using the generated keywords.

```
For example,
chunk1 = #{keywords1}\n{chunk1}
```