# 

# Train your own GPT compatible Tokenzer and obtain vocab.json & merges.txt
---

## Learning Objectives
- **The goal of this lab is to show you how to train your own GPTBPE tokenizer on your own raw text data **
 - train your own GPT compatible tokenizer given own text data in own langauge
 1. option 1 - load from pretrained vocab and merge files, and fit to the new corpus 
 2. option 2 - train a GPT compatible tokenizer from scratch

we will elaborate how to train your own GPT compatible tokenizer and obtain vocab and merge files
we will be using HuggingFace's ByteLevel BPE Tokenizer and trainer to complete this task

--------------------------------------------------------------------------------------------------------------------
we need to install the [HuggingFace Tokenizer library](https://huggingface.co/transformers/installation.html)

!pip install tokenizers

In [1]:
!pip install tokenizers

Defaulting to user installation because normal site-packages is not writeable


In [3]:
raw_text_path='../dataset/SV/webnyheter2013.txt'
output_trained_tokenizer_model_path='../dataset/SV/32k/'
pretrained_gpt_dir='./Megatron-LM'

-------------------------------------------------------------------------------
## how to use the python script below - 
 trainGPTTokenizer.py [-h] 

 optional arguments:
 -h, --help show this help message and exit
 --infile INFILE path to the text files
 --bpe_path BPE_PATH output GPTBPT path
 --load_pretrained load pretrained GPT model
 --pretrained_gpt_dir PRETRAINED_GPT_DIR
 path to pretrained gpt vocab and merge files, default None
 --incl_special_toks load pretrained BPE model
 --vocab_size VOCAB_SIZE
 specify the vocab_size when training HF GPTBPE for own language usually 16k/32k/48k/64k

---
## load_pretrained vocab and merge files into the trainer and then train on new txt
#### OUTPUT should be similar to the below ---
 
 loading gpt2bpe english vocab and merge 
 include minimal special token end of text 
 [00:00:00] Pre-processing files (914 Mo) ░░░░░░░░ 0%
 [00:00:02] Pre-processing files (914 Mo) ░░░░░░░░ 1%
 [00:00:05] Pre-processing files (914 Mo) ░░░░░░░░ 2%
 [00:00:07] Pre-processing files (914 Mo) ░░░░░░░░ 3%
 [00:00:10] Pre-processing files (914 Mo) ░░░░░░░░ 4%
 ....
 [00:00:19] Compute merges ███████░ 30080 / 32000
 [00:00:19] Compute merges ███████░ 31040 / 32000
 [00:00:19] Compute merges ████████ 31743 / 31743

 Trained vocab size: 32000
 saving trained BPE model to : ./Megatron-LM/dataset/EN/32k/
 model saved ! 

In [4]:
!python ./Megatron-LM/sv_utils/trainGPTTokenizer.py --infile $raw_text_path --bpe_path $output_trained_tokenizer_model_path --load_pretrained --pretrained_gpt_dir=$pretrained_gpt_dir --vocab_size 32000

loading gpt2bpe english vocab and merge 

include minimal special token end of text 
[00:00:00] Pre-processing files (136 Mo) ░░░░░░░░ 0%
[2K[1B[1A[00:00:00] Pre-processing files (136 Mo) ░░░░░░░░ 1%
[2K[1B[1A[00:00:00] Pre-processing files (136 Mo) ░░░░░░░░ 2%
[2K[1B[1A[00:00:01] Pre-processing files (136 Mo) ░░░░░░░░ 3%
[2K[1B[1A[00:00:01] Pre-processing files (136 Mo) ░░░░░░░░ 4%
[2K[1B[1A[00:00:01] Pre-processing files (136 Mo) ░░░░░░░░ 5%
[2K[1B[1A[00:00:02] Pre-processing files (136 Mo) ░░░░░░░░ 6%
[2K[1B[1A[00:00:02] Pre-processing files (136 Mo) ░░░░░░░░ 7%
[2K[1B[1A[00:00:02] Pre-processing files (136 Mo) ░░░░░░░░ 8%
[2K[1B[1A[00:00:03] Pre-processing files (136 Mo) ░░░░░░░░ 9%
[2K[1B[1A[00:00:03] Pre-processing files (136 Mo) ░░░░░░░░ 10%
[2K[1B[1A[00:00:04] Pre-processing files (136 Mo) ░░░░░░░░ 11%
[2K[1B[1A[00:00:04] Pre-processing files (136 Mo) ░░░░░░░░ 12%
[2K[1B[1A[00:00:04] Pre-processing files (136 Mo) █░░░░░░░ 13%
[2K[1B[1A[

---
## train completely from scratch with the raw txt to obtain vocab.json and merges.txt files
#### OUTPUT should be similar to the below ---
 include minimal special token end of text 
 [00:00:00] Pre-processing files (914 Mo) ░░░░░░░░ 0%
 [00:00:02] Pre-processing files (914 Mo) ░░░░░░░░ 1%
 [00:00:05] Pre-processing files (914 Mo) ░░░░░░░░ 2%
 [00:00:07] Pre-processing files (914 Mo) ░░░░░░░░ 3%
 ...
 [00:00:18] Compute merges ███████░ 30400 / 32000
 [00:00:18] Compute merges ███████░ 31360 / 32000
 [00:00:19] Compute merges ████████ 31743 / 31743

 Trained vocab size: 32000
 saving trained BPE model to : ./Megatron-LM/dataset/EN/32k/
 model saved ! 


In [13]:
!python ./Megatron-LM/sv_utils/trainGPTTokenizer.py --infile $raw_text_path --bpe_path $output_trained_tokenizer_model_path --vocab_size 32000

include minimal special token end of text 
[00:00:00] Pre-processing files (914 Mo) ░░░░░░░░ 0%
[2K[1B[1A[00:00:02] Pre-processing files (914 Mo) ░░░░░░░░ 1%
[2K[1B[1A[00:00:05] Pre-processing files (914 Mo) ░░░░░░░░ 2%
[2K[1B[1A[00:00:07] Pre-processing files (914 Mo) ░░░░░░░░ 3%
[2K[1B[1A[00:00:10] Pre-processing files (914 Mo) ░░░░░░░░ 4%
[2K[1B[1A[00:00:13] Pre-processing files (914 Mo) ░░░░░░░░ 5%
[2K[1B[1A[00:00:15] Pre-processing files (914 Mo) ░░░░░░░░ 6%
[2K[1B[1A[00:00:18] Pre-processing files (914 Mo) ░░░░░░░░ 7%
[2K[1B[1A[00:00:21] Pre-processing files (914 Mo) ░░░░░░░░ 8%
[2K[1B[1A[00:00:23] Pre-processing files (914 Mo) ░░░░░░░░ 9%
[2K[1B[1A[00:00:26] Pre-processing files (914 Mo) ░░░░░░░░ 10%
[2K[1B[1A[00:00:29] Pre-processing files (914 Mo) ░░░░░░░░ 11%
[2K[1B[1A[00:00:31] Pre-processing files (914 Mo) ░░░░░░░░ 12%
[2K[1B[1A[00:00:34] Pre-processing files (914 Mo) █░░░░░░░ 13%
[2K[1B[1A[00:00:37] Pre-processing files (914 Mo) █░

---
## Up Next : 

[customize preprocess data python script and convert to mmap](./Day3-4_customize_process2mmap.ipynb)

## Back To Start Menu
[start menu](../Start_Here.ipynb)

-----


## Licensing 

This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). 