{ "cells": [ { "cell_type": "markdown", "id": "rapid-arctic", "metadata": {}, "source": [ "# Train custom GPTBPE Tokenzer \n", "---\n", "\n", "## Learning Objectives\n", "\n", "In order to include the vocabulary of the local language, in this case it is Swedish, into GPTBPE tokenizer, we need to be able to train GPTBPE Tokenizer on local language raw text data. The trained GPTBPE Tokenizer will produce it's own vocab.json and merges.txt files which will be compatible with Megatron-LM's GPTBPE Tokenizer. \n", "\n", "Previously in `Lab2-1_acquiring_data.ipynb`, we have acquired our own Swedish raw text data extracted from data source språkbank.\n", "Therefore, the goal of this notebook, is to train our own GPTBPE Tokenizer on the Swedish raw text data obtained from `Lab2-1_acquiring_data.ipynb`.\n", "\n", "We can either choose to load a previously trained GPTBPE Tokenizer by providing the vocab.json and merges.txt files to the GPTBPE Tokenizer before training further with the raw text data, or we can choose to train a completely new GPTBPE Tokenizer from scratch.\n", "\n", "The two options are covered in this notebook :\n", "\n", " 1. option 1 - load from pretrained vocab and merge files, then continue training with the new raw text.\n", " 2. option 2 - train a GPT compatible tokenizer from scratch.\n", "\n", "\n", "We will use HuggingFace's Tokenizer library and the trainer function in order to train our own GPTBPE Tokenizer with our own raw text data.\n", "\n", "\n", "First, we will install the [HuggingFace Tokenizer library](https://huggingface.co/transformers/installation.html)" ] }, { "cell_type": "code", "execution_count": null, "id": "suspended-peace", "metadata": {}, "outputs": [], "source": [ "!pip install tokenizers" ] }, { "cell_type": "code", "execution_count": null, "id": "active-artwork", "metadata": {}, "outputs": [], "source": [ "raw_text_path='../dataset/SV/webnyheter2013.txt'\n", "output_trained_tokenizer_model_path='../dataset/SV/56k/'\n", "pretrained_gpt_dir='../dataset/EN/50k/'" ] }, { "cell_type": "markdown", "id": "average-boundary", "metadata": {}, "source": [ "A python script for training custom GPTBPE Tokenizer is provided for your convenience : \n", "\n", "To view the python script, click on [trainGPTTokenizer.py](./Megatron-LM/sv_utils/trainGPTTokenizer.py)\n", "\n", " trainGPTTokenizer.py [-h] \n", "\n", " optional arguments:\n", " -h, --help show this help message and exit\n", " --infile INFILE path to the text files\n", " --bpe_path BPE_PATH output GPTBPT path\n", " --load_pretrained load pretrained GPT model\n", " --pretrained_gpt_dir PRETRAINED_GPT_DIR\n", " path to pretrained gpt vocab and merge files, default None\n", " --incl_special_toks load pretrained BPE model\n", " --vocab_size VOCAB_SIZE\n", " specify the vocab_size when training HF GPTBPE for own language usually 16k/32k/48k/64k" ] }, { "cell_type": "markdown", "id": "harmful-grounds", "metadata": {}, "source": [ "1. option 1 - load from pretrained vocab and merge files, then continue training with the new raw text." ] }, { "cell_type": "code", "execution_count": null, "id": "perceived-jerusalem", "metadata": {}, "outputs": [], "source": [ "!python ./Megatron-LM/sv_utils/trainGPTTokenizer.py --infile $raw_text_path --bpe_path $output_trained_tokenizer_model_path --load_pretrained --pretrained_gpt_dir=$pretrained_gpt_dir --vocab_size 56000" ] }, { "cell_type": "markdown", "id": "weird-index", "metadata": {}, "source": [ "Below is the expected outputs :\n", " \n", " [00:00:14] Compute merges ███████░ 51520 / 56000\n", " [00:00:14] Compute merges ███████░ 52640 / 56000\n", " [00:00:14] Compute merges ███████░ 53760 / 56000\n", " [00:00:14] Compute merges ███████░ 54880 / 56000\n", " [00:00:14] Compute merges ████████ 55743 / 55743\n", "\n", " Trained vocab size: 56000\n", " saving trained BPE model to : ../dataset/SV/56k/\n", " model saved ! \n", "\n", "\n", "\n", " testing ...\n", "\n", "\n", "\n", " ['ĠHar', 'ĠnÃ¥gon', 'Ġfunderat', 'ĠpÃ¥', 'Ġvarför', 'Ġman', 'Ġinte', 'ĠfÃ¥r', 'Ġinom', 'hu', 'ste', 'peratur', 'ens', 'Ġkurva', 'Ġsynlig', 'Ġi', 'Ġgraf', 'en', '?', 'ĠÃĦr', 'Ġdet', 'ĠnÃ¥gon', 'Ġsom', 'ĠfrÃ¥gat', 'ĠTher', 'm', 'ia', '?', 'ĠSkulle', 'Ġdet', 'Ġinte', 'Ġvara', 'Ġväsentligt', 'Ġatt', 'Ġkunna', 'Ġkolla', 'Ġhistor', 'iken', 'ĠpÃ¥', 'Ġden', 'ĠdÃ¥', 'Ġman', 'Ġskall', 'Ġställa', 'Ġin', 'Ġkurvan', '?']" ] }, { "cell_type": "code", "execution_count": null, "id": "emerging-music", "metadata": {}, "outputs": [], "source": [ "## verify merges.txt and vocab.json exist\n", "!ls ../dataset/SV/56k/" ] }, { "cell_type": "markdown", "id": "filled-blast", "metadata": {}, "source": [ "2. option 2 - train a GPT compatible tokenizer from scratch." ] }, { "cell_type": "code", "execution_count": null, "id": "uniform-complaint", "metadata": {}, "outputs": [], "source": [ "raw_text_path='../dataset/SV/webnyheter2013.txt'\n", "output_trained_tokenizer_model_path='../dataset/SV/32k/'" ] }, { "cell_type": "code", "execution_count": null, "id": "disabled-pencil", "metadata": {}, "outputs": [], "source": [ "!python ./Megatron-LM/sv_utils/trainGPTTokenizer.py --infile $raw_text_path --bpe_path $output_trained_tokenizer_model_path --vocab_size 32000" ] }, { "cell_type": "markdown", "id": "tender-magic", "metadata": {}, "source": [ "Below is the expected outputs :\n", " \n", " [00:00:11] Compute merges ███████░ 30720 / 32000\n", " [00:00:11] Compute merges ███████░ 31360 / 32000\n", " [00:00:12] Compute merges ████████ 31743 / 31743\n", "\n", " Trained vocab size: 32000\n", " saving trained BPE model to : ../dataset/SV/32k/\n", " model saved ! \n", "\n", "\n", "\n", " testing ...\n", "\n", "\n", "\n", " ['ĠHar', 'ĠnÃ¥gon', 'Ġfunderat', 'ĠpÃ¥', 'Ġvarför', 'Ġman', 'Ġinte', 'ĠfÃ¥r', 'Ġinom', 'hu', 'ste', 'peratur', 'ens', 'Ġkurva', 'Ġsynlig', 'Ġi', 'Ġgraf', 'en', '?', 'ĠÃĦr', 'Ġdet', 'ĠnÃ¥gon', 'Ġsom', 'ĠfrÃ¥gat', 'ĠTher', 'm', 'ia', '?', 'ĠSkulle', 'Ġdet', 'Ġinte', 'Ġvara', 'Ġväsentligt', 'Ġatt', 'Ġkunna', 'Ġkolla', 'Ġhistor', 'iken', 'ĠpÃ¥', 'Ġden', 'ĠdÃ¥', 'Ġman', 'Ġskall', 'Ġställa', 'Ġin', 'Ġkurvan', '?']" ] }, { "cell_type": "code", "execution_count": null, "id": "criminal-leadership", "metadata": {}, "outputs": [], "source": [ "## verify the merges.txt and vocab.json exist \n", "!ls ../dataset/SV/32k/" ] }, { "cell_type": "markdown", "id": "orange-alignment", "metadata": {}, "source": [ "--- \n", "## Links and Resources\n", "Don't forget to check out additional resources such as [HuggingFace Tokenizer Documentation](https://huggingface.co/docs/tokenizers/python/latest/quicktour.html) and [Train GPTBPE Tokenizer in your own langauge](https://towardsdatascience.com/train-gpt-2-in-your-own-language-fc6ad4d60171)." ] }, { "cell_type": "markdown", "id": "offshore-truck", "metadata": {}, "source": [ "-----\n", "##
" ] }, { "cell_type": "markdown", "id": "clinical-tuition", "metadata": {}, "source": [ "-----\n", "\n", "\n", "## Licensing \n", "\n", "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 5 }