|
@@ -2,50 +2,47 @@
|
|
|
"cells": [
|
|
|
{
|
|
|
"cell_type": "markdown",
|
|
|
- "id": "respected-vintage",
|
|
|
+ "id": "naval-commodity",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
- "# Train your own GPT compatible Tokenzer and obtain vocab.json & merges.txt\n",
|
|
|
+ "# Train custom GPTBPE Tokenzer \n",
|
|
|
"---\n",
|
|
|
"\n",
|
|
|
"## Learning Objectives\n",
|
|
|
- "The goal of this lab is to demonstrate how to train your own GPTBPE tokenizer on your own raw text data \n",
|
|
|
"\n",
|
|
|
- "- train your own GPT compatible tokenizer given own text data in own langauge\n",
|
|
|
- " 1. option 1 - load from pretrained vocab and merge files, and fit to the new corpus \n",
|
|
|
- " 2. option 2 - train a GPT compatible tokenizer from scratch\n",
|
|
|
+ "In order to include the vocabulary of the local language into GPTBPE tokenizer, we need to be able to train GPTBPE Tokenizer on local language raw text data. The trained GPTBPE Tokenizer will produce it's own vocab.json and merges.txt files which is compatible with Megatron-LM's GPTBPE Tokenizer. \n",
|
|
|
"\n",
|
|
|
- "we will elaborate how to train your own GPT compatible tokenizer and obtain vocab and merge files\n",
|
|
|
+ "Previously in `Lab2-1_acquiring_data.ipynb`, we have acquired our own Swedish raw text data extracted from data source språkbank.\n",
|
|
|
+ "Therefore, the goal of this notebook, is to train our own GPTBPE Tokenizer on the Swedish raw text data obtained from `Lab2-1_acquiring_data.ipynb`.\n",
|
|
|
"\n",
|
|
|
- "we will be using HuggingFace's ByteLevel BPE Tokenizer and trainer to complete this task\n",
|
|
|
+ "We can either choose to load a previously trained GPTBPE Tokenizer by providing the vocab.json and merges.txt files to the GPTBPE Tokenizer before training further with the raw text data, or we can choose to train a completely new GPTBPE Tokenizer.\n",
|
|
|
"\n",
|
|
|
- "--------------------------------------------------------------------------------------------------------------------\n",
|
|
|
- "First of all, we need to install the [HuggingFace Tokenizer library](https://huggingface.co/transformers/installation.html)"
|
|
|
+ "The two options are covered in this notebook :\n",
|
|
|
+ "\n",
|
|
|
+ " 1. option 1 - load from pretrained vocab and merge files, then continue training with the new raw text.\n",
|
|
|
+ " 2. option 2 - train a GPT compatible tokenizer from scratch.\n",
|
|
|
+ "\n",
|
|
|
+ "\n",
|
|
|
+ "We will use HuggingFace's Tokenizer library and the trainer function in order train our own GPTBPE Tokenizer with our own raw text data.\n",
|
|
|
+ "\n",
|
|
|
+ "\n",
|
|
|
+ "First, we will install the [HuggingFace Tokenizer library](https://huggingface.co/transformers/installation.html)"
|
|
|
]
|
|
|
},
|
|
|
{
|
|
|
"cell_type": "code",
|
|
|
- "execution_count": 1,
|
|
|
- "id": "transsexual-republican",
|
|
|
+ "execution_count": null,
|
|
|
+ "id": "cathedral-jumping",
|
|
|
"metadata": {},
|
|
|
- "outputs": [
|
|
|
- {
|
|
|
- "name": "stdout",
|
|
|
- "output_type": "stream",
|
|
|
- "text": [
|
|
|
- "Defaulting to user installation because normal site-packages is not writeable\n",
|
|
|
- "Requirement already satisfied: tokenizers in /home/x_zench/.local/lib/python3.8/site-packages (0.10.3)\n"
|
|
|
- ]
|
|
|
- }
|
|
|
- ],
|
|
|
+ "outputs": [],
|
|
|
"source": [
|
|
|
"!pip install tokenizers"
|
|
|
]
|
|
|
},
|
|
|
{
|
|
|
"cell_type": "code",
|
|
|
- "execution_count": 3,
|
|
|
- "id": "golden-retailer",
|
|
|
+ "execution_count": null,
|
|
|
+ "id": "designing-occasion",
|
|
|
"metadata": {},
|
|
|
"outputs": [],
|
|
|
"source": [
|
|
@@ -56,11 +53,13 @@
|
|
|
},
|
|
|
{
|
|
|
"cell_type": "markdown",
|
|
|
- "id": "accepting-simon",
|
|
|
+ "id": "separated-article",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
- "-------------------------------------------------------------------------------\n",
|
|
|
- "## How to use the python script below - \n",
|
|
|
+ "A python script for training custom GPTBPE Tokenizer is provided for your convenience : \n",
|
|
|
+ "\n",
|
|
|
+ "To view the python script, click on [trainGPTTokenizer.py](./Megatron-LM/sv_utils/trainGPTTokenizer.py)\n",
|
|
|
+ "\n",
|
|
|
" trainGPTTokenizer.py [-h] \n",
|
|
|
"\n",
|
|
|
" optional arguments:\n",
|
|
@@ -77,217 +76,54 @@
|
|
|
},
|
|
|
{
|
|
|
"cell_type": "markdown",
|
|
|
- "id": "latin-netscape",
|
|
|
+ "id": "therapeutic-kentucky",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
- "---\n",
|
|
|
- "## Load pretrained vocab and merge files into the trainer and then train on new txt\n",
|
|
|
- "#### OUTPUT should look similar to the below ---\n",
|
|
|
- " \n",
|
|
|
- " loading gpt2bpe english vocab and merge \n",
|
|
|
- " include minimal special token end of text \n",
|
|
|
- " [00:00:00] Pre-processing files (914 Mo) ░░░░░░░░ 0%\n",
|
|
|
- " [00:00:02] Pre-processing files (914 Mo) ░░░░░░░░ 1%\n",
|
|
|
- " [00:00:05] Pre-processing files (914 Mo) ░░░░░░░░ 2%\n",
|
|
|
- " [00:00:07] Pre-processing files (914 Mo) ░░░░░░░░ 3%\n",
|
|
|
- " [00:00:10] Pre-processing files (914 Mo) ░░░░░░░░ 4%\n",
|
|
|
- " ....\n",
|
|
|
- " [00:00:19] Compute merges ███████░ 30080 / 32000\n",
|
|
|
- " [00:00:19] Compute merges ███████░ 31040 / 32000\n",
|
|
|
- " [00:00:19] Compute merges ████████ 31743 / 31743\n",
|
|
|
- "\n",
|
|
|
- " Trained vocab size: 32000\n",
|
|
|
- " saving trained BPE model to : ./Megatron-LM/dataset/EN/32k/\n",
|
|
|
- " model saved ! "
|
|
|
+ "1. option 1 - load from pretrained vocab and merge files, then continue training with the new raw text."
|
|
|
]
|
|
|
},
|
|
|
{
|
|
|
"cell_type": "code",
|
|
|
- "execution_count": 4,
|
|
|
- "id": "visible-setup",
|
|
|
+ "execution_count": null,
|
|
|
+ "id": "modular-result",
|
|
|
"metadata": {},
|
|
|
- "outputs": [
|
|
|
- {
|
|
|
- "name": "stdout",
|
|
|
- "output_type": "stream",
|
|
|
- "text": [
|
|
|
- "loading gpt2bpe english vocab and merge \n",
|
|
|
- "\n",
|
|
|
- "include minimal special token end of text \n",
|
|
|
- "[00:00:00] Pre-processing files (136 Mo) ░░░░░░░░ 0%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo) ░░░░░░░░ 1%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo) ░░░░░░░░ 3%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo) ░░░░░░░░ 5%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo) ░░░░░░░░ 7%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo) ░░░░░░░░ 9%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo) ░░░░░░░░ 11%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo) █░░░░░░░ 14%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo) █░░░░░░░ 16%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo) █░░░░░░░ 18%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo) █░░░░░░░ 21%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo) █░░░░░░░ 23%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo) ██░░░░░░ 26%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo) ██░░░░░░ 29%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo) ██░░░░░░ 31%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo) ██░░░░░░ 34%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo) ██░░░░░░ 36%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo) ███░░░░░ 38%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo) ███░░░░░ 40%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo) ███░░░░░ 42%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo) ███░░░░░ 45%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo) ███░░░░░ 47%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo) ████░░░░ 50%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo) ████░░░░ 53%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo) ████░░░░ 56%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo) ████░░░░ 59%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo) ████░░░░ 62%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo) █████░░░ 65%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo) █████░░░ 68%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo) █████░░░ 71%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo) █████░░░ 74%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo) ██████░░ 76%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo) ██████░░ 79%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo) ██████░░ 82%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo) ██████░░ 85%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo) ███████░ 88%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:03] Pre-processing files (136 Mo) ███████░ 91%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:03] Pre-processing files (136 Mo) ███████░ 94%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:03] Pre-processing files (136 Mo) ███████░ 97%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:04] Pre-processing files (136 Mo) ████████ 100%\n",
|
|
|
- "[00:00:00] Tokenize words ████████ 0 / 0\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words ░░░░░░░░ 45279 / 503185\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words █░░░░░░░ 90558 / 503185\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words ██░░░░░░ 140868 / 503185\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words ██░░░░░░ 186147 / 503185\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words ███░░░░░ 236457 / 503185\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words ████░░░░ 286767 / 503185\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words █████░░░ 337077 / 503185\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words ██████░░ 387387 / 503185\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words ██████░░ 437697 / 503185\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words ███████░ 488007 / 503185\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words ████████ 503185 / 503185\n",
|
|
|
- "\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Count pairs █░░░░░░░ 70434 / 503185\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Count pairs ██░░░░░░ 140882 / 503185\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Count pairs ███░░░░░ 216347 / 503185\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Count pairs ████░░░░ 291798 / 503185\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Count pairs █████░░░ 362267 / 503185\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Count pairs ██████░░ 432688 / 503185\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Count pairs ████████ 503185 / 503185\n",
|
|
|
- "\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:04] Compute merges ░░░░░░░░ 560 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:05] Compute merges ░░░░░░░░ 1120 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:05] Compute merges ░░░░░░░░ 1680 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:06] Compute merges ░░░░░░░░ 2240 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:06] Compute merges ░░░░░░░░ 2800 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:06] Compute merges ░░░░░░░░ 3360 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:07] Compute merges ░░░░░░░░ 3920 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:07] Compute merges ░░░░░░░░ 4480 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:07] Compute merges ░░░░░░░░ 5040 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:07] Compute merges ░░░░░░░░ 5600 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:07] Compute merges ░░░░░░░░ 6160 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:07] Compute merges ░░░░░░░░ 6720 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Compute merges █░░░░░░░ 7280 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Compute merges █░░░░░░░ 7840 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Compute merges █░░░░░░░ 8400 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Compute merges █░░░░░░░ 8960 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Compute merges █░░░░░░░ 9520 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Compute merges █░░░░░░░ 10080 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Compute merges █░░░░░░░ 10640 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Compute merges █░░░░░░░ 11200 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Compute merges █░░░░░░░ 11760 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:09] Compute merges █░░░░░░░ 12320 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:09] Compute merges █░░░░░░░ 12880 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:09] Compute merges █░░░░░░░ 13440 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:09] Compute merges ██░░░░░░ 14000 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:09] Compute merges ██░░░░░░ 14560 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:09] Compute merges ██░░░░░░ 15120 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:09] Compute merges ██░░░░░░ 15680 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:09] Compute merges ██░░░░░░ 16240 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:09] Compute merges ██░░░░░░ 16800 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:09] Compute merges ██░░░░░░ 17360 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges ██░░░░░░ 17920 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges ██░░░░░░ 18480 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges ██░░░░░░ 19040 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges ██░░░░░░ 19600 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges ██░░░░░░ 20160 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges ██░░░░░░ 20720 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges ███░░░░░ 21280 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges ███░░░░░ 21840 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges ███░░░░░ 22400 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges ███░░░░░ 22960 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges ███░░░░░ 23520 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges ███░░░░░ 24640 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges ███░░░░░ 25200 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges ███░░░░░ 25760 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges ███░░░░░ 26320 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges ███░░░░░ 26880 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges ███░░░░░ 27440 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges ████░░░░ 28000 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges ████░░░░ 28560 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges ████░░░░ 29120 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges ████░░░░ 29680 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges ████░░░░ 30240 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges ████░░░░ 30800 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges ████░░░░ 31360 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:12] Compute merges ████░░░░ 32480 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:12] Compute merges ████░░░░ 33040 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:12] Compute merges ████░░░░ 34160 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:12] Compute merges █████░░░ 35280 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:12] Compute merges █████░░░ 36400 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:12] Compute merges █████░░░ 37520 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:12] Compute merges █████░░░ 38640 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:12] Compute merges █████░░░ 39200 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:12] Compute merges █████░░░ 40320 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:13] Compute merges █████░░░ 41440 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:13] Compute merges ██████░░ 42560 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:13] Compute merges ██████░░ 43120 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:13] Compute merges ██████░░ 43680 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:13] Compute merges ██████░░ 44240 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:13] Compute merges ██████░░ 45360 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:13] Compute merges ██████░░ 45920 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:13] Compute merges ██████░░ 47040 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:13] Compute merges ██████░░ 48160 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:13] Compute merges ███████░ 49280 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:13] Compute merges ███████░ 50400 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:14] Compute merges ███████░ 51520 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:14] Compute merges ███████░ 52640 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:14] Compute merges ███████░ 53760 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:14] Compute merges ███████░ 54880 / 56000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:14] Compute merges ████████ 55743 / 55743\n",
|
|
|
- "\n",
|
|
|
- "Trained vocab size: 56000\n",
|
|
|
- "saving trained BPE model to : ../dataset/SV/56k/\n",
|
|
|
- "model saved ! \n",
|
|
|
- "\n",
|
|
|
- "\n",
|
|
|
- "\n",
|
|
|
- "testing ...\n",
|
|
|
- "\n",
|
|
|
- "\n",
|
|
|
- "\n",
|
|
|
- "['ĠHar', 'ĠnÃ¥gon', 'Ġfunderat', 'ĠpÃ¥', 'Ġvarför', 'Ġman', 'Ġinte', 'ĠfÃ¥r', 'Ġinom', 'hu', 'ste', 'peratur', 'ens', 'Ġkurva', 'Ġsynlig', 'Ġi', 'Ġgraf', 'en', '?', 'ĠÃĦr', 'Ġdet', 'ĠnÃ¥gon', 'Ġsom', 'ĠfrÃ¥gat', 'ĠTher', 'm', 'ia', '?', 'ĠSkulle', 'Ġdet', 'Ġinte', 'Ġvara', 'Ġväsentligt', 'Ġatt', 'Ġkunna', 'Ġkolla', 'Ġhistor', 'iken', 'ĠpÃ¥', 'Ġden', 'ĠdÃ¥', 'Ġman', 'Ġskall', 'Ġställa', 'Ġin', 'Ġkurvan', '?']\n"
|
|
|
- ]
|
|
|
- }
|
|
|
- ],
|
|
|
+ "outputs": [],
|
|
|
"source": [
|
|
|
"!python ./Megatron-LM/sv_utils/trainGPTTokenizer.py --infile $raw_text_path --bpe_path $output_trained_tokenizer_model_path --load_pretrained --pretrained_gpt_dir=$pretrained_gpt_dir --vocab_size 56000"
|
|
|
]
|
|
|
},
|
|
|
{
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "id": "roman-advocate",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "Below is the expected outputs :\n",
|
|
|
+ " \n",
|
|
|
+ " [00:00:14] Compute merges ███████░ 51520 / 56000\n",
|
|
|
+ " [00:00:14] Compute merges ███████░ 52640 / 56000\n",
|
|
|
+ " [00:00:14] Compute merges ███████░ 53760 / 56000\n",
|
|
|
+ " [00:00:14] Compute merges ███████░ 54880 / 56000\n",
|
|
|
+ " [00:00:14] Compute merges ████████ 55743 / 55743\n",
|
|
|
+ "\n",
|
|
|
+ " Trained vocab size: 56000\n",
|
|
|
+ " saving trained BPE model to : ../dataset/SV/56k/\n",
|
|
|
+ " model saved ! \n",
|
|
|
+ "\n",
|
|
|
+ "\n",
|
|
|
+ "\n",
|
|
|
+ " testing ...\n",
|
|
|
+ "\n",
|
|
|
+ "\n",
|
|
|
+ "\n",
|
|
|
+ " ['ĠHar', 'ĠnÃ¥gon', 'Ġfunderat', 'ĠpÃ¥', 'Ġvarför', 'Ġman', 'Ġinte', 'ĠfÃ¥r', 'Ġinom', 'hu', 'ste', 'peratur', 'ens', 'Ġkurva', 'Ġsynlig', 'Ġi', 'Ġgraf', 'en', '?', 'ĠÃĦr', 'Ġdet', 'ĠnÃ¥gon', 'Ġsom', 'ĠfrÃ¥gat', 'ĠTher', 'm', 'ia', '?', 'ĠSkulle', 'Ġdet', 'Ġinte', 'Ġvara', 'Ġväsentligt', 'Ġatt', 'Ġkunna', 'Ġkolla', 'Ġhistor', 'iken', 'ĠpÃ¥', 'Ġden', 'ĠdÃ¥', 'Ġman', 'Ġskall', 'Ġställa', 'Ġin', 'Ġkurvan', '?']"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
"cell_type": "code",
|
|
|
- "execution_count": 9,
|
|
|
- "id": "analyzed-pacific",
|
|
|
+ "execution_count": null,
|
|
|
+ "id": "better-consideration",
|
|
|
"metadata": {},
|
|
|
- "outputs": [
|
|
|
- {
|
|
|
- "name": "stdout",
|
|
|
- "output_type": "stream",
|
|
|
- "text": [
|
|
|
- "merges.txt vocab.json\n"
|
|
|
- ]
|
|
|
- }
|
|
|
- ],
|
|
|
+ "outputs": [],
|
|
|
"source": [
|
|
|
"## verify merges.txt and vocab.json exist\n",
|
|
|
"!ls ../dataset/SV/56k/"
|
|
@@ -295,31 +131,16 @@
|
|
|
},
|
|
|
{
|
|
|
"cell_type": "markdown",
|
|
|
- "id": "brown-pickup",
|
|
|
+ "id": "mysterious-gossip",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
- "---\n",
|
|
|
- "## Train completely from scratch with the raw txt to obtain vocab.json and merges.txt files\n",
|
|
|
- "#### OUTPUT should look similar to the below ---\n",
|
|
|
- " include minimal special token end of text \n",
|
|
|
- " [00:00:00] Pre-processing files (914 Mo) ░░░░░░░░ 0%\n",
|
|
|
- " [00:00:02] Pre-processing files (914 Mo) ░░░░░░░░ 1%\n",
|
|
|
- " [00:00:05] Pre-processing files (914 Mo) ░░░░░░░░ 2%\n",
|
|
|
- " [00:00:07] Pre-processing files (914 Mo) ░░░░░░░░ 3%\n",
|
|
|
- " ...\n",
|
|
|
- " [00:00:18] Compute merges ███████░ 30400 / 32000\n",
|
|
|
- " [00:00:18] Compute merges ███████░ 31360 / 32000\n",
|
|
|
- " [00:00:19] Compute merges ████████ 31743 / 31743\n",
|
|
|
- "\n",
|
|
|
- " Trained vocab size: 32000\n",
|
|
|
- " saving trained BPE model to : ./Megatron-LM/dataset/EN/32k/\n",
|
|
|
- " model saved ! \n"
|
|
|
+ "2. option 2 - train a GPT compatible tokenizer from scratch."
|
|
|
]
|
|
|
},
|
|
|
{
|
|
|
"cell_type": "code",
|
|
|
- "execution_count": 5,
|
|
|
- "id": "pointed-toner",
|
|
|
+ "execution_count": null,
|
|
|
+ "id": "unexpected-cowboy",
|
|
|
"metadata": {},
|
|
|
"outputs": [],
|
|
|
"source": [
|
|
@@ -329,171 +150,44 @@
|
|
|
},
|
|
|
{
|
|
|
"cell_type": "code",
|
|
|
- "execution_count": 6,
|
|
|
- "id": "north-reality",
|
|
|
+ "execution_count": null,
|
|
|
+ "id": "atlantic-serbia",
|
|
|
"metadata": {},
|
|
|
- "outputs": [
|
|
|
- {
|
|
|
- "name": "stdout",
|
|
|
- "output_type": "stream",
|
|
|
- "text": [
|
|
|
- "include minimal special token end of text \n",
|
|
|
- "[00:00:00] Pre-processing files (136 Mo) ░░░░░░░░ 0%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo) ░░░░░░░░ 1%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo) ░░░░░░░░ 4%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo) ░░░░░░░░ 7%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo) ░░░░░░░░ 9%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo) ░░░░░░░░ 12%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo) █░░░░░░░ 14%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo) █░░░░░░░ 17%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo) █░░░░░░░ 20%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo) █░░░░░░░ 23%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo) ██░░░░░░ 26%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo) ██░░░░░░ 29%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo) ██░░░░░░ 32%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo) ██░░░░░░ 35%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo) ███░░░░░ 38%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo) ███░░░░░ 40%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo) ███░░░░░ 43%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo) ███░░░░░ 46%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo) ███░░░░░ 49%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo) ████░░░░ 52%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo) ████░░░░ 55%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo) ████░░░░ 58%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo) ████░░░░ 61%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo) █████░░░ 64%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo) █████░░░ 67%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo) █████░░░ 70%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo) █████░░░ 72%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo) ██████░░ 75%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo) ██████░░ 78%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo) ██████░░ 81%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo) ██████░░ 84%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo) ██████░░ 87%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo) ███████░ 90%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo) ███████░ 93%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo) ███████░ 96%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:03] Pre-processing files (136 Mo) ███████░ 98%\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:03] Pre-processing files (136 Mo) ████████ 100%\n",
|
|
|
- "[00:00:00] Tokenize words ████████ 0 / 0\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words ░░░░░░░░ 50310 / 503185\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words █░░░░░░░ 100620 / 503185\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words ██░░░░░░ 150930 / 503185\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words ███░░░░░ 201240 / 503185\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words ███░░░░░ 251550 / 503185\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words ████░░░░ 301860 / 503185\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words █████░░░ 352170 / 503185\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words ██████░░ 402480 / 503185\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words ███████░ 452790 / 503185\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words ███████░ 503100 / 503185\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words ████████ 503185 / 503185\n",
|
|
|
- "\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Count pairs █░░░░░░░ 75526 / 503185\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Count pairs ██░░░░░░ 150944 / 503185\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Count pairs ███░░░░░ 231465 / 503185\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Count pairs ████░░░░ 301961 / 503185\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Count pairs █████░░░ 372348 / 503185\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Count pairs ███████░ 447777 / 503185\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Count pairs ████████ 503185 / 503185\n",
|
|
|
- "\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:04] Compute merges ░░░░░░░░ 320 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:04] Compute merges ░░░░░░░░ 640 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:05] Compute merges ░░░░░░░░ 960 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:05] Compute merges ░░░░░░░░ 1280 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:05] Compute merges ░░░░░░░░ 1600 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:06] Compute merges ░░░░░░░░ 1920 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:06] Compute merges ░░░░░░░░ 2240 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:06] Compute merges ░░░░░░░░ 2560 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:06] Compute merges ░░░░░░░░ 2880 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:06] Compute merges ░░░░░░░░ 3200 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:06] Compute merges ░░░░░░░░ 3520 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:07] Compute merges ░░░░░░░░ 3840 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:07] Compute merges █░░░░░░░ 4160 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:07] Compute merges █░░░░░░░ 4480 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:07] Compute merges █░░░░░░░ 4800 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:07] Compute merges █░░░░░░░ 5120 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:07] Compute merges █░░░░░░░ 5440 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:07] Compute merges █░░░░░░░ 5760 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:07] Compute merges █░░░░░░░ 6080 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:07] Compute merges █░░░░░░░ 6400 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Compute merges █░░░░░░░ 6720 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Compute merges █░░░░░░░ 7040 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Compute merges █░░░░░░░ 7360 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Compute merges █░░░░░░░ 7680 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Compute merges ██░░░░░░ 8000 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Compute merges ██░░░░░░ 8320 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Compute merges ██░░░░░░ 8640 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Compute merges ██░░░░░░ 8960 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Compute merges ██░░░░░░ 9280 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Compute merges ██░░░░░░ 9600 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Compute merges ██░░░░░░ 9920 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Compute merges ██░░░░░░ 10240 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:09] Compute merges ██░░░░░░ 10560 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:09] Compute merges ██░░░░░░ 11200 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:09] Compute merges ██░░░░░░ 11520 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:09] Compute merges ███░░░░░ 12160 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:09] Compute merges ███░░░░░ 12800 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:09] Compute merges ███░░░░░ 13120 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:09] Compute merges ███░░░░░ 13440 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:09] Compute merges ███░░░░░ 14080 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:09] Compute merges ███░░░░░ 14720 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges ███░░░░░ 15360 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges ████░░░░ 16000 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges ████░░░░ 16640 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges ████░░░░ 17280 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges ████░░░░ 17920 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges ████░░░░ 18560 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges ████░░░░ 18880 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges ████░░░░ 19520 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges █████░░░ 20160 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges █████░░░ 20800 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges █████░░░ 21440 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges █████░░░ 22400 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges █████░░░ 23040 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges ██████░░ 24000 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges ██████░░ 24960 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges ██████░░ 25920 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges ██████░░ 26880 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges ██████░░ 27840 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges ███████░ 28800 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges ███████░ 29440 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges ███████░ 30080 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges ███████░ 30720 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges ███████░ 31360 / 32000\n",
|
|
|
- "\u001b[2K\u001b[1B\u001b[1A[00:00:12] Compute merges ████████ 31743 / 31743\n",
|
|
|
- "\n",
|
|
|
- "Trained vocab size: 32000\n",
|
|
|
- "saving trained BPE model to : ../dataset/SV/32k/\n",
|
|
|
- "model saved ! \n",
|
|
|
- "\n",
|
|
|
- "\n",
|
|
|
- "\n",
|
|
|
- "testing ...\n",
|
|
|
- "\n",
|
|
|
- "\n",
|
|
|
- "\n",
|
|
|
- "['ĠHar', 'ĠnÃ¥gon', 'Ġfunderat', 'ĠpÃ¥', 'Ġvarför', 'Ġman', 'Ġinte', 'ĠfÃ¥r', 'Ġinom', 'hu', 'ste', 'peratur', 'ens', 'Ġkurva', 'Ġsynlig', 'Ġi', 'Ġgraf', 'en', '?', 'ĠÃĦr', 'Ġdet', 'ĠnÃ¥gon', 'Ġsom', 'ĠfrÃ¥gat', 'ĠTher', 'm', 'ia', '?', 'ĠSkulle', 'Ġdet', 'Ġinte', 'Ġvara', 'Ġväsentligt', 'Ġatt', 'Ġkunna', 'Ġkolla', 'Ġhistor', 'iken', 'ĠpÃ¥', 'Ġden', 'ĠdÃ¥', 'Ġman', 'Ġskall', 'Ġställa', 'Ġin', 'Ġkurvan', '?']\n"
|
|
|
- ]
|
|
|
- }
|
|
|
- ],
|
|
|
+ "outputs": [],
|
|
|
"source": [
|
|
|
"!python ./Megatron-LM/sv_utils/trainGPTTokenizer.py --infile $raw_text_path --bpe_path $output_trained_tokenizer_model_path --vocab_size 32000"
|
|
|
]
|
|
|
},
|
|
|
{
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "id": "heated-ranch",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "Below is the expected outputs :\n",
|
|
|
+ " \n",
|
|
|
+ " [00:00:11] Compute merges ███████░ 30720 / 32000\n",
|
|
|
+ " [00:00:11] Compute merges ███████░ 31360 / 32000\n",
|
|
|
+ " [00:00:12] Compute merges ████████ 31743 / 31743\n",
|
|
|
+ "\n",
|
|
|
+ " Trained vocab size: 32000\n",
|
|
|
+ " saving trained BPE model to : ../dataset/SV/32k/\n",
|
|
|
+ " model saved ! \n",
|
|
|
+ "\n",
|
|
|
+ "\n",
|
|
|
+ "\n",
|
|
|
+ " testing ...\n",
|
|
|
+ "\n",
|
|
|
+ "\n",
|
|
|
+ "\n",
|
|
|
+ " ['ĠHar', 'ĠnÃ¥gon', 'Ġfunderat', 'ĠpÃ¥', 'Ġvarför', 'Ġman', 'Ġinte', 'ĠfÃ¥r', 'Ġinom', 'hu', 'ste', 'peratur', 'ens', 'Ġkurva', 'Ġsynlig', 'Ġi', 'Ġgraf', 'en', '?', 'ĠÃĦr', 'Ġdet', 'ĠnÃ¥gon', 'Ġsom', 'ĠfrÃ¥gat', 'ĠTher', 'm', 'ia', '?', 'ĠSkulle', 'Ġdet', 'Ġinte', 'Ġvara', 'Ġväsentligt', 'Ġatt', 'Ġkunna', 'Ġkolla', 'Ġhistor', 'iken', 'ĠpÃ¥', 'Ġden', 'ĠdÃ¥', 'Ġman', 'Ġskall', 'Ġställa', 'Ġin', 'Ġkurvan', '?']"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
"cell_type": "code",
|
|
|
- "execution_count": 10,
|
|
|
- "id": "wireless-galaxy",
|
|
|
+ "execution_count": null,
|
|
|
+ "id": "olive-japanese",
|
|
|
"metadata": {},
|
|
|
- "outputs": [
|
|
|
- {
|
|
|
- "name": "stdout",
|
|
|
- "output_type": "stream",
|
|
|
- "text": [
|
|
|
- "merges.txt vocab.json\n"
|
|
|
- ]
|
|
|
- }
|
|
|
- ],
|
|
|
+ "outputs": [],
|
|
|
"source": [
|
|
|
"## verify the merges.txt and vocab.json exist \n",
|
|
|
"!ls ../dataset/SV/32k/"
|
|
@@ -501,35 +195,26 @@
|
|
|
},
|
|
|
{
|
|
|
"cell_type": "markdown",
|
|
|
- "id": "requested-sphere",
|
|
|
+ "id": "mineral-middle",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
"--- \n",
|
|
|
- "\n",
|
|
|
- "## Additional Resources\n",
|
|
|
- "\n",
|
|
|
- "HuggingFace Tokenizer Documentation : https://huggingface.co/docs/tokenizers/python/latest/quicktour.html\n",
|
|
|
- "\n",
|
|
|
- "Train GPT-2 in your own langauge : https://towardsdatascience.com/train-gpt-2-in-your-own-language-fc6ad4d60171"
|
|
|
+ "## Links and Resources\n",
|
|
|
+ "Don't forget to check out additional resources such as [HuggingFace Tokenizer Documentation](https://huggingface.co/docs/tokenizers/python/latest/quicktour.html) and [Train GPTBPE Tokenizer in your own langauge](https://towardsdatascience.com/train-gpt-2-in-your-own-language-fc6ad4d60171)."
|
|
|
]
|
|
|
},
|
|
|
{
|
|
|
"cell_type": "markdown",
|
|
|
- "id": "contrary-quantum",
|
|
|
+ "id": "pregnant-template",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
- "---\n",
|
|
|
- "## Up Next : \n",
|
|
|
- "\n",
|
|
|
- "[customize preprocess data python script and convert to mmap](./Day3-4_customize_process2mmap.ipynb)\n",
|
|
|
- "\n",
|
|
|
- "## Back To Start Menu\n",
|
|
|
- "[start menu](../Start_Here.ipynb)"
|
|
|
+ "-----\n",
|
|
|
+ "## <p style=\"text-align:center;border:3px; padding: 1em\"> <a href=../Start_Here.ipynb>HOME</a> <a href=./Lab2-4_customize_process2mmap.ipynb>NEXT</a></p>"
|
|
|
]
|
|
|
},
|
|
|
{
|
|
|
"cell_type": "markdown",
|
|
|
- "id": "sized-google",
|
|
|
+ "id": "limiting-visiting",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
"-----\n",
|