{ "cells": [ { "cell_type": "markdown", "id": "induced-beverage", "metadata": {}, "source": [ "# \n", "\n", "# Train your own GPT compatible Tokenzer and obtain vocab.json & merges.txt\n", "---\n", "\n", "## Learning Objectives\n", "- **The goal of this lab is to show you how to train your own GPTBPE tokenizer on your own raw text data **\n", " - train your own GPT compatible tokenizer given own text data in own langauge\n", " 1. option 1 - load from pretrained vocab and merge files, and fit to the new corpus \n", " 2. option 2 - train a GPT compatible tokenizer from scratch\n", "\n", "we will elaborate how to train your own GPT compatible tokenizer and obtain vocab and merge files\n", "we will be using HuggingFace's ByteLevel BPE Tokenizer and trainer to complete this task\n", "\n", "--------------------------------------------------------------------------------------------------------------------\n", "we need to install the [HuggingFace Tokenizer library](https://huggingface.co/transformers/installation.html)\n", "\n", "!pip install tokenizers" ] }, { "cell_type": "code", "execution_count": 1, "id": "turkish-phrase", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Defaulting to user installation because normal site-packages is not writeable\n", "Requirement already satisfied: tokenizers in /home/zcharpy/.local/lib/python3.8/site-packages (0.10.3)\n" ] } ], "source": [ "!pip install tokenizers" ] }, { "cell_type": "code", "execution_count": 3, "id": "precise-airplane", "metadata": {}, "outputs": [], "source": [ "raw_text_path='../dataset/SV/webnyheter2013.txt'\n", "output_trained_tokenizer_model_path='../dataset/SV/32k/'\n", "pretrained_gpt_dir='./Megatron-LM'" ] }, { "cell_type": "markdown", "id": "decreased-arnold", "metadata": {}, "source": [ "-------------------------------------------------------------------------------\n", "## how to use the python script below - \n", " trainGPTTokenizer.py [-h] \n", "\n", " optional arguments:\n", " -h, --help show this help message and exit\n", " --infile INFILE path to the text files\n", " --bpe_path BPE_PATH output GPTBPT path\n", " --load_pretrained load pretrained GPT model\n", " --pretrained_gpt_dir PRETRAINED_GPT_DIR\n", " path to pretrained gpt vocab and merge files, default None\n", " --incl_special_toks load pretrained BPE model\n", " --vocab_size VOCAB_SIZE\n", " specify the vocab_size when training HF GPTBPE for own language usually 16k/32k/48k/64k" ] }, { "cell_type": "markdown", "id": "formal-confidence", "metadata": {}, "source": [ "---\n", "## load_pretrained vocab and merge files into the trainer and then train on new txt\n", "#### OUTPUT should be similar to the below ---\n", " \n", " loading gpt2bpe english vocab and merge \n", " include minimal special token end of text \n", " [00:00:00] Pre-processing files (914 Mo) ░░░░░░░░ 0%\n", " [00:00:02] Pre-processing files (914 Mo) ░░░░░░░░ 1%\n", " [00:00:05] Pre-processing files (914 Mo) ░░░░░░░░ 2%\n", " [00:00:07] Pre-processing files (914 Mo) ░░░░░░░░ 3%\n", " [00:00:10] Pre-processing files (914 Mo) ░░░░░░░░ 4%\n", " ....\n", " [00:00:19] Compute merges ███████░ 30080 / 32000\n", " [00:00:19] Compute merges ███████░ 31040 / 32000\n", " [00:00:19] Compute merges ████████ 31743 / 31743\n", "\n", " Trained vocab size: 32000\n", " saving trained BPE model to : ./Megatron-LM/dataset/EN/32k/\n", " model saved ! " ] }, { "cell_type": "code", "execution_count": 4, "id": "olive-sustainability", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "loading gpt2bpe english vocab and merge \n", "\n", "include minimal special token end of text \n", "[00:00:00] Pre-processing files (136 Mo) ░░░░░░░░ 0%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo) ░░░░░░░░ 1%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Pre-processing files (136 Mo) ░░░░░░░░ 2%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo) ░░░░░░░░ 3%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo) ░░░░░░░░ 4%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Pre-processing files (136 Mo) ░░░░░░░░ 5%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo) ░░░░░░░░ 6%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo) ░░░░░░░░ 7%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (136 Mo) ░░░░░░░░ 8%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:03] Pre-processing files (136 Mo) ░░░░░░░░ 9%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:03] Pre-processing files (136 Mo) ░░░░░░░░ 10%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:04] Pre-processing files (136 Mo) ░░░░░░░░ 11%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:04] Pre-processing files (136 Mo) ░░░░░░░░ 12%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:04] Pre-processing files (136 Mo) █░░░░░░░ 13%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:05] Pre-processing files (136 Mo) █░░░░░░░ 14%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:05] Pre-processing files (136 Mo) █░░░░░░░ 15%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:05] Pre-processing files (136 Mo) █░░░░░░░ 16%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:06] Pre-processing files (136 Mo) █░░░░░░░ 17%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:06] Pre-processing files (136 Mo) █░░░░░░░ 18%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:07] Pre-processing files (136 Mo) █░░░░░░░ 19%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:07] Pre-processing files (136 Mo) █░░░░░░░ 20%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:07] Pre-processing files (136 Mo) █░░░░░░░ 21%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Pre-processing files (136 Mo) █░░░░░░░ 22%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Pre-processing files (136 Mo) █░░░░░░░ 23%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Pre-processing files (136 Mo) █░░░░░░░ 24%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:09] Pre-processing files (136 Mo) ██░░░░░░ 25%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:09] Pre-processing files (136 Mo) ██░░░░░░ 26%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:09] Pre-processing files (136 Mo) ██░░░░░░ 27%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Pre-processing files (136 Mo) ██░░░░░░ 28%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Pre-processing files (136 Mo) ██░░░░░░ 29%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Pre-processing files (136 Mo) ██░░░░░░ 30%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Pre-processing files (136 Mo) ██░░░░░░ 31%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Pre-processing files (136 Mo) ██░░░░░░ 32%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:12] Pre-processing files (136 Mo) ██░░░░░░ 33%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:12] Pre-processing files (136 Mo) ██░░░░░░ 34%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:12] Pre-processing files (136 Mo) ██░░░░░░ 35%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:13] Pre-processing files (136 Mo) ██░░░░░░ 36%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:13] Pre-processing files (136 Mo) ██░░░░░░ 37%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:13] Pre-processing files (136 Mo) ███░░░░░ 38%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:14] Pre-processing files (136 Mo) ███░░░░░ 39%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:14] Pre-processing files (136 Mo) ███░░░░░ 40%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:14] Pre-processing files (136 Mo) ███░░░░░ 41%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:15] Pre-processing files (136 Mo) ███░░░░░ 42%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:15] Pre-processing files (136 Mo) ███░░░░░ 43%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:16] Pre-processing files (136 Mo) ███░░░░░ 44%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:16] Pre-processing files (136 Mo) ███░░░░░ 45%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:16] Pre-processing files (136 Mo) ███░░░░░ 46%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:17] Pre-processing files (136 Mo) ███░░░░░ 47%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:17] Pre-processing files (136 Mo) ███░░░░░ 48%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:17] Pre-processing files (136 Mo) ███░░░░░ 49%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:18] Pre-processing files (136 Mo) ████░░░░ 50%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:18] Pre-processing files (136 Mo) ████░░░░ 51%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:19] Pre-processing files (136 Mo) ████░░░░ 52%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:19] Pre-processing files (136 Mo) ████░░░░ 53%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:19] Pre-processing files (136 Mo) ████░░░░ 54%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:20] Pre-processing files (136 Mo) ████░░░░ 55%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:20] Pre-processing files (136 Mo) ████░░░░ 56%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:20] Pre-processing files (136 Mo) ████░░░░ 57%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:21] Pre-processing files (136 Mo) ████░░░░ 58%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:21] Pre-processing files (136 Mo) ████░░░░ 59%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:21] Pre-processing files (136 Mo) ████░░░░ 60%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:22] Pre-processing files (136 Mo) ████░░░░ 61%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:22] Pre-processing files (136 Mo) ████░░░░ 62%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:22] Pre-processing files (136 Mo) █████░░░ 63%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:23] Pre-processing files (136 Mo) █████░░░ 64%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:23] Pre-processing files (136 Mo) █████░░░ 65%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:24] Pre-processing files (136 Mo) █████░░░ 66%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:24] Pre-processing files (136 Mo) █████░░░ 67%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:24] Pre-processing files (136 Mo) █████░░░ 68%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:25] Pre-processing files (136 Mo) █████░░░ 69%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:25] Pre-processing files (136 Mo) █████░░░ 70%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:25] Pre-processing files (136 Mo) █████░░░ 71%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:26] Pre-processing files (136 Mo) █████░░░ 72%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:26] Pre-processing files (136 Mo) █████░░░ 73%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:27] Pre-processing files (136 Mo) █████░░░ 74%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:27] Pre-processing files (136 Mo) ██████░░ 75%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:27] Pre-processing files (136 Mo) ██████░░ 76%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:28] Pre-processing files (136 Mo) ██████░░ 77%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:28] Pre-processing files (136 Mo) ██████░░ 78%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:28] Pre-processing files (136 Mo) ██████░░ 79%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:29] Pre-processing files (136 Mo) ██████░░ 80%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:29] Pre-processing files (136 Mo) ██████░░ 81%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:29] Pre-processing files (136 Mo) ██████░░ 82%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:30] Pre-processing files (136 Mo) ██████░░ 83%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:30] Pre-processing files (136 Mo) ██████░░ 84%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:31] Pre-processing files (136 Mo) ██████░░ 85%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:31] Pre-processing files (136 Mo) ██████░░ 86%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:31] Pre-processing files (136 Mo) ██████░░ 87%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:32] Pre-processing files (136 Mo) ███████░ 88%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:32] Pre-processing files (136 Mo) ███████░ 89%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:32] Pre-processing files (136 Mo) ███████░ 90%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:33] Pre-processing files (136 Mo) ███████░ 91%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:33] Pre-processing files (136 Mo) ███████░ 92%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:33] Pre-processing files (136 Mo) ███████░ 93%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:34] Pre-processing files (136 Mo) ███████░ 94%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:34] Pre-processing files (136 Mo) ███████░ 95%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:35] Pre-processing files (136 Mo) ███████░ 96%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:35] Pre-processing files (136 Mo) ███████░ 97%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:35] Pre-processing files (136 Mo) ███████░ 98%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:36] Pre-processing files (136 Mo) ███████░ 99%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:36] Pre-processing files (136 Mo) ████████ 100%\n", "[00:00:00] Tokenize words ████████ 0 / 0\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words ░░░░░░░░ 35217 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words █░░░░░░░ 80496 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words █░░░░░░░ 125775 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words ██░░░░░░ 171054 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words ███░░░░░ 216333 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words ████░░░░ 261612 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words ████░░░░ 306891 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words █████░░░ 352170 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words ██████░░ 397449 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words ███████░ 442728 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words ███████░ 488007 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words ████████ 503185 / 503185\n", "\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Count pairs ░░░░░░░░ 5031 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Count pairs ░░░░░░░░ 10062 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Count pairs ░░░░░░░░ 15093 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Count pairs ░░░░░░░░ 20124 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Count pairs ░░░░░░░░ 25155 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:03] Count pairs ░░░░░░░░ 30186 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:04] Count pairs ░░░░░░░░ 35217 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:05] Count pairs ░░░░░░░░ 40248 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:06] Count pairs ░░░░░░░░ 45279 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Count pairs ░░░░░░░░ 50310 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Count pairs ░░░░░░░░ 55341 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Count pairs ░░░░░░░░ 60372 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:13] Count pairs █░░░░░░░ 65403 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:13] Count pairs █░░░░░░░ 70434 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:13] Count pairs █░░░░░░░ 75465 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:14] Count pairs █░░░░░░░ 80496 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:14] Count pairs █░░░░░░░ 85527 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:15] Count pairs █░░░░░░░ 90558 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:16] Count pairs █░░░░░░░ 95589 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:17] Count pairs █░░░░░░░ 100620 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:19] Count pairs █░░░░░░░ 105651 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:20] Count pairs █░░░░░░░ 110682 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:22] Count pairs █░░░░░░░ 115713 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:24] Count pairs █░░░░░░░ 120744 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:26] Count pairs █░░░░░░░ 125775 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:26] Count pairs ██░░░░░░ 130806 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:26] Count pairs ██░░░░░░ 135837 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:26] Count pairs ██░░░░░░ 140868 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:27] Count pairs ██░░░░░░ 145899 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:28] Count pairs ██░░░░░░ 150930 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:29] Count pairs ██░░░░░░ 155961 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:30] Count pairs ██░░░░░░ 160992 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:31] Count pairs ██░░░░░░ 166023 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:32] Count pairs ██░░░░░░ 171054 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:34] Count pairs ██░░░░░░ 176085 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:36] Count pairs ██░░░░░░ 181116 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:38] Count pairs ██░░░░░░ 186147 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:39] Count pairs ███░░░░░ 191178 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:39] Count pairs ███░░░░░ 196209 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:39] Count pairs ███░░░░░ 201240 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:40] Count pairs ███░░░░░ 206271 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:40] Count pairs ███░░░░░ 211302 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:41] Count pairs ███░░░░░ 216333 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:42] Count pairs ███░░░░░ 221364 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:43] Count pairs ███░░░░░ 226395 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:45] Count pairs ███░░░░░ 231426 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:46] Count pairs ███░░░░░ 236457 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:48] Count pairs ███░░░░░ 241488 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:50] Count pairs ███░░░░░ 246519 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:52] Count pairs ███░░░░░ 251550 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:52] Count pairs ████░░░░ 256581 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:52] Count pairs ████░░░░ 261612 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:53] Count pairs ████░░░░ 266643 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:53] Count pairs ████░░░░ 271674 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:54] Count pairs ████░░░░ 276705 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:55] Count pairs ████░░░░ 281736 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:56] Count pairs ████░░░░ 286767 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:57] Count pairs ████░░░░ 291798 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:59] Count pairs ████░░░░ 296829 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:00] Count pairs ████░░░░ 301860 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:02] Count pairs ████░░░░ 306891 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:04] Count pairs ████░░░░ 311922 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:05] Count pairs █████░░░ 316953 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:05] Count pairs █████░░░ 321984 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:05] Count pairs █████░░░ 327015 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:06] Count pairs █████░░░ 332046 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:07] Count pairs █████░░░ 337077 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:08] Count pairs █████░░░ 342108 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:09] Count pairs █████░░░ 347139 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:10] Count pairs █████░░░ 352170 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:11] Count pairs █████░░░ 357201 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:12] Count pairs █████░░░ 362232 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:14] Count pairs █████░░░ 367263 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:16] Count pairs █████░░░ 372294 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:18] Count pairs █████░░░ 377325 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:18] Count pairs ██████░░ 382356 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:18] Count pairs ██████░░ 387387 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:19] Count pairs ██████░░ 392418 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:19] Count pairs ██████░░ 397449 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:20] Count pairs ██████░░ 402480 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:21] Count pairs ██████░░ 407511 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:22] Count pairs ██████░░ 412542 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:23] Count pairs ██████░░ 417573 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:25] Count pairs ██████░░ 422604 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:26] Count pairs ██████░░ 427635 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:28] Count pairs ██████░░ 432666 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:30] Count pairs ██████░░ 437697 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:31] Count pairs ███████░ 442728 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:31] Count pairs ███████░ 447759 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:32] Count pairs ███████░ 452790 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:32] Count pairs ███████░ 457821 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:33] Count pairs ███████░ 462852 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:34] Count pairs ███████░ 467883 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:35] Count pairs ███████░ 472914 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:36] Count pairs ███████░ 477945 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:37] Count pairs ███████░ 482976 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:39] Count pairs ███████░ 488007 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:40] Count pairs ███████░ 493038 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:42] Count pairs ███████░ 498069 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:44] Count pairs ███████░ 503100 / 503185\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:45] Count pairs ████████ 503185 / 503185\n", "\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:03] Compute merges ░░░░░░░░ 320 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:04] Compute merges ░░░░░░░░ 640 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:04] Compute merges ░░░░░░░░ 960 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:04] Compute merges ░░░░░░░░ 1280 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:04] Compute merges ░░░░░░░░ 1600 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:04] Compute merges ░░░░░░░░ 1920 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:05] Compute merges ░░░░░░░░ 2240 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:05] Compute merges ░░░░░░░░ 2560 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:05] Compute merges ░░░░░░░░ 2880 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:05] Compute merges ░░░░░░░░ 3200 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:05] Compute merges ░░░░░░░░ 3840 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:05] Compute merges █░░░░░░░ 4480 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:05] Compute merges █░░░░░░░ 5120 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:05] Compute merges █░░░░░░░ 5760 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:05] Compute merges █░░░░░░░ 6400 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:05] Compute merges █░░░░░░░ 7360 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:05] Compute merges ██░░░░░░ 8000 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:06] Compute merges ██░░░░░░ 8960 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:06] Compute merges ██░░░░░░ 9920 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:06] Compute merges ██░░░░░░ 11200 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:06] Compute merges ███░░░░░ 12160 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:06] Compute merges ███░░░░░ 13120 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:06] Compute merges ███░░░░░ 14080 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:06] Compute merges ███░░░░░ 15360 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:06] Compute merges ████░░░░ 16960 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:06] Compute merges ████░░░░ 18560 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:06] Compute merges █████░░░ 20160 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:06] Compute merges █████░░░ 22080 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:06] Compute merges ██████░░ 24000 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:06] Compute merges ██████░░ 25920 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:07] Compute merges ██████░░ 27840 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:07] Compute merges ███████░ 30080 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:07] Compute merges ████████ 31743 / 31743\n", "\n", "Trained vocab size: 32000\n", "saving trained BPE model to : ../dataset/SV/32k/\n", "model saved ! \n", "\n", "\n", "\n", "testing ...\n", "\n", "\n", "\n", "['ĠHar', 'ĠnÃ¥gon', 'Ġfunderat', 'ĠpÃ¥', 'Ġvarför', 'Ġman', 'Ġinte', 'ĠfÃ¥r', 'Ġinom', 'hu', 'ste', 'peratur', 'ens', 'Ġkurva', 'Ġsynlig', 'Ġi', 'Ġgraf', 'en', '?', 'ĠÃĦr', 'Ġdet', 'ĠnÃ¥gon', 'Ġsom', 'ĠfrÃ¥gat', 'ĠTher', 'm', 'ia', '?', 'ĠSkulle', 'Ġdet', 'Ġinte', 'Ġvara', 'Ġväsentligt', 'Ġatt', 'Ġkunna', 'Ġkolla', 'Ġhistor', 'iken', 'ĠpÃ¥', 'Ġden', 'ĠdÃ¥', 'Ġman', 'Ġskall', 'Ġställa', 'Ġin', 'Ġkurvan', '?']\n" ] } ], "source": [ "!python ./Megatron-LM/sv_utils/trainGPTTokenizer.py --infile $raw_text_path --bpe_path $output_trained_tokenizer_model_path --load_pretrained --pretrained_gpt_dir=$pretrained_gpt_dir --vocab_size 32000" ] }, { "cell_type": "markdown", "id": "hindu-obligation", "metadata": {}, "source": [ "---\n", "## train completely from scratch with the raw txt to obtain vocab.json and merges.txt files\n", "#### OUTPUT should be similar to the below ---\n", " include minimal special token end of text \n", " [00:00:00] Pre-processing files (914 Mo) ░░░░░░░░ 0%\n", " [00:00:02] Pre-processing files (914 Mo) ░░░░░░░░ 1%\n", " [00:00:05] Pre-processing files (914 Mo) ░░░░░░░░ 2%\n", " [00:00:07] Pre-processing files (914 Mo) ░░░░░░░░ 3%\n", " ...\n", " [00:00:18] Compute merges ███████░ 30400 / 32000\n", " [00:00:18] Compute merges ███████░ 31360 / 32000\n", " [00:00:19] Compute merges ████████ 31743 / 31743\n", "\n", " Trained vocab size: 32000\n", " saving trained BPE model to : ./Megatron-LM/dataset/EN/32k/\n", " model saved ! \n" ] }, { "cell_type": "code", "execution_count": 13, "id": "digital-authentication", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "include minimal special token end of text \n", "[00:00:00] Pre-processing files (914 Mo) ░░░░░░░░ 0%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Pre-processing files (914 Mo) ░░░░░░░░ 1%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:05] Pre-processing files (914 Mo) ░░░░░░░░ 2%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:07] Pre-processing files (914 Mo) ░░░░░░░░ 3%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Pre-processing files (914 Mo) ░░░░░░░░ 4%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:13] Pre-processing files (914 Mo) ░░░░░░░░ 5%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:15] Pre-processing files (914 Mo) ░░░░░░░░ 6%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:18] Pre-processing files (914 Mo) ░░░░░░░░ 7%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:21] Pre-processing files (914 Mo) ░░░░░░░░ 8%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:23] Pre-processing files (914 Mo) ░░░░░░░░ 9%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:26] Pre-processing files (914 Mo) ░░░░░░░░ 10%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:29] Pre-processing files (914 Mo) ░░░░░░░░ 11%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:31] Pre-processing files (914 Mo) ░░░░░░░░ 12%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:34] Pre-processing files (914 Mo) █░░░░░░░ 13%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:37] Pre-processing files (914 Mo) █░░░░░░░ 14%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:39] Pre-processing files (914 Mo) █░░░░░░░ 15%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:42] Pre-processing files (914 Mo) █░░░░░░░ 16%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:45] Pre-processing files (914 Mo) █░░░░░░░ 17%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:47] Pre-processing files (914 Mo) █░░░░░░░ 18%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:50] Pre-processing files (914 Mo) █░░░░░░░ 19%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:53] Pre-processing files (914 Mo) █░░░░░░░ 20%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:55] Pre-processing files (914 Mo) █░░░░░░░ 21%\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:58] Pre-processing files (914 Mo) █░░░░░░░ 22%\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:01] Pre-processing files (914 Mo) █░░░░░░░ 23%\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:03] Pre-processing files (914 Mo) █░░░░░░░ 24%\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:06] Pre-processing files (914 Mo) ██░░░░░░ 25%\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:09] Pre-processing files (914 Mo) ██░░░░░░ 26%\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:12] Pre-processing files (914 Mo) ██░░░░░░ 27%\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:14] Pre-processing files (914 Mo) ██░░░░░░ 28%\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:17] Pre-processing files (914 Mo) ██░░░░░░ 29%\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:20] Pre-processing files (914 Mo) ██░░░░░░ 30%\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:22] Pre-processing files (914 Mo) ██░░░░░░ 31%\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:25] Pre-processing files (914 Mo) ██░░░░░░ 32%\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:28] Pre-processing files (914 Mo) ██░░░░░░ 33%\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:30] Pre-processing files (914 Mo) ██░░░░░░ 34%\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:33] Pre-processing files (914 Mo) ██░░░░░░ 35%\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:36] Pre-processing files (914 Mo) ██░░░░░░ 36%\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:38] Pre-processing files (914 Mo) ██░░░░░░ 37%\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:41] Pre-processing files (914 Mo) ███░░░░░ 38%\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:44] Pre-processing files (914 Mo) ███░░░░░ 39%\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:46] Pre-processing files (914 Mo) ███░░░░░ 40%\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:49] Pre-processing files (914 Mo) ███░░░░░ 41%\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:52] Pre-processing files (914 Mo) ███░░░░░ 42%\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:54] Pre-processing files (914 Mo) ███░░░░░ 43%\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:57] Pre-processing files (914 Mo) ███░░░░░ 44%\n", "\u001b[2K\u001b[1B\u001b[1A[00:02:00] Pre-processing files (914 Mo) ███░░░░░ 45%\n", "\u001b[2K\u001b[1B\u001b[1A[00:02:02] Pre-processing files (914 Mo) ███░░░░░ 46%\n", "\u001b[2K\u001b[1B\u001b[1A[00:02:05] Pre-processing files (914 Mo) ███░░░░░ 47%\n", "\u001b[2K\u001b[1B\u001b[1A[00:02:08] Pre-processing files (914 Mo) ███░░░░░ 48%\n", "\u001b[2K\u001b[1B\u001b[1A[00:02:10] Pre-processing files (914 Mo) ███░░░░░ 49%\n", "\u001b[2K\u001b[1B\u001b[1A[00:02:13] Pre-processing files (914 Mo) ████░░░░ 50%\n", "\u001b[2K\u001b[1B\u001b[1A[00:02:16] Pre-processing files (914 Mo) ████░░░░ 51%\n", "\u001b[2K\u001b[1B\u001b[1A[00:02:18] Pre-processing files (914 Mo) ████░░░░ 52%\n", "\u001b[2K\u001b[1B\u001b[1A[00:02:21] Pre-processing files (914 Mo) ████░░░░ 53%\n", "\u001b[2K\u001b[1B\u001b[1A[00:02:24] Pre-processing files (914 Mo) ████░░░░ 54%\n", "\u001b[2K\u001b[1B\u001b[1A[00:02:26] Pre-processing files (914 Mo) ████░░░░ 55%\n", "\u001b[2K\u001b[1B\u001b[1A[00:02:29] Pre-processing files (914 Mo) ████░░░░ 56%\n", "\u001b[2K\u001b[1B\u001b[1A[00:02:32] Pre-processing files (914 Mo) ████░░░░ 57%\n", "\u001b[2K\u001b[1B\u001b[1A[00:02:34] Pre-processing files (914 Mo) ████░░░░ 58%\n", "\u001b[2K\u001b[1B\u001b[1A[00:02:37] Pre-processing files (914 Mo) ████░░░░ 59%\n", "\u001b[2K\u001b[1B\u001b[1A[00:02:40] Pre-processing files (914 Mo) ████░░░░ 60%\n", "\u001b[2K\u001b[1B\u001b[1A[00:02:42] Pre-processing files (914 Mo) ████░░░░ 61%\n", "\u001b[2K\u001b[1B\u001b[1A[00:02:45] Pre-processing files (914 Mo) ████░░░░ 62%\n", "\u001b[2K\u001b[1B\u001b[1A[00:02:48] Pre-processing files (914 Mo) █████░░░ 63%\n", "\u001b[2K\u001b[1B\u001b[1A[00:02:50] Pre-processing files (914 Mo) █████░░░ 64%\n", "\u001b[2K\u001b[1B\u001b[1A[00:02:53] Pre-processing files (914 Mo) █████░░░ 65%\n", "\u001b[2K\u001b[1B\u001b[1A[00:02:56] Pre-processing files (914 Mo) █████░░░ 66%\n", "\u001b[2K\u001b[1B\u001b[1A[00:02:58] Pre-processing files (914 Mo) █████░░░ 67%\n", "\u001b[2K\u001b[1B\u001b[1A[00:03:01] Pre-processing files (914 Mo) █████░░░ 68%\n", "\u001b[2K\u001b[1B\u001b[1A[00:03:04] Pre-processing files (914 Mo) █████░░░ 69%\n", "\u001b[2K\u001b[1B\u001b[1A[00:03:06] Pre-processing files (914 Mo) █████░░░ 70%\n", "\u001b[2K\u001b[1B\u001b[1A[00:03:09] Pre-processing files (914 Mo) █████░░░ 71%\n", "\u001b[2K\u001b[1B\u001b[1A[00:03:12] Pre-processing files (914 Mo) █████░░░ 72%\n", "\u001b[2K\u001b[1B\u001b[1A[00:03:14] Pre-processing files (914 Mo) █████░░░ 73%\n", "\u001b[2K\u001b[1B\u001b[1A[00:03:17] Pre-processing files (914 Mo) █████░░░ 74%\n", "\u001b[2K\u001b[1B\u001b[1A[00:03:20] Pre-processing files (914 Mo) ██████░░ 75%\n", "\u001b[2K\u001b[1B\u001b[1A[00:03:23] Pre-processing files (914 Mo) ██████░░ 76%\n", "\u001b[2K\u001b[1B\u001b[1A[00:03:26] Pre-processing files (914 Mo) ██████░░ 77%\n", "\u001b[2K\u001b[1B\u001b[1A[00:03:29] Pre-processing files (914 Mo) ██████░░ 78%\n", "\u001b[2K\u001b[1B\u001b[1A[00:03:32] Pre-processing files (914 Mo) ██████░░ 79%\n", "\u001b[2K\u001b[1B\u001b[1A[00:03:34] Pre-processing files (914 Mo) ██████░░ 80%\n", "\u001b[2K\u001b[1B\u001b[1A[00:03:37] Pre-processing files (914 Mo) ██████░░ 81%\n", "\u001b[2K\u001b[1B\u001b[1A[00:03:39] Pre-processing files (914 Mo) ██████░░ 82%\n", "\u001b[2K\u001b[1B\u001b[1A[00:03:42] Pre-processing files (914 Mo) ██████░░ 83%\n", "\u001b[2K\u001b[1B\u001b[1A[00:03:45] Pre-processing files (914 Mo) ██████░░ 84%\n", "\u001b[2K\u001b[1B\u001b[1A[00:03:47] Pre-processing files (914 Mo) ██████░░ 85%\n", "\u001b[2K\u001b[1B\u001b[1A[00:03:50] Pre-processing files (914 Mo) ██████░░ 86%\n", "\u001b[2K\u001b[1B\u001b[1A[00:03:53] Pre-processing files (914 Mo) ██████░░ 87%\n", "\u001b[2K\u001b[1B\u001b[1A[00:03:55] Pre-processing files (914 Mo) ███████░ 88%\n", "\u001b[2K\u001b[1B\u001b[1A[00:03:58] Pre-processing files (914 Mo) ███████░ 89%\n", "\u001b[2K\u001b[1B\u001b[1A[00:04:01] Pre-processing files (914 Mo) ███████░ 90%\n", "\u001b[2K\u001b[1B\u001b[1A[00:04:03] Pre-processing files (914 Mo) ███████░ 91%\n", "\u001b[2K\u001b[1B\u001b[1A[00:04:06] Pre-processing files (914 Mo) ███████░ 92%\n", "\u001b[2K\u001b[1B\u001b[1A[00:04:08] Pre-processing files (914 Mo) ███████░ 93%\n", "\u001b[2K\u001b[1B\u001b[1A[00:04:11] Pre-processing files (914 Mo) ███████░ 94%\n", "\u001b[2K\u001b[1B\u001b[1A[00:04:14] Pre-processing files (914 Mo) ███████░ 95%\n", "\u001b[2K\u001b[1B\u001b[1A[00:04:17] Pre-processing files (914 Mo) ███████░ 96%\n", "\u001b[2K\u001b[1B\u001b[1A[00:04:20] Pre-processing files (914 Mo) ███████░ 97%\n", "\u001b[2K\u001b[1B\u001b[1A[00:04:23] Pre-processing files (914 Mo) ███████░ 98%\n", "\u001b[2K\u001b[1B\u001b[1A[00:04:26] Pre-processing files (914 Mo) ███████░ 99%\n", "\u001b[2K\u001b[1B\u001b[1A[00:04:29] Pre-processing files (914 Mo) ████████ 100%\n", "[00:00:00] Tokenize words ████████ 0 / 0\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words ░░░░░░░░ 68932 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words ░░░░░░░░ 120631 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words ░░░░░░░░ 172330 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words █░░░░░░░ 241262 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words █░░░░░░░ 310194 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words █░░░░░░░ 361893 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words █░░░░░░░ 413592 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words ██░░░░░░ 465291 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words ██░░░░░░ 516990 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words ██░░░░░░ 568689 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words ██░░░░░░ 620388 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words ███░░░░░ 672087 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words ███░░░░░ 723786 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Tokenize words ███░░░░░ 775485 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Tokenize words ███░░░░░ 844417 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Tokenize words ████░░░░ 896116 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Tokenize words ████░░░░ 947815 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Tokenize words ████░░░░ 1016747 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Tokenize words █████░░░ 1085679 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Tokenize words █████░░░ 1137378 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Tokenize words █████░░░ 1206310 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Tokenize words █████░░░ 1258009 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Tokenize words ██████░░ 1309708 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Tokenize words ██████░░ 1378640 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Tokenize words ██████░░ 1430339 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Tokenize words ██████░░ 1482038 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Tokenize words ███████░ 1533737 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Tokenize words ███████░ 1602669 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Tokenize words ███████░ 1654368 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Tokenize words ███████░ 1723300 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:02] Tokenize words ████████ 1723363 / 1723363\n", "\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:00] Count pairs ░░░░░░░░ 17233 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:01] Count pairs ░░░░░░░░ 34466 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:03] Count pairs ░░░░░░░░ 51699 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:05] Count pairs ░░░░░░░░ 68932 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:08] Count pairs ░░░░░░░░ 86165 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Count pairs ░░░░░░░░ 103398 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:16] Count pairs ░░░░░░░░ 120631 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:21] Count pairs ░░░░░░░░ 137864 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:26] Count pairs ░░░░░░░░ 155097 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:33] Count pairs ░░░░░░░░ 172330 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:40] Count pairs ░░░░░░░░ 189563 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:47] Count pairs ░░░░░░░░ 206796 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:51] Count pairs █░░░░░░░ 224029 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:52] Count pairs █░░░░░░░ 241262 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:53] Count pairs █░░░░░░░ 258495 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:55] Count pairs █░░░░░░░ 275728 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:58] Count pairs █░░░░░░░ 292961 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:01] Count pairs █░░░░░░░ 310194 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:05] Count pairs █░░░░░░░ 327427 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:10] Count pairs █░░░░░░░ 344660 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:15] Count pairs █░░░░░░░ 361893 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:21] Count pairs █░░░░░░░ 379126 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:28] Count pairs █░░░░░░░ 396359 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:35] Count pairs █░░░░░░░ 413592 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:43] Count pairs █░░░░░░░ 430825 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:43] Count pairs ██░░░░░░ 448058 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:44] Count pairs ██░░░░░░ 465291 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:46] Count pairs ██░░░░░░ 482524 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:48] Count pairs ██░░░░░░ 499757 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:51] Count pairs ██░░░░░░ 516990 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:55] Count pairs ██░░░░░░ 534223 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:01:59] Count pairs ██░░░░░░ 551456 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:02:04] Count pairs ██░░░░░░ 568689 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:02:10] Count pairs ██░░░░░░ 585922 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:02:16] Count pairs ██░░░░░░ 603155 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:02:23] Count pairs ██░░░░░░ 620388 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:02:30] Count pairs ██░░░░░░ 637621 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:02:35] Count pairs ███░░░░░ 654854 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:02:35] Count pairs ███░░░░░ 672087 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:02:37] Count pairs ███░░░░░ 689320 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:02:39] Count pairs ███░░░░░ 706553 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:02:41] Count pairs ███░░░░░ 723786 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:02:45] Count pairs ███░░░░░ 741019 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:02:49] Count pairs ███░░░░░ 758252 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:02:53] Count pairs ███░░░░░ 775485 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:02:58] Count pairs ███░░░░░ 792718 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:03:04] Count pairs ███░░░░░ 809951 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:03:11] Count pairs ███░░░░░ 827184 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:03:18] Count pairs ███░░░░░ 844417 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:03:26] Count pairs ███░░░░░ 861650 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:03:28] Count pairs ████░░░░ 878883 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:03:28] Count pairs ████░░░░ 896116 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:03:30] Count pairs ████░░░░ 913349 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:03:32] Count pairs ████░░░░ 930582 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:03:35] Count pairs ████░░░░ 947815 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:03:39] Count pairs ████░░░░ 965048 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:03:43] Count pairs ████░░░░ 982281 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:03:48] Count pairs ████░░░░ 999514 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:03:54] Count pairs ████░░░░ 1016747 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:04:00] Count pairs ████░░░░ 1033980 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:04:07] Count pairs ████░░░░ 1051213 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:04:15] Count pairs ████░░░░ 1068446 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:04:19] Count pairs █████░░░ 1085679 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:04:20] Count pairs █████░░░ 1102912 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:04:21] Count pairs █████░░░ 1120145 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:04:23] Count pairs █████░░░ 1137378 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:04:26] Count pairs █████░░░ 1154611 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:04:29] Count pairs █████░░░ 1171844 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:04:33] Count pairs █████░░░ 1189077 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:04:37] Count pairs █████░░░ 1206310 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:04:43] Count pairs █████░░░ 1223543 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:04:49] Count pairs █████░░░ 1240776 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:04:55] Count pairs █████░░░ 1258009 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:05:03] Count pairs █████░░░ 1275242 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:05:10] Count pairs █████░░░ 1292475 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:05:11] Count pairs ██████░░ 1309708 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:05:12] Count pairs ██████░░ 1326941 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:05:14] Count pairs ██████░░ 1344174 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:05:16] Count pairs ██████░░ 1361407 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:05:19] Count pairs ██████░░ 1378640 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:05:23] Count pairs ██████░░ 1395873 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:05:27] Count pairs ██████░░ 1413106 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:05:32] Count pairs ██████░░ 1430339 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:05:38] Count pairs ██████░░ 1447572 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:05:44] Count pairs ██████░░ 1464805 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:05:51] Count pairs ██████░░ 1482038 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:05:59] Count pairs ██████░░ 1499271 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:06:02] Count pairs ███████░ 1516504 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:06:04] Count pairs ███████░ 1533737 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:06:06] Count pairs ███████░ 1550970 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:06:07] Count pairs ███████░ 1568203 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:06:10] Count pairs ███████░ 1585436 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:06:13] Count pairs ███████░ 1602669 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:06:17] Count pairs ███████░ 1619902 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:06:22] Count pairs ███████░ 1637135 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:06:27] Count pairs ███████░ 1654368 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:06:33] Count pairs ███████░ 1671601 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:06:39] Count pairs ███████░ 1688834 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:06:47] Count pairs ███████░ 1706067 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:06:56] Count pairs ███████░ 1723300 / 1723363\n", "\u001b[2K\u001b[1B\u001b[1A[00:06:57] Count pairs ████████ 1723363 / 1723363\n", "\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:09] Compute merges ░░░░░░░░ 320 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges ░░░░░░░░ 640 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:10] Compute merges ░░░░░░░░ 960 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges ░░░░░░░░ 1280 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:11] Compute merges ░░░░░░░░ 1600 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:12] Compute merges ░░░░░░░░ 1920 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:12] Compute merges ░░░░░░░░ 2240 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:12] Compute merges ░░░░░░░░ 2560 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:12] Compute merges ░░░░░░░░ 2880 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:13] Compute merges ░░░░░░░░ 3200 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:13] Compute merges ░░░░░░░░ 3520 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:13] Compute merges ░░░░░░░░ 3840 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:13] Compute merges █░░░░░░░ 4160 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:13] Compute merges █░░░░░░░ 4480 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:14] Compute merges █░░░░░░░ 4800 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:14] Compute merges █░░░░░░░ 5120 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:14] Compute merges █░░░░░░░ 5440 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:14] Compute merges █░░░░░░░ 5760 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:14] Compute merges █░░░░░░░ 6080 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:14] Compute merges █░░░░░░░ 6400 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:14] Compute merges █░░░░░░░ 6720 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:15] Compute merges █░░░░░░░ 7040 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:15] Compute merges █░░░░░░░ 7360 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:15] Compute merges █░░░░░░░ 7680 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:15] Compute merges ██░░░░░░ 8000 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:15] Compute merges ██░░░░░░ 8320 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:15] Compute merges ██░░░░░░ 8640 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:15] Compute merges ██░░░░░░ 8960 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:15] Compute merges ██░░░░░░ 9280 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:15] Compute merges ██░░░░░░ 9600 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:15] Compute merges ██░░░░░░ 9920 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:16] Compute merges ██░░░░░░ 10560 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:16] Compute merges ██░░░░░░ 10880 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:16] Compute merges ██░░░░░░ 11200 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:16] Compute merges ██░░░░░░ 11840 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:16] Compute merges ███░░░░░ 12160 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:16] Compute merges ███░░░░░ 12800 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:16] Compute merges ███░░░░░ 13440 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:16] Compute merges ███░░░░░ 13760 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:16] Compute merges ███░░░░░ 14400 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:17] Compute merges ███░░░░░ 15040 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:17] Compute merges ███░░░░░ 15680 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:17] Compute merges ████░░░░ 16320 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:17] Compute merges ████░░░░ 16960 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:17] Compute merges ████░░░░ 17600 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:17] Compute merges ████░░░░ 18240 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:17] Compute merges ████░░░░ 18880 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:17] Compute merges ████░░░░ 19520 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:17] Compute merges █████░░░ 20160 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:17] Compute merges █████░░░ 20800 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:17] Compute merges █████░░░ 21440 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:18] Compute merges █████░░░ 22080 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:18] Compute merges █████░░░ 22720 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:18] Compute merges █████░░░ 23360 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:18] Compute merges ██████░░ 24000 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:18] Compute merges ██████░░ 24960 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:18] Compute merges ██████░░ 25920 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:18] Compute merges ██████░░ 26560 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:18] Compute merges ██████░░ 27520 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:18] Compute merges ███████░ 28480 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:18] Compute merges ███████░ 29440 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:18] Compute merges ███████░ 30400 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:18] Compute merges ███████░ 31360 / 32000\n", "\u001b[2K\u001b[1B\u001b[1A[00:00:19] Compute merges ████████ 31743 / 31743\n", "\n", "Trained vocab size: 32000\n", "saving trained BPE model to : ./Megatron-LM/dataset/EN/32k/\n", "model saved ! \n", "\n", "\n", "\n", "testing ...\n", "\n", "\n", "\n", "['ĠHar', 'Ġn', 'Ã¥', 'gon', 'Ġfun', 'der', 'at', 'ĠpÃ¥', 'Ġvar', 'för', 'Ġman', 'Ġinte', 'Ġf', 'Ã¥r', 'Ġin', 'om', 'h', 'ust', 'e', 'perature', 'ns', 'Ġk', 'ur', 'va', 'Ġsyn', 'lig', 'Ġi', 'Ġgra', 'fen', '?', 'ĠÃĦ', 'r', 'Ġdet', 'Ġn', 'Ã¥', 'gon', 'Ġsom', 'Ġfr', 'Ã¥', 'gat', 'ĠTherm', 'ia', '?', 'ĠSk', 'ulle', 'Ġdet', 'Ġinte', 'Ġvar', 'a', 'Ġv', 'äs', 'ent', 'lig', 't', 'Ġatt', 'Ġk', 'unn', 'a', 'Ġk', 'olla', 'Ġhistor', 'iken', 'ĠpÃ¥', 'Ġden', 'Ġd', 'Ã¥', 'Ġman', 'Ġsk', 'all', 'Ġst', 'ä', 'lla', 'Ġin', 'Ġk', 'ur', 'van', '?']\n" ] } ], "source": [ "!python ./Megatron-LM/sv_utils/trainGPTTokenizer.py --infile $raw_text_path --bpe_path $output_trained_tokenizer_model_path --vocab_size 32000" ] }, { "cell_type": "markdown", "id": "earlier-memorabilia", "metadata": {}, "source": [ "---\n", "## Up Next : \n", "\n", "[customize preprocess data python script and convert to mmap](./Day3-4_customize_process2mmap.ipynb)\n", "\n", "## Back To Start Menu\n", "[start menu](../Start_Here.ipynb)" ] }, { "cell_type": "markdown", "id": "exposed-forest", "metadata": {}, "source": [ "-----\n", "\n", "\n", "## Licensing \n", "\n", "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 5 }