|
@@ -2,7 +2,7 @@
|
|
"cells": [
|
|
"cells": [
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "sustainable-wrong",
|
|
|
|
|
|
+ "id": "mental-dating",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"## GPT Tokenizer files\n",
|
|
"## GPT Tokenizer files\n",
|
|
@@ -14,15 +14,15 @@
|
|
"\n",
|
|
"\n",
|
|
"Later on, we will use the observations from this notebook to train a GPTBPE Tokenizer with our own raw text data.\n",
|
|
"Later on, we will use the observations from this notebook to train a GPTBPE Tokenizer with our own raw text data.\n",
|
|
"\n",
|
|
"\n",
|
|
- "We will load and verify GPTBPE Tokenizer and make sure the output tokens and token ids are as expected. \n"
|
|
|
|
|
|
+ "We will load and verify the GPTBPE Tokenizer and make sure the output tokens and token ids are as expected. \n"
|
|
]
|
|
]
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "fatal-think",
|
|
|
|
|
|
+ "id": "indoor-albany",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
- "Let's review the source code of [gpt2 tokenizer](https://huggingface.co/transformers/_modules/transformers/tokenization_gpt2.html)\n",
|
|
|
|
|
|
+ "Let's review the source code of [GPT2 tokenizer](https://huggingface.co/transformers/_modules/transformers/tokenization_gpt2.html)\n",
|
|
"\n",
|
|
"\n",
|
|
" This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will\n",
|
|
" This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will\n",
|
|
" be encoded differently whether it is at the beginning of the sentence (without space) or not:\n",
|
|
" be encoded differently whether it is at the beginning of the sentence (without space) or not:\n",
|
|
@@ -35,12 +35,12 @@
|
|
" tokenizer(\" Hello world\")['input_ids']\n",
|
|
" tokenizer(\" Hello world\")['input_ids']\n",
|
|
" [18435, 995]\n",
|
|
" [18435, 995]\n",
|
|
"\n",
|
|
"\n",
|
|
- "We expect our custom tokenizer, which we will later on train in lab 2, will exhibit the same behavior of [treating spaces like parts of the tokens](https://huggingface.co/transformers/model_doc/gpt2.html#transformers.GPT2Tokenizer).\n"
|
|
|
|
|
|
+ "We expect our custom tokenizer, which we will train later in lab 2, will exhibit the same behavior of [treating spaces like parts of the tokens](https://huggingface.co/transformers/model_doc/gpt2.html#transformers.GPT2Tokenizer).\n"
|
|
]
|
|
]
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "missing-congo",
|
|
|
|
|
|
+ "id": "composed-amount",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"Install necessary python libraries."
|
|
"Install necessary python libraries."
|
|
@@ -49,7 +49,7 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"execution_count": null,
|
|
- "id": "private-aurora",
|
|
|
|
|
|
+ "id": "instructional-seller",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"outputs": [],
|
|
"source": [
|
|
"source": [
|
|
@@ -59,10 +59,10 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "frequent-blues",
|
|
|
|
|
|
+ "id": "geological-drinking",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
- "Next, we proceed to fetch pretrained GPT Tokenizer files, namely the vocab and merge files, will ideally looks like. \n",
|
|
|
|
|
|
+ "Next, we will proceed to fetch the pretrained GPT Tokenizer files, namely the vocab and merge files, will ideally looks like. \n",
|
|
"\n",
|
|
"\n",
|
|
"We can later on use these observations to validate our custom trained GPTBPE tokenizer and the corresponding vocab.json and merges.txt file, in order to ensure the custom trained GPTBPE tokenizer will tokenze as expected."
|
|
"We can later on use these observations to validate our custom trained GPTBPE tokenizer and the corresponding vocab.json and merges.txt file, in order to ensure the custom trained GPTBPE tokenizer will tokenze as expected."
|
|
]
|
|
]
|
|
@@ -70,7 +70,7 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"execution_count": null,
|
|
- "id": "conceptual-mason",
|
|
|
|
|
|
+ "id": "destroyed-cleaning",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"outputs": [],
|
|
"source": [
|
|
"source": [
|
|
@@ -80,7 +80,7 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "specific-pharmaceutical",
|
|
|
|
|
|
+ "id": "sustainable-sweet",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"Examine the vocab and merge files, observe the presence of Ġ character.\n",
|
|
"Examine the vocab and merge files, observe the presence of Ġ character.\n",
|
|
@@ -90,7 +90,7 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"execution_count": null,
|
|
- "id": "pursuant-paradise",
|
|
|
|
|
|
+ "id": "overhead-freeware",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"outputs": [],
|
|
"source": [
|
|
"source": [
|
|
@@ -107,7 +107,7 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"execution_count": null,
|
|
- "id": "private-hunter",
|
|
|
|
|
|
+ "id": "eight-perception",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"outputs": [],
|
|
"source": [
|
|
"source": [
|
|
@@ -116,10 +116,10 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "cellular-standing",
|
|
|
|
|
|
+ "id": "knowing-tsunami",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
- "The following code block will load a default GPT2Tokenizer from HuggingFace's **_transformer_** library, we verify the following :\n",
|
|
|
|
|
|
+ "The following code block will load a default GPT2Tokenizer from HuggingFace's **_transformer_** library, verify the following:\n",
|
|
"\n",
|
|
"\n",
|
|
" from transformers import GPT2Tokenizer\n",
|
|
" from transformers import GPT2Tokenizer\n",
|
|
" tokenizer = GPT2Tokenizer.from_pretrained(\"gpt2\")\n",
|
|
" tokenizer = GPT2Tokenizer.from_pretrained(\"gpt2\")\n",
|
|
@@ -127,13 +127,13 @@
|
|
" tokenizer(\" Hello world\")['input_ids']\n",
|
|
" tokenizer(\" Hello world\")['input_ids']\n",
|
|
" expected token ids for \" Hello world\" is [18435, 995]\n",
|
|
" expected token ids for \" Hello world\" is [18435, 995]\n",
|
|
"\n",
|
|
"\n",
|
|
- "Note: The HuggingFace's **_transformer_** library does not have functions to train GPTBPE tokenizer, it can load a pre-trained tokenizer given valid files. For training GPTBPE Tokenizer, we will need to use another library called **_tokenizers_** also from HuggingFace."
|
|
|
|
|
|
+ "Note: The HuggingFace's **transformer** library does not have functions to train the GPTBPE tokenizer, but it can load a pre-trained tokenizer given valid files. For training the GPTBPE Tokenizer, we will need to use another library called **tokenizers**, also from HuggingFace."
|
|
]
|
|
]
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"execution_count": null,
|
|
- "id": "driving-right",
|
|
|
|
|
|
+ "id": "headed-baseline",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"outputs": [],
|
|
"source": [
|
|
"source": [
|
|
@@ -153,7 +153,7 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "collectible-rehabilitation",
|
|
|
|
|
|
+ "id": "lined-thousand",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"Below is the expected outputs :\n",
|
|
"Below is the expected outputs :\n",
|
|
@@ -166,24 +166,24 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "fluid-merit",
|
|
|
|
|
|
+ "id": "animated-durham",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"In the next code block, we will examine how HuggingFace's **_tokenizers_** library loads a pretrained tokenizer given gpt2-vocab.json and merges.txt files. \n",
|
|
"In the next code block, we will examine how HuggingFace's **_tokenizers_** library loads a pretrained tokenizer given gpt2-vocab.json and merges.txt files. \n",
|
|
- "We will verify that, the usage of `use_gpt` flag will result in the same tokenization behavior, i.e the presence of the **Ġ** character. We will also double check that the token ids are identical to HuggingFace's **_transformer_** loaded `tokenizer = GPT2Tokenizer.from_pretrained(\"gpt2\")` when applying tokenization to the exact same text ` Hello world`. \n",
|
|
|
|
|
|
+ "We will verify that, the usage of the `use_gpt` flag will result in the same tokenization behavior, i.e the presence of the **Ġ** character. We will also double check that the token IDs are identical to HuggingFace's **_transformer_** loaded `tokenizer = GPT2Tokenizer.from_pretrained(\"gpt2\")` when applying tokenization to the exact same text ` Hello world`. \n",
|
|
"\n",
|
|
"\n",
|
|
"Setting `use_gpt` to True will evoke the following : \n",
|
|
"Setting `use_gpt` to True will evoke the following : \n",
|
|
"\n",
|
|
"\n",
|
|
" tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()\n",
|
|
" tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()\n",
|
|
" tokenizer.decoder = ByteLevelDecoder()\n",
|
|
" tokenizer.decoder = ByteLevelDecoder()\n",
|
|
" \n",
|
|
" \n",
|
|
- "This is the expected tokenizer behavior for GPTBPE Tokenizer, this GPTBPE tokenizer will load the vocab.json and merges.txt files and tokenize as expected. Whereas setting `use_gpt` to False, will result in a normal BPE Tokenizer, the tokenization will behave differently."
|
|
|
|
|
|
+ "This is the expected tokenizer behavior for GPTBPE Tokenizer. This GPTBPE tokenizer will load the vocab.json and merges.txt files and tokenize as expected. Whereas setting the `use_gpt` flag to False, will result in a normal BPE Tokenizer and the tokenization will behave differently."
|
|
]
|
|
]
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"execution_count": null,
|
|
- "id": "quarterly-remains",
|
|
|
|
|
|
+ "id": "german-decimal",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"outputs": [],
|
|
"source": [
|
|
"source": [
|
|
@@ -225,7 +225,7 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "incident-positive",
|
|
|
|
|
|
+ "id": "joined-executive",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"Below is the expected outputs :\n",
|
|
"Below is the expected outputs :\n",
|
|
@@ -241,16 +241,16 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "substantial-spank",
|
|
|
|
|
|
+ "id": "other-equality",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
- "What did we observed ? \n",
|
|
|
|
|
|
+ "What did we observed? \n",
|
|
"\n",
|
|
"\n",
|
|
- "We observe that by setting `use_gpt` flag to True in HuggingFace's **_tokenizers_** library when loading the same gpt2-vocab.json and merges.txt will give us the expected behavor of GPTBPE tokenization. \n",
|
|
|
|
|
|
+ "By setting the `use_gpt` flag to True in HuggingFace's **_tokenizers_** library when loading the same gpt2-vocab.json and merges.txt gives us the expected behavior of GPTBPE tokenization. \n",
|
|
"\n",
|
|
"\n",
|
|
- "We further verify, by applying tokenization to the exact same text ` Hello world`, the result of the tokenizer, with `use_gpt` flag = True, will match the result of the HuggingFace's **_transformer_** library loaded gpt2 tokenizer.\n",
|
|
|
|
|
|
+ "We can further verify that by applying tokenization to the exact same text ` Hello world`, the result of the tokenizer, with the`use_gpt` flag = True, will match the result of the HuggingFace's **_transformer_** library loaded gpt2 tokenizer.\n",
|
|
"\n",
|
|
"\n",
|
|
- "Whereas setting `use_gpt` flag = False would result in a different behavior. \n",
|
|
|
|
|
|
+ "Whereas setting the `use_gpt` flag = False would result in a different behavior. \n",
|
|
"\n",
|
|
"\n",
|
|
"Therefore, we will enforce having :\n",
|
|
"Therefore, we will enforce having :\n",
|
|
"\n",
|
|
"\n",
|
|
@@ -262,7 +262,7 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "solid-aspect",
|
|
|
|
|
|
+ "id": "certain-indie",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"We will now move the gpt-vocab.json and gpt2-merges.txt to the correct data folder as a preparation for the next step."
|
|
"We will now move the gpt-vocab.json and gpt2-merges.txt to the correct data folder as a preparation for the next step."
|
|
@@ -271,7 +271,7 @@
|
|
{
|
|
{
|
|
"cell_type": "code",
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"execution_count": null,
|
|
- "id": "electrical-worcester",
|
|
|
|
|
|
+ "id": "interstate-center",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"outputs": [],
|
|
"source": [
|
|
"source": [
|
|
@@ -282,18 +282,18 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "related-saturn",
|
|
|
|
|
|
+ "id": "developing-finland",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"---\n",
|
|
"---\n",
|
|
"\n",
|
|
"\n",
|
|
"## Links and Resources\n",
|
|
"## Links and Resources\n",
|
|
- "Don't forget to check out additional resources such as [HuggingFace Tokenizer Documentation](https://huggingface.co/docs/tokenizers/python/latest/quicktour.html) and [Train GPT-2 in your own langauge](https://towardsdatascience.com/train-gpt-2-in-your-own-language-fc6ad4d60171).\n"
|
|
|
|
|
|
+ "Don't forget to check out additional resources such as [HuggingFace Tokenizer Documentation](https://huggingface.co/docs/tokenizers/python/latest/quicktour.html) and [Train GPT-2 in your own language](https://towardsdatascience.com/train-gpt-2-in-your-own-language-fc6ad4d60171).\n"
|
|
]
|
|
]
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "surprised-venue",
|
|
|
|
|
|
+ "id": "appreciated-technology",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"-----\n",
|
|
"-----\n",
|
|
@@ -302,7 +302,7 @@
|
|
},
|
|
},
|
|
{
|
|
{
|
|
"cell_type": "markdown",
|
|
"cell_type": "markdown",
|
|
- "id": "graduate-windsor",
|
|
|
|
|
|
+ "id": "parallel-preliminary",
|
|
"metadata": {},
|
|
"metadata": {},
|
|
"source": [
|
|
"source": [
|
|
"-----\n",
|
|
"-----\n",
|