{
"cells": [
{
"cell_type": "markdown",
"id": "otherwise-masters",
"metadata": {},
"source": [
"## Customize preprocess_data.py\n",
"---\n",
"\n",
"## Learning Objectives\n",
"\n",
"We fetched our own Swedish raw text data in `Lab2-1_acquiring_data.ipynb`, we learned how to find sentence boundary with custom functions in `Lab2-2_SentenceBoundary_and_Deduplicate.ipynb` , we also trained a GPTBPETokenizer and fitted it to our raw Swedish text. \n",
"\n",
"We are now ready to incorporate the custom sentence-splitter into preprocess_data.py and covert the raw Swedish text to , first json format, and then mmap format.\n",
"\n",
"Therefore, the goal of this notebook is to integrate all knowledge gained from both Lab 1 as well as the above notebooks, and challenge ourselves to further customize the preprocess_data.py with a custom sentence-splitter, and in the process, convert the new raw Sweden text to mmap format.\n",
"\n",
"More specifically, this notebook will cover the steps to :\n",
"\n",
"1. Convert the extracted raw Swedish text from webnyheter2013.txt to webnyheter2013.json.\n",
"2. Generate the mmap format files by default preprocess_data.py first to assure the possibility to move on to the next notebook in case time runs out.\n",
"\n",
"\n",
"Toward the end, there is a Mini-Challenge Jump to view Mini-Challenge.\n"
]
},
{
"cell_type": "markdown",
"id": "statutory-thesis",
"metadata": {},
"source": [
"1. Convert the extracted raw Swedish text from webnyheter2013.txt to webnyheter2013.json."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "horizontal-cause",
"metadata": {},
"outputs": [],
"source": [
"!python create_loose_json.py --infile ../dataset/SV/webnyheter2013.txt --outfile ../dataset/SV/webnyheter2013.json"
]
},
{
"cell_type": "markdown",
"id": "reserved-clear",
"metadata": {},
"source": [
"Below is the expected outputs :\n",
"\n",
" process 1000000 documents so far ...\n",
" example: – Vi har en bra generation som spelat tillsammans ett tag .\n",
"\n",
" finished processing 1249010 lines to loose json format"
]
},
{
"cell_type": "markdown",
"id": "fixed-closing",
"metadata": {},
"source": [
"2. Generate the mmap format files by default preprocess_data.py first to assure the possibility to move on to the next notebook in case time runs out."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "dried-intro",
"metadata": {},
"outputs": [],
"source": [
"INPUT_JSON_FILE='../dataset/SV/webnyheter2013.json'\n",
"OUTPUT_PATH='../dataset/SV/webnyheter2013_56kvocab'\n",
"VOCAB_FILE='../dataset/SV/56k/vocab.json'\n",
"MERGE_FILE='../dataset/SV/56k/merges.txt'\n",
"NUM_CPUS=16"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "addressed-meeting",
"metadata": {},
"outputs": [],
"source": [
"!python ./Megatron-LM/tools/preprocess_data.py \\\n",
" --input $INPUT_JSON_FILE \\\n",
" --output-prefix $OUTPUT_PATH \\\n",
" --json-keys text \\\n",
" --vocab-file $VOCAB_FILE \\\n",
" --merge-file $MERGE_FILE \\\n",
" --dataset-impl mmap \\\n",
" --tokenizer-type GPT2BPETokenizer \\\n",
" --workers $NUM_CPUS \\\n",
" --append-eod"
]
},
{
"cell_type": "markdown",
"id": "moderate-future",
"metadata": {},
"source": [
"Below is the expected outputs :\n",
"\n",
" Processed 1248300 documents (52998.601302473544 docs/s, 5.869853647730749 MB/s).\n",
" Processed 1248400 documents (53001.39142986273 docs/s, 5.870136451906283 MB/s).\n",
" Processed 1248500 documents (53004.16423593737 docs/s, 5.870477584597603 MB/s).\n",
" Processed 1248600 documents (53007.072626674184 docs/s, 5.870763528521501 MB/s).\n",
" Processed 1248700 documents (53009.92668081499 docs/s, 5.871081674576178 MB/s).\n",
" Processed 1248800 documents (53012.79399884911 docs/s, 5.871406835923378 MB/s).\n",
" Processed 1248900 documents (53015.61341376629 docs/s, 5.8717617499445 MB/s).\n",
" Processed 1249000 documents (53018.49277365899 docs/s, 5.8720826162486786 MB/s)."
]
},
{
"cell_type": "markdown",
"id": "electrical-executive",
"metadata": {},
"source": [
"Now we get the default mmap files (xxx.bin and xxx.idx ) and therefore guarantee the data needed for the next notebook to run.\n",
"We can now move on. We start by copy the old preprocess_data.py and rename it to `MYpreprocess_data.py`\n",
"\n",
"cp the preprocess_data.py into a new python script called `MYpreprocess_data.py`"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "greatest-receptor",
"metadata": {},
"outputs": [],
"source": [
"!cp ./Megatron-LM/tools/preprocess_data.py ./Megatron-LM/tools/MYpreprocess_data.py"
]
},
{
"cell_type": "markdown",
"id": "south-devil",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "rough-pickup",
"metadata": {},
"source": [
"The below code block is our custom sentence-splitter `cut_sentence_with_quotation_marks`, the custom function is provided for your convenience for integarting to `MYpreprocess_data.py`"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "prostate-profession",
"metadata": {},
"outputs": [],
"source": [
"import re\n",
"import nltk\n",
"from nltk.tokenize import sent_tokenize\n",
"def normal_cut_sentence(temp):\n",
" return sent_tokenize(temp)\n",
"\n",
"def cut_sentence_with_quotation_marks(text):\n",
" p = re.compile(\"“.*?”\")\n",
" list = []\n",
" index = 0\n",
" length = len(text)\n",
" for i in p.finditer(text):\n",
" temp = ''\n",
" start = i.start()\n",
" end = i.end()\n",
" for j in range(index, start):\n",
" temp += text[j]\n",
" if temp != '':\n",
" temp_list = normal_cut_sentence(temp)\n",
" list += temp_list\n",
" temp = ''\n",
" for k in range(start, end):\n",
" temp += text[k]\n",
" if temp != ' ':\n",
" list.append(temp)\n",
" index = end\n",
" return list"
]
},
{
"cell_type": "markdown",
"id": "large-birthday",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "medical-incident",
"metadata": {},
"source": [
"---\n",
"## **Mini-Challenge ** - integrate the custom sentence splitter into MYpreprocess_data.py\n",
"\n",
"Task : Modify and overwrite `MYpreprocess_data.py` below to incoporate the custom `cut_sentence_with_quotation_marks`\n",
"\n",
"Pass : Successfully run Mypreprocess_data.py with the custom sentence splitter cut_sentence_with_quotation_marks and generate the customSentenceSplit_text_document.bin and customSentenceSplit_text_document.idx files.\n",
"\n",
"Note: the solution will be delivered to you at the end of Lab 2.\n",
"\n",
"---\n",
"Modify the below cell block to overwrite `MYpreprocess_data.py`. \n",
"After modification, Jump to Rerun cell to produce customSentenceSplit_text_document.bin and customSentenceSplit_text_document.idx files.\n",
"\n",
"Jump to ReRun Cell "
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "selected-depth",
"metadata": {},
"outputs": [],
"source": [
"%%writefile ./Megatron-LM/tools/MYpreprocess_data.py \n",
"# coding=utf-8\n",
"# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.\n",
"#\n",
"# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"# you may not use this file except in compliance with the License.\n",
"# You may obtain a copy of the License at\n",
"#\n",
"# http://www.apache.org/licenses/LICENSE-2.0\n",
"#\n",
"# Unless required by applicable law or agreed to in writing, software\n",
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License.\n",
"\n",
"\"\"\"Processing data for pretraining.\"\"\"\n",
"\n",
"import argparse\n",
"import json\n",
"import multiprocessing\n",
"import os\n",
"import sys\n",
"sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__),\n",
" os.path.pardir)))\n",
"import time\n",
"\n",
"import torch\n",
"try:\n",
" import nltk\n",
" nltk_available = True\n",
"except ImportError:\n",
" nltk_available = False\n",
"\n",
"from megatron.tokenizer import build_tokenizer\n",
"from megatron.data import indexed_dataset\n",
"\n",
"\n",
"# https://stackoverflow.com/questions/33139531/preserve-empty-lines-with-nltks-punkt-tokenizer\n",
"class CustomLanguageVars(nltk.tokenize.punkt.PunktLanguageVars):\n",
"\n",
" _period_context_fmt = r\"\"\"\n",
" \\S* # some word material\n",
" %(SentEndChars)s # a potential sentence ending\n",
" \\s* # <-- THIS is what I changed\n",
" (?=(?P\n",
" %(NonWord)s # either other punctuation\n",
" |\n",
" (?P\\S+) # <-- Normally you would have \\s+ here\n",
" ))\"\"\"\n",
"\n",
"class IdentitySplitter(object):\n",
" def tokenize(self, *text):\n",
" return text\n",
"\"\"\"[TODO]: modify this class to integrate the custom sentence splitter above \"\"\"\n",
"\n",
"class Encoder(object):\n",
" def __init__(self, args):\n",
" self.args = args\n",
" \n",
" def initializer(self):\n",
" # Use Encoder class as a container for global data\n",
" Encoder.tokenizer = build_tokenizer(self.args)\n",
" if self.args.split_sentences:\n",
" if not nltk_available:\n",
" print(\"NLTK is not available to split sentences.\")\n",
" exit()\n",
" splitter = nltk.load(\"tokenizers/punkt/english.pickle\")\n",
" if self.args.keep_newlines:\n",
" # this prevents punkt from eating newlines after sentences\n",
" Encoder.splitter = nltk.tokenize.punkt.PunktSentenceTokenizer(\n",
" train_text = splitter._params,\n",
" lang_vars = CustomLanguageVars())\n",
" else:\n",
" Encoder.splitter = splitter\n",
"\n",
" else:\n",
" Encoder.splitter = IdentitySplitter()\n",
"\n",
" def encode(self, json_line):\n",
" data = json.loads(json_line)\n",
" ids = {}\n",
" for key in self.args.json_keys:\n",
" text = data[key]\n",
" doc_ids = []\n",
" for sentence in Encoder.splitter.tokenize(text):\n",
" sentence_ids = Encoder.tokenizer.tokenize(sentence)\n",
" if len(sentence_ids) > 0:\n",
" doc_ids.append(sentence_ids)\n",
" if len(doc_ids) > 0 and self.args.append_eod:\n",
" doc_ids[-1].append(Encoder.tokenizer.eod)\n",
" ids[key] = doc_ids\n",
" return ids, len(json_line)\n",
"\n",
"def get_args():\n",
" parser = argparse.ArgumentParser()\n",
" group = parser.add_argument_group(title='input data')\n",
" group.add_argument('--input', type=str, required=True,\n",
" help='Path to input JSON')\n",
" group.add_argument('--json-keys', nargs='+', default=['text'],\n",
" help='space separate listed of keys to extract from json')\n",
" group.add_argument('--split-sentences', action='store_true',\n",
" help='Split documents into sentences.')\n",
" group.add_argument('--keep-newlines', action='store_true',\n",
" help='Keep newlines between sentences when splitting.')\n",
"\n",
" group = parser.add_argument_group(title='tokenizer')\n",
" group.add_argument('--tokenizer-type', type=str, required=True,\n",
" choices=['BertWordPieceLowerCase','BertWordPieceCase',\n",
" 'GPT2BPETokenizer'],\n",
" help='What type of tokenizer to use.')\n",
" group.add_argument('--vocab-file', type=str, default=None,\n",
" help='Path to the vocab file')\n",
" group.add_argument('--merge-file', type=str, default=None,\n",
" help='Path to the BPE merge file (if necessary).')\n",
" group.add_argument('--append-eod', action='store_true',\n",
" help='Append an token to the end of a document.')\n",
"\n",
"\n",
" group = parser.add_argument_group(title='output data')\n",
" group.add_argument('--output-prefix', type=str, required=True,\n",
" help='Path to binary output file without suffix')\n",
" group.add_argument('--dataset-impl', type=str, default='mmap',\n",
" choices=['lazy', 'cached', 'mmap'])\n",
"\n",
" group = parser.add_argument_group(title='runtime')\n",
" group.add_argument('--workers', type=int, default=1,\n",
" help='Number of worker processes to launch')\n",
" group.add_argument('--log-interval', type=int, default=100,\n",
" help='Interval between progress updates')\n",
" args = parser.parse_args()\n",
" args.keep_empty = False\n",
"\n",
" if args.tokenizer_type.lower().startswith('bert'):\n",
" if not args.split_sentences:\n",
" print(\"Bert tokenizer detected, are you sure you don't want to split sentences?\")\n",
"\n",
" # some default/dummy values for the tokenizer\n",
" args.rank = 0\n",
" args.make_vocab_size_divisible_by = 128\n",
" args.tensor_model_parallel_size = 1\n",
" args.vocab_extra_ids = 0\n",
"\n",
" return args\n",
"\n",
"def main():\n",
" args = get_args()\n",
" startup_start = time.time()\n",
"\n",
" print(\"Opening\", args.input)\n",
" fin = open(args.input, 'r', encoding='utf-8')\n",
"\n",
" if nltk_available and args.split_sentences:\n",
" nltk.download(\"punkt\", quiet=True)\n",
"\n",
" encoder = Encoder(args)\n",
" tokenizer = build_tokenizer(args)\n",
" pool = multiprocessing.Pool(args.workers, initializer=encoder.initializer)\n",
" encoded_docs = pool.imap(encoder.encode, fin, 25)\n",
" #encoded_docs = map(encoder.encode, fin)\n",
"\n",
" level = \"document\"\n",
" if args.split_sentences:\n",
" level = \"sentence\"\n",
"\n",
" print(f\"Vocab size: {tokenizer.vocab_size}\")\n",
" print(f\"Output prefix: {args.output_prefix}\")\n",
" output_bin_files = {}\n",
" output_idx_files = {}\n",
" builders = {}\n",
" for key in args.json_keys:\n",
" output_bin_files[key] = \"{}_{}_{}.bin\".format(args.output_prefix,\n",
" key, level)\n",
" output_idx_files[key] = \"{}_{}_{}.idx\".format(args.output_prefix,\n",
" key, level)\n",
" builders[key] = indexed_dataset.make_builder(output_bin_files[key],\n",
" impl=args.dataset_impl,\n",
" vocab_size=tokenizer.vocab_size)\n",
"\n",
" startup_end = time.time()\n",
" proc_start = time.time()\n",
" total_bytes_processed = 0\n",
" print(\"Time to startup:\", startup_end - startup_start)\n",
"\n",
" for i, (doc, bytes_processed) in enumerate(encoded_docs, start=1):\n",
" total_bytes_processed += bytes_processed\n",
" for key, sentences in doc.items():\n",
" if len(sentences) == 0:\n",
" continue\n",
" for sentence in sentences:\n",
" builders[key].add_item(torch.IntTensor(sentence))\n",
" builders[key].end_document()\n",
" if i % args.log_interval == 0:\n",
" current = time.time()\n",
" elapsed = current - proc_start\n",
" mbs = total_bytes_processed/elapsed/1024/1024\n",
" print(f\"Processed {i} documents\",\n",
" f\"({i/elapsed} docs/s, {mbs} MB/s).\",\n",
" file=sys.stderr)\n",
"\n",
" for key in args.json_keys:\n",
" builders[key].finalize(output_idx_files[key])\n",
"\n",
"if __name__ == '__main__':\n",
" main()"
]
},
{
"cell_type": "markdown",
"id": "raised-victim",
"metadata": {},
"source": [
"Below cell block specify all the input parameters in order to run `MYpreprocess_data.py`. \n",
"\n",
"Please do **NOT** modify anything in below cell."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "fluid-dayton",
"metadata": {},
"outputs": [],
"source": [
"INPUT_JSON_FILE='../dataset/SV/webnyheter2013.json'\n",
"OUTPUT_PATH='../dataset/SV/customSentenceSplit'\n",
"VOCAB_FILE='../dataset/SV/32k/vocab.json'\n",
"MERGE_FILE='../dataset/SV/32k/merges.txt'\n",
"NUM_CPUS=16"
]
},
{
"cell_type": "markdown",
"id": "concerned-protest",
"metadata": {},
"source": [
"Below is a ReRun cell block to run `MYpreprocess_data.py` and produce the customSentenceSplit_text_document.bin and customSentenceSplit_text_document.idx files.\n",
"\n",
"\n",
"\n",
"Go back and modify `MYpreprocess_data.py`, click on this shortcut link to Jump to Modify MYpreprocess_data.py "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "rolled-welcome",
"metadata": {},
"outputs": [],
"source": [
"!python ./Megatron-LM/tools/MYpreprocess_data.py \\\n",
" --input $INPUT_JSON_FILE \\\n",
" --output-prefix $OUTPUT_PATH \\\n",
" --json-keys text \\\n",
" --vocab-file $VOCAB_FILE \\\n",
" --merge-file $MERGE_FILE \\\n",
" --dataset-impl mmap \\\n",
" --tokenizer-type GPT2BPETokenizer \\\n",
" --workers $NUM_CPUS \\\n",
" --append-eod"
]
},
{
"cell_type": "markdown",
"id": "reduced-court",
"metadata": {},
"source": [
"Check whether these two files : customSentenceSplit_text_document.bin and customSentenceSplit_text_document.idx files are successfully generated."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "secondary-stereo",
"metadata": {},
"outputs": [],
"source": [
"! ls ../dataset/SV/"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "premier-birth",
"metadata": {},
"outputs": [],
"source": [
"## clean up to free up space\n",
"!rm ./Megatron-LM/tools/MYpreprocess_data.py"
]
},
{
"cell_type": "markdown",
"id": "eastern-ministry",
"metadata": {},
"source": [
"-----\n",
"## HOME NEXT
"
]
},
{
"cell_type": "markdown",
"id": "qualified-admission",
"metadata": {},
"source": [
"-----\n",
"\n",
"\n",
"## Licensing \n",
"\n",
"This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
}
},
"nbformat": 4,
"nbformat_minor": 5
}