{ "cells": [ { "cell_type": "markdown", "id": "otherwise-masters", "metadata": {}, "source": [ "## Customize preprocess_data.py\n", "---\n", "\n", "## Learning Objectives\n", "\n", "We fetched our own Swedish raw text data in `Lab2-1_acquiring_data.ipynb`, we learned how to find sentence boundary with custom functions in `Lab2-2_SentenceBoundary_and_Deduplicate.ipynb` , we also trained a GPTBPETokenizer and fitted it to our raw Swedish text. \n", "\n", "We are now ready to incorporate the custom sentence-splitter into preprocess_data.py and covert the raw Swedish text to , first json format, and then mmap format.\n", "\n", "Therefore, the goal of this notebook is to integrate all knowledge gained from both Lab 1 as well as the above notebooks, and challenge ourselves to further customize the preprocess_data.py with a custom sentence-splitter, and in the process, convert the new raw Sweden text to mmap format.\n", "\n", "More specifically, this notebook will cover the steps to :\n", "\n", "1. Convert the extracted raw Swedish text from webnyheter2013.txt to webnyheter2013.json.\n", "2. Generate the mmap format files by default preprocess_data.py first to assure the possibility to move on to the next notebook in case time runs out.\n", "\n", "\n", "Toward the end, there is a Mini-Challenge Jump to view Mini-Challenge.\n" ] }, { "cell_type": "markdown", "id": "statutory-thesis", "metadata": {}, "source": [ "1. Convert the extracted raw Swedish text from webnyheter2013.txt to webnyheter2013.json." ] }, { "cell_type": "code", "execution_count": null, "id": "horizontal-cause", "metadata": {}, "outputs": [], "source": [ "!python create_loose_json.py --infile ../dataset/SV/webnyheter2013.txt --outfile ../dataset/SV/webnyheter2013.json" ] }, { "cell_type": "markdown", "id": "reserved-clear", "metadata": {}, "source": [ "Below is the expected outputs :\n", "\n", " process 1000000 documents so far ...\n", " example: – Vi har en bra generation som spelat tillsammans ett tag .\n", "\n", " finished processing 1249010 lines to loose json format" ] }, { "cell_type": "markdown", "id": "fixed-closing", "metadata": {}, "source": [ "2. Generate the mmap format files by default preprocess_data.py first to assure the possibility to move on to the next notebook in case time runs out." ] }, { "cell_type": "code", "execution_count": 7, "id": "dried-intro", "metadata": {}, "outputs": [], "source": [ "INPUT_JSON_FILE='../dataset/SV/webnyheter2013.json'\n", "OUTPUT_PATH='../dataset/SV/webnyheter2013_56kvocab'\n", "VOCAB_FILE='../dataset/SV/56k/vocab.json'\n", "MERGE_FILE='../dataset/SV/56k/merges.txt'\n", "NUM_CPUS=16" ] }, { "cell_type": "code", "execution_count": null, "id": "addressed-meeting", "metadata": {}, "outputs": [], "source": [ "!python ./Megatron-LM/tools/preprocess_data.py \\\n", " --input $INPUT_JSON_FILE \\\n", " --output-prefix $OUTPUT_PATH \\\n", " --json-keys text \\\n", " --vocab-file $VOCAB_FILE \\\n", " --merge-file $MERGE_FILE \\\n", " --dataset-impl mmap \\\n", " --tokenizer-type GPT2BPETokenizer \\\n", " --workers $NUM_CPUS \\\n", " --append-eod" ] }, { "cell_type": "markdown", "id": "moderate-future", "metadata": {}, "source": [ "Below is the expected outputs :\n", "\n", " Processed 1248300 documents (52998.601302473544 docs/s, 5.869853647730749 MB/s).\n", " Processed 1248400 documents (53001.39142986273 docs/s, 5.870136451906283 MB/s).\n", " Processed 1248500 documents (53004.16423593737 docs/s, 5.870477584597603 MB/s).\n", " Processed 1248600 documents (53007.072626674184 docs/s, 5.870763528521501 MB/s).\n", " Processed 1248700 documents (53009.92668081499 docs/s, 5.871081674576178 MB/s).\n", " Processed 1248800 documents (53012.79399884911 docs/s, 5.871406835923378 MB/s).\n", " Processed 1248900 documents (53015.61341376629 docs/s, 5.8717617499445 MB/s).\n", " Processed 1249000 documents (53018.49277365899 docs/s, 5.8720826162486786 MB/s)." ] }, { "cell_type": "markdown", "id": "electrical-executive", "metadata": {}, "source": [ "Now we get the default mmap files (xxx.bin and xxx.idx ) and therefore guarantee the data needed for the next notebook to run.\n", "We can now move on. We start by copy the old preprocess_data.py and rename it to `MYpreprocess_data.py`\n", "\n", "cp the preprocess_data.py into a new python script called `MYpreprocess_data.py`" ] }, { "cell_type": "code", "execution_count": 2, "id": "greatest-receptor", "metadata": {}, "outputs": [], "source": [ "!cp ./Megatron-LM/tools/preprocess_data.py ./Megatron-LM/tools/MYpreprocess_data.py" ] }, { "cell_type": "markdown", "id": "south-devil", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "id": "rough-pickup", "metadata": {}, "source": [ "The below code block is our custom sentence-splitter `cut_sentence_with_quotation_marks`, the custom function is provided for your convenience for integarting to `MYpreprocess_data.py`" ] }, { "cell_type": "code", "execution_count": 3, "id": "prostate-profession", "metadata": {}, "outputs": [], "source": [ "import re\n", "import nltk\n", "from nltk.tokenize import sent_tokenize\n", "def normal_cut_sentence(temp):\n", " return sent_tokenize(temp)\n", "\n", "def cut_sentence_with_quotation_marks(text):\n", " p = re.compile(\"“.*?”\")\n", " list = []\n", " index = 0\n", " length = len(text)\n", " for i in p.finditer(text):\n", " temp = ''\n", " start = i.start()\n", " end = i.end()\n", " for j in range(index, start):\n", " temp += text[j]\n", " if temp != '':\n", " temp_list = normal_cut_sentence(temp)\n", " list += temp_list\n", " temp = ''\n", " for k in range(start, end):\n", " temp += text[k]\n", " if temp != ' ':\n", " list.append(temp)\n", " index = end\n", " return list" ] }, { "cell_type": "markdown", "id": "large-birthday", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "id": "medical-incident", "metadata": {}, "source": [ "---\n", "## **Mini-Challenge ** - integrate the custom sentence splitter into MYpreprocess_data.py\n", "\n", "Task : Modify and overwrite `MYpreprocess_data.py` below to incoporate the custom `cut_sentence_with_quotation_marks`\n", "\n", "Pass : Successfully run Mypreprocess_data.py with the custom sentence splitter cut_sentence_with_quotation_marks and generate the customSentenceSplit_text_document.bin and customSentenceSplit_text_document.idx files.\n", "\n", "Note: the solution will be delivered to you at the end of Lab 2.\n", "\n", "---\n", "Modify the below cell block to overwrite `MYpreprocess_data.py`. \n", "After modification, Jump to Rerun cell to produce customSentenceSplit_text_document.bin and customSentenceSplit_text_document.idx files.\n", "\n", "Jump to ReRun Cell " ] }, { "cell_type": "code", "execution_count": 2, "id": "selected-depth", "metadata": {}, "outputs": [], "source": [ "%%writefile ./Megatron-LM/tools/MYpreprocess_data.py \n", "# coding=utf-8\n", "# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# http://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License.\n", "\n", "\"\"\"Processing data for pretraining.\"\"\"\n", "\n", "import argparse\n", "import json\n", "import multiprocessing\n", "import os\n", "import sys\n", "sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__),\n", " os.path.pardir)))\n", "import time\n", "\n", "import torch\n", "try:\n", " import nltk\n", " nltk_available = True\n", "except ImportError:\n", " nltk_available = False\n", "\n", "from megatron.tokenizer import build_tokenizer\n", "from megatron.data import indexed_dataset\n", "\n", "\n", "# https://stackoverflow.com/questions/33139531/preserve-empty-lines-with-nltks-punkt-tokenizer\n", "class CustomLanguageVars(nltk.tokenize.punkt.PunktLanguageVars):\n", "\n", " _period_context_fmt = r\"\"\"\n", " \\S* # some word material\n", " %(SentEndChars)s # a potential sentence ending\n", " \\s* # <-- THIS is what I changed\n", " (?=(?P\n", " %(NonWord)s # either other punctuation\n", " |\n", " (?P\\S+) # <-- Normally you would have \\s+ here\n", " ))\"\"\"\n", "\n", "class IdentitySplitter(object):\n", " def tokenize(self, *text):\n", " return text\n", "\"\"\"[TODO]: modify this class to integrate the custom sentence splitter above \"\"\"\n", "\n", "class Encoder(object):\n", " def __init__(self, args):\n", " self.args = args\n", " \n", " def initializer(self):\n", " # Use Encoder class as a container for global data\n", " Encoder.tokenizer = build_tokenizer(self.args)\n", " if self.args.split_sentences:\n", " if not nltk_available:\n", " print(\"NLTK is not available to split sentences.\")\n", " exit()\n", " splitter = nltk.load(\"tokenizers/punkt/english.pickle\")\n", " if self.args.keep_newlines:\n", " # this prevents punkt from eating newlines after sentences\n", " Encoder.splitter = nltk.tokenize.punkt.PunktSentenceTokenizer(\n", " train_text = splitter._params,\n", " lang_vars = CustomLanguageVars())\n", " else:\n", " Encoder.splitter = splitter\n", "\n", " else:\n", " Encoder.splitter = IdentitySplitter()\n", "\n", " def encode(self, json_line):\n", " data = json.loads(json_line)\n", " ids = {}\n", " for key in self.args.json_keys:\n", " text = data[key]\n", " doc_ids = []\n", " for sentence in Encoder.splitter.tokenize(text):\n", " sentence_ids = Encoder.tokenizer.tokenize(sentence)\n", " if len(sentence_ids) > 0:\n", " doc_ids.append(sentence_ids)\n", " if len(doc_ids) > 0 and self.args.append_eod:\n", " doc_ids[-1].append(Encoder.tokenizer.eod)\n", " ids[key] = doc_ids\n", " return ids, len(json_line)\n", "\n", "def get_args():\n", " parser = argparse.ArgumentParser()\n", " group = parser.add_argument_group(title='input data')\n", " group.add_argument('--input', type=str, required=True,\n", " help='Path to input JSON')\n", " group.add_argument('--json-keys', nargs='+', default=['text'],\n", " help='space separate listed of keys to extract from json')\n", " group.add_argument('--split-sentences', action='store_true',\n", " help='Split documents into sentences.')\n", " group.add_argument('--keep-newlines', action='store_true',\n", " help='Keep newlines between sentences when splitting.')\n", "\n", " group = parser.add_argument_group(title='tokenizer')\n", " group.add_argument('--tokenizer-type', type=str, required=True,\n", " choices=['BertWordPieceLowerCase','BertWordPieceCase',\n", " 'GPT2BPETokenizer'],\n", " help='What type of tokenizer to use.')\n", " group.add_argument('--vocab-file', type=str, default=None,\n", " help='Path to the vocab file')\n", " group.add_argument('--merge-file', type=str, default=None,\n", " help='Path to the BPE merge file (if necessary).')\n", " group.add_argument('--append-eod', action='store_true',\n", " help='Append an token to the end of a document.')\n", "\n", "\n", " group = parser.add_argument_group(title='output data')\n", " group.add_argument('--output-prefix', type=str, required=True,\n", " help='Path to binary output file without suffix')\n", " group.add_argument('--dataset-impl', type=str, default='mmap',\n", " choices=['lazy', 'cached', 'mmap'])\n", "\n", " group = parser.add_argument_group(title='runtime')\n", " group.add_argument('--workers', type=int, default=1,\n", " help='Number of worker processes to launch')\n", " group.add_argument('--log-interval', type=int, default=100,\n", " help='Interval between progress updates')\n", " args = parser.parse_args()\n", " args.keep_empty = False\n", "\n", " if args.tokenizer_type.lower().startswith('bert'):\n", " if not args.split_sentences:\n", " print(\"Bert tokenizer detected, are you sure you don't want to split sentences?\")\n", "\n", " # some default/dummy values for the tokenizer\n", " args.rank = 0\n", " args.make_vocab_size_divisible_by = 128\n", " args.tensor_model_parallel_size = 1\n", " args.vocab_extra_ids = 0\n", "\n", " return args\n", "\n", "def main():\n", " args = get_args()\n", " startup_start = time.time()\n", "\n", " print(\"Opening\", args.input)\n", " fin = open(args.input, 'r', encoding='utf-8')\n", "\n", " if nltk_available and args.split_sentences:\n", " nltk.download(\"punkt\", quiet=True)\n", "\n", " encoder = Encoder(args)\n", " tokenizer = build_tokenizer(args)\n", " pool = multiprocessing.Pool(args.workers, initializer=encoder.initializer)\n", " encoded_docs = pool.imap(encoder.encode, fin, 25)\n", " #encoded_docs = map(encoder.encode, fin)\n", "\n", " level = \"document\"\n", " if args.split_sentences:\n", " level = \"sentence\"\n", "\n", " print(f\"Vocab size: {tokenizer.vocab_size}\")\n", " print(f\"Output prefix: {args.output_prefix}\")\n", " output_bin_files = {}\n", " output_idx_files = {}\n", " builders = {}\n", " for key in args.json_keys:\n", " output_bin_files[key] = \"{}_{}_{}.bin\".format(args.output_prefix,\n", " key, level)\n", " output_idx_files[key] = \"{}_{}_{}.idx\".format(args.output_prefix,\n", " key, level)\n", " builders[key] = indexed_dataset.make_builder(output_bin_files[key],\n", " impl=args.dataset_impl,\n", " vocab_size=tokenizer.vocab_size)\n", "\n", " startup_end = time.time()\n", " proc_start = time.time()\n", " total_bytes_processed = 0\n", " print(\"Time to startup:\", startup_end - startup_start)\n", "\n", " for i, (doc, bytes_processed) in enumerate(encoded_docs, start=1):\n", " total_bytes_processed += bytes_processed\n", " for key, sentences in doc.items():\n", " if len(sentences) == 0:\n", " continue\n", " for sentence in sentences:\n", " builders[key].add_item(torch.IntTensor(sentence))\n", " builders[key].end_document()\n", " if i % args.log_interval == 0:\n", " current = time.time()\n", " elapsed = current - proc_start\n", " mbs = total_bytes_processed/elapsed/1024/1024\n", " print(f\"Processed {i} documents\",\n", " f\"({i/elapsed} docs/s, {mbs} MB/s).\",\n", " file=sys.stderr)\n", "\n", " for key in args.json_keys:\n", " builders[key].finalize(output_idx_files[key])\n", "\n", "if __name__ == '__main__':\n", " main()" ] }, { "cell_type": "markdown", "id": "raised-victim", "metadata": {}, "source": [ "Below cell block specify all the input parameters in order to run `MYpreprocess_data.py`. \n", "\n", "Please do **NOT** modify anything in below cell." ] }, { "cell_type": "code", "execution_count": 11, "id": "fluid-dayton", "metadata": {}, "outputs": [], "source": [ "INPUT_JSON_FILE='../dataset/SV/webnyheter2013.json'\n", "OUTPUT_PATH='../dataset/SV/customSentenceSplit'\n", "VOCAB_FILE='../dataset/SV/32k/vocab.json'\n", "MERGE_FILE='../dataset/SV/32k/merges.txt'\n", "NUM_CPUS=16" ] }, { "cell_type": "markdown", "id": "concerned-protest", "metadata": {}, "source": [ "Below is a ReRun cell block to run `MYpreprocess_data.py` and produce the customSentenceSplit_text_document.bin and customSentenceSplit_text_document.idx files.\n", "\n", "\n", "\n", "Go back and modify `MYpreprocess_data.py`, click on this shortcut link to Jump to Modify MYpreprocess_data.py " ] }, { "cell_type": "code", "execution_count": null, "id": "rolled-welcome", "metadata": {}, "outputs": [], "source": [ "!python ./Megatron-LM/tools/MYpreprocess_data.py \\\n", " --input $INPUT_JSON_FILE \\\n", " --output-prefix $OUTPUT_PATH \\\n", " --json-keys text \\\n", " --vocab-file $VOCAB_FILE \\\n", " --merge-file $MERGE_FILE \\\n", " --dataset-impl mmap \\\n", " --tokenizer-type GPT2BPETokenizer \\\n", " --workers $NUM_CPUS \\\n", " --append-eod" ] }, { "cell_type": "markdown", "id": "reduced-court", "metadata": {}, "source": [ "Check whether these two files : customSentenceSplit_text_document.bin and customSentenceSplit_text_document.idx files are successfully generated." ] }, { "cell_type": "code", "execution_count": null, "id": "secondary-stereo", "metadata": {}, "outputs": [], "source": [ "! ls ../dataset/SV/" ] }, { "cell_type": "code", "execution_count": null, "id": "premier-birth", "metadata": {}, "outputs": [], "source": [ "## clean up to free up space\n", "!rm ./Megatron-LM/tools/MYpreprocess_data.py" ] }, { "cell_type": "markdown", "id": "eastern-ministry", "metadata": {}, "source": [ "-----\n", "##

HOME       NEXT

" ] }, { "cell_type": "markdown", "id": "qualified-admission", "metadata": {}, "source": [ "-----\n", "\n", "\n", "## Licensing \n", "\n", "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 5 }