{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "otherwise-masters",
   "metadata": {},
   "source": [
    "## Customize preprocess_data.py\n",
    "---\n",
    "\n",
    "## Learning Objectives\n",
    "\n",
    "We fetched our own Swedish raw text data in `Lab2-1_acquiring_data.ipynb`, we learned how to find sentence boundary with custom functions in `Lab2-2_SentenceBoundary_and_Deduplicate.ipynb` , we also trained a GPTBPETokenizer and fitted it to our raw Swedish text. \n",
    "\n",
    "We are now ready to incorporate the custom sentence-splitter into preprocess_data.py and covert the raw Swedish text to , first json format, and then mmap format.\n",
    "\n",
    "Therefore, the goal of this notebook is to integrate all knowledge gained from both Lab 1 as well as the above notebooks, and challenge ourselves to further customize the preprocess_data.py with a <a href=\"./Lab2-4_customize_process2mmap.ipynb#Custom-Sentence-Splitter\">custom sentence-splitter</a>, and in the process, convert the new raw Sweden text to mmap format.\n",
    "\n",
    "More specifically, this notebook will cover the steps to :\n",
    "\n",
    "1.  Convert the extracted raw Swedish text from webnyheter2013.txt to webnyheter2013.json.\n",
    "2.  Generate the mmap format files by default preprocess_data.py first to assure the possibility to move on to the next notebook in case time runs out.\n",
    "\n",
    "\n",
    "Toward the end, there is a Mini-Challenge <a href=\"./Lab2-4_customize_process2mmap.ipynb#Mini-Challenge\">Jump to view Mini-Challenge</a>.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "statutory-thesis",
   "metadata": {},
   "source": [
    "1. Convert the extracted raw Swedish text from webnyheter2013.txt to webnyheter2013.json."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "horizontal-cause",
   "metadata": {},
   "outputs": [],
   "source": [
    "!python create_loose_json.py --infile ../dataset/SV/webnyheter2013.txt --outfile ../dataset/SV/webnyheter2013.json"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "reserved-clear",
   "metadata": {},
   "source": [
    "Below is the expected outputs :\n",
    "\n",
    "        process 1000000 documents so far ...\n",
    "        example:  – Vi har en bra generation som spelat tillsammans ett tag .\n",
    "\n",
    "        finished processing 1249010 lines to loose json format"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fixed-closing",
   "metadata": {},
   "source": [
    "2.  Generate the mmap format files by default preprocess_data.py first to assure the possibility to move on to the next notebook in case time runs out."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "dried-intro",
   "metadata": {},
   "outputs": [],
   "source": [
    "INPUT_JSON_FILE='../dataset/SV/webnyheter2013.json'\n",
    "OUTPUT_PATH='../dataset/SV/webnyheter2013_56kvocab'\n",
    "VOCAB_FILE='../dataset/SV/56k/vocab.json'\n",
    "MERGE_FILE='../dataset/SV/56k/merges.txt'\n",
    "NUM_CPUS=16"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "addressed-meeting",
   "metadata": {},
   "outputs": [],
   "source": [
    "!python ./Megatron-LM/tools/preprocess_data.py \\\n",
    "                       --input $INPUT_JSON_FILE \\\n",
    "                       --output-prefix $OUTPUT_PATH \\\n",
    "                       --json-keys text \\\n",
    "                       --vocab-file $VOCAB_FILE \\\n",
    "                       --merge-file $MERGE_FILE \\\n",
    "                       --dataset-impl mmap \\\n",
    "                       --tokenizer-type GPT2BPETokenizer \\\n",
    "                       --workers $NUM_CPUS \\\n",
    "                       --append-eod"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "moderate-future",
   "metadata": {},
   "source": [
    "Below is the expected outputs :\n",
    "\n",
    "    Processed 1248300 documents (52998.601302473544 docs/s, 5.869853647730749 MB/s).\n",
    "    Processed 1248400 documents (53001.39142986273 docs/s, 5.870136451906283 MB/s).\n",
    "    Processed 1248500 documents (53004.16423593737 docs/s, 5.870477584597603 MB/s).\n",
    "    Processed 1248600 documents (53007.072626674184 docs/s, 5.870763528521501 MB/s).\n",
    "    Processed 1248700 documents (53009.92668081499 docs/s, 5.871081674576178 MB/s).\n",
    "    Processed 1248800 documents (53012.79399884911 docs/s, 5.871406835923378 MB/s).\n",
    "    Processed 1248900 documents (53015.61341376629 docs/s, 5.8717617499445 MB/s).\n",
    "    Processed 1249000 documents (53018.49277365899 docs/s, 5.8720826162486786 MB/s)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "electrical-executive",
   "metadata": {},
   "source": [
    "Now we get the default mmap files (xxx.bin and xxx.idx ) and therefore guarantee the data needed for the next notebook to run.\n",
    "We can now move on. We start by copy the old preprocess_data.py and rename it to `MYpreprocess_data.py`\n",
    "\n",
    "cp the preprocess_data.py into a new python script called `MYpreprocess_data.py`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "greatest-receptor",
   "metadata": {},
   "outputs": [],
   "source": [
    "!cp ./Megatron-LM/tools/preprocess_data.py ./Megatron-LM/tools/MYpreprocess_data.py"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "south-devil",
   "metadata": {},
   "source": [
    "<a id=\"Custom-Sentence-Splitter\"></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "rough-pickup",
   "metadata": {},
   "source": [
    "The below code block is our custom sentence-splitter `cut_sentence_with_quotation_marks`, the custom function is provided for your convenience for integarting to  `MYpreprocess_data.py`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "prostate-profession",
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "import nltk\n",
    "from nltk.tokenize import sent_tokenize\n",
    "def normal_cut_sentence(temp):\n",
    "    return sent_tokenize(temp)\n",
    "\n",
    "def cut_sentence_with_quotation_marks(text):\n",
    "    p = re.compile(\"“.*?”\")\n",
    "    list = []\n",
    "    index = 0\n",
    "    length = len(text)\n",
    "    for i in p.finditer(text):\n",
    "        temp = ''\n",
    "        start = i.start()\n",
    "        end = i.end()\n",
    "        for j in range(index, start):\n",
    "            temp += text[j]\n",
    "        if temp != '':\n",
    "            temp_list = normal_cut_sentence(temp)\n",
    "            list += temp_list\n",
    "        temp = ''\n",
    "        for k in range(start, end):\n",
    "            temp += text[k]\n",
    "        if temp != ' ':\n",
    "            list.append(temp)\n",
    "        index = end\n",
    "    return list"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "large-birthday",
   "metadata": {},
   "source": [
    "<a id=\"Mini-Challenge\"></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "medical-incident",
   "metadata": {},
   "source": [
    "---\n",
    "## **Mini-Challenge ** - integrate the custom sentence splitter into MYpreprocess_data.py\n",
    "\n",
    "Task : Modify and overwrite `MYpreprocess_data.py` below to incoporate the custom `cut_sentence_with_quotation_marks`\n",
    "\n",
    "Pass : Successfully run Mypreprocess_data.py with the custom sentence splitter cut_sentence_with_quotation_marks and generate the customSentenceSplit_text_document.bin and customSentenceSplit_text_document.idx files.\n",
    "\n",
    "Note: the solution will be delivered to you at the end of Lab 2.\n",
    "\n",
    "---\n",
    "Modify the below cell block to overwrite `MYpreprocess_data.py`. \n",
    "After modification, Jump to Rerun cell to produce customSentenceSplit_text_document.bin and customSentenceSplit_text_document.idx files.\n",
    "<a id=\"MODIFY_CELL\"></a>\n",
    "<a href=\"./Lab2-4_customize_process2mmap.ipynb#Rerun_Cell\">Jump to ReRun Cell</a> "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "selected-depth",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile ./Megatron-LM/tools/MYpreprocess_data.py \n",
    "# coding=utf-8\n",
    "# Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.\n",
    "#\n",
    "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
    "# you may not use this file except in compliance with the License.\n",
    "# You may obtain a copy of the License at\n",
    "#\n",
    "#     http://www.apache.org/licenses/LICENSE-2.0\n",
    "#\n",
    "# Unless required by applicable law or agreed to in writing, software\n",
    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
    "# See the License for the specific language governing permissions and\n",
    "# limitations under the License.\n",
    "\n",
    "\"\"\"Processing data for pretraining.\"\"\"\n",
    "\n",
    "import argparse\n",
    "import json\n",
    "import multiprocessing\n",
    "import os\n",
    "import sys\n",
    "sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__),\n",
    "                                             os.path.pardir)))\n",
    "import time\n",
    "\n",
    "import torch\n",
    "try:\n",
    "    import nltk\n",
    "    nltk_available = True\n",
    "except ImportError:\n",
    "    nltk_available = False\n",
    "\n",
    "from megatron.tokenizer import build_tokenizer\n",
    "from megatron.data import indexed_dataset\n",
    "\n",
    "\n",
    "# https://stackoverflow.com/questions/33139531/preserve-empty-lines-with-nltks-punkt-tokenizer\n",
    "class CustomLanguageVars(nltk.tokenize.punkt.PunktLanguageVars):\n",
    "\n",
    "    _period_context_fmt = r\"\"\"\n",
    "        \\S*                          # some word material\n",
    "        %(SentEndChars)s             # a potential sentence ending\n",
    "        \\s*                       #  <-- THIS is what I changed\n",
    "        (?=(?P<after_tok>\n",
    "            %(NonWord)s              # either other punctuation\n",
    "            |\n",
    "            (?P<next_tok>\\S+)     #  <-- Normally you would have \\s+ here\n",
    "        ))\"\"\"\n",
    "\n",
    "class IdentitySplitter(object):\n",
    "    def tokenize(self, *text):\n",
    "        return text\n",
    "\"\"\"[TODO]: modify this class to integrate the custom sentence splitter above \"\"\"\n",
    "\n",
    "class Encoder(object):\n",
    "    def __init__(self, args):\n",
    "        self.args = args\n",
    "    \n",
    "    def initializer(self):\n",
    "        # Use Encoder class as a container for global data\n",
    "        Encoder.tokenizer = build_tokenizer(self.args)\n",
    "        if self.args.split_sentences:\n",
    "            if not nltk_available:\n",
    "                print(\"NLTK is not available to split sentences.\")\n",
    "                exit()\n",
    "            splitter = nltk.load(\"tokenizers/punkt/english.pickle\")\n",
    "            if self.args.keep_newlines:\n",
    "                # this prevents punkt from eating newlines after sentences\n",
    "                Encoder.splitter = nltk.tokenize.punkt.PunktSentenceTokenizer(\n",
    "                    train_text = splitter._params,\n",
    "                    lang_vars = CustomLanguageVars())\n",
    "            else:\n",
    "                Encoder.splitter = splitter\n",
    "\n",
    "        else:\n",
    "            Encoder.splitter = IdentitySplitter()\n",
    "\n",
    "    def encode(self, json_line):\n",
    "        data = json.loads(json_line)\n",
    "        ids = {}\n",
    "        for key in self.args.json_keys:\n",
    "            text = data[key]\n",
    "            doc_ids = []\n",
    "            for sentence in Encoder.splitter.tokenize(text):\n",
    "                sentence_ids = Encoder.tokenizer.tokenize(sentence)\n",
    "                if len(sentence_ids) > 0:\n",
    "                    doc_ids.append(sentence_ids)\n",
    "            if len(doc_ids) > 0 and self.args.append_eod:\n",
    "                doc_ids[-1].append(Encoder.tokenizer.eod)\n",
    "            ids[key] = doc_ids\n",
    "        return ids, len(json_line)\n",
    "\n",
    "def get_args():\n",
    "    parser = argparse.ArgumentParser()\n",
    "    group = parser.add_argument_group(title='input data')\n",
    "    group.add_argument('--input', type=str, required=True,\n",
    "                       help='Path to input JSON')\n",
    "    group.add_argument('--json-keys', nargs='+', default=['text'],\n",
    "                       help='space separate listed of keys to extract from json')\n",
    "    group.add_argument('--split-sentences', action='store_true',\n",
    "                       help='Split documents into sentences.')\n",
    "    group.add_argument('--keep-newlines', action='store_true',\n",
    "                       help='Keep newlines between sentences when splitting.')\n",
    "\n",
    "    group = parser.add_argument_group(title='tokenizer')\n",
    "    group.add_argument('--tokenizer-type', type=str, required=True,\n",
    "                       choices=['BertWordPieceLowerCase','BertWordPieceCase',\n",
    "                                'GPT2BPETokenizer'],\n",
    "                       help='What type of tokenizer to use.')\n",
    "    group.add_argument('--vocab-file', type=str, default=None,\n",
    "                       help='Path to the vocab file')\n",
    "    group.add_argument('--merge-file', type=str, default=None,\n",
    "                       help='Path to the BPE merge file (if necessary).')\n",
    "    group.add_argument('--append-eod', action='store_true',\n",
    "                       help='Append an <eod> token to the end of a document.')\n",
    "\n",
    "\n",
    "    group = parser.add_argument_group(title='output data')\n",
    "    group.add_argument('--output-prefix', type=str, required=True,\n",
    "                       help='Path to binary output file without suffix')\n",
    "    group.add_argument('--dataset-impl', type=str, default='mmap',\n",
    "                       choices=['lazy', 'cached', 'mmap'])\n",
    "\n",
    "    group = parser.add_argument_group(title='runtime')\n",
    "    group.add_argument('--workers', type=int, default=1,\n",
    "                       help='Number of worker processes to launch')\n",
    "    group.add_argument('--log-interval', type=int, default=100,\n",
    "                       help='Interval between progress updates')\n",
    "    args = parser.parse_args()\n",
    "    args.keep_empty = False\n",
    "\n",
    "    if args.tokenizer_type.lower().startswith('bert'):\n",
    "        if not args.split_sentences:\n",
    "            print(\"Bert tokenizer detected, are you sure you don't want to split sentences?\")\n",
    "\n",
    "    # some default/dummy values for the tokenizer\n",
    "    args.rank = 0\n",
    "    args.make_vocab_size_divisible_by = 128\n",
    "    args.tensor_model_parallel_size = 1\n",
    "    args.vocab_extra_ids = 0\n",
    "\n",
    "    return args\n",
    "\n",
    "def main():\n",
    "    args = get_args()\n",
    "    startup_start = time.time()\n",
    "\n",
    "    print(\"Opening\", args.input)\n",
    "    fin = open(args.input, 'r', encoding='utf-8')\n",
    "\n",
    "    if nltk_available and args.split_sentences:\n",
    "        nltk.download(\"punkt\", quiet=True)\n",
    "\n",
    "    encoder = Encoder(args)\n",
    "    tokenizer = build_tokenizer(args)\n",
    "    pool = multiprocessing.Pool(args.workers, initializer=encoder.initializer)\n",
    "    encoded_docs = pool.imap(encoder.encode, fin, 25)\n",
    "    #encoded_docs = map(encoder.encode, fin)\n",
    "\n",
    "    level = \"document\"\n",
    "    if args.split_sentences:\n",
    "        level = \"sentence\"\n",
    "\n",
    "    print(f\"Vocab size: {tokenizer.vocab_size}\")\n",
    "    print(f\"Output prefix: {args.output_prefix}\")\n",
    "    output_bin_files = {}\n",
    "    output_idx_files = {}\n",
    "    builders = {}\n",
    "    for key in args.json_keys:\n",
    "        output_bin_files[key] = \"{}_{}_{}.bin\".format(args.output_prefix,\n",
    "                                                      key, level)\n",
    "        output_idx_files[key] = \"{}_{}_{}.idx\".format(args.output_prefix,\n",
    "                                                      key, level)\n",
    "        builders[key] = indexed_dataset.make_builder(output_bin_files[key],\n",
    "                                               impl=args.dataset_impl,\n",
    "                                               vocab_size=tokenizer.vocab_size)\n",
    "\n",
    "    startup_end = time.time()\n",
    "    proc_start = time.time()\n",
    "    total_bytes_processed = 0\n",
    "    print(\"Time to startup:\", startup_end - startup_start)\n",
    "\n",
    "    for i, (doc, bytes_processed) in enumerate(encoded_docs, start=1):\n",
    "        total_bytes_processed += bytes_processed\n",
    "        for key, sentences in doc.items():\n",
    "            if len(sentences) == 0:\n",
    "                continue\n",
    "            for sentence in sentences:\n",
    "                builders[key].add_item(torch.IntTensor(sentence))\n",
    "            builders[key].end_document()\n",
    "        if i % args.log_interval == 0:\n",
    "            current = time.time()\n",
    "            elapsed = current - proc_start\n",
    "            mbs = total_bytes_processed/elapsed/1024/1024\n",
    "            print(f\"Processed {i} documents\",\n",
    "                  f\"({i/elapsed} docs/s, {mbs} MB/s).\",\n",
    "                  file=sys.stderr)\n",
    "\n",
    "    for key in args.json_keys:\n",
    "        builders[key].finalize(output_idx_files[key])\n",
    "\n",
    "if __name__ == '__main__':\n",
    "    main()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "raised-victim",
   "metadata": {},
   "source": [
    "Below cell block specify all the input parameters in order to run `MYpreprocess_data.py`. \n",
    "\n",
    "Please do **NOT** modify anything in below cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "fluid-dayton",
   "metadata": {},
   "outputs": [],
   "source": [
    "INPUT_JSON_FILE='../dataset/SV/webnyheter2013.json'\n",
    "OUTPUT_PATH='../dataset/SV/customSentenceSplit'\n",
    "VOCAB_FILE='../dataset/SV/32k/vocab.json'\n",
    "MERGE_FILE='../dataset/SV/32k/merges.txt'\n",
    "NUM_CPUS=16"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "concerned-protest",
   "metadata": {},
   "source": [
    "Below is a ReRun cell block to run `MYpreprocess_data.py` and produce the customSentenceSplit_text_document.bin and customSentenceSplit_text_document.idx files.\n",
    "\n",
    "<a id=\"Rerun_Cell\"></a>\n",
    "\n",
    "Go back and modify `MYpreprocess_data.py`, click on this shortcut link to <a href=\"./Lab2-4_customize_process2mmap.ipynb#MODIFY_CELL\">Jump to Modify MYpreprocess_data.py</a> "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "rolled-welcome",
   "metadata": {},
   "outputs": [],
   "source": [
    "!python ./Megatron-LM/tools/MYpreprocess_data.py \\\n",
    "                       --input $INPUT_JSON_FILE \\\n",
    "                       --output-prefix $OUTPUT_PATH \\\n",
    "                       --json-keys text \\\n",
    "                       --vocab-file $VOCAB_FILE \\\n",
    "                       --merge-file $MERGE_FILE \\\n",
    "                       --dataset-impl mmap \\\n",
    "                       --tokenizer-type GPT2BPETokenizer \\\n",
    "                       --workers $NUM_CPUS \\\n",
    "                       --append-eod"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "reduced-court",
   "metadata": {},
   "source": [
    "Check whether these two files : customSentenceSplit_text_document.bin and customSentenceSplit_text_document.idx files are successfully generated."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "secondary-stereo",
   "metadata": {},
   "outputs": [],
   "source": [
    "! ls ../dataset/SV/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "premier-birth",
   "metadata": {},
   "outputs": [],
   "source": [
    "## clean up to free up space\n",
    "!rm ./Megatron-LM/tools/MYpreprocess_data.py"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eastern-ministry",
   "metadata": {},
   "source": [
    "-----\n",
    "## <p style=\"text-align:center;border:3px; padding: 1em\"> <a href=../Start_Here.ipynb>HOME</a> &nbsp; &nbsp; &nbsp; <a href=./Lab2-5_run_Megatron_with_varying_config.ipynb>NEXT</a></p>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "qualified-admission",
   "metadata": {},
   "source": [
    "-----\n",
    "\n",
    "\n",
    "## Licensing \n",
    "\n",
    "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}