{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "convertible-whale",
   "metadata": {},
   "source": [
    "# \n",
    "\n",
    "# 4_Jsonfy and preprocess to mmap format for optimizing data loading\n",
    "---\n",
    "\n",
    "## Learning Objectives\n",
    "- **The goal of this lab is to:**\n",
    "    - motivation : understand the need for preprocessing to mmap format\n",
    "    - the assumptions about the data \n",
    "    - jsonfy the raw text data into loose json format\n",
    "    - use preprocess_data.py to convert the cleaned data into mmap format as a preparation for training\n",
    "\n",
    "----------------------------------------------------------\n",
    "### Understand the need for preprocessing to mmap format-    \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "surrounded-counter",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "out=np.random.random((1024,2048))\n",
    "np.save('myarr',out)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "complete-lindsay",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3.07 ms ± 55.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit \n",
    "out=np.load('myarr.npy')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "conventional-mason",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "62 µs ± 136 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "array = np.memmap(\"myarr.npy\", mode=\"r\",\n",
    "                  dtype=np.int16, shape=(1024, 1024))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "functioning-stage",
   "metadata": {},
   "source": [
    "----------------------------------------------------------\n",
    "### the assumptions about the data -\n",
    "    one element per document \n",
    "    text in the 'text' field by default ,can be modified to extract other fields\n",
    "    {\"src\": \"The Internet\", \"text\": \"jumps over the lazy dog\", \"type\": \"Eng\", \"id\": \"42\", \"title\": \"Second Part\"}\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eastern-habitat",
   "metadata": {},
   "source": [
    "----------------------------------------------------------\n",
    "### jsonfy the raw text data into loose json format -\n",
    "    python create_loose_json.py --help\n",
    "        usage: create_loose_json.py [-h] [--infile INFILE] [--outfile OUTFILE]\n",
    "\n",
    "        optional arguments:\n",
    "          -h, --help         show this help message and exit\n",
    "          --infile INFILE    input file path\n",
    "          --outfile OUTFILE  output file path"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "finite-marina",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "finished processing 74 lines to loose json format\n"
     ]
    }
   ],
   "source": [
    "!python create_loose_json.py --infile ./Megatron-LM/dataset/EN/extractedNVblogs.txt --outfile ./Megatron-LM/dataset/EN/extractedNVblogs.json"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "proof-pakistan",
   "metadata": {},
   "source": [
    "----------------------------------------------------------\n",
    "### use preprocess_data.py to convert the cleaned data into mmap format as a preparation for training-\n",
    "\n",
    "wrap the following into a bash script :\n",
    "\n",
    "            %% writefile process2mmap.sh \n",
    "\n",
    "            INPUT_JSON_FILE=path_to_the_json_file\n",
    "\n",
    "            OUTPUT_PATH=path_to_save_the_converted_data_to\n",
    "\n",
    "            VOCAB_FILE=path_to_your_own_pretrained_vocab_file\n",
    "\n",
    "            MERGE_FILE=path_to_your_own_pretrained_merge_file\n",
    "\n",
    "            NUM_CPUS=16\n",
    "\n",
    "\n",
    "            python tools/preprocess_data.py \\\n",
    "                       --input INPUT_JSON_FILE \\\n",
    "                       --output-prefix OUTPUT_PATH \\\n",
    "                       --json-keys text \\\n",
    "                       --vocab-file VOCAB_FILE \\\n",
    "                       --merge-file MERGE_FILE \\\n",
    "                       --dataset-impl mmap \\\n",
    "                       --tokenizer-type GPT2BPETokenizer \\\n",
    "                       --workers NUM_CPUS \\\n",
    "                       --append-eod <--- very important, do not miss this flag !\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "angry-canvas",
   "metadata": {},
   "source": [
    "----------------------------------------------------------\n",
    "se preprocess_data.py to convert the cleaned data into mmap format as a preparation for training-\n",
    "\n",
    "wrap the following into a bash script :\n",
    "\n",
    "            %% writefile process2mmap.sh \n",
    "\n",
    "            INPUT_JSON_FILE=path_to_the_json_file\n",
    "\n",
    "            OUTPUT_PATH=path_to_save_the_converted_data_to\n",
    "\n",
    "            VOCAB_FILE=path_to_your_own_pretrained_vocab_file\n",
    "\n",
    "            MERGE_FILE=path_to_your_own_pretrained_merge_file\n",
    "\n",
    "            NUM_CPUS=16\n",
    "\n",
    "\n",
    "            python tools/preprocess_data.py \\\n",
    "                       --input INPUT_JSON_FILE \\\n",
    "                       --output-prefix OUTPUT_PATH \\\n",
    "                       --json-keys text \\\n",
    "                       --vocab-file VOCAB_FILE \\\n",
    "                       --merge-file MERGE_FILE \\\n",
    "                       --dataset-impl mmap \\\n",
    "                       --tokenizer-type GPT2BPETokenizer \\\n",
    "                       --workers NUM_CPUS \\\n",
    "                       --append-eod <--- very important, do not miss this flag !\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "regional-stake",
   "metadata": {},
   "outputs": [],
   "source": [
    "INPUT_JSON_FILE='../dataset/EN/extractedNVblogs.json'\n",
    "OUTPUT_PATH='../dataset/EN/CustomSentenceSplitter'\n",
    "VOCAB_FILE='../dataset/EN/50k/gpt2-vocab.json'\n",
    "MERGE_FILE='../dataset/EN/50k/gpt2-merges.txt'\n",
    "NUM_CPUS=16\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "similar-commonwealth",
   "metadata": {},
   "source": [
    "---\n",
    "## OUTPUT should looks similar to the following \n",
    "\n",
    "                    Opening ./Megatron-LM/dataset/EN/extractedNVblogs.json\n",
    "                    > building GPT2BPETokenizer tokenizer ...\n",
    "                     > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
    "                    > building GPT2BPETokenizer tokenizer ...\n",
    "                    > building GPT2BPETokenizer tokenizer ...\n",
    "                    > building GPT2BPETokenizer tokenizer ...\n",
    "                    > building GPT2BPETokenizer tokenizer ...\n",
    "                    > building GPT2BPETokenizer tokenizer ...\n",
    "                    > building GPT2BPETokenizer tokenizer ...\n",
    "                    > building GPT2BPETokenizer tokenizer ...\n",
    "                    > building GPT2BPETokenizer tokenizer ...\n",
    "                    > building GPT2BPETokenizer tokenizer ...\n",
    "                    > building GPT2BPETokenizer tokenizer ...\n",
    "                    > building GPT2BPETokenizer tokenizer ...\n",
    "                    > building GPT2BPETokenizer tokenizer ...\n",
    "                    > building GPT2BPETokenizer tokenizer ...\n",
    "                    > building GPT2BPETokenizer tokenizer ...\n",
    "                    > building GPT2BPETokenizer tokenizer ...\n",
    "                    > building GPT2BPETokenizer tokenizer ...\n",
    "                    Vocab size: 50257\n",
    "                    Output prefix: ./Megatron-LM/dataset/EN/NVblogs\n",
    "                    Time to startup: 0.5460700988769531\n",
    "                     > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
    "                     > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "framed-point",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Opening ./Megatron-LM/dataset/EN/extractedNVblogs.json\n",
      "> building GPT2BPETokenizer tokenizer ...\n",
      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
      "> building GPT2BPETokenizer tokenizer ...\n",
      "> building GPT2BPETokenizer tokenizer ...\n",
      "> building GPT2BPETokenizer tokenizer ...\n",
      "> building GPT2BPETokenizer tokenizer ...\n",
      "> building GPT2BPETokenizer tokenizer ...\n",
      "> building GPT2BPETokenizer tokenizer ...\n",
      "> building GPT2BPETokenizer tokenizer ...\n",
      "> building GPT2BPETokenizer tokenizer ...\n",
      "> building GPT2BPETokenizer tokenizer ...\n",
      "> building GPT2BPETokenizer tokenizer ...\n",
      "> building GPT2BPETokenizer tokenizer ...\n",
      "> building GPT2BPETokenizer tokenizer ...\n",
      "> building GPT2BPETokenizer tokenizer ...\n",
      "> building GPT2BPETokenizer tokenizer ...\n",
      "> building GPT2BPETokenizer tokenizer ...\n",
      "> building GPT2BPETokenizer tokenizer ...\n",
      "Vocab size: 50257\n",
      "Output prefix: ./Megatron-LM/dataset/EN/NVblogs\n",
      "Time to startup: 0.5460700988769531\n",
      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n"
     ]
    }
   ],
   "source": [
    "!python ./Megatron-LM/tools/preprocess_data.py \\\n",
    "                       --input $INPUT_JSON_FILE \\\n",
    "                       --output-prefix $OUTPUT_PATH \\\n",
    "                       --json-keys text \\\n",
    "                       --vocab-file $VOCAB_FILE \\\n",
    "                       --merge-file $MERGE_FILE \\\n",
    "                       --dataset-impl mmap \\\n",
    "                       --tokenizer-type GPT2BPETokenizer \\\n",
    "                       --workers $NUM_CPUS \\\n",
    "                       --append-eod"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "endless-vietnamese",
   "metadata": {},
   "source": [
    "---\n",
    "## Up Next : \n",
    "\n",
    "[Observe_GPT_runs_vs_performance ](./Day2-5_Observe_GPT_runs_vs_performance.ipynb)\n",
    "\n",
    "## Back To Start Menu\n",
    "[start menu](../Start_Here.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "married-necklace",
   "metadata": {},
   "source": [
    "-----\n",
    "\n",
    "\n",
    "## Licensing \n",
    "\n",
    "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}