{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "dependent-chemistry",
   "metadata": {},
   "source": [
    "# \n",
    "\n",
    "# 4_Jsonfy and preprocess to mmap format for optimizing data loading\n",
    "---\n",
    "\n",
    "## Learning Objectives\n",
    "- **The goal of this lab is to:**\n",
    "    - motivation : understand the need for preprocessing to mmap format\n",
    "    - the assumptions about the data \n",
    "    - jsonfy the raw text data into loose json format\n",
    "    - use preprocess_data.py to convert the cleaned data into mmap format as a preparation for training\n",
    "\n",
    "----------------------------------------------------------\n",
    "### Understand the need for preprocessing to mmap format-    \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "square-louisville",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "out=np.random.random((1024,2048))\n",
    "np.save('myarr',out)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "human-appliance",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3.84 ms ± 36.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit \n",
    "out=np.load('myarr.npy')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "heard-baseball",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "43 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "array = np.memmap(\"myarr.npy\", mode=\"r\",\n",
    "                  dtype=np.int16, shape=(1024, 1024))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "dynamic-nudist",
   "metadata": {},
   "outputs": [],
   "source": [
    "## clean up\n",
    "!rm myarr.npy"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "soviet-jumping",
   "metadata": {},
   "source": [
    "----------------------------------------------------------\n",
    "### the assumptions about the data -\n",
    "    one element per document \n",
    "    text in the 'text' field by default ,can be modified to extract other fields\n",
    "    {\"src\": \"The Internet\", \"text\": \"jumps over the lazy dog\", \"type\": \"Eng\", \"id\": \"42\", \"title\": \"Second Part\"}\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "acting-covering",
   "metadata": {},
   "source": [
    "----------------------------------------------------------\n",
    "### jsonfy the raw text data into loose json format -\n",
    "    python create_loose_json.py --help\n",
    "        usage: create_loose_json.py [-h] [--infile INFILE] [--outfile OUTFILE]\n",
    "\n",
    "        optional arguments:\n",
    "          -h, --help         show this help message and exit\n",
    "          --infile INFILE    input file path\n",
    "          --outfile OUTFILE  output file path"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "organic-malaysia",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "finished processing 71 lines to loose json format\n"
     ]
    }
   ],
   "source": [
    "!python create_loose_json.py --infile ../dataset/EN/extractedNVblogs.txt --outfile ../dataset/EN/extractedNVblogs.json"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "rubber-absolute",
   "metadata": {},
   "source": [
    "----------------------------------------------------------\n",
    "### use preprocess_data.py to convert the cleaned data into mmap format as a preparation for training-\n",
    "\n",
    "wrap the following into a bash script :\n",
    "\n",
    "            %% writefile process2mmap.sh \n",
    "\n",
    "            INPUT_JSON_FILE=path_to_the_json_file\n",
    "\n",
    "            OUTPUT_PATH=path_to_save_the_converted_data_to\n",
    "\n",
    "            VOCAB_FILE=path_to_your_own_pretrained_vocab_file\n",
    "\n",
    "            MERGE_FILE=path_to_your_own_pretrained_merge_file\n",
    "\n",
    "            NUM_CPUS=16\n",
    "\n",
    "\n",
    "            python tools/preprocess_data.py \\\n",
    "                       --input INPUT_JSON_FILE \\\n",
    "                       --output-prefix OUTPUT_PATH \\\n",
    "                       --json-keys text \\\n",
    "                       --vocab-file VOCAB_FILE \\\n",
    "                       --merge-file MERGE_FILE \\\n",
    "                       --dataset-impl mmap \\\n",
    "                       --tokenizer-type GPT2BPETokenizer \\\n",
    "                       --workers NUM_CPUS \\\n",
    "                       --append-eod <--- very important, do not miss this flag !\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "lined-transfer",
   "metadata": {},
   "source": [
    "----------------------------------------------------------\n",
    "se preprocess_data.py to convert the cleaned data into mmap format as a preparation for training-\n",
    "\n",
    "wrap the following into a bash script :\n",
    "\n",
    "            %% writefile process2mmap.sh \n",
    "\n",
    "            INPUT_JSON_FILE=path_to_the_json_file\n",
    "\n",
    "            OUTPUT_PATH=path_to_save_the_converted_data_to\n",
    "\n",
    "            VOCAB_FILE=path_to_your_own_pretrained_vocab_file\n",
    "\n",
    "            MERGE_FILE=path_to_your_own_pretrained_merge_file\n",
    "\n",
    "            NUM_CPUS=16\n",
    "\n",
    "\n",
    "            python tools/preprocess_data.py \\\n",
    "                       --input INPUT_JSON_FILE \\\n",
    "                       --output-prefix OUTPUT_PATH \\\n",
    "                       --json-keys text \\\n",
    "                       --vocab-file VOCAB_FILE \\\n",
    "                       --merge-file MERGE_FILE \\\n",
    "                       --dataset-impl mmap \\\n",
    "                       --tokenizer-type GPT2BPETokenizer \\\n",
    "                       --workers NUM_CPUS \\\n",
    "                       --append-eod <--- very important, do not miss this flag !\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "marked-midnight",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "gpt2-merges.txt  gpt2-vocab.json\n"
     ]
    }
   ],
   "source": [
    "!mv gpt2-vocab.json ../dataset/EN/50k/\n",
    "!mv gpt2-merges.txt ../dataset/EN/50k/\n",
    "!ls ../dataset/EN/50k/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "adjustable-hammer",
   "metadata": {},
   "outputs": [],
   "source": [
    "INPUT_JSON_FILE='../dataset/EN/extractedNVblogs.json'\n",
    "OUTPUT_PATH='../dataset/EN/NVblog'\n",
    "VOCAB_FILE='../dataset/EN/50k/gpt2-vocab.json'\n",
    "MERGE_FILE='../dataset/EN/50k/gpt2-merges.txt'\n",
    "NUM_CPUS=16"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "hidden-patrick",
   "metadata": {},
   "source": [
    "---\n",
    "## OUTPUT should looks similar to the following \n",
    "\n",
    "                    Opening ./Megatron-LM/dataset/EN/extractedNVblogs.json\n",
    "                    > building GPT2BPETokenizer tokenizer ...\n",
    "                     > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
    "                    > building GPT2BPETokenizer tokenizer ...\n",
    "                    > building GPT2BPETokenizer tokenizer ...\n",
    "                    > building GPT2BPETokenizer tokenizer ...\n",
    "                    > building GPT2BPETokenizer tokenizer ...\n",
    "                    > building GPT2BPETokenizer tokenizer ...\n",
    "                    > building GPT2BPETokenizer tokenizer ...\n",
    "                    > building GPT2BPETokenizer tokenizer ...\n",
    "                    > building GPT2BPETokenizer tokenizer ...\n",
    "                    > building GPT2BPETokenizer tokenizer ...\n",
    "                    > building GPT2BPETokenizer tokenizer ...\n",
    "                    > building GPT2BPETokenizer tokenizer ...\n",
    "                    > building GPT2BPETokenizer tokenizer ...\n",
    "                    > building GPT2BPETokenizer tokenizer ...\n",
    "                    > building GPT2BPETokenizer tokenizer ...\n",
    "                    > building GPT2BPETokenizer tokenizer ...\n",
    "                    > building GPT2BPETokenizer tokenizer ...\n",
    "                    Vocab size: 50257\n",
    "                    Output prefix: ./Megatron-LM/dataset/EN/NVblogs\n",
    "                    Time to startup: 0.5460700988769531\n",
    "                     > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
    "                     > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "professional-lawyer",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Opening ../dataset/EN/extractedNVblogs.json\n",
      "> building GPT2BPETokenizer tokenizer ...\n",
      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
      "> building GPT2BPETokenizer tokenizer ...\n",
      "> building GPT2BPETokenizer tokenizer ...\n",
      "> building GPT2BPETokenizer tokenizer ...\n",
      "> building GPT2BPETokenizer tokenizer ...\n",
      "> building GPT2BPETokenizer tokenizer ...\n",
      "> building GPT2BPETokenizer tokenizer ...\n",
      "> building GPT2BPETokenizer tokenizer ...\n",
      "> building GPT2BPETokenizer tokenizer ...\n",
      "> building GPT2BPETokenizer tokenizer ...\n",
      "> building GPT2BPETokenizer tokenizer ...\n",
      "> building GPT2BPETokenizer tokenizer ...\n",
      "> building GPT2BPETokenizer tokenizer ...\n",
      "> building GPT2BPETokenizer tokenizer ...\n",
      "> building GPT2BPETokenizer tokenizer ...\n",
      "> building GPT2BPETokenizer tokenizer ...\n",
      "> building GPT2BPETokenizer tokenizer ...\n",
      "Vocab size: 50257\n",
      "Output prefix: ../dataset/EN/NVblog\n",
      "Time to startup: 0.1618051528930664\n",
      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n",
      " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n"
     ]
    }
   ],
   "source": [
    "!python ./Megatron-LM/tools/preprocess_data.py \\\n",
    "                       --input $INPUT_JSON_FILE \\\n",
    "                       --output-prefix $OUTPUT_PATH \\\n",
    "                       --json-keys text \\\n",
    "                       --vocab-file $VOCAB_FILE \\\n",
    "                       --merge-file $MERGE_FILE \\\n",
    "                       --dataset-impl mmap \\\n",
    "                       --tokenizer-type GPT2BPETokenizer \\\n",
    "                       --workers $NUM_CPUS \\\n",
    "                       --append-eod"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "valuable-equilibrium",
   "metadata": {},
   "source": [
    "---\n",
    "## Up Next : \n",
    "\n",
    "[Observe_GPT_runs_vs_performance ](./Day2-5_Observe_GPT_runs_vs_performance.ipynb)\n",
    "\n",
    "## Back To Start Menu\n",
    "[start menu](../Start_Here.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "accurate-drinking",
   "metadata": {},
   "source": [
    "-----\n",
    "\n",
    "\n",
    "## Licensing \n",
    "\n",
    "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}