{ "cells": [ { "cell_type": "markdown", "id": "dependent-chemistry", "metadata": {}, "source": [ "# \n", "\n", "# 4_Jsonfy and preprocess to mmap format for optimizing data loading\n", "---\n", "\n", "## Learning Objectives\n", "- **The goal of this lab is to:**\n", " - motivation : understand the need for preprocessing to mmap format\n", " - the assumptions about the data \n", " - jsonfy the raw text data into loose json format\n", " - use preprocess_data.py to convert the cleaned data into mmap format as a preparation for training\n", "\n", "----------------------------------------------------------\n", "### Understand the need for preprocessing to mmap format- \n" ] }, { "cell_type": "code", "execution_count": 1, "id": "square-louisville", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "out=np.random.random((1024,2048))\n", "np.save('myarr',out)" ] }, { "cell_type": "code", "execution_count": 2, "id": "human-appliance", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3.84 ms ± 36.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" ] } ], "source": [ "%%timeit \n", "out=np.load('myarr.npy')" ] }, { "cell_type": "code", "execution_count": 3, "id": "heard-baseball", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "43 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n" ] } ], "source": [ "%%timeit\n", "array = np.memmap(\"myarr.npy\", mode=\"r\",\n", " dtype=np.int16, shape=(1024, 1024))" ] }, { "cell_type": "code", "execution_count": 4, "id": "dynamic-nudist", "metadata": {}, "outputs": [], "source": [ "## clean up\n", "!rm myarr.npy" ] }, { "cell_type": "markdown", "id": "soviet-jumping", "metadata": {}, "source": [ "----------------------------------------------------------\n", "### the assumptions about the data -\n", " one element per document \n", " text in the 'text' field by default ,can be modified to extract other fields\n", " {\"src\": \"The Internet\", \"text\": \"jumps over the lazy dog\", \"type\": \"Eng\", \"id\": \"42\", \"title\": \"Second Part\"}\n" ] }, { "cell_type": "markdown", "id": "acting-covering", "metadata": {}, "source": [ "----------------------------------------------------------\n", "### jsonfy the raw text data into loose json format -\n", " python create_loose_json.py --help\n", " usage: create_loose_json.py [-h] [--infile INFILE] [--outfile OUTFILE]\n", "\n", " optional arguments:\n", " -h, --help show this help message and exit\n", " --infile INFILE input file path\n", " --outfile OUTFILE output file path" ] }, { "cell_type": "code", "execution_count": 5, "id": "organic-malaysia", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "finished processing 71 lines to loose json format\n" ] } ], "source": [ "!python create_loose_json.py --infile ../dataset/EN/extractedNVblogs.txt --outfile ../dataset/EN/extractedNVblogs.json" ] }, { "cell_type": "markdown", "id": "rubber-absolute", "metadata": {}, "source": [ "----------------------------------------------------------\n", "### use preprocess_data.py to convert the cleaned data into mmap format as a preparation for training-\n", "\n", "wrap the following into a bash script :\n", "\n", " %% writefile process2mmap.sh \n", "\n", " INPUT_JSON_FILE=path_to_the_json_file\n", "\n", " OUTPUT_PATH=path_to_save_the_converted_data_to\n", "\n", " VOCAB_FILE=path_to_your_own_pretrained_vocab_file\n", "\n", " MERGE_FILE=path_to_your_own_pretrained_merge_file\n", "\n", " NUM_CPUS=16\n", "\n", "\n", " python tools/preprocess_data.py \\\n", " --input INPUT_JSON_FILE \\\n", " --output-prefix OUTPUT_PATH \\\n", " --json-keys text \\\n", " --vocab-file VOCAB_FILE \\\n", " --merge-file MERGE_FILE \\\n", " --dataset-impl mmap \\\n", " --tokenizer-type GPT2BPETokenizer \\\n", " --workers NUM_CPUS \\\n", " --append-eod <--- very important, do not miss this flag !\n" ] }, { "cell_type": "markdown", "id": "lined-transfer", "metadata": {}, "source": [ "----------------------------------------------------------\n", "se preprocess_data.py to convert the cleaned data into mmap format as a preparation for training-\n", "\n", "wrap the following into a bash script :\n", "\n", " %% writefile process2mmap.sh \n", "\n", " INPUT_JSON_FILE=path_to_the_json_file\n", "\n", " OUTPUT_PATH=path_to_save_the_converted_data_to\n", "\n", " VOCAB_FILE=path_to_your_own_pretrained_vocab_file\n", "\n", " MERGE_FILE=path_to_your_own_pretrained_merge_file\n", "\n", " NUM_CPUS=16\n", "\n", "\n", " python tools/preprocess_data.py \\\n", " --input INPUT_JSON_FILE \\\n", " --output-prefix OUTPUT_PATH \\\n", " --json-keys text \\\n", " --vocab-file VOCAB_FILE \\\n", " --merge-file MERGE_FILE \\\n", " --dataset-impl mmap \\\n", " --tokenizer-type GPT2BPETokenizer \\\n", " --workers NUM_CPUS \\\n", " --append-eod <--- very important, do not miss this flag !\n", "\n" ] }, { "cell_type": "code", "execution_count": 6, "id": "marked-midnight", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "gpt2-merges.txt gpt2-vocab.json\n" ] } ], "source": [ "!mv gpt2-vocab.json ../dataset/EN/50k/\n", "!mv gpt2-merges.txt ../dataset/EN/50k/\n", "!ls ../dataset/EN/50k/" ] }, { "cell_type": "code", "execution_count": 9, "id": "adjustable-hammer", "metadata": {}, "outputs": [], "source": [ "INPUT_JSON_FILE='../dataset/EN/extractedNVblogs.json'\n", "OUTPUT_PATH='../dataset/EN/NVblog'\n", "VOCAB_FILE='../dataset/EN/50k/gpt2-vocab.json'\n", "MERGE_FILE='../dataset/EN/50k/gpt2-merges.txt'\n", "NUM_CPUS=16" ] }, { "cell_type": "markdown", "id": "hidden-patrick", "metadata": {}, "source": [ "---\n", "## OUTPUT should looks similar to the following \n", "\n", " Opening ./Megatron-LM/dataset/EN/extractedNVblogs.json\n", " > building GPT2BPETokenizer tokenizer ...\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " Vocab size: 50257\n", " Output prefix: ./Megatron-LM/dataset/EN/NVblogs\n", " Time to startup: 0.5460700988769531\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)" ] }, { "cell_type": "code", "execution_count": 10, "id": "professional-lawyer", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Opening ../dataset/EN/extractedNVblogs.json\n", "> building GPT2BPETokenizer tokenizer ...\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "Vocab size: 50257\n", "Output prefix: ../dataset/EN/NVblog\n", "Time to startup: 0.1618051528930664\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n" ] } ], "source": [ "!python ./Megatron-LM/tools/preprocess_data.py \\\n", " --input $INPUT_JSON_FILE \\\n", " --output-prefix $OUTPUT_PATH \\\n", " --json-keys text \\\n", " --vocab-file $VOCAB_FILE \\\n", " --merge-file $MERGE_FILE \\\n", " --dataset-impl mmap \\\n", " --tokenizer-type GPT2BPETokenizer \\\n", " --workers $NUM_CPUS \\\n", " --append-eod" ] }, { "cell_type": "markdown", "id": "valuable-equilibrium", "metadata": {}, "source": [ "---\n", "## Up Next : \n", "\n", "[Observe_GPT_runs_vs_performance ](./Day2-5_Observe_GPT_runs_vs_performance.ipynb)\n", "\n", "## Back To Start Menu\n", "[start menu](../Start_Here.ipynb)" ] }, { "cell_type": "markdown", "id": "accurate-drinking", "metadata": {}, "source": [ "-----\n", "\n", "\n", "## Licensing \n", "\n", "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 5 }