{ "cells": [ { "cell_type": "markdown", "id": "convertible-whale", "metadata": {}, "source": [ "# \n", "\n", "# 4_Jsonfy and preprocess to mmap format for optimizing data loading\n", "---\n", "\n", "## Learning Objectives\n", "- **The goal of this lab is to:**\n", " - motivation : understand the need for preprocessing to mmap format\n", " - the assumptions about the data \n", " - jsonfy the raw text data into loose json format\n", " - use preprocess_data.py to convert the cleaned data into mmap format as a preparation for training\n", "\n", "----------------------------------------------------------\n", "### Understand the need for preprocessing to mmap format- \n" ] }, { "cell_type": "code", "execution_count": 6, "id": "surrounded-counter", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "out=np.random.random((1024,2048))\n", "np.save('myarr',out)" ] }, { "cell_type": "code", "execution_count": 7, "id": "complete-lindsay", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3.07 ms ± 55.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" ] } ], "source": [ "%%timeit \n", "out=np.load('myarr.npy')" ] }, { "cell_type": "code", "execution_count": 9, "id": "conventional-mason", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "62 µs ± 136 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n" ] } ], "source": [ "%%timeit\n", "array = np.memmap(\"myarr.npy\", mode=\"r\",\n", " dtype=np.int16, shape=(1024, 1024))" ] }, { "cell_type": "markdown", "id": "functioning-stage", "metadata": {}, "source": [ "----------------------------------------------------------\n", "### the assumptions about the data -\n", " one element per document \n", " text in the 'text' field by default ,can be modified to extract other fields\n", " {\"src\": \"The Internet\", \"text\": \"jumps over the lazy dog\", \"type\": \"Eng\", \"id\": \"42\", \"title\": \"Second Part\"}\n" ] }, { "cell_type": "markdown", "id": "eastern-habitat", "metadata": {}, "source": [ "----------------------------------------------------------\n", "### jsonfy the raw text data into loose json format -\n", " python create_loose_json.py --help\n", " usage: create_loose_json.py [-h] [--infile INFILE] [--outfile OUTFILE]\n", "\n", " optional arguments:\n", " -h, --help show this help message and exit\n", " --infile INFILE input file path\n", " --outfile OUTFILE output file path" ] }, { "cell_type": "code", "execution_count": 11, "id": "finite-marina", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "finished processing 74 lines to loose json format\n" ] } ], "source": [ "!python create_loose_json.py --infile ./Megatron-LM/dataset/EN/extractedNVblogs.txt --outfile ./Megatron-LM/dataset/EN/extractedNVblogs.json" ] }, { "cell_type": "markdown", "id": "proof-pakistan", "metadata": {}, "source": [ "----------------------------------------------------------\n", "### use preprocess_data.py to convert the cleaned data into mmap format as a preparation for training-\n", "\n", "wrap the following into a bash script :\n", "\n", " %% writefile process2mmap.sh \n", "\n", " INPUT_JSON_FILE=path_to_the_json_file\n", "\n", " OUTPUT_PATH=path_to_save_the_converted_data_to\n", "\n", " VOCAB_FILE=path_to_your_own_pretrained_vocab_file\n", "\n", " MERGE_FILE=path_to_your_own_pretrained_merge_file\n", "\n", " NUM_CPUS=16\n", "\n", "\n", " python tools/preprocess_data.py \\\n", " --input INPUT_JSON_FILE \\\n", " --output-prefix OUTPUT_PATH \\\n", " --json-keys text \\\n", " --vocab-file VOCAB_FILE \\\n", " --merge-file MERGE_FILE \\\n", " --dataset-impl mmap \\\n", " --tokenizer-type GPT2BPETokenizer \\\n", " --workers NUM_CPUS \\\n", " --append-eod <--- very important, do not miss this flag !\n" ] }, { "cell_type": "markdown", "id": "angry-canvas", "metadata": {}, "source": [ "----------------------------------------------------------\n", "se preprocess_data.py to convert the cleaned data into mmap format as a preparation for training-\n", "\n", "wrap the following into a bash script :\n", "\n", " %% writefile process2mmap.sh \n", "\n", " INPUT_JSON_FILE=path_to_the_json_file\n", "\n", " OUTPUT_PATH=path_to_save_the_converted_data_to\n", "\n", " VOCAB_FILE=path_to_your_own_pretrained_vocab_file\n", "\n", " MERGE_FILE=path_to_your_own_pretrained_merge_file\n", "\n", " NUM_CPUS=16\n", "\n", "\n", " python tools/preprocess_data.py \\\n", " --input INPUT_JSON_FILE \\\n", " --output-prefix OUTPUT_PATH \\\n", " --json-keys text \\\n", " --vocab-file VOCAB_FILE \\\n", " --merge-file MERGE_FILE \\\n", " --dataset-impl mmap \\\n", " --tokenizer-type GPT2BPETokenizer \\\n", " --workers NUM_CPUS \\\n", " --append-eod <--- very important, do not miss this flag !\n", "\n" ] }, { "cell_type": "code", "execution_count": 5, "id": "regional-stake", "metadata": {}, "outputs": [], "source": [ "INPUT_JSON_FILE='../dataset/EN/extractedNVblogs.json'\n", "OUTPUT_PATH='../dataset/EN/CustomSentenceSplitter'\n", "VOCAB_FILE='../dataset/EN/50k/gpt2-vocab.json'\n", "MERGE_FILE='../dataset/EN/50k/gpt2-merges.txt'\n", "NUM_CPUS=16\n" ] }, { "cell_type": "markdown", "id": "similar-commonwealth", "metadata": {}, "source": [ "---\n", "## OUTPUT should looks similar to the following \n", "\n", " Opening ./Megatron-LM/dataset/EN/extractedNVblogs.json\n", " > building GPT2BPETokenizer tokenizer ...\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " Vocab size: 50257\n", " Output prefix: ./Megatron-LM/dataset/EN/NVblogs\n", " Time to startup: 0.5460700988769531\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)" ] }, { "cell_type": "code", "execution_count": 6, "id": "framed-point", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Opening ./Megatron-LM/dataset/EN/extractedNVblogs.json\n", "> building GPT2BPETokenizer tokenizer ...\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "Vocab size: 50257\n", "Output prefix: ./Megatron-LM/dataset/EN/NVblogs\n", "Time to startup: 0.5460700988769531\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n" ] } ], "source": [ "!python ./Megatron-LM/tools/preprocess_data.py \\\n", " --input $INPUT_JSON_FILE \\\n", " --output-prefix $OUTPUT_PATH \\\n", " --json-keys text \\\n", " --vocab-file $VOCAB_FILE \\\n", " --merge-file $MERGE_FILE \\\n", " --dataset-impl mmap \\\n", " --tokenizer-type GPT2BPETokenizer \\\n", " --workers $NUM_CPUS \\\n", " --append-eod" ] }, { "cell_type": "markdown", "id": "endless-vietnamese", "metadata": {}, "source": [ "---\n", "## Up Next : \n", "\n", "[Observe_GPT_runs_vs_performance ](./Day2-5_Observe_GPT_runs_vs_performance.ipynb)\n", "\n", "## Back To Start Menu\n", "[start menu](../Start_Here.ipynb)" ] }, { "cell_type": "markdown", "id": "married-necklace", "metadata": {}, "source": [ "-----\n", "\n", "\n", "## Licensing \n", "\n", "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 5 }