{ "cells": [ { "cell_type": "markdown", "id": "aggressive-madness", "metadata": {}, "source": [ "## Jsonfy + convert to mmap\n", "---\n", "\n", "## Learning Objectives\n", "\n", "The goal of this lab is to convert the raw data to Megatron-LM's raw text data to mmap format.\n", "\n", "In particular, we will cover the following steps :\n", "\n", " 1. Understand the need for preprocessing data to mmap format.\n", " 2. Convert the raw text data into loose json format.\n", " 3. Use preprocess_data.py to convert the cleaned data into mmap format as a preparation for training.\n" ] }, { "cell_type": "markdown", "id": "alternative-supply", "metadata": {}, "source": [ "1. Understand the need for preprocessing data to mmap format.\n", "\n", "The below cell blocks will demonstrate the increased speed-up of using `np.memmap` instead of `np.load` to load arbitrary data.\n", "The `np.memmap` is integrated into preprocess_data.py. " ] }, { "cell_type": "code", "execution_count": 1, "id": "instructional-tunnel", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "out=np.random.random((1024,2048))\n", "np.save('myarr',out)" ] }, { "cell_type": "code", "execution_count": 2, "id": "brave-barcelona", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3.84 ms ± 36.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" ] } ], "source": [ "%%timeit \n", "out=np.load('myarr.npy')" ] }, { "cell_type": "code", "execution_count": 3, "id": "warming-hardwood", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "43 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n" ] } ], "source": [ "%%timeit\n", "array = np.memmap(\"myarr.npy\", mode=\"r\",\n", " dtype=np.int16, shape=(1024, 1024))" ] }, { "cell_type": "code", "execution_count": 4, "id": "documentary-pointer", "metadata": {}, "outputs": [], "source": [ "## clean up\n", "!rm myarr.npy" ] }, { "cell_type": "markdown", "id": "institutional-shape", "metadata": {}, "source": [ "2. jsonfy the raw text data into loose json format.\n", "\n", "The preprocess_data.py is expecting to receive json format data. Hence we need to convert the raw text data to json format first.\n", "It is assumed that the json format data, will have one element per document, and the value of the 'text' field in the json data will be extracted in preprocess_data.py. Other fields can also be specified for extraction. \n", "An example of how the json data should look is showing in the following: \n", "\n", " {\"src\": \"The Internet\", \"text\": \"jumps over the lazy dog\", \"type\": \"Eng\", \"id\": \"42\", \"title\": \"Second Part\"}\n" ] }, { "cell_type": "markdown", "id": "frozen-vinyl", "metadata": {}, "source": [ "We will now use the following python script to convert the raw text data into `extractedNVblogs.json` format as a preparation for the next step. \n", "\n", "\n", " python create_loose_json.py --help\n", " usage: create_loose_json.py [-h] [--infile INFILE] [--outfile OUTFILE]\n", "\n", " optional arguments:\n", " -h, --help show this help message and exit\n", " --infile INFILE input file path\n", " --outfile OUTFILE output file path" ] }, { "cell_type": "code", "execution_count": 5, "id": "surrounded-permit", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "finished processing 71 lines to loose json format\n" ] } ], "source": [ "!python create_loose_json.py --infile ../dataset/EN/extractedNVblogs.txt --outfile ../dataset/EN/extractedNVblogs.json" ] }, { "cell_type": "markdown", "id": "limiting-window", "metadata": {}, "source": [ "3. Use preprocess_data.py to convert the cleaned data into mmap format as a preparation for training.\n", "\n", "We are now ready to feed `extractedNVblogs.json` data to Megatron-LM's preprocess_data.py in order to convert the data to mmap format.\n", "\n", "The following two code blocks will convert the `extractedNVblogs.json` to `NVblog_text_document.bin` and `NVblog_text_document.idx`" ] }, { "cell_type": "code", "execution_count": 9, "id": "numerical-tuner", "metadata": {}, "outputs": [], "source": [ "INPUT_JSON_FILE='../dataset/EN/extractedNVblogs.json'\n", "OUTPUT_PATH='../dataset/EN/NVblog'\n", "VOCAB_FILE='../dataset/EN/50k/gpt2-vocab.json'\n", "MERGE_FILE='../dataset/EN/50k/gpt2-merges.txt'\n", "NUM_CPUS=16" ] }, { "cell_type": "code", "execution_count": 10, "id": "inner-match", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Opening ../dataset/EN/extractedNVblogs.json\n", "> building GPT2BPETokenizer tokenizer ...\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "> building GPT2BPETokenizer tokenizer ...\n", "Vocab size: 50257\n", "Output prefix: ../dataset/EN/NVblog\n", "Time to startup: 0.1618051528930664\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n" ] } ], "source": [ "!python ./Megatron-LM/tools/preprocess_data.py \\\n", " --input $INPUT_JSON_FILE \\\n", " --output-prefix $OUTPUT_PATH \\\n", " --json-keys text \\\n", " --vocab-file $VOCAB_FILE \\\n", " --merge-file $MERGE_FILE \\\n", " --dataset-impl mmap \\\n", " --tokenizer-type GPT2BPETokenizer \\\n", " --workers $NUM_CPUS \\\n", " --append-eod" ] }, { "cell_type": "markdown", "id": "grand-demand", "metadata": {}, "source": [ "Below is the expected outputs :\n", "\n", " Opening ./Megatron-LM/dataset/EN/extractedNVblogs.json\n", " > building GPT2BPETokenizer tokenizer ...\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " > building GPT2BPETokenizer tokenizer ...\n", " Vocab size: 50257\n", " Output prefix: ./Megatron-LM/dataset/EN/NVblogs\n", " Time to startup: 0.5460700988769531\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)\n", " > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)" ] }, { "cell_type": "markdown", "id": "aware-insert", "metadata": {}, "source": [ "---\n", "\n", "## Links and Resources\n", "Don't forget to [Read More on MMAP](https://docs.python.org/3/library/mmap.html) and examine the [indexed_dataset builder](https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/data/indexed_dataset.py#L407).\n" ] }, { "cell_type": "markdown", "id": "educational-ministry", "metadata": {}, "source": [ "-----\n", "##
\n" ] }, { "cell_type": "markdown", "id": "genetic-gamma", "metadata": {}, "source": [ "-----\n", "\n", "\n", "## Licensing \n", "\n", "This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 5 }