## Jsonfy + convert to mmap
---

## Learning Objectives

The goal of this lab is to convert the raw data to Megatron-LM's raw text data to mmap format.

In particular, we will cover the following steps :

    1. Understand the need of preprocessing data to mmap format.
    2. Convert the raw text data into loose json format.
    3. Use preprocess_data.py to convert the cleaned data into mmap format as a preparation for training.




1. Understand the need of preprocessing data to mmap format.

In [1]:
import numpy as np
out=np.random.random((1024,2048))
np.save('myarr',out)

In [2]:
%%timeit 
out=np.load('myarr.npy')

3.84 ms ± 36.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [3]:
%%timeit
array = np.memmap("myarr.npy", mode="r",
                  dtype=np.int16, shape=(1024, 1024))

43 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [4]:
## clean up
!rm myarr.npy

2. jsonfy the raw text data into loose json format.

The preprocess_data.py is expecting to receive json format data. Hence we need to convert the raw text data to json format first.
It is assumed that the json format data, will have one element per document, and the 'text' field in the json data, it's value will be extracted in preprocess_data.py. Other fields can also be specified for extraction. 
An example of how the json data should look like, is given by the following : 

    {"src": "The Internet", "text": "jumps over the lazy dog", "type": "Eng", "id": "42", "title": "Second Part"}


We will now use the following python script to converting the raw text data into `extractedNVblogs.json` format as a preparation for the next step. 


    python create_loose_json.py --help
        usage: create_loose_json.py [-h] [--infile INFILE] [--outfile OUTFILE]

        optional arguments:
          -h, --help         show this help message and exit
          --infile INFILE    input file path
          --outfile OUTFILE  output file path

In [5]:
!python create_loose_json.py --infile ../dataset/EN/extractedNVblogs.txt --outfile ../dataset/EN/extractedNVblogs.json

finished processing 71 lines to loose json format


3. Use preprocess_data.py to convert the cleaned data into mmap format as a preparation for training.

We are now ready to feed `extractedNVblogs.json`  data to Megatron-LM's preprocess_data.py in order to further convert the data to mmap format.

The following two code blocks will convert the `extractedNVblogs.json` to `NVblog_text_document.bin` and `NVblog_text_document.idx`

In [9]:
INPUT_JSON_FILE='../dataset/EN/extractedNVblogs.json'
OUTPUT_PATH='../dataset/EN/NVblog'
VOCAB_FILE='../dataset/EN/50k/gpt2-vocab.json'
MERGE_FILE='../dataset/EN/50k/gpt2-merges.txt'
NUM_CPUS=16

In [10]:
!python ./Megatron-LM/tools/preprocess_data.py \
                       --input $INPUT_JSON_FILE \
                       --output-prefix $OUTPUT_PATH \
                       --json-keys text \
                       --vocab-file $VOCAB_FILE \
                       --merge-file $MERGE_FILE \
                       --dataset-impl mmap \
                       --tokenizer-type GPT2BPETokenizer \
                       --workers $NUM_CPUS \
                       --append-eod

Opening ../dataset/EN/extractedNVblogs.json
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> building GPT2BPETokenizer tokenizer ...
> building GPT2BPETokenizer tokenizer ...
> building GPT2BPETokenizer tokenizer ...
> building GPT2BPETokenizer tokenizer ...
> building GPT2BPETokenizer tokenizer ...
> building GPT2BPETokenizer tokenizer ...
> building GPT2BPETokenizer tokenizer ...
> building GPT2BPETokenizer tokenizer ...
> building GPT2BPETokenizer tokenizer ...
> building GPT2BPETokenizer tokenizer ...
> building GPT2BPETokenizer tokenizer ...
> building GPT2BPETokenizer tokenizer ...
> building GPT2BPETokenizer tokenizer ...
> building GPT2BPETokenizer tokenizer ...
> building GPT2BPETokenizer tokenizer ...
> building GPT2BPETokenizer tokenizer ...
Vocab size: 50257
Output prefix: ../dataset/EN/NVblog
Time to startup: 0.1618051528930664
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
 > padded voca

Below is the expected outputs :

                    Opening ./Megatron-LM/dataset/EN/extractedNVblogs.json
                    > building GPT2BPETokenizer tokenizer ...
                     > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
                    > building GPT2BPETokenizer tokenizer ...
                    > building GPT2BPETokenizer tokenizer ...
                    > building GPT2BPETokenizer tokenizer ...
                    > building GPT2BPETokenizer tokenizer ...
                    > building GPT2BPETokenizer tokenizer ...
                    > building GPT2BPETokenizer tokenizer ...
                    > building GPT2BPETokenizer tokenizer ...
                    > building GPT2BPETokenizer tokenizer ...
                    > building GPT2BPETokenizer tokenizer ...
                    > building GPT2BPETokenizer tokenizer ...
                    > building GPT2BPETokenizer tokenizer ...
                    > building GPT2BPETokenizer tokenizer ...
                    > building GPT2BPETokenizer tokenizer ...
                    > building GPT2BPETokenizer tokenizer ...
                    > building GPT2BPETokenizer tokenizer ...
                    > building GPT2BPETokenizer tokenizer ...
                    Vocab size: 50257
                    Output prefix: ./Megatron-LM/dataset/EN/NVblogs
                    Time to startup: 0.5460700988769531
                     > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
                     > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)

---

## Links and Resources
Don't forget to [Read More on MMAP](https://docs.python.org/3/library/mmap.html).


-----
## <p style="text-align:center;border:3px; padding: 1em"> <a href=../Start_Here.ipynb>HOME</a>&nbsp; &nbsp; &nbsp; <a href=./Lab1-6_Observe_GPT_runs_vs_performance.ipynb>NEXT</a></p>


-----


## Licensing 

This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). 