# 

# 4_Jsonfy and preprocess to mmap format for optimizing data loading
---

## Learning Objectives
- **The goal of this lab is to:**
    - motivation : understand the need for preprocessing to mmap format
    - the assumptions about the data 
    - jsonfy the raw text data into loose json format
    - use preprocess_data.py to convert the cleaned data into mmap format as a preparation for training

----------------------------------------------------------
### Understand the need for preprocessing to mmap format-    


In [6]:
import numpy as np
out=np.random.random((1024,2048))
np.save('myarr',out)

In [7]:
%%timeit 
out=np.load('myarr.npy')

3.07 ms ± 55.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [9]:
%%timeit
array = np.memmap("myarr.npy", mode="r",
                  dtype=np.int16, shape=(1024, 1024))

62 µs ± 136 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


----------------------------------------------------------
### the assumptions about the data -
    one element per document 
    text in the 'text' field by default ,can be modified to extract other fields
    {"src": "The Internet", "text": "jumps over the lazy dog", "type": "Eng", "id": "42", "title": "Second Part"}


----------------------------------------------------------
### jsonfy the raw text data into loose json format -
    python create_loose_json.py --help
        usage: create_loose_json.py [-h] [--infile INFILE] [--outfile OUTFILE]

        optional arguments:
          -h, --help         show this help message and exit
          --infile INFILE    input file path
          --outfile OUTFILE  output file path

In [11]:
!python create_loose_json.py --infile ./Megatron-LM/dataset/EN/extractedNVblogs.txt --outfile ./Megatron-LM/dataset/EN/extractedNVblogs.json

finished processing 74 lines to loose json format


----------------------------------------------------------
### use preprocess_data.py to convert the cleaned data into mmap format as a preparation for training-

wrap the following into a bash script :

            %% writefile process2mmap.sh 

            INPUT_JSON_FILE=path_to_the_json_file

            OUTPUT_PATH=path_to_save_the_converted_data_to

            VOCAB_FILE=path_to_your_own_pretrained_vocab_file

            MERGE_FILE=path_to_your_own_pretrained_merge_file

            NUM_CPUS=16


            python tools/preprocess_data.py \
                       --input INPUT_JSON_FILE \
                       --output-prefix OUTPUT_PATH \
                       --json-keys text \
                       --vocab-file VOCAB_FILE \
                       --merge-file MERGE_FILE \
                       --dataset-impl mmap \
                       --tokenizer-type GPT2BPETokenizer \
                       --workers NUM_CPUS \
                       --append-eod <--- very important, do not miss this flag !


----------------------------------------------------------
se preprocess_data.py to convert the cleaned data into mmap format as a preparation for training-

wrap the following into a bash script :

            %% writefile process2mmap.sh 

            INPUT_JSON_FILE=path_to_the_json_file

            OUTPUT_PATH=path_to_save_the_converted_data_to

            VOCAB_FILE=path_to_your_own_pretrained_vocab_file

            MERGE_FILE=path_to_your_own_pretrained_merge_file

            NUM_CPUS=16


            python tools/preprocess_data.py \
                       --input INPUT_JSON_FILE \
                       --output-prefix OUTPUT_PATH \
                       --json-keys text \
                       --vocab-file VOCAB_FILE \
                       --merge-file MERGE_FILE \
                       --dataset-impl mmap \
                       --tokenizer-type GPT2BPETokenizer \
                       --workers NUM_CPUS \
                       --append-eod <--- very important, do not miss this flag !



In [5]:
INPUT_JSON_FILE='../dataset/EN/extractedNVblogs.json'
OUTPUT_PATH='../dataset/EN/CustomSentenceSplitter'
VOCAB_FILE='../dataset/EN/50k/gpt2-vocab.json'
MERGE_FILE='../dataset/EN/50k/gpt2-merges.txt'
NUM_CPUS=16


---
## OUTPUT should looks similar to the following 

                    Opening ./Megatron-LM/dataset/EN/extractedNVblogs.json
                    > building GPT2BPETokenizer tokenizer ...
                     > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
                    > building GPT2BPETokenizer tokenizer ...
                    > building GPT2BPETokenizer tokenizer ...
                    > building GPT2BPETokenizer tokenizer ...
                    > building GPT2BPETokenizer tokenizer ...
                    > building GPT2BPETokenizer tokenizer ...
                    > building GPT2BPETokenizer tokenizer ...
                    > building GPT2BPETokenizer tokenizer ...
                    > building GPT2BPETokenizer tokenizer ...
                    > building GPT2BPETokenizer tokenizer ...
                    > building GPT2BPETokenizer tokenizer ...
                    > building GPT2BPETokenizer tokenizer ...
                    > building GPT2BPETokenizer tokenizer ...
                    > building GPT2BPETokenizer tokenizer ...
                    > building GPT2BPETokenizer tokenizer ...
                    > building GPT2BPETokenizer tokenizer ...
                    > building GPT2BPETokenizer tokenizer ...
                    Vocab size: 50257
                    Output prefix: ./Megatron-LM/dataset/EN/NVblogs
                    Time to startup: 0.5460700988769531
                     > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
                     > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)

In [6]:
!python ./Megatron-LM/tools/preprocess_data.py \
                       --input $INPUT_JSON_FILE \
                       --output-prefix $OUTPUT_PATH \
                       --json-keys text \
                       --vocab-file $VOCAB_FILE \
                       --merge-file $MERGE_FILE \
                       --dataset-impl mmap \
                       --tokenizer-type GPT2BPETokenizer \
                       --workers $NUM_CPUS \
                       --append-eod

Opening ./Megatron-LM/dataset/EN/extractedNVblogs.json
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> building GPT2BPETokenizer tokenizer ...
> building GPT2BPETokenizer tokenizer ...
> building GPT2BPETokenizer tokenizer ...
> building GPT2BPETokenizer tokenizer ...
> building GPT2BPETokenizer tokenizer ...
> building GPT2BPETokenizer tokenizer ...
> building GPT2BPETokenizer tokenizer ...
> building GPT2BPETokenizer tokenizer ...
> building GPT2BPETokenizer tokenizer ...
> building GPT2BPETokenizer tokenizer ...
> building GPT2BPETokenizer tokenizer ...
> building GPT2BPETokenizer tokenizer ...
> building GPT2BPETokenizer tokenizer ...
> building GPT2BPETokenizer tokenizer ...
> building GPT2BPETokenizer tokenizer ...
> building GPT2BPETokenizer tokenizer ...
Vocab size: 50257
Output prefix: ./Megatron-LM/dataset/EN/NVblogs
Time to startup: 0.5460700988769531
 > padded vocab (size: 50257) with 47 dummy tokens (new size

---
## Up Next : 

[Observe_GPT_runs_vs_performance ](./Day2-5_Observe_GPT_runs_vs_performance.ipynb)

## Back To Start Menu
[start menu](../Start_Here.ipynb)

-----


## Licensing 

This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). 