RONCC/NVIDIA-GPU-Bootcamp-Training @ 6e9663ce87df0eedfcb9ad7b08b094bd40c012fb

zenodia 6e9663ce87 implement proof-reading thanks to Millie T		4 anni fa
..
supervised	6e9663ce87 implement proof-reading thanks to Millie T	4 anni fa
unsupervised	6e9663ce87 implement proof-reading thanks to Millie T	4 anni fa
README.md	6e9663ce87 implement proof-reading thanks to Millie T	4 anni fa
evaluate_orqa.py	6e9663ce87 implement proof-reading thanks to Millie T	4 anni fa
evaluate_utils.py	6e9663ce87 implement proof-reading thanks to Millie T	4 anni fa

End-to-End Training of Neural Retrievers for Open-Domain Question Answering

Below we present the steps to run unsupervised and supervised trainining and evaluation of the retriever for open domain question answering.

Retriever Training

Unsupervised pretraining

Use tools/preprocess_data.py to preprocess the dataset for Inverse Cloze Task (ICT), which we call unsupervised pretraining. This script takes as input a corpus in loose JSON format and creates fixed-size blocks of text as the fundamental units of data. For a corpus like Wikipedia, this will mean multiple sentences per block and multiple blocks per document. Run tools/preprocess_data.py to construct one or more indexed datasets with the --split-sentences argument to make sentences the basic unit. We construct two datasets, one with the title of every document and another with the body.

python tools/preprocess_data.py \
    --input /path/to/corpus.json \
    --json-keys text title \
    --split-sentences \
    --tokenizer-type BertWordPieceLowerCase \
    --vocab-file /path/to/vocab.txt \
    --output-prefix corpus_indexed \
    --workers 10

The examples/pretrain_ict.sh script runs a single GPU 217M parameter biencoder model for ICT retriever training. Single GPU training is primarily intended for debugging purposes, as the code is developed for distributed training. The script uses a pretrained BERT model and we use a total of batch size of 4096 for the ICT training.
Evaluate the pretrained ICT model using examples/evaluate_retriever_nq.sh for Google's Natural Questions Open dataset.

Supervised finetuning

Use the above pretrained ICT model to finetune using Google's Natural Questions Open dataset. The script examples/finetune_retriever_distributed.sh provides an example for how to perform the training. Our finetuning process includes retriever score scaling and longer training (80 epochs) on top DPR training.
Evaluate the finetuned model using the same evaluation script as mentioned above for the unsupervised model.

More details on the retriever are available in our paper.

Reader Training

The reader component will be available soon.

README.md

End-to-End Training of Neural Retrievers for Open-Domain Question Answering

Retriever Training

Unsupervised pretraining

Supervised finetuning

Reader Training