GitHub - woonsangcho/contrast_qgen: Code for 'Contrastive Multi-Document Question Generation'

Contrastive Multi-Document Question Generation

This repo contains a complete reproducible code and dataset for the arXiv paper: https://arxiv.org/abs/1911.03047, accepted to appear at EACL 2021. It is based on repositories from Huggingface Transformers and OpenAI GPT-2.

In the paper, we propose a novel generating coordinator model that leverage reinforcement learning using signals from multiple documents. We also develop a principled contrastive-learning based regularization to promote specificity of generated questions.

Recommended Environment

Linux Ubuntu 18.04
GPU with at least 12G memory

At both training time and evaluation time, it requires loading the GPT-2 generator block model, Transformer-based coordinator model and the pre-trained ranker model from which we derive reinforcement learning signals.

Model training has been done using 8 NVidia Tesla V100 GPU in parallel. We recommend running our codebase with multiple GPUs.

Dependencies

PyTorch (1.4.0)
PyLucene (7.7.1)
nltk (3.4.5)
nlg-eval (2.2)
transformers (however, the repo is self-contained)
numpy
scipy
boto3
requests
tqdm
regex

Setup

git clone https://github.com/woonsangcho/contrast_qgen

For pre-processing and constructing challenging negative samples, first download the raw MS-MARCO Conversational Search, and follow the preprocessing code. For convenience, download the dataset from here and place it under your $DATA_PATH/. We randomly splits the publicly available MS-MARCO-QA dataset into train/dev/test sets. Due to the large collection of dataset, we random sampled a subset of the dev set to expedite the training.

Download the pre-trained ranker, converted into PyTorch for convenience here.

Download pre-indexed Lucene files here for ranking, and here for the retrieval-based baselines.

1. Fine-tuning of the pre-trained GPT-2 generator block model on the MS-MARCO domain

Download a pre-trained GPT-2 model (small) from this link.

Download the public MS-MARCO dataset, formatted to our codebase here.

python src/train_gpt2_distributed.py --config $CONFIG_PATH

$CONFIG_PATH contains the path to the model configuration file: config_file/config_domain_tune.json. Modify your data file paths under $DATA_PATH.

If you would like to train the generator block from scratch rather than using the pre-trained GPT-2 model, append --init_checkpoint 'None'. However, we observed we can obtain a better generating block by fine-tuning a pre-trained GPT-2 model (based on validation dataset). To bypass this step for your convenience, you can download our fine-tuned GPT-2 model here.

2. Training the coordinator using RL and Set-induced Contrastive Regularization (SCR)

First, build and install PyLucene following the commands here.

We distributed the training across multiple GPUs (8 GPUs) using the following command.

python -m torch.distributed.launch --nproc_per_node=8 src/train_gpt2_distributed_rl.py --config $CONFIG_PATH

$CONFIG_PATH contains the path to the model configuration file: config_file/config_coordinator_rl.json. This contains the default configuration for training the full model. Modify the arguments to fit your environment. For other parameter options, see comments.

On 8 NVidia Tesla V100 GPUs, the training takes about 2 days to complete. For a pre-trained coordinator, download here.

The coordinator with default configuration has 14,501,377 parameters.

3. Evaluating generated questions via automatic metrics

python src/evaluate_coordinator.py --config config_file/config_domain_tune.json --coordinator_model <path-to-the-coordinator-model>

python src/evaluate_coordinator_embedding.py --config config_file/config_domain_tune.json --coordinator_model <path-to-the-coordinator-model>

Contact

Please email all inquiries to Woon Sang Cho at: woonsang at princeton.edu.

Disclaimer

This repository aims to promote further research in multi-document question generation. This source code provided here contains the research pipeline, including the modeling code needed to produce a model weight file, as well as generation code. This repository can be adapted to users' own data to generate outputs. We are not responsible for any generation from the 3rd party utilization of the shared files included herein, including the pretrained system or the generation code.

Citation

For citation, please use the following bibtex entry:

@article{cho2020contrastqgen,
  title={Contrastive Multi-Document Question Generation},
  author={Cho, Woon Sang and Zhang, Yizhe and Rao, Sudha and Celikyilmaz, Asli and Xiong, Chenyan and Gao, Jianfeng and Wang, Mengdi and Dolan, Bill},
  year={2020}
  eprint={1911.03047},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commits
config_file		config_file
coordinator		coordinator
examples		examples
gpt2_training		gpt2_training
mylucene		mylucene
prepro		prepro
pretrained_bert_ranker		pretrained_bert_ranker
pytorch_pretrained_bert		pytorch_pretrained_bert
README.md		README.md
data_loader.py		data_loader.py
env.py		env.py
evaluate_coordinator.py		evaluate_coordinator.py
evaluate_coordinator_embedding.py		evaluate_coordinator_embedding.py
marco_ranker.py		marco_ranker.py
optim.py		optim.py
prepro_v4.py		prepro_v4.py
rl_helpers.py		rl_helpers.py
train_gpt2_distributed.py		train_gpt2_distributed.py
train_gpt2_distributed_rl.py		train_gpt2_distributed_rl.py

woonsangcho/contrast_qgen

Folders and files

Latest commit

History

Repository files navigation

Contrastive Multi-Document Question Generation

Recommended Environment

Dependencies

Setup

1. Fine-tuning of the pre-trained GPT-2 generator block model on the MS-MARCO domain

2. Training the coordinator using RL and Set-induced Contrastive Regularization (SCR)

3. Evaluating generated questions via automatic metrics

Contact

Disclaimer

Citation

About

Resources

Stars

Watchers

Forks

Languages