Skip to content

dayihengliu/CRQDA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Tell Me How to Ask Again: Question Data Augmentation with Controllable Rewriting in Continuous Space

We provide a Pytorch implementation of the following paper:

Tell Me How to Ask Again: Question Data Augmentation with Controllable Rewriting in Continuous Space, Dayiheng Liu, Yeyun Gong, Jie Fu, Yu Yan, Jiusheng Chen, Jiancheng Lv, Nan Duan and Ming Zhou, Conference on Empirical Methods in Natural Language Processing. EMNLP 2020 [paper]

Prerequisites

  • Python 3.6
  • Tensorflow 1.10.0+
  • Pytorch 1.3.0+
  • nltk 3.3+
  • cuda 9.0

Please install the Huggingface transformers locally as follows:

cd pytorch-transformers-master
python setup.py install

Datasets

Download the SQuAD 2.0 dataset files (train-v2.0.json and dev-v2.0.json) at here.

The Transformer autoencoder can be trained with the questions in train-v2.0.json.

We can also pretrain the Transformer autoencoder with our collected 2M question corpus, which contains about 2M questions from the training sets of several MRC and QA datasets, including SQuAD2.0, Natural Questions, NewsQA, QuAC, TriviaQA, CoQA, HotpotQA, DuoRC, and MS MARCO. This 2M questions corpus can be downloaded at here.

In addition, we can pretrain the Transformer autoencoder with the large-scale corpora English Wikipedia and BookCorpus, please refer to here to download and preprocess the dataset. After that, you can obtain a text file wikicorpus_en_one_article_per_line.txt for Transformer autoencoder pre-training.

CRQDA

Model Training

Pre-trained Language Model based MRC Model

We adopt the BERT (BERTforQuestionAnswering) and RoBERTa (RobertaForQuestionAnswering) based models of Huggingface as the SQuAD 2.0 MRC models.

We provide a well trained RoBERTa SQuAD 2.0 MRC model whose checkpoint can be downloaded at here.

Transformer-based Autoencoder

Before training the Transformer-based Autoencoder, please put the checkpoint files of the well trained RoBERTa SQuAD 2.0 MRC model into the default directory crqda/data/mrc_model, and put the wikicorpus_en_one_article_per_line.txt (or other dataset, like 2M questions corpus) into the default directory crqda/data/.

Then train the Transformer-based Autoencoder with this script:

cd crqda
./run_train.sh

The Transformer-based Autoencoder will be saved at data/ae_models.

Rewriting Question with Gradient-based Optimization

To rewrite the question and obtain the augmented dataset, please run this script:

cd crqda
python inference.py \
--OS_ID 0 \
--GAP 33000 \
--NEG \
--ae_model_path 'data/ae_models/pytorch_model.bin'

set --NEG to generate unanswerable questions, and --para to generate answerable questions. Since the rewriting process is slow, we set up a manual parallel rewriting function, set OS_ID to indicate which GPU should be used for this rewriting, and GAP is the number of original training samples should be rewritten in this GPU.

Here we provide a SQuAD 2.0 augmented dataset which contains the original SQuAD 2.0 training data pairs and some unanswerable question data pairs generated by CRQDA. It can be downloaded at here.

Finetuning MRC model with Augmented Dataset

After question data augmention with CRQDA, we can finetune the BERT-large model on the augmented dataset with the script:

cd pytorch-transformers-master/examples
./run_fine_tune_bert_with_crqda.sh

You may obtain the results like

"best_exact": 80.56093657879222,   "best_f1": 83.3359726931614,   "exact": 80.03032089615093,   "f1": 82.97608915068454

Other Baselines

We also provide the implementation of the baselines, including EDA, Back-Translation, and Text-VAE, which can be found in baselines/EDA , baselines/Mu-Forcing-VRAE, and baselines/Style-Transfer-Through-Back-Translation, respectively.

Citation

@inproceedings{liu2020crqda,
    title = "Tell Me How to Ask Again: Question Data Augmentation with Controllable Rewriting in Continuous Space",
    author="Liu, Dayiheng and Gong, Yeyun and Fu, Jie, and Yan, Yu and Chen Jiusheng, and Lv, Jiancheng and Duan, Nan and Zhou, Ming",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    year = "2020"
}

About

Code for EMNLP2020 paper: "Tell Me How to Ask Again: Question Data Augmentation with Controllable Rewriting in Continuous Space"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published