MQAG: Multiple-choice Question Answering and Generation for Assessing Information Consistency

Code for our paper MQAG: Multiple-choice Question Answering and Generation for Assessing Information Consistency in Summarization
The paper is accepted and to appear at AACL 2023

Python Package

*Update (23/03/2023): MQAG implementation is added to selfcheckgpt package. To use MQAG in this package:

pip install selfcheckgpt

Example usage (full example in this Jupyter notebook)

from selfcheckgpt.modeling_mqag import MQAG
mqag_model = MQAG()

# Usage1: MQAG-score [Generation + Answering]
# return dict consisting of various statistical distance, e.g. KL-div, Counting, Hellinger Distance, Total Variation
score = mqag_model.score(candidate=summary, reference=document, num_questions=5, verbose=True)

# Usage2: MQAG-generate (Multiple-chioce Question Generation)
questions = mqag_model.generate(context=context, do_sample=True, num_questions=3)
for i, question_item in enumerate(questions):
    print("------------------------------------")
    print(f"Q{i}: {question_item['question']}")
    print(f"A: {question_item['options'][0]}")
    print(f"B: {question_item['options'][1]}")
    print(f"C: {question_item['options'][2]}")
    print(f"D: {question_item['options'][3]}")
    
# Usage3: MQAG-answer (Multiple-chioce Question Answering)
questions = [{'question': question, 'options': options}]
probs = mqag_model.answer(questions=questions, context=context)
print(probs[0])

MQAG (this repository)

Please read our paper for the information on MQAG
Model weights are available on HuggingFace:
- generation_model_1 (RACE): https://huggingface.co/potsawee/t5-large-generation-race-QuestionAnswer
- generation_model_1 (SQuAD): https://huggingface.co/potsawee/t5-large-generation-squad-QuestionAnswer
- generation_model_2: https://huggingface.co/potsawee/t5-large-generation-race-Distractor
- answering_model: https://huggingface.co/potsawee/longformer-large-4096-answering-race
Requirements: We've tested on python 3.8 and other packages are shown in requirements.txt

Running MQAG Inference

python inference_mqag.py takes the arguments below. source and summary files contain text in the one document per line format (see examples/...)

source_path: path to source (text) file
summary_path: path to summary (text) file
mqag_variant: mqag_src | mqag_sum
num_samples: number of questions to be drawn
generation_model1_path: path to Question+Answer Gen (t5-large)
generation_model2_path: path to Distractor Gen (t5-large)
generation_model_type: e.g. t5-large
answering_model_path: path to Answering model (longformer)
answering_model_type: e.g. longformer
use_gpu: True | False (whether or not to use GPU if available)
verbose: True | False (whether or not to print information)

Example usage:

python inference_mqag.py \
    --source_path=examples/0_source.txt \
    --summary_path=examples/0_summary.txt \
    --mqag_variant=mqag_sum \
    --num_samples=10 \
    --generation_model1_path=model_weights/t5-large-generation-Race-QuestionAnswer.pt \
    --generation_model2_path=model_weights/t5-large-generation-Race-Distractor.pt \
    --answering_model_path=model_weights/longformer-large-4096-Race-Answering.pt \
    --use_gpu=True \
    --verbose=True

Example Output:

[document=1/2, multiple-choice question=1/10]
Question: Two security guards have been threatened during a robbery at a _.
(1) bank
(2) securityguard
(3) school
(4) mosque
prob_sum_y = 0.995091	0.004669	0.000126	0.000114
prob_doc_x = 0.981052	0.017671	0.000337	0.000941
prob_nocontext = 0.716027	0.043105	0.061257	0.179611

Experimental Results

The pearson correlation coefficient (PCC) between the MQAG-score and human judgements are shown below.

Method	QAG-CNNDM	QAG-XSum	XSum-Faithful	XSum-Factual	Podcast	SummEval-Rel	SummEval-Cons
MQAG-Src (KL-div)	0.143	0.097	0.088	0.054	0.321	0.559	0.599
MQAG-Sum (KL-div)	0.450	0.283	0.135	0.179	0.789	0.753	0.954
MQAG-Sum (TotalVar)	0.462	0.309	0.221	0.244	0.770	0.796	0.933
MQAG-Sum (TotalVar + Answerability)	0.502	0.313	0.306	0.270	0.855	0.814	0.945

*Update:

We found that comparing two distributions using a bounded distance such as total variation yields better results than KL-diveregence.
Using answerability measure to filter out poor (i.e., unanswerable) questions improve performance.
G1 trained on SQuAD is also available for MQAG. See our model weights on HuggingFace https://huggingface.co/potsawee/t5-large-generation-squad-QuestionAnswer.

Additional results and discussion about statistical distances, answerability, and other model variants can be found in our paper.

Training Multiple-choice QG and QA models

Note that trained weights are available with the download links provided at the beginning of README. Here, we provide the scripts to train QG and QA models, or fine-tune to other multiple-choice datasets. Hyparameters and configurations are set manually inside the scripts just before def experiment(). The current version only supports T5 and Longformer, but you're welcome to modify the code to use a different architecture.

QG system

There are two generation models: (1) Question + Answer(supposedly correct answer) Generation; (2) Distrator Generation (the remaining options in addition to the answer).

model1: Question + Answer Generation Model
```
  python train_generation_qa.py
```

model2: Distrator Generation Model

  python train_generation_distractors.py

QA system

one answering model for predicting probablity over the options
```
  python train_answering.py
```

Links to Datasets

We refer to the original papers that release the dataset.

Multiple-choice Reading Comprehension
- RACE
Summary Evaluation

Citation

@article{manakul2023mqag,
  title={MQAG: Multiple-choice Question Answering and Generation for Assessing Information Consistency in Summarization},
  author={Manakul, Potsawee and Liusie, Adian and Gales, Mark JF},
  journal={arXiv preprint arXiv:2301.12307},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
examples		examples
LICENSE		LICENSE
README.md		README.md
data_loader.py		data_loader.py
inference_mqag.py		inference_mqag.py
requirements.txt		requirements.txt
train_answering.py		train_answering.py
train_generation_distractors.py		train_generation_distractors.py
train_generation_qa.py		train_generation_qa.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

examples

examples

LICENSE

LICENSE

README.md

README.md

data_loader.py

data_loader.py

inference_mqag.py

inference_mqag.py

requirements.txt

requirements.txt

train_answering.py

train_answering.py

train_generation_distractors.py

train_generation_distractors.py

train_generation_qa.py

train_generation_qa.py

utils.py

utils.py

Repository files navigation

MQAG: Multiple-choice Question Answering and Generation for Assessing Information Consistency

Python Package

MQAG (this repository)

Running MQAG Inference

Experimental Results

Training Multiple-choice QG and QA models

QG system

QA system

Links to Datasets

Citation

About

Releases

Packages

Languages

License

potsawee/mqag0

Folders and files

Latest commit

History

Repository files navigation

MQAG: Multiple-choice Question Answering and Generation for Assessing Information Consistency

Python Package

MQAG (this repository)

Running MQAG Inference

Experimental Results

Training Multiple-choice QG and QA models

QG system

QA system

Links to Datasets

Citation

About

Resources

License

Stars

Watchers

Forks

Languages