Skip to content

Latest commit

 

History

History
115 lines (96 loc) · 3.27 KB

README.md

File metadata and controls

115 lines (96 loc) · 3.27 KB

Position Bias in Question Answering

This repository provides code for the paper 'Look at the First Sentence: Position Bias in Question Answering' (EMNLP, 2020). You can train the question-answering model on synthetic datasets with various de-biasing methods. We currently provide synthetic datasets and position statistics of SQuAD.

Requirements

$ conda create -n position-bias python=3.6
$ conda install pytorch==1.5.0 torchvision==0.6.0 cudatoolkit=10.1 -c pytorch
$ pip install -r requirements.txt

Note that Pytorch has to be installed depending on the version of CUDA.

Dataset

We provide five synthetic datasets.

Dataset Answer Position Example
SQuAD-train-1st.json First sentence 28,263
SQuAD-train-2nd.json Second sentence 20,593
SQuAD-train-3rd.json Third sentence 15,567
SQuAD-train-4th.json Fourth sentence 10,379
SQuAD-train-5th.json Fith Sentence & later 12,610

Train

The following example train BERT on our synthetic dataset.

TRAIN_FILE=dataset/squad/SQuAD-train-1st.json
OUTPUT_DIR=logs/1st_bert
make train_bert TRAIN_FILE=${TRAIN_FILE} OUTPUT_DIR=${OUTPUT_DIR}

Train Baselines

You can train two de-biasing baselines (entropy regularization, randomized position) with the following examples.

TRAIN_FILE=dataset/squad/SQuAD-train-1st.json
OUTPUT_DIR=logs/1st_ent_reg
make train_entropy_bert TRAIN_FILE=${TRAIN_FILE} OUTPUT_DIR=${OUTPUT_DIR}
TRAIN_FILE=dataset/squad/SQuAD-train-1st.json
OUTPUT_DIR=logs/1st_random
make train_random_bert TRAIN_FILE=${TRAIN_FILE} OUTPUT_DIR=${OUTPUT_DIR}

Train Bias Ensemble

The following examples train bias ensemble methods (bias product, learned-mixin) on each synthetic dataset. To select a synthetic dataset, you can choose K between [1st, 2nd, 3rd, 4th, 5th].

K = 1st
TRAIN_FILE=dataset/squad/SQuAD-train-${K}.json
STAT_FILE=dataset/squad/${K}_stat.p
OUTPUT_DIR=logs/${K}_prod
make train_prod_bert TRAIN_FILE=${TRAIN_FILE} STAT_FILE=${STAT_FILE} OUTPUT_DIR=${OUTPUT_DIR}
K = 1st
TRAIN_FILE=dataset/squad/SQuAD-train-${K}.json
STAT_FILE=dataset/squad/${K}_stat.p
OUTPUT_DIR=logs/${K}_mixin
make train_mixin_bert TRAIN_FILE=${TRAIN_FILE} STAT_FILE=${STAT_FILE} OUTPUT_DIR=${OUTPUT_DIR}

We also provide answer statistics of the full SQuAD dataset. After download full SQuAD data, you can train the bias ensemble method with the following example.

TRAIN_FILE=dataset/squad/SQuAD-v1.1-train.json
STAT_FILE=dataset/squad/train_answer_stat.p
OUTPUT_DIR=logs/full_mixin
make train_mixin_bert TRAIN_FILE=${TRAIN_FILE} STAT_FILE=${STAT_FILE} OUTPUT_DIR=${OUTPUT_DIR}

Citation

@inproceedings{ko2020look,
      title={Look at the First Sentence: Position Bias in Question Answering}, 
      author={Ko, Miyoung and Lee, Jinhyuk and Kim, Hyunjae and Kim, Gangwoo and Kang, Jaewoo},
      year={2020},
      booktitle={EMNLP}
}