Skip to content

dmis-lab/position-bias

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Position Bias in Question Answering

This repository provides code for the paper 'Look at the First Sentence: Position Bias in Question Answering' (EMNLP, 2020). You can train the question-answering model on synthetic datasets with various de-biasing methods. We currently provide synthetic datasets and position statistics of SQuAD.

Requirements

$ conda create -n position-bias python=3.6
$ conda install pytorch==1.5.0 torchvision==0.6.0 cudatoolkit=10.1 -c pytorch
$ pip install -r requirements.txt

Note that Pytorch has to be installed depending on the version of CUDA.

Dataset

We provide five synthetic datasets.

Dataset Answer Position Example
SQuAD-train-1st.json First sentence 28,263
SQuAD-train-2nd.json Second sentence 20,593
SQuAD-train-3rd.json Third sentence 15,567
SQuAD-train-4th.json Fourth sentence 10,379
SQuAD-train-5th.json Fith Sentence & later 12,610

Train

The following example train BERT on our synthetic dataset.

TRAIN_FILE=dataset/squad/SQuAD-train-1st.json
OUTPUT_DIR=logs/1st_bert
make train_bert TRAIN_FILE=${TRAIN_FILE} OUTPUT_DIR=${OUTPUT_DIR}

Train Baselines

You can train two de-biasing baselines (entropy regularization, randomized position) with the following examples.

TRAIN_FILE=dataset/squad/SQuAD-train-1st.json
OUTPUT_DIR=logs/1st_ent_reg
make train_entropy_bert TRAIN_FILE=${TRAIN_FILE} OUTPUT_DIR=${OUTPUT_DIR}
TRAIN_FILE=dataset/squad/SQuAD-train-1st.json
OUTPUT_DIR=logs/1st_random
make train_random_bert TRAIN_FILE=${TRAIN_FILE} OUTPUT_DIR=${OUTPUT_DIR}

Train Bias Ensemble

The following examples train bias ensemble methods (bias product, learned-mixin) on each synthetic dataset. To select a synthetic dataset, you can choose K between [1st, 2nd, 3rd, 4th, 5th].

K = 1st
TRAIN_FILE=dataset/squad/SQuAD-train-${K}.json
STAT_FILE=dataset/squad/${K}_stat.p
OUTPUT_DIR=logs/${K}_prod
make train_prod_bert TRAIN_FILE=${TRAIN_FILE} STAT_FILE=${STAT_FILE} OUTPUT_DIR=${OUTPUT_DIR}
K = 1st
TRAIN_FILE=dataset/squad/SQuAD-train-${K}.json
STAT_FILE=dataset/squad/${K}_stat.p
OUTPUT_DIR=logs/${K}_mixin
make train_mixin_bert TRAIN_FILE=${TRAIN_FILE} STAT_FILE=${STAT_FILE} OUTPUT_DIR=${OUTPUT_DIR}

We also provide answer statistics of the full SQuAD dataset. After download full SQuAD data, you can train the bias ensemble method with the following example.

TRAIN_FILE=dataset/squad/SQuAD-v1.1-train.json
STAT_FILE=dataset/squad/train_answer_stat.p
OUTPUT_DIR=logs/full_mixin
make train_mixin_bert TRAIN_FILE=${TRAIN_FILE} STAT_FILE=${STAT_FILE} OUTPUT_DIR=${OUTPUT_DIR}

Citation

@inproceedings{ko2020look,
      title={Look at the First Sentence: Position Bias in Question Answering}, 
      author={Ko, Miyoung and Lee, Jinhyuk and Kim, Hyunjae and Kim, Gangwoo and Kang, Jaewoo},
      year={2020},
      booktitle={EMNLP}
}

About

EMNLP'2020: Look at the First Sentence: Position Bias in Question Answering

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published