BUG Dataset

Table of Contents generated with DocToc

BUG Dataset
- Setup
- Dataset Partitions
- Dataset Format
- Evaluations
  - Coreference
- Conversions
  - CoNLL
- Citing

BUG Dataset

A Large-Scale Gender Bias Dataset for Coreference Resolution and Machine Translation (Levy et al., Findings of EMNLP 2021).

BUG was collected semi-automatically from different real-world corpora, designed to be challenging in terms of soceital gender role assignements for machine translation and coreference resolution.

Setup

Unzip data.tar.gz this should create a data folder with the following files:
- balanced_BUG.csv
- full_BUG.csv
- gold_BUG.csv
Setup a python 3.x environment and install requirements:

pip install -r requirements.txt

Dataset Partitions

NOTE: These partitions vary slightly from those reported in the paper due improvments and bug fixes post submission. For reprducibility's sake, you can access the dataset from the submission here.

Full BUG

105,687 sentences with a human entity, identified by their profession and a gendered pronoun.

Gold BUG

1,717 sentences, the gold-quality human-validated samples.

Balanced BUG

25,504 sentences, randomly sampled from Full BUG to ensure balance between male and female entities and between stereotypical and non-stereotypical gender role assignments.

Dataset Format

Each file in the data folder is a csv file adhering to the following format:

Column	Header	Description
1	sentence_text	Text of sentences with a human entity, identified by their profession and a gendered pronoun
2	tokens	List of tokens (using spacy tokenizer)
3	profession	The entity in the sentence
4	g	The pronoun in the sentence
5	profession_first_index	Words offset of profession in sentence
6	g_first_index	Words offset of pronoun in sentence
7	predicted gender	'male'/'female' determined by the pronoun
8	stereotype	-1/0/1 for anti-stereotype, neutral and stereotype sentence
9	distance	The abs distance in words between pronoun and profession
10	num_of_pronouns	Number of pronouns in the sentence
11	corpus	The corpus from which the sentence is taken
12	data_index	The query index of the pattern of the sentence

Evaluations

See below instructions for reproducing our evaluations on BUG.

Coreference

Download the Spanbert predictions from this link.
Unzip and put coref_preds.jsonl in in the predictions/ folder.
From src/evaluations/, run python evaluate_coref.py --in=../../predictions/coref_preds.jsonl --out=../../visualizations/delta_s_by_dist.png.
This should reproduce the coreference evaluation figure.

Conversions

CoNLL

To convert each data partition to CoNLL format run:

python convert_to_conll.py --in=path/to/input/file --out=path/to/output/file

For example, try:

python convert_to_conll.py --in=../../data/gold_BUG.csv --out=./gold_bug.conll

Filter from SPIKE

Download the wanted SPIKE csv files and save them all in the same directory (directory_path).
Make sure the name of each file end with \_<corpusquery><x>.csv where corpus is the name of the SPIKE dataset and x is the number of query you entered on search (for example - myspikedata_wikipedia18.csv).
From src/evaluations/, run python Analyze.py directory_path.
This should reproduce the full dataset and balanced dataset.

Citing

@misc{levy2021collecting,
      title={Collecting a Large-Scale Gender Bias Dataset for Coreference Resolution and Machine Translation}, 
      author={Shahar Levy and Koren Lazar and Gabriel Stanovsky},
      year={2021},
      eprint={2109.03858},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
docs		docs
images		images
predictions		predictions
src		src
visualizations		visualizations
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
data.tar.gz		data.tar.gz
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BUG Dataset

Setup

Dataset Partitions

Full BUG

Gold BUG

Balanced BUG

Dataset Format

Evaluations

Coreference

Conversions

CoNLL

Filter from SPIKE

Citing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

SLAB-NLP/BUG

Folders and files

Latest commit

History

Repository files navigation

BUG Dataset

Setup

Dataset Partitions

Full BUG

Gold BUG

Balanced BUG

Dataset Format

Evaluations

Coreference

Conversions

CoNLL

Filter from SPIKE

Citing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages