Cross-document coreference resolution (CDCR) datasets with diverse annotation schemes

The repository contains the code used to report the results in the LREC 2022 paper Zhukova A., Hamborg F., Gipp B. "Towards Evaluation of Cross-document Coreference Resolution Models Using Datasets with Diverse Annotation Schemes".
Please use this .bib to cite the paper:

@inproceedings{Zhukova2022a,
  title        = {{T}owards {E}valuation of {C}ross-document {C}oreference {R}esolution {M}odels {U}sing {D}atasets with {D}iverse {A}nnotation {S}chemes},
  author       = {Zhukova, Anastasia and Hamborg, Felix and Gipp, Bela},
  year         = 2022,
  month        = {June},
  booktitle    = {Proceedings of the 13th Language Resources and Evaluation Conference},
  location     = {Marseille, France}
}

The repository contains a code that parses original formats of CDCR datasets into the same format (conll format for coreference resolution and a separate list of mentions) and calculates summary values that enable comparison of the datasets.

Parsing scripts per dataset are contained in each separate folder, whereas the summary script is located in the root folder. The parsed datasets are available in this repository in the folders listed below.

Installation

Python 3.8 required
!!! Recommended to create a venv.
Install libraries: pip install -r requirements.txt
Download the datasets and required libraries from spacy: python setup.py
Download and install Perl. Add perl to PATH, restart your computer, and check that perl has been correctly installed.

Dataset information

The parsing scripts and output folders are located here:

Dataset	Parsing script	Output files
ECB+	`ECBplus-prep/parse_ecbplus.py`	`ECBplus-prep/output_data`
NewsWCL50	`NewsWCL50-prep/parse_newswcl50.py`	`NewsWCL50-prep/output_data`

Each dataset contains three output files suitable for a CDCR model:

*dataset_name*.conll
entity_mentions.json
event_mentions.json

CoNLL format (simplified)

CoNLL format is a standard input format for within-document coreference resolution. The original format contains multiple columns that contain information per each token, e.g., POS tags, NER labels. We use a simplified format (based on the format of input filed used by Barhom et al. 2019) that contains tokens, their identifiers in the text (e.g., doc_id, sent_id), and labels of coref chains:

Column ID	Type	Description
0	string	Composed document id: topic/subtopic/doc ("-" is used if there is no subtopic)
1	int	Sentence ID
2	int	Token ID
3	string	Token
4	string	Coreference chain

Each document is accompanied with a beginning and end tags, sentences are separated with news lines (warning: some new line delimiters can be tokens themselves (e.g., in NewsWCL50)).

Example:

#begin document 0/-/0_LL; part 000
0/-/0_LL 0 0 This -
0/-/0_LL 0 1 is -
0/-/0_LL 0 2 Jim (1)
0/-/0_LL 0 3 . -

0/-/0_LL 1 0 He (1)
0/-/0_LL 1 1 likes - 
0/-/0_LL 1 2 sports - 
0/-/0_LL 1 3 . -

#end document
#begin document 1/1ecb/12; part 000
1/1ecb/12 0 0 This -
1/1ecb/12 0 1 is -
1/1ecb/12 0 2 Anna (2)
1/1ecb/12 0 3 . -

1/1ecb/12 1 0 She (2)
1/1ecb/12 1 1 likes - 
1/1ecb/12 1 2 singing - 
1/1ecb/12 1 3 . -

#end document

Mentions.json

The format is adapted and extended from WEC-Eng and from the mention format used by Barhom et al. 2019.

Field	Type	Description
coref_chain	string	Unique identifier of a coreference chain to which this mention belongs to.
description	string	Description of a coreference chain.
coref_type	string	Type of a coreference link, e.g., strict indentity.
mention_id	string	Mention ID.
mention_type	string	Short form of a mention type, e.g., HUM
mention_full_type	string	Long form of a mention type, e.g., HUMAN_PART_PER
tokens_str	string	A full mention string, i.e., all consequitive chars of the mention as found in the text.
tokens_text	list of strings	A mention split into a list of tokens, text of tokens
tokens_numbers	list of int	A mention split into a list of tokens, token id of these tokens (as occurred in a sentence).
mention_head	string	A head of mention's phrase, e.g., Barack Obama
mention_head_id	int	Token id of the head of mention's phrase
mention_head_pos	string	Token's POS tag of the head of mention's phrase
mention_head_lemma	string	Token's lemma of the head of mention's phrase
sent_id	int	Sentence ID
topic_id	int	Topic ID
topic	string	Topic description
subtopic	string	Subtopic name
doc_id	string	Document ID
is_continuous	bool	If all tokens in the annotated mention continuously occur in the text
is_singleton	bool	If a coreference chain consists of only one mention.
mention_context	list of strings	-N and +N tokens before and after the mention (N=100).
conll_doc_key	string	a compositional key for one-to-one mapping documents between .conll and .json files.

Example:

[{
    "coref_chain": "0_Denuclearization_MISC", 
    "tokens_number": [33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], 
    "doc_id": "0_L", 
    "score": -1, 
    "sent_id": 21, 
    "mention_type": "MISC", 
    "mention_full_type": "MISC", 
    "mention_id": "0_L_21_49_VZrL", 
    "topic_id": 0, 
    "topic": "0_CIADirectorMikePompeoMeetingNorthKorea", 
    "subtopic": "-",
    "description": "Denuclearization", 
    "coref_type": "STRICT", 
    "mention_ner": "O", 
    "mention_head_pos": "PUNCT", 
    "mention_head_lemma": "\"", 
    "mention_head": "\"", 
    "mention_head_id": 49, 
    "is_continuous": true, 
    "is_singleton": false, 
    "mention_context": ["newspaper", ",", "Munhwa", "Ilbo", ",", "reported", "that", "the", "two", "countries", "were", "negotiating", "an", "announcement", "\"", "to", "ease", "military", "tensions", "and", "end", "a", "military", "confrontation", ",", "\"", "as", "part", "of", "the", "summit", "meeting", "planned", "between", "Mr.", "Kim", "and", "President", "Moon", "Jae", "-", "in", "of", "South", "Korea", ".", "\n", "That", "could", "involve", "pulling", "troops", "out", "of", "the", "Demilitarized", "Zone", ",", "making", "it", "a", "genuinely", "\"", "Demilitarized", "Zone", ".", "\"", "A", "South", "Korean", "government", "official", "later", "played", "down", "the", "report", ",", "saying", "it", "was", "too", "soon", "to", "tell", "what", "a", "joint", "statement", "by", "Mr.", "Moon", "and", "Mr.", "Kim", "would", "contain", ",", "other", "than", "broad", "and", "\"", "abstract", "\"", "statements", "about", "the", "need", "for", "North", "Korea", "to", "\"", "denuclearize", ".", "\"", "\n", "But", "analysts", "said", "South", "Korea", "was", "aiming", "for", "a", "comprehensive", "deal", ",", "in", "which", "the", "North", "agreed", "to", "give", "up", "its", "weapons", "in", "return", "for", "a", "security", "guarantee", ",", "including", "a", "peace", "treaty", ".", "Mr.", "Trump", "'s", "comments", "suggested", "he", "backed", "that", "effort", ".", "\n", "\"", "They", "do", "have", "my", "blessing", "to", "discuss", "the", "end", "of", "the", "war", ",", "\"", "he", "said", ".", "\"", "People", "do", "n't", "realize", "that", "the", "Korean", "War", "has", "not", "ended", ".", "It", "'s", "going", "on", "right", "now", ".", "And", "they", "are", "discussing", "an", "end", "to", "war", ".", "Subject", "to", "a", "deal", ",", "they"], 
    "tokens_str": "broad and \"abstract\" statements about the need for North Korea to \"denuclearize.\" ", 
    "tokens_text": ["broad", "and", "\"", "abstract", "\"", "statements", "about", "the", "need", "for", "North", "Korea", "to", "\"", "denuclearize", ".", "\""], 
    "conll_doc_key": "0/-/0_L"
}]

Dataset summary metrics

The following values enable comparison of the CDCR datasets on dataset and topic levels.

Field	Type	Description
dataset	string	Name of the dataset
topic	string	Topic name (or empty for the line that contains stats for a full dataset)
articles	int	Number of articles in a dataset/topic
tokens	int	Number of tokens in a dataset/topic
coref_chain	int	Number of coref chains in a dataset/topic
mentions	int	Number of all mentions in a dataset/topic
event_mentions	int	Number of event mentions in a dataset/topic
entity_mentions	int	Number of entity mentions in a dataset/topic
singletons	int	Number of singleton coref chains in a dataset/topic
average_size	float	Average number of mentions in a coref chain, i.e., chain size
unique_lemmas_all	float	Lexical diversity measurement: a number of unique mention lemmas in a chain. Calculated on all coref chains.
unique_lemmas_wo_singl	float	-//- Calculated on non-singleton chains.
phrasing_diversity_weighted_all	float	Lexical diversity measurement: phrasing diversity (see LREC paper). Measures diversity of the mentions given variation and frequency of the chains' mentions. Calculated on all mentions.
phrasing_diversity_weighted_wo_singl	float	-//- Calculated on non-singleton chains.
F1_CONLL_all	float	F1 CoNLL (average of B3, MUC, and CEAF_e) calculated on the simple same-lemma baseline. Calculated on all coref chains.
F1_CONLL_wo_singl	float	-//- Calculated on non-singleton chains.

The results of dataset comparison is available in /summary folder.

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
ECBplus-prep		ECBplus-prep
MEANTIME-prep		MEANTIME-prep
NewsWCL50-prep		NewsWCL50-prep
scorer		scorer
summary		summary
tmp		tmp
.gitignore		.gitignore
CITATION.BIB		CITATION.BIB
LICENSE		LICENSE
README.md		README.md
_sample_doc.json		_sample_doc.json
_sample_mention.json		_sample_mention.json
create_summary.py		create_summary.py
insert_whitespace.py		insert_whitespace.py
logger.py		logger.py
requirements.txt		requirements.txt
setup.py		setup.py

License

anastasia-zhukova/Diverse_CDCR_datasets

Folders and files

Latest commit

History

Repository files navigation

Cross-document coreference resolution (CDCR) datasets with diverse annotation schemes

Installation

Dataset information

CoNLL format (simplified)

Mentions.json

Dataset summary metrics

About

Topics

Resources

License

Stars

Watchers

Forks

Languages