Revisiting Joint Modeling of Cross-document Entity and Event Coreference Resolution

Introduction

This code was used in the paper: "Paraphrasing vs Coreferring: Two Sides of the Same Coin" link Yehudit Meged, Avi Caciularu, Vered Shwartz, Ido Dagan

and based on the code of the paper: "Revisiting Joint Modeling of Cross-document Entity and Event Coreference Resolution"

A neural model implemented in PyTorch for resolving cross-document entity and event coreference using paraphrasing knowladge.

The model was trained and evaluated on the ECB+ corpus.

Prerequisites

Python 3.6

PyTorch 0.4.0

We specifically used PyTorch 0.4.0 with CUDA 9.0 on Linux, which can be installed using the command: pip install https://download.pytorch.org/whl/cu90/torch-0.4.0-cp36-cp36m-linux_x86_64.whl

spaCy 2.0.18

Install the spacy en model with python -m spacy download en

Matplotlib 3.0.2

NumPy 1.16.1

NLTK 3.4

scikit-learn 0.20.2

SciPy 1.2.1

seaborn 0.9.0

AllenNLP 0.5.1

Testing Instructions

Download pretrained event and entity models and pre-processed data for the ECB+ corpus at https://drive.google.com/open?id=197jYq5lioefABWP11cr4hy4Ohh1HMPGK

Configure the model and test set paths in the configuration file test_config.json accordingly.

Run the script predict_model.py with the command: python src/all_models/predict_model.py --config_path test_config.json --out_dir <output_directory>

Where:

config_path - a path to a JSON file holds the test configuration (test_config.json). An explanation about this configuration file is provided in config_files_readme.md.

out_dir - an output directory.

Main output:

Two response (aka system prediction) files:

CD_test_entity_mention_based.response_conll - cross-document entity coreference results in CoNLL format.

CD_test_event_mention_based.response_conll - cross-document event coreference results in CoNLL format.

conll_f1_scores.txt - A text file contains the CoNLL coreference scorer's output (F1 score).

Note - the script's configuration file (test_config.json) also requires:

An output file of a within-document entity coreference system on the ECB+ corpus (provided in this repo at data/external/stanford_neural_wd_entity_coref_out/ecb_wd_coref.json)

An output file of the document clustering algorithm that has been used in the paper (provided in this repo at data/external/document_clustering/predicted_topics)

Training Instructions

Download the pre-processed data for the ECB+ corpus at https://drive.google.com/open?id=197jYq5lioefABWP11cr4hy4Ohh1HMPGK.

Alternatively, you can create the data from scratch by following the instructions below.

Download GloVe embeddings from https://nlp.stanford.edu/projects/glove/ (we used glove.6B.300d).

Configure paths in the configuration file train_config.json (see details at config_files_readme.md).

Run the script train_model.py with the command: python src/all_models/train_model.py --config_path train_config.json --out_dir <output_directory>

Where:

config_path - a path to a JSON file holds the training configuration (train_config.json). An explanation about this configuration file is provided in config_files_readme.md.

out_dir - an output directory.

Main Output:

Two trained models that are saved to the files:

cd_event_best_model - the event model that got the highest B-cubed F1 score on the dev set.

cd_entity_best_model - the entity model that got the highest B-cubed F1 score on the dev set.

summery.txt - a summary of the training.

Note - the script's configuration file (train_config.json) also requires:

An output file of a within-document entity coreference system on the ECB+ corpus (provided in this repo at data/external/stanford_neural_wd_entity_coref_out)

Creating Data from Scratch

This repository provides pre-processed data for the ECB+ corpus (download from https://drive.google.com/open?id=197jYq5lioefABWP11cr4hy4Ohh1HMPGK). In case you want to create the data from scratch, do the following steps:

Download ELMo's files (options file and weights) from https://allennlp.org/elmo (we used Original 5.5B model files).

Loading the ECB+ corpus

Extract the gold mentions and documents from the ECB+ corpus: python src/data/make_dataset.py --ecb_path <ecb_path> --output_dir <output_directory> --data_setup 2 --selected_sentences_file data/raw/ECBplus_coreference_sentences.csv

Where:

ecb_path - a directory contains the ECB+ documents (can be downloaded from http://www.newsreader-project.eu/results/data/the-ecb-corpus/).

output_dir - output directory.

data_setup - enter '2' to load the ECB+ data in the same evaluation setup as used in our experiments (see the setup description in the paper).

selected_sentences_file - path to a CSV file contains the selected sentences.

Output: The script saves for each data split (train/dev/test):

A json file contains its mention objects.

A text file contains its sentences.

Feature extraction

Run the feature extraction script, which extracts predicate-argument structures, mention head and ELMo embeddings, for each mention in each split (train/dev/test): python src/features/build_features.py --config_path build_features_config.json --output_path <output_path>

Where:

config_path - a path to a JSON file holds the feature extraction configuration (build_features_config.json). An explanation about this configuration file is provided in config_files_readme.md.

output_path - a path to the output directory.

Output: This script saves 3 pickle files, each contains a Corpus object representing each split:

train_data - the training data, used as an input to the script train_model.py.

dev_data - the dev data, used as an input to the script train_model.py.

test_data - the test data, used as an input to the script predict_model.py.

Note - the script's configuration file also requires:

The output files of the script make_dataset.py (JSON and text files).

Output files of SwiRL SRL system on the ECB+ corpus (provided in this repo at data/external/swirl_output).

Contact info

Contact Yehudit Meged at yehuditmeged@gmail.com for questions about this repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

doc_clustering

doc_clustering

models

models

scorer

scorer

src

src

README.md

README.md

build_features_config.json

build_features_config.json

config_files_readme.md

config_files_readme.md

lemma_baseline_config.json

lemma_baseline_config.json

test_config.json

test_config.json

train_config.json

train_config.json

Repository files navigation

Revisiting Joint Modeling of Cross-document Entity and Event Coreference Resolution

Introduction

Prerequisites

Testing Instructions

Training Instructions

Creating Data from Scratch

Loading the ECB+ corpus

Feature extraction

Contact info

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
data		data
doc_clustering		doc_clustering
models		models
scorer		scorer
src		src
README.md		README.md
build_features_config.json		build_features_config.json
config_files_readme.md		config_files_readme.md
lemma_baseline_config.json		lemma_baseline_config.json
test_config.json		test_config.json
train_config.json		train_config.json

yehudit96/event_entity_coref_ecb_plus

Folders and files

Latest commit

History

Repository files navigation

Revisiting Joint Modeling of Cross-document Entity and Event Coreference Resolution

Introduction

Prerequisites

Testing Instructions

Training Instructions

Creating Data from Scratch

Loading the ECB+ corpus

Feature extraction

Contact info

About

Resources

Stars

Watchers

Forks

Languages