REDSandT: Improving Distantly-Supervised Relation Extraction through BERT-based Label & Instance Embedding

This repository contains the code of our paper:
Improving Distantly-Supervised Relation Extraction through BERT-based Label & Instance Embedding
Despina Christou and Grigorios Tsoumakas

REDSandT Overview

REDSandT (Relation Extraction with Distant Supervision and Transformers) is a novel distantly-supervised transformer-based RE method that manages to capture highly informative instance and label embeddings for RE by transferring common knowledge from the pre-trained BERT language model. Experiments in two widely used benchmark datasets NYT-10 and GDS show that REDSandT captures a broader set of relations with higher confidence, including relations in the long tail.

Using Git Repository

Clone the repository from our github page and then create a virtual environment

conda create --name redsandt python=3.6

, activate this

conda activate redsandt

, and finally install the requirements:

pip install -r requirements.txt

Datasets

We evaluate our model on the standard benchmark datasets for distantly supervised relation extraction: NYT-10 (Riedel et al., 2010) and GDS (Jat et al., 2018).

We enhance both datasets with extra information, including compressed forms of the original relational instances (STP, SDP) and generic entity types extracted through spaCy.

Example of STP, SDP versions of texts:

We present 'NYT-10-enhanced' and 'GDS-enhanced' datasets.

'NYT-10-enhanced' includes the following information:

"text": Relational Instance (same as in NYT-10)
"stp": Sub-Tree path - Connects an entity pair to their least common ancestor' s parent
"sdp": Sub-Dependency path - Connects an entity pair to their least common ancestor
"{h,t}_id": Head/Tail unique id (same as in NYT-10)
"{h,t}_word": Head/Tail tokens (same as in NYT-10)
"{h,t}_char_pos": Head/Tail char pos in "text" (same as in NYT-10)
"{h,t}_token_pos": Head/Tail token pos in "text"
"{h,t}_ne": Head/Tail entity types (captured with spaCy for each "text")
"relation": Freebase Relation (same as in NYT-10)

'GDS-enhanced' includes the following information:

"text": Relational Instance (same as in GDS)
"stp": Sub-Tree path - Connects an entity pair to their least common ancestor' s parent
"{h_FB,t_FB}_ID": Head/Tail Freebase unique id (same as in GDS)
"{h,t}_word": Head/Tail tokens (same as in GDS)
"{h,t}_ne": Head/Tail entity types (captured with spaCy for each "text")
"relation": Relation (same as in GDS)
"relation_id": Relation id

To facilitate reproducibility of our results and encourage further research on relation extraction using compressed forms of instances and generic entity types, we provide both datasets' enhanced versions. These can be found here.

Please unzip and place 'NYT-10-enhanced' and 'GDS-enhanced' folders under /benchmark.

Training

Run the following command:

python redsandt.py --dataset <dataset> --config <path_to_config_file> --model_dir <model_dir> --model_name <model_name> --train --eval

for NYT-10 dataset: python redsandt.py --dataset "NYT-10" --config "experiments/configs/NYT-10/REDSandT/config.json" --model_dir "REDSandT" --model_name "redsandt" --train --eval
for GDS dataset: python redsandt.py --dataset "GDS" --config "experiments/configs/GDS/REDSandT/config.json" --model_dir "REDSandT" --model_name "redsandt_gids" --train --eval

Evaluation:

The models we trained on 'NYT-10-enhanced' and 'GDS-enhanced' can be found here.

Please unzip and place NYT-10 and GDS folders under /experiments/ckpt.

Run the following command:

python redsandt.py --dataset <dataset> --config <path_to_config_file> --model_dir <model_dir> --model_name <model_name> --eval

for NYT-10 dataset: python redsandt.py --dataset "NYT-10" --config "experiments/configs/NYT-10/REDSandT/config.json" --model_dir "REDSandT" --model_name "redsandt" --eval
for GDS dataset: python redsandt.py --dataset "GDS" --config "experiments/configs/GDS/REDSandT/config.json" --model_dir "REDSandT" --model_name "redsandt_gids" --eval

Baselines

We gathered in "baselines_pr" folder the precision - recall values for several state-of-the-art baselines for both NYT-10 and GDS. Download from here and unzip to use.

Citations

If you use our code in your research or find our repository useful, please consider citing our work.

@article{christou2021improving,
  author={Christou, Despina and Tsoumakas, Grigorios},  
  title={Improving Distantly-Supervised Relation Extraction Through BERT-Based Label and Instance Embeddings},
  journal={IEEE Access},  
  volume={9},    
  pages={62574-62582},  
  year={2021},
  publisher={IEEE}, 
  doi={10.1109/ACCESS.2021.3073428}}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
baselines_pr		baselines_pr
benchmark		benchmark
experiments		experiments
paper_images		paper_images
plots		plots
redsandt		redsandt
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
get_pr_gids_plots.py		get_pr_gids_plots.py
get_pr_nyt10_plots.py		get_pr_nyt10_plots.py
redsandt.py		redsandt.py
requirements.txt		requirements.txt
utils.py		utils.py

License

DespinaChristou/REDSandT

Folders and files

Latest commit

History

Repository files navigation

REDSandT: Improving Distantly-Supervised Relation Extraction through BERT-based Label & Instance Embedding

REDSandT Overview

Using Git Repository

Datasets

Training

Evaluation:

Baselines

Citations

About

Topics

Resources

License

Stars

Watchers

Forks

Languages