Skip to content

[COLING22] Text-to-Text Extraction and Verbalization of Biomedical Event Graphs

License

Notifications You must be signed in to change notification settings

disi-unibo-nlp/bio-ee-egv

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Text-to-Text Extraction and Verbalization of Biomedical Event Graphs

This repository provides the source code & data of our paper: Text-to-Text Extraction and Verbalization of Biomedical Event Graphs.

In bioinformatics, events represent complex interactions mentioned in the scientific literature, involving a set of entities (e.g., proteins, genes, diseases, drugs), each contributing with a specific semantic role (e.g., theme, cause, site). For instance, biomedical events include molecular reactions, organism-level outcomes, and adverse drug reactions. Text-to-event (or event extraction, EE) and event-to-text (or event graph verbalization, EGV) systems effectively bridge natural language and symbolic representations. They provide a step towards decoupling concept units (what to say) from language competencies (how to say it). Almost all contributions in the event realm orbit around semantic parsing, usually employing discriminative architectures and cumbersome multi-step pipelines limited to a small number of target interaction types. Despite being less explored, EGV holds a lot of potential as well, targeting the generation of informative text constrained on semantic graphs, crucial in applications like conversational agents and summarization systems. We present the first lightweight framework to solve both event extraction and event verbalization with a unified text-to-text approach, allowing us to fuse all the resources so far designed for different tasks. To this end, we present a new event graph linearization technique and release highly comprehensive event-text paired datasets (BioT2E and BioE2T), covering more than 150 event types from multiple biology subareas (English language). By streamlining parsing and generation to translations, we propose baseline transformer model results according to multiple biomedical text mining benchmarks and natural language generation metrics. Our extractive models achieve greater state-of-the-art performance than single-task competitors and show promising capabilities for the controlled generation of coherent natural language utterances from structured data.

Requirements

General

  • Python (verified on 3.8)
  • CUDA (verified on 11.1)

Python Packages

  • See docker/requirements.txt

Datasets

Original EE datasets

Our BioE2T and BioT2E datasets are derived from 10 influential benchmarks originally designed for biomedical EE (BEE) and primarily released within BioNLP-ST competitions. For your convenience, we include these freely accessible benchmarks directly within the repository: data/datasets/original_datasets.tar.gz.

Corpus Domain #Documents Annotation Schema
Genia Event Corpus (GE08) Human blood cells transcription factors 1,000 abstracts 35 entity types, 35 event types
Genia Event 2011 (GE11) See Genia08 1,210 abstracts, 14 full papers 2 entity types, 9 event types, 2 modifiers
Epigenetics and Post-translational Modification (EPI11) Epigenetic change and common protein post-translational modifications 1,200 abstracts 2 entity types, 14 event types, 2 modifiers
Infectious Diseases (ID11) Two-component regulatory systems 30 full papers 5 entity types, 10 event types, 2 modifiers
Multi-Level Event Extraction (MLEE) Blood vessel development from the subcellular to the whole organism 262 abstracts 16 entity types, 19 event types
GENIA-MK See GE08 1,000 abstracts 35 entity types, 35 event types, 5 modifiers (+2 inferable)
Genia Event 2013 (GE13) See GE08 34 full papers 2 entity types, 13 event types, 2 modifiers
Cancer Genetics (CG13) Cancer biology 600 abstracts 18 entity types, 40 event types, 2 modifiers
Pathway Curation (PC13) Reactions, pathways, and curation 525 abstracts 4 entity types, 23 event types, 2 modifiers
Gene Regulation Ontology (GRO13) Human gene regulation and transcription 300 abstracts 174 entity types, 126 event types

BioT2E and BioE2T

We publicly release our BioT2E (data/datasets/biot2e) and BioE2T (data/datasets/bioe2t) text-to-text datasets for event extraction and event graph verbalization, respectively. For replicability, we also provide the preprocessing, filtering, and sampling scripts (notebooks/create_datasets.ipynb) used for their automatic generation mostly from EE datasets following a .txt/.a1/.a2 or .ann structure.

Models

We trained and evaluated T5 and BART models.

  • We reimplemented T5-Base (∼220M parameters, 12-layers, 768-hidden, 12- heads) in Flax (T5X) starting from the Google Research codebase; see https://github.com/disi-unibo-nlp/bio-ee-egv/blob/main/src/utils/t5x.
  • We built our BART-Base (∼139M, 12-layers, 768-hidden, 16-heads) model in PyTorch using the HuggingFace’s Transformers library.

Evaluation

  1. Generate prediction files using the following scripts.
  2. Check the evaluation notebook (./notebooks/evaluate_ee.ipynb) to run the automatic evaluation.

T5X

  • EE → python3 ./src/test_scripts/t5x/test_ee_t5.py
  • EGV → python3 ./src/test_scripts/t5x/test_egv_t5.py
  • PubMed Summ → python3 ./src/test_scripts/t5x/test_summarization_t5.py
  • Multi-task Learning (EE + EGV + PubMed Summ) → python3 ./src/test_scripts/t5x/test_mtl_t5.py

BART

  • EE → python3 ./src/test_scripts/bart/test_ee_bart.py
  • EGV → python3 ./src/test_scripts/bart/test_egv_bart.py

Checkpoints

EE Trained model Val F1 (%), AVG on 10 benchmarks
T5[BioT2E] [link] 80.25
BART[BioT2E] [link] 73.50
EGV Trained model Val ROUGE-1/2/L F1 AVG (%)
T5[BioE2T] [link] 65.40
BART[BioE2T] [link] 54.30

✉ Contacts

♣ = Mantainers

If you have troubles, suggestions, or ideas, the Discussion board might have some relevant information. If not, you can post your questions there 💬🗨.

License

This project is released under the CC-BY-NC-SA 4.0 license (see LICENSE).

Citation

If you use the reported code, datasets, or models in your research, please cite:

@inproceedings{frisoni-etal-2022-text,
  title = "Text-to-Text Extraction and Verbalization of Biomedical Event Graphs",
  author = "Frisoni, Giacomo  and
    Moro, Gianluca  and
    Balzani, Lorenzo",
  booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
  month = oct,
  year = "2022",
  address = "Gyeongju, Republic of Korea",
  publisher = "International Committee on Computational Linguistics",
  url = "https://aclanthology.org/2022.coling-1.238",
  pages = "2692--2710"
}