Text-to-Text Extraction and Verbalization of Biomedical Event Graphs

This repository provides the source code & data of our paper: Text-to-Text Extraction and Verbalization of Biomedical Event Graphs.

In bioinformatics, events represent complex interactions mentioned in the scientific literature, involving a set of entities (e.g., proteins, genes, diseases, drugs), each contributing with a specific semantic role (e.g., theme, cause, site). For instance, biomedical events include molecular reactions, organism-level outcomes, and adverse drug reactions. Text-to-event (or event extraction, EE) and event-to-text (or event graph verbalization, EGV) systems effectively bridge natural language and symbolic representations. They provide a step towards decoupling concept units (what to say) from language competencies (how to say it). Almost all contributions in the event realm orbit around semantic parsing, usually employing discriminative architectures and cumbersome multi-step pipelines limited to a small number of target interaction types. Despite being less explored, EGV holds a lot of potential as well, targeting the generation of informative text constrained on semantic graphs, crucial in applications like conversational agents and summarization systems. We present the first lightweight framework to solve both event extraction and event verbalization with a unified text-to-text approach, allowing us to fuse all the resources so far designed for different tasks. To this end, we present a new event graph linearization technique and release highly comprehensive event-text paired datasets (BioT2E and BioE2T), covering more than 150 event types from multiple biology subareas (English language). By streamlining parsing and generation to translations, we propose baseline transformer model results according to multiple biomedical text mining benchmarks and natural language generation metrics. Our extractive models achieve greater state-of-the-art performance than single-task competitors and show promising capabilities for the controlled generation of coherent natural language utterances from structured data.

Text-to-Text Extraction and Verbalization of Biomedical Event Graphs
- Requirements
- Datasets
  - Original EE datasets
  - BioT2E and BioE2T
- Models
  - Evaluation
    - T5X
    - BART
  - Checkpoints
- ✉ Contacts
- Citation

Requirements

General

Python (verified on 3.8)
CUDA (verified on 11.1)

Python Packages

See docker/requirements.txt

Datasets

Original EE datasets

Our BioE2T and BioT2E datasets are derived from 10 influential benchmarks originally designed for biomedical EE (BEE) and primarily released within BioNLP-ST competitions. For your convenience, we include these freely accessible benchmarks directly within the repository: data/datasets/original_datasets.tar.gz.

Corpus	Domain	#Documents	Annotation Schema
Genia Event Corpus (GE08)	Human blood cells transcription factors	1,000 abstracts	35 entity types, 35 event types
Genia Event 2011 (GE11)	See Genia08	1,210 abstracts, 14 full papers	2 entity types, 9 event types, 2 modifiers
Epigenetics and Post-translational Modification (EPI11)	Epigenetic change and common protein post-translational modifications	1,200 abstracts	2 entity types, 14 event types, 2 modifiers
Infectious Diseases (ID11)	Two-component regulatory systems	30 full papers	5 entity types, 10 event types, 2 modifiers
Multi-Level Event Extraction (MLEE)	Blood vessel development from the subcellular to the whole organism	262 abstracts	16 entity types, 19 event types
GENIA-MK	See GE08	1,000 abstracts	35 entity types, 35 event types, 5 modifiers (+2 inferable)
Genia Event 2013 (GE13)	See GE08	34 full papers	2 entity types, 13 event types, 2 modifiers
Cancer Genetics (CG13)	Cancer biology	600 abstracts	18 entity types, 40 event types, 2 modifiers
Pathway Curation (PC13)	Reactions, pathways, and curation	525 abstracts	4 entity types, 23 event types, 2 modifiers
Gene Regulation Ontology (GRO13)	Human gene regulation and transcription	300 abstracts	174 entity types, 126 event types

BioT2E and BioE2T

We publicly release our BioT2E (data/datasets/biot2e) and BioE2T (data/datasets/bioe2t) text-to-text datasets for event extraction and event graph verbalization, respectively. For replicability, we also provide the preprocessing, filtering, and sampling scripts (notebooks/create_datasets.ipynb) used for their automatic generation mostly from EE datasets following a .txt/.a1/.a2 or .ann structure.

Models

We trained and evaluated T5 and BART models.

We reimplemented T5-Base (∼220M parameters, 12-layers, 768-hidden, 12- heads) in Flax (T5X) starting from the Google Research codebase; see https://github.com/disi-unibo-nlp/bio-ee-egv/blob/main/src/utils/t5x.
We built our BART-Base (∼139M, 12-layers, 768-hidden, 16-heads) model in PyTorch using the HuggingFace’s Transformers library.

Evaluation

Generate prediction files using the following scripts.
Check the evaluation notebook (./notebooks/evaluate_ee.ipynb) to run the automatic evaluation.

T5X

EE → python3 ./src/test_scripts/t5x/test_ee_t5.py
EGV → python3 ./src/test_scripts/t5x/test_egv_t5.py
PubMed Summ → python3 ./src/test_scripts/t5x/test_summarization_t5.py
Multi-task Learning (EE + EGV + PubMed Summ) → python3 ./src/test_scripts/t5x/test_mtl_t5.py

BART

EE → python3 ./src/test_scripts/bart/test_ee_bart.py
EGV → python3 ./src/test_scripts/bart/test_egv_bart.py

Checkpoints

EE	Trained model	Val F1 (%), AVG on 10 benchmarks
	T5[BioT2E] [link]	80.25
	BART[BioT2E] [link]	73.50
EGV	Trained model	Val ROUGE-1/2/L F1 AVG (%)
	T5[BioE2T] [link]	65.40
	BART[BioE2T] [link]	54.30

✉ Contacts

Giacomo Frisoni, giacomo.frisoni[at]unibo.it ♣
Gianluca Moro, gianluca.moro[at]unibo.it
Lorenzo Balzani, balzanilo[at]icloud.com ♣

♣ = Mantainers

If you have troubles, suggestions, or ideas, the Discussion board might have some relevant information. If not, you can post your questions there 💬🗨.

License

This project is released under the CC-BY-NC-SA 4.0 license (see LICENSE).

Citation

If you use the reported code, datasets, or models in your research, please cite:

@inproceedings{frisoni-etal-2022-text,
  title = "Text-to-Text Extraction and Verbalization of Biomedical Event Graphs",
  author = "Frisoni, Giacomo  and
    Moro, Gianluca  and
    Balzani, Lorenzo",
  booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
  month = oct,
  year = "2022",
  address = "Gyeongju, Republic of Korea",
  publisher = "International Committee on Computational Linguistics",
  url = "https://aclanthology.org/2022.coling-1.238",
  pages = "2692--2710"
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
docker		docker
figures		figures
src		src
LICENSE.txt		LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

docker

docker

figures

figures

src

src

LICENSE.txt

LICENSE.txt

README.md

README.md

Repository files navigation

Text-to-Text Extraction and Verbalization of Biomedical Event Graphs

Requirements

Datasets

Original EE datasets

BioT2E and BioE2T

Models

Evaluation

T5X

BART

Checkpoints

✉ Contacts

License

Citation

About

Releases

Packages

Contributors 2

Languages

License

disi-unibo-nlp/bio-ee-egv

Folders and files

Latest commit

History

Repository files navigation

Text-to-Text Extraction and Verbalization of Biomedical Event Graphs

Requirements

Datasets

Original EE datasets

BioT2E and BioE2T

Models

Evaluation

T5X

BART

Checkpoints

✉ Contacts

License

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages