Skip to content

cosbi-research/beesl

Repository files navigation

Biomedical Event Extraction as Sequence Labeling (BeeSL)

BeeSL is a deep learning solution that is fast, accurate, end-to-end, and unlike current methods does not require any external knowledge base or preprocessing tools as it builds on BERT. Empirical results show that BeeSL's speed and accuracy makes it a viable approach for large-scale real-world scenarios.

This repository contains the source code for Biomedical Event Extraction as Sequence Labeling (BeeSL).

For more information on ongoing work in biomedical knowledge extraction you may want to visit the COSBI knowledge extraction page or get in touch with the COSBI Bioinformatics lab. We'll be happy to help!

Table of contents

How does BeeSL work?

Biomedical events are structured representations which comprise multiple information units (Figure 1, above the line). We encode such event structure into a representation in which each token (roughly, word) is assigned the following labels summarizing its pertinent parts of the original event structure (Figure 1, below the line):

  • dependent or, type of mention, the token assumes in the event, either an event trigger, an entity, or nothing;
  • relation or, thematic role, the argument token is playing in the event;
  • head of an event is a verbal form; each token partecipating to an event is labeled with a reference (subscript) to the event verb type it is taking part in.

encoding Figure 1: Above the dashed line: an (italicized) text excerpt with four biomedical events. The mention types (d) shown upon the text are (boxed) triggers and entities. Thematic roles (r), characterizing the event, label the edges among the relevant mentions. Below the dashes: our proposed encoding for mention types (d), thematic roles (r) and head verbs (h). See the paper for more details.

At this point we recast event extraction as a sequence labeling task as any token may have multiple associated labels. Adopting a Systems Thinking approach, we design a multi-label aware encoding strategy for jointly modeling the intermediate tasks via multi-task learning.

After encoding events as a sequence of labels, the labels for the token sequences are predicted using a neural architecture employing BERT as encoder. Dedicated classifiers for predicting the label parts (referred as tasks) are devised. Experimental results show that the best results are achieved by learning two tasks in a multi-task setup. A single label classifier for the mention types (d), and a multi-label classifier for thematic roles (r) and heads (h) <r,h> are able to capture the participation of the same token into multiple events. The sequences are finally decoded to the original event representation (Figure 1, above the line).

Installation

It is recommended to install an environment management system (e.g., miniconda3) to avoid conflicts with other programs. After installing miniconda3, create the environment and install the requirements:

cd $BEESL_DIR                             # the folder where you put this codebase
conda create --name beesl-env python=3.7  # create an python 3.7 env called beesl-env
conda activate beesl-env                  # activate the environment
python -m pip install -r requirements.txt # install the packages from requirements.txt

NOTE: we have tried hard, but there is no easy way to ship the installation of conda across operating systems and users, therefore this step is a necessary manual operation to do. Advanced users may also proceed to custom installation of python and requirements.txt

Download the pre-trained BioBERT-Base v1.1 (+ PubMed 1M) model and run:

# Extract the model, convert it to pytorch, and clean the directory
tar xC models -f $DOWNLOAD_DIR/biobert_v1.1_pubmed.tar.gz 
pytorch_transformers bert models/biobert_v1.1_pubmed/model.ckpt-1000000 models/biobert_v1.1_pubmed/bert_config.json models/biobert_v1.1_pubmed/pytorch_model.bin
rm models/biobert_v1.1_pubmed/model.ckpt*

Download the GENIA event data with our automatized script:

sh download_data.sh

Download the BeeSL model described in the paper.

curl -O https://www.cosbi.eu/fx/2354/model.tar.gz

Installing the predictive model

Place the downloaded model https://www.cosbi.eu/fx/2354/model.tar.gz in beesl/models/beesl-model/. In that folder you may later place your own trained models. The models are declared in the file config/params.json, setting the parameter pretrained_model. The provided config/params.json already references the model at that path. If you place the model somewhere else, make sure to update the configuration.

You now have everything in place and are ready to start using the system.

Usage

While this is a research product, the quality reached by the system makes it suitable to be used in real research settings for either event detection or training new models of your own.

The system was designed to be trained on data where entity mentions have been hidden. This allows to learn the wider linguistic construction rather than the mentions themselves and avoid overfitting to training data, making it more apt to general use, beyond model data. The process is called masking of the mentions type (d) (e.g. by writing $PROTEIN in place of G6PD). A model trained on masked data will best perform event extraction on masked data. Easy masking/unmasking commands are provided in the following examples.

Before starting, just ensure that your conda environment is activated:

conda activate beesl-env                    # activate the environment

Event extraction (prediction)

To detect biomedical events, run:

# conversion from BioNLP format and masking of "type" mentions
python bioscripts/preprocess.py --corpus $CORPUS_FOLDER --masking type

$CORPUS_FOLDER contains the biomedical text in the standard BioNLP standoff format, e.g., $BEESL_DIR/data/GE11 you just downloaded. This command will create the subfolder masked with BeeSL input format suitable to the:

# actual event extraction
python predict.py $PATH_TO_MODEL $BEESL_INPUT_FILE $PREDICTIONS_FILE --device $DEVICE

Where:

  • $PATH_TO_MODEL: a serialized model fine-tuned on biomedical events, for example the one provided above at https://www.cosbi.eu/fx/2354/model.tar.gz.
  • $BEESL_INPUT_FILE: a BeeSL format with entities you have just masked with the previous command. For an example, see the provided $BEESL_DIR/data/GE11/masked/test.mt.1. More info on the BeeSL file format.
  • $PREDICTIONS_FILE: the predictions of events in BeeSL format
  • $DEVICE: a device where to run the inference (i.e., CPU: -1, GPU: 0, 1, ...)

The detected event parts and text portions are now masked in the $PREDICTIONS_FILE. To recover back the entities just unmask them with:

# unmasking of "type" mentions
python bioscripts/preprocess.py --corpus $CORPUS_FOLDER --masking no

The unmasked BeeSL prediction file can be converted into the BioNLP standoff format with the following two lines. An output/ folder will be created in the BeeSL project with the converted files:

# Merge predicted labels
python bio-mergeBack.py $PREDICTIONS_FILE $BEESL_INPUT_FILE 2 > $PREDICTIONS_NOT_MASKED
# Convert them back to the BioNLP standoff format
python bioscripts/postprocess.py --filepath $PREDICTIONS_NOT_MASKED

For example, if you want to evaluate the prediction performance on the GENIA test set (in the BioNLP standoff format), compress the results cd $BEESL_DIR/output/ && tar -czf predictions.tar.gz *.a2 and submit predictions.tar.gz to the official GENIA online evaluation service.

Training a new model

To train a new model, type:

# conversion from BioNLP format and masking of "type" mentions
python bioscripts/preprocess.py --corpus $CORPUS_FOLDER --masking type

$CORPUS_FOLDER contains the biomedical text in the standard BioNLP standoff format, e.g., $BEESL_DIR/data/GE11 you just downloaded. This command will create the subfolder masked with BeeSL input format suitable to the:

# actual model training
python train.py --name $NAME --dataset_config $DATASET_CONFIG --parameters_config $PARAMETERS_CONFIG --device $DEVICE

The serialized masked model will be stored in beesl/logs/$NAME/$DATETIME/model.tar.gz, where $DATETIME is a folder to disambiguate multiple executions with the same $NAME. A performance report will be in beesl/logs/$NAME/$DATETIME/results.txt. To use your newly trained model to predict new data see the installation instructions above.

Reference and Contact

If you use this work in your research paper, we provide the full citation details for your reference.

@inproceedings{ramponi-etal-2020-biomedical,
    title     = "{B}iomedical {E}vent {E}xtraction as {S}equence {L}abeling",
    author    = "Ramponi, Alan and van der Goot, Rob and Lombardo, Rosario and Plank, Barbara",
    year      = "2020",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    publisher = "Association for Computational Linguistics",
    pages     = "5357–5367",
    location  = "Online",
    url       = "https://aclanthology.org/2020.emnlp-main.431/" 
}

For any information or request you may want to get in touch with the COSBI Bioinformatics lab. We'll be happy to help!

About

Biomedical Event Extraction exhibiting first industry-level performances in quality and speed

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •