Skip to content

BioNER: Named Entity Recognition in the Biomedical Domain

License

Notifications You must be signed in to change notification settings

phil1995/BioNER

Repository files navigation

BioNER

Codacy Badge

This repository contains the code for BioNER, an LSTM-based model designed for biomedical named entity recognition (NER).

Download

We provide the model trained for the following datasets:

Dataset Mirror (Siasky) Mirror (Mega)
MedMentions full Download Model Download Model
MedMentions ST21pv Download Model Download Model
JNLPBA Download Model Download Model

In addition, the word embeddings trained with fastText on PubMed Baseline 2021 are provided for the following n-gram ranges:

n-gram range Mirror (Siasky) Mirror (Mega) Mirror (Storj)
3-4 Download Download Download
3-6 Download Download Download

Installation

Install the dependencies.

pip install -r requirements.txt

As deterministic behaviour is enabled by default, you may need to set a debug environment variable CUBLAS_WORKSPACE_CONFIG to prevent RuntimeErrors when using CUDA.

export CUBLAS_WORKSPACE_CONFIG=:4096:8

Usage

Dataset Preprocessing

BioNER expects a dataset in the CoNLL-2003 format. We used the tool bconv for preprocessing the MedMentions dataset.

Training

You can either use the provided Makefile to train the BioNER model or execute train_bioner.py directly. Makefile: Don't forget to fill in the empty fields in the Makefile before the first start.

make train-bioner ngrams=3-4

Annotation

You can annotate a CoNLL-2003 dataset in the following way:

python annotate_dataset.py \
--embeddings \ # path to the word embeddings file 
--dataset \ # path to the CoNLL-2003 dataset
--outputFile \ # path to the output file for storing the annotated dataset
--model # path to the trained BioNER model

Furthermore, you can add the flag --enableExportCoNLL to export an additional file at the same location at the same parent folder as the outputFile, which can be used for the evaluation with the original conlleval.pl perl script (source).