Skip to content

xiongshufeng/Transfer-Learning-BNER-Bioinformatics-2018

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Transfer-learning-for-BNER-Bioinformatics-2018

This repository contains supplementary data, and links to the model and corpora used for the paper Transfer learning for biomedical named entity recognition with neural networks.

Code

Corpora pre-processing steps were collected in a single script with a jupyter notebook for ease-of-use. Script and notebook can be found in code.

Model

The model used in this study is NeuroNER [1], a domain-independent named entity recognizer (NER) based on a bi-directional long short term memory network-conditional random field (LSTM-CRF). A repository for the model can be found here.

NeuroNER uses standard python config files to specify hyperparameters. We provide three of these config files for reproducibility (see code/configs):

  1. baseline.ini: config used while training on the target data sets (i.e., the baseline.)
  2. source.ini: config used while training on the source data sets.
  3. transfer.ini: config used while transferring a model trained on the source data set for training on a target data set.

Word Embeddings

The word embeddings used in this study were obtained from here [2]. Code for converting the word vectors to the .txt format necessary for use with NeuroNER can be found in the jupyter notebook in code, under data cleaning.

Corpora

All corpora used in this study (which can be re-distributed) are in the corpora folder (given in Brat-standoff format).

Data can be uncompressed with the following command: tar -zxvf <name_of_corpora>.

Alternatively, the corpora can be publicly accessed at the following links:

Corpora Text Genre Standard Entities Publication
AZDC Scientific Article Gold disease link
BioCreative II GM Scientific Article Gold genes/proteins link
BioInfer Scientific Article Gold genes/proteins link
BioSemantics Patent Gold chemicals, disease link
CALBC-III-Small Scientific Article Silver chemicals, diseases, species, genes/proteins link
CDR Scientific Article Gold chemicals, diseases link
CellFinder Scientific Article Gold species, gene/proteins, cells, anatomy link
CHEMDNER Patent Patent Gold chemicals link
DECA Scientific Article Gold gene/proteins link
FSU-PRGE Scientific Article Gold genes/proteins link
Linneaus Scientific Article Gold species link
LocText Scientific Article Gold species, genes/proteins link
IEPA Scientific Article Gold genes/proteins link
miRNA Scientific Article Gold diseases, species, genes/proteins link
NCBI disease Scientific Article Gold diseases link
S800 Scientific Article Gold species link
Variome Scientific Article Gold diseases, species, genes/proteins link

Many of these corpora can also be accessed and visualized in the browser here [3].

Supplementary Information

The supplementary data can be found in the file supplementary/additional_file_1.pdf. Additionally, blacklists used for the silver-standard corpora (SSCs) can be found in supplementary/blacklists.

Citations

  1. Dernoncourt, F., Lee, J. Y., & Szolovits, P. (2017). NeuroNER: an easy-to-use program for named-entity recognition based on neural networks. arXiv preprint arXiv:1705.05487.
  2. Moen, S. P. F. G. H., & Ananiadou, T. S. S. (2013). Distributional semantics resources for biomedical text processing. In Proceedings of the 5th International Symposium on Languages in Biology and Medicine, Tokyo, Japan (pp. 39-43).
  3. Stenetorp, P., Topić, G., Pyysalo, S., Ohta, T., Kim, J. D., & Tsujii, J. I. (2011, June). BioNLP shared task 2011: Supporting resources. In Proceedings of the BioNLP Shared Task 2011 Workshop (pp. 112-120). Association for Computational Linguistics.

About

This repository contains supplementary data, and links to the model and corpora used for the paper: Transfer learning for biomedical named entity recognition with neural networks.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 56.6%
  • Jupyter Notebook 43.4%