Skip to content

ncbi/BioConceptVec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 

Repository files navigation

BioConceptVec:
creating and evaluating literature-based biomedical concept embeddings on a large scale

HitCount

Table of contents

Text corpora

We created BioConceptVec using the entire PubMed. The texts were split and tokenized using NLTK. We also lowercased all the words.

Using PubTator for annotating concepts in the PubMed

We employed PubTator to annotate biomedical concepts in the PubMed. It covers genes, mutations, chemicals, diseases and cellines. The trained embeddings contain over 400,000 concepts.

BioConceptVec: embeddings and concept files

We release four versions of BioConceptVec (cbow, skip-gram, glove and fastText). For each version, we make both the embedding(contains concepts and other words) in binary format and the concept-only file in json format available.

  1. BioConceptVec cbow: embedding (2.4GB) and concept-only (798MB).
  2. BioConceptVec skip-gram: embedding (2.4GB) and concept-only (812MB).
  3. BioConceptVec glove: embedding (2.4GB) and concept-only (835MB).
  4. BioConceptVec fastText: embedding (2.4GB) and concept-only (813MB).

Tutorial

You can find this tutorial on how to use BioConceptVec (for both embedding and concept-only files) for a quick start.

Datasets

We also make all the 9 evaluation datasets publicly available. It covers 4 applications:

  1. Drug-Gene interactions. The dataset contains (1) ID: the instance ID, (2) num_of_genes: the number of genes for this instance, (3) pos_rel_genes: the IDs of related genes, and (4) neg_rel_genes: the IDs of unrelated genes.

  2. Gene-Gene interactions. 5 datasets on gene-gene interactions have the same format as above.

  3. Protein-Protein interaction. It contains two datasets: (1) combined: protein-protein interactions created based on STRING combined scores and (2) exp700: protein-protein interactions created based on STRING experimental scores over 700. Both datasets contain train, valid and test files. The file contains (1) query: query protein ID, (2) subject: subject protein ID, (3) score: STRING score and (4) label: whether it is a protein-protein interaction.

  4. Drug-Drug interaction. This dataset is from Drug-Drug interaction semeval-2013. Please see the details there.

References

When using our resources, please cite the following papers:

Chen, Q., Lee, K., Yan, S., Kim, S., Wei, C. H., & Lu, Z. (2019). BioConceptVec: creating and evaluating literature-based biomedical concept embeddings on a large scale. To appear in PLOS Computational Biology.

Acknowledgments

This work was supported by the Intramural Research Programs of the National Institutes of Health, National Library of Medicine.