BioConceptVec:
creating and evaluating literature-based biomedical concept embeddings on a large scale

Text corpora

We created BioConceptVec using the entire PubMed. The texts were split and tokenized using NLTK. We also lowercased all the words.

Using PubTator for annotating concepts in the PubMed

We employed PubTator to annotate biomedical concepts in the PubMed. It covers genes, mutations, chemicals, diseases and cellines. The trained embeddings contain over 400,000 concepts.

BioConceptVec: embeddings and concept files

We release four versions of BioConceptVec (cbow, skip-gram, glove and fastText). For each version, we make both the embedding(contains concepts and other words) in binary format and the concept-only file in json format available.

BioConceptVec cbow: embedding (2.4GB) and concept-only (798MB).
BioConceptVec skip-gram: embedding (2.4GB) and concept-only (812MB).
BioConceptVec glove: embedding (2.4GB) and concept-only (835MB).
BioConceptVec fastText: embedding (2.4GB) and concept-only (813MB).

Tutorial

You can find this tutorial on how to use BioConceptVec (for both embedding and concept-only files) for a quick start.

Datasets

We also make all the 9 evaluation datasets publicly available. It covers 4 applications:

Drug-Gene interactions. The dataset contains (1) ID: the instance ID, (2) num_of_genes: the number of genes for this instance, (3) pos_rel_genes: the IDs of related genes, and (4) neg_rel_genes: the IDs of unrelated genes.
Gene-Gene interactions. 5 datasets on gene-gene interactions have the same format as above.
Protein-Protein interaction. It contains two datasets: (1) combined: protein-protein interactions created based on STRING combined scores and (2) exp700: protein-protein interactions created based on STRING experimental scores over 700. Both datasets contain train, valid and test files. The file contains (1) query: query protein ID, (2) subject: subject protein ID, (3) score: STRING score and (4) label: whether it is a protein-protein interaction.
Drug-Drug interaction. This dataset is from Drug-Drug interaction semeval-2013. Please see the details there.

References

When using our resources, please cite the following papers:

Chen, Q., Lee, K., Yan, S., Kim, S., Wei, C. H., & Lu, Z. (2019). BioConceptVec: creating and evaluating literature-based biomedical concept embeddings on a large scale. To appear in PLOS Computational Biology.

Acknowledgments

This work was supported by the Intramural Research Programs of the National Institutes of Health, National Library of Medicine.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
datasets		datasets
README.md		README.md
bioconcept_tutorial.ipynb		bioconcept_tutorial.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datasets

datasets

README.md

README.md

bioconcept_tutorial.ipynb

bioconcept_tutorial.ipynb

Repository files navigation

BioConceptVec:
creating and evaluating literature-based biomedical concept embeddings on a large scale

Table of contents

Text corpora

Using PubTator for annotating concepts in the PubMed

BioConceptVec: embeddings and concept files

Tutorial

Datasets

References

Acknowledgments

About

Releases

Packages

Contributors 2

Languages

ncbi/BioConceptVec

Folders and files

Latest commit

History

Repository files navigation

BioConceptVec: creating and evaluating literature-based biomedical concept embeddings on a large scale

Table of contents

Text corpora

Using PubTator for annotating concepts in the PubMed

BioConceptVec: embeddings and concept files

Tutorial

Datasets

References

Acknowledgments

About

Topics

Resources

Stars

Watchers

Forks

Languages

BioConceptVec:
creating and evaluating literature-based biomedical concept embeddings on a large scale