Skip to content
Qingyu Chen edited this page May 12, 2020 · 14 revisions

FAQs

Will you make source corpora available?

Unfortunately, we cannot provide the corpora due to the copyrights. The PubMed abstracts can be downloaded from https://www.ncbi.nlm.nih.gov/pubmed. The MIMIC-III Clinical Database can be downloaded from https://physionet.org/works/MIMICIIIClinicalDatabase/access.shtml.

How to use the BioWordVec and BioSentVec model?

The BioWordVec is in the binary word2vec C format. One way to read the model is using gensim. The following example is copied from their website.

To use BioWordVec vector:

from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format(filename, binary=True)

To use BioWordVec model:

Based on a recent test on the speed, we recommend using fasttext library to load BioWordVec model:

import fasttext
model = fasttext.load_model(filename)

Alternatively, you could use gensim:

from gensim.models import FastText
model = FastText.load_fasttext_format(filename)

The BioSentVec is built upon sent2vec. To infer sentence embeddings, please see the Directly from python section. The following example is copied from their website,

import sent2vec
model = sent2vec.Sent2vecModel()
model.load_model('model.bin')
emb = model.embed_sentence("once upon a time .") 
embs = model.embed_sentences(["first sentence .", "another sentence"])

Where can I find the code to preprocess the text?

The preprocessing methods can be found in the src folder. In general, the text was first tokenized using NLTK and then lowercased.

Where can I find the code to generate the models?

The bash scripts can be found in the src folder.

How do I cite BioSentVec?

@article{chen2018biosentvec,
  title={BioSentVec: creating sentence embeddings for biomedical texts},
  author={Chen, Qingyu and Peng, Yifan and Lu, Zhiyong},
  journal={arXiv preprint arXiv:181302},
  year={2018}
}