using NLU for biobert embeddings -- takes a really long time on list of 10,000 words, and on 1 word #103

krico1 · 2022-02-24T18:54:17Z

Hi, so we are working on generating biobert embeddings for our project. When we run it on a single word it takes about a second or so. When we run on a list of 10,000 words, it either times out or takes upwards of hours to run. Is this normal? Below is how we are using it:

def load_biobert(self):
# Load BioBERT model (for sentence-type embeddings)
self.logger.info("Loading BioBERT model...")
start = time.time()
biobert = nlu.load('en.embed_sentence.biobert.pmc_base_cased')
end = time.time()
self.logger.info('done (BioBERT loading time: %.2fs seconds)', end - start)
return biobert

def get_biobert_embeddings(self, strings):
embedding_list = []
for string in strings:
self.logger.debug("...Generating embedding for: %s", string)
embedding_list.append(self.get_biobert_embedding(string))
return embedding_list

def get_biobert_embedding(self, string):
embedding = self.biobert.predict(string, output_level='sentence', get_embeddings=True)
return embedding.sentence_embedding_biobert.values[0]

C-K-Loan · 2022-02-27T14:41:07Z

Hi @krico1 large embeddings like biobert can be quite slow because of the large deep learning models used for it.
But you can also achieve ~ 10x speedup by using NLU in GPU mode

All you need to do is set gpu=True and make sure the GPU is available to Tesnroflow beforehand.
Then you can just call the following to get the GPU pipe
nlu.load('en.embed_sentence.biobert.pmc_base_cased', gpu=True)

See this notebook as a reference.

Also note: If you have a large dataset at hand, it will be faster to feed NLU all the data at once instead of one by one

MargheCap · 2022-05-02T07:59:44Z

@C-K-Loan Hi! Unfortunately, I am not able to obtain the embeddings (even when adding get_embeddings= True). I tried with multiple models, and by including other parameters but with no success.
In particular, nlu.load(biobert).predict("random sentence", output_level='token', get_embeddings= True) does not give the expected output, I thought the column was being dropped so I added drop_irrelevant_cols= False but still no success.

thank you!

raven44099 · 2022-12-03T08:40:15Z

@C-K-Loan I have the same problem as @MargheCap . I assume it has something to do with how we install the nlu package. Could you share how you install it?

With my installation (below), I get this rather slow calculation:

And I checked the GPU visibility to Tensorflow:

import tensorflow as tf
tf.config.list_physical_devices('GPU')

gives --> [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

For installation, I used:

!wget https://setup.johnsnowlabs.com/nlu/colab.sh -O - | bash
import nlu
pipe = nlu.load('en.embed_sentence.biobert.pmc_base_cased', gpu=True)

I used this installation because it was proposed in this colab-sheet: https://colab.research.google.com/github/JohnSnowLabs/nlu/blob/master/examples/colab/component_examples/sentence_embeddings/NLU_BERT_sentence_embeddings_and_t-SNE_visualization_Example.ipynb#scrollTo=rBXrqlGEYA8G
Furthermore, the quick_start_google_colab.ipynb brought forth here ( https://nlp.johnsnowlabs.com/docs/en/install#google-colab-notebook ) utilises from sparknlp.pretrained import PretrainedPipeline , but I don't know how to load it. Using pipe = PretrainedPipeline('en.embed_sentence.biobert.pmc_base_cased', gpu=True) gives an errer: ...unexpected keyword argument 'gpu'

C-K-Loan added the question Further information is requested label Feb 27, 2022

C-K-Loan self-assigned this Feb 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

using NLU for biobert embeddings -- takes a really long time on list of 10,000 words, and on 1 word #103

using NLU for biobert embeddings -- takes a really long time on list of 10,000 words, and on 1 word #103

krico1 commented Feb 24, 2022

C-K-Loan commented Feb 27, 2022

MargheCap commented May 2, 2022

raven44099 commented Dec 3, 2022 •

edited

using NLU for biobert embeddings -- takes a really long time on list of 10,000 words, and on 1 word #103

using NLU for biobert embeddings -- takes a really long time on list of 10,000 words, and on 1 word #103

Comments

krico1 commented Feb 24, 2022

C-K-Loan commented Feb 27, 2022

MargheCap commented May 2, 2022

raven44099 commented Dec 3, 2022 • edited

raven44099 commented Dec 3, 2022 •

edited