Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using NLU for biobert embeddings -- takes a really long time on list of 10,000 words, and on 1 word #103

Open
krico1 opened this issue Feb 24, 2022 · 3 comments
Assignees
Labels
question Further information is requested

Comments

@krico1
Copy link

krico1 commented Feb 24, 2022

Hi, so we are working on generating biobert embeddings for our project. When we run it on a single word it takes about a second or so. When we run on a list of 10,000 words, it either times out or takes upwards of hours to run. Is this normal? Below is how we are using it:

def load_biobert(self):
# Load BioBERT model (for sentence-type embeddings)
self.logger.info("Loading BioBERT model...")
start = time.time()
biobert = nlu.load('en.embed_sentence.biobert.pmc_base_cased')
end = time.time()
self.logger.info('done (BioBERT loading time: %.2fs seconds)', end - start)
return biobert

def get_biobert_embeddings(self, strings):
embedding_list = []
for string in strings:
self.logger.debug("...Generating embedding for: %s", string)
embedding_list.append(self.get_biobert_embedding(string))
return embedding_list

def get_biobert_embedding(self, string):
embedding = self.biobert.predict(string, output_level='sentence', get_embeddings=True)
return embedding.sentence_embedding_biobert.values[0]

@C-K-Loan
Copy link
Member

Hi @krico1 large embeddings like biobert can be quite slow because of the large deep learning models used for it.
But you can also achieve ~ 10x speedup by using NLU in GPU mode

All you need to do is set gpu=True and make sure the GPU is available to Tesnroflow beforehand.
Then you can just call the following to get the GPU pipe
nlu.load('en.embed_sentence.biobert.pmc_base_cased', gpu=True)

image

See this notebook as a reference.

Also note: If you have a large dataset at hand, it will be faster to feed NLU all the data at once instead of one by one

@C-K-Loan C-K-Loan added the question Further information is requested label Feb 27, 2022
@C-K-Loan C-K-Loan self-assigned this Feb 27, 2022
@MargheCap
Copy link

@C-K-Loan Hi! Unfortunately, I am not able to obtain the embeddings (even when adding get_embeddings= True). I tried with multiple models, and by including other parameters but with no success.
In particular, nlu.load(biobert).predict("random sentence", output_level='token', get_embeddings= True) does not give the expected output, I thought the column was being dropped so I added drop_irrelevant_cols= False but still no success.

thank you!

@raven44099
Copy link

raven44099 commented Dec 3, 2022

@C-K-Loan I have the same problem as @MargheCap . I assume it has something to do with how we install the nlu package. Could you share how you install it?

With my installation (below), I get this rather slow calculation:
image

And I checked the GPU visibility to Tensorflow:

import tensorflow as tf
tf.config.list_physical_devices('GPU')

gives --> [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

For installation, I used:

!wget https://setup.johnsnowlabs.com/nlu/colab.sh -O - | bash
import nlu
pipe = nlu.load('en.embed_sentence.biobert.pmc_base_cased', gpu=True) 

I used this installation because it was proposed in this colab-sheet: https://colab.research.google.com/github/JohnSnowLabs/nlu/blob/master/examples/colab/component_examples/sentence_embeddings/NLU_BERT_sentence_embeddings_and_t-SNE_visualization_Example.ipynb#scrollTo=rBXrqlGEYA8G
Furthermore, the quick_start_google_colab.ipynb brought forth here ( https://nlp.johnsnowlabs.com/docs/en/install#google-colab-notebook ) utilises from sparknlp.pretrained import PretrainedPipeline , but I don't know how to load it. Using pipe = PretrainedPipeline('en.embed_sentence.biobert.pmc_base_cased', gpu=True) gives an errer: ...unexpected keyword argument 'gpu'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants