BioBERT Model Available - Trained on BioASQ for Question Answer #88

trisongz · 2020-03-26T21:05:34Z

Hi - I wanted to share a model that I've pretrained from scratch using BERT Large Cased and the BioASQ 7b - factoid dataset on TPU v2-8.

Original Implementation:
https://github.com/dmis-lab/biobert

Dataset can also be found on their repo.

Model Details:
loss = 0.41782737
step = 18000
max_seq_length = 384
learning_rate = 3e-6
doc_stride = 128

The model is tensorflow based, and I haven't yet converted it to torch or transformers, and haven't evaluated it yet.

I'd like to continue training it on COVID related questions, as well as additional data from BioASQ but haven't yet found an easy way to convert the raw bioASQ data into the format for training. If someone would like to do that so I can continue training the model further, please let me know.

You should be able to download all the files easily with gsutil installed by running

gsutil -m cp -r gs://ce-covid-public/biobert-large-cased/* /path/to/folder/

If someone wants to run evaluation on the models and provide the metrics, I can update this.

Question - when running the backend on Docker with GPU enabled and BERT embeddings, it doesn't seem to be using the GPUs even with all the correct drivers. Is there some documentation around this?

Great job on the progress so far! I believe there's a lot of value in what's being done.

The text was updated successfully, but these errors were encountered:

tholor · 2020-03-27T13:46:47Z

Hey @trisongz,

Thanks for sharing this! Did I understand correctly that you took the text corpus from BioASQ 7b and pretrained a BERT from scratch with MLM and NSP objective?
I haven't worked with the BioASQ dataset yet, but seems that it could be also helpful for actual QA training (possibly even extractive QA in SQuAD style?) or at least passage ranking.

For conversion: Is it the TF format used in the original BERT repo by google? Would be helpful if you (or someone from the community here) could convert it to pytorch / transformers.

Question - when running the backend on Docker with GPU enabled and BERT embeddings, it doesn't seem to be using the GPUs even with all the correct drivers. Is there some documentation around this?

Yes, the current docker image is supporting CPU only. We could add another one for GPU that inherits from nvidia/cuda:10.1-runtime. For now, you could also run the backend without docker (via uvicorn backend.api:app)

trisongz · 2020-03-27T15:28:11Z

The objectives for this BERT model is extractive QA in SQuAD style - so it should be able to do Question Answering given the input text. I figured that would be the most useful given the context of the solution we're aiming for. I took the implementation from BioBERT and used the original BERT checkpoints from Google, since the BioBERT model had a slightly smaller model parameter.

So the TF format should match the original BERT Implementation.

tholor · 2020-03-27T15:46:57Z

Oh great, that's even better :)

Could you convert the model to PyTorch / transformers format?
We are actually planning to index the CORD-19 dataset in elasticsearch and use haystack to do QA via a retriever-reader approach. Your model could be a promising reader :)

We are also about to start some expert labeling sessions to gather training/eval data for QA on the CORD-19 dataset. We will share this data once it's available. Maybe you could try to evaluate your model or continue training on it?

trisongz · 2020-03-27T16:06:44Z

I actually spent a bit of time cleaning up the CORD-19 dataset and compiled it into a single jsonl file. It's pre-processed along with using SciBERT to label potential diseases mentioned, solutions, results, removed any non-english or papers that were smaller than 100 words, and (I think) any that were missing abstract had an abstract generated using GenSIM's summarization library.

https://drive.google.com/open?id=1fd0QJ7soYpYQubeWUxeYmUJEK7eX_RhK

Would love to have the additional dataset to continue training. Would you also be able to add a script to convert it to the same format as https://storage.cloud.google.com/ce-covid-public/BioASQ-6b/train/Full-Abstract/BioASQ-train-factoid-6b-full-annotated.json to save some time?

I'll try to have the checkpoint converted and provide both formats for anyone to continue training since I know that's always a pain when models are only in one format/framework.

tholor · 2020-03-29T15:55:06Z

Great! Thanks for sharing your data.

Would love to have the additional dataset to continue training. Would you also be able to add a script to convert it to the same format as https://storage.cloud.google.com/ce-covid-public/BioASQ-6b/train/Full-Abstract/BioASQ-train-factoid-6b-full-annotated.json to save some time?

Looks like standard SQuAD format, right? We can definitely provide the dataset in this format. Just be aware that it might take some time until we have enough labels gathered.

ViktorAlm · 2020-03-30T11:00:23Z

https://github.com/abachaa/MedQuAD

tholor · 2020-03-30T14:16:37Z

Nice, good finding @ViktorAlm. From a quick look, this dataset seems to contain many question-answer pairs, but is lacking the "context" text as it's rather extracted from FAQ websites. Not sure how this could be useful for extractive QA 🤔

ViktorAlm · 2020-03-30T14:41:22Z

I think the context is the link on the top. im still looking at it. Im spending most of my time on normal work/ training some swedish models. If the spans doesnt match the urls then i guess its not very useful. Maybe for pretraining qnli before qa. Im not familiar with methods for better qa. It might be useful for the annotators to see a huge list of questions though

edit:
Yep just seems to be faqs :( i got so excited.

She does seem to have data that would be useful for sentencebert and other stuff though.
ex. https://github.com/abachaa/RQE_Data_AMIA2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BioBERT Model Available - Trained on BioASQ for Question Answer #88

BioBERT Model Available - Trained on BioASQ for Question Answer #88

trisongz commented Mar 26, 2020

tholor commented Mar 27, 2020

trisongz commented Mar 27, 2020

tholor commented Mar 27, 2020

trisongz commented Mar 27, 2020

tholor commented Mar 29, 2020

ViktorAlm commented Mar 30, 2020

tholor commented Mar 30, 2020

ViktorAlm commented Mar 30, 2020 •

edited

BioBERT Model Available - Trained on BioASQ for Question Answer #88

BioBERT Model Available - Trained on BioASQ for Question Answer #88

Comments

trisongz commented Mar 26, 2020

tholor commented Mar 27, 2020

trisongz commented Mar 27, 2020

tholor commented Mar 27, 2020

trisongz commented Mar 27, 2020

tholor commented Mar 29, 2020

ViktorAlm commented Mar 30, 2020

tholor commented Mar 30, 2020

ViktorAlm commented Mar 30, 2020 • edited

ViktorAlm commented Mar 30, 2020 •

edited