Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BioBERT Model Available - Trained on BioASQ for Question Answer #88

Open
trisongz opened this issue Mar 26, 2020 · 8 comments
Open

BioBERT Model Available - Trained on BioASQ for Question Answer #88

trisongz opened this issue Mar 26, 2020 · 8 comments

Comments

@trisongz
Copy link

Hi - I wanted to share a model that I've pretrained from scratch using BERT Large Cased and the BioASQ 7b - factoid dataset on TPU v2-8.

Original Implementation:
https://github.com/dmis-lab/biobert

Dataset can also be found on their repo.

Model Details:
loss = 0.41782737
step = 18000
max_seq_length = 384
learning_rate = 3e-6
doc_stride = 128

The model is tensorflow based, and I haven't yet converted it to torch or transformers, and haven't evaluated it yet.

I'd like to continue training it on COVID related questions, as well as additional data from BioASQ but haven't yet found an easy way to convert the raw bioASQ data into the format for training. If someone would like to do that so I can continue training the model further, please let me know.

You should be able to download all the files easily with gsutil installed by running

gsutil -m cp -r gs://ce-covid-public/biobert-large-cased/* /path/to/folder/

If someone wants to run evaluation on the models and provide the metrics, I can update this.

Question - when running the backend on Docker with GPU enabled and BERT embeddings, it doesn't seem to be using the GPUs even with all the correct drivers. Is there some documentation around this?

Great job on the progress so far! I believe there's a lot of value in what's being done.

@tholor
Copy link
Member

tholor commented Mar 27, 2020

Hey @trisongz,

Thanks for sharing this! Did I understand correctly that you took the text corpus from BioASQ 7b and pretrained a BERT from scratch with MLM and NSP objective?
I haven't worked with the BioASQ dataset yet, but seems that it could be also helpful for actual QA training (possibly even extractive QA in SQuAD style?) or at least passage ranking.

For conversion: Is it the TF format used in the original BERT repo by google? Would be helpful if you (or someone from the community here) could convert it to pytorch / transformers.

Question - when running the backend on Docker with GPU enabled and BERT embeddings, it doesn't seem to be using the GPUs even with all the correct drivers. Is there some documentation around this?

Yes, the current docker image is supporting CPU only. We could add another one for GPU that inherits from nvidia/cuda:10.1-runtime. For now, you could also run the backend without docker (via uvicorn backend.api:app)

@trisongz
Copy link
Author

The objectives for this BERT model is extractive QA in SQuAD style - so it should be able to do Question Answering given the input text. I figured that would be the most useful given the context of the solution we're aiming for. I took the implementation from BioBERT and used the original BERT checkpoints from Google, since the BioBERT model had a slightly smaller model parameter.

So the TF format should match the original BERT Implementation.

@tholor
Copy link
Member

tholor commented Mar 27, 2020

Oh great, that's even better :)

Could you convert the model to PyTorch / transformers format?
We are actually planning to index the CORD-19 dataset in elasticsearch and use haystack to do QA via a retriever-reader approach. Your model could be a promising reader :)

We are also about to start some expert labeling sessions to gather training/eval data for QA on the CORD-19 dataset. We will share this data once it's available. Maybe you could try to evaluate your model or continue training on it?

@trisongz
Copy link
Author

I actually spent a bit of time cleaning up the CORD-19 dataset and compiled it into a single jsonl file. It's pre-processed along with using SciBERT to label potential diseases mentioned, solutions, results, removed any non-english or papers that were smaller than 100 words, and (I think) any that were missing abstract had an abstract generated using GenSIM's summarization library.

https://drive.google.com/open?id=1fd0QJ7soYpYQubeWUxeYmUJEK7eX_RhK

Would love to have the additional dataset to continue training. Would you also be able to add a script to convert it to the same format as https://storage.cloud.google.com/ce-covid-public/BioASQ-6b/train/Full-Abstract/BioASQ-train-factoid-6b-full-annotated.json to save some time?

I'll try to have the checkpoint converted and provide both formats for anyone to continue training since I know that's always a pain when models are only in one format/framework.

@tholor
Copy link
Member

tholor commented Mar 29, 2020

Great! Thanks for sharing your data.

Would love to have the additional dataset to continue training. Would you also be able to add a script to convert it to the same format as https://storage.cloud.google.com/ce-covid-public/BioASQ-6b/train/Full-Abstract/BioASQ-train-factoid-6b-full-annotated.json to save some time?

Looks like standard SQuAD format, right? We can definitely provide the dataset in this format. Just be aware that it might take some time until we have enough labels gathered.

@ViktorAlm
Copy link
Contributor

https://github.com/abachaa/MedQuAD

@tholor
Copy link
Member

tholor commented Mar 30, 2020

Nice, good finding @ViktorAlm. From a quick look, this dataset seems to contain many question-answer pairs, but is lacking the "context" text as it's rather extracted from FAQ websites. Not sure how this could be useful for extractive QA 🤔

@ViktorAlm
Copy link
Contributor

ViktorAlm commented Mar 30, 2020

I think the context is the link on the top. im still looking at it. Im spending most of my time on normal work/ training some swedish models. If the spans doesnt match the urls then i guess its not very useful. Maybe for pretraining qnli before qa. Im not familiar with methods for better qa. It might be useful for the annotators to see a huge list of questions though

edit:
Yep just seems to be faqs :( i got so excited.

She does seem to have data that would be useful for sentencebert and other stuff though.
ex. https://github.com/abachaa/RQE_Data_AMIA2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants