Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need help understanding "Train Your Own QA Models" Tutorial #14

Open
ronykalfarisi opened this issue Aug 14, 2019 · 8 comments
Open

Need help understanding "Train Your Own QA Models" Tutorial #14

ronykalfarisi opened this issue Aug 14, 2019 · 8 comments

Comments

@ronykalfarisi
Copy link

ronykalfarisi commented Aug 14, 2019

Hi all,
First of all, thank you so much for releasing such a brilliant work. I need your help in understanding the tutorial jupyter-notebook of training our own QA. Before I tried to trained "our own data", I managed to run DocProductPresentation successfully (as well as downloading all necessary files).

To trained our own data, I downloaded the "sampleData.csv" and Train_Your_Own_QA notebook file. When I tried to run the training, I notice that I got OOM error (My GPU is RTX 2070 8GB). So, my first step is to reduce the batch size by half and so on. However, even after I set batch size to 1, I still got OOM error. Therefore, I played a little bit with "bert_config.json" from BioBert pre-train model and change the num_hidden_layers to 6 (default is 12) and it ran. Also, I noticed that you set the num_epochs to 1 so I didn't change it.

Once the training is finished (it took around 35 mins), I used DocProductPresentation notebook to test the new model. However, the result is totally out of the topic from the question I asked. Therefore to test if this new model work as intended, I copied one question from "sampleData.csv" and I still got out of topic answer.

So, my questions are,

  1. What GPU you use to trained your model? Is 8 GB VRAM not enough? Does the OOM error comes from Loading BioBERT model or from your architecture?
  2. Did you use num_epochs 1 in training "sampleData.csv" and got good result? If not, what are good parameters I need to use?
  3. I notice, you used "Float16EmbeddingsExpanded.pkl" in DocProductPresentation notebook but not in the training our own QA. Then, what is the importance of this file?
  4. Does the answer is auto-generated or just retrieved from "sampleData.csv"? If this is a retrieval, the model must look from a some kind of database or a pool of QA, where is this?
  5. Also, I couldn't find some of the result answers from the "sampleData.csv", where do these answers come from?

Thank you so much for your help.

@ronykalfarisi ronykalfarisi changed the title Need help with "Train Your Own QA Models" Tutorial Need help understanding "Train Your Own QA Models" Tutorial Aug 14, 2019
@Santosh-Gupta
Copy link
Member

Santosh-Gupta commented Aug 14, 2019

What GPU you use to trained your model? Is 8 GB VRAM not enough? Does the OOM error comes from Loading BioBERT model or from your architecture?

@JayYip or @ash3n should be able to answer that one.

Did you use num_epochs 1 in training "sampleData.csv" and got good result? If not, what are good parameters I need to use?

@JayYip or @ash3n but I believe we ran very few epochs. In the single digits. Possibly 1

I notice, you used "Float16EmbeddingsExpanded.pkl" in DocProductPresentation notebook but not in the training our own QA. Then, what is the importance of this file?

We could not load the original embeddings into google colab without it crashing.

Does the answer is auto-generated or just retrieved from "sampleData.csv"? If this is a retrieval, the model must look from a some kind of database or a pool of QA, where is this?

We have two notebooks, one uses GPT2 to generate, the other is simple retrieval.

the model must look from a some kind of database or a pool of QA, where is this?

The "Float16EmbeddingsExpanded.pkl"

Also, I couldn't find some of the result answers from the "sampleData.csv", where do these answers come from?

Try this one instead

https://github.com/Santosh-Gupta/datasets

To see if your text data is trainable, What you can do is train on your data with just the FFNN. Which means encode your texts with Bert (2nd to last layer average of all the context vectors), and have seperate FFNN layers for each of the question and answer embedding. It won't be as good as training the bert weights, but it's much faster and should be able to give you decent results. If these results don't make sense this way, something may be off with your data.

If you are using scientific/medical texts, you will want to use scibert or biobert, and then use Bert-as-a-service to batch encode your texts. But if not, I would recommend using tensorflow hub or pytorch hub to mass encode your texts. I especially recommend pytorch's roBerta weights.

@JayYip
Copy link
Collaborator

JayYip commented Aug 26, 2019

Sorry for the late reply.

What GPU you use to trained your model? Is 8 GB VRAM not enough? Does the OOM error comes from Loading BioBERT model or from your architecture?

I used Titan Xp but I think there's something wrong with it since it still raised OOM even the batch size was set to 1.

Did you use num_epochs 1 in training "sampleData.csv" and got good result? If not, what are good parameters I need to use?

We trained for a couple epochs. You can try something between 5-10.

@Santosh-Gupta
Copy link
Member

@JayYip @ash3n now that the tf 2.0 hackathon is over, maybe we should switch to the pytorch huggingface BERT, which is somehow very lightweight. It runs on the colab Tesla k80 GPU no problem. It is widely used, and continuously updated.

@JayYip
Copy link
Collaborator

JayYip commented Aug 26, 2019

It runs on the colab Tesla k80 GPU no problem.

TensorFlow and Pytorch are not that different in terms of GPU memory. K80 should be fine training 12-layers transformer.

It is widely used, and continuously updated

I agree with this point but that will take some work. We need to change the input pipeline from tf.data to torch.utils.data and the model from Keras to pytorch. It'll take a couple days and I'm not sure whether I have time to do it.

@Santosh-Gupta
Copy link
Member

It runs on the colab Tesla k80 GPU no problem.

TensorFlow and Pytorch are not that different in terms of GPU memory. K80 should be fine training 12-layers transformer.

It is widely used, and continuously updated

I agree with this point but that will take some work. We need to change the input pipeline from tf.data to torch.utils.data and the model from Keras to pytorch. It'll take a couple days and I'm not sure whether I have time to do it.

True. Maybe for another project. I am actually doing archive manatee which uses the exact same architecture: two tower bert.

@ronykalfarisi
Copy link
Author

@Santosh-Gupta & @JayYip thanks so much guys.

@abhijeet201998
Copy link

@ronykalfarisi @JayYip @JayYip Hey , can you help me out in running the model locally on my machine ?

@JayYip
Copy link
Collaborator

JayYip commented Mar 3, 2020

@abhijeet201998 The code is tested on Linux machine with Titan Xp GPU. Not 100% sure whether it will work on Windows and MacOS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants