Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Results with custom dataset #112

Open
aaronbriel opened this issue Nov 19, 2020 · 1 comment
Open

Results with custom dataset #112

aaronbriel opened this issue Nov 19, 2020 · 1 comment

Comments

@aaronbriel
Copy link

aaronbriel commented Nov 19, 2020

Hello!

First of all, thank you again for your incredible contribution with not only this dataset, but most importantly with the Haystack toolset!

I was able to closely approximate the results of your paper when running https://github.com/deepset-ai/FARM/blob/master/examples/question_answering_crossvalidation.py, although I had to reduce batch_size to 25 to prevent RuntimeError: CUDA out of memory. Tried to allocate 540.00 MiB (GPU 0; 15.78 GiB total capacity; 14.29 GiB already allocated; 386.75 MiB free; 14.35 GiB reserved in total by PyTorch). This is using an Ubuntu 18.04 VM running a Tesla V100 GPU with 128G disk space. As mentioned, the results obtained were quite close:
XVAL EM: 0.26151560178306094
XVAL f1: 0.5858967501101285

I created a custom Covid-19 dataset that combines a preprocessed/cleansed subset of the dataset from the paper "Collecting Verified COVID-19 Question Answer Pairs" (Poliak et al, 2020) and a SQuADified version of your dataset, faq_covidbert.csv. For the latter I used your annotation tool to map questions to chunks in the answers, treating the full answers as contexts.

I trained a model with this dataset using the hyperparameters you specify here: https://huggingface.co/deepset/roberta-base-squad2-covid#hyperparameters . Informal tests of various questions related to Covid-19 indicate superior responses generated from my model as opposed to roberta-base-squad2-covid, which isn't surprising as inspection of both datasets reveals that mine contains far more Covid-19-specific questions and answers.

However, when running question_answering_crossvalidation.py with my dataset the metric results are not as good as what is observed with your dataset or even with the baseline referenced in the paper. Here are the EM and f1 scores I obtained with my dataset:
XVAL EM: 0.21554054054054053
XVAL f1: 0.4432141443807887

Can you provide any insight as to why this would be the case? Thank you so much!

@aaronbriel
Copy link
Author

I'll assume that the low scores overall, as similar to what was stated in the paper, could be related to the complexity of the question/answer pairs combined with the large contexts and absence of multiple annotations per question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant