Results with custom dataset #112

aaronbriel · 2020-11-19T15:12:43Z

Hello!

First of all, thank you again for your incredible contribution with not only this dataset, but most importantly with the Haystack toolset!

I was able to closely approximate the results of your paper when running https://github.com/deepset-ai/FARM/blob/master/examples/question_answering_crossvalidation.py, although I had to reduce batch_size to 25 to prevent RuntimeError: CUDA out of memory. Tried to allocate 540.00 MiB (GPU 0; 15.78 GiB total capacity; 14.29 GiB already allocated; 386.75 MiB free; 14.35 GiB reserved in total by PyTorch). This is using an Ubuntu 18.04 VM running a Tesla V100 GPU with 128G disk space. As mentioned, the results obtained were quite close:
XVAL EM: 0.26151560178306094
XVAL f1: 0.5858967501101285

I created a custom Covid-19 dataset that combines a preprocessed/cleansed subset of the dataset from the paper "Collecting Verified COVID-19 Question Answer Pairs" (Poliak et al, 2020) and a SQuADified version of your dataset, faq_covidbert.csv. For the latter I used your annotation tool to map questions to chunks in the answers, treating the full answers as contexts.

I trained a model with this dataset using the hyperparameters you specify here: https://huggingface.co/deepset/roberta-base-squad2-covid#hyperparameters . Informal tests of various questions related to Covid-19 indicate superior responses generated from my model as opposed to roberta-base-squad2-covid, which isn't surprising as inspection of both datasets reveals that mine contains far more Covid-19-specific questions and answers.

However, when running question_answering_crossvalidation.py with my dataset the metric results are not as good as what is observed with your dataset or even with the baseline referenced in the paper. Here are the EM and f1 scores I obtained with my dataset:
XVAL EM: 0.21554054054054053
XVAL f1: 0.4432141443807887

Can you provide any insight as to why this would be the case? Thank you so much!

The text was updated successfully, but these errors were encountered:

aaronbriel · 2020-11-22T19:45:49Z

I'll assume that the low scores overall, as similar to what was stated in the paper, could be related to the complexity of the question/answer pairs combined with the large contexts and absence of multiple annotations per question.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Results with custom dataset #112

Results with custom dataset #112

aaronbriel commented Nov 19, 2020 •

edited

aaronbriel commented Nov 22, 2020

Results with custom dataset #112

Results with custom dataset #112

Comments

aaronbriel commented Nov 19, 2020 • edited

aaronbriel commented Nov 22, 2020

aaronbriel commented Nov 19, 2020 •

edited