Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

NQ Dataset #221

Open
varshakishore opened this issue Jun 14, 2022 · 1 comment
Open

NQ Dataset #221

varshakishore opened this issue Jun 14, 2022 · 1 comment

Comments

@varshakishore
Copy link

varshakishore commented Jun 14, 2022

I no that this code base is no longer supported but I have a couple questions about the NQ dataset.

The official dataset page says "Natural Questions contains 307K training examples, 8K examples for development, and a further 8K examples for testing". However, the DPR paper reports that the NQ dataset is much smaller (unfiltered training set is 79,168, filtered training set is 58,880, dev set is 8,757 and test set is 3,610). Why is this the case? Are you using an older version of the NQ dataset?

Also, I downloaded the datasets using your scripts. The training set does indeed have 58,880 samples but the dev set only has 6515 sample. Why are some of the samples missing from the dev test?

@vladk232
Copy link

Hi @varshakishore ,
The Open domain version of NQ is a subset of the "main" NQ dataset and there is not direct correspondence between their dev/test splits (official dev NQ set is a test set for Open Domain NQ). You can find info about differences on the google's NQ relevant github page. As described in the paper, we used the same filtering process as in ORQA paper. Google didn't release OD version of NQ so we just repeated the same steps. But they released all the splits since then and you can just reference/download OD NQ from the official site.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants