[Feature Request] Exclude reported sentences by default. #120

HarikalarKutusu · 2022-01-30T11:05:05Z

As far as I can see, the reported.tsv is not taken into account anywhere in the workflow.

If a sentence gets reported it is logged into reported.tsv
It continues to be shown to users and keeps getting new recordings
These recordings keep being shown in Listen/validation and they can get validated/invalidated
CorporaCreator does not take these (reported.tsv) into account and while creating the datasets they go into validated.tsv if they have enough votes, thus might go into train/dev/test sets, thus into training.

I examined 240 records in v7.0 Turkish dataset and found these:

As you can see 22% were OK, 40% can be corrected and only 38% is rightly rejected. Things like "offensive language/slang" or "political" can be very subjective. When you analyze the rightfully reported ones, they are indeed wrong grammar, OCR mistakes or heavy use of foreign names.

Taking the size of the whole dataset, I think leaving out some bad reports will be OK. It would not be desired to get wrong sentences (grammar & spelling) into training.

I would suggest the following:

Add a command-line parameter -x to exclude reported.tsv, which should default to TRUE, but one could enable them.
Get rid of all recordings of sentences in reported.tsv in validated/train/dev/test sets

One might go further and add another field "verified" into reported.tsv, where a dataset engineer manually reviews them and only "verified" ones get removed.

laubonghaudoi · 2022-05-17T15:05:40Z

Strong +1 to this request. Reported sentences should be auto-excluded as they damaged the data quality of corpus.

HarikalarKutusu mentioned this issue May 24, 2022

Do not allow reported sentences in record/listen common-voice/common-voice#3717

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Exclude reported sentences by default. #120

[Feature Request] Exclude reported sentences by default. #120

HarikalarKutusu commented Jan 30, 2022

laubonghaudoi commented May 17, 2022

[Feature Request] Exclude reported sentences by default. #120

[Feature Request] Exclude reported sentences by default. #120

Comments

HarikalarKutusu commented Jan 30, 2022

laubonghaudoi commented May 17, 2022