Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Exclude reported sentences by default. #120

Open
HarikalarKutusu opened this issue Jan 30, 2022 · 1 comment
Open

[Feature Request] Exclude reported sentences by default. #120

HarikalarKutusu opened this issue Jan 30, 2022 · 1 comment

Comments

@HarikalarKutusu
Copy link
Contributor

As far as I can see, the reported.tsv is not taken into account anywhere in the workflow.

  • If a sentence gets reported it is logged into reported.tsv
  • It continues to be shown to users and keeps getting new recordings
  • These recordings keep being shown in Listen/validation and they can get validated/invalidated
  • CorporaCreator does not take these (reported.tsv) into account and while creating the datasets they go into validated.tsv if they have enough votes, thus might go into train/dev/test sets, thus into training.

I examined 240 records in v7.0 Turkish dataset and found these:

image

image

As you can see 22% were OK, 40% can be corrected and only 38% is rightly rejected. Things like "offensive language/slang" or "political" can be very subjective. When you analyze the rightfully reported ones, they are indeed wrong grammar, OCR mistakes or heavy use of foreign names.

Taking the size of the whole dataset, I think leaving out some bad reports will be OK. It would not be desired to get wrong sentences (grammar & spelling) into training.

I would suggest the following:

  1. Add a command-line parameter -x to exclude reported.tsv, which should default to TRUE, but one could enable them.
  2. Get rid of all recordings of sentences in reported.tsv in validated/train/dev/test sets

One might go further and add another field "verified" into reported.tsv, where a dataset engineer manually reviews them and only "verified" ones get removed.

@laubonghaudoi
Copy link

Strong +1 to this request. Reported sentences should be auto-excluded as they damaged the data quality of corpus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants