The single audio per sentence restriction is too strict for most languages #113

ftyers · 2021-04-06T02:23:49Z

I've been training quite a few models recently. And after getting through about 18 Common Voice languages I realised that most of the data wasn't being included. The issue surfaced when I was looking for an additional datapoint with more training data than Tatar to fill out the following graph:

It seemed odd to me that Portuguese only had 7 hours of data, but not odd enough. Then I looked at Basque.

test: Final amount of imported audio: 8:08:13 from 8:11:21.
dev: Final amount of imported audio: 7:47:56 from 7:48:58.
train: Final amount of imported audio: 10:51:34 from 10:51:44.
validated: Final amount of imported audio: 89:35:24 from 89:43:41.

The total amount of data available in the training split was a fraction of what is validated.

The obvious solution is that everyone goes and makes their own splits. But this is a bit unsatisfactory because then people's results won't be comparable. I imagine one of the desiderata of the dataset releases and splits is that they be standard and comparable.

Another option would be to have options:

--strict-speaker: One speaker only lives in one file
--strict-sentence: One sentence only lives in one file
--strict-audio: Only a single recording per sentence

--strict-speaker and --strict-sentence should be turned on by default, these mean that the model doesn't get to peek at either the speaker or the sentence.

--strict-audio should be turned off by default, this is more about model optimisation, e.g. you could consider having more than one recording per sentence as a kind of augmentation.

It would also be worth looking into balancing the train/dev/test by gender, but that is certainly another issue.

The text was updated successfully, but these errors were encountered:

ftyers · 2021-04-06T13:30:03Z

I'd like to add an additional note that this restriction implicitly creates bias in the training set unless extra steps are taken (which I believe they aren't):

It favours recordings by either the first contributors in time (who are more likely to be white and male for most of the languages we currently have enabled)
Or it favours a random contributor (who is more likely to be white and male for most of the languages we currently have enabled).

At least this is the case for most languages currently in Common Voice.

HarikalarKutusu · 2021-12-08T13:37:16Z

I hit this wall while trying to train with v7.0 of the Turkish dataset. Before getting our hands on the new dataset, I wanted to know where we stand with v7.0 to see the effect of our campaign. I used @ftyers's technical paper for replication - as of now acoustic model only...

But v7.0 was giving bad results, v6.1 was better... So I did a roundup for all dataset versions:

As the training of v7.0 got optimized at a rather early stage, I had to analyze the splits... So I did another roundup:

Two additional notes before commenting on these:

Until v7.0 the text corpus was the same, thus split sizes are about the same, I can see that. In 2021 we added more text to the text corpus, so recordings in these splits and used voice data percentage increased.
In 2021 some male voices added many recordings (>500 up to 4000)

The problem lies in several places with v7:

Number of distinct voices dropped considerably.
All training is done by (young) male donators, no women - and tested by women.
There is one young male with an accent with many many recordings both in TRAIN and DEV sets.

Here is data for the last point:

Train

Dev

So, I think this is the worst possible scenario. Because these splits are meant to be a benchmark, I think a better split algorithm is needed. @ftyers's PR is only one part of the solution.

Your comments are greatly appreciated...

HarikalarKutusu · 2021-12-14T12:07:02Z

Addendum: Finished a sweep with optimized LM. The difference keeps being there, although the size of validated increased by 52.18%...

ftyers changed the title ~~CorporaCreator throws away ~70% of training data~~ The single audio per sentence restriction is too strict for most languages Apr 6, 2021

ftyers mentioned this issue Apr 6, 2021

Update corpus.py #114

Closed

HarikalarKutusu mentioned this issue Dec 21, 2023

[BUG] Uyghur Dataset: Total Items in train, dev, and test Significantly Less Than in validated.tsv common-voice/common-voice#4277

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The single audio per sentence restriction is too strict for most languages #113

The single audio per sentence restriction is too strict for most languages #113

ftyers commented Apr 6, 2021 •

edited

ftyers commented Apr 6, 2021 •

edited

HarikalarKutusu commented Dec 8, 2021 •

edited

HarikalarKutusu commented Dec 14, 2021

The single audio per sentence restriction is too strict for most languages #113

The single audio per sentence restriction is too strict for most languages #113

Comments

ftyers commented Apr 6, 2021 • edited

ftyers commented Apr 6, 2021 • edited

HarikalarKutusu commented Dec 8, 2021 • edited

HarikalarKutusu commented Dec 14, 2021

ftyers commented Apr 6, 2021 •

edited

ftyers commented Apr 6, 2021 •

edited

HarikalarKutusu commented Dec 8, 2021 •

edited