Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a limit to the audio duration? #181

Open
JJun-Guo opened this issue Jun 6, 2023 · 22 comments
Open

Is there a limit to the audio duration? #181

JJun-Guo opened this issue Jun 6, 2023 · 22 comments
Labels
extract-improvements question Further information is requested

Comments

@JJun-Guo
Copy link

JJun-Guo commented Jun 6, 2023

Is there a limit to the audio duration?

@HarikalarKutusu
Copy link
Contributor

Hey @JJun-Guo, recordings in Common Voice are currently limited to 10 seconds.

Here is a related recent discussion on allowing more:
https://discourse.mozilla.org/t/discussion-relaxation-of-the-10-sec-recording-limitation/114142

@JJun-Guo
Copy link
Author

JJun-Guo commented Jun 6, 2023 via email

@HarikalarKutusu
Copy link
Contributor

HarikalarKutusu commented Jun 6, 2023

I need to check it from the code, but from my head, it was 1 sec but dropped to 0.5...

Actually, as it also includes silences, short uttrences can easily be recorded putting a silence at the start or at the end while recording.

@HarikalarKutusu
Copy link
Contributor

HarikalarKutusu commented Jun 6, 2023

I was wrong. It is 1 sec. 0.5 sec is for the benchmark sentences (numbers etc).

https://github.com/common-voice/common-voice/blob/3bccdf446f6acd8a9afda1db7a9a1664457e611d/web/src/components/pages/contribution/speak/speak.tsx#L42

But as I stated on the link given in the previous post, state-of-the art models work better with longer utterences. E.g. whisper best works for 5-25 sec recordings...

So, it is better to get an average char duration and calculate a minimum sentence length from there...

@JJun-Guo
Copy link
Author

JJun-Guo commented Jun 6, 2023 via email

@HarikalarKutusu
Copy link
Contributor

AFAIK, a rule-of-thumb is to train a model with data which it will see in the wild. For a general purpose ASR model where the model is subjected to everyday speech, I think it should include shorter ones, because spontanous speech/conversations include them extensively, like in short answers to questions: yes-no-ok-fine-etc, "What do you want?" => "Tea..." like...

I think it is best to have a more-or-less evenly distributed durations (flat curve), thus sentence lengths. One could work on the betterment of their Common Voice dataset to remedy peaks in the distribution.

I created webapps where people can examine their datasets in more details, also helping in this area - for all CV languages.
For example, this is the duration distribution of CV 13.0 Turkish validated recordings:

image

And this is the distribution in text corpus:
image

Because we had little CC0 sentence resources, we had to rely on volunteers writing common everyday stuff, which are short and dropped the average recording duration to 3.6 - from around 4 secs. We need to remedy this issue...

You can check your language from here:
https://analyzer.cv-toolbox.web.tr/

You can also check the overall changes in time here:
https://metadata.cv-toolbox.web.tr/

@HarikalarKutusu
Copy link
Contributor

If you are working on the cv-sentence-extractor rules (first run):

Getting longer sentences are better I think. It is easier to get shorter sentences from other sources. Once it gets data from an article, it is done.

Some points on this:

  • As stated, state-of-the-art models train better starting with 5 secs.
  • Most languages in CV have an average of 4-6 secs.
  • A longer sentence will result in a longer duration recording.
  • And finally what matters is the training/fine-tuning train set duration.
  • Instead of getting 3000 sentences from 1000 articles with average 4 sec, if you take so that the average is 8, the duration will double.

@MichaelKohler
Copy link
Member

Instead of getting 3000 sentences from 1000 articles with average 4 sec, if you take so that the average is 8, the duration will double.

Not wrong, but might be risky without proper testing. Note that if the Sentence Extractor can't find 3 sentences with the required length, it will not continue to try with less words, it will just use what it got and continue on to the next article. Of course with proper analysis of the source it would be possible to fully optimize this.

@HarikalarKutusu
Copy link
Contributor

@MichaelKohler, can this be made adaptive? I mean, not to put an absolute minimum, but set a "requested_minimum", if the 3 sentences are not found, fill it with shorter ones...

@MichaelKohler
Copy link
Member

MichaelKohler commented Jun 7, 2023

Yes, certainly would be an option, but that would need to be implemented. Overall this would mean going over the sentences multiple times for the case where it won't find enough sentences the first time, but probably not such a big hit on performance overall. In the end, for development purposes that won't matter and for the final run it's fine as well as that runs in the GitHub Action.

@HarikalarKutusu
Copy link
Contributor

As you know working on this was on my to-do list, if only I can get really good results... I'll look into this. E.g sorting sentences by length can help performance.

@MichaelKohler
Copy link
Member

sorting sentences by length can help performance.

Mh, this made me think. Now I wonder if the legal requirement is just "maximum 3 sentences per article" or if there could be issues if we always pick the 3 longest sentences. In some articles the longest 3 sentences might be the majority of content. Probably something that would need to be verified just to make sure. To be clear: I only ever knew about the "maximum 3 sentences per article" without any further restrictions, but I can't guarantee that this is exactly what the lawyers said.

@HarikalarKutusu
Copy link
Contributor

HarikalarKutusu commented Jun 7, 2023

Very good point... But this is how it works now, isn't it? So, as of now, if an article has 3 sentences, they are taken if the rules match.
One could add a check for it so that the char count of the selected (verified by rules) sentences is at most (say) 50% of the total for example.

@MichaelKohler
Copy link
Member

Right now it's fully random, but rejecting what does not fit the rules. So generally, by analysis the full Wikipedia dump, you could optimize the minimum words rule to get the most words out. But that would be different than always taking the longest sentences.

Of course depending on the requirements additional rules can be added. At this point I don't even know if it would be a problem or not to do it that way.

@HarikalarKutusu
Copy link
Contributor

As I mentioned above, with the state-of-the-art models and HW advancements, it is better to get longer audio, thus longer texts. A change in this repo towards this goal would be awesome. Especially because there is no going back once 3-4 word sentences are taken...

With longer sentences, duplicates/similarities will also drop substantially, and more possible vocabulary will go into the text-corpus. I think more common words are already in the corpora or can easily be added from other sources, but less frequent ones will be needed by everyone (if too-technical/problematic/hard-to-read ones got correctly ruled out).

If it is legally possible of course...

@MichaelKohler MichaelKohler added help wanted Extra attention is needed question Further information is requested extract-improvements labels Jun 13, 2023
@MichaelKohler
Copy link
Member

@jessicarose Analog to the other question I tagged you in, could you also check here if we in theory would be allowed to always take the 3 longest sentences per article? Thanks!

@MichaelKohler MichaelKohler removed the help wanted Extra attention is needed label Jun 19, 2023
@HarikalarKutusu
Copy link
Contributor

Sorry to ping the issue...

I'm nearly finalizing my work and I need to ask if taking the longest three sentences will ever be possible - because there is no going back.

@HarikalarKutusu
Copy link
Contributor

Is there a limit to the audio duration?

@JJun-Guo, the recording limit is increased to 15 seconds in Common Voice v1.114.2.

@MichaelKohler: Probably all rule files should adapt to this change, including the defaults.

@MichaelKohler
Copy link
Member

@HarikalarKutusu Thanks for keeping track of this. I agree. Do you know what the correct value for EN would be and then we set that as default? And do you have time to reach out to all language contributors to get a new estimate? I'd be fine with one PR updating all the values as I think it's rather low-risk of a change. One thing to note is that some languages use characters and some use words.

@HarikalarKutusu
Copy link
Contributor

HarikalarKutusu commented Mar 17, 2024

@MichaelKohler I think a 50% increase should be fine for both max words and characters.
The problem is with the minimums. The new Common Voice "guideline" suggests 10-15 sec recording times on par with the newer model architectures.

With the new v17.0, I can add some character speed measurements and possibly per user, and their distribution in the Analyzer, so that one can for example see the 95 percentile coverage from those values. But that part should be handled by communities like you suggest.

For those languages which already did run the cv-sentence-extractor, most probably already got shorter sentences and might like to increase that limit for re-runs, also taking into account the recently introduced rules.

I have time for PRs and posts in Discourse, but you might need to point to them in case somebody decides on a re-run...

@MichaelKohler
Copy link
Member

but you might need to point to them in case somebody decides on a re-run...

I can try to keep this in mind :)

@HarikalarKutusu
Copy link
Contributor

I opened a discussion here, your input would be very valuable:
https://discourse.mozilla.org/t/discussion-best-practices-steps-for-increased-recording-duration/129205

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
extract-improvements question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants