Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training wav2letter++ streaming convnets (TDS + CTC) #101

Open
erksch opened this issue Apr 4, 2020 · 6 comments
Open

Training wav2letter++ streaming convnets (TDS + CTC) #101

erksch opened this issue Apr 4, 2020 · 6 comments

Comments

@erksch
Copy link

erksch commented Apr 4, 2020

Hey!

First of all, I think your work is amazing and making all your models available is just so generous.

I checked out your german wav2letter model and as I can tell from your train config (w2l_config_conv_glu_train.cfg) the acoustic model is based on conv_glu with ASG criterion from the original wav2letter paper.

Facebook released its streaming_convnets version in January which allows online speech recognition with streaming capability and I would kill for having a german model for that. Here is a link to the architecture file and the training config.

I want to train the acoustic model with the hardware resources I have available and updated german speech corpora (like the most recent common voice with 500 hours of german speech).

Regarding your experience in training a wav2letter model:

  • How many and what GPUs do you use for training? (The wav2letter guys said here that they used 32 GPUs for training the streaming convnets acoustic model, which sounds a little bit insane)
  • How much RAM does the system need to have or is it primarily GPU work?
  • How long did the training of your wav2letter model took?
  • Are there any pitfalls when training for wav2letter?

Vielen Dank :)

@gooofy
Copy link
Owner

gooofy commented Apr 4, 2020 via email

@erksch
Copy link
Author

erksch commented Apr 4, 2020

Thank you for your reply and the insights!
That's a lot of time 😅

Maybe as you speak of it, what about your language models? How long did it take, say for the large order 6 german lm?

And if I have domain-specific words that I really want my speech recognition to know about, should I add examples of that to the speech corpora or should I make sure that those words are well represented in the language model text corpora? Or both? Or should the language model text corpora be identical to the speech corpora text?

Sorry for the questioning :D

@svenha
Copy link
Contributor

svenha commented Apr 4, 2020

Hi Erik.

Your project sounds interesting. I have only one remark about annotation quality, because this problem came up several times in this project and Guenter spent a lot of time to correct annotation problems in speech corpora. So, if you include the latest Common Voice data set, a very new data set, I would be cautious and would try to spot problematic audio files and/or annotations.

Just curious: What WERs are you expecting?

Sven

@erksch
Copy link
Author

erksch commented Apr 4, 2020

@svenha you're right. Theoretically, the Common Voice dataset should be already reviewed by the users, but I don't know if that actually ensures the data's quality.

Regarding WER I am not expecting anything. I'll compare it to a microphone streaming implementation with a Kaldi model (the german Zamia Kaldi model) and see what feels better and more robust.

@gooofy
Copy link
Owner

gooofy commented Apr 4, 2020 via email

@lagidigu
Copy link

@erksch did you have success training a streaming convnet on the mozilla dataset? I would be attempting something similar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants