Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On what kind of datasets does the model trained on? #38

Open
sraghuram90 opened this issue Oct 15, 2020 · 4 comments
Open

On what kind of datasets does the model trained on? #38

sraghuram90 opened this issue Oct 15, 2020 · 4 comments

Comments

@sraghuram90
Copy link

What are the datasets does this kaldi active grammar model trained on?
If you would have included public datasets, could you name them?
The pretrained model which you mentioned, is that Zamia speech model?

@JohnDoe02
Copy link

I was also curious about this. According to here (cf., stage 2) it should be: Librispeech, TEDLIUM, Mozilla's Commonvoice, Tatoeba, Tensorflow's speech_commands.

@daanzu
Copy link
Owner

daanzu commented Nov 1, 2020

Actually, daanzu_multi_en is a partial and unfinished training pipeline. I have ended up working with a heavily modified version of the Zamia pipeline. The datasets are:

  • Common Voice
  • Common Voice single word
  • Librispeech
  • LJ Speech
  • M-AILabs
  • Google Speech Commands
  • Tatoeba
  • TedLIUM3
  • Voxforge
  • A collection of TTS I generated

@zhouyong64
Copy link

Actually, daanzu_multi_en is a partial and unfinished training pipeline. I have ended up working with a heavily modified version of the Zamia pipeline. The datasets are:

  • Common Voice
  • Common Voice single word
  • Librispeech
  • LJ Speech
  • M-AILabs
  • Google Speech Commands
  • Tatoeba
  • TedLIUM3
  • Voxforge
  • A collection of TTS I generated

How about kaldi_model_daanzu_20211030-biglm? Also trained on these datasets?

@daanzu
Copy link
Owner

daanzu commented Dec 12, 2021

@zhouyong64

How about kaldi_model_daanzu_20211030-biglm? Also trained on these datasets?

Yes, the new model is trained on the same datasets. The major change is that it now includes models necessary for running g2p_en for local pronunciation generation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants