Skip to content

Open-Speech-EkStep/vakyansh-models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

84 Commits
 
 
 
 

Repository files navigation

Vakyansh Open Source Models

  1. Pretrained ASR Models
  2. Finetuned ASR Models
  3. Language Models
  4. Punctuation Models
  5. TTS Models
  6. Gender Classification Model
  7. Language Identification Models
  8. Interspeech 2021 ASR Models

Pretrained ASR Models

wav2vec2-code | nemo-code

Pretrained Model Description Architecture Hours
Vakyansh-Conformer-SSL This model was pre-trained using Nemo toolkit with 34,000 hours unlabeled audio in 39 Indian languages. This includes 15,000 hours of news recordings available on the internet, 10,000 hours of YouTube audios and other audio data. In addition, 9,000 hours of Indian English audio data was taken from NPTEL lectures open sourced by AI4Bharat.
This model was trained in collaboration with NVIDIA (NVIDIA Graphics Pvt Ltd). We thank NVIDIA for providing the compute resources to train this model.
Conformer-Large 34,000
CLSRIL-23 Cross Lingual Speech Representations for Indic Languages, Contains 10,000 hours of training data from 23 Indic Languages.
Citation: https://arxiv.org/abs/2107.07402
wav2vec2-Base 10,000
hindi_pretrained_4kh Trained on 4200 hours of Hindi Data wav2vec2-Base 4,200
kannada_pretrained_1400h Trained on 1400 hours of Kannada data wav2vec2-XLSR 1,400



Finetuned ASR Models

Conformer based models

Repo

Language Pretrained Model Finetuned Model Finetuned Hours Arch
Hindi Vakyansh Conformer SSL hindi_large_ssl_2500 2,500 h Large
Indian English Vakyansh Conformer SSL indian_en_large_ssl_700 700 h Large
Kannada Vakyansh Conformer SSL kannada_large_ssl_1000 1,000 h Large
Punjabi Vakyansh Conformer SSL punjabi_large_ssl_500 500 h Large
Tamil Vakyansh Conformer SSL tamil_large_ssl_900 900 h Large



wav2vec2 based models

Repo

Citation: https://arxiv.org/abs/2203.16512

Language Pretrained Model Finetuned Model Dictionary Single Model for Inference Finetuned Hours TS model
Hindi CLSRIL-23 him_4200 dict hindi_infer 4200 h hindi_ts
Indian English CLSRIL-23 enm_700 dict english_infer 700 h english_ts
Kannada CLSRIL-23 knm_560 dict kannada_infer 560 h kannada_ts
Tamil CLSRIL-23 tam_250 dict tamil_infer 250 h tamil_ts
Bengali CLSRIL-23 bnm_200 dict bengali_infer 200 h bengali_ts
Nepali CLSRIL-23 nem_130 dict nepali_infer 130 h nepali_ts
Telugu CLSRIL-23 tem_100 dict telugu_infer 100 h telugu_ts
Gujarati CLSRIL-23 gum_100 dict gujarati_infer 100 h gujarati_ts
Marathi CLSRIL-23 mrm_100 dict marathi_infer 100 h marathi_ts
Odia CLSRIL-23 orm_100 dict odia_infer 100 h odia_ts
Sanskrit CLSRIL-23 sam_60 dict sanskrit_infer 60 h sanskrit_ts
Maithili CLSRIL-23 maim_50 dict maithili_infer 50 h maithili_ts
Urdu CLSRIL-23 urm_60h dict urdu_infer 60h urdu_ts
Punjabi CLSRIL-23 pam_10h dict punjabi_infer 10 h punjabi_ts
Dogri CLSRIL-23 doi_55h dict dogri_infer 55 h dogri_ts
Malayalam CLSRIL-23 mlm_8h dict malayalam_infer 8 h malayalam_ts
Bhojpuri CLSRIL-23 bhom_60h dict bhojpuri_infer 60 h bhojpuri_ts
Assamese CLSRIL-23 asm_8h dict assamese_infer 8 h assamese_ts



Language Models

Repo

Language models integrate with finetuned models.

Language Type Lexicon LM Text Corpus
Hindi kenlm 5-gram hindi_lexicon hindi_lm hindi_text
Indian English kenlm 5-gram english_lexicon english_lm english_text
Kannada kenlm 5-gram kannada_lexicon kannada_lm kannada_text
Tamil kenlm 5-gram tamil_lexicon tamil_lm tamil_text
Bengali kenlm 5-gram bengali_lexicon bengali_lm bengali_text
Nepali kenlm 5-gram nepali_lexicon nepali_lm nepali_text
Telugu kenlm 5-gram telugu_lexicon telugu_lm telugu_text
Gujarati kenlm 5-gram gujarati_lexicon gujarati_lm gujarati_text
Marathi kenlm 5-gram marathi_lexicon marathi_lm marathi_text
Odia kenlm 5-gram odia_lexicon odia_lm odia_lm
Sanskrit kenlm 5-gram sanskrit_lexicon sanskrit_lm sanskrit_text
Maithili kenlm 5-gram maithili_lexicon maithili_lm maithili_text
Urdu kenlm 5-gram urdu_lexicon urdu_lm urdu_text
Punjabi kenlm 5-gram punjabi_lexicon punjabi_lm punjabi_text
Dogri kenlm 5-gram dogri_lexicon dogri_lm dogri_text
Malayalam kenlm 5-gram malayalam_lexicon malayalam_lm malayalam_text
Bhojpuri kenlm 5-gram bhojpuri_lexicon bhojpuri_lm bhojpuri_text
Rajasthani kenlm 5-gram rajasthani_lexicon rajasthani_lm rajasthani_text
Assamese kenlm 5-gram assamese_lexicon assamese_lm assamese_text
Hinglish kenlm 5-gram hinglish_lexicon hinglish_lm hinglish_text

Dataset Credits: We thanks AI4Bharat for open sourcing the Indic-Corp dataset. Link. We modified the original data by tokenizing and removing duplicates.

Domain Specific Language Models

Language Type Domain Lexicon LM Text Corpus
English kenlm 5-gram Biomedical bio_lexicon bio_lm bio_lm_eng_text



Punctuation Models

Training Repo

Inference Repo

Language Model Data
Hindi hi.zip hindi_data
Assamese as.zip assamese_data
Bengali bn.zip bengali_data
Gujarati gu.zip gujarati_data
Kannada kn.zip kannada_data
Malayalam ml.zip malayalam_data
Marathi mr.zip marathi_data
Odia or.zip odia_data
Punjabi pa.zip punjabi_data
Tamil ta.zip tamil_data
Telugu te.zip telugu_data

Dataset Credits: We thank AI4Bharat for open sourcing the Indic-Corp dataset. Link. We modified the original data by tokenizing and removing duplicates.



TTS Models

Below models are trained using Glow TTS and hifi GAN combination.

Repo

Language Language Code Gender glow ckpt hifi-gan ckpt
Hindi hi Female hi_0_glow hi_0_hifi
Hindi hi Male hi_1_glow hi_1_hifi
Kannada kn Female kn_0_glow kn_0_1_hifi
Kannada kn Male kn_1_glow kn_0_1_hifi
Tamil ta Female ta_0_glow ta_0_1_hifi
Tamil ta Male ta_1_glow ta_0_1_hifi
Telugu te Female te_0_glow te_0_1_hifi
Telugu te Male te_1_glow te_0_1_hifi
Odia or Female or_0_glow or_0_1_hifi
Odia or Male or_1_glow or_0_1_hifi
Malayalam ml Female ml_0_glow ml_0_hifi
Malayalam ml Male ml_1_glow ml_1_hifi
Marathi mr Female mr_0_glow mr_1_hifi
Gujarati gu Male gu_0_glow gu_0_hifi
Bengali bn Female bn_0_glow bn_0_1_hifi
Bengali bn Male bn_1_glow bn_0_1_hifi
English en Female en_0_glow en_0_hifi
English en Male en_1_glow en_1_hifi

Dataset Credits: We thanks IITM for open sourcing Indic-TTS dataset. Link



Gender Classification Model

Repo

Type Model Type Model
Gender Classification SVC Model



Language Identification Models

Repo

Type Model
Hindi_vs_Others Model
Tamil_vs_Others Model



Interspeech 2021 ASR Models

Comp Link

Language Pretrained Model Finetuned Model Dictionary Single Model for Inference
Telugu CLSRIL-23 te_40h_interspeech dict telugu_infer_interspeech
Tamil CLSRIL-23 ta_40h_interspeech dict tamil_infer_interspeech
Gujarati CLSRIL-23 gu_40h_interspeech dict gujarati_infer_interspeech
Hinglish CLSRIL-23 hinglish_interspeech dict hinglish_infer_interspeech



Citation

If you use any of our resources, please cite the following article:

@misc{chadha2022vakyansh,
    title={Vakyansh: ASR Toolkit for Low Resource Indic languages},
    author={Harveen Singh Chadha and Anirudh Gupta and Priyanshi Shah and Neeraj Chhimwal and Ankur Dhuriya and Rishabh Gaur and Vivek Raghavan},
    year={2022},
    eprint={2203.16512},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

If you use the pretrained model (CLSRIL-23) please cite the following article:

@misc{gupta2021clsril23,
      title={CLSRIL-23: Cross Lingual Speech Representations for Indic Languages}, 
      author={Anirudh Gupta and Harveen Singh Chadha and Priyanshi Shah and Neeraj Chimmwal and Ankur Dhuriya and Rishabh Gaur and Vivek Raghavan},
      year={2021},
      eprint={2107.07402},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}