Name		Name	Last commit message	Last commit date
parent directory ..
am_arch		am_arch
librispeech		librispeech
librivox		librivox
lm		lm
lm_analysis		lm_analysis
lm_corpus_and_PL_generation		lm_corpus_and_PL_generation
raw_lm_corpus		raw_lm_corpus
rescoring		rescoring
README.md		README.md

README.md

End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures

In the paper we are considering:

different architectures for acoustic modeling:
- ResNet
- TDS
- Transformer
different criterions:
- Seq2Seq
- CTC
different settings:
- supervised LibriSpeech 1k hours
- supervised LibriSpeech 1k hours + unsupervised LibriVox 57k hours (for LibriVox we generate pseudo-labels to use them as a target),
and different language models:
- word-piece (ngram, ConvLM)
- word-based (ngram, ConvLM, transformer)

Data preparation

Run data and auxiliary files (like lexicon, tokens set, etc.) preparation (set necessary paths instead of [...]: data_dst path to data to store, model_dst path to auxiliary path to store).

pip install sentencepiece==0.1.82
python3 ../../utilities/prepare_librispeech_wp_and_official_lexicon.py --data_dst [...] --model_dst [...] --nbest 10 --wp 10000

Besides data the auxiliary files for acoustic and language models training/evaluation will be generated:

cd $MODEL_DST
tree -L 2
.
├── am
│   ├── librispeech-train-all-unigram-10000.model
│   ├── librispeech-train-all-unigram-10000.tokens
│   ├── librispeech-train-all-unigram-10000.vocab
│   ├── librispeech-train+dev-unigram-10000-nbest10.lexicon
│   ├── librispeech-train-unigram-10000-nbest10.lexicon
│   └── train.txt
└── decoder
    ├── 4-gram.arpa
    ├── 4-gram.arpa.lower
    └── decoder-unigram-10000-nbest10.lexicon

Instructions to reproduce training and decoding

To reproduce acoustic models training on Librispeech (1k hours) and beam-search decoding of these models check the librispeech directory.
Details on pseudolabels preparation is in the directory lm_corpus_and_PL_generation (raw LM corpus which has no intersection with Librovox data is prepared in the raw_lm_corpus)
To reproduce acoustic models training on Librispeech 1k hours + unsupervised LibriVox data (with generated pseudo-labels) and beam-search decoding of these models, check librivox directory.
Details on language models training one can find in the lm directory.
Beam dump for the best models and beam rescoring can be found in the rescoring directory.
Disentangling of acoustic and linguistic representations analyis (TTS and Segmentation experiments) are in lm_analysis.

Tokens and Lexicon sets

Lexicon	Tokens	Beam-search lexicon	WP tokenizer model
Lexicon	Tokens	Beam-search lexicon	WP tokenizer model

Tokens and lexicon files generated in the $MODEL_DST/am/ and $MODEL_DST/decoder/ are the same as in the table.

Pre-trained acoustic models

Below there is info about pre-trained acoustic models, which one can use, for example, to reproduce a decoding step.

Dataset	Acoustic model dev-clean	Acoustic model dev-other
LibriSpeech	Resnet CTC clean	Resnet CTC other
LibriSpeech + LibriVox	Resnet CTC clean	Resnet CTC other
LibriSpeech	TDS CTC clean	TDS CTC other
LibriSpeech + LibriVox	TDS CTC clean	TDS CTC other
LibriSpeech	Transformer CTC clean	Transformer CTC other
LibriSpeech + LibriVox	Transformer CTC clean	Transformer CTC other
LibriSpeech	Resnet S2S clean	Resnet S2S other
LibriSpeech + LibriVox	Resnet S2S clean	Resnet S2S other
LibriSpeech	TDS Seq2Seq clean	TDS Seq2Seq other
LibriSpeech + LibriVox	TDS Seq2Seq clean	TDS Seq2Seq other
LibriSpeech	Transformer Seq2Seq clean	Transformer Seq2Seq other
LibriSpeech + LibriVox	Transformer Seq2Seq clean	Transformer Seq2Seq other

Pre-trained language models

LM type	Language model	Vocabulary	Architecture	LM Fairseq	Dict fairseq
ngram	word 4-gram	-	-	-	-
ngram	wp 6-gram	-	-	-	-
GCNN	word GCNN	vocabulary	Archfile	fairseq	fairseq dict
GCNN	wp GCNN	vocabulary	Archfile	fairseq	fairseq dict
Transformer	-	-	-	fairseq	fairseq dict

To reproduce decoding step from the paper download these models into $MODEL_DST/am/ and $MODEL_DST/decoder/ appropriately.

Non-overlap LM corpus (Librispeech official LM corpus excluded the data from Librivox)

One can use prepared corpus to train LM to generate PL on LibriVox data: raw corpus and normalized corpus and 4gram LM with 200k vocab.

Generated pseudo-labels used in the paper

We open-sourced also the generated pseudo-labels on which we trained our model: pl and pl with overlap. **Make sure to fix the prefixes to the files names in the lists, right now it is set to be /root/librivox)

Citation

@article{synnaeve2019end,
  title={End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures},
  author={Synnaeve, Gabriel and Xu, Qiantong and Kahn, Jacob and Grave, Edouard and Likhomanenko, Tatiana and Pratap, Vineel and Sriram, Anuroop and Liptchinsky, Vitaliy and Collobert, Ronan},
  journal={arXiv preprint arXiv:1911.08460},
  year={2019}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2019

2019

am_arch

am_arch

librispeech

librispeech

librivox

librivox

lm

lm

lm_analysis

lm_analysis

lm_corpus_and_PL_generation

lm_corpus_and_PL_generation

raw_lm_corpus

raw_lm_corpus

rescoring

rescoring

README.md

README.md

README.md

End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures

Data preparation

Instructions to reproduce training and decoding

Tokens and Lexicon sets

Pre-trained acoustic models

Pre-trained language models

Non-overlap LM corpus (Librispeech official LM corpus excluded the data from Librivox)

Generated pseudo-labels used in the paper

Citation

Files

2019

Directory actions

More options

Directory actions

More options

Latest commit

History

2019

Folders and files

parent directory

Data preparation

Instructions to reproduce training and decoding

Tokens and Lexicon sets

Pre-trained acoustic models

Pre-trained language models

Non-overlap LM corpus (Librispeech official LM corpus excluded the data from Librivox)

Generated pseudo-labels used in the paper

Citation