Install software:

Well within 24 hours, transcribe 40 hours of recorded speech in a surprise language.

Build an ASR for a surprise language L from a pre-trained acoustic model, an L pronunciation dictionary, and an L language model. This approach converts phones directly to L words. This is less noisy than using multiple cross-trained ASRs to make English words from which phone strings are extracted, merged by PTgen, and reconstituted into L words.

A full description with performance measurements is on arXiv, and in:
M Hasegawa-Johnson, L Rolston, C Goudeseune, GA Levow, and K Kirchhoff,
Grapheme-to-phoneme transduction for cross-language ASR, Stat. Lang. Speech Proc.:3‒19, 2020.

Install software: * Kaldi * brno-phnrec * This repo * Extension of ASpIRE * CVTE Mandarin
For each language L, build an ASR: * Get raw text. * Get a G2P. * Build an ASR.
Transcribe speech: * Get recordings. * Typical results.

Install software:

Kaldi

If you don't already have a version of Kaldi newer than 2016 Sep 30, get and build it following the instructions in its INSTALL files.

    git clone https://github.com/kaldi-asr/kaldi
    cd kaldi/tools; make -j $(nproc)
    cd ../src; ./configure --shared && make depend -j $(nproc) && make -j $(nproc)

brno-phnrec

Put Brno U. of Technology's phoneme recognizer next to the usual s5 directory.

    sudo apt-get install libopenblas-dev libopenblas-base
    cd kaldi/egs/aspire
    git clone https://github.com/uiuc-sst/brno-phnrec.git
    cd brno-phnrec/PhnRec
    make

This repo

Put this next to the usual s5 directory.
(The package nodejs is for ./sampa2ipa.js.)

    sudo apt-get install nodejs
    cd kaldi/egs/aspire
    git clone https://github.com/uiuc-sst/asr24.git
    cd asr24

Extension of ASpIRE

Get the ASpIRE chain model, extended by Krisztián Varga.

    cd kaldi/egs/aspire/asr24
    wget -qO- http://dl.kaldi-asr.org/models/0001_aspire_chain_model.tar.gz | tar xz
    steps/online/nnet3/prepare_online_decoding.sh \
      --mfcc-config conf/mfcc_hires.conf \
      data/lang_chain exp/nnet3/extractor \
      exp/chain/tdnn_7b exp/tdnn_7b_chain_online
    utils/mkgraph.sh --self-loop-scale 1.0 data/lang_pp_test \
      exp/tdnn_7b_chain_online exp/tdnn_7b_chain_online/graph_pp

In exp/tdnn_7b_chain_online this builds the files phones.txt, tree, final.mdl, conf/, etc.
This builds the subdirectories data and exp. Its last command mkgraph.sh can take 45 minutes (30 for CTVE Mandarin) and use a lot of memory because it calls fstdeterminizestar on a large language model, as Dan Povey explains.

Verify that it can transcribe English, in mono 16-bit 8 kHz .wav format. Either use the provided 8khz.wav, or sox MySpeech.wav -r 8000 8khz.wav, or ffmpeg -i MySpeech.wav -acodec pcm_s16le -ac 1 -ar 8000 8khz.wav.

(The scripts cmd.sh and path.sh say where to find kaldi/src/online2bin/online2-wav-nnet3-latgen-faster.)

    . cmd.sh && . path.sh
    online2-wav-nnet3-latgen-faster \
      --online=false  --do-endpointing=false \
      --frame-subsampling-factor=3 \
      --config=exp/tdnn_7b_chain_online/conf/online.conf \
      --max-active=7000 \
      --beam=15.0  --lattice-beam=6.0  --acoustic-scale=1.0 \
      --word-symbol-table=exp/tdnn_7b_chain_online/graph_pp/words.txt \
      exp/tdnn_7b_chain_online/final.mdl \
      exp/tdnn_7b_chain_online/graph_pp/HCLG.fst \
      'ark:echo utterance-id1 utterance-id1|' \
      'scp:echo utterance-id1 8khz.wav|' \
      'ark:/dev/null'

CVTE Mandarin

Get the Mandarin chain model (3.4 GB, about 10 minutes). This makes a subdir cvte/s5, containing a words.txt, HCLG.fst, and final.mdl.

    wget -qO- http://kaldi-asr.org/models/0002_cvte_chain_model.tar.gz | tar xz
    steps/online/nnet3/prepare_online_decoding.sh \
      --mfcc-config conf/mfcc_hires.conf \
      data/lang_chain exp/nnet3/extractor \
      exp/chain/tdnn_7b cvte/s5/exp/chain/tdnn
    utils/mkgraph.sh --self-loop-scale 1.0 data/lang_pp_test \
      cvte/s5/exp/chain/tdnn cvte/s5/exp/chain/tdnn/graph_pp

For each language L, build an ASR:

Get raw text.

Into $L/train_all/text put word strings in L (scraped from wherever), roughly 10 words per line, at most 500k lines. These may be quite noisy, because they'll be cleaned up.

Get a G2P.

Into $L/train_all/g2aspire.txt put a G2P, a few hundred lines each containing grapheme(s), whitespace, and space-delimited Aspire-style phones.
If it has CR line terminators, convert them to standard ones in vi with %s/^M/\r/g, typing control-V before the ^M.
If it starts with a BOM, remove it: vi -b g2aspire.txt, and just x that character away.
If you need to build the G2P, ./g2ipa2asr.py $L_wikipedia_symboltable.txt aspire2ipa.txt phoibletable.csv > $L/train_all/g2aspire.txt.

Build an ASR.

./run.sh $L makes an L-customized HCLG.fst.

To instead use a prebuilt LM, ./run_from_wordlist.sh $L. See that script for usage.

Transcribe speech:

Get recordings.

On ifp-serv-03.ifp.illinois.edu, get LDC speech and convert it to a flat dir of 8 kHz .wav files:

    cd /ws/ifp-serv-03_1/workspace/fletcher/fletcher1/speech_data1/Russian/LDC2016E111/RUS_20160930
    cd /ws/ifp-serv-03_1/workspace/fletcher/fletcher1/speech_data1/Tamil/TAM_EVAL_20170601/TAM_EVAL_20170601
    cd /ws/ifp-serv-03_1/workspace/fletcher/fletcher1/speech_data1/Uzbek/LDC2016E66/UZB_20160711

    mkdir /tmp/8k
    for f in */AUDIO/*.flac; do sox "$f" -r 8000 -c 1 /tmp/8k/$(basename ${f%.*}.wav); done
    tar cf /workspace/ifp-53_1-data/eval/8k.tar -C /tmp 8k
    rm -rf /tmp/8k

For BABEL .sph files:

    cd /ws/ifp-serv-03_1/workspace/fletcher/fletcher1/speech_data1/Assamese/LDC2016E02/conversational/training/audio
    tar cf /tmp/foo.tar BABEL*.sph
    scp /tmp/foo.tar ifp-53:/tmp

On ifp-53,

    mkdir ~/kaldi/egs/aspire/asr24/$L-8khz
    cd myTmpSphDir
    tar xf /tmp/foo.tar
    for f in *.sph; do ~/kaldi/tools/sph2pipe_v2.5/sph2pipe -p -f rif "$f" /tmp/a.wav; \
        sox /tmp/a.wav -r 8000 -c 1 ~/kaldi/egs/aspire/asr24/$L-8khz/$(basename ${f%.*}.wav); done

On the host that will run the transcribing, e.g. ifp-53:

    cd kaldi/egs/aspire/asr24
    wget -qO- http://www.ifp.illinois.edu/~camilleg/e/8k.tar | tar xf -
    mv 8k $L-8khz

./mkscp.rb $L-8khz $(nproc) $L splits the ASR tasks into one job per CPU core, each job with roughly the same audio duration.
It reads $L-8khz, the dir of 8 kHz speech files.
It makes $L-submit.sh.
./$L-submit.sh launches these jobs in parallel.
After those jobs complete, collect the transcriptions with
grep -h -e '^TAM_EVAL' $L/lat/*.log | sort > $L-scrips.txt (or ...^RUS_, ^BABEL_, etc.).
To sftp transcriptions to Jon May as elisa.tam-eng.eval-asr-uiuc.y3r1.v8.xml.gz, with timestamp June 11 and version 8,
grep -h -e '^TAM_EVAL' tamil/lat/*.log | sort | sed -e 's/ /\t/' | ./hyp2jonmay.rb /tmp/jon-tam tam 20180611 8
(If UTF-8 errors occur, simplify letters by appending to the sed command args such as -e 's/Ñ/N/g'.)
Collect each .wav file's n best transcriptions with
cat $L/lat/*.ascii | sort > $L-nbest.txt.

Special postprocessing.

If your transcriptions used nonsense English words, convert them to phones and then, via a trie or longest common substring, into L-words:

./trie-$L.rb < trie1-scrips.txt > $L-trie-scrips.txt.
make multicore-$L; wait; grep ... > $L-lcs-scrips.txt.

Typical results.

RUS_20160930 was transcribed in 67 minutes, 13 MB/min, 12x faster than real time.

A 3.1 GB subset of Assam LDC2016E02 was transcribed in 440 minutes, 7 MB/min, 6.5x real time. (This may have been slower because it exhausted ifp-53's memory.)

Arabic/NEMLAR_speech/NMBCN7AR, 2.2 GB (40 hours), was transcribed in 147 minutes, 14 MB/min, 16x real time. (This may have been faster because it was a few long (half-hour) files instead of many brief ones.)

TAM_EVAL_20170601 was transcribed in 45 minutes, 21 MB/min, 19x real time.

Generating lattices $L/lat/* took 1.04x longer for Russian, 0.93x longer(!) for Arabic, 1.7x longer for Tamil.

Name		Name	Last commit message	Last commit date
Latest commit History 196 Commits
.gitignore		.gitignore
8khz.wav		8khz.wav
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
arabic-scrips.txt		arabic-scrips.txt
aspire2ipa.rb		aspire2ipa.rb
aspire2ipa.txt		aspire2ipa.txt
assam-scrips.txt		assam-scrips.txt
cmd.sh		cmd.sh
compute-wer-downcase.rb		compute-wer-downcase.rb
conf		conf
flat2elisa.py		flat2elisa.py
g2ipa2asr.py		g2ipa2asr.py
gh-md-toc		gh-md-toc
hyp2jonmay.rb		hyp2jonmay.rb
ipa2aspire.rb		ipa2aspire.rb
lcs-kinyar.cpp		lcs-kinyar.cpp
lcs-sinhal.cpp		lcs-sinhal.cpp
local		local
make-g2ps.rb		make-g2ps.rb
mkprondict.py		mkprondict.py
mkprondict_from_wordlist.py		mkprondict_from_wordlist.py
mkscp.py		mkscp.py
mkscp.rb		mkscp.rb
nemlar2utf8.rb		nemlar2utf8.rb
newlangdir_make_confs.sh		newlangdir_make_confs.sh
newlangdir_make_graphs-cvteMandarin.sh		newlangdir_make_graphs-cvteMandarin.sh
newlangdir_make_graphs.sh		newlangdir_make_graphs.sh
newlangdir_make_graphs_from_wordlist.sh		newlangdir_make_graphs_from_wordlist.sh
newlangdir_train_lms.sh		newlangdir_train_lms.sh
nonsilence_phones.txt		nonsilence_phones.txt
path.sh		path.sh
phnrec.py		phnrec.py
phnrec.rb		phnrec.rb
phnrec.sh		phnrec.sh
run.sh		run.sh
run_from_wordlist.sh		run_from_wordlist.sh
sampa2ipa.js		sampa2ipa.js
steps		steps
tam-Saturday-init.sh		tam-Saturday-init.sh
tamil-scrips-ifp53.txt		tamil-scrips-ifp53.txt
trie-kinyar.rb		trie-kinyar.rb
trie-sinhal.rb		trie-sinhal.rb
txt2trn.rb		txt2trn.rb
utils		utils
utils-slash-validate_dict_dir.pl		utils-slash-validate_dict_dir.pl
variant-wer-somali.rb		variant-wer-somali.rb
wer-prep.rb		wer-prep.rb
wer.rb		wer.rb

License

uiuc-sst/asr24

Folders and files

Latest commit

History

Repository files navigation

Install software:

Kaldi

brno-phnrec

This repo

Extension of ASpIRE

CVTE Mandarin

For each language L, build an ASR:

Get raw text.

Get a G2P.

Build an ASR.

Transcribe speech:

Get recordings.

Special postprocessing.

Typical results.

About

Topics

Resources

License

Stars

Watchers

Forks

Languages