word timestamps of an each individual word in the inference #987

vchagari · 2021-08-31T16:39:46Z

Question:
Is there a way to accurately calculate or compute an individual word level timing of each word as it appeared after the start of the audio?.

Note:
I referred the following existing ticket - #809, but it looks like there is no solution in that ticket. Could you please help me pointing to the right resource that would help me finding the accurate word level timings.

Ticket I referred to: #809

Thanks

tlikhomanenko · 2021-09-01T21:41:17Z

Hey!

Here https://github.com/flashlight/flashlight/blob/master/flashlight/app/asr/Decode.cpp#L641 you will have per frame token indices in the rawTokenPrediction, so you can do any postprocessing and print computed word timings here. The only thing to have in mind to convert to original time is the model stride.

vchagari · 2021-09-02T19:25:20Z

Hi @tlikhomanenko,

Thank you very much for the response, could you please help me by telling how to convert the per frame token indices to word timings. Please provide an example if possible.

I think my model has frame stride set to 10ms.

Thanks
Vamsi Chagari

tlikhomanenko · 2021-09-04T21:18:56Z

Well, I can navigate only for Decode.cpp (not the online inference if you are referring to it).

The qq I have before going further: what are values for flags

FLAGS_criterion,
FLAGS_surround,
FLAGS_replabel,
FLAGS_usewordpiece,
FLAGS_wordseparator

did you set? Also what is the model architecture (what is the stride happens inside model itself)?

vchagari · 2021-09-05T06:23:27Z

Hi @tlikhomanenko,

Okay, thank you. I am referring to Decode.cpp.

Please find the info below:
Flags:
Criterion is set to “ctc”
Surrond is not set
Relabel is not set,
Usewordpiece set to “true”
Wordseparator is set to “_”

Please find the model architecture below:
https://github.com/flashlight/wav2letter/blob/master/recipes/streaming_convnets/librispeech/am_500ms_future_context.arch

Stride: I haven't changed anything, what ever is the default value that's there in the streaming convnets.

More Info:

I did use fork to create a new AM model with my data from the base model (am_500ms_future_context_dev_other.bin)
configuration used for training
--runname=inference_2019
--rundir=/data/set3/
--datadir=/data/set3
--tokens=/data/set3/librispeech-train-all-unigram-10000.tokens
--arch=/data/set3/am_500ms_future_context.arch
--train=lists/train.lst
--valid=lists/dev.lst
--lexicon=/data/set3/decoder-unigram-10000-nbest10-02-04-2021.lexicon
--criterion=ctc
--batchsize=8
--lr=0.01
--momentum=0.8
--maxgradnorm=0.5
--reportiters=1000
--nthread=6
--mfsc=true
--usewordpiece=true
--wordseparator=_
--filterbanks=80
--minisz=200
--mintsz=2
--maxisz=33000
--enable_distributed=true
--pcttraineval=1
--minloglevel=0
--logtostderr
--onorm=target
--sqnorm
--localnrmlleftctx=300
--lr_decay=10000
--input=wav
--itersave=true
--iter=100000000

Thank you

tlikhomanenko · 2021-09-08T19:17:46Z

So you have from https://github.com/flashlight/flashlight/blob/master/flashlight/app/asr/Decode.cpp#L641 rawTokenPrediction array of token indices for each frame. Then in the loop over this array you call

std::array<std::string> tokens;
for (auto index : rawTokenPrediction) {
  tokens.push_back(tokenDict.getEntry(index));
}

Now tokens contains word pieces. As soon as you used model arch with total stride 8 and features originally used 10ms stride then every frame now counts 80ms. Now you can parse duplications in the tokens and set properly the timing on words.

For example if tokens have ["_hel", "_hel", "_hel", "lo", "lo", "lo", "_world", "_world"] then you have "hello" from 0-480ms and "world" 480ms-640ms.

vchagari · 2021-09-12T20:00:52Z

Thank you @tlikhomanenko for the response, i appreciate it.

I did test the decoder after making the code changes. The word timings I calculated based on the info that's there in the "rawTokenPrediction and tokenDict" data-structures and the timings of the words in the audio doesn't seems to be matching.

Is the frame size 80ms correct ?, please correct me, if i am wrong and what does the "#" represent in the tokenDict entry ?.

Here are the output details of the two audio files I tested with the decoder:

test_2_4_wav16.wav: Which has "see you later" in the audio
Based on the info from the tokenDict data-structure and from the audio wav file:

**Word no of Frames actual time it took in the audio**

see 8 ~300ms

you 2 ~160ms

later 2 ~300ms

Decoder stdout output:
tokenDict.getEntry(468)=_
tokenDict.getEntry(9997)=#
tokenDict.getEntry(9997)=#
tokenDict.getEntry(9997)=#
tokenDict.getEntry(9997)=#
tokenDict.getEntry(9997)=#
tokenDict.getEntry(9997)=#
tokenDict.getEntry(87)=_see
tokenDict.getEntry(9997)=#
tokenDict.getEntry(14)=_you
tokenDict.getEntry(9997)=#
tokenDict.getEntry(665)=later
tokenDict.getEntry(468)=
|T|: see you later
|P|: see you later
|t|: s e e _ y o u _ l a t e r
|p|: _ s e e _ y o u _ l a t e r
[sample: test_2_4_wav16.wav, WER: 0%, LER: 7.69231%, slice WER: 0%, slice LER: 7.69231%, decoded samples (thread 2): 1]
I0912 12:20:32.203042 31342 Decode.cpp:742] ------

Audio file timings screenshot:

test_2_2_wav16.wav: which has "hello nancy" in the audio
Based on the info from the tokenDict data-structure and from the audio wav file:

Word no of Frames actual time it took in the audio

hello 8 ~470ms

nancy 4 ~400ms

Decoder stdout output:
tokenDict.getEntry(468)=_
tokenDict.getEntry(9997)=#
tokenDict.getEntry(9997)=#
tokenDict.getEntry(9997)=#
tokenDict.getEntry(9997)=#
tokenDict.getEntry(9997)=#
tokenDict.getEntry(9997)=#
tokenDict.getEntry(7960)=_hello
tokenDict.getEntry(9997)=#
tokenDict.getEntry(9997)=#
tokenDict.getEntry(9997)=#
tokenDict.getEntry(4408)=_nancy
tokenDict.getEntry(9997)=#
|T|: hello nancy
|P|: hello nancy
|t|: h e l l o _ n a n c y
|p|: _ h e l l o _ n a n c y
[sample: test_2_2_wav16.wav, WER: 0%, LER: 9.09091%, slice WER: 0%, slice LER: 9.09091%, decoded samples (thread 1): 1]

Audio file timings Screenshots:

More info:

Code Changes to decode.cpp to print the word pieces:
auto rawWordPrediction = results[i].words;
auto rawTokenPrediction = results[i].tokens;
std::cout << "nTopHyps=" << nTopHyps << std::endl;
std::vectorstd::string tokens_tmp;
for (auto index : rawTokenPrediction)
{
std::string tmp_str = tokenDict.getEntry(index);
tokens_tmp.push_back(tmp_str);
std::cout << "tokenDict.getEntry(" << index << ")=" << tokenDict.getEntry(index) << std::endl;
}
Decoder cfg, lst files content, CMD and terminal output.

Decoder cfg:
--am=/data/set3/inference_2019/results/001_model_iter_02.bin
--tokensdir=/data/set3/
--tokens=librispeech-train-all-unigram-10000.tokens
--lexicon=/data/set3/decoder-unigram-10000-nbest10-data-02-04-2021.lexicon
--datadir=/data/tests/decoder_changes
--test=listfile.lst
--uselexicon=true
--decodertype=wrd
--lmtype=kenlm
--lmweight=0.67470637680685
--beamsize=100
--beamsizetoken=100
--beamthreshold=20
--wordscore=0.62867952607587
--silscore=0
--eosscore=0
--nthread_decoder=8
--unkscore=-Infinity
--smearing=max

LIST FILE:
test_2_4_wav16.wav /data/tests/decoder_changes/test_2_4_wav16.wav 895.0 see you later
test_2_2_wav16.wav /data/tests/decoder_changes/test_2_2_wav16.wav 1020.0 hello nancy

CMD:
./Decoder --flagsfile /data/tests/decoder_changes/decode.cfg --lm=/data/wav2letter_env/kenlm/build/bin/lm_o4.bin --show
--showletters --sclite /data/tests/decoder_changes

Deoder terminal output:

I0912 12:20:27.416009 31342 Decode.cpp:106] Gflags after parsing
--flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --adambeta1=0.90000000000000002; --adambeta2=0.999; --am=/data//set3/02_04_2021/inference_2019/results/001_model_iter_091.bin; --am_decoder_tr_dropout=0; --am_decoder_tr_layerdrop=0; --am_decoder_tr_layers=1; --arch=/data//set3/am_500ms_future_context.arch; --archdir=; --attention=content; --attentionthreshold=0; --attnWindow=no; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batchsize=8; --beamsize=100; --beamsizetoken=100; --beamthreshold=20; --blobdata=false; --channels=1; --criterion=ctc; --critoptim=sgd; --datadir=/data/tests/decoder_changes; --dataorder=input; --decoderattnround=1; --decoderdropout=0; --decoderrnnlayer=1; --decodertype=wrd; --devwin=0; --emission_dir=; --emission_queue_size=3000; --enable_distributed=true; --encoderdim=0; --eosscore=0; --eostoken=false; --everstoredb=false; --fftcachesize=1; --filterbanks=80; --flagsfile=/data/tests/decoder_changes/decode.cfg; --framesizems=25; --framestridems=10; --gamma=1; --gumbeltemperature=1; --input=wav; --inputbinsize=100; --inputfeeding=false; --isbeamdump=false; --iter=100000000; --itersave=true; --labelsmooth=0; --leftWindowSize=50; --lexicon=/data//set3/decoder-unigram-10000-nbest10-data-02-04-2021.lexicon; --linlr=-1; --linlrcrit=-1; --linseg=0; --lm=/data/wav2letter_env/kenlm/build/bin/lm_o4.bin; --lm_memory=5000; --lm_vocab=; --lmtype=kenlm; --lmweight=0.67470637680684997; --localnrmlleftctx=300; --localnrmlrightctx=0; --logadd=false; --lr=0.01; --lr_decay=10000; --lr_decay_step=9223372036854775807; --lrcosine=false; --lrcrit=0; --max_devices_per_node=8; --maxdecoderoutputlen=200; --maxgradnorm=0.5; --maxisz=33000; --maxload=-1; --maxrate=10; --maxsil=50; --maxtsz=9223372036854775807; --maxword=-1; --melfloor=1; --memstepsize=10485760; --mfcc=false; --mfcccoeffs=13; --mfsc=true; --minisz=200; --minrate=3; --minsil=0; --mintsz=2; --momentum=0.80000000000000004; --netoptim=sgd; --noresample=false; --nthread=6; --nthread_decoder=8; --nthread_decoder_am_forward=1; --numattnhead=8; --onorm=target; --optimepsilon=1e-08; --optimrho=0.90000000000000002; --outputbinsize=5; --pctteacherforcing=100; --pcttraineval=1; --pow=false; --pretrainWindow=0; --replabel=0; --reportiters=1000; --rightWindowSize=50; --rndv_filepath=; --rundir=/data//set3/02_04_2021; --runname=inference_2019; --samplerate=16000; --sampletarget=0; --samplingstrategy=rand; --saug_fmaskf=27; --saug_fmaskn=2; --saug_start_update=-1; --saug_tmaskn=2; --saug_tmaskp=1; --saug_tmaskt=100; --sclite=/data/tests/decoder_changes; --seed=0; --show=true; --showletters=true; --silscore=0; --smearing=max; --smoothingtemperature=1; --softwoffset=10; --softwrate=5; --softwstd=5; --sqnorm=true; --stepsize=1000000; --surround=; --tag=; --target=tkn; --test=listfile.lst; --tokens=librispeech-train-all-unigram-10000.tokens; --tokensdir=/data//set3/; --train=lists/train.lst; --trainWithWindow=false; --transdiag=0; --unkscore=-inf; --use_memcache=false; --uselexicon=true; --usewordpiece=true; --valid=lists/dev.lst; --validbatchsize=-1; --warmup=1; --weightdecay=0; --wordscore=0.62867952607586997; --wordseparator=; --world_rank=0; --world_size=32; --alsologtoemail=; --alsologtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logfile_mode=436; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0; --vmodule=;
I0912 12:20:27.418490 31342 Decode.cpp:127] Number of classes (network): 9998
I0912 12:20:28.334156 31342 Decode.cpp:134] Number of words: 204170
I0912 12:20:28.419312 31342 Decode.cpp:247] [Decoder] LM constructed.
I0912 12:20:30.019878 31342 Decode.cpp:274] [Decoder] Trie planted.
I0912 12:20:30.260799 31342 Decode.cpp:286] [Decoder] Trie smeared.
I0912 12:20:30.665376 31342 W2lListFilesDataset.cpp:141] 2 files found.
I0912 12:20:30.665395 31342 Utils.cpp:104] Filtered 0/2 samples
I0912 12:20:30.665408 31342 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 2
I0912 12:20:30.665736 31592 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 1
I0912 12:20:30.665736 31598 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 2
I0912 12:20:30.665737 31597 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 0
I0912 12:20:30.665937 31590 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 4
I0912 12:20:30.665993 31596 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 7
I0912 12:20:30.666023 31595 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 3
I0912 12:20:30.666026 31593 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 5
I0912 12:20:30.666069 31594 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 6
nTopHyps=1
nTopHyps=1
tokenDict.getEntry(468)=
tokenDict.getEntry(9997)=#
tokenDict.getEntry(9997)=#
tokenDict.getEntry(9997)=#
tokenDict.getEntry(9997)=#
tokenDict.getEntry(9997)=#
tokenDict.getEntry(9997)=#
tokenDict.getEntry(7960)=hello
tokenDict.getEntry(9997)=#
tokenDict.getEntry(9997)=#
tokenDict.getEntry(9997)=#
tokenDict.getEntry(4408)=nancy
tokenDict.getEntry(9997)=#
tokenDict.getEntry(468)=
|T|: hello nancy
|P|: hello nancy
|t|: h e l l o _ n a n c y
|p|: _ h e l l o _ n a n c y
[sample: test_2_2_wav16.wav, WER: 0%, LER: 9.09091%, slice WER: 0%, slice LER: 9.09091%, decoded samples (thread 1): 1]
nTopHyps=1
tokenDict.getEntry(468)=
tokenDict.getEntry(9997)=#
tokenDict.getEntry(9997)=#
tokenDict.getEntry(9997)=#
tokenDict.getEntry(9997)=#
tokenDict.getEntry(9997)=#
tokenDict.getEntry(9997)=#
tokenDict.getEntry(87)=_see
tokenDict.getEntry(9997)=#
tokenDict.getEntry(14)=_you
tokenDict.getEntry(9997)=#
tokenDict.getEntry(665)=later
tokenDict.getEntry(468)=
|T|: see you later
|P|: see you later
|t|: s e e _ y o u _ l a t e r
|p|: _ s e e _ y o u _ l a t e r
[sample: test_2_4_wav16.wav, WER: 0%, LER: 7.69231%, slice WER: 0%, slice LER: 7.69231%, decoded samples (thread 2): 1]
I0912 12:20:32.203042 31342 Decode.cpp:742] ------
[Decode listfile.lst (2 samples) in 1.5376s (actual decoding time 0.0122s/sample) -- WER: 0, LER: 8.33333]

tlikhomanenko · 2021-09-13T02:33:47Z

Well, "#" means CTC blank token. Also if I remember correctly (https://github.com/flashlight/flashlight/blob/master/flashlight/lib/text/decoder/LexiconDecoder.cpp#L257, https://github.com/flashlight/flashlight/blob/master/flashlight/lib/text/decoder/LexiconDecoder.cpp#L27), you need to remove first and last silence tokens as we add them artificially during decoding. Then seems duration in frames is similar to what you have in audio using 80ms per frame.

Not sure if CTC is good criterion for learning very good alignment, better to use ASG and moreover do it with letter tokens rather than with word-pieces. The overall task is to predict correctly transcription not the alignment while still alignment is necessary to do ASR in a good way.

vchagari · 2021-09-13T17:49:40Z

Hi @tlikhomanenko,

Thank you for your comments.

If I remove the first and last silence tokens as you said, still the word timings of an each individual word is not matching up to the timings that's there in the actual audio.

I assumed each tokenDict entry as one frame in the decoder output, please correct me if wrong. Lets take the "see you later" audio data and its decoder output:

Decoder Output:
tokenDict.getEntry(468)=
tokenDict.getEntry(9997)=# ----->1st Frame
tokenDict.getEntry(9997)=# ----->2nd Frame
tokenDict.getEntry(9997)=# ----->3rd Frame
tokenDict.getEntry(9997)=# ----->4th Frame
tokenDict.getEntry(9997)=# ----->5th Frame
tokenDict.getEntry(9997)=# ----->6th Frame
tokenDict.getEntry(87)=_see ----->7th Frame
tokenDict.getEntry(9997)=# ----->8th Frame
tokenDict.getEntry(14)=_you ----->9th Frame
tokenDict.getEntry(9997)=# ----->10th Frame
tokenDict.getEntry(665)=later. ----->11th Frame
tokenDict.getEntry(468)=

**Word	no of Frames & time in ms as per tokenDict entries	actual time it took in the audio**
see	7 Frames, Time = 7 * 80 = 560ms	~300ms
you	2 Frames, Time = 2 * 80 = 160ms	~160ms
later	2 Frames, Time = 2 * 80 = 160ms	~300ms

Please correct me, if my calculations are wrong.

Okay, thank you for the inputs. I have few questions, please address them.

i) Config changes needed in train cfg
I need to train my model with "criterion" set as "ASG" and set the "usewordpiece" to "false" in train cfg?.
Do I have to change any other configuration in train cfg?, please let me know.

Note: I will use fork to create a new AM model with my data from the base model (am_500ms_future_context_dev_other.bin)

Train cfg:
--runname=inference_2019
--rundir=/data/set3/
--datadir=/data/set3
--tokens=/data/set3/librispeech-train-all-unigram-10000.tokens
--arch=/data/set3/am_500ms_future_context.arch
--train=lists/train.lst
--valid=lists/dev.lst
--lexicon=/data/set3/decoder-unigram-10000-nbest10-02-04-2021.lexicon
--criterion=ctc --> Change this to asg
--batchsize=8
--lr=0.01
--momentum=0.8
--maxgradnorm=0.5
--reportiters=1000
--nthread=6
--mfsc=true
--usewordpiece=true ---> Change this to false
--wordseparator=_
--filterbanks=80
--minisz=200
--mintsz=2
--maxisz=33000
--enable_distributed=true
--pcttraineval=1
--minloglevel=0
--logtostderr
--onorm=target
--sqnorm
--localnrmlleftctx=300
--lr_decay=10000
--input=wav
--itersave=true
--iter=100000000

ii) Config changes needed in decode cfg file.
In the decoder cfg, what settings do I have to change, just "decodertype" as "tkn"?, please let me know.
Decoder cfg:
--am=/data/set3/inference_2019/results/001_model_iter_02.bin
--tokensdir=/data/set3/
--tokens=librispeech-train-all-unigram-10000.tokens
--lexicon=/data/set3/decoder-unigram-10000-nbest10-data-02-04-2021.lexicon
--datadir=/data/tests/decoder_changes
--test=listfile.lst
--uselexicon=true
--decodertype=wrd
--lmtype=kenlm
--lmweight=0.67470637680685
--beamsize=100
--beamsizetoken=100
--beamthreshold=20
--wordscore=0.62867952607587
--silscore=0
--eosscore=0
--nthread_decoder=8
--unkscore=-Infinity
--smearing=max

tlikhomanenko · 2021-09-15T23:09:25Z

Please correct me, if my calculations are wrong.

Yep, it looks correct to me. Again, total duration after removing first and last frame now looks correct. The problem with segmentation is what I said about the model itself and word-pieces.

About config changes, please have a look at this model for example https://github.com/flashlight/wav2letter/tree/main/recipes/lexicon_free or recent with transformer https://github.com/flashlight/wav2letter/tree/main/recipes/slimIPL - they are trained with letters: you need to change tokens, lexicon, decrease stride in the model itself (because it is too large otherwise, it should be 2 or 3). You should not fork model because it only reset the optimizer but not the model itself.

Also first I would check without decoder if viterbi path gives meaningful alignment, otherwise it is definitely the problem of word-piece usage.

Also have a look at the tool here https://github.com/flashlight/flashlight/tree/master/flashlight/app/asr/tools/alignment to perform alignment without a language model.

vchagari · 2021-09-16T02:11:08Z

Hi @tlikhomanenko,

I realized later that you might be referring to the total duration. Thank you for ur comments. Okay, I did actually explore the other Wav2letter recipes and figured it out that the lexicon_free, conv_glu and learnable frontend recipes uses ASG criterion.

I also ran the decoder with the the lexicon_free recipe pre-trained models (am & lm) and files (tokens & lexicon and so on). Is the frame size used in the lexicon_free arch is 10ms? (Arch file: https://github.com/flashlight/wav2letter/blob/main/recipes/lexicon_free/librispeech/am.arch). The "framestridems" is set to 10 in the base AM model and I assume the stride is "1" ?, if so then it seems to be that the word timings reported are more accurate, when compared to the streaming convnets pre-trained recipe models/files.

Few Questions I have, please address them:

How to know the stride value from the model arch file? and also how and where to set it correctly.
What is the default value of the "target" parameter value?. Do I have to explicitly set it as "ltr" ?.
About the config changes you mentioned above, are you saying, I can still use streaming convnets (same architecture file) but I have to change Tokens, Lexicon, LM, stride value, train and decoder cfgs similar to the lexicon_free/conv_glu recipes? and then train the model from the start on the librispeech dataset along with my data set.
If I can train like I mentioned in question 3 above, can I use the trained model for Inference?, does the "streaming_tds_model_converter" tool can convert a model trained with streaming convnets architecture and with "ASG" criterion ?.

I did run the AM alone (using Test binary) for the Streaming_Convnets recipe with my Models, same config shown in the previous threads. The timing is almost same as the Decoder results (not correct).

Thank you for this, i used "align" executable in wav2letter 0.2v with my streaming convnets recipe models, same config as mentioned in the previous comments in this thread. It didn't help actually, the timing was off, I am not sure If I interpreted it correctly. Please see the screenshot below:

tlikhomanenko · 2021-09-19T03:38:48Z

Yep, correct, stride of the arch is 1 and data preprocessing is 10ms, so frame after network corresponds to 10ms audio.

How to know the stride value from the model arch file? and also how and where to set it correctly.

So stride can be done in conv, pooling layers. So you can simply check these types of layers if they have striding.

What is the default value of the "target" parameter value?. Do I have to explicitly set it as "ltr" ?.

Where do you see this parameter? I don't see it in lexfree train config.

About the config changes you mentioned above, are you saying, I can still use streaming convnets (same architecture file) but I have to change Tokens, Lexicon, LM, stride value, train and decoder cfgs similar to the lexicon_free/conv_glu recipes? and then train the model from the start on the librispeech dataset along with my data set.

Yes, and potentially this should work because I believe the arch itself is good and can be used for other token set and stride too. The main question is what and why are you doing? Do you need online alignment model (because streaming convnets are online in the sense of not using large future)? Otherwise you can retrain lexfree model with specaugment and use it or use recent rasr transformer model with ctc or retrain rasr transformer model with asg. There are a lot of options.

4. If I can train like I mentioned in question 3 above, can I use the trained model for Inference?, does the "streaming_tds_model_converter" tool can convert a model trained with streaming convnets architecture and with "ASG" criterion ?.

If you use the same type of architecture but only change its params, like stride, number of layers, etc. then it should work (cc @vineelpratap). About inference decoding - not sure, cc @xuqiantong, maybe you need to change the decoding with respect to asg (no blanks but repetition tokens). I would test if rasr transfromer model works good for alignment: if yes, then retrain streaming convnets with ctc as it was but with letters tokens and stride 2-3 instead. This will give online model which you can use + better alignment.

vchagari · 2021-09-23T04:13:51Z

Thank you for your comments.
@tlikhomanenko: Please address my questions below.

There are no convolution or pooling layers in the Lexicon-free architecture. So default stride is 1!?.
Lexicon_Free arch: https://github.com/flashlight/wav2letter/blob/main/recipes/lexicon_free/librispeech/am.arch
Okay, I did explore the architecture files and related docs. I have a question regarding the streaming convnets arch, isn't the total stride is 7?. Could you please tell me how did you calculate the total stride as 8?.
Arch: https://github.com/flashlight/wav2letter/blob/main/recipes/streaming_convnets/librispeech/am_500ms_future_context.arch
I see that there are total 4 convolution layers of strides the following (2,1) (2,1) (2,1), (1,1), If I add the xStride values, it will come as 7. Please correct me If I am wrong.
The Parameter 'target' is set to "ltr" in SlimIPL Recipe.
Arch: https://github.com/flashlight/wav2letter/blob/main/recipes/slimIPL/10h_supervised_slimipl.cfg
Okay, Thank you. Yes, I need an online alignment model. I did actually check the RASR transformer recipe models, the timings looks much better.
I have a question, @vineelpratap, @xuqiantong please address it, I will have to train the streaming convnets from scratch with letter tokens and with CTC criterion, To reduce the total stride to 3 from 7 in the current streaming convnets architecture (https://github.com/flashlight/wav2letter/blob/main/recipes/streaming_convnets/librispeech/am_500ms_future_context.arch), do I need to remove a convolutional layer in the current architecture and set the remaining convolutional layers stride as 1, am i correct?, please let me know and wouldn't that hurt the AM model accuracy?.

vchagari · 2021-10-12T03:45:17Z

Hi @tlikhomanenko,

I changed the total stride to 3 from 7 in streaming convents recipe architecture and trained it from scratch on Librispeech data with Letter tokens. AM Model seems to be trained fine, but the word-timings reported by the AM/Decoder are still bad. Please find the modified architecture below, could you please let me know if the changes I made makes sense?.

Note: I tried with the total stride as 4 and 2 as well, no luck. I also did experiments by removing 2nd/3rd/4th layer PD+CN+R+DO+LN+TDS layer blocks in corresponding training experiments.

Original Arch File:
https://github.com/flashlight/wav2letter/blob/main/recipes/streaming_convnets/librispeech/am_500ms_future_context.arch

Modified Arch File:
Changes:
Removed the SAUG layer.
reduced the first conv layer stride to 1.
removed the second PD, conv, R, DO, LN, TDS Layers.
reduced the third conv layer stride to 1.
###############################
V -1 NFEAT 1 0
PD 0 5 3
C2 1 15 10 1 1 1 0 0
R
DO 0.1
LN 1 2
TDS 15 9 80 0.1 0 1 0
TDS 15 9 80 0.1 0 1 0
PD 0 9 1
C2 15 23 12 1 1 1 0 0
R
DO 0.1
LN 1 2
TDS 23 11 80 0.1 0 1 0
TDS 23 11 80 0.1 0 1 0
TDS 23 11 80 0.1 0 1 0
TDS 23 11 80 0.1 0 0 0
PD 0 10 0
C2 23 27 11 1 1 1 0 0
R
DO 0.1
LN 1 2
TDS 27 11 80 0.1 0 0 0
TDS 27 11 80 0.1 0 0 0
TDS 27 11 80 0.1 0 0 0
TDS 27 11 80 0.1 0 0 0
TDS 27 11 80 0.1 0 0 0
RO 2 1 0 3
V 2160 -1 1 0
L 2160 NLABEL
V NLABEL 0 -1 1

vchagari added the question label Aug 31, 2021

vchagari closed this as completed Sep 15, 2021

vchagari reopened this Sep 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

word timestamps of an each individual word in the inference #987

word timestamps of an each individual word in the inference #987

vchagari commented Aug 31, 2021

tlikhomanenko commented Sep 1, 2021

vchagari commented Sep 2, 2021

tlikhomanenko commented Sep 4, 2021

vchagari commented Sep 5, 2021 •

edited

tlikhomanenko commented Sep 8, 2021

vchagari commented Sep 12, 2021 •

edited

tlikhomanenko commented Sep 13, 2021

vchagari commented Sep 13, 2021 •

edited

tlikhomanenko commented Sep 15, 2021

vchagari commented Sep 16, 2021 •

edited

tlikhomanenko commented Sep 19, 2021

vchagari commented Sep 23, 2021 •

edited

vchagari commented Oct 12, 2021 •

edited

word timestamps of an each individual word in the inference #987

word timestamps of an each individual word in the inference #987

Comments

vchagari commented Aug 31, 2021

tlikhomanenko commented Sep 1, 2021

vchagari commented Sep 2, 2021

tlikhomanenko commented Sep 4, 2021

vchagari commented Sep 5, 2021 • edited

tlikhomanenko commented Sep 8, 2021

vchagari commented Sep 12, 2021 • edited

tlikhomanenko commented Sep 13, 2021

vchagari commented Sep 13, 2021 • edited

tlikhomanenko commented Sep 15, 2021

vchagari commented Sep 16, 2021 • edited

tlikhomanenko commented Sep 19, 2021

vchagari commented Sep 23, 2021 • edited

vchagari commented Oct 12, 2021 • edited

vchagari commented Sep 5, 2021 •

edited

vchagari commented Sep 12, 2021 •

edited

vchagari commented Sep 13, 2021 •

edited

vchagari commented Sep 16, 2021 •

edited

vchagari commented Sep 23, 2021 •

edited

vchagari commented Oct 12, 2021 •

edited