Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

word timestamps of an each individual word in the inference #987

Open
vchagari opened this issue Aug 31, 2021 · 13 comments
Open

word timestamps of an each individual word in the inference #987

vchagari opened this issue Aug 31, 2021 · 13 comments
Labels

Comments

@vchagari
Copy link

Question:
Is there a way to accurately calculate or compute an individual word level timing of each word as it appeared after the start of the audio?.

Note:
I referred the following existing ticket - #809, but it looks like there is no solution in that ticket. Could you please help me pointing to the right resource that would help me finding the accurate word level timings.

Ticket I referred to: #809

Thanks

@tlikhomanenko
Copy link
Contributor

Hey!

Here https://github.com/flashlight/flashlight/blob/master/flashlight/app/asr/Decode.cpp#L641 you will have per frame token indices in the rawTokenPrediction, so you can do any postprocessing and print computed word timings here. The only thing to have in mind to convert to original time is the model stride.

@vchagari
Copy link
Author

vchagari commented Sep 2, 2021

Hi @tlikhomanenko,

Thank you very much for the response, could you please help me by telling how to convert the per frame token indices to word timings. Please provide an example if possible.

I think my model has frame stride set to 10ms.

Thanks
Vamsi Chagari

@tlikhomanenko
Copy link
Contributor

Well, I can navigate only for Decode.cpp (not the online inference if you are referring to it).

The qq I have before going further: what are values for flags

FLAGS_criterion,
FLAGS_surround,
FLAGS_replabel,
FLAGS_usewordpiece,
FLAGS_wordseparator

did you set? Also what is the model architecture (what is the stride happens inside model itself)?

@vchagari
Copy link
Author

vchagari commented Sep 5, 2021

Hi @tlikhomanenko,

Okay, thank you. I am referring to Decode.cpp.

Please find the info below:
Flags:
Criterion is set to “ctc”
Surrond is not set
Relabel is not set,
Usewordpiece set to “true”
Wordseparator is set to “_”

Please find the model architecture below:
https://github.com/flashlight/wav2letter/blob/master/recipes/streaming_convnets/librispeech/am_500ms_future_context.arch

Stride: I haven't changed anything, what ever is the default value that's there in the streaming convnets.

More Info:

  1. I did use fork to create a new AM model with my data from the base model (am_500ms_future_context_dev_other.bin)

  2. configuration used for training
    --runname=inference_2019
    --rundir=/data/set3/
    --datadir=/data/set3
    --tokens=/data/set3/librispeech-train-all-unigram-10000.tokens
    --arch=/data/set3/am_500ms_future_context.arch
    --train=lists/train.lst
    --valid=lists/dev.lst
    --lexicon=/data/set3/decoder-unigram-10000-nbest10-02-04-2021.lexicon
    --criterion=ctc
    --batchsize=8
    --lr=0.01
    --momentum=0.8
    --maxgradnorm=0.5
    --reportiters=1000
    --nthread=6
    --mfsc=true
    --usewordpiece=true
    --wordseparator=_
    --filterbanks=80
    --minisz=200
    --mintsz=2
    --maxisz=33000
    --enable_distributed=true
    --pcttraineval=1
    --minloglevel=0
    --logtostderr
    --onorm=target
    --sqnorm
    --localnrmlleftctx=300
    --lr_decay=10000
    --input=wav
    --itersave=true
    --iter=100000000

Thank you

@tlikhomanenko
Copy link
Contributor

So you have from https://github.com/flashlight/flashlight/blob/master/flashlight/app/asr/Decode.cpp#L641 rawTokenPrediction array of token indices for each frame. Then in the loop over this array you call

std::array<std::string> tokens;
for (auto index : rawTokenPrediction) {
  tokens.push_back(tokenDict.getEntry(index));
}

Now tokens contains word pieces. As soon as you used model arch with total stride 8 and features originally used 10ms stride then every frame now counts 80ms. Now you can parse duplications in the tokens and set properly the timing on words.

For example if tokens have ["_hel", "_hel", "_hel", "lo", "lo", "lo", "_world", "_world"] then you have "hello" from 0-480ms and "world" 480ms-640ms.

@vchagari
Copy link
Author

vchagari commented Sep 12, 2021

Thank you @tlikhomanenko for the response, i appreciate it.

I did test the decoder after making the code changes. The word timings I calculated based on the info that's there in the "rawTokenPrediction and tokenDict" data-structures and the timings of the words in the audio doesn't seems to be matching.

Is the frame size 80ms correct ?, please correct me, if i am wrong and what does the "#" represent in the tokenDict entry ?.

Here are the output details of the two audio files I tested with the decoder:

  1. test_2_4_wav16.wav: Which has "see you later" in the audio
    Based on the info from the tokenDict data-structure and from the audio wav file:

    **Word no of Frames actual time it took in the audio**
    see 8 ~300ms
    you 2 ~160ms
    later 2 ~300ms

    Decoder stdout output:
    tokenDict.getEntry(468)=_
    tokenDict.getEntry(9997)=#
    tokenDict.getEntry(9997)=#
    tokenDict.getEntry(9997)=#
    tokenDict.getEntry(9997)=#
    tokenDict.getEntry(9997)=#
    tokenDict.getEntry(9997)=#
    tokenDict.getEntry(87)=_see
    tokenDict.getEntry(9997)=#
    tokenDict.getEntry(14)=_you
    tokenDict.getEntry(9997)=#
    tokenDict.getEntry(665)=later
    tokenDict.getEntry(468)=

    |T|: see you later
    |P|: see you later
    |t|: s e e _ y o u _ l a t e r
    |p|: _ s e e _ y o u _ l a t e r
    [sample: test_2_4_wav16.wav, WER: 0%, LER: 7.69231%, slice WER: 0%, slice LER: 7.69231%, decoded samples (thread 2): 1]
    I0912 12:20:32.203042 31342 Decode.cpp:742] ------

    Audio file timings screenshot:

See_you_later_screenshot

  1. test_2_2_wav16.wav: which has "hello nancy" in the audio
    Based on the info from the tokenDict data-structure and from the audio wav file:

    Word no of Frames actual time it took in the audio
    hello 8 ~470ms
    nancy 4 ~400ms

    Decoder stdout output:
    tokenDict.getEntry(468)=_
    tokenDict.getEntry(9997)=#
    tokenDict.getEntry(9997)=#
    tokenDict.getEntry(9997)=#
    tokenDict.getEntry(9997)=#
    tokenDict.getEntry(9997)=#
    tokenDict.getEntry(9997)=#
    tokenDict.getEntry(7960)=_hello
    tokenDict.getEntry(9997)=#
    tokenDict.getEntry(9997)=#
    tokenDict.getEntry(9997)=#
    tokenDict.getEntry(4408)=_nancy
    tokenDict.getEntry(9997)=#
    |T|: hello nancy
    |P|: hello nancy
    |t|: h e l l o _ n a n c y
    |p|: _ h e l l o _ n a n c y
    [sample: test_2_2_wav16.wav, WER: 0%, LER: 9.09091%, slice WER: 0%, slice LER: 9.09091%, decoded samples (thread 1): 1]

    Audio file timings Screenshots:

Screen Shot 2021-09-12 at 12 28 56 PM

More info:

  1. Code Changes to decode.cpp to print the word pieces:
    auto rawWordPrediction = results[i].words;
    auto rawTokenPrediction = results[i].tokens;
    std::cout << "nTopHyps=" << nTopHyps << std::endl;
    std::vectorstd::string tokens_tmp;
    for (auto index : rawTokenPrediction)
    {
    std::string tmp_str = tokenDict.getEntry(index);
    tokens_tmp.push_back(tmp_str);
    std::cout << "tokenDict.getEntry(" << index << ")=" << tokenDict.getEntry(index) << std::endl;
    }

  2. Decoder cfg, lst files content, CMD and terminal output.

    Decoder cfg:
    --am=/data/set3/inference_2019/results/001_model_iter_02.bin
    --tokensdir=/data/set3/
    --tokens=librispeech-train-all-unigram-10000.tokens
    --lexicon=/data/set3/decoder-unigram-10000-nbest10-data-02-04-2021.lexicon
    --datadir=/data/tests/decoder_changes
    --test=listfile.lst
    --uselexicon=true
    --decodertype=wrd
    --lmtype=kenlm
    --lmweight=0.67470637680685
    --beamsize=100
    --beamsizetoken=100
    --beamthreshold=20
    --wordscore=0.62867952607587
    --silscore=0
    --eosscore=0
    --nthread_decoder=8
    --unkscore=-Infinity
    --smearing=max

    LIST FILE:
    test_2_4_wav16.wav /data/tests/decoder_changes/test_2_4_wav16.wav 895.0 see you later
    test_2_2_wav16.wav /data/tests/decoder_changes/test_2_2_wav16.wav 1020.0 hello nancy

    CMD:
    ./Decoder --flagsfile /data/tests/decoder_changes/decode.cfg --lm=/data/wav2letter_env/kenlm/build/bin/lm_o4.bin --show
    --showletters --sclite /data/tests/decoder_changes

    Deoder terminal output:

    I0912 12:20:27.416009 31342 Decode.cpp:106] Gflags after parsing
    --flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --adambeta1=0.90000000000000002; --adambeta2=0.999; --am=/data//set3/02_04_2021/inference_2019/results/001_model_iter_091.bin; --am_decoder_tr_dropout=0; --am_decoder_tr_layerdrop=0; --am_decoder_tr_layers=1; --arch=/data//set3/am_500ms_future_context.arch; --archdir=; --attention=content; --attentionthreshold=0; --attnWindow=no; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batchsize=8; --beamsize=100; --beamsizetoken=100; --beamthreshold=20; --blobdata=false; --channels=1; --criterion=ctc; --critoptim=sgd; --datadir=/data/tests/decoder_changes; --dataorder=input; --decoderattnround=1; --decoderdropout=0; --decoderrnnlayer=1; --decodertype=wrd; --devwin=0; --emission_dir=; --emission_queue_size=3000; --enable_distributed=true; --encoderdim=0; --eosscore=0; --eostoken=false; --everstoredb=false; --fftcachesize=1; --filterbanks=80; --flagsfile=/data/tests/decoder_changes/decode.cfg; --framesizems=25; --framestridems=10; --gamma=1; --gumbeltemperature=1; --input=wav; --inputbinsize=100; --inputfeeding=false; --isbeamdump=false; --iter=100000000; --itersave=true; --labelsmooth=0; --leftWindowSize=50; --lexicon=/data//set3/decoder-unigram-10000-nbest10-data-02-04-2021.lexicon; --linlr=-1; --linlrcrit=-1; --linseg=0; --lm=/data/wav2letter_env/kenlm/build/bin/lm_o4.bin; --lm_memory=5000; --lm_vocab=; --lmtype=kenlm; --lmweight=0.67470637680684997; --localnrmlleftctx=300; --localnrmlrightctx=0; --logadd=false; --lr=0.01; --lr_decay=10000; --lr_decay_step=9223372036854775807; --lrcosine=false; --lrcrit=0; --max_devices_per_node=8; --maxdecoderoutputlen=200; --maxgradnorm=0.5; --maxisz=33000; --maxload=-1; --maxrate=10; --maxsil=50; --maxtsz=9223372036854775807; --maxword=-1; --melfloor=1; --memstepsize=10485760; --mfcc=false; --mfcccoeffs=13; --mfsc=true; --minisz=200; --minrate=3; --minsil=0; --mintsz=2; --momentum=0.80000000000000004; --netoptim=sgd; --noresample=false; --nthread=6; --nthread_decoder=8; --nthread_decoder_am_forward=1; --numattnhead=8; --onorm=target; --optimepsilon=1e-08; --optimrho=0.90000000000000002; --outputbinsize=5; --pctteacherforcing=100; --pcttraineval=1; --pow=false; --pretrainWindow=0; --replabel=0; --reportiters=1000; --rightWindowSize=50; --rndv_filepath=; --rundir=/data//set3/02_04_2021; --runname=inference_2019; --samplerate=16000; --sampletarget=0; --samplingstrategy=rand; --saug_fmaskf=27; --saug_fmaskn=2; --saug_start_update=-1; --saug_tmaskn=2; --saug_tmaskp=1; --saug_tmaskt=100; --sclite=/data/tests/decoder_changes; --seed=0; --show=true; --showletters=true; --silscore=0; --smearing=max; --smoothingtemperature=1; --softwoffset=10; --softwrate=5; --softwstd=5; --sqnorm=true; --stepsize=1000000; --surround=; --tag=; --target=tkn; --test=listfile.lst; --tokens=librispeech-train-all-unigram-10000.tokens; --tokensdir=/data//set3/; --train=lists/train.lst; --trainWithWindow=false; --transdiag=0; --unkscore=-inf; --use_memcache=false; --uselexicon=true; --usewordpiece=true; --valid=lists/dev.lst; --validbatchsize=-1; --warmup=1; --weightdecay=0; --wordscore=0.62867952607586997; --wordseparator=; --world_rank=0; --world_size=32; --alsologtoemail=; --alsologtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logfile_mode=436; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0; --vmodule=;
    I0912 12:20:27.418490 31342 Decode.cpp:127] Number of classes (network): 9998
    I0912 12:20:28.334156 31342 Decode.cpp:134] Number of words: 204170
    I0912 12:20:28.419312 31342 Decode.cpp:247] [Decoder] LM constructed.
    I0912 12:20:30.019878 31342 Decode.cpp:274] [Decoder] Trie planted.
    I0912 12:20:30.260799 31342 Decode.cpp:286] [Decoder] Trie smeared.
    I0912 12:20:30.665376 31342 W2lListFilesDataset.cpp:141] 2 files found.
    I0912 12:20:30.665395 31342 Utils.cpp:104] Filtered 0/2 samples
    I0912 12:20:30.665408 31342 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 2
    I0912 12:20:30.665736 31592 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 1
    I0912 12:20:30.665736 31598 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 2
    I0912 12:20:30.665737 31597 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 0
    I0912 12:20:30.665937 31590 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 4
    I0912 12:20:30.665993 31596 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 7
    I0912 12:20:30.666023 31595 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 3
    I0912 12:20:30.666026 31593 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 5
    I0912 12:20:30.666069 31594 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 6
    nTopHyps=1
    nTopHyps=1
    tokenDict.getEntry(468)=

    tokenDict.getEntry(9997)=#
    tokenDict.getEntry(9997)=#
    tokenDict.getEntry(9997)=#
    tokenDict.getEntry(9997)=#
    tokenDict.getEntry(9997)=#
    tokenDict.getEntry(9997)=#
    tokenDict.getEntry(7960)=hello
    tokenDict.getEntry(9997)=#
    tokenDict.getEntry(9997)=#
    tokenDict.getEntry(9997)=#
    tokenDict.getEntry(4408)=nancy
    tokenDict.getEntry(9997)=#
    tokenDict.getEntry(468)=

    |T|: hello nancy
    |P|: hello nancy
    |t|: h e l l o _ n a n c y
    |p|: _ h e l l o _ n a n c y
    [sample: test_2_2_wav16.wav, WER: 0%, LER: 9.09091%, slice WER: 0%, slice LER: 9.09091%, decoded samples (thread 1): 1]
    nTopHyps=1
    tokenDict.getEntry(468)=

    tokenDict.getEntry(9997)=#
    tokenDict.getEntry(9997)=#
    tokenDict.getEntry(9997)=#
    tokenDict.getEntry(9997)=#
    tokenDict.getEntry(9997)=#
    tokenDict.getEntry(9997)=#
    tokenDict.getEntry(87)=_see
    tokenDict.getEntry(9997)=#
    tokenDict.getEntry(14)=_you
    tokenDict.getEntry(9997)=#
    tokenDict.getEntry(665)=later
    tokenDict.getEntry(468)=

    |T|: see you later
    |P|: see you later
    |t|: s e e _ y o u _ l a t e r
    |p|: _ s e e _ y o u _ l a t e r
    [sample: test_2_4_wav16.wav, WER: 0%, LER: 7.69231%, slice WER: 0%, slice LER: 7.69231%, decoded samples (thread 2): 1]
    I0912 12:20:32.203042 31342 Decode.cpp:742] ------
    [Decode listfile.lst (2 samples) in 1.5376s (actual decoding time 0.0122s/sample) -- WER: 0, LER: 8.33333]

@tlikhomanenko
Copy link
Contributor

Well, "#" means CTC blank token. Also if I remember correctly (https://github.com/flashlight/flashlight/blob/master/flashlight/lib/text/decoder/LexiconDecoder.cpp#L257, https://github.com/flashlight/flashlight/blob/master/flashlight/lib/text/decoder/LexiconDecoder.cpp#L27), you need to remove first and last silence tokens as we add them artificially during decoding. Then seems duration in frames is similar to what you have in audio using 80ms per frame.

Not sure if CTC is good criterion for learning very good alignment, better to use ASG and moreover do it with letter tokens rather than with word-pieces. The overall task is to predict correctly transcription not the alignment while still alignment is necessary to do ASR in a good way.

@vchagari
Copy link
Author

vchagari commented Sep 13, 2021

Hi @tlikhomanenko,

Thank you for your comments.

  1. If I remove the first and last silence tokens as you said, still the word timings of an each individual word is not matching up to the timings that's there in the actual audio.

    I assumed each tokenDict entry as one frame in the decoder output, please correct me if wrong. Lets take the "see you later" audio data and its decoder output:

    Decoder Output:
    tokenDict.getEntry(468)=
    tokenDict.getEntry(9997)=# ----->1st Frame
    tokenDict.getEntry(9997)=# ----->2nd Frame
    tokenDict.getEntry(9997)=# ----->3rd Frame
    tokenDict.getEntry(9997)=# ----->4th Frame
    tokenDict.getEntry(9997)=# ----->5th Frame
    tokenDict.getEntry(9997)=# ----->6th Frame
    tokenDict.getEntry(87)=_see ----->7th Frame
    tokenDict.getEntry(9997)=# ----->8th Frame
    tokenDict.getEntry(14)=_you ----->9th Frame
    tokenDict.getEntry(9997)=# ----->10th Frame
    tokenDict.getEntry(665)=later. ----->11th Frame
    tokenDict.getEntry(468)=

    **Word no of Frames & time in ms as per tokenDict entries actual time it took in the audio**
    see 7 Frames, Time = 7 * 80 = 560ms ~300ms
    you 2 Frames, Time = 2 * 80 = 160ms ~160ms
    later 2 Frames, Time = 2 * 80 = 160ms ~300ms

    Please correct me, if my calculations are wrong.

  2. Okay, thank you for the inputs. I have few questions, please address them.

    i) Config changes needed in train cfg
    I need to train my model with "criterion" set as "ASG" and set the "usewordpiece" to "false" in train cfg?.
    Do I have to change any other configuration in train cfg?, please let me know.

    Note: I will use fork to create a new AM model with my data from the base model (am_500ms_future_context_dev_other.bin)

    Train cfg:
    --runname=inference_2019
    --rundir=/data/set3/
    --datadir=/data/set3
    --tokens=/data/set3/librispeech-train-all-unigram-10000.tokens
    --arch=/data/set3/am_500ms_future_context.arch
    --train=lists/train.lst
    --valid=lists/dev.lst
    --lexicon=/data/set3/decoder-unigram-10000-nbest10-02-04-2021.lexicon
    --criterion=ctc --> Change this to asg
    --batchsize=8
    --lr=0.01
    --momentum=0.8
    --maxgradnorm=0.5
    --reportiters=1000
    --nthread=6
    --mfsc=true
    --usewordpiece=true ---> Change this to false
    --wordseparator=_
    --filterbanks=80
    --minisz=200
    --mintsz=2
    --maxisz=33000
    --enable_distributed=true
    --pcttraineval=1
    --minloglevel=0
    --logtostderr
    --onorm=target
    --sqnorm
    --localnrmlleftctx=300
    --lr_decay=10000
    --input=wav
    --itersave=true
    --iter=100000000

    ii) Config changes needed in decode cfg file.
    In the decoder cfg, what settings do I have to change, just "decodertype" as "tkn"?, please let me know.
    Decoder cfg:
    --am=/data/set3/inference_2019/results/001_model_iter_02.bin
    --tokensdir=/data/set3/
    --tokens=librispeech-train-all-unigram-10000.tokens
    --lexicon=/data/set3/decoder-unigram-10000-nbest10-data-02-04-2021.lexicon
    --datadir=/data/tests/decoder_changes
    --test=listfile.lst
    --uselexicon=true
    --decodertype=wrd
    --lmtype=kenlm
    --lmweight=0.67470637680685
    --beamsize=100
    --beamsizetoken=100
    --beamthreshold=20
    --wordscore=0.62867952607587
    --silscore=0
    --eosscore=0
    --nthread_decoder=8
    --unkscore=-Infinity
    --smearing=max

@tlikhomanenko
Copy link
Contributor

  • Please correct me, if my calculations are wrong.

Yep, it looks correct to me. Again, total duration after removing first and last frame now looks correct. The problem with segmentation is what I said about the model itself and word-pieces.

About config changes, please have a look at this model for example https://github.com/flashlight/wav2letter/tree/main/recipes/lexicon_free or recent with transformer https://github.com/flashlight/wav2letter/tree/main/recipes/slimIPL - they are trained with letters: you need to change tokens, lexicon, decrease stride in the model itself (because it is too large otherwise, it should be 2 or 3). You should not fork model because it only reset the optimizer but not the model itself.

Also first I would check without decoder if viterbi path gives meaningful alignment, otherwise it is definitely the problem of word-piece usage.

Also have a look at the tool here https://github.com/flashlight/flashlight/tree/master/flashlight/app/asr/tools/alignment to perform alignment without a language model.

@vchagari vchagari reopened this Sep 15, 2021
@vchagari
Copy link
Author

vchagari commented Sep 16, 2021

Hi @tlikhomanenko,

I realized later that you might be referring to the total duration. Thank you for ur comments. Okay, I did actually explore the other Wav2letter recipes and figured it out that the lexicon_free, conv_glu and learnable frontend recipes uses ASG criterion.

I also ran the decoder with the the lexicon_free recipe pre-trained models (am & lm) and files (tokens & lexicon and so on). Is the frame size used in the lexicon_free arch is 10ms? (Arch file: https://github.com/flashlight/wav2letter/blob/main/recipes/lexicon_free/librispeech/am.arch). The "framestridems" is set to 10 in the base AM model and I assume the stride is "1" ?, if so then it seems to be that the word timings reported are more accurate, when compared to the streaming convnets pre-trained recipe models/files.

Few Questions I have, please address them:

  1. How to know the stride value from the model arch file? and also how and where to set it correctly.

  2. What is the default value of the "target" parameter value?. Do I have to explicitly set it as "ltr" ?.

  3. About the config changes you mentioned above, are you saying, I can still use streaming convnets (same architecture file) but I have to change Tokens, Lexicon, LM, stride value, train and decoder cfgs similar to the lexicon_free/conv_glu recipes? and then train the model from the start on the librispeech dataset along with my data set.

  4. If I can train like I mentioned in question 3 above, can I use the trained model for Inference?, does the "streaming_tds_model_converter" tool can convert a model trained with streaming convnets architecture and with "ASG" criterion ?.

I did run the AM alone (using Test binary) for the Streaming_Convnets recipe with my Models, same config shown in the previous threads. The timing is almost same as the Decoder results (not correct).

Thank you for this, i used "align" executable in wav2letter 0.2v with my streaming convnets recipe models, same config as mentioned in the previous comments in this thread. It didn't help actually, the timing was off, I am not sure If I interpreted it correctly. Please see the screenshot below:
Screen Shot 2021-09-15 at 6 52 00 PM

@tlikhomanenko
Copy link
Contributor

Yep, correct, stride of the arch is 1 and data preprocessing is 10ms, so frame after network corresponds to 10ms audio.

  • How to know the stride value from the model arch file? and also how and where to set it correctly.

So stride can be done in conv, pooling layers. So you can simply check these types of layers if they have striding.

  • What is the default value of the "target" parameter value?. Do I have to explicitly set it as "ltr" ?.

Where do you see this parameter? I don't see it in lexfree train config.

  • About the config changes you mentioned above, are you saying, I can still use streaming convnets (same architecture file) but I have to change Tokens, Lexicon, LM, stride value, train and decoder cfgs similar to the lexicon_free/conv_glu recipes? and then train the model from the start on the librispeech dataset along with my data set.

Yes, and potentially this should work because I believe the arch itself is good and can be used for other token set and stride too. The main question is what and why are you doing? Do you need online alignment model (because streaming convnets are online in the sense of not using large future)? Otherwise you can retrain lexfree model with specaugment and use it or use recent rasr transformer model with ctc or retrain rasr transformer model with asg. There are a lot of options.

4. If I can train like I mentioned in question 3 above, can I use the trained model for Inference?, does the "streaming_tds_model_converter" tool can convert a model trained with streaming convnets architecture and with "ASG" criterion ?.

If you use the same type of architecture but only change its params, like stride, number of layers, etc. then it should work (cc @vineelpratap). About inference decoding - not sure, cc @xuqiantong, maybe you need to change the decoding with respect to asg (no blanks but repetition tokens). I would test if rasr transfromer model works good for alignment: if yes, then retrain streaming convnets with ctc as it was but with letters tokens and stride 2-3 instead. This will give online model which you can use + better alignment.

@vchagari
Copy link
Author

vchagari commented Sep 23, 2021

Thank you for your comments.
@tlikhomanenko: Please address my questions below.

  1. There are no convolution or pooling layers in the Lexicon-free architecture. So default stride is 1!?.
    Lexicon_Free arch: https://github.com/flashlight/wav2letter/blob/main/recipes/lexicon_free/librispeech/am.arch

  2. Okay, I did explore the architecture files and related docs. I have a question regarding the streaming convnets arch, isn't the total stride is 7?. Could you please tell me how did you calculate the total stride as 8?.
    Arch: https://github.com/flashlight/wav2letter/blob/main/recipes/streaming_convnets/librispeech/am_500ms_future_context.arch
    I see that there are total 4 convolution layers of strides the following (2,1) (2,1) (2,1), (1,1), If I add the xStride values, it will come as 7. Please correct me If I am wrong.

  3. The Parameter 'target' is set to "ltr" in SlimIPL Recipe.
    Arch: https://github.com/flashlight/wav2letter/blob/main/recipes/slimIPL/10h_supervised_slimipl.cfg

  4. Okay, Thank you. Yes, I need an online alignment model. I did actually check the RASR transformer recipe models, the timings looks much better.

  5. I have a question, @vineelpratap, @xuqiantong please address it, I will have to train the streaming convnets from scratch with letter tokens and with CTC criterion, To reduce the total stride to 3 from 7 in the current streaming convnets architecture (https://github.com/flashlight/wav2letter/blob/main/recipes/streaming_convnets/librispeech/am_500ms_future_context.arch), do I need to remove a convolutional layer in the current architecture and set the remaining convolutional layers stride as 1, am i correct?, please let me know and wouldn't that hurt the AM model accuracy?.

@vchagari
Copy link
Author

vchagari commented Oct 12, 2021

Hi @tlikhomanenko,

I changed the total stride to 3 from 7 in streaming convents recipe architecture and trained it from scratch on Librispeech data with Letter tokens. AM Model seems to be trained fine, but the word-timings reported by the AM/Decoder are still bad. Please find the modified architecture below, could you please let me know if the changes I made makes sense?.

Note: I tried with the total stride as 4 and 2 as well, no luck. I also did experiments by removing 2nd/3rd/4th layer PD+CN+R+DO+LN+TDS layer blocks in corresponding training experiments.

Original Arch File:
https://github.com/flashlight/wav2letter/blob/main/recipes/streaming_convnets/librispeech/am_500ms_future_context.arch

Modified Arch File:
Changes:
Removed the SAUG layer.
reduced the first conv layer stride to 1.
removed the second PD, conv, R, DO, LN, TDS Layers.
reduced the third conv layer stride to 1.
###############################
V -1 NFEAT 1 0
PD 0 5 3
C2 1 15 10 1 1 1 0 0
R
DO 0.1
LN 1 2
TDS 15 9 80 0.1 0 1 0
TDS 15 9 80 0.1 0 1 0
PD 0 9 1
C2 15 23 12 1 1 1 0 0
R
DO 0.1
LN 1 2
TDS 23 11 80 0.1 0 1 0
TDS 23 11 80 0.1 0 1 0
TDS 23 11 80 0.1 0 1 0
TDS 23 11 80 0.1 0 0 0
PD 0 10 0
C2 23 27 11 1 1 1 0 0
R
DO 0.1
LN 1 2
TDS 27 11 80 0.1 0 0 0
TDS 27 11 80 0.1 0 0 0
TDS 27 11 80 0.1 0 0 0
TDS 27 11 80 0.1 0 0 0
TDS 27 11 80 0.1 0 0 0
RO 2 1 0 3
V 2160 -1 1 0
L 2160 NLABEL
V NLABEL 0 -1 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants