[tf_clean] swb1/v1-tf WFST decoding - checking on assumptions #193

efosler · 2018-06-25T17:07:06Z

Decided a new thread would be good for this issue.

Right now the SWB tf code as checked in seems to have a discrepancy, and I'm writing down some of my assumptions as I work through cleaning up the WFST decode.

It looks like to me that run_ctc_phn.sh creates a set of training/cv labels that ignores noises (and overwrites units.txt, removing spn and npn). However, utils/ctc_compile_dict_token.sh assumes that units.txt and lexicon.txt are synchronized, resulting in the lovely error:

FATAL: FstCompiler: Symbol "spn" is not mapped to any integer arc ilabel, symbol table = data/lang_phn/tokens.txt, source = standard input, line = 1171

The fix is pretty simple (synchronizing the lexicon) but I'm trying to figure out how much to modify the utils/ctc_compile_dict_token.sh script vs. correcting the prep script to do the appropriate correction. I'm thinking that I'll correct the prep script, but if anyone has any thoughts on that let me know.

The text was updated successfully, but these errors were encountered:

ramonsanabria · 2018-06-25T19:02:12Z

great yes thanks @efosler we really couldn't get a full recipe for tf + wfst. I have some spare scripts but nothing clean and oficial...

This is correct yes:

It looks like to me that run_ctc_phn.sh creates a set of training/cv labels that ignores noises (and overwrites units.txt, removing spn and npn). However, utils/ctc_compile_dict_token.sh assumes that units.txt and lexicon.txt are synchronized, resulting in the lovely error:

I was thinking if is there any improved (or more simple way) to do the data preparation for characters and phonemes separately. Do you have any thought? I can try to help. Otherwise we can reuse the preparation of the master branch.

Also, another random thing that I observed with swbd: I tried to prepare the char set up substituting numbers by written words and removing noises but at the end it did not work out...

I am working in integrating CharRNN decoding recipe that we have (it doesn't perfrom better than wfst but we allow open vocabulary) https://arxiv.org/abs/1708.04469.

Please let me know if I can help you somehow I am will be very happy to!

Thanks again!

efosler · 2018-06-25T19:10:28Z

Let me think about it as I play with the scripts. I just created a _tf version of swbd1_decode_graph.sh which gets rid of the -unk option, but that feels like there could be a better factorization.

efosler · 2018-06-26T16:24:04Z

So, an update: the good news is that I was able to get a decode to run all the way through. There does seem to be a bit of underperformance w.r.t. Yajie's runs on the non-tf version. Currently, I'm seeing 24.7% WER on eval2000 (vs. 21.0) using SWB + Fisher LM. I think there are a few differences:

4 layer BiLSTM vs 5 layer BiLSTM
I'm not sure that the default tf recipe currently checked in has speaker adaptation (or in fact if the original has speaker adaptation). The RESULTS file seems to indicate speaker adaptation but now looking through v1 I don't see how that happens, if it happens.

I'm sure that there is some other possible set of differences in parameters as well.

Just to check: what I did was just work with the output of ./steps/decode_ctc_am_tf.sh and feed the logprobs through latgen-faster. NB this just runs test.py in ctc-am rather than nnet.py or anything else (not sure if this is the right thing to do, but it's what's checked in).

Any thoughts on diffs between the tf and old versions that might be causing the discrepancy?

ramonsanabria · 2018-06-26T18:53:43Z

Hi Eric, Thank you very much for that. Can you let me know which token error rate were you getting in the acoustic model? I have some experiments that achieved way below this WER. Did you use the prior probabilities of each character during wfst decoding? Can you share the complete log of the acoustic model training? Thanks! Best, Ramon 2018-06-26 12:24 GMT-04:00 Eric Fosler-Lussier <notifications@github.com>:

…

So, an update: the good news is that I was able to get a decode to run all the way through. There does seem to be a bit of underperformance w.r.t. Yajie's runs on the non-tf version. Currently, I'm seeing 24.7% WER on eval2000 (vs. 21.0) using SWB + Fisher LM. I think there are a few differences: - 4 layer BiLSTM vs 5 layer BiLSTM - I'm not sure that the default tf recipe currently checked in has speaker adaptation I'm sure that there is some other possible set of differences in parameters as well. Just to check: what I did was just work with the output of ./steps/decode_ctc_am_tf.sh and feed the logprobs through latgen-faster. NB this just runs test.py in ctc-am rather than nnet.py or anything else (not sure if this is the right thing to do, but it's what's checked in). Any thoughts on diffs between the tf and old versions that might be causing the discrepancy? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#193 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMlwPRMl704W1O4LrlphOdE-Xoarzhciks5uAmA4gaJpZM4U2hdg> .

efosler · 2018-06-26T19:00:33Z

Let me re-run it - I just started up the non-tf version and I think it blew away the tf version (hmpf). I am pretty sure that it isn't using the prior probabilities of each phone (not character) but I'm not sure. (I don't see where that would have been injected into the system).

ericbolo · 2018-06-27T20:29:09Z

@efosler I would also like to integrate the tf acoustic model into a WFST for decoding. As I understand this thread you have managed to do that. Is any of your code in the repo ?

I pulled tf_clean and asr_egs/swbd/v1-tf/run_ctc_phn.sh only does acoustic decoding.

Would be great if I could avoid starting from scratch :)

efosler · 2018-06-29T16:28:54Z

Sorry for the delay - I had a few other things pop up. The non-tf run didn't finish before we had a massive server shutdown because of a planned power outage (sigh). So @ericbolo let me try to run the v1-tf branch again and I can check in against my copy of the repository. I think that @ramonsanabria has had a better outcome than I have.

Basically, the things I had to do were slight modifications to the building the TLG graph followed by calling latgen-faster and score_sclite.sh. I'm sure that the decoding parameters aren't right and I have to investigate whether I do have the priors involved or not before decoding.

ericbolo · 2018-07-02T08:52:38Z

@efosler, thank you !

…

On 29 June 2018 at 18:28, Eric Fosler-Lussier ***@***.***> wrote: Sorry for the delay - I had a few other things pop up. The non-tf run didn't finish before we had a massive server shutdown because of a planned power outage (sigh). So @ericbolo <https://github.com/ericbolo> let me try to run the v1-tf branch again and I can check in against my copy of the repository. I think that @ramonsanabria <https://github.com/ramonsanabria> has had a better outcome than I have. Basically, the things I had to do were slight modifications to the building the TLG graph followed by calling latgen-faster and score_sclite.sh. I'm sure that the decoding parameters aren't right and I have to investigate whether I do have the priors involved or not before decoding. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#193 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEXQ_KvFiDQlFEOibqPA0fxwQrwXJqzqks5uBlXJgaJpZM4U2hdg> .

-- Eric Bolo CTO tel: 06 29 72 19 80 <http://06%2058%2079%2025%2037/>

______________________ Batvoice Technologies 10, rue Coquillière - 75001 Paris www.batvoice.com

ramonsanabria · 2018-07-02T13:07:27Z

Hi all, Yesterday I was able to re run one of our complete pipeline of EESEN + WFST using BPE units ( https://arxiv.org/pdf/1712.06855.pdf) in SWBD. I hit the 16.5 without fine tunning. I feel I can maybe get some extra points by just playing a bit with WFST parameters. The BPE pipeline made me think that in the case you need a little bit more speed during decoding you can also play use BPE (bigger units means less steps in the decoding process) and the accuracy is not that bad. PS: Also we have a recipe to train the acoustic model with the whole fisher swbd corpus set in case you needed for the your real time implementation that you have in mind. Thanks! Best, Ramon 2018-07-02 4:52 GMT-04:00 ericbolo <notifications@github.com>:

…

@efosler, thank you ! On 29 June 2018 at 18:28, Eric Fosler-Lussier ***@***.***> wrote: > Sorry for the delay - I had a few other things pop up. The non-tf run > didn't finish before we had a massive server shutdown because of a planned > power outage (sigh). So @ericbolo <https://github.com/ericbolo> let me > try to run the v1-tf branch again and I can check in against my copy of the > repository. I think that @ramonsanabria <https://github.com/ ramonsanabria> > has had a better outcome than I have. > > Basically, the things I had to do were slight modifications to the > building the TLG graph followed by calling latgen-faster and > score_sclite.sh. I'm sure that the decoding parameters aren't right and I > have to investigate whether I do have the priors involved or not before > decoding. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#193 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AEXQ_ KvFiDQlFEOibqPA0fxwQrwXJqzqks5uBlXJgaJpZM4U2hdg> > . > -- Eric Bolo CTO tel: 06 29 72 19 80 <http://06%2058%2079%2025%2037/> ______________________ Batvoice Technologies 10, rue Coquillière - 75001 Paris <https://maps.google.com/?q=10,+rue+Coquilli%C3%A8re+-+75001+Paris&entry=gmail&source=g> www.batvoice.com — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#193 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMlwPavmS82ay9tCjDizst6ZUnw4DkXqks5uCd9YgaJpZM4U2hdg> .

ramonsanabria · 2018-07-02T13:10:22Z

sorry wrong paper. Correction, the BPE paper is this one: https://arxiv.org/pdf/1712.06855.pdf 2018-07-02 9:07 GMT-04:00 Ramon Sanabria <ramon.sanabria.teixidor@gmail.com> :

…

Hi all, Yesterday I was able to re run one of our complete pipeline of EESEN + WFST using BPE units (https://arxiv.org/abs/1708.04469) in SWBD. I hit the 16.5 without fine tunning. I feel I can maybe get some extra points by just playing a bit with WFST parameters. The BPE pipeline made me think that in the case you need a little bit more speed during decoding you can also play use BPE (bigger units means less steps in the decoding process) and the accuracy is not that bad. PS: Also we have a recipe to train the acoustic model with the whole fisher swbd corpus set in case you needed for the your real time implementation that you have in mind. Thanks! Best, Ramon 2018-07-02 4:52 GMT-04:00 ericbolo ***@***.***>: > @efosler, thank you ! > > On 29 June 2018 at 18:28, Eric Fosler-Lussier ***@***.***> > wrote: > > > Sorry for the delay - I had a few other things pop up. The non-tf run > > didn't finish before we had a massive server shutdown because of a > planned > > power outage (sigh). So @ericbolo <https://github.com/ericbolo> let me > > try to run the v1-tf branch again and I can check in against my copy of > the > > repository. I think that @ramonsanabria <https://github.com/ramonsanab > ria> > > has had a better outcome than I have. > > > > Basically, the things I had to do were slight modifications to the > > building the TLG graph followed by calling latgen-faster and > > score_sclite.sh. I'm sure that the decoding parameters aren't right and > I > > have to investigate whether I do have the priors involved or not before > > decoding. > > > > — > > You are receiving this because you were mentioned. > > Reply to this email directly, view it on GitHub > > <#193 (comment)>, or > mute > > the thread > > <https://github.com/notifications/unsubscribe-auth/AEXQ_KvFi > DQlFEOibqPA0fxwQrwXJqzqks5uBlXJgaJpZM4U2hdg> > > . > > > > > > -- > Eric Bolo > CTO > tel: 06 29 72 19 80 <http://06%2058%2079%2025%2037/> > ______________________ > Batvoice Technologies > 10, rue Coquillière - 75001 Paris > <https://maps.google.com/?q=10,+rue+Coquilli%C3%A8re+-+75001+Paris&entry=gmail&source=g> > www.batvoice.com > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#193 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AMlwPavmS82ay9tCjDizst6ZUnw4DkXqks5uCd9YgaJpZM4U2hdg> > . >

efosler · 2018-07-02T13:19:46Z

@ramonsanabria thanks! We've been having some NFS issues here so I haven't gotten my full pass to run. It would be great to have this recipe in the mix. Does this want to go into v1-tf or should there be a v2-tf?

efosler · 2018-07-03T19:48:22Z

We finally got the NFS issues resolved so I should have the training done by tomorrowish. @ramonsanabria, two questions -

I noticed that the default forward pass used epoch 14 (rather than the full model) - was there a reason for that or is that something that I should clean up. (This would be part of the reason for substandard results, but more likely item 2 below...)
I do not believe that the decoding is using priors. I see a start on that for ctc-am/tf/tf_test.py but it doesn't seem to do anything with the priors, nor is there a model.priors file built during training (as far as I can tell). Am I missing something?

ramonsanabria · 2018-07-03T20:03:45Z

Hi Eric, Sorry for not responding the last message. We were also a little bit busy in JSALT.

Regarding tf-v2. Yes we can do that, cool idea. @fmetze what do you think? I am still fine tunning everything (WFST and AM) and I should include some code (mostly for the BPE generation) but it would be a good idea. We are preparing a second publication for SLT, after the acceptance we can actually release the whole recipie.

Ok let me send all the parameters the I am using. Can you share your TER results with your configuration? You might find some parameters that are currently not implemented in the master branch (dropout etc.). But with the intersected parameters you should be fine. With this configuration on swbd I remember that @fmetze achieved something close to 11% TER.

target_scheme {'no_name_language': {'no_target_name': 47}} drop_out 0.0 sat_conf {'num_sat_layers': 2, 'continue_ckpt_sat': False, 'sat_stage': 'fine_tune', 'sat _type': 'non_adapted'} init_nproj 80 clip 0.1 nlayer 5 nhidden 320 data_dir /tmp/tmp.EKt4xyU6eX min_lr_rate 0.0005 half_rate 0.5 do_shuf True nepoch 30 grad_opt grad random_seed 15213 model_dir exp/fmetze_test_43j26e/model input_feats_dim 129 batch_size 16 kl_weight 0.0 lstm_type cudnn lr_rate 0.05 model deepbilstm nproj 340 final_nproj 0 half_after 8 train_dir exp/fmetze_test_43j26e/ online_augment_conf {'window': 3, 'subsampling': 3, 'roll': True} clip_norm False l2 0.001 store_model True debug False continue_ckpt half_period 4 force_lr_epoch False batch_norm True
Let me take a look to the bullet point 2 later in the day. This you should use https://github.com/srvk/eesen/blob/tf_clean/tf/ctc-am/test.py in order to perform testing. It will generate varios versions of the forward pass (log_probs, logits, probs, etc.) with blank in the 0 position. I will need to clean this up so that the script only outputs what is really needed.

After having the log_probs then is when you can just apply normal eesen c++ recipie (i.e. apply wfst to log_probs). I am not sure why my character based WFST is not working. I could make it work with bpe300 and other units but not in characters. I will try to get back to you later on this.

Thanks!

efosler · 2018-07-03T20:26:45Z

No worries on lag - I think this is going to be an "over several weeks" thing as this isn't first priority for any of us (although high priority overall).

The TER I'm seeing is more around 15% (still training, but I don't see it likely to get much under 15%) - I will see if there are any diffs.

Meanwhile once I get the pipeline to finish, I'll check in a local copy for @ericbolo so that he can play around, since it is a working pipeline even if it isn't efficient or as high accuracy.

Thanks!

efosler · 2018-07-03T21:07:58Z

Just for the record, here are diffs on config:
nlayer: 4 (vs 5)
input_feats_dim: 120 (vs 129)
batch_size: 32 (vs 16)
lr_rate: 0.005 (vs 0.05)
nproj: 60 (vs 340)
online_augment_conf.roll = False (vs True)
l2: 0.0 (vs 0.001)
batch_norm: False (vs True)

So it's pretty clear that there are some significant differences, and I'd believe the sum total of them could result in a 4% difference in TER (particularly layers, l2, batch norm, lr_rate, and maybe nproj). The really interesting question is what the extra 9 features are - it looks like one additional base feature which has deltas/double-deltas and windowing applied.

{'continue_ckpt': '', 'diff_num_target_ckpt': False, 'force_lr_epoch': False, 'random_seed': 15213, 'debug': False, 'store_model': True, 'data_dir': '/scratch/tmp/fosler/tmp.xJUz4scH4T', 'train_dir': 'exp/train_phn_l4_c320_mdeepbilstm_w3_ntrue_p60_ip80_fp80', 'batch_size': 32, 'do_shuf': True, 'nepoch': 30, 'lr_rate': 0.005, 'min_lr_rate': 0.0005, 'half_period': 4, 'half_rate': 0.5, 'half_after': 8, 'drop_out': 0.0, 'clip_norm': False, 'kl_weight': 0.0, 'model': 'deepbilstm', 'lstm_type': 'cudnn', 'nproj': 60, 'final_nproj': 100, 'init_nproj': 80, 'l2': 0.0, 'nlayer': 4, 'nhidden': 320, 'clip': 0.1, 'batch_norm': False, 'grad_opt': 'grad', 'sat_conf': {'sat_type': 'non_adapted', 'sat_stage': 'fine_tune', 'num_sat_layers': 2, 'continue_ckpt_sat': False}, 'online_augment_conf': {'window': 3, 'subsampling': 3, 'roll': False}, 'input_feats_dim': 120, 'target_scheme': {'no_name_language': {'no_target_name': 43}}, 'model_dir': 'exp/train_phn_l4_c320_mdeepbilstm_w3_ntrue_p60_ip80_fp80/model'}

ramonsanabria · 2018-07-03T22:22:04Z

Perfect yes, so with this parameters you should see an improvement according to my experience. For `input_feats_dim: 120 (vs 129)` just extract fbank_pitch features (we should also change this in the recipie of v1-tf). I assume that `'subsampling': 3` is correct right? (this is also important). Also those ones that I think are implemented: `'nproj': 60, 'final_nproj': 100, 'init_nproj': 80`. Otherwise I will push code to have them. Thanks! 2018-07-03 17:08 GMT-04:00 Eric Fosler-Lussier <notifications@github.com>:

…

Just for the record, here are diffs on config: nlayer: 4 (vs 5) input_feats_dim: 120 (vs 129) batch_size: 32 (vs 16) lr_rate: 0.005 (vs 0.05) nproj: 60 (vs 340) online_augment_conf.roll = False (vs True) l2: 0.0 (vs 0.001) batch_norm: False (vs True) So it's pretty clear that there are some significant differences, and I'd believe the sum total of them could result in a 4% difference in TER (particularly layers, l2, batch norm, lr_rate, and maybe nproj). The really interesting question is what the extra 9 features are - it looks like one additional base feature which has deltas/double-deltas and windowing applied. {'continue_ckpt': '', 'diff_num_target_ckpt': False, 'force_lr_epoch': False, 'random_seed': 15213, 'debug': False, 'store_model': True, 'data_dir': '/scratch/tmp/fosler/tmp.xJUz4scH4T', 'train_dir': 'exp/train_phn_l4_c320_mdeepbilstm_w3_ntrue_p60_ip80_fp80', 'batch_size': 32, 'do_shuf': True, 'nepoch': 30, 'lr_rate': 0.005, 'min_lr_rate': 0.0005, 'half_period': 4, 'half_rate': 0.5, 'half_after': 8, 'drop_out': 0.0, 'clip_norm': False, 'kl_weight': 0.0, 'model': 'deepbilstm', 'lstm_type': 'cudnn', 'nproj': 60, 'final_nproj': 100, 'init_nproj': 80, 'l2': 0.0, 'nlayer': 4, 'nhidden': 320, 'clip': 0.1, 'batch_norm': False, 'grad_opt': 'grad', 'sat_conf': {'sat_type': 'non_adapted', 'sat_stage': 'fine_tune', 'num_sat_layers': 2, 'continue_ckpt_sat': False}, 'online_augment_conf': {'window': 3, 'subsampling': 3, 'roll': False}, 'input_feats_dim': 120, 'target_scheme': {'no_name_language': {'no_target_name': 43}}, 'model_dir': 'exp/train_phn_l4_c320_mdeepbilstm_w3_ntrue_p60_ip80_fp80/model'} — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#193 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMlwPbFOxV0nyRRyx7rdsXBvhPvB1-xYks5uC90xgaJpZM4U2hdg> .

ericbolo · 2018-07-04T07:13:57Z

Hello all, @efosler: yes, all I need is a running pipeline, not the best performing one, so that I can look at all the pieces of an online decoding system with tensorflow + wfst.

…

On 4 July 2018 at 00:22, ramonsanabria ***@***.***> wrote: Perfect yes, so with this parameters you should see an improvement according to my experience. For `input_feats_dim: 120 (vs 129)` just extract fbank_pitch features (we should also change this in the recipie of v1-tf). I assume that `'subsampling': 3` is correct right? (this is also important). Also those ones that I think are implemented: `'nproj': 60, 'final_nproj': 100, 'init_nproj': 80`. Otherwise I will push code to have them. Thanks! 2018-07-03 17:08 GMT-04:00 Eric Fosler-Lussier ***@***.***>: > Just for the record, here are diffs on config: > nlayer: 4 (vs 5) > input_feats_dim: 120 (vs 129) > batch_size: 32 (vs 16) > lr_rate: 0.005 (vs 0.05) > nproj: 60 (vs 340) > online_augment_conf.roll = False (vs True) > l2: 0.0 (vs 0.001) > batch_norm: False (vs True) > > So it's pretty clear that there are some significant differences, and I'd > believe the sum total of them could result in a 4% difference in TER > (particularly layers, l2, batch norm, lr_rate, and maybe nproj). The really > interesting question is what the extra 9 features are - it looks like one > additional base feature which has deltas/double-deltas and windowing > applied. > > {'continue_ckpt': '', 'diff_num_target_ckpt': False, 'force_lr_epoch': > False, 'random_seed': 15213, 'debug': False, 'store_model': True, > 'data_dir': '/scratch/tmp/fosler/tmp.xJUz4scH4T', 'train_dir': > 'exp/train_phn_l4_c320_mdeepbilstm_w3_ntrue_p60_ip80_fp80', 'batch_size': > 32, 'do_shuf': True, 'nepoch': 30, 'lr_rate': 0.005, 'min_lr_rate': 0.0005, > 'half_period': 4, 'half_rate': 0.5, 'half_after': 8, 'drop_out': 0.0, > 'clip_norm': False, 'kl_weight': 0.0, 'model': 'deepbilstm', 'lstm_type': > 'cudnn', 'nproj': 60, 'final_nproj': 100, 'init_nproj': 80, 'l2': 0.0, > 'nlayer': 4, 'nhidden': 320, 'clip': 0.1, 'batch_norm': False, 'grad_opt': > 'grad', 'sat_conf': {'sat_type': 'non_adapted', 'sat_stage': 'fine_tune', > 'num_sat_layers': 2, 'continue_ckpt_sat': False}, 'online_augment_conf': > {'window': 3, 'subsampling': 3, 'roll': False}, 'input_feats_dim': 120, > 'target_scheme': {'no_name_language': {'no_target_name': 43}}, 'model_dir': > 'exp/train_phn_l4_c320_mdeepbilstm_w3_ntrue_p60_ip80_fp80/model'} > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#193 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/ AMlwPbFOxV0nyRRyx7rdsXBvhPvB1-xYks5uC90xgaJpZM4U2hdg> > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#193 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEXQ_JHFGkUzWdTdWwYVqrrxN6k8GQbAks5uC-6PgaJpZM4U2hdg> .

-- Eric Bolo CTO tel: 06 29 72 19 80 <http://06%2058%2079%2025%2037/>

______________________ Batvoice Technologies 10, rue Coquillière - 75001 Paris www.batvoice.com

efosler · 2018-07-05T22:06:04Z

@ericbolo I've uploaded my changes to efosler/eesen so you can grab the newest copy. This should work - there are a few diffs with the graph prep scripts. Here's a list of the files that I changed so that you can just grab them if you want:

asr_egs/swbd/v1-tf/local/swbd1_data_prep.sh
asr_egs/swbd/v1-tf/local/swbd1_decode_graph_tf.sh
asr_egs/swbd/v1-tf/run_ctc_phn.sh
asr_egs/swbd/v1/local/swbd1_data_prep.sh
[cosmetic changes only to these]
asr_egs/wsj/steps/decode_ctc_lat_tf.sh
asr_egs/wsj/steps/train_ctc_tf.sh
[python 3 compatibility]
asr_egs/wsj/utils/ctc_token_fst.py
asr_egs/wsj/utils/model_topo.py

My next step will be to try to rework the recipe so that it matches the parameters sent by @ramonsanabria . Once I've got that done and confirmed I'll send a pull request.

efosler · 2018-07-05T22:07:42Z

NB: the decode script is woefully non-parallel (needs to be fixed), but for the online stuff this won't matter.

ericbolo · 2018-07-06T05:55:03Z

@efosler: wonderful, thanks ! I don't have the swbd db but I can adapt it for, say, tedlium.

…

On Fri, Jul 6, 2018, 12:06 AM Eric Fosler-Lussier ***@***.***> wrote: @ericbolo <https://github.com/ericbolo> I've uploaded my changes to efosler/eesen so you can grab the newest copy. This *should* work - there are a few diffs with the graph prep scripts. Here's a list of the files that I changed so that you can just grab them if you want: - asr_egs/swbd/v1-tf/local/swbd1_data_prep.sh - asr_egs/swbd/v1-tf/local/swbd1_decode_graph_tf.sh - asr_egs/swbd/v1-tf/run_ctc_phn.sh - asr_egs/swbd/v1/local/swbd1_data_prep.sh [cosmetic changes only to these] - asr_egs/wsj/steps/decode_ctc_lat_tf.sh - asr_egs/wsj/steps/train_ctc_tf.sh [python 3 compatibility] - asr_egs/wsj/utils/ctc_token_fst.py - asr_egs/wsj/utils/model_topo.py My next step will be to try to rework the recipe so that it matches the parameters sent by @ramonsanabria <https://github.com/ramonsanabria> . Once I've got that done and confirmed I'll send a pull request. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#193 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEXQ_EbYf7FANRUU4OBuQuUkC0MBrQRUks5uDo3QgaJpZM4U2hdg> .

efosler · 2018-07-06T17:17:35Z

Hey @ramonsanabria , quick question: you said...

Also those ones that I think are implemented: 'nproj': 60, 'final_nproj': 100, 'init_nproj': 80. Otherwise I will push code to have them.

Looking through the code base, it seems like these are passed as parameters - will it not do the right thing if those parameters are set?

efosler · 2018-07-07T17:49:49Z

About to go offline for a bit, so I won't be able to report on the full run, but training with the parameters above (same as @fmetze 's run but with nproj=60, final_nproj=100, init_nproj=80) does get down to 11.7% TER, so I will make those the default with the script going forward. Decoding hasn't happened yet.

ericbolo · 2018-07-20T10:46:47Z

@efosler, a quick update: I was able to run the full pipeline with tensorflow + language model decoding on a dummy dataset. Thanks again ! Next steps re:online decoding (#141): implementing a forward-only LSTM, and the loss function for student-teacher learning.

…

On 7 July 2018 at 19:49, Eric Fosler-Lussier ***@***.***> wrote: About to go offline for a bit, so I won't be able to report on the full run, but training with the parameters above (same as @fmetze <https://github.com/fmetze> 's run but with nproj=60, final_nproj=100, init_nproj=80) does get down to 11.7% TER, so I will make those the default with the script going forward. Decoding hasn't happened yet. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#193 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEXQ_DAAIxwLWmIlWqqyQVnLlV6rkaVKks5uEPTAgaJpZM4U2hdg> .

-- Eric Bolo CTO tel: 06 29 72 19 80 <http://06%2058%2079%2025%2037/>

______________________ Batvoice Technologies 10, rue Coquillière - 75001 Paris www.batvoice.com

ericbolo · 2018-07-24T10:20:10Z

re: priors. As @efosler noted, it seems the priors are not used in the current decoding code.

in tf_test.py:

if(config[constants.CONFIG_TAGS_TEST.USE_PRIORS]):
            #TODO we need to add priors also:
            #feed_priors={i: y for i, y in zip(model.priors, config["prior"])}
            print(config[constants.CONFIG_TAGS_TEST.PRIORS_SCHEME])

model.priors doesn't seem to be generated anywhere, but we can use label.counts to generate it.
@fmetze , are priors used in the original (c++) implementation ?

ramonsanabria · 2018-07-24T15:12:06Z

Hi all, Priors are generated by: ``` labels=$dir_am/label.counts gunzip -c $dir_am/labels.tr.gz cat | awk '{line=$0; gsub(" "," 0 ",line); print line " 0";}' | /data/ASR5/fmetze/eesen-block-copy/src/decoderbin/analyze-counts --verbose=1 --binary=false ark:- $labels ``` Then you can use nnet.py as: ``` $decode_cmd JOB=1:$nj $mdl/log/decode.JOB.log \ cat $PWD/$mdl/split$nj/JOB/feats.scp \| sort -k 1 \| python utils/nnet.py --label-counts $labels --temperature $temperature --blank-scale $bkscale \| \ latgen-faster --max-active=$max_active --max-mem=$max_mem --beam=$beam --lattice-beam=$lattice_beam \ --acoustic-scale=$acwt --allow-partial=true --word-symbol-table=$graphdir/words.txt \ $graphdir/TLG.fst ark:- "ark:|gzip -c > $mdl/lat/lat.JOB.gz" || exit 1; ``` This nnet.py is currently using tensorflow, I have a version that doesn't rely on that. I will push it now. Will keep you posted. Thanks!

ramonsanabria · 2018-07-24T15:26:58Z

Here the commit of the new asr_egs/wsj/utils/nnet_notf.py (that does not use tf): 543c9ed

Here are the parts of the code that I posted in the previous message (email responds does not support code in the markdown language):

labels=$dir_am/label.counts

gunzip -c $dir_am/labels.tr.gz cat | awk '{line=$0; gsub(" "," 0 ",line);
print line " 0";}' |
/data/ASR5/fmetze/eesen-block-copy/src/decoderbin/analyze-counts
--verbose=1 --binary=false ark:- $labels

$decode_cmd JOB=1:$nj $mdl/log/decode.JOB.log \
  cat $PWD/$mdl/split$nj/JOB/feats.scp \| sort -k 1 \| python utils/nnet.py
--label-counts $labels --temperature $temperature --blank-scale $bkscale \|
\
  latgen-faster  --max-active=$max_active --max-mem=$max_mem --beam=$beam
--lattice-beam=$lattice_beam \
  --acoustic-scale=$acwt --allow-partial=true
--word-symbol-table=$graphdir/words.txt \
  $graphdir/TLG.fst ark:- "ark:|gzip -c > $mdl/lat/lat.JOB.gz" || exit 1;

efosler · 2018-08-08T19:42:46Z

So, an update on my progress with SWB (now that I'm getting back to this). I haven't tried out @ramonsanabria 's code above yet.

I'm able to train a SWB system getting 11.8% TER on the CV set (much better than before). However, decoding with this (again not with priors) gives me a 40+% WER - much worse than the previous setup. I'm trying to debug this to understand where things are going wrong.

One thing I tried to do was turn on TER calculation during the forward pass. Had to do some modifications to steps/decode_ctc_am_tf.sh to make it pass the right flags to the test module. However, that was a non-starter it seems - the forward pass just hangs with no errors.

Seems like the next best step would be to just try to switch to @ramonsanabria 's decode strategy and abandon steps/decode_ctc_am_tf.sh?

efosler · 2018-08-10T16:30:15Z

@ramonsanabria what's a good (rough) value for blank_scale?

efosler · 2018-08-10T16:40:44Z

@ramonsanabria Now looking through nnet.py (and non-tf version) - this actually takes the output of the net and does the smoothing and priors as a filter, right? The code snippet you have above doesn't actually run the net forward, it seems to me, but would do something funky on the features in feats.scp.

ramonsanabria · 2018-08-10T18:43:29Z

Hi all, How is it going. A good value for blank scale should be 0.9 and 1.1. But is something that we should play with. Exactly, the nnet.py script will only take the posteriors from the eesen, modify them slightly (add blank scaling, put blank in index-zero so WFST can read it, add temperature to the whole distribution, add priors which will certainly boost WER scores) and finally pipe it to the next script, which I believe it is the WFST decoding. Will you guys be in India for Interspeech? Would be great to meet :) 2018-08-10 17:40 GMT+01:00 Eric Fosler-Lussier <notifications@github.com>:

…

@ramonsanabria <https://github.com/ramonsanabria> Now looking through nnet.py (and non-tf version) - this actually takes the output of the net and does the smoothing and priors as a filter, right? The code snippet you have above doesn't actually run the net forward, it seems to me, but would do something funky on the features in feats.scp. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#193 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMlwPWRcr8L4FFltzhLFlUa1kbv-ZW0yks5uPbeOgaJpZM4U2hdg> .

ericbolo · 2018-08-12T20:41:51Z

India, I wish ! But no...

…

On Fri, Aug 10, 2018, 8:43 PM ramonsanabria ***@***.***> wrote: Hi all, How is it going. A good value for blank scale should be 0.9 and 1.1. But is something that we should play with. Exactly, the nnet.py script will only take the posteriors from the eesen, modify them slightly (add blank scaling, put blank in index-zero so WFST can read it, add temperature to the whole distribution, add priors which will certainly boost WER scores) and finally pipe it to the next script, which I believe it is the WFST decoding. Will you guys be in India for Interspeech? Would be great to meet :) 2018-08-10 17:40 GMT+01:00 Eric Fosler-Lussier ***@***.***>: > @ramonsanabria <https://github.com/ramonsanabria> Now looking through > nnet.py (and non-tf version) - this actually takes the output of the net > and does the smoothing and priors as a filter, right? The code snippet you > have above doesn't actually run the net forward, it seems to me, but would > do something funky on the features in feats.scp. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#193 (comment)>, or mute > the thread > < https://github.com/notifications/unsubscribe-auth/AMlwPWRcr8L4FFltzhLFlUa1kbv-ZW0yks5uPbeOgaJpZM4U2hdg > > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#193 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEXQ_K1PrDJkJz6OEpop29iaR5KFcWJoks5uPdRVgaJpZM4U2hdg> .

efosler · 2018-08-15T20:29:04Z

Alas, I won't be in India either. (Maybe I might be able to stop by CMU sometime this semester.)

Update on progress: I wrote a small script to do greedy decoding on a logit/posterior stream and calculate the TER. (Will post this to my repo soonish and then send a pull request.) Found that on the SWB eval2000 test set I was getting 30% TER (this after priors; without priors it is worse). I was slightly puzzled by that, so I decided to calculate the TER on the train_dev set for SWB - I'm getting roughly 21-22% TER. This was a system that was reporting 11.8% TER on the same set during training. So something is rather hinky. Still digging, but if anyone has ideas, let me know.

efosler · 2018-08-16T01:58:40Z

I think I've enabled tf_train to dump out the forward pass on cv to see what's going on - if there is a difference in the output. Took me a good chunk of the evening. One thing I did run across is that the fp on subsampled data gets averaged in tf_test - it's not clear to me if the TER reported in tf_train is over averaged or (as I suspect) all variants. I don't think this could account for a factor of two in TER though.

FWIW, I think the code would be cleaner if tf_train and tf_test were factorized some - I had to copy a lot of code over and I worry about inconsistencies between them (although they are hooked together through the model class).

efosler · 2018-08-16T12:34:27Z

Update from yesterday (now that the swbd system has some time to train): the dumped cv ark files do not show the same CTC error rate as the system claims. I am suspecting that the averaging might be doing something weird. Writing down assumptions here and someone can pick this apart:

The implementation of the greedy decoder is just to take the max at each frame, remove duplicates, and then remove blanks. Shortest decoder I've ever written:

def greedy_decode(logits): return [i for i,_ in itertools.groupby(logits.argmax(1)) if i>0]

The averaging code (taken from tf_test.py) checks if there is augmented data. If so it computes the average logit stream from all examples. But if the network has radically different placement of the tokens it will likely give different decodings.

(Now this is making me wonder if the test set was augmented... hmmm...)

Anyway, just to give a sample of the difference in TER:

Reported by tf during training:

            Validate cost: 40.4, ter: 27.6%, #example: 11190
            Validate cost: 32.9, ter: 22.2%, #example: 11190
            Validate cost: 30.2, ter: 21.2%, #example: 11190
            Validate cost: 27.8, ter: 19.2%, #example: 11190
            Validate cost: 26.8, ter: 18.3%, #example: 11190
            Validate cost: 35.0, ter: 23.7%, #example: 11190
            Validate cost: 28.4, ter: 19.4%, #example: 11190
            Validate cost: 24.8, ter: 17.1%, #example: 11190

Decoding on the averaged stream:

TER = 76690 / 152641 = 50.2
TER = 69108 / 152380 = 45.4
TER = 68611 / 152380 = 45.0
TER = 62259 / 152380 = 40.9
TER = 59838 / 152380 = 39.3
TER = 72821 / 152380 = 47.8
TER = 61498 / 152380 = 40.4
TER = 59800 / 152380 = 39.2

efosler · 2018-08-16T13:31:12Z

@ramonsanabria and @fmetze can you confirm what the online feature augmentation is doing? I think I misunderstood it in my comments above. (I had visions of other types of augmentation going on but reading the code I think it's simpler than I thought.)

Looking through the code it seems like when you have the subsample and window set to 3, what it's doing is stacking three frames on the input, and making the input 3 times as small. Is it also creating three variants with different shifts? I'm trying to figure out where the averaging would come in later.

efosler · 2018-08-18T17:54:18Z

OK, I have figured out the discrepancy in output between forward passes and what is recorded by the training regime. tl;dr - the augmentation and averaging code in tf_test.py is at fault and should not be currently trusted. I'm working on a fix.

When training is done with augmentation (in this example, with window 3) 3 different shifted copies are created for training with stacked features. The TER is calculated for each copy (stream) by taking a forward pass and greedy decoding over the logits, then getting edit distance to the labels. The reported TER is over all copies.

At test time, it is not really clear what to do with 3 copies of the same logit stream. The test code (which I've replicated in the forward pass during training) assumes that correct thing to do is to average the logit streams. This would be appropriate for a traditional frame-based NN system. However, in a CTC-based system there is no guarantee of synchronization of outputs, so averaging the streams means that sometimes the blank label will dominate where it should not (for example: if one stream labels greedily "A blank blank", the second "blank A blank" and the third "blank blank A" then the average stream might label "blank blank blank" - causing a deletion).

I verified this by only dumping out the first stream in the averaging rather than the average, and found that the CV TER was identicial to that reported by the trainer. (That's not to say that the decoding was identical, but that the end number was the same.)

Upshot: it's probably best to arbitrarily take one of the streams and use it at test time - although is there a more appropriate combination scheme?

efosler · 2018-08-18T18:04:05Z

Created new issue for this particular bug. #194

efosler · 2018-08-20T19:41:32Z

Latest update:
Decoding with sw+fish LM, incorporating priors, and fixing the averaging bug leads to 19.2% WER on eval 2000, with the swbd subset getting 13.4% WER (the kaldi triphone based system gets 13.3% WER on the same set, although this may be a more involved model). I think that this is close enough for a baseline to declare victory. I'll clean stuff up and then make a pull request.

efosler · 2018-08-23T12:13:04Z

Successful full train and decode; I also tested out a run with a slightly larger net (with a bit of improvement). Adding these baselines to the README file.

# CTC Phonemes on the Complete set (with 5 BiLSTM layers) with WFST decode                                                                                                                                                                                     
%WER 12.5 | 1831 21395 | 88.9 7.7 3.4 1.5 12.5 49.6 | exp/train_phn_fbank_pitch_l5_c320_mdeepbilstm_w3_ntrue_p60_ip80_fp80/results_epoch22_bs5.0_sw1_fsh_tgpr/score_8/eval2000.ctm.swbd.filt.sys
%WER 18.3 | 4459 42989 | 83.9 11.7 4.4 2.2 18.3 57.3 | exp/train_phn_fbank_pitch_l5_c320_mdeepbilstm_w3_ntrue_p60_ip80_fp80/results_epoch22_bs4.0_sw1_fsh_tgpr/score_8/eval2000.ctm.filt.sys
%WER 23.9 | 2628 21594 | 79.0 15.5 5.6 2.8 23.9 62.5 | exp/train_phn_fbank_pitch_l5_c320_mdeepbilstm_w3_ntrue_p60_ip80_fp80/results_epoch22_bs4.0_sw1_fsh_tgpr/score_8/eval2000.ctm.callhm.filt.sys

# Slightly larger model (400 units, 80 internal projections) with WFST decode                                                                                                                                                                                   
%WER 12.2 | 1831 21395 | 89.2 7.7 3.1 1.4 12.2 49.7 | exp/train_phn_fbank_pitch_l5_c400_mdeepbilstm_w3_ntrue_p80_ip80_fp80/results_epoch23_bs7.0_sw1_fsh_tgpr/score_10/eval2000.ctm.swbd.filt.sys
%WER 17.8 | 4459 42989 | 84.1 11.1 4.8 1.9 17.8 57.1 | exp/train_phn_fbank_pitch_l5_c400_mdeepbilstm_w3_ntrue_p80_ip80_fp80/results_epoch23_bs7.0_sw1_fsh_tgpr/score_9/eval2000.ctm.filt.sys
%WER 23.4 | 2628 21594 | 79.3 14.8 5.9 2.7 23.4 62.1 | exp/train_phn_fbank_pitch_l5_c400_mdeepbilstm_w3_ntrue_p80_ip80_fp80/results_epoch23_bs7.0_sw1_fsh_tgpr/score_10/eval2000.ctm.callhm.filt.sys

ramonsanabria · 2018-08-23T12:49:44Z

Awesome thank you so much Eric! the numbers looks great. Can you share the full training configuration?

Thank you again!

efosler · 2018-08-23T13:14:43Z

Just submitted the pull request (#196).

efosler · 2018-08-23T13:16:09Z

Once we decide that #196 is all good, I think we can close this particular thread!!!

efosler · 2018-08-24T02:24:25Z

OK, closing this particular thread. Whew!

ericbolo mentioned this issue Jun 28, 2018

Real-time decoding #141

Open

efosler mentioned this issue Aug 18, 2018

[tf_clean] CTC logits/posteriors/etc for augmented input buggy in tf_test.py #194

Open

efosler closed this as completed Aug 24, 2018

[tf_clean] swb1/v1-tf WFST decoding - checking on assumptions #193

[tf_clean] swb1/v1-tf WFST decoding - checking on assumptions #193

Comments

efosler commented Jun 25, 2018

ramonsanabria commented Jun 25, 2018

efosler commented Jun 25, 2018

efosler commented Jun 26, 2018 • edited

ramonsanabria commented Jun 26, 2018 via email

efosler commented Jun 26, 2018

ericbolo commented Jun 27, 2018

efosler commented Jun 29, 2018

ericbolo commented Jul 2, 2018 via email

ramonsanabria commented Jul 2, 2018 via email • edited

ramonsanabria commented Jul 2, 2018 via email

efosler commented Jul 2, 2018

efosler commented Jul 3, 2018 • edited

ramonsanabria commented Jul 3, 2018

efosler commented Jul 3, 2018

efosler commented Jul 3, 2018

ramonsanabria commented Jul 3, 2018 via email

ericbolo commented Jul 4, 2018 via email

efosler commented Jul 5, 2018

efosler commented Jul 5, 2018

ericbolo commented Jul 6, 2018 via email

efosler commented Jul 6, 2018

efosler commented Jul 7, 2018

ericbolo commented Jul 20, 2018 via email

ericbolo commented Jul 24, 2018 • edited

ramonsanabria commented Jul 24, 2018 via email • edited

ramonsanabria commented Jul 24, 2018 • edited

efosler commented Aug 8, 2018

efosler commented Aug 10, 2018

efosler commented Aug 10, 2018

ramonsanabria commented Aug 10, 2018 via email

ericbolo commented Aug 12, 2018 via email

efosler commented Aug 15, 2018 • edited

efosler commented Aug 16, 2018

efosler commented Aug 16, 2018 • edited

efosler commented Aug 16, 2018

efosler commented Aug 18, 2018 • edited

efosler commented Aug 18, 2018

efosler commented Aug 20, 2018

efosler commented Aug 23, 2018

ramonsanabria commented Aug 23, 2018

efosler commented Aug 23, 2018

efosler commented Aug 23, 2018

efosler commented Aug 24, 2018

efosler commented Jun 26, 2018 •

edited

ramonsanabria commented Jul 2, 2018 via email •

edited

efosler commented Jul 3, 2018 •

edited

ericbolo commented Jul 24, 2018 •

edited

ramonsanabria commented Jul 24, 2018 via email •

edited

ramonsanabria commented Jul 24, 2018 •

edited

efosler commented Aug 15, 2018 •

edited

efosler commented Aug 16, 2018 •

edited

efosler commented Aug 18, 2018 •

edited