Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tf_clean] swb1/v1-tf WFST decoding - checking on assumptions #193

Closed
efosler opened this issue Jun 25, 2018 · 43 comments
Closed

[tf_clean] swb1/v1-tf WFST decoding - checking on assumptions #193

efosler opened this issue Jun 25, 2018 · 43 comments

Comments

@efosler
Copy link
Contributor

efosler commented Jun 25, 2018

Decided a new thread would be good for this issue.

Right now the SWB tf code as checked in seems to have a discrepancy, and I'm writing down some of my assumptions as I work through cleaning up the WFST decode.

It looks like to me that run_ctc_phn.sh creates a set of training/cv labels that ignores noises (and overwrites units.txt, removing spn and npn). However, utils/ctc_compile_dict_token.sh assumes that units.txt and lexicon.txt are synchronized, resulting in the lovely error:

FATAL: FstCompiler: Symbol "spn" is not mapped to any integer arc ilabel, symbol table = data/lang_phn/tokens.txt, source = standard input, line = 1171

The fix is pretty simple (synchronizing the lexicon) but I'm trying to figure out how much to modify the utils/ctc_compile_dict_token.sh script vs. correcting the prep script to do the appropriate correction. I'm thinking that I'll correct the prep script, but if anyone has any thoughts on that let me know.

@ramonsanabria
Copy link

great yes thanks @efosler we really couldn't get a full recipe for tf + wfst. I have some spare scripts but nothing clean and oficial...

This is correct yes:

It looks like to me that run_ctc_phn.sh creates a set of training/cv labels that ignores noises (and overwrites units.txt, removing spn and npn). However, utils/ctc_compile_dict_token.sh assumes that units.txt and lexicon.txt are synchronized, resulting in the lovely error:

I was thinking if is there any improved (or more simple way) to do the data preparation for characters and phonemes separately. Do you have any thought? I can try to help. Otherwise we can reuse the preparation of the master branch.

Also, another random thing that I observed with swbd: I tried to prepare the char set up substituting numbers by written words and removing noises but at the end it did not work out...

I am working in integrating CharRNN decoding recipe that we have (it doesn't perfrom better than wfst but we allow open vocabulary) https://arxiv.org/abs/1708.04469.

Please let me know if I can help you somehow I am will be very happy to!

Thanks again!

@efosler
Copy link
Contributor Author

efosler commented Jun 25, 2018

Let me think about it as I play with the scripts. I just created a _tf version of swbd1_decode_graph.sh which gets rid of the -unk option, but that feels like there could be a better factorization.

@efosler
Copy link
Contributor Author

efosler commented Jun 26, 2018

So, an update: the good news is that I was able to get a decode to run all the way through. There does seem to be a bit of underperformance w.r.t. Yajie's runs on the non-tf version. Currently, I'm seeing 24.7% WER on eval2000 (vs. 21.0) using SWB + Fisher LM. I think there are a few differences:

  • 4 layer BiLSTM vs 5 layer BiLSTM
  • I'm not sure that the default tf recipe currently checked in has speaker adaptation (or in fact if the original has speaker adaptation). The RESULTS file seems to indicate speaker adaptation but now looking through v1 I don't see how that happens, if it happens.

I'm sure that there is some other possible set of differences in parameters as well.

Just to check: what I did was just work with the output of ./steps/decode_ctc_am_tf.sh and feed the logprobs through latgen-faster. NB this just runs test.py in ctc-am rather than nnet.py or anything else (not sure if this is the right thing to do, but it's what's checked in).

Any thoughts on diffs between the tf and old versions that might be causing the discrepancy?

@ramonsanabria
Copy link

ramonsanabria commented Jun 26, 2018 via email

@efosler
Copy link
Contributor Author

efosler commented Jun 26, 2018

Let me re-run it - I just started up the non-tf version and I think it blew away the tf version (hmpf). I am pretty sure that it isn't using the prior probabilities of each phone (not character) but I'm not sure. (I don't see where that would have been injected into the system).

@ericbolo
Copy link

@efosler I would also like to integrate the tf acoustic model into a WFST for decoding. As I understand this thread you have managed to do that. Is any of your code in the repo ?

I pulled tf_clean and asr_egs/swbd/v1-tf/run_ctc_phn.sh only does acoustic decoding.

Would be great if I could avoid starting from scratch :)

@efosler
Copy link
Contributor Author

efosler commented Jun 29, 2018

Sorry for the delay - I had a few other things pop up. The non-tf run didn't finish before we had a massive server shutdown because of a planned power outage (sigh). So @ericbolo let me try to run the v1-tf branch again and I can check in against my copy of the repository. I think that @ramonsanabria has had a better outcome than I have.

Basically, the things I had to do were slight modifications to the building the TLG graph followed by calling latgen-faster and score_sclite.sh. I'm sure that the decoding parameters aren't right and I have to investigate whether I do have the priors involved or not before decoding.

@ericbolo
Copy link

ericbolo commented Jul 2, 2018 via email

@ramonsanabria
Copy link

ramonsanabria commented Jul 2, 2018 via email

@ramonsanabria
Copy link

ramonsanabria commented Jul 2, 2018 via email

@efosler
Copy link
Contributor Author

efosler commented Jul 2, 2018

@ramonsanabria thanks! We've been having some NFS issues here so I haven't gotten my full pass to run. It would be great to have this recipe in the mix. Does this want to go into v1-tf or should there be a v2-tf?

@efosler
Copy link
Contributor Author

efosler commented Jul 3, 2018

We finally got the NFS issues resolved so I should have the training done by tomorrowish. @ramonsanabria, two questions -

  1. I noticed that the default forward pass used epoch 14 (rather than the full model) - was there a reason for that or is that something that I should clean up. (This would be part of the reason for substandard results, but more likely item 2 below...)
  2. I do not believe that the decoding is using priors. I see a start on that for ctc-am/tf/tf_test.py but it doesn't seem to do anything with the priors, nor is there a model.priors file built during training (as far as I can tell). Am I missing something?

@ramonsanabria
Copy link

Hi Eric, Sorry for not responding the last message. We were also a little bit busy in JSALT.

Regarding tf-v2. Yes we can do that, cool idea. @fmetze what do you think? I am still fine tunning everything (WFST and AM) and I should include some code (mostly for the BPE generation) but it would be a good idea. We are preparing a second publication for SLT, after the acceptance we can actually release the whole recipie.

Ok let me send all the parameters the I am using. Can you share your TER results with your configuration? You might find some parameters that are currently not implemented in the master branch (dropout etc.). But with the intersected parameters you should be fine. With this configuration on swbd I remember that @fmetze achieved something close to 11% TER.

target_scheme {'no_name_language': {'no_target_name': 47}} drop_out 0.0 sat_conf {'num_sat_layers': 2, 'continue_ckpt_sat': False, 'sat_stage': 'fine_tune', 'sat _type': 'non_adapted'} init_nproj 80 clip 0.1 nlayer 5 nhidden 320 data_dir /tmp/tmp.EKt4xyU6eX min_lr_rate 0.0005 half_rate 0.5 do_shuf True nepoch 30 grad_opt grad random_seed 15213 model_dir exp/fmetze_test_43j26e/model input_feats_dim 129 batch_size 16 kl_weight 0.0 lstm_type cudnn lr_rate 0.05 model deepbilstm nproj 340 final_nproj 0 half_after 8 train_dir exp/fmetze_test_43j26e/ online_augment_conf {'window': 3, 'subsampling': 3, 'roll': True} clip_norm False l2 0.001 store_model True debug False continue_ckpt half_period 4 force_lr_epoch False batch_norm True
Let me take a look to the bullet point 2 later in the day. This you should use https://github.com/srvk/eesen/blob/tf_clean/tf/ctc-am/test.py in order to perform testing. It will generate varios versions of the forward pass (log_probs, logits, probs, etc.) with blank in the 0 position. I will need to clean this up so that the script only outputs what is really needed.

After having the log_probs then is when you can just apply normal eesen c++ recipie (i.e. apply wfst to log_probs). I am not sure why my character based WFST is not working. I could make it work with bpe300 and other units but not in characters. I will try to get back to you later on this.

Thanks!

@efosler
Copy link
Contributor Author

efosler commented Jul 3, 2018

No worries on lag - I think this is going to be an "over several weeks" thing as this isn't first priority for any of us (although high priority overall).

The TER I'm seeing is more around 15% (still training, but I don't see it likely to get much under 15%) - I will see if there are any diffs.

Meanwhile once I get the pipeline to finish, I'll check in a local copy for @ericbolo so that he can play around, since it is a working pipeline even if it isn't efficient or as high accuracy.

Thanks!

@efosler
Copy link
Contributor Author

efosler commented Jul 3, 2018

Just for the record, here are diffs on config:
nlayer: 4 (vs 5)
input_feats_dim: 120 (vs 129)
batch_size: 32 (vs 16)
lr_rate: 0.005 (vs 0.05)
nproj: 60 (vs 340)
online_augment_conf.roll = False (vs True)
l2: 0.0 (vs 0.001)
batch_norm: False (vs True)

So it's pretty clear that there are some significant differences, and I'd believe the sum total of them could result in a 4% difference in TER (particularly layers, l2, batch norm, lr_rate, and maybe nproj). The really interesting question is what the extra 9 features are - it looks like one additional base feature which has deltas/double-deltas and windowing applied.

{'continue_ckpt': '', 'diff_num_target_ckpt': False, 'force_lr_epoch': False, 'random_seed': 15213, 'debug': False, 'store_model': True, 'data_dir': '/scratch/tmp/fosler/tmp.xJUz4scH4T', 'train_dir': 'exp/train_phn_l4_c320_mdeepbilstm_w3_ntrue_p60_ip80_fp80', 'batch_size': 32, 'do_shuf': True, 'nepoch': 30, 'lr_rate': 0.005, 'min_lr_rate': 0.0005, 'half_period': 4, 'half_rate': 0.5, 'half_after': 8, 'drop_out': 0.0, 'clip_norm': False, 'kl_weight': 0.0, 'model': 'deepbilstm', 'lstm_type': 'cudnn', 'nproj': 60, 'final_nproj': 100, 'init_nproj': 80, 'l2': 0.0, 'nlayer': 4, 'nhidden': 320, 'clip': 0.1, 'batch_norm': False, 'grad_opt': 'grad', 'sat_conf': {'sat_type': 'non_adapted', 'sat_stage': 'fine_tune', 'num_sat_layers': 2, 'continue_ckpt_sat': False}, 'online_augment_conf': {'window': 3, 'subsampling': 3, 'roll': False}, 'input_feats_dim': 120, 'target_scheme': {'no_name_language': {'no_target_name': 43}}, 'model_dir': 'exp/train_phn_l4_c320_mdeepbilstm_w3_ntrue_p60_ip80_fp80/model'}

@ramonsanabria
Copy link

ramonsanabria commented Jul 3, 2018 via email

@ericbolo
Copy link

ericbolo commented Jul 4, 2018 via email

@efosler
Copy link
Contributor Author

efosler commented Jul 5, 2018

@ericbolo I've uploaded my changes to efosler/eesen so you can grab the newest copy. This should work - there are a few diffs with the graph prep scripts. Here's a list of the files that I changed so that you can just grab them if you want:

  • asr_egs/swbd/v1-tf/local/swbd1_data_prep.sh
  • asr_egs/swbd/v1-tf/local/swbd1_decode_graph_tf.sh
  • asr_egs/swbd/v1-tf/run_ctc_phn.sh
  • asr_egs/swbd/v1/local/swbd1_data_prep.sh
    [cosmetic changes only to these]
  • asr_egs/wsj/steps/decode_ctc_lat_tf.sh
  • asr_egs/wsj/steps/train_ctc_tf.sh
    [python 3 compatibility]
  • asr_egs/wsj/utils/ctc_token_fst.py
  • asr_egs/wsj/utils/model_topo.py

My next step will be to try to rework the recipe so that it matches the parameters sent by @ramonsanabria . Once I've got that done and confirmed I'll send a pull request.

@efosler
Copy link
Contributor Author

efosler commented Jul 5, 2018

NB: the decode script is woefully non-parallel (needs to be fixed), but for the online stuff this won't matter.

@ericbolo
Copy link

ericbolo commented Jul 6, 2018 via email

@efosler
Copy link
Contributor Author

efosler commented Jul 6, 2018

Hey @ramonsanabria , quick question: you said...

Also those ones that I think are implemented: 'nproj': 60, 'final_nproj': 100, 'init_nproj': 80. Otherwise I will push code to have them.

Looking through the code base, it seems like these are passed as parameters - will it not do the right thing if those parameters are set?

@efosler
Copy link
Contributor Author

efosler commented Jul 7, 2018

About to go offline for a bit, so I won't be able to report on the full run, but training with the parameters above (same as @fmetze 's run but with nproj=60, final_nproj=100, init_nproj=80) does get down to 11.7% TER, so I will make those the default with the script going forward. Decoding hasn't happened yet.

@ericbolo
Copy link

ericbolo commented Jul 20, 2018 via email

@ericbolo
Copy link

ericbolo commented Jul 24, 2018

re: priors. As @efosler noted, it seems the priors are not used in the current decoding code.

in tf_test.py:

if(config[constants.CONFIG_TAGS_TEST.USE_PRIORS]):
            #TODO we need to add priors also:
            #feed_priors={i: y for i, y in zip(model.priors, config["prior"])}
            print(config[constants.CONFIG_TAGS_TEST.PRIORS_SCHEME])

model.priors doesn't seem to be generated anywhere, but we can use label.counts to generate it.
@fmetze , are priors used in the original (c++) implementation ?

@ramonsanabria
Copy link

ramonsanabria commented Jul 24, 2018 via email

@ramonsanabria
Copy link

ramonsanabria commented Jul 24, 2018

Here the commit of the new asr_egs/wsj/utils/nnet_notf.py (that does not use tf): 543c9ed

Here are the parts of the code that I posted in the previous message (email responds does not support code in the markdown language):

labels=$dir_am/label.counts

gunzip -c $dir_am/labels.tr.gz cat | awk '{line=$0; gsub(" "," 0 ",line);
print line " 0";}' |
/data/ASR5/fmetze/eesen-block-copy/src/decoderbin/analyze-counts
--verbose=1 --binary=false ark:- $labels
$decode_cmd JOB=1:$nj $mdl/log/decode.JOB.log \
  cat $PWD/$mdl/split$nj/JOB/feats.scp \| sort -k 1 \| python utils/nnet.py
--label-counts $labels --temperature $temperature --blank-scale $bkscale \|
\
  latgen-faster  --max-active=$max_active --max-mem=$max_mem --beam=$beam
--lattice-beam=$lattice_beam \
  --acoustic-scale=$acwt --allow-partial=true
--word-symbol-table=$graphdir/words.txt \
  $graphdir/TLG.fst ark:- "ark:|gzip -c > $mdl/lat/lat.JOB.gz" || exit 1;

@efosler
Copy link
Contributor Author

efosler commented Aug 8, 2018

So, an update on my progress with SWB (now that I'm getting back to this). I haven't tried out @ramonsanabria 's code above yet.

I'm able to train a SWB system getting 11.8% TER on the CV set (much better than before). However, decoding with this (again not with priors) gives me a 40+% WER - much worse than the previous setup. I'm trying to debug this to understand where things are going wrong.

One thing I tried to do was turn on TER calculation during the forward pass. Had to do some modifications to steps/decode_ctc_am_tf.sh to make it pass the right flags to the test module. However, that was a non-starter it seems - the forward pass just hangs with no errors.

Seems like the next best step would be to just try to switch to @ramonsanabria 's decode strategy and abandon steps/decode_ctc_am_tf.sh?

@efosler
Copy link
Contributor Author

efosler commented Aug 10, 2018

@ramonsanabria what's a good (rough) value for blank_scale?

@efosler
Copy link
Contributor Author

efosler commented Aug 10, 2018

@ramonsanabria Now looking through nnet.py (and non-tf version) - this actually takes the output of the net and does the smoothing and priors as a filter, right? The code snippet you have above doesn't actually run the net forward, it seems to me, but would do something funky on the features in feats.scp.

@ramonsanabria
Copy link

ramonsanabria commented Aug 10, 2018 via email

@ericbolo
Copy link

ericbolo commented Aug 12, 2018 via email

@efosler
Copy link
Contributor Author

efosler commented Aug 15, 2018

Alas, I won't be in India either. (Maybe I might be able to stop by CMU sometime this semester.)

Update on progress: I wrote a small script to do greedy decoding on a logit/posterior stream and calculate the TER. (Will post this to my repo soonish and then send a pull request.) Found that on the SWB eval2000 test set I was getting 30% TER (this after priors; without priors it is worse). I was slightly puzzled by that, so I decided to calculate the TER on the train_dev set for SWB - I'm getting roughly 21-22% TER. This was a system that was reporting 11.8% TER on the same set during training. So something is rather hinky. Still digging, but if anyone has ideas, let me know.

@efosler
Copy link
Contributor Author

efosler commented Aug 16, 2018

I think I've enabled tf_train to dump out the forward pass on cv to see what's going on - if there is a difference in the output. Took me a good chunk of the evening. One thing I did run across is that the fp on subsampled data gets averaged in tf_test - it's not clear to me if the TER reported in tf_train is over averaged or (as I suspect) all variants. I don't think this could account for a factor of two in TER though.

FWIW, I think the code would be cleaner if tf_train and tf_test were factorized some - I had to copy a lot of code over and I worry about inconsistencies between them (although they are hooked together through the model class).

@efosler
Copy link
Contributor Author

efosler commented Aug 16, 2018

Update from yesterday (now that the swbd system has some time to train): the dumped cv ark files do not show the same CTC error rate as the system claims. I am suspecting that the averaging might be doing something weird. Writing down assumptions here and someone can pick this apart:

  • The implementation of the greedy decoder is just to take the max at each frame, remove duplicates, and then remove blanks. Shortest decoder I've ever written:

def greedy_decode(logits): return [i for i,_ in itertools.groupby(logits.argmax(1)) if i>0]

  • The averaging code (taken from tf_test.py) checks if there is augmented data. If so it computes the average logit stream from all examples. But if the network has radically different placement of the tokens it will likely give different decodings.

(Now this is making me wonder if the test set was augmented... hmmm...)

Anyway, just to give a sample of the difference in TER:

Reported by tf during training:

            Validate cost: 40.4, ter: 27.6%, #example: 11190
            Validate cost: 32.9, ter: 22.2%, #example: 11190
            Validate cost: 30.2, ter: 21.2%, #example: 11190
            Validate cost: 27.8, ter: 19.2%, #example: 11190
            Validate cost: 26.8, ter: 18.3%, #example: 11190
            Validate cost: 35.0, ter: 23.7%, #example: 11190
            Validate cost: 28.4, ter: 19.4%, #example: 11190
            Validate cost: 24.8, ter: 17.1%, #example: 11190

Decoding on the averaged stream:

TER = 76690 / 152641 = 50.2
TER = 69108 / 152380 = 45.4
TER = 68611 / 152380 = 45.0
TER = 62259 / 152380 = 40.9
TER = 59838 / 152380 = 39.3
TER = 72821 / 152380 = 47.8
TER = 61498 / 152380 = 40.4
TER = 59800 / 152380 = 39.2

@efosler
Copy link
Contributor Author

efosler commented Aug 16, 2018

@ramonsanabria and @fmetze can you confirm what the online feature augmentation is doing? I think I misunderstood it in my comments above. (I had visions of other types of augmentation going on but reading the code I think it's simpler than I thought.)

Looking through the code it seems like when you have the subsample and window set to 3, what it's doing is stacking three frames on the input, and making the input 3 times as small. Is it also creating three variants with different shifts? I'm trying to figure out where the averaging would come in later.

@efosler
Copy link
Contributor Author

efosler commented Aug 18, 2018

OK, I have figured out the discrepancy in output between forward passes and what is recorded by the training regime. tl;dr - the augmentation and averaging code in tf_test.py is at fault and should not be currently trusted. I'm working on a fix.

When training is done with augmentation (in this example, with window 3) 3 different shifted copies are created for training with stacked features. The TER is calculated for each copy (stream) by taking a forward pass and greedy decoding over the logits, then getting edit distance to the labels. The reported TER is over all copies.

At test time, it is not really clear what to do with 3 copies of the same logit stream. The test code (which I've replicated in the forward pass during training) assumes that correct thing to do is to average the logit streams. This would be appropriate for a traditional frame-based NN system. However, in a CTC-based system there is no guarantee of synchronization of outputs, so averaging the streams means that sometimes the blank label will dominate where it should not (for example: if one stream labels greedily "A blank blank", the second "blank A blank" and the third "blank blank A" then the average stream might label "blank blank blank" - causing a deletion).

I verified this by only dumping out the first stream in the averaging rather than the average, and found that the CV TER was identicial to that reported by the trainer. (That's not to say that the decoding was identical, but that the end number was the same.)

Upshot: it's probably best to arbitrarily take one of the streams and use it at test time - although is there a more appropriate combination scheme?

@efosler
Copy link
Contributor Author

efosler commented Aug 18, 2018

Created new issue for this particular bug. #194

@efosler
Copy link
Contributor Author

efosler commented Aug 20, 2018

Latest update:
Decoding with sw+fish LM, incorporating priors, and fixing the averaging bug leads to 19.2% WER on eval 2000, with the swbd subset getting 13.4% WER (the kaldi triphone based system gets 13.3% WER on the same set, although this may be a more involved model). I think that this is close enough for a baseline to declare victory. I'll clean stuff up and then make a pull request.

@efosler
Copy link
Contributor Author

efosler commented Aug 23, 2018

Successful full train and decode; I also tested out a run with a slightly larger net (with a bit of improvement). Adding these baselines to the README file.

# CTC Phonemes on the Complete set (with 5 BiLSTM layers) with WFST decode                                                                                                                                                                                     
%WER 12.5 | 1831 21395 | 88.9 7.7 3.4 1.5 12.5 49.6 | exp/train_phn_fbank_pitch_l5_c320_mdeepbilstm_w3_ntrue_p60_ip80_fp80/results_epoch22_bs5.0_sw1_fsh_tgpr/score_8/eval2000.ctm.swbd.filt.sys
%WER 18.3 | 4459 42989 | 83.9 11.7 4.4 2.2 18.3 57.3 | exp/train_phn_fbank_pitch_l5_c320_mdeepbilstm_w3_ntrue_p60_ip80_fp80/results_epoch22_bs4.0_sw1_fsh_tgpr/score_8/eval2000.ctm.filt.sys
%WER 23.9 | 2628 21594 | 79.0 15.5 5.6 2.8 23.9 62.5 | exp/train_phn_fbank_pitch_l5_c320_mdeepbilstm_w3_ntrue_p60_ip80_fp80/results_epoch22_bs4.0_sw1_fsh_tgpr/score_8/eval2000.ctm.callhm.filt.sys

# Slightly larger model (400 units, 80 internal projections) with WFST decode                                                                                                                                                                                   
%WER 12.2 | 1831 21395 | 89.2 7.7 3.1 1.4 12.2 49.7 | exp/train_phn_fbank_pitch_l5_c400_mdeepbilstm_w3_ntrue_p80_ip80_fp80/results_epoch23_bs7.0_sw1_fsh_tgpr/score_10/eval2000.ctm.swbd.filt.sys
%WER 17.8 | 4459 42989 | 84.1 11.1 4.8 1.9 17.8 57.1 | exp/train_phn_fbank_pitch_l5_c400_mdeepbilstm_w3_ntrue_p80_ip80_fp80/results_epoch23_bs7.0_sw1_fsh_tgpr/score_9/eval2000.ctm.filt.sys
%WER 23.4 | 2628 21594 | 79.3 14.8 5.9 2.7 23.4 62.1 | exp/train_phn_fbank_pitch_l5_c400_mdeepbilstm_w3_ntrue_p80_ip80_fp80/results_epoch23_bs7.0_sw1_fsh_tgpr/score_10/eval2000.ctm.callhm.filt.sys

@ramonsanabria
Copy link

Awesome thank you so much Eric! the numbers looks great. Can you share the full training configuration?

Thank you again!

@efosler
Copy link
Contributor Author

efosler commented Aug 23, 2018

Just submitted the pull request (#196).

@efosler
Copy link
Contributor Author

efosler commented Aug 23, 2018

Once we decide that #196 is all good, I think we can close this particular thread!!!

@efosler
Copy link
Contributor Author

efosler commented Aug 24, 2018

OK, closing this particular thread. Whew!

@efosler efosler closed this as completed Aug 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants