WIP: begin to add CTC training with kaldi pybind and PyTorch. #3947

csukuangfj · 2020-02-20T11:08:19Z

working in progress.

danpovey

A few comments. Bear in mind that my intention when implementing CTC is to allow the supervision to be generic FSTs, not limited to linear sequences. This may already be what you are doing. This will allow dictionaries with multiple entries for words, for instance; and optional silence. The forward backward code would do the same as our current numerator forward backward code, but I want to implement it on the GPU. Meixu was going to look into GPU programming for this. I could help myself as well, I wrote Kaldi's denominator forward-backward code.

danpovey · 2020-02-20T13:16:37Z

egs/aishell/s10b/local/convert_text_to_labels.py

+    parser = argparse.ArgumentParser(description='convert text to labels')
+
+    parser.add_argument('--lexicon-filename', dest='lexicon_filename', type=str)
+    parser.add_argument('--tokens-filename', dest='tokens_filename', type=str)


Please use the standard OpenFST symbol-table format for these tokens.
I'm open to other opinions, but since we'll probably have these symbols present in FSTs I think symbol 0 should be reserved for and should be 1, and we can just apply an offset of 1 when interpreting the nnet outputs.

... if the format is already the symbol-table format, bear in mind that the order of lines is actually arbitrary;what matters is the integer there.

I reuse the notation from EESEN (https://github.com/srvk/eesen), which calls
phones.txt as tokens.txt.

tokens.txt is acutally a phone symbol table, with

<eps> 0 <blk> 1 other phones

The code here does not pose any constraint on the order of lines. What
matters here is only the integer of symbols. The first two integers 0 and 1
are reserved. I think 0 is reserved for <eps>. Here I reserve 1 for
the blank symbol.

The script generating tokens.txt has considered the above constraint.

Since there is a T in TLG.fst, I keep using tokens.txt here instead
of phones.txt. I can switch to phones.txt if you think that is more natural
in kaldi.

egs/aishell/s10b/local/convert_text_to_labels.py

egs/aishell/s10b/local/token_to_fst.py

egs/aishell/s10b/local/convert_text_to_labels.py

csukuangfj · 2020-02-20T14:01:32Z

I am intending to use Baidu's warp-ctc (https://github.com/baidu-research/warp-ctc)
or PyTorch's builtin CTCLoss (https://pytorch.org/docs/stable/nn.html#torch.nn.CTCLoss).

Both of them do not support words with multiple pronunciations. I currently use only
the pronunciation of a word when it first appears and ignore other alternative pronunciations.

Can we first implement a baseline that considers only one pronunciation?
Since this approach is the easiest one and we can reuse existing APIs to compute ctc loss.

danpovey · 2020-02-20T14:13:20Z

Sure, implementing a baseline is fine.

…

On Thu, Feb 20, 2020 at 10:01 PM Fangjun Kuang ***@***.***> wrote: I am intending to use Baidu's warp-ctc ( https://github.com/baidu-research/warp-ctc) or PyTorch's builtin CTCLoss ( https://pytorch.org/docs/stable/nn.html#torch.nn.CTCLoss). Both of them do not support words with multiple pronunciations. I currently use only the pronunciation of a word when it first appears and ignore other alternative pronunciations. Can we first implement a baseline that considers only one pronunciation? Since this approach is the easiest one and we can reuse existing APIs to compute ctc loss. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3947?email_source=notifications&email_token=AAZFLO5UA6U6QXNDBY57O5TRD2EL3A5CNFSM4KYMRNQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMODGNY#issuecomment-589050679>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO7G53FF2G3F5GXXOJDRD2EL3ANCNFSM4KYMRNQA> .

csukuangfj · 2020-02-21T14:05:21Z

The loss drops from 83 to 4.8 after 100 batches and stops decreasing. I am trying
to find out the reason.

danpovey · 2020-02-21T14:09:11Z

For the CTC system you'll normally want to end with an LSTM layer so it can make the output spiky.

…

On Fri, Feb 21, 2020 at 10:05 PM Fangjun Kuang ***@***.***> wrote: The loss drops from 83 to 4.8 after 100 batches and stops decreasing. I am trying to find out the reason. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3947?email_source=notifications&email_token=AAZFLOYUPXVPDWMMMERGPTDRD7NSFA5CNFSM4KYMRNQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMSZRNA#issuecomment-589666484>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO4IQOT4EPEPPQBNTH3RD7NSFANCNFSM4KYMRNQA> .

csukuangfj · 2020-02-21T14:20:37Z

@danpovey thanks

My current network architecture is

layer0: input-batchnorm

layer1: lstm
layer2: projection + tanh

layer3: lstm
layer4: projection + tanh

layer5: lstm
layer6: projection + tanh

layer7: lstm
layer8: projection + tanh

layer9: prefinal-affine
layer10: log-softmax

Do you mean I should replace layer9: prefinal-affine with an LSTM layer ?

danpovey · 2020-02-21T14:57:04Z

No it should be OK as long as there are LSTM layers in there.. You could try removing the prefinal layer though. You may have to play with the l2 and learning rates a bit.

…

On Fri, Feb 21, 2020 at 10:20 PM Fangjun Kuang ***@***.***> wrote: @danpovey <https://github.com/danpovey> thanks My current network architecture is layer0: input-batchnorm layer1: lstm layer2: projection + tanh layer3: lstm layer4: projection + tanh layer5: lstm layer6: projection + tanh layer7: lstm layer8: projection + tanh layer9: prefinal-affine layer10: log-softmax Do you mean I should replace layer9: prefinal-affine with an LSTM layer ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3947?email_source=notifications&email_token=AAZFLO5L6NXTTIMPD6DN2S3RD7PLNA5CNFSM4KYMRNQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMS3DWY#issuecomment-589672923>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO5ACOTRO2VINEQH3ATRD7PLNANCNFSM4KYMRNQA> .

csukuangfj · 2020-02-24T08:41:47Z

@danpovey
do we need to normalize the coefficients?

That is, to replace

[-1, 0, 1]
[1, 0, -2, 0, 1]

with

[-0.5, 0, 0.5]
[0.25, 0, -0.5, 0, 0.25]

danpovey · 2020-02-24T08:56:57Z

Normalization is not necessary because this component would typically be followed by a batchnorm component. BTW, regarding the Conv1d-type layout, make sure it is documented.

…

On Mon, Feb 24, 2020 at 4:41 PM Fangjun Kuang ***@***.***> wrote: @danpovey <https://github.com/danpovey> do we need to *normalize* the coefficients? That is, to replace - [-1, 0, 1] - [1, 0, -2, 0, 1] with - [-0.5, 0, 0.5] - [0.25, 0, -0.5, 0, 0.25] — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3947?email_source=notifications&email_token=AAZFLO3OIOOUQ4UJ6XG2UKTREOB4ZA5CNFSM4KYMRNQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMW7ZCI#issuecomment-590216329>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO2BX77DUSJRLI4K3VTREOB4ZANCNFSM4KYMRNQA> .

We need to replace LSTM with TDNN since it is difficult to converge for LSTM.

csukuangfj · 2020-02-26T11:39:53Z

Still working in progress.

Now if you use only 8 wave files for training and decoding, you can
get a CER as low as 0.04, which verifies that the pipleline is working.

I find that the LSTM model is difficult to converge. I'm going to replace it with TDNN-F.

csukuangfj · 2020-02-27T06:46:41Z

Decode results for the current pullrequest are as follows

	this pullrequest	haowen's PyTorch(#3925)	fanlu's PyTorch (#3940) with online cmvn	haowen's kaldi (#3925)
test cer	12.91	7.86	7.31	7.08
test wer	21.90	16.56	15.97	15.72
dev cer	11.81	6.47	6.16	5.99
dev wer	20.46	14.45	14.01	13.86

The first column uses nearly the same TDNN-F model architecture as the remaining columns
except that it does not have xent regularizer. In addition, the first column uses CTC loss
instead of chain loss.

It takes about 59 minutes per epoch and the decode results are for the 12th epoch.
The ctc loss value is 0.086.

There is still a big gap in CER/WER of this pullrequet compared with the chain model.
I will add an LSTM layer before the output affine layer, add spectral augmentation,
and run the training again.

danpovey · 2020-03-10T06:14:09Z

@csukuangfj If you are implementing CTC-CRF and if I understand it correctly from this paper
http://oa.ee.tsinghua.edu.cn/~ouzhijian/pdf/ctc-crf.pdf
it is the same as LF-MMI except there is no context dependency and the self-loop (blank symbol) is shared between all phones, and there is no optional silence. The current tree-building mechanism in Kaldi doesn't allow for one pdf to be shared and the other not; however, I do remember doing experiments with making the blank shared vs. not shared (I don't recall how) and not shared was a bit better.
It should be possible to use the same forward-backward code for both numerator and denominator, for CTC-CRF as LF-MMI.

csukuangfj · 2020-03-10T06:23:29Z

@danpovey

thanks. I am still in the learning mode and have read the implementation of Kaldi's
denominator computation. I find that the denominator part of CTC-CRF is adapted
from Kaldi's code. It takes more than 3 days for CTC-CRF for the AIShell dataset
to get a reasonable CER, which is too slow and unacceptable. I am trying to figure
out the reason and to reuse as much code from Kaldi as possible.

csukuangfj · 2020-03-11T13:28:54Z

@danpovey
I've read the denominator implementation of CTC CRF and find that it is
a re-implementation of the chain denominator computation with the following differences:

It does not perform 100 iterations of the FST to get the initial probability for each state;
only the start states take the probability from the state weight; all other states have
initial probability 0; The initial state probabilities are not normalized.
it has no leaky hmm
it performs the computation in the log space

Different from the chain training part, examples in the same batch of ctc training
do not have the same number of frames, soDenominatorComputation in src/chain
cannot be used directly for ctc training.

I would like to implement a class CtcDenominatorCompuation that can handle
examples with different sequence lengths in the same batch.

Would the above differences (or tricks in Kaldi) make a significant impact on the final training result?

danpovey · 2020-03-12T04:12:44Z

@csukuangfj I don't think those things would make a significant difference. Also I don't think that name is really appropriate. Firstly it isn't CTC (the special characteristic of CTC is that it is normalized to 1 so needs no denominator computation). Regarding the different sequence lengths... it would be interesting to allow the denominator computation to handle different sequence lengths (the numerator as well). The difficulty is doing this efficiently. GPU programming is only efficient if you can form suitable batches. What I was thinking was, it might be better to still have batches with fixed-size elements, but instead focus on allowing the numerator computation to work with that. If we had to break sentences into pieces, we could either handle it by constraining to an FST like Kaldi's current implementation of chain training, or use the regular CTC-type FST but allow all states to be initial and final, so the sequence can start and end in the middle.

csukuangfj · 2020-03-12T05:22:54Z

I am trying to sort examples by their sequence lengths so that there is as less padding
as possible in the same batch.

What I was thinking was, it might be better to still have batches with fixed-size elements, but instead focus on allowing the numerator computation to work with that.

I'm still learning the internals of chain training. We can come to this when the handling
of different sequence lengths is finished; at least, it can be used as a baseline.

stale · 2020-06-19T06:36:06Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2020-07-19T06:23:58Z

This issue has been automatically closed by a bot strictly because of inactivity. This does not mean that we think that this issue is not important! If you believe it has been closed hastily, add a comment to the issue and mention @kkm000, and I'll gladly reopen it.

begin to add CTC training with kaldi pybind and PyTorch.

e6526be

danpovey reviewed Feb 20, 2020

View reviewed changes

csukuangfj added 2 commits February 20, 2020 23:11

add more documentation.

b832be4

add unittest for the convert text to labels program.

0148732

csukuangfj force-pushed the fangjun-ctc branch from b32e0f7 to 0148732 Compare February 21, 2020 01:54

add training script.

ffa861c

csukuangfj force-pushed the fangjun-ctc branch from 4fcfb19 to ffa861c Compare February 21, 2020 14:11

csukuangfj added 3 commits February 24, 2020 13:18

add kaldi's equivalent add-deltas to PyTorch.

5c7fdec

change the implementation of add-deltas to be a subclass of nn.Module

150b497

remove permute and disable padding in add deltas layer.

906d57d

csukuangfj added 3 commits February 24, 2020 19:43

wrap Baidu's warp-ctc to PyTorch.

676385e

use only lstm layers.

35820f2

finish the CTC training pipeline.

ae5dfbc

We need to replace LSTM with TDNN since it is difficult to converge for LSTM.

replace LSTM with TDNN-F.

9d686a0

fanlu mentioned this pull request Mar 3, 2020

support ivector training in pytorch model #3969

Merged

stale bot added the stale Stale bot on the loose label Jun 19, 2020

stale bot closed this Jul 19, 2020

kkm000 reopened this Jul 19, 2020

stale bot removed the stale Stale bot on the loose label Jul 19, 2020

kkm000 added the stale-exclude Stale bot ignore this issue label Jul 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: begin to add CTC training with kaldi pybind and PyTorch. #3947

WIP: begin to add CTC training with kaldi pybind and PyTorch. #3947

csukuangfj commented Feb 20, 2020 •

edited

danpovey left a comment

danpovey Feb 20, 2020

danpovey Feb 20, 2020

csukuangfj Feb 20, 2020

csukuangfj commented Feb 20, 2020

danpovey commented Feb 20, 2020 via email

csukuangfj commented Feb 21, 2020

danpovey commented Feb 21, 2020 via email

csukuangfj commented Feb 21, 2020

danpovey commented Feb 21, 2020 via email

csukuangfj commented Feb 24, 2020

danpovey commented Feb 24, 2020 via email

csukuangfj commented Feb 26, 2020

csukuangfj commented Feb 27, 2020

danpovey commented Mar 10, 2020

csukuangfj commented Mar 10, 2020

csukuangfj commented Mar 11, 2020 •

edited

danpovey commented Mar 12, 2020

csukuangfj commented Mar 12, 2020

stale bot commented Jun 19, 2020

stale bot commented Jul 19, 2020

WIP: begin to add CTC training with kaldi pybind and PyTorch. #3947

Are you sure you want to change the base?

WIP: begin to add CTC training with kaldi pybind and PyTorch. #3947

Conversation

csukuangfj commented Feb 20, 2020 • edited

danpovey left a comment

Choose a reason for hiding this comment

danpovey Feb 20, 2020

Choose a reason for hiding this comment

danpovey Feb 20, 2020

Choose a reason for hiding this comment

csukuangfj Feb 20, 2020

Choose a reason for hiding this comment

csukuangfj commented Feb 20, 2020

danpovey commented Feb 20, 2020 via email

csukuangfj commented Feb 21, 2020

danpovey commented Feb 21, 2020 via email

csukuangfj commented Feb 21, 2020

danpovey commented Feb 21, 2020 via email

csukuangfj commented Feb 24, 2020

danpovey commented Feb 24, 2020 via email

csukuangfj commented Feb 26, 2020

csukuangfj commented Feb 27, 2020

danpovey commented Mar 10, 2020

csukuangfj commented Mar 10, 2020

csukuangfj commented Mar 11, 2020 • edited

danpovey commented Mar 12, 2020

csukuangfj commented Mar 12, 2020

stale bot commented Jun 19, 2020

stale bot commented Jul 19, 2020

csukuangfj commented Feb 20, 2020 •

edited

csukuangfj commented Mar 11, 2020 •

edited