Recognition too slow? Specifically in `CTCPrefixScoreTH.call` #5619

albertz · 2024-01-10T08:43:01Z

albertz
Jan 10, 2024

Via py-spy top:

Total Samples 150464
GIL: 33.00%, Active: 100.00%, Threads: 1

  %Own   %Total  OwnTime  TotalTime  Function (filename)                                                                                                        
  0.00% 100.00%   0.000s     1504s   <module> (returnn/rnn.py)
  0.00% 100.00%   0.000s     1504s   main (returnn/returnn/__main__.py)
  0.00% 100.00%   0.000s     1504s   forward_with_callback (returnn/returnn/torch/engine.py)
  0.00% 100.00%   0.000s     1504s   execute_main_task (returnn/returnn/__main__.py)
  0.00% 100.00%   0.010s     1504s   _returnn_v2_forward_step (i6_experiments/users/zeyer/recog.py)
  0.00% 100.00%   0.000s     1504s   _run_step (returnn/returnn/torch/engine.py)
  0.00% 100.00%   0.030s     1504s   model_recog (i6_experiments/users/zeyer/experiments/exp2023_04_25_rf/espnet.py)
  2.00% 100.00%   11.96s     1502s   _call_impl (torch/nn/modules/module.py)
  1.00% 100.00%    4.26s     1502s   _wrapped_call_impl (torch/nn/modules/module.py)
  0.00% 100.00%    3.98s     1501s   forward (espnet/nets/beam_search.py)
  2.00%  94.00%   21.34s     1436s   search (espnet/nets/batch_beam_search.py)
  0.00%  64.00%   0.210s     1053s   score_partial (espnet/nets/batch_beam_search.py)
  0.00%  64.00%    2.27s     1053s   batch_score_partial (espnet/nets/scorers/ctc.py)
 64.00%  64.00%    1050s     1051s   __call__ (espnet/nets/ctc_prefix_score.py)
  0.00%  23.00%   0.400s    246.6s   score_full (espnet/nets/batch_beam_search.py)
  0.00%  23.00%   0.490s    246.2s   batch_score (espnet2/asr/decoder/transformer_decoder.py)
  0.00%  20.00%    1.87s    236.5s   forward_one_step (espnet2/asr/decoder/transformer_decoder.py)
  4.00%  20.00%   22.86s    227.0s   forward (espnet/nets/pytorch_backend/transformer/decoder_layer.py)
  0.00%   8.00%   28.82s    156.4s   forward (espnet/nets/pytorch_backend/transformer/attention.py)
  4.00%   4.00%   67.44s    68.64s   forward (torch/nn/modules/linear.py)
  3.00%   4.00%   46.92s    67.53s   forward_attention (espnet/nets/pytorch_backend/transformer/attention.py)
  0.00%   6.00%    8.09s    60.84s   post_process (espnet/nets/batch_beam_search.py)
  1.00%   4.00%   11.46s    60.02s   forward_qkv (espnet/nets/pytorch_backend/transformer/attention.py)
  1.00%   3.00%   19.05s    53.77s   <dictcomp> (espnet/nets/batch_beam_search.py)
  5.00%   5.00%   49.42s    52.17s   <listcomp> (espnet/nets/batch_beam_search.py)
  2.00%   2.00%   28.73s    28.73s   select_state (espnet/nets/scorers/ctc.py)
  1.00%   1.00%   21.51s    21.51s   append_token (espnet/nets/beam_search.py)
  0.00%   0.00%   20.35s    20.35s   merge_scores (espnet/nets/beam_search.py)
  1.00%   4.00%    1.62s    19.74s   forward (espnet/nets/pytorch_backend/transformer/positionwise_feed_forward.py)
  0.00%   2.00%   0.500s    17.43s   forward (espnet/nets/pytorch_backend/transformer/layer_norm.py)
  0.00%   2.00%    1.07s    16.93s   forward (torch/nn/modules/normalization.py)
  2.00%   2.00%   15.16s    15.40s   layer_norm (torch/nn/functional.py)
  0.00%   0.00%    3.69s    14.57s   batchfy (espnet/nets/batch_beam_search.py)
  1.00%   1.00%    8.21s     8.21s   __iter__ (torch/_tensor.py)
  1.00%   1.00%    8.08s     8.08s   <listcomp> (espnet2/asr/decoder/transformer_decoder.py)
  0.00%   1.00%   0.660s     7.93s   forward (torch/nn/modules/dropout.py)
  1.00%   1.00%    6.75s     7.27s   dropout (torch/nn/functional.py)
  0.00%   0.00%   0.130s     7.13s   unbatchfy (espnet/nets/batch_beam_search.py)
  0.00%   0.00%    5.64s     5.64s   __getattr__ (torch/nn/modules/module.py)
  0.00%   0.00%    4.67s     4.67s   select_state (espnet/nets/scorer_interface.py)
  0.00%   0.00%    4.41s     4.58s   pad_sequence (torch/nn/utils/rnn.py)
  1.00%   1.00%    4.30s     4.30s   batch_beam (espnet/nets/batch_beam_search.py)
  0.00%   0.00%    1.65s     3.56s   _batch_select (espnet/nets/batch_beam_search.py)
  0.00%   0.00%   0.130s     3.46s   forward (torch/nn/modules/container.py)
  0.00%   1.00%   0.130s     3.28s   forward (torch/nn/modules/activation.py)

Or py-spy dump:

Thread 738709 (active+gil): "MainThread"
    __call__ (espnet/nets/ctc_prefix_score.py:160)
    batch_score_partial (espnet/nets/scorers/ctc.py:126)
    score_partial (espnet/nets/batch_beam_search.py:233)
    search (espnet/nets/batch_beam_search.py:312)
    forward (espnet/nets/beam_search.py:437)
    _call_impl (torch/nn/modules/module.py:1527)
    _wrapped_call_impl (torch/nn/modules/module.py:1518)
    model_recog (i6_experiments/users/zeyer/experiments/exp2023_04_25_rf/espnet.py:455)
    _returnn_v2_forward_step (i6_experiments/users/zeyer/recog.py:451)
    _run_step (returnn/returnn/torch/engine.py:639)
    forward_with_callback (returnn/returnn/torch/engine.py:946)
    execute_main_task (returnn/returnn/__main__.py:510)
    main (returnn/returnn/__main__.py:663)
    <module> (returnn/rnn.py:11)

By far most of the time is spent in __call__ (espnet/nets/ctc_prefix_score.py:160), i.e. CTCPrefixScoreTH.__call__.

This is using not the official ESPnet scripts (like asr_inference) but an own script. Code:

    beam_size = 12
    ctc_weight = 0.3
    lm_weight = 0.6  # not used currently...
    ngram_weight = 0.9  # not used currently...
    penalty = 0.0
    normalize_length = False
    maxlenratio = 0.0
    minlenratio = 0.0

    scorers = {}
    asr_model = model
    decoder = asr_model.decoder

    ctc = CTCPrefixScorer(ctc=asr_model.ctc, eos=asr_model.eos)
    token_list = asr_model.token_list
    scorers.update(
        decoder=decoder,
        ctc=ctc,
        length_bonus=LengthBonus(len(token_list)),
    )

    weights = dict(
        decoder=1.0 - ctc_weight,
        ctc=ctc_weight,
        lm=lm_weight,
        ngram=ngram_weight,
        length_bonus=penalty,
    )

    assert all(isinstance(v, BatchScorerInterface) for k, v in scorers.items()), f"non-batch scorers: {scorers}"

    beam_search = BatchBeamSearch(
        beam_size=beam_size,
        weights=weights,
        scorers=scorers,
        sos=asr_model.sos,
        eos=asr_model.eos,
        vocab_size=len(token_list),
        token_list=token_list,
        pre_beam_score_key=None if ctc_weight == 1.0 else "full",
        normalize_length=normalize_length,
    )

    speech = data.raw_tensor  # [B, Nsamples]
    print("Speech shape:", speech.shape, "device:", speech.device)
    lengths = data_spatial_dim.dyn_size  # [B]
    batch = {"speech": speech, "speech_lengths": lengths}
    logging.info("speech length: " + str(speech.size(1)))

    # Encoder forward (batched)
    enc, enc_olens = asr_model.encode(**batch)
    print("Encoded shape:", enc.shape, "device:", enc.device)

    ...

    # BatchBeamSearch is misleading: It still only operates on a single sequence,
    # but just handles all hypotheses in a batched way.
    # So we must iterate over all the sequences here from the input.
    for i in range(batch_size):
        nbest_hyps: List[Hypothesis] = beam_search(
            x=enc[i, : enc_olens[i]], maxlenratio=maxlenratio, minlenratio=minlenratio
        )
        print("best:", " ".join(token_list[v] for v in nbest_hyps[0].yseq))
        # I'm not exactly sure why, but sometimes we get even more hyps?
        assert len(nbest_hyps) >= beam_size, f"got {len(nbest_hyps)} hyps, expected beam size {beam_size}"
        for j in range(beam_size):
            hyp: Hypothesis = nbest_hyps[j]
            ...

The code runs on a GeForce GTX 1080 Ti.

It is running on one of Librispeech dev-clean/dev-other/test-clean/test-other. It takes more than 4h (I'm not exactly sure how long, but after 4h, my recog job gets killed, and I really think that's already wrong, it should be way faster).

Am I doing sth wrong? Is this unexpected?

It is still an early checkpoint from training, after 1 epoch of training on Librispeech. Looking at the recognized outputs, I see lots of repetition at the end:
best: <s> AND THE TIME OF HIS FEET OF THE MAN OF THE HOUSE OF THE HOUSE OF THE OTHER SIDE OF THE KING OF THE DOOR OF THE OTHER MAN OF THE ROOM IN THE WAY OF THE DOOR OF THE OTHER SIDE OF THE TIME OF THE KING OF THE TIME IN THE WAY OF THE HOUSE AND THE DOOR OF THE KING OF THE FIRST TO THE HOUSE OF THE KING OF THE WAY OF THE WORLD IN THE DOOR OF WHICH HE HAD BEEN IN THE FIRST TIME IN THE WAY AND THE OLD WOMAN OF HIS EYES AND THE MAN OF HIS FACE OF THE DOOR OF THE HOUSE OF THE KING OF THE KING TO THE WORLD OF THE KING OF THE OTHER IN THE KING OF THE HOUSE OF THE KING OF THE HOUSE OF THE KING OF THE OTHER OF THE KING OF THE KING OF THE HOUSE OF THE KING OF THE HOUSE OF THE KING OF THE HOUSE OF THE KING OF THE KING OF THE HOUSE OF THE KING OF THE OTHER OF THE KING OF THE HOUSE OF THE KING OF THE HOUSE OF THE KING OF THE KING OF THE OTHER IN THE OTHER IN THE KING OF THE KING OF THE KING OF THE KING OF THE OTHER OF THE KING OF THE OTHER OF THE KING OF THE OTHER IN THE OTHER IN THE WORLD OF THE MAN OF THE WAY OF THE KING OF THE HOUSE OF THE KING OF THE OLD MAN OF THE KING OF THE HOUSE OF THE WAY OF THE KING OF THE OTHER OF THE KING OF THE KING IN THE OTHER OF THE WAY OF THE KING OF THE KING OF THE HOUSE OF THE WAY OF THE KING OF THE HOUSE OF THE KING OF THE OLD MAN OF THE KING OF THE HOUSE OF THE KING OF THE OLD MAN OF THE KING IN THE OLD MAN OF THE KING OF THE KING OF THE KING OF THE KING OF THE WORLD OF THE MAN OF THE KING OF THE KING OF THE KING OF THE KING OF THE KING OF THE WORLD OF THE MAN OF THE KING IN THE HOUSE OF THE MAN OF THE KING OF THE KING OF THE WORLD OF THE KING OF THE KING OF THE WORLD OF THE KING OF THE KING OF THE OLD MAN OF THE KING OF THE KING OF THE MAN OF THE HOUSE OF THE KING OF THE KING OF THE OLD MAN OF THE KING OF THE KING OF THE OLD MAN OF THE KING OF THE KING OF THE KING OF THE OLD MAN OF THE KING OF THE KING OF THE KING OF THE OLD MAN OF THE KING OF THE KING OF THE WORLD OF THE KING OF THE KING OF THE KING OF THE OLD MAN OF THE KING OF THE KING OF THE KING OF THE KING OF THE KING OF THE OLD MAN OF THE KING OF THE OLD MAN OF THE KING OF THE KING OF THE OLD MAN OF THE KING OF THE KING OF THE KING OF THE KING OF THE KING OF THE KING OF THE KING OF THE KING OF THE KING OF THE KING OF THE KING OF THE KING OF THE OLD MAN OF THE KING OF THE KING OF THE OLD MAN OF THE KING OF THE KING OF THE KING OF THE KING OF THE OLD MAN OF THE KING OF THE OLD MAN OF THE KING OF THE OLD MAN OF THE KING OF THE HOUSE OF THE KING OF THE KING OF THE KING OF THE KING OF THE KING OF THE KING OF THE OLD MAN OF THE KING OF THE OLD MAN OF THE KING OF THE OLD MAN OF THE KING OF THE KING OF THE OLD MAN OF THE KING OF THE OLD MAN OF THE KING OF THE KING OF THE OLD MAN OF THE OLD MAN OF THE KING OF THE HOUSE OF THE OLD MAN OF THE OLD MAN OF THE KING OF THE OLD MAN OF THE DOOR OF THE OLD MAN OF THE KING OF THE OLD MAN OF THE OLD MAN OF THE HOUSE OF THE KING OF THE HOUSE OF THE KING OF THE OLD MAN OF THE KING OF THE HOUSE OF THE KING OF THE DOOR OF THE KING OF THE OLD MAN OF THE KING OF THE HOUSE OF THE KING OF THE KING OF THE OLD MAN OF THE KING OF THE OLD MAN OF THE KING OF THE OLD MAN OF THE KING OF THE OLD MAN OF THE OLD MAN OF THE OLD MAN OF THE OLD MAN OF THE KING OF THE HOUSE OF THE KING OF THE OLD MAN OF THE KING OF THE <s>

So maybe that is the reason? Because the checkpoint is still bad, it gets in this bad mode of repeating, and calculating the CTX prefix scores for such long sequences just takes so long? So this is expected?

In the log, I see also lots of messages like this:

WARNING:espnet.nets.beam_search:best hypo length: 792 == max output length: 792
WARNING:espnet.nets.beam_search:decoding may be stopped by the max output length limitation, please consider to increase the maxlenratio.
WARNING:espnet.nets.beam_search:best hypo length: 792 == max output length: 792
WARNING:espnet.nets.beam_search:decoding may be stopped by the max output length limitation, please consider to increase the maxlenratio.
WARNING:espnet.nets.beam_search:best hypo length: 785 == max output length: 785
WARNING:espnet.nets.beam_search:decoding may be stopped by the max output length limitation, please consider to increase the maxlenratio.
WARNING:espnet.nets.beam_search:best hypo length: 736 == max output length: 736
WARNING:espnet.nets.beam_search:decoding may be stopped by the max output length limitation, please consider to increase the maxlenratio.
WARNING:espnet.nets.beam_search:best hypo length: 727 == max output length: 727
WARNING:espnet.nets.beam_search:decoding may be stopped by the max output length limitation, please consider to increase the maxlenratio.
WARNING:espnet.nets.beam_search:best hypo length: 724 == max output length: 724
WARNING:espnet.nets.beam_search:decoding may be stopped by the max output length limitation, please consider to increase the maxlenratio.

The dev scores from this checkpoint:

'dev_loss_acc': 0.33511875137902686,
'dev_loss_cer': 0.604846930158311,
'dev_loss_cer_ctc': 0.816621374392855,
'dev_loss_loss': 150.03562169835186,
'dev_loss_loss_att': 162.88359531457874,
'dev_loss_loss_ctc': 120.05701626210973,
'dev_loss_total': 5.1276568760862755,  # normalized
'dev_loss_wer': 1.0,

But I also wonder, I would have thought that the CTC scores for such repetitions should be very bad, so it should never run into such problems. But still, this gives the best score here?

Answered by sw005320

Jan 10, 2024

Thanks for the detailed report.

I would have thought that the CTC scores for such repetitions should be very bad.

I agree...
I'm not sure why it happens.

@takaaki-hori, could you give us the comments about the issues?
I also want to know why the thresholding techniques (Section 3.3 in https://www.merl.com/publications/docs/TR2019-102.pdf) are not working.
Maybe we disabled it.
(I just forgot the details)

View full answer

sw005320 · 2024-01-10T23:44:45Z

sw005320
Jan 10, 2024
Maintainer

Thanks for the detailed report.

I would have thought that the CTC scores for such repetitions should be very bad.

I agree...
I'm not sure why it happens.

@takaaki-hori, could you give us the comments about the issues?
I also want to know why the thresholding techniques (Section 3.3 in https://www.merl.com/publications/docs/TR2019-102.pdf) are not working.
Maybe we disabled it.
(I just forgot the details)

2 replies

albertz Jan 11, 2024
Author

Note, in the meantime, I found that I had the problem with EOS/SOS being at index 0, i.e. the same as blank index, and CTC was overwriting the EOS score. This is fixed via this PR: #5620

Now this looks much better. So maybe just this was the reason?

sw005320 Jan 11, 2024
Maintainer

I see. Thanks for the investigation!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recognition too slow? Specifically in `CTCPrefixScoreTH.call` #5619

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Recognition too slow? Specifically in CTCPrefixScoreTH.__call__ #5619

albertz Jan 10, 2024

Replies: 1 comment · 2 replies

sw005320 Jan 10, 2024 Maintainer

albertz Jan 11, 2024 Author

sw005320 Jan 11, 2024 Maintainer

Recognition too slow? Specifically in `CTCPrefixScoreTH.call` #5619

albertz
Jan 10, 2024

Replies: 1 comment 2 replies

sw005320
Jan 10, 2024
Maintainer

albertz Jan 11, 2024
Author

sw005320 Jan 11, 2024
Maintainer