-
Via
Or
By far most of the time is spent in This is using not the official ESPnet scripts (like asr_inference) but an own script. Code: beam_size = 12
ctc_weight = 0.3
lm_weight = 0.6 # not used currently...
ngram_weight = 0.9 # not used currently...
penalty = 0.0
normalize_length = False
maxlenratio = 0.0
minlenratio = 0.0
scorers = {}
asr_model = model
decoder = asr_model.decoder
ctc = CTCPrefixScorer(ctc=asr_model.ctc, eos=asr_model.eos)
token_list = asr_model.token_list
scorers.update(
decoder=decoder,
ctc=ctc,
length_bonus=LengthBonus(len(token_list)),
)
weights = dict(
decoder=1.0 - ctc_weight,
ctc=ctc_weight,
lm=lm_weight,
ngram=ngram_weight,
length_bonus=penalty,
)
assert all(isinstance(v, BatchScorerInterface) for k, v in scorers.items()), f"non-batch scorers: {scorers}"
beam_search = BatchBeamSearch(
beam_size=beam_size,
weights=weights,
scorers=scorers,
sos=asr_model.sos,
eos=asr_model.eos,
vocab_size=len(token_list),
token_list=token_list,
pre_beam_score_key=None if ctc_weight == 1.0 else "full",
normalize_length=normalize_length,
)
speech = data.raw_tensor # [B, Nsamples]
print("Speech shape:", speech.shape, "device:", speech.device)
lengths = data_spatial_dim.dyn_size # [B]
batch = {"speech": speech, "speech_lengths": lengths}
logging.info("speech length: " + str(speech.size(1)))
# Encoder forward (batched)
enc, enc_olens = asr_model.encode(**batch)
print("Encoded shape:", enc.shape, "device:", enc.device)
...
# BatchBeamSearch is misleading: It still only operates on a single sequence,
# but just handles all hypotheses in a batched way.
# So we must iterate over all the sequences here from the input.
for i in range(batch_size):
nbest_hyps: List[Hypothesis] = beam_search(
x=enc[i, : enc_olens[i]], maxlenratio=maxlenratio, minlenratio=minlenratio
)
print("best:", " ".join(token_list[v] for v in nbest_hyps[0].yseq))
# I'm not exactly sure why, but sometimes we get even more hyps?
assert len(nbest_hyps) >= beam_size, f"got {len(nbest_hyps)} hyps, expected beam size {beam_size}"
for j in range(beam_size):
hyp: Hypothesis = nbest_hyps[j]
... The code runs on a GeForce GTX 1080 Ti. It is running on one of Librispeech dev-clean/dev-other/test-clean/test-other. It takes more than 4h (I'm not exactly sure how long, but after 4h, my recog job gets killed, and I really think that's already wrong, it should be way faster). Am I doing sth wrong? Is this unexpected? It is still an early checkpoint from training, after 1 epoch of training on Librispeech. Looking at the recognized outputs, I see lots of repetition at the end: So maybe that is the reason? Because the checkpoint is still bad, it gets in this bad mode of repeating, and calculating the CTX prefix scores for such long sequences just takes so long? So this is expected? In the log, I see also lots of messages like this:
The dev scores from this checkpoint:
But I also wonder, I would have thought that the CTC scores for such repetitions should be very bad, so it should never run into such problems. But still, this gives the best score here? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Thanks for the detailed report.
I agree... @takaaki-hori, could you give us the comments about the issues? |
Beta Was this translation helpful? Give feedback.
Thanks for the detailed report.
I agree...
I'm not sure why it happens.
@takaaki-hori, could you give us the comments about the issues?
I also want to know why the thresholding techniques (Section 3.3 in https://www.merl.com/publications/docs/TR2019-102.pdf) are not working.
Maybe we disabled it.
(I just forgot the details)