-
Dear Team and Community, I am currently working on measuring the latency of RNN transducer operations. Presently, I can profile the encoding and decoding times independently from the inference log file. However, I've noticed that the RNN transducer inference process is not currently implemented in a streaming fashion within the espnet2/bin/asr_inference.py script. My primary question is: if I intend to measure the endpoint latency of RNN transducer operations during streaming inference (where "endpoint latency" is defined as the time between the completion of speech and the entire inference process end), how can I accurately measure or estimate this value? As per my understanding, the RNN transducer decodes the encoder output frame by frame. Given that the sum of encoder Real-Time Factors (RTF) and decoder RTF is currently less than 1, can I assume that the endpoint latency corresponds to the time taken for decoding the last frame? I apologize if this question seems somewhat elementary. Thank you in advance. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 2 replies
-
@b-flo, can you answer it? |
Beta Was this translation helpful? Give feedback.
-
To complement a little bit. From some papers I read, the endpoint latency at 50 percentile (EP50) for RNN transducer is around 300ms to 450ms. (if I only take the last frame processing time, I assume it won't take such long time) |
Beta Was this translation helpful? Give feedback.
-
Hi, Sorry for the delay!
Not sure which version you use but for streaming, you should look at asr_inference_streaming.py or asr_transducer_inference.py. There are two Transducer version in ESPnet, see tutorial doc.
Hum, I'm not entirely sure we can use the described EP latency measure 1-on-1 in ESPnet. The paper describing the endpoint detection is not available and I can't recall the details. @sw005320 are you familiar with that paper? Anyway, it would be similar to compute the difference between 1) when the last predicted token/chunk is returned (we don't have an explicit endpoint token) by the beam search process, and 2) when last audio chunk containing speech is received in Speech2Text (i.e.: timestamp of the last speech frame timing). |
Beta Was this translation helpful? Give feedback.
Hi,
Sorry for the delay!
Not sure which version you use but for streaming, you should look at asr_inference_streaming.py or asr_transducer_inference.py. There are two Transducer version in ESPnet, see tutorial doc.
Hum, I'm not entirely sure we can use th…