Endpoint latency of RNN transducer measurement #5433

rxpwang · 2023-09-15T22:06:07Z

rxpwang
Sep 15, 2023

Dear Team and Community,

I am currently working on measuring the latency of RNN transducer operations. Presently, I can profile the encoding and decoding times independently from the inference log file. However, I've noticed that the RNN transducer inference process is not currently implemented in a streaming fashion within the espnet2/bin/asr_inference.py script.

My primary question is: if I intend to measure the endpoint latency of RNN transducer operations during streaming inference (where "endpoint latency" is defined as the time between the completion of speech and the entire inference process end), how can I accurately measure or estimate this value? As per my understanding, the RNN transducer decodes the encoder output frame by frame. Given that the sum of encoder Real-Time Factors (RTF) and decoder RTF is currently less than 1, can I assume that the endpoint latency corresponds to the time taken for decoding the last frame? I apologize if this question seems somewhat elementary.

Thank you in advance.

Answered by b-flo

Sep 19, 2023

Hi,

Sorry for the delay!

However, I've noticed that the RNN transducer inference process is not currently implemented in a streaming fashion within the espnet2/bin/asr_inference.py script.

Not sure which version you use but for streaming, you should look at asr_inference_streaming.py or asr_transducer_inference.py. There are two Transducer version in ESPnet, see tutorial doc.

if I intend to measure the endpoint latency of RNN transducer operations during streaming inference (where "endpoint latency" is defined as the time between the completion of speech and the entire inference process end), how can I accurately measure or estimate this value?

Hum, I'm not entirely sure we can use th…

View full answer

sw005320 · 2023-09-16T20:42:23Z

sw005320
Sep 16, 2023
Maintainer

@b-flo, can you answer it?

0 replies

rxpwang · 2023-09-16T21:56:26Z

rxpwang
Sep 16, 2023
Author

To complement a little bit. From some papers I read, the endpoint latency at 50 percentile (EP50) for RNN transducer is around 300ms to 450ms. (if I only take the last frame processing time, I assume it won't take such long time)
https://arxiv.org/pdf/2010.11148.pdf
https://arxiv.org/pdf/2004.11544.pdf
https://arxiv.org/pdf/2003.12710.pdf

0 replies

b-flo · 2023-09-19T08:34:13Z

b-flo
Sep 19, 2023
Maintainer

Hi,

Sorry for the delay!

However, I've noticed that the RNN transducer inference process is not currently implemented in a streaming fashion within the espnet2/bin/asr_inference.py script.

Not sure which version you use but for streaming, you should look at asr_inference_streaming.py or asr_transducer_inference.py. There are two Transducer version in ESPnet, see tutorial doc.

if I intend to measure the endpoint latency of RNN transducer operations during streaming inference (where "endpoint latency" is defined as the time between the completion of speech and the entire inference process end), how can I accurately measure or estimate this value?

Hum, I'm not entirely sure we can use the described EP latency measure 1-on-1 in ESPnet. The paper describing the endpoint detection is not available and I can't recall the details. @sw005320 are you familiar with that paper?

Anyway, it would be similar to compute the difference between 1) when the last predicted token/chunk is returned (we don't have an explicit endpoint token) by the beam search process, and 2) when last audio chunk containing speech is received in Speech2Text (i.e.: timestamp of the last speech frame timing).

2 replies

rxpwang Sep 19, 2023
Author

Thank you for your response. I didn't notice there is asr_inference_transducer.py there for the streaming transducer. Thank you for pointing out that. I guess previously I trained and inferenced the transducer model in a wrong way because I didn't indicate asr_task=asr_transducer in the run.sh script. I will look into it and give another try. Also thank you for providing insight for latency measurement. I will mark this as answer.

b-flo Sep 20, 2023
Maintainer

I guess previously I trained and inferenced the transducer model in a wrong way because I didn't indicate asr_task=asr_transducer in the run.sh script.

Not necessarily! You can define a Transducer either with asr_task=asr and decoder_type=transducer or with asr_task=asr_transducer. I recommend reading the documentation to understand what are the main difference.

Also, note that for streaming the shared version (asr) rely on the contextual block approach while the standalone version (asr_transducer) rely on a chunk-wise scheme. More explanation and reference are given in the tutorial.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Endpoint latency of RNN transducer measurement #5433

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Endpoint latency of RNN transducer measurement #5433

rxpwang Sep 15, 2023

Replies: 3 comments · 2 replies

sw005320 Sep 16, 2023 Maintainer

rxpwang Sep 16, 2023 Author

b-flo Sep 19, 2023 Maintainer

rxpwang Sep 19, 2023 Author

b-flo Sep 20, 2023 Maintainer

rxpwang
Sep 15, 2023

Replies: 3 comments 2 replies

sw005320
Sep 16, 2023
Maintainer

rxpwang
Sep 16, 2023
Author

b-flo
Sep 19, 2023
Maintainer

rxpwang Sep 19, 2023
Author

b-flo Sep 20, 2023
Maintainer