Streaming Transformer Transducer #249

stefan-falk · 2021-01-11T11:36:53Z

Hi!

I am currently working on a streaming Transformer Transducer (T-T) myself (using Tensorflow) but I'm struggling to get started with the actual inference part. I've been referred to your repository from ESPnet (see espnet/espnet#2533 (comment)), as you noticed or going to notice.

I was wondering if you could share some knowledge on how you are tackling this problem. As for me, I started to look at "Developing Real-Time Streaming Transformer Transducer for Speech Recognition on Large-Scale Dataset" and noticed that they are proposing something called Trigger Attention ([1], [2]). In contrast, what I've been told is that, you are using Monotonic Chunkwise Attention (MoChA).

I'm not quite sure how either work in detail but if you could point to somewhere or help me to get started it would be much appreciated!

hirofumi0810 · 2021-01-11T12:50:46Z

@stefan-falk Hello. The difference between triggered attention and MoChA is the computational complexity in each generation step. Triggered attention requires O(T^2) but MoChA does O(T) because the context size is very limited.

stefan-falk · 2021-01-11T12:54:28Z

@hirofumi0810 Ah, I see. Then MoChA it is! Thank you. Do you have working code for MoChA training and decoding already? I'd love to take a look at it to get started, in case you do.

hirofumi0810 · 2021-01-11T14:55:21Z

@stefan-falk You can start from here.
I'll send a big PR for streaming inference with Transformer/Conformer encoder (+caching) this week. Stay tune!

stefan-falk · 2021-01-12T07:20:12Z

@hirofumi0810 Thanks a lot! I'll be looking at the code :) And thanks for your great work!

Update

For anybody coming here. There's also a mocha.py.

stefan-falk · 2021-01-14T09:00:42Z

@hirofumi0810 Can we even use MoChA inside a Transducer model? I think I misunderstood something along the way here. 😆

What I am looking for is a way to stream a Transducer-based model. In particular, I'd like to be able to stream the Transformer-Transducer (T-T) as in [1]. Are you working towards this as well?

hirofumi0810 · 2021-01-15T04:16:17Z

@stefan-falk MoChA is different from Transducer, so it is not common to combine them.
But there is a combination of attention and RNN-T (https://arxiv.org/abs/2005.08497).
I have implemented RNN-T as well (see readme). I'll support streaming inference with RNN-T soon.

stefan-falk · 2021-01-15T07:18:59Z

Thank you for the link!

However, I kind of move away from RNN-based Transducer models. The reason for that was after I saw how much smaller the Transformer-Transducer (T-T) and Conformer-Transducer (C-T) models are. A 30M parameter C-T model outperforms a 130M parameter RNN-T model. On my hardware, I am not even able to train such an RNN-T model 😆

Here is a quick (not very scientific) comparison from my own experiments on a German dataset.

Please note that these results are not coming from a streaming scenario as I am not able to stream Transformer-based models yet.

Name	WER	Encoder	Decoder
C-T	12.74%	Conformer	RNN
TRNN-T	18.38%	Transformer	RNN
RNN-T	24.60%	RNN	RNN

jinggaizi · 2021-02-08T02:32:50Z

@stefan-falk hi, stefan, Is your experiments on the German dataset on espnet? espnet or espnet2?

stefan-falk · 2021-02-09T11:10:54Z

@jinggaizi This is just a mix of different (public) datasets e.g. Common Voice and Spoken Wikipedia.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming Transformer Transducer #249

Streaming Transformer Transducer #249

stefan-falk commented Jan 11, 2021 •

edited

hirofumi0810 commented Jan 11, 2021

stefan-falk commented Jan 11, 2021

hirofumi0810 commented Jan 11, 2021

stefan-falk commented Jan 12, 2021 •

edited

stefan-falk commented Jan 14, 2021 •

edited

hirofumi0810 commented Jan 15, 2021

stefan-falk commented Jan 15, 2021 •

edited

jinggaizi commented Feb 8, 2021

stefan-falk commented Feb 9, 2021

Streaming Transformer Transducer #249

Streaming Transformer Transducer #249

Comments

stefan-falk commented Jan 11, 2021 • edited

hirofumi0810 commented Jan 11, 2021

stefan-falk commented Jan 11, 2021

hirofumi0810 commented Jan 11, 2021

stefan-falk commented Jan 12, 2021 • edited

stefan-falk commented Jan 14, 2021 • edited

hirofumi0810 commented Jan 15, 2021

stefan-falk commented Jan 15, 2021 • edited

jinggaizi commented Feb 8, 2021

stefan-falk commented Feb 9, 2021

stefan-falk commented Jan 11, 2021 •

edited

stefan-falk commented Jan 12, 2021 •

edited

stefan-falk commented Jan 14, 2021 •

edited

stefan-falk commented Jan 15, 2021 •

edited