Confusion about computation in paper 'Simple Recurrent Units for Highly Parallelizable Recurrence' ? #157

liziru · 2021-01-22T09:58:41Z

Thanks for your job. @taoleicn @taolei87

In the paper 'Simple Recurrent Units for Highly Parallelizable Recurrence', I found the following computation:

To facilitate parallelization, our light recurrence component uses a point-wise multiplication vt*ct-1 instead.
With this simplification, each dimension of the state vectors becomes independent and hence parallelizable.

I have no idea how to become parallelizable when using a point-wise multiplication dimension of the state vectors instead.
Because of still using ct-1, I think this ‘sru’ cannot be parallelizable.

Besides, I also found 'Training RNNs as Fast as CNNs’ computation that I think it can be parallelizable.

And, what are the differences between these two papers? I also mentioned this question in Zhihu.

Looking forward to your reply.

taolei87 · 2021-01-23T17:34:28Z

hi @liziru , thanks for posting the great questions.

(1) Difference of the two versions. In our first arxiv version, we didn't include the element-wise hidden-to-hidden connection (v * c_{t-1}) for simplicity. It is easier to derive the backward computation in this case. We added the hidden-to-hidden connection to obtain additional modeling capacity and better empirical results in our second version (EMNLP submission). See for example a comparison of two variants in Table 5 and Figure 5 in the EMNLP paper.

(2) Parallelization. There are two different parallelization here. First, the matrix multiplications in (1)-(4) can be grouped into a single one and therefore be parallelized across both time step and hidden dimension. Second, you are correct that point-wise computation cannot be parallelized across time step. But they can be parallelized across hidden dimension since each dimension is independent. In other words, to compute c[t][i] you need to first compute c[t-1][i]; but you can compute c[*][i] and c[*][j] in parallel for i != j. Note the second parallelization is not available in LSTM / GRU since in these RNNs the hidden-to-hidden connections are fully connected.

Hope this helps.

liziru · 2021-01-25T03:21:06Z

hi @liziru , thanks for posting the great questions.

(1) Difference of the two versions. In our first arxiv version, we didn't include the element-wise hidden-to-hidden connection (v * c_{t-1}) for simplicity. It is easier to derive the backward computation in this case. We added the hidden-to-hidden connection to obtain additional modeling capacity and better empirical results in our second version (EMNLP submission). See for example a comparison of two variants in Table 5 and Figure 5 in the EMNLP paper.

(2) Parallelization. There are two different parallelization here. First, the matrix multiplications in (1)-(4) can be grouped into a single one and therefore be parallelized across both time step and hidden dimension. Second, you are correct that point-wise computation cannot be parallelized across time step. But they can be parallelized across hidden dimension since each dimension is independent. In other words, to compute c[t][i] you need to first compute c[t-1][i]; but you can compute c[][i] and c[][j] in parallel for i != j. Note the second parallelization is not available in LSTM / GRU since in these RNNs the hidden-to-hidden connections are fully connected.

Hope this helps.

Thanks for your reply. @taolei87
Is this GitHub implementation the first arxiv version or the second version (EMNLP submission)?
I want to do some tests on the first version due to some reasons. Where can I find the implementations of the first version?
I will appreciate it very much.

taolei87 · 2021-01-25T03:24:10Z

@liziru It is version 2 by default. But you can test v1 by passing v1 = True in SRU or SRUCell constructor.
See https://github.com/asappresearch/sru/blob/master/sru/modules.py#L444

liziru · 2021-01-25T03:25:41Z

@liziru It is version 2 by default. But you can test v1 by passing v1 = True in SRU or SRUCell constructor.
See https://github.com/asappresearch/sru/blob/master/sru/modules.py#L444

Great! Thank you very much.

liziru changed the title ~~Confusion about computation in paper 'Simple Recurrent Units for Highly Parallelizable Recurrence'~~ Confusion about computation in paper 'Simple Recurrent Units for Highly Parallelizable Recurrence' ? Jan 23, 2021

taoleicn added the question label Jan 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusion about computation in paper 'Simple Recurrent Units for Highly Parallelizable Recurrence' ? #157

Confusion about computation in paper 'Simple Recurrent Units for Highly Parallelizable Recurrence' ? #157

liziru commented Jan 22, 2021 •

edited

taolei87 commented Jan 23, 2021 •

edited

liziru commented Jan 25, 2021

taolei87 commented Jan 25, 2021

liziru commented Jan 25, 2021 •

edited

Confusion about computation in paper 'Simple Recurrent Units for Highly Parallelizable Recurrence' ? #157

Confusion about computation in paper 'Simple Recurrent Units for Highly Parallelizable Recurrence' ? #157

Comments

liziru commented Jan 22, 2021 • edited

taolei87 commented Jan 23, 2021 • edited

liziru commented Jan 25, 2021

taolei87 commented Jan 25, 2021

liziru commented Jan 25, 2021 • edited

liziru commented Jan 22, 2021 •

edited

taolei87 commented Jan 23, 2021 •

edited

liziru commented Jan 25, 2021 •

edited