LinearAttention Module #169

rachtibat · 2022-10-20T14:16:29Z

Hi Christopher,

hope you're fine and I'm really glad that the zennit community grows, congratulation!
With a growing community, more nn.Modules desire to be explained and that's why I'm writing this issue.
A student in our department tries to explain a LinearAttention module. (The implementation is below for reference).

It contains a series of
torch.einsum
and
torch.transpose
operations.

It uses the rearrange function of the einops library, a new syntax to write basic torch code like transpose, reshape etc.

I think, zennit should be able to analyse a series of reshaping and transposing operations. However, I am not completely sure.
I'd be glad, if you could give your opinion on analyzing such a linear attention module. If you don't know, that's also no problem (: Then, it's the beginning of a new research topic.

(And the softmax function is also a problem, but maybe Arras et. al has a solution to this which the student could implement... )

Best,
Reduan

class LinearAttention(nn.Module):
    def __init__(self, dim, heads=4, dim_head=32):
        super().__init__()
        self.scale = dim_head**-0.5
        self.heads = heads
        hidden_dim = dim_head * heads
        self.to_qkv = nn.Conv2d(dim, hidden_dim * 3, 1, bias=False)

        self.to_out = nn.Sequential(nn.Conv2d(hidden_dim, dim, 1),
                                    nn.GroupNorm(1, dim))

    def forward(self, x):
        b, c, h, w = x.shape
        qkv = self.to_qkv(x).chunk(3, dim=1)
        q, k, v = map(
            lambda t: rearrange(t, "b (h c) x y -> b h c (x y)", h=self.heads), qkv
        )

        q = q.softmax(dim=-2)
        k = k.softmax(dim=-1)

        q = q * self.scale
        context = torch.einsum("b h d n, b h e n -> b h d e", k, v)

        out = torch.einsum("b h d e, b h d n -> b h e n", context, q)
        out = rearrange(out, "b h c (x y) -> b (h c) x y", h=self.heads, x=h, y=w)
        return self.to_out(out)

The text was updated successfully, but these errors were encountered:

chr5tphr · 2022-11-02T10:04:14Z

Hey Reduan,

thank you for the issue!
You can have a look at this work, where they introduce LRP for Transformers (i.e. also attention heads).
I have talked to @tschnake before about bringing transformers to Zennit, which is still as WIP as it gets.

About the implementation details:

The rearrange operation is just a re-indexing, so the correct approach for it is already simply the gradient, so it is supported by Zennit.
The einsum is a linear operation, so it can be handled like a linear layer in LRP.
The softmax is a little tricky. In the work above they handle this by viewing the gating terms as constants.

In code, we may get away by requiring to use torch.nn.Softmax and implementing a Constant rule, which will have the gradient be set to zero, although I need to think a little more if this would work as intended.

Otherwise, we could also implement a canonizer (or a meta-rule) for the most popular library implementing attention layers.

chr5tphr mentioned this issue May 10, 2023

Custom rules for vision transformer #184

Open

chr5tphr added the model compatibility Compatibility for new or variations of existing models label Aug 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LinearAttention Module #169

LinearAttention Module #169

rachtibat commented Oct 20, 2022 •

edited by chr5tphr

chr5tphr commented Nov 2, 2022

LinearAttention Module #169

LinearAttention Module #169

Comments

rachtibat commented Oct 20, 2022 • edited by chr5tphr

chr5tphr commented Nov 2, 2022

About the implementation details:

rachtibat commented Oct 20, 2022 •

edited by chr5tphr