New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[one-optimize] Optimize part of the transformer's attention-head #12917
Comments
@BalyshevArtem , this is awesome! |
Would you let me know which model you used? In the model I used, only one FullyConnected layer was created in the corresponding part, so it seems that the structure varies slightly depending on the model. |
I used model generated in one of the internal repo - Modified Llama2 (split head). |
@BalyshevArtem Thanks for a good idea :) As @periannath mentioned, the original pattern seems to have duplicate FCs, i.e., the two FCs are in fact the same. So the baseline would be the pattern with a single FC layer. For the second fusion, the second MUL is for applying rotary embedding, which would be a user input (not constant) if the model supports dynamic behavior. If a model only supports fixed positions (all input tokens' position is fixed, which means that the number of previously cached tokens is also fixed), this would be an effective optimization. |
This fusion looks good to me. One minor concern is that this will reduce operator counts but create a new constant tensor. It has to be considered not to increase model size too much. |
Yes :)
Ah, your model seems to be the one whose attention heads are split. I thought about the pattern without head split. Below is the original pattern of rotary embedding whose heads are not split. After heads are split, it seems that a new FC is created as FC is fused with Mul (left Mul in the above graph). I think that kind of fusion should be applied carefully |
@BalyshevArtem Could you share any preliminary result after this optimization, e.g., impacts on cycles/traffic? If there is some sensitive information, please use our internal repo. |
Sure, I will post results in internal repo :)
In this example, we can also apply some optimizations:
|
It seems that the first fusion is invalid. Please check the begin/end of StridedSlice.
The order of two sliced tensors is changed, so it is impossible to convert the pattern to a simple Mul. |
Yes, you're right, thank you! Indeed, there is a division in half and a reverse of these halves. Such a pattern can still be optimized, but it gets more complicated. Let's expand the pattern in question by adding Fully Connected.
So the idea is to first split weights and rotate in the same way as StridedSlices->Concatenation does. So In the example from #12917 (comment) we need change weights for FullyConnected (with shape It turns out to be a highly specialized optimization pattern, but at the same time it allows us to greatly reduce unnecessary calculations and even reduce the binary size, due to fusing constants and weights. |
@BalyshevArtem I've answered the question in the internal repo. |
What
Let's introduce two new optimization passes to simplify and accelerate part of transformer's attention-head.
Original it has the following pattern we can optimize:
1. First we can fuse
pattern as Mul operation, consisting of 1 and where there was a Neg operation there -1
As a result we will have:
2. The we can twice fuse Mul with FullyConnected nodes and get:
3. And finally Fuse horizontal fc layers, we will get single FC node :
Why
To speed up and simplify attention-based models.
How
The text was updated successfully, but these errors were encountered: