GPU fusion kernel perf drops when it has different layouts for the I/O tensors. #766
shawnwang18
started this conversation in
General
Replies: 1 comment
-
I think that's the bug right there: with proper codegen we should never have uncoalesced reads inside the kernel. This is already covered by your point (2): all fused kTranspose instructions should go through transpose codegen, and if they don't, it's a bug. Could you post an example of the transpose instruction, after all optimizations, which does not? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
We observed that for XLA fusion kernel, performance may severely regress when the fusion kernel uses different layouts for its input/output tensors. The issue looks like frequently occurred when multiple layouts are used in the same XLA model.
In principle, we want the read/write of the fusion kernel I/O tensors to be coalesced, so memory access can benefit from cache locality. Memory access coalescence is usually guaranteed when fusion kernel’s I/O layout is the same, for example, in kLoop fused kernel, neighbor GPU threads are processing output elements whose addresses are consecutive, and the input data elements kernel needs to fetch are also from consecutive memory address across neighbor GPU threads, if both input/output tensors are using the same memory layout.
While we see that multiple XLA fusion optimization passes do not consider the tensor layout impact, which ends up with generating a super large fusion kernel which shows very bad data locality. An simple example which generates different I/O tensor layouts is like below:
The above graph shows the layout after GpuLayoutAssignment pass, here the custom-call instruction requires mandatory layout {1,0,2,3}, and Parameter instruction also requires mandatory layout {3,2,1,0}. XLA will introduce the copy operator to transform the layouts, as the Add operator requires its operand layout to be the same. During GpuInstructionFusion pass, the Add and Copy operator will be fused, this will create a fusion kernel where its operand tensors with different layout, and the locality is bad when reading the Parameter instruction’s output tensor inside the fused kernel.
What makes the perf even worse is that later fusion pass, like FusionMerger, GpuMultiOutputFusion, GpuHorizontalInputFusion is also not considering any layout impact, which ends up generating a super large fusion kernel with different layout I/O tensors in some cases.
We have a case in stable diffusion benchmark, where the fused kernel has different I/O layouts, if we compare with the same fusion pattern but using the same layout, current fusion kernel’s perf may be 9x slower.
The unittest and before/after optimization HLO graph can be accessed here.
We (Nvidia) had an internal discussion about the issue, possible idea is to resolve the issue like:
When checking fusing of Copy/Transpose instruction, we check whether this will bring too much uncoalesced access, if so, not fusing it. Also needs to add constraints about layouts in other fusion passes (FusionMerge, GpuMultiOutputFusion, GpuHorizontalInputFusion).
If we fuse the Copy/Transpose operator, is it possible to use tiled transposed algorithm when codegen the kernel, a similar approach used by IrEmitterUnnested::EmitUnnestedTranspose
We would like to initiate the discussion on the possible fix for the issue, any idea?
Beta Was this translation helpful? Give feedback.
All reactions