GPU fusion kernel perf drops when it has different layouts for the I/O tensors. #766

shawnwang18 · 2023-01-18T08:06:31Z

shawnwang18
Jan 18, 2023
Collaborator

We observed that for XLA fusion kernel, performance may severely regress when the fusion kernel uses different layouts for its input/output tensors. The issue looks like frequently occurred when multiple layouts are used in the same XLA model.

In principle, we want the read/write of the fusion kernel I/O tensors to be coalesced, so memory access can benefit from cache locality. Memory access coalescence is usually guaranteed when fusion kernel’s I/O layout is the same, for example, in kLoop fused kernel, neighbor GPU threads are processing output elements whose addresses are consecutive, and the input data elements kernel needs to fetch are also from consecutive memory address across neighbor GPU threads, if both input/output tensors are using the same memory layout.

While we see that multiple XLA fusion optimization passes do not consider the tensor layout impact, which ends up with generating a super large fusion kernel which shows very bad data locality. An simple example which generates different I/O tensor layouts is like below:

The above graph shows the layout after GpuLayoutAssignment pass, here the custom-call instruction requires mandatory layout {1,0,2,3}, and Parameter instruction also requires mandatory layout {3,2,1,0}. XLA will introduce the copy operator to transform the layouts, as the Add operator requires its operand layout to be the same. During GpuInstructionFusion pass, the Add and Copy operator will be fused, this will create a fusion kernel where its operand tensors with different layout, and the locality is bad when reading the Parameter instruction’s output tensor inside the fused kernel.

What makes the perf even worse is that later fusion pass, like FusionMerger, GpuMultiOutputFusion, GpuHorizontalInputFusion is also not considering any layout impact, which ends up generating a super large fusion kernel with different layout I/O tensors in some cases.

We have a case in stable diffusion benchmark, where the fused kernel has different I/O layouts, if we compare with the same fusion pattern but using the same layout, current fusion kernel’s perf may be 9x slower.

Fused computation I/O tensors with different layouts I/O tensors use the same layout {1,0,2,3} I/O tensors use the same layout {3,2,1,0}

fused_computation.1116 21.63ms 2.43ms 2.43ms

The unittest and before/after optimization HLO graph can be accessed here.

We (Nvidia) had an internal discussion about the issue, possible idea is to resolve the issue like:
When checking fusing of Copy/Transpose instruction, we check whether this will bring too much uncoalesced access, if so, not fusing it. Also needs to add constraints about layouts in other fusion passes (FusionMerge, GpuMultiOutputFusion, GpuHorizontalInputFusion).
If we fuse the Copy/Transpose operator, is it possible to use tiled transposed algorithm when codegen the kernel, a similar approach used by IrEmitterUnnested::EmitUnnestedTranspose

We would like to initiate the discussion on the possible fix for the issue, any idea?

cheshire · 2023-01-19T11:03:17Z

cheshire
Jan 19, 2023
Maintainer

and the locality is bad when reading the Parameter instruction’s output tensor inside the fused kernel.

I think that's the bug right there: with proper codegen we should never have uncoalesced reads inside the kernel.

This is already covered by your point (2): all fused kTranspose instructions should go through transpose codegen, and if they don't, it's a bug. Could you post an example of the transpose instruction, after all optimizations, which does not?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU fusion kernel perf drops when it has different layouts for the I/O tensors. #766

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

GPU fusion kernel perf drops when it has different layouts for the I/O tensors. #766

shawnwang18 Jan 18, 2023 Collaborator

Replies: 1 comment

cheshire Jan 19, 2023 Maintainer

shawnwang18
Jan 18, 2023
Collaborator

cheshire
Jan 19, 2023
Maintainer