Significant perf drop when using dynamic ranges in GPU kernel #470

luraess · 2024-04-03T15:07:50Z

Running the CUDA benchmarks from the HPCBenchmarks.jl tests returns significant performance drop using KA with dynamic range definition. The blow tests are performed on GH200 using local CUDA 12.4 install and Julia 10.2.

Using dynamic ranges ndrange as implemented in the benchmark https://github.com/PTsolvers/HPCBenchmarks.jl/blob/a5985aaaf931efb0caf194d669e3bfcb90c5c08e/CUDA/diffusion_3d.jl#L39:

diffusion_kernel_ka!(CUDABackend(), 256)($A_new, $A, $h; ndrange=($n, $n, $n))

returns a nearly 50% perf drop compared to plain CUDA.jl and reference CUDA C:

[ Info: diffusion 3D
[ Info: N = 256
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "julia" => Trial(104.865 μs)
  "reference" => Trial(92.161 μs)
  "julia-ka" => Trial(173.473 μs)
[ Info: N = 512
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "julia" => Trial(771.301 μs)
  "reference" => Trial(672.581 μs)
  "julia-ka" => Trial(1.299 ms)
[ Info: N = 1024
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "julia" => Trial(6.251 ms)
  "reference" => Trial(5.833 ms)
  "julia-ka" => Trial(10.285 ms)

While modifying it and using static range definition:

diffusion_kernel_ka!(CUDABackend(), 256, ($n, $n, $n))($A_new, $A, $h)

returns

[ Info: diffusion 3D
[ Info: N = 256
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "julia" => Trial(104.993 μs)
  "reference" => Trial(92.416 μs)
  "julia-ka" => Trial(103.649 μs)
[ Info: N = 512
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "julia" => Trial(770.790 μs)
  "reference" => Trial(672.037 μs)
  "julia-ka" => Trial(769.701 μs)
[ Info: N = 1024
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "julia" => Trial(6.250 ms)
  "reference" => Trial(5.873 ms)
  "julia-ka" => Trial(6.121 ms)

The text was updated successfully, but these errors were encountered:

vchuravy · 2024-04-03T16:26:17Z

Yeah this is due to KA allowing for arbitrary dimensions instead of just limiting the user to 3.

You end up in https://github.com/JuliaGPU/CUDA.jl/blob/7f725c0a117c2ba947015f48833630605501fb3a/src/CUDAKernels.jl#L178
and thereafter in

KernelAbstractions.jl/src/nditeration.jl

Line 73 in c5fe83c

    
           @inline function expand(ndrange::NDRange{N}, groupidx::CartesianIndex{N}, idx::CartesianIndex{N}) where N

So if we don't know the ndrange the code here won't be optimized away and we do execute quite a few integer operations more. Which is particular costly for small kernels.

One avenue I have been meaning to try, but never got around to is to ensure that most of the index calculation occur using Int32

vchuravy · 2024-04-03T16:28:59Z

Can you use CUDA.@device_code dir="out" for both cases kernels? In particular the optimized .ll would be of interest.

luraess · 2024-04-04T06:42:47Z

Here are the outputs from the device_code for dynamic (dyn) and static (stat) expressions.

out_dyn.zip
out_stat.zip

vchuravy · 2024-04-04T18:58:32Z

There is a performance pitfall that I didn't expect...

KernelAbstractions.jl/src/nditeration.jl

Line 83 in c5fe83c

    
           Base.@propagate_inbounds function expand(ndrange::NDRange, groupidx::Integer, idx::Integer)

; │┌ @ /srv/scratch/lraess/julia_depot/packages/KernelAbstractions/zPAn3/src/nditeration.jl:84 within `expand`
; ││┌ @ abstractarray.jl:1291 within `getindex`
; │││┌ @ abstractarray.jl:1336 within `_getindex`
; ││││┌ @ abstractarray.jl:1343 within `_to_subscript_indices`
; │││││┌ @ abstractarray.jl:1365 within `_unsafe_ind2sub`
; ││││││┌ @ abstractarray.jl:2962 within `_ind2sub` @ abstractarray.jl:3000
; │││││││┌ @ int.jl:86 within `-`
          %57 = zext i32 %56 to i64, !dbg !280
; │││││││└
; │││││││┌ @ abstractarray.jl:3013 within `_ind2sub_recurse`
; ││││││││┌ @ abstractarray.jl:3020 within `_div`
; │││││││││┌ @ int.jl:295 within `div`
            %58 = udiv i64 %57, %.fca.1.0.0.0.0.extract, !dbg !145
; ││││││││└└
; ││││││││ @ abstractarray.jl:3014 within `_ind2sub_recurse` @ abstractarray.jl:3013
; ││││││││┌ @ abstractarray.jl:3020 within `_div`
; │││││││││┌ @ int.jl:295 within `div`
            %59 = icmp sgt i64 %.fca.1.0.0.1.0.extract, 0, !dbg !281
            br i1 %59, label %pass11, label %fail10, !dbg !281

fail10:                                           ; preds = %pass
            call fastcc void @gpu_report_exception(i64 ptrtoint ([10 x i8]* @exception117 to i64)), !dbg !281
            call fastcc void @gpu_signal_exception({ i64, i32 } %state), !dbg !281
            call void @llvm.trap(), !dbg !281
            call void @llvm.trap(), !dbg !281
            call void asm sideeffect "exit;", ""(), !dbg !281
            unreachable, !dbg !281

pass11:

We have a call to div there which does a check for 0 and otherwise will throw an error.
div on it's own is bad enough and I was trying to avoid those in the happy path...

vchuravy · 2024-04-04T18:59:56Z

x-ref: JuliaGPU/GPUArrays.jl#520

vchuravy · 2024-04-04T19:04:48Z

In contrast with constant a ndrange:

│┌ @ /srv/scratch/lraess/julia_depot/packages/KernelAbstractions/zPAn3/src/nditeration.jl:84 within `expand`
; ││┌ @ abstractarray.jl:1291 within `getindex`
; │││┌ @ abstractarray.jl:1336 within `_getindex`
; ││││┌ @ abstractarray.jl:1343 within `_to_subscript_indices`
; │││││┌ @ abstractarray.jl:1365 within `_unsafe_ind2sub`
; ││││││┌ @ abstractarray.jl:2962 within `_ind2sub` @ abstractarray.jl:3000
; │││││││┌ @ int.jl:86 within `-`
          %5 = zext i32 %4 to i64, !dbg !71
; │││││││└
; │││││││┌ @ abstractarray.jl:3013 within `_ind2sub_recurse`
; ││││││││┌ @ abstractarray.jl:3020 within `_div`
; │││││││││┌ @ int.jl:295 within `div`
            %6 = lshr i64 %5, 2, !dbg !89
; ││││││││└└
; ││││││││ @ abstractarray.jl:3014 within `_ind2sub_recurse` @ abstractarray.jl:3013
; ││││││││┌ @ abstractarray.jl:3020 within `_div`
; │││││││││┌ @ int.jl:295 within `div`
            %7 = lshr i64 %5, 12, !dbg !95
; ││││││││└└
; ││││││││ @ abstractarray.jl:3014 within `_ind2sub_recurse` @ abstractarray.jl:3014
; ││││││││┌ @ int.jl:88 within `*`
           %.neg = mul nsw i64 %7, -1024, !dbg !99
; ││││││││└
; ││││││││┌ @ int.jl:86 within `-`
           %8 = add nsw i64 %.neg, %6, !dbg !102
; │││││││└└
; │││││││┌ @ int.jl:86 within `-`
          %9 = zext i32 %3 to i64, !dbg !71
; │││││││└
; │││││││┌ @ abstractarray.jl:3013 within `_ind2sub_recurse`
; ││││││││┌ @ abstractarray.jl:3020 within `_div`
; │││││││││┌ @ int.jl:295 within `div`
            %10 = lshr i64 %9, 8, !dbg !89
; ││││││││└└
; ││││││││ @ abstractarray.jl:3014 within `_ind2sub_recurse`
; ││││││││┌ @ int.jl:86 within `-`
           %11 = and i64 %9, 255, !dbg !103
; ││││││││└
; ││││││││ @ abstractarray.jl:3014 within `_ind2sub_recurse` @ abstractarray.jl:3014 @ abstractarray.jl:3008
; ││││││││┌ @ abstractarray.jl:3018 within `_lookup`
; │││││││││┌ @ int.jl:87 within `+`
            %12 = add nuw nsw i64 %10, 1, !dbg !104
; ││└└└└└└└└

The division is turned into a lshr

luraess · 2024-04-04T19:16:15Z

Should one do more globally what was done for Metal in there?

vchuravy · 2024-04-04T19:36:10Z

I am not sure right now.

We could special case 1D/2D/3D NDRanges
Maybe https://github.com/maleadt/StaticCartesian.jl would help, but in this case we don't have a static set of cartesian indices
The core issue is that we are going using a linear index to a Cartesian, can we get around that without breaking KA tiling
(Low-priority) do indexing math in 32bit
Profiling to see if the issue is the udiv or the exception branch. (The exception branch we could get remove)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significant perf drop when using dynamic ranges in GPU kernel #470

Significant perf drop when using dynamic ranges in GPU kernel #470

luraess commented Apr 3, 2024

vchuravy commented Apr 3, 2024

vchuravy commented Apr 3, 2024

luraess commented Apr 4, 2024

vchuravy commented Apr 4, 2024

vchuravy commented Apr 4, 2024

vchuravy commented Apr 4, 2024

luraess commented Apr 4, 2024

vchuravy commented Apr 4, 2024

Significant perf drop when using dynamic ranges in GPU kernel #470

Significant perf drop when using dynamic ranges in GPU kernel #470

Comments

luraess commented Apr 3, 2024

vchuravy commented Apr 3, 2024

vchuravy commented Apr 3, 2024

luraess commented Apr 4, 2024

vchuravy commented Apr 4, 2024

vchuravy commented Apr 4, 2024

vchuravy commented Apr 4, 2024

luraess commented Apr 4, 2024

vchuravy commented Apr 4, 2024