Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significant perf drop when using dynamic ranges in GPU kernel #470

Open
luraess opened this issue Apr 3, 2024 · 8 comments
Open

Significant perf drop when using dynamic ranges in GPU kernel #470

luraess opened this issue Apr 3, 2024 · 8 comments

Comments

@luraess
Copy link

luraess commented Apr 3, 2024

Running the CUDA benchmarks from the HPCBenchmarks.jl tests returns significant performance drop using KA with dynamic range definition. The blow tests are performed on GH200 using local CUDA 12.4 install and Julia 10.2.

diffusion_kernel_ka!(CUDABackend(), 256)($A_new, $A, $h; ndrange=($n, $n, $n))

returns a nearly 50% perf drop compared to plain CUDA.jl and reference CUDA C:

[ Info: diffusion 3D
[ Info: N = 256
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "julia" => Trial(104.865 μs)
  "reference" => Trial(92.161 μs)
  "julia-ka" => Trial(173.473 μs)
[ Info: N = 512
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "julia" => Trial(771.301 μs)
  "reference" => Trial(672.581 μs)
  "julia-ka" => Trial(1.299 ms)
[ Info: N = 1024
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "julia" => Trial(6.251 ms)
  "reference" => Trial(5.833 ms)
  "julia-ka" => Trial(10.285 ms)
  • While modifying it and using static range definition:
diffusion_kernel_ka!(CUDABackend(), 256, ($n, $n, $n))($A_new, $A, $h)

returns

[ Info: diffusion 3D
[ Info: N = 256
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "julia" => Trial(104.993 μs)
  "reference" => Trial(92.416 μs)
  "julia-ka" => Trial(103.649 μs)
[ Info: N = 512
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "julia" => Trial(770.790 μs)
  "reference" => Trial(672.037 μs)
  "julia-ka" => Trial(769.701 μs)
[ Info: N = 1024
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "julia" => Trial(6.250 ms)
  "reference" => Trial(5.873 ms)
  "julia-ka" => Trial(6.121 ms)
@vchuravy
Copy link
Member

vchuravy commented Apr 3, 2024

Yeah this is due to KA allowing for arbitrary dimensions instead of just limiting the user to 3.

You end up in https://github.com/JuliaGPU/CUDA.jl/blob/7f725c0a117c2ba947015f48833630605501fb3a/src/CUDAKernels.jl#L178
and thereafter in

@inline function expand(ndrange::NDRange{N}, groupidx::CartesianIndex{N}, idx::CartesianIndex{N}) where N

So if we don't know the ndrange the code here won't be optimized away and we do execute quite a few integer operations more. Which is particular costly for small kernels.

One avenue I have been meaning to try, but never got around to is to ensure that most of the index calculation occur using Int32

@vchuravy
Copy link
Member

vchuravy commented Apr 3, 2024

Can you use CUDA.@device_code dir="out" for both cases kernels? In particular the optimized .ll would be of interest.

@luraess
Copy link
Author

luraess commented Apr 4, 2024

Here are the outputs from the device_code for dynamic (dyn) and static (stat) expressions.

out_dyn.zip
out_stat.zip

@vchuravy
Copy link
Member

vchuravy commented Apr 4, 2024

There is a performance pitfall that I didn't expect...

Base.@propagate_inbounds function expand(ndrange::NDRange, groupidx::Integer, idx::Integer)

; │┌ @ /srv/scratch/lraess/julia_depot/packages/KernelAbstractions/zPAn3/src/nditeration.jl:84 within `expand`
; ││┌ @ abstractarray.jl:1291 within `getindex`
; │││┌ @ abstractarray.jl:1336 within `_getindex`
; ││││┌ @ abstractarray.jl:1343 within `_to_subscript_indices`
; │││││┌ @ abstractarray.jl:1365 within `_unsafe_ind2sub`
; ││││││┌ @ abstractarray.jl:2962 within `_ind2sub` @ abstractarray.jl:3000
; │││││││┌ @ int.jl:86 within `-`
          %57 = zext i32 %56 to i64, !dbg !280
; │││││││└
; │││││││┌ @ abstractarray.jl:3013 within `_ind2sub_recurse`
; ││││││││┌ @ abstractarray.jl:3020 within `_div`
; │││││││││┌ @ int.jl:295 within `div`
            %58 = udiv i64 %57, %.fca.1.0.0.0.0.extract, !dbg !145
; ││││││││└└
; ││││││││ @ abstractarray.jl:3014 within `_ind2sub_recurse` @ abstractarray.jl:3013
; ││││││││┌ @ abstractarray.jl:3020 within `_div`
; │││││││││┌ @ int.jl:295 within `div`
            %59 = icmp sgt i64 %.fca.1.0.0.1.0.extract, 0, !dbg !281
            br i1 %59, label %pass11, label %fail10, !dbg !281

fail10:                                           ; preds = %pass
            call fastcc void @gpu_report_exception(i64 ptrtoint ([10 x i8]* @exception117 to i64)), !dbg !281
            call fastcc void @gpu_signal_exception({ i64, i32 } %state), !dbg !281
            call void @llvm.trap(), !dbg !281
            call void @llvm.trap(), !dbg !281
            call void asm sideeffect "exit;", ""(), !dbg !281
            unreachable, !dbg !281

pass11:                    

We have a call to div there which does a check for 0 and otherwise will throw an error.
div on it's own is bad enough and I was trying to avoid those in the happy path...

@vchuravy
Copy link
Member

vchuravy commented Apr 4, 2024

x-ref: JuliaGPU/GPUArrays.jl#520

@vchuravy
Copy link
Member

vchuravy commented Apr 4, 2024

In contrast with constant a ndrange:

│┌ @ /srv/scratch/lraess/julia_depot/packages/KernelAbstractions/zPAn3/src/nditeration.jl:84 within `expand`
; ││┌ @ abstractarray.jl:1291 within `getindex`
; │││┌ @ abstractarray.jl:1336 within `_getindex`
; ││││┌ @ abstractarray.jl:1343 within `_to_subscript_indices`
; │││││┌ @ abstractarray.jl:1365 within `_unsafe_ind2sub`
; ││││││┌ @ abstractarray.jl:2962 within `_ind2sub` @ abstractarray.jl:3000
; │││││││┌ @ int.jl:86 within `-`
          %5 = zext i32 %4 to i64, !dbg !71
; │││││││└
; │││││││┌ @ abstractarray.jl:3013 within `_ind2sub_recurse`
; ││││││││┌ @ abstractarray.jl:3020 within `_div`
; │││││││││┌ @ int.jl:295 within `div`
            %6 = lshr i64 %5, 2, !dbg !89
; ││││││││└└
; ││││││││ @ abstractarray.jl:3014 within `_ind2sub_recurse` @ abstractarray.jl:3013
; ││││││││┌ @ abstractarray.jl:3020 within `_div`
; │││││││││┌ @ int.jl:295 within `div`
            %7 = lshr i64 %5, 12, !dbg !95
; ││││││││└└
; ││││││││ @ abstractarray.jl:3014 within `_ind2sub_recurse` @ abstractarray.jl:3014
; ││││││││┌ @ int.jl:88 within `*`
           %.neg = mul nsw i64 %7, -1024, !dbg !99
; ││││││││└
; ││││││││┌ @ int.jl:86 within `-`
           %8 = add nsw i64 %.neg, %6, !dbg !102
; │││││││└└
; │││││││┌ @ int.jl:86 within `-`
          %9 = zext i32 %3 to i64, !dbg !71
; │││││││└
; │││││││┌ @ abstractarray.jl:3013 within `_ind2sub_recurse`
; ││││││││┌ @ abstractarray.jl:3020 within `_div`
; │││││││││┌ @ int.jl:295 within `div`
            %10 = lshr i64 %9, 8, !dbg !89
; ││││││││└└
; ││││││││ @ abstractarray.jl:3014 within `_ind2sub_recurse`
; ││││││││┌ @ int.jl:86 within `-`
           %11 = and i64 %9, 255, !dbg !103
; ││││││││└
; ││││││││ @ abstractarray.jl:3014 within `_ind2sub_recurse` @ abstractarray.jl:3014 @ abstractarray.jl:3008
; ││││││││┌ @ abstractarray.jl:3018 within `_lookup`
; │││││││││┌ @ int.jl:87 within `+`
            %12 = add nuw nsw i64 %10, 1, !dbg !104
; ││└└└└└└└└

The division is turned into a lshr

@luraess
Copy link
Author

luraess commented Apr 4, 2024

Should one do more globally what was done for Metal in there?

@vchuravy
Copy link
Member

vchuravy commented Apr 4, 2024

I am not sure right now.

  1. We could special case 1D/2D/3D NDRanges
  2. Maybe https://github.com/maleadt/StaticCartesian.jl would help, but in this case we don't have a static set of cartesian indices
  3. The core issue is that we are going using a linear index to a Cartesian, can we get around that without breaking KA tiling
  4. (Low-priority) do indexing math in 32bit
  5. Profiling to see if the issue is the udiv or the exception branch. (The exception branch we could get remove)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants