Looping over high-dimensional arrays #438

b-fg · 2023-11-16T17:04:23Z

Together with @weymouth we are trying to create a kernel that loops over an n-dimensional array and applies a function to each element. While we can certainly achieve to do so, the speedup we observe when comparing @kernel ("KA") and non-@kernel ("serial") implementations is very different depending of the array slice we want to access. This is probably related to Julia being C-major, but the difference is strikingly here and KA does not perform as well as the serial version.

Here is a simple MWE that demonstrates this, and this has been run with julia -t 1 to force a single thread and draw comparisons between KA and serial implementation. There is also an additional GPU test added for comparison, where the same issue is detected.

using CUDA
using BenchmarkTools

N=256
u_cu = CUDA.zeros(N,N,N,3);
u = Array(u_cu);
R1 = CartesianIndices((1:1,1:N,1:N))
R2 = CartesianIndices((1:N,1:1,1:N))
R3 = CartesianIndices((1:N,1:N,1:1))

test_serial(a,i,R) = for I ∈ R
    a[I,i] = √log1p(Float32(prod(I.I)))
end
using KernelAbstractions
@kernel function kern(a,@Const(i))
    I = @index(Global,Cartesian)
    a[I,i] = √log1p(Float32(prod(I.I)))
end
test_kernel(a,i,R) = kern(get_backend(a),64)(a,i,ndrange=size(R))

@btime test_serial($u,1,$R1)
@btime test_serial($u,2,$R2)
@btime test_serial($u,3,$R3)
@btime test_kernel($u,1,$R1)
@btime test_kernel($u,2,$R2)
@btime test_kernel($u,3,$R3)
@btime CUDA.@sync test_kernel($u_cu,1,$R1)
@btime CUDA.@sync test_kernel($u_cu,2,$R2)
@btime CUDA.@sync test_kernel($u_cu,3,$R3)

The timings are:

950.046 μs (0 allocations: 0 bytes) # serial R1
928.321 μs (0 allocations: 0 bytes) # serial R2
998.689 μs (0 allocations: 0 bytes) # serial R3
5.500 ms (10 allocations: 256 bytes) # KA CPU R1
1.045 ms (10 allocations: 256 bytes) # KA CPU R2
991.688 μs (10 allocations: 256 bytes) # KA CPU R3
217.867 μs (42 allocations: 2.23 KiB) # KA GPU R1
13.077 μs (42 allocations: 2.23 KiB) # KA GPU R2
14.328 μs (42 allocations: 2.23 KiB) # KA GPU R3

Is there something wrong in the MWE? Could this be done differently? It would be nice to learn about this. Thanks!

The text was updated successfully, but these errors were encountered:

vchuravy · 2023-11-16T17:21:34Z

@kernel function kern()
           I = @index(Global,Cartesian)
           @show I
       end
end

kern(CPU(), 64)(ndrange=(2, 3, 4))
I = CartesianIndex(1, 1, 1)
I = CartesianIndex(2, 1, 1)
I = CartesianIndex(1, 2, 1)
I = CartesianIndex(2, 2, 1)
I = CartesianIndex(1, 3, 1)
I = CartesianIndex(2, 3, 1)
I = CartesianIndex(1, 1, 2)
I = CartesianIndex(2, 1, 2)
I = CartesianIndex(1, 2, 2)
I = CartesianIndex(2, 2, 2)
I = CartesianIndex(1, 3, 2)
I = CartesianIndex(2, 3, 2)
I = CartesianIndex(1, 1, 3)
I = CartesianIndex(2, 1, 3)
I = CartesianIndex(1, 2, 3)
I = CartesianIndex(2, 2, 3)
I = CartesianIndex(1, 3, 3)
I = CartesianIndex(2, 3, 3)
I = CartesianIndex(1, 1, 4)
I = CartesianIndex(2, 1, 4)
I = CartesianIndex(1, 2, 4)
I = CartesianIndex(2, 2, 4)
I = CartesianIndex(1, 3, 4)
I = CartesianIndex(2, 3, 4)

Seems to follow the same iteration order as:

for i in CartesianIndices((2,3,4))
          @show i
end
i = CartesianIndex(1, 1, 1)
i = CartesianIndex(2, 1, 1)
i = CartesianIndex(1, 2, 1)
i = CartesianIndex(2, 2, 1)
i = CartesianIndex(1, 3, 1)
i = CartesianIndex(2, 3, 1)
i = CartesianIndex(1, 1, 2)
i = CartesianIndex(2, 1, 2)
i = CartesianIndex(1, 2, 2)
i = CartesianIndex(2, 2, 2)
i = CartesianIndex(1, 3, 2)
i = CartesianIndex(2, 3, 2)
i = CartesianIndex(1, 1, 3)
i = CartesianIndex(2, 1, 3)
i = CartesianIndex(1, 2, 3)
i = CartesianIndex(2, 2, 3)
i = CartesianIndex(1, 3, 3)
i = CartesianIndex(2, 3, 3)
i = CartesianIndex(1, 1, 4)
i = CartesianIndex(2, 1, 4)
i = CartesianIndex(1, 2, 4)
i = CartesianIndex(2, 2, 4)
i = CartesianIndex(1, 3, 4)
i = CartesianIndex(2, 3, 4)

vchuravy · 2023-11-16T17:24:23Z

One thing to note that your kern(CPU(), 64) is equivalent to (64, 1, 1).

So I am not surprised that R1 = CartesianIndices((1:1,1:N,1:N)) is slow. For that I would expect you needing (1, 64, 1).

weymouth · 2023-11-16T17:38:13Z

Say what now? This isn't spilt up automatically?

vchuravy · 2023-11-16T17:55:54Z

Well I thought I had documented that clearly, but I seem to not find it...

Take a look at:
https://juliagpu.github.io/KernelAbstractions.jl/stable/examples/performance/

the workgroupsize is also a tuple where you provide the dimensions of the workgroup.

weymouth · 2023-11-16T18:07:54Z

Ok. That explains everything.
Well need to redo our macro to make the workgroup size adaptive. While I'm at it, is there a way to predict the best workgroup size to use?

b-fg · 2023-11-16T19:55:02Z

Understood @vchuravy! Thanks for clarifying.

b-fg · 2024-01-15T13:09:08Z

A bit more on this, it looks like if we try to evaluate the required workgroupsize during runtime using

workgroupsize = ntuple(j->j==argmax(size(R)) ? 64 : 1, length(size(R)))

and then passing it to the kernel argument, this results in much slower kernels. On the other hand, hardcoding it to 64 or (64,1,1) results in a much faster computation. Is there a specific reason why this might be happening?

Edit:

Actually, I have seen that the macro that generates the kernel is sometimes failing to produce the expected result of workgroupsize. So for example, when size(R)=(258,258,1) it results in (1,1,1) (it should be (64,64,1)) and this is why it is slow. So this is not a KA problem I believe, but the way we are generating the kernels in the macro... cc @weymouth.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Looping over high-dimensional arrays #438

Looping over high-dimensional arrays #438

b-fg commented Nov 16, 2023

vchuravy commented Nov 16, 2023

vchuravy commented Nov 16, 2023

weymouth commented Nov 16, 2023

vchuravy commented Nov 16, 2023

weymouth commented Nov 16, 2023

b-fg commented Nov 16, 2023

b-fg commented Jan 15, 2024 •

edited

Looping over high-dimensional arrays #438

Looping over high-dimensional arrays #438

Comments

b-fg commented Nov 16, 2023

vchuravy commented Nov 16, 2023

vchuravy commented Nov 16, 2023

weymouth commented Nov 16, 2023

vchuravy commented Nov 16, 2023

weymouth commented Nov 16, 2023

b-fg commented Nov 16, 2023

b-fg commented Jan 15, 2024 • edited

b-fg commented Jan 15, 2024 •

edited