Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Looping over high-dimensional arrays #438

Open
b-fg opened this issue Nov 16, 2023 · 7 comments
Open

Looping over high-dimensional arrays #438

b-fg opened this issue Nov 16, 2023 · 7 comments

Comments

@b-fg
Copy link

b-fg commented Nov 16, 2023

Together with @weymouth we are trying to create a kernel that loops over an n-dimensional array and applies a function to each element. While we can certainly achieve to do so, the speedup we observe when comparing @kernel ("KA") and non-@kernel ("serial") implementations is very different depending of the array slice we want to access. This is probably related to Julia being C-major, but the difference is strikingly here and KA does not perform as well as the serial version.

Here is a simple MWE that demonstrates this, and this has been run with julia -t 1 to force a single thread and draw comparisons between KA and serial implementation. There is also an additional GPU test added for comparison, where the same issue is detected.

using CUDA
using BenchmarkTools

N=256
u_cu = CUDA.zeros(N,N,N,3);
u = Array(u_cu);
R1 = CartesianIndices((1:1,1:N,1:N))
R2 = CartesianIndices((1:N,1:1,1:N))
R3 = CartesianIndices((1:N,1:N,1:1))

test_serial(a,i,R) = for I  R
    a[I,i] = log1p(Float32(prod(I.I)))
end
using KernelAbstractions
@kernel function kern(a,@Const(i))
    I = @index(Global,Cartesian)
    a[I,i] = log1p(Float32(prod(I.I)))
end
test_kernel(a,i,R) = kern(get_backend(a),64)(a,i,ndrange=size(R))

@btime test_serial($u,1,$R1)
@btime test_serial($u,2,$R2)
@btime test_serial($u,3,$R3)
@btime test_kernel($u,1,$R1)
@btime test_kernel($u,2,$R2)
@btime test_kernel($u,3,$R3)
@btime CUDA.@sync test_kernel($u_cu,1,$R1)
@btime CUDA.@sync test_kernel($u_cu,2,$R2)
@btime CUDA.@sync test_kernel($u_cu,3,$R3)

The timings are:

950.046 μs (0 allocations: 0 bytes) # serial R1
928.321 μs (0 allocations: 0 bytes) # serial R2
998.689 μs (0 allocations: 0 bytes) # serial R3
5.500 ms (10 allocations: 256 bytes) # KA CPU R1
1.045 ms (10 allocations: 256 bytes) # KA CPU R2
991.688 μs (10 allocations: 256 bytes) # KA CPU R3
217.867 μs (42 allocations: 2.23 KiB) # KA GPU R1
13.077 μs (42 allocations: 2.23 KiB) # KA GPU R2
14.328 μs (42 allocations: 2.23 KiB) # KA GPU R3

Is there something wrong in the MWE? Could this be done differently? It would be nice to learn about this. Thanks!

@vchuravy
Copy link
Member

@kernel function kern()
           I = @index(Global,Cartesian)
           @show I
       end
end
kern(CPU(), 64)(ndrange=(2, 3, 4))
I = CartesianIndex(1, 1, 1)
I = CartesianIndex(2, 1, 1)
I = CartesianIndex(1, 2, 1)
I = CartesianIndex(2, 2, 1)
I = CartesianIndex(1, 3, 1)
I = CartesianIndex(2, 3, 1)
I = CartesianIndex(1, 1, 2)
I = CartesianIndex(2, 1, 2)
I = CartesianIndex(1, 2, 2)
I = CartesianIndex(2, 2, 2)
I = CartesianIndex(1, 3, 2)
I = CartesianIndex(2, 3, 2)
I = CartesianIndex(1, 1, 3)
I = CartesianIndex(2, 1, 3)
I = CartesianIndex(1, 2, 3)
I = CartesianIndex(2, 2, 3)
I = CartesianIndex(1, 3, 3)
I = CartesianIndex(2, 3, 3)
I = CartesianIndex(1, 1, 4)
I = CartesianIndex(2, 1, 4)
I = CartesianIndex(1, 2, 4)
I = CartesianIndex(2, 2, 4)
I = CartesianIndex(1, 3, 4)
I = CartesianIndex(2, 3, 4)

Seems to follow the same iteration order as:

for i in CartesianIndices((2,3,4))
          @show i
end
i = CartesianIndex(1, 1, 1)
i = CartesianIndex(2, 1, 1)
i = CartesianIndex(1, 2, 1)
i = CartesianIndex(2, 2, 1)
i = CartesianIndex(1, 3, 1)
i = CartesianIndex(2, 3, 1)
i = CartesianIndex(1, 1, 2)
i = CartesianIndex(2, 1, 2)
i = CartesianIndex(1, 2, 2)
i = CartesianIndex(2, 2, 2)
i = CartesianIndex(1, 3, 2)
i = CartesianIndex(2, 3, 2)
i = CartesianIndex(1, 1, 3)
i = CartesianIndex(2, 1, 3)
i = CartesianIndex(1, 2, 3)
i = CartesianIndex(2, 2, 3)
i = CartesianIndex(1, 3, 3)
i = CartesianIndex(2, 3, 3)
i = CartesianIndex(1, 1, 4)
i = CartesianIndex(2, 1, 4)
i = CartesianIndex(1, 2, 4)
i = CartesianIndex(2, 2, 4)
i = CartesianIndex(1, 3, 4)
i = CartesianIndex(2, 3, 4)

@vchuravy
Copy link
Member

One thing to note that your kern(CPU(), 64) is equivalent to (64, 1, 1).

So I am not surprised that R1 = CartesianIndices((1:1,1:N,1:N)) is slow. For that I would expect you needing (1, 64, 1).

@weymouth
Copy link

Say what now? This isn't spilt up automatically?

@vchuravy
Copy link
Member

Well I thought I had documented that clearly, but I seem to not find it...

Take a look at:
https://juliagpu.github.io/KernelAbstractions.jl/stable/examples/performance/

the workgroupsize is also a tuple where you provide the dimensions of the workgroup.

@weymouth
Copy link

Ok. That explains everything.
Well need to redo our macro to make the workgroup size adaptive. While I'm at it, is there a way to predict the best workgroup size to use?

@b-fg
Copy link
Author

b-fg commented Nov 16, 2023

Understood @vchuravy! Thanks for clarifying.

@b-fg
Copy link
Author

b-fg commented Jan 15, 2024

A bit more on this, it looks like if we try to evaluate the required workgroupsize during runtime using

workgroupsize = ntuple(j->j==argmax(size(R)) ? 64 : 1, length(size(R)))

and then passing it to the kernel argument, this results in much slower kernels. On the other hand, hardcoding it to 64 or (64,1,1) results in a much faster computation. Is there a specific reason why this might be happening?

Edit:

Actually, I have seen that the macro that generates the kernel is sometimes failing to produce the expected result of workgroupsize. So for example, when size(R)=(258,258,1) it results in (1,1,1) (it should be (64,64,1)) and this is why it is slow. So this is not a KA problem I believe, but the way we are generating the kernels in the macro... cc @weymouth.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants