Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU __thread_run could loop over CartesianIndices? #448

Open
rafaqz opened this issue Dec 29, 2023 · 7 comments
Open

CPU __thread_run could loop over CartesianIndices? #448

rafaqz opened this issue Dec 29, 2023 · 7 comments

Comments

@rafaqz
Copy link

rafaqz commented Dec 29, 2023

I noticed in Stencils.jl that when I'm using a fast stencil (e.g. 3x3 window summing over a Matrix{Bool}) that the indexing in __thread_run takes longer than actually reading and summing the stencil!

It seems to be because the conversion from linear back to cartesian indices is pretty slow. I'm getting 4ns for N=2, 7ns for N=3 and 11ns for N=4 on my laptop. So there is also a penalty to adding dimensions.

Could we switch the loop to iterating over CartesianIndices directly?

I guess it will make dividing up the array a little messier, and might be slower for really large workloads where an even split of tasks is more important than 7ns per operation. It could have a keyword to choose behaviours.

@vchuravy
Copy link
Member

Might be interesting, I haven't looked into the execution there too closely.

Do you have a benchmark?

@rafaqz
Copy link
Author

rafaqz commented Dec 29, 2023

Just some Stencils.jl profiles on another machine.

But I can write up a PR and we can benchmark it

@vchuravy
Copy link
Member

If you can contribute it here: https://github.com/JuliaGPU/KernelAbstractions.jl/tree/main/benchmark that would be nice!

@rafaqz
Copy link
Author

rafaqz commented Dec 29, 2023

Seems its because my workgroup size was 4 - I guess you're expecting much larger workgroups on CPU?

I never totally got my head around what workgroup size means on CPU when the work is divied up before the workgroup anyway. I was guessing it didn't make much difference what the workgroup size was. But this is a case where it does (very small workloads).

@rafaqz
Copy link
Author

rafaqz commented Dec 29, 2023

I guess its kind of academic if you can get around it with large workgroups. But comparing a workgroup 1 and 64:

using KernelAbstractions
kernel1! = copy_kernel!(CPU(), 1)
kernel64! = copy_kernel!(CPU(), 64)
A = rand(16, 16, 16, 16)
B = rand(16, 16, 16, 16)

Benchmarks:

julia> @btime kernel1!(A, B; ndrange=size(A))
  1.799 ms (99 allocations: 6.80 KiB)

julia> @btime kernel64!(A, B; ndrange=size(A))
  439.169 μs (99 allocations: 6.80 KiB)

And you can see the difference in the profile for 1 vs 64 (left vs right) is all integer div from the linear - cartesian conversion.

2023-12-29-180754_1920x1080

using ProfileView
@profview for i in 1:100 kernel1!(A, B; ndrange=size(A)) end
@profview for i in 1:100 kernel64!(A, B; ndrange=size(A)) end

@vchuravy
Copy link
Member

Yeah, for the CPU I often use a workgroupsize of 1024

@rafaqz
Copy link
Author

rafaqz commented Jan 15, 2024

I've been wondering if the CPU workgroup size should mean "how much we unroll"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants