KernelAbstractions support #147

leios · 2023-09-07T13:24:16Z

The KernelAbstractions branch now compiles, so I thought I would put forward a quick draft PR while I figure out all the runtime bugs.

Notes:

This builds off of adding preliminary AMDGPU support #99 and should replace it entirely
CUDA has been removed and replaced with KernelAbstractions and GPUArrays. As an important note here, GPUArrays is not strictly necessary except to replicate the behavior of the boolean GPU flag (ie isa(a, AbstractGPUArray)).
If this is merged, other GPU types (Metal, AMD, Intel) will also be supported, but I can only test on AMD (and maybe Metal if I can get someone to try it with a mac).
I need to add in the changes from Non-atomic pairwise force summation kernels. #133. If there is something we are missing on the KernelAbstractions side, I can try to add it in, but I think we are good to go.

leios · 2023-09-07T13:31:25Z

Ah, I guess while I'm here, I'll briefly explain the differences with CUDA syntactically:

Indexing is easier: @index(Global / Group / Local, Linear / NTuple / CartesianIndex) vs (blockIdx().x - 1) * blockDim().x + threadIdx().x for CUDA
Kernels run off of an ndrange for the range of elements (OpenCL inspired syntax)
Launching kernels requires configuration with a backend, see: https://github.com/leios/Molly.jl/blob/KA_support/src/kernels.jl#L21
Certain functions now execute on the backend CUDA.zeros(...) -> zeros(backend, args...)

The tricky thing about this PR was removing the CUDA dependency outside of the kernels. There is still one call in zygote.jl I gotta figure out: https://github.com/leios/Molly.jl/blob/KA_support/src/zygote.jl#L698

jgreener64 · 2023-09-07T22:56:15Z

Great work so far. Making the code compatible with generic array types is a nice change, and having the kernels work on different devices would be a selling point of the package.

I would be interested to see the performance of the kernels compared to the CUDA versions. Also whether it plays nicely with Enzyme. Good luck with the runtime errors.

leios · 2023-09-08T04:15:33Z

I think I can finish this up today or else early next week (emphasis on think), but to quickly answer the questions:

KA essentially just writes vendor-specific code (ie CUDA) from the generic code input, so if we don't have identical performance to CUDA, then that's a bug. I'll do the performance testing similar to Non-atomic pairwise force summation kernels. #133 once the code is cleaned up.
Enzyme should also not be an issue; however, there are some reports of error handling being an issue: Enzyme + KA Stalls on Error instead of reporting it EnzymeAD/Enzyme.jl#365

jgreener64 · 2023-09-08T13:27:57Z

Great. Not urgent, but how well does KernelAbstractions.jl deal with warp-level code, e.g. warpsize() and sync_warp()?

leios · 2023-09-08T13:40:36Z

That's a good question. We can probably expose the APIs available from CUDA, but I am not sure how AMDGPU deals with these. We would also just need to figure out what that corresponds to on parallel CPU.

I think these are the tools we need: https://rocm.docs.amd.com/projects/rocPRIM/en/latest/warp_ops/index.html
So they are available, it's just a matter of exposing them in KA and figuring out what it corresponds to for different backends.

Ah, as an important note (that I somehow failed to mention before), an advantage of KA is that it also provides a parallel CPU implentation, so the kernels can be written once and executed everywhere. I didn't do that in this PR because that brings up design questions related to Molly internals.

jgreener64 · 2023-09-08T22:19:09Z

I didn't do that in this PR because that brings up design questions related to Molly internals.

Yeah we can discuss that after this PR. I would be okay with switching if there was no performance hit.

Judging from discussion on the linked PR there is not currently warp support in KA. It may be necessary to leave that CUDA kernel in and have a separate KA kernel for other backends until warp support comes to KA.

leios · 2023-09-09T11:56:22Z

Ok, so a couple of quick notes here:

There are a few host calls that are not yet supported by AMDGPU (such as findall). My understanding was that such calls would eventually be ported to GPUArrays, but I don't think that has happened yet. Note that some of the stalling here is because we are waiting to get KA into GPUArrays (Transition GPUArrays to KernelAbstractions JuliaGPU/GPUArrays.jl#451). At least for findall, the kernel is not that complex: https://github.com/JuliaGPU/CUDA.jl/blob/master/src/indexing.jl#L23, so we could put it into AMDGPU or something for now; however, we are stuck on an older version of AMDGPU due to some package conflicts. The quick fix would be to do it the ol' fashioned way and just stick the necessary kernels in Molly under a file like, kernel_hacks,.jl or something. Such issues were also what stalled adding preliminary AMDGPU support #99.
Non-atomic pairwise force summation kernels. #133 seems to only use warpsize and warp_sync for warp-level semantics. The KA kernel would probably get the warpsize on the host and then pass it in as a parameter. warp_sync is a bit more interesting because, well, at least in the old days warps didn't need any synchronizing. It seems that things changed in Volta and most people missed the memo. Because of this, the easiest thing to do would be to keep the CUDA dependency for that one kernel. We could also add in warp-level semantics to KA, but that would take some time to propagate to all the independent GPU APIs and (as mentioned in 1), we are kinda stuck on older versions of AMDGPU and CUDA because of compatability with other packages.
I am realizing that there is a greater conflict with this PR. Namely, I don't know if I have the bandwidth to do any sort of maintainence on Molly after this PR is in. I don't know if it's fair to ask you to merge 1000 lines of code with a new API and then leave. On the other hand, getting this to work on AMD would be great and really useful. Let me think on that.

jgreener64 · 2023-09-10T18:09:00Z

Because of this, the easiest thing to do would be to keep the CUDA dependency for that one kernel.

That is okay.

I don't know if it's fair to ask you to merge 1000 lines of code with a new API and then leave.

I wouldn't worry about this. Currently I only merge stuff that I am able to maintain, or where I think I can skill up to the point of maintaining it. The changes here seem reasonable and worth merging once any errors and performance regressions are fixed. There is a wider question about whether KernelAbstractions.jl will continue to be maintained compared to CUDA.jl, but it seems to have decent traction now.

leios · 2023-09-10T20:36:02Z

Yeah, the plan is for KA to be used even within GPUArrays, so it's not going anywhere anytime soon. Speaking of which, the "correct" course of action for KA in Molly would be to get the KA in GPUArrays first and then use that to implement any missing features on the GPUArrays level.

Would it be better for me to separate this PR then? Maybe one doing the generic Array stuff and then another with the KA support?

jgreener64 · 2023-09-10T22:06:51Z

I would try and get this PR working as is. Only if that becomes difficult would it be worth splitting out and merging the generic array support.

If KA is here for the long haul then there is a benefit to switching the kernels even if only CUDA works currently. Because then when changes happen elsewhere, AMDGPU will work without any changes required in Molly.

leios added 6 commits September 6, 2023 16:12

removing CUDA-specific tooling from tests

29deea4

gpu -> ArrayType for sys defnition

f6a51a3

modifying all kernels

b922573

removing all CUDAs

e86e3a3

ok, now to check compilation

1a5b899

it compiles

9975335

single core tests work, working on rocm now

e8364a3

leios mentioned this pull request Sep 8, 2023

exposing warp-level semantics JuliaGPU/KernelAbstractions.jl#420

Open

bits and bobs

7c5597e

some stuff to get it to run on rocm

aa2b13f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KernelAbstractions support #147

KernelAbstractions support #147

leios commented Sep 7, 2023

leios commented Sep 7, 2023 •

edited

jgreener64 commented Sep 7, 2023

leios commented Sep 8, 2023 •

edited

jgreener64 commented Sep 8, 2023

leios commented Sep 8, 2023 •

edited

jgreener64 commented Sep 8, 2023

leios commented Sep 9, 2023

jgreener64 commented Sep 10, 2023

leios commented Sep 10, 2023

jgreener64 commented Sep 10, 2023

KernelAbstractions support #147

Are you sure you want to change the base?

KernelAbstractions support #147

Conversation

leios commented Sep 7, 2023

leios commented Sep 7, 2023 • edited

jgreener64 commented Sep 7, 2023

leios commented Sep 8, 2023 • edited

jgreener64 commented Sep 8, 2023

leios commented Sep 8, 2023 • edited

jgreener64 commented Sep 8, 2023

leios commented Sep 9, 2023

jgreener64 commented Sep 10, 2023

leios commented Sep 10, 2023

jgreener64 commented Sep 10, 2023

leios commented Sep 7, 2023 •

edited

leios commented Sep 8, 2023 •

edited

leios commented Sep 8, 2023 •

edited