Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KernelAbstractions support #147

Draft
wants to merge 9 commits into
base: master
Choose a base branch
from
Draft

Conversation

leios
Copy link
Contributor

@leios leios commented Sep 7, 2023

The KernelAbstractions branch now compiles, so I thought I would put forward a quick draft PR while I figure out all the runtime bugs.

Notes:

  1. This builds off of adding preliminary AMDGPU support #99 and should replace it entirely
  2. CUDA has been removed and replaced with KernelAbstractions and GPUArrays. As an important note here, GPUArrays is not strictly necessary except to replicate the behavior of the boolean GPU flag (ie isa(a, AbstractGPUArray)).
  3. If this is merged, other GPU types (Metal, AMD, Intel) will also be supported, but I can only test on AMD (and maybe Metal if I can get someone to try it with a mac).
  4. I need to add in the changes from Non-atomic pairwise force summation kernels. #133. If there is something we are missing on the KernelAbstractions side, I can try to add it in, but I think we are good to go.

@leios
Copy link
Contributor Author

leios commented Sep 7, 2023

Ah, I guess while I'm here, I'll briefly explain the differences with CUDA syntactically:

  1. Indexing is easier: @index(Global / Group / Local, Linear / NTuple / CartesianIndex) vs (blockIdx().x - 1) * blockDim().x + threadIdx().x for CUDA
  2. Kernels run off of an ndrange for the range of elements (OpenCL inspired syntax)
  3. Launching kernels requires configuration with a backend, see: https://github.com/leios/Molly.jl/blob/KA_support/src/kernels.jl#L21
  4. Certain functions now execute on the backend CUDA.zeros(...) -> zeros(backend, args...)

The tricky thing about this PR was removing the CUDA dependency outside of the kernels. There is still one call in zygote.jl I gotta figure out: https://github.com/leios/Molly.jl/blob/KA_support/src/zygote.jl#L698

@jgreener64
Copy link
Collaborator

Great work so far. Making the code compatible with generic array types is a nice change, and having the kernels work on different devices would be a selling point of the package.

I would be interested to see the performance of the kernels compared to the CUDA versions. Also whether it plays nicely with Enzyme. Good luck with the runtime errors.

@leios
Copy link
Contributor Author

leios commented Sep 8, 2023

I think I can finish this up today or else early next week (emphasis on think), but to quickly answer the questions:

  1. KA essentially just writes vendor-specific code (ie CUDA) from the generic code input, so if we don't have identical performance to CUDA, then that's a bug. I'll do the performance testing similar to Non-atomic pairwise force summation kernels. #133 once the code is cleaned up.
  2. Enzyme should also not be an issue; however, there are some reports of error handling being an issue: Enzyme + KA Stalls on Error instead of reporting it EnzymeAD/Enzyme.jl#365

@jgreener64
Copy link
Collaborator

Great. Not urgent, but how well does KernelAbstractions.jl deal with warp-level code, e.g. warpsize() and sync_warp()?

@leios
Copy link
Contributor Author

leios commented Sep 8, 2023

That's a good question. We can probably expose the APIs available from CUDA, but I am not sure how AMDGPU deals with these. We would also just need to figure out what that corresponds to on parallel CPU.

I think these are the tools we need: https://rocm.docs.amd.com/projects/rocPRIM/en/latest/warp_ops/index.html
So they are available, it's just a matter of exposing them in KA and figuring out what it corresponds to for different backends.

Ah, as an important note (that I somehow failed to mention before), an advantage of KA is that it also provides a parallel CPU implentation, so the kernels can be written once and executed everywhere. I didn't do that in this PR because that brings up design questions related to Molly internals.

@jgreener64
Copy link
Collaborator

I didn't do that in this PR because that brings up design questions related to Molly internals.

Yeah we can discuss that after this PR. I would be okay with switching if there was no performance hit.

Judging from discussion on the linked PR there is not currently warp support in KA. It may be necessary to leave that CUDA kernel in and have a separate KA kernel for other backends until warp support comes to KA.

@leios
Copy link
Contributor Author

leios commented Sep 9, 2023

Ok, so a couple of quick notes here:

  1. There are a few host calls that are not yet supported by AMDGPU (such as findall). My understanding was that such calls would eventually be ported to GPUArrays, but I don't think that has happened yet. Note that some of the stalling here is because we are waiting to get KA into GPUArrays (Transition GPUArrays to KernelAbstractions JuliaGPU/GPUArrays.jl#451). At least for findall, the kernel is not that complex: https://github.com/JuliaGPU/CUDA.jl/blob/master/src/indexing.jl#L23, so we could put it into AMDGPU or something for now; however, we are stuck on an older version of AMDGPU due to some package conflicts. The quick fix would be to do it the ol' fashioned way and just stick the necessary kernels in Molly under a file like, kernel_hacks,.jl or something. Such issues were also what stalled adding preliminary AMDGPU support #99.
  2. Non-atomic pairwise force summation kernels. #133 seems to only use warpsize and warp_sync for warp-level semantics. The KA kernel would probably get the warpsize on the host and then pass it in as a parameter. warp_sync is a bit more interesting because, well, at least in the old days warps didn't need any synchronizing. It seems that things changed in Volta and most people missed the memo. Because of this, the easiest thing to do would be to keep the CUDA dependency for that one kernel. We could also add in warp-level semantics to KA, but that would take some time to propagate to all the independent GPU APIs and (as mentioned in 1), we are kinda stuck on older versions of AMDGPU and CUDA because of compatability with other packages.
  3. I am realizing that there is a greater conflict with this PR. Namely, I don't know if I have the bandwidth to do any sort of maintainence on Molly after this PR is in. I don't know if it's fair to ask you to merge 1000 lines of code with a new API and then leave. On the other hand, getting this to work on AMD would be great and really useful. Let me think on that.

@jgreener64
Copy link
Collaborator

Because of this, the easiest thing to do would be to keep the CUDA dependency for that one kernel.

That is okay.

I don't know if it's fair to ask you to merge 1000 lines of code with a new API and then leave.

I wouldn't worry about this. Currently I only merge stuff that I am able to maintain, or where I think I can skill up to the point of maintaining it. The changes here seem reasonable and worth merging once any errors and performance regressions are fixed. There is a wider question about whether KernelAbstractions.jl will continue to be maintained compared to CUDA.jl, but it seems to have decent traction now.

@leios
Copy link
Contributor Author

leios commented Sep 10, 2023

Yeah, the plan is for KA to be used even within GPUArrays, so it's not going anywhere anytime soon. Speaking of which, the "correct" course of action for KA in Molly would be to get the KA in GPUArrays first and then use that to implement any missing features on the GPUArrays level.

Would it be better for me to separate this PR then? Maybe one doing the generic Array stuff and then another with the KA support?

@jgreener64
Copy link
Collaborator

I would try and get this PR working as is. Only if that becomes difficult would it be worth splitting out and merging the generic array support.

If KA is here for the long haul then there is a benefit to switching the kernels even if only CUDA works currently. Because then when changes happen elsewhere, AMDGPU will work without any changes required in Molly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants