Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Differentiable GPU kernels #60

Open
jgreener64 opened this issue Mar 30, 2022 · 16 comments
Open

Differentiable GPU kernels #60

jgreener64 opened this issue Mar 30, 2022 · 16 comments

Comments

@jgreener64
Copy link
Collaborator

At the minute the GPU/differentiable path is Zygote-compatible and hence uses non-mutating broadcasted operations. This works, but is rather slow and very GPU memory-intensive.

Long term the plan is to switch to Enzyme-compatible GPU kernels to calculate and sum the forces using the neighbour list. This will be much faster both with and without gradients, and should help us move towards the speeds of existing MD software. These kernels could be used as part of the general interaction interface as is, or another interface could emerge to use them. Enzyme and Zygote can be used together, so it should be possible to replace the force summation alone and retain the functionality of the package.

One consideration is how general such kernels should be. A general pairwise force summation kernel for user-defined force functions would be useful for Lennard-Jones and Coulomb interactions, and hence would be sufficient for macromolecular simulation. Other more specialised multi-body kernels could live in Molly or elsewhere depending on how generic they are.

Another concern is how the neighbour list is best stored (calculation of the neighbour list can also be GPU accelerated but that is a somewhat separate issue).

Something to bear in mind is the extension from using one to multiple GPUs for the same simulation. It is probably best to start with one GPU and go from there.

This issue is to track and discuss this development. @leios

Useful links:

@vchuravy
Copy link

Trying out Enzyme will depend on EnzymeAD/Enzyme.jl#230 which I hope to get to by mid April

@leios
Copy link
Contributor

leios commented Jun 23, 2022

Hey, I kinda left this in the dark as issues with other projects cropped up. So far, I

  1. Mocked up a simple NBody simulator in Julia to get an idea of performance benefits from KernelAbstractions (KA) over the broadcasting currently used for Molly. At least with my other package, I could hit a target of 100,000 particles running 10 timesteps in roughly 10 seconds on my old GTX970 without a neighborlist. To be honest, the other MD simulator was for gravity simulations, but swapping out the force law is kinda trivial.
  2. Did some basic performance analysis of Molly to see the current GPU utilization. I think this can be improved a lot by a switch to KA / CUDA.
  3. Tried to figure out some memory-efficient ways to store the neighborlist on the GPU. Still working on this one.

Regardless, here are my immediate plans:

  1. Create a draft PR with KA in Molly for a simple VelocityVerlet run (maybe just optimizing the already existing GPU example). This will start with the forces!(...) calculation, as that is the one taking the most time outside of the neighborlist generation. Note that because this calculation also requires a neighborlist, I'll be playing around with it as well, but I want to leave any "big" changes for neighborlist generation to a second (or third) pass.
  2. If the benchmarks seem good, I'll work towards getting KA everywhere else and also mess around with Enzyme.
  3. After that, I will then tackle the neighborlists. The reason I am separating them out is that there doesn't seem to be a strong consensus on the correct way to do them. In addition, we are currently trying to figure out exactly what we want out of the neighborlist API for CESMIX and may be creating a package for that. If you have good papers to read on GPU neighborlists and best practices, let me know. I found a few here and there, but could be missing something major.

Currently, I'm hitting a few small issues with Molly in terms of memory usage and stuff like this, but I am slowly figuring things out.

All-in-all, this process will take some time, but I wanted to check in to make sure none of this will step on your toes for summer development of Molly.

@jgreener64
Copy link
Collaborator Author

Thanks for the update. Your plan sounds good and I am looking forward to seeing what you come up with.

A lot of the current design is the way it is due to Zygote-compatibility, so if CUDA kernels with Enzyme were used then hopefully we can get rid of a lot of memory issues. As you have found out, GPU utilisation is very poor and broadcasting across the whole neighbour list uses a huge amount of memory.

It makes sense to start with the force function and think about neighbours after. I'm not an expert on neighbour lists, though I did learn a lot from Gromacs papers from 2015 and 2020 (and a related talk).

I am keen for the simulator definitions to be high-level and not implemented in CUDA/KernelAbstractions if possible. If that becomes the bottleneck later it can be reconsidered. Anything within accelerations and find_neighbors is fair game.

I would be interested to hear how the speed changes with single v double precision, I don't know which you are working in currently. Numerical accuracy is something to think about, with most simulation software using mixed precision.

I am currently working to replace box_size throughout the whole package with boundary types, opening the door to arbitrary boundary conditions. I don't think it should affect your work too much but I thought I'd make you aware. For systems in a cubic box the SVector{3} will be replaced by a struct containing a SVector{3}, with relevant dispatches on vector and wrap_coords_vec.

@leios
Copy link
Contributor

leios commented Jun 23, 2022

I had no intention of touching simulator definitions directly. In fact, I don't really think a switch to KA would speed up much outside of accelerations and find_neighbors, so those functions were my targets.

For multi-GPU, we might need to add some functionality to communicate across devices, but I am not sure where that should live right now and I am putting it off until we get single GPU performance a little better.

For single vs double precision, I wrote my other package to do either, but not both. Where do the benefits come in for mixed precision?

Here are quick tests (again, my own code with 100,000 particles for 10 timesteps):

GTX 970 P100
Float32 11.471057 s 10.994425 s
Float64 73.748271 s 14.873951 s

This kinda makes sense as my GTX 970 doesn't have double support... My guess is that the Float32 timings are similar because I made sure to use shared memory for all essential operations and the P100 is also quite old by now. If you look at the GPU performance (blue bar at the top-ish), you see that each step is basically using the GPU as efficiently as possible:

Screenshot from 2022-06-23 17-40-17

If you are interested, here is the nbody (no neighborlist) code: https://github.com/leios/Galily.jl/blob/cde36f32ae5d6e650f3bbec511cb40b6dc5b269e/src/forces.jl#L68

The (big) caveat is that this ignores the neighborlist calculation and AD support. If we run Molly (10 timesteps, 5000 particles, Float32, GTX970), it runs in 5.5 s and floods my available memory. The GPU performance is also a lot bumpier:

Screenshot from 2022-06-23 17-52-18

I realize a 970 isn't a target for Molly, but the same card can run my simple set-up with 5000 particles in either 0.37 seconds for Float64 or 0.12 seconds for Float32 (~45x faster with a way smaller memory footprint). My goal is to get Molly as close as possible to the numbers I am getting in the other code while also supporting AD.

Anyway, this might take some time, but I'll try to keep you in the loop!

Also, thanks for the papers / looking forward to the boundary changes!

@jgreener64
Copy link
Collaborator Author

Thanks for the benchmarks and the link to the code, looks interesting.

For multi-GPU, we might need to add some functionality to communicate across devices, but I am not sure where that should live right now and I am putting it off until we get single GPU performance a little better.

Yes I think that's a longer-term goal.

I think anything confined to force calculation can be single precision, and hence benefit from the speed available there. My understanding is that double precision is used for certain values, e.g. the virial, where accumulation error is relevant (https://manual.gromacs.org/current/reference-manual/definitions.html#mixed-or-double-precision).

@jgreener64
Copy link
Collaborator Author

I just pushed the branch kernels, showing work in progress on CUDA kernels for force and energy summation. The tests pass apart from differentiable simulation, which will require more work to get going. Initial indications show ~15x speedup to before, ~0.8 ms / step for 3k atoms, with room for optimisation.

Ideally we would use KernelAbstractions.jl as per @leios' work but this could prove a partial step and/or reference point.

The next step is to look at getting Enzyme working for CPU force summation, then for GPU force summation. I think the GPU step will need CUDA atomic support to be added to Enzyme (EnzymeAD/Enzyme.jl#421) and also a fix for EnzymeAD/Enzyme.jl#428.

@leios I know you did some work on atomics for KernelAbstractions.jl, do you have an idea of whether CUDA atomic support will be added for Enzyme.jl?

@leios
Copy link
Contributor

leios commented Sep 15, 2022

The kernels branch looks good and is easily extendable to KA for AMDGPU / CPU support as well (it shouldn't change CUDA performance). I have a KA branch as well that gets similar performance. I also saw a notable improvement to CPU performance in my branch, which I was somewhat surprised by.

Seeing as how the kernels branch is so far developed, maybe it would be best for me to stop developing my own branch and instead just port what you have to KA when the time comes (for CPU / AMDGPU support). So my priorities are:

  1. Finish adding preliminary AMDGPU support #99. It will probably be done in the next day or two.
  2. Take a look at the Enzyme issue. No matter if we are using KA or CUDA, we will have the same atomic issues with enzyme, I think (unless the Atomix package does some tricky stuff that I am not aware of). I've been digging into the atomics stuff a bit in the past few weeks and am in the same circle as a bunch of the key developers, so I can try to make headway there (or at least figure out what the hold-up is). I just sent a message to the main Atomix developer to see if they had put any thought into Enzyme or not yet. I also just sent a message to William from enzyme to see if there's anyway I can poke at the issue myself.
  3. Port the kernels branch to KA when it's ready instead of working on my own stuff. This way we are not stepping on each other's toes. Right now, it looks like the kernels branch is using a similar set-up to what is already on the master branch, just CUDA-ified. Honestly, as long as there's CUDA code to work off of, a KA port would only take an afternoon (along with a few days of testing) to complete. KA also gives parallel CPU and AMDGPU support without a loss in CUDA performance, so I see it as a net win.
  4. I've been talking to one of the devs that is working on Dagger.jl. They seem to have a nice demo of a wrapper around MPI (essentially) that should make the multi-GPU step way easier.

What do you think? Could we maybe target mid October for 1-3 (depending on how complicated 2 is)? The way I see it, you are doing a great job with the kernels branch already, but are missing key features to elsewhere in Julia. Maybe I should focus in making sure those features are getting done while you are working on the kernels.

I'm actually quite behind on my branch right now, but a few comments there:

  • It is not obvious to me that the atomics-based kernels will necessarily be faster than one that just does the force calculation twice. I think it might depend on how complicated the force calculations get, but with the equal and opposite forces, almost every add is an atomic add, which might be quite slow. I'm only just now looking through the code, but I think if you use a gridsize that is just the upper-triangle of the square interactions matrix (just the list of all possible interactions between i and j), you can minimize the number of atomic operations by cleverly choosing your blocks and threads. It looks like you are already doing this to some extent (looking at: https://github.com/JuliaMolSim/Molly.jl/blob/kernels/src/cuda.jl#L25)?
  • I had to revamp the neighborlists a bit to get them to play nicely with kernels. It looks like this is done as well? I basically just wrote a simple function to minimize the memory footprint of a neighborlist and pass it to the GPU. It seems like the neighborlist passion has died down a bit at MIT, so I need to poke a few people and see what progress is being made there.

@jgreener64
Copy link
Collaborator Author

The kernels branch looks good and is easily extendable to KA for AMDGPU / CPU support as well (it shouldn't change CUDA performance). I have a KA branch as well that gets similar performance.

Port the kernels branch to KA when it's ready

Yes sounds good.

Take a look at the Enzyme issue

This would be really useful, and is something I probably won't have the skills for myself.

What do you think? Could we maybe target mid October for 1-3 (depending on how complicated 2 is)?

I think that is possible, as you say it will depend on how difficult the atomic issue is and whether I can work Enzyme in more broadly without issues.

It is not obvious to me that the atomics-based kernels will necessarily be faster than one that just does the force calculation twice.

I agree for the case with no neighbour list - the first two links in the first post in this issue use that approach. I wouldn't be against implementing that kernel for the non-neighbour list case. I couldn't work out how to get that scheme to work with a neighbour list though, bearing in mind that the proportion of neighbours to possible interactions goes down as N goes up. Perhaps there is a way. In OpenMM they use a complex scheme to avoid atomic adds (https://github.com/openmm/openmm/blob/master/platforms/cuda/src/kernels/nonbonded.cu#L42-L103) and also use multiple force buffers (http://docs.openmm.org/7.2.0/developerguide/developer.html#computing-forces). I'm okay with leaving that speed on the table for now though to get everything working.

I had to revamp the neighborlists a bit to get them to play nicely with kernels.

On the branch currently I just store a CuArray of tuples for i, j and 1-4 interaction. There may be a better way, I haven't looked at optimising the neighbour calculation much.

@leios
Copy link
Contributor

leios commented Sep 15, 2022

Great! Thanks for the info / I'll see if I can work on the Enzyme issues. Apparently the bottleneck is proving the correctness of the adjoint. I'll dig into it more next week.

As for the neighborlists and atomic avoidance, I agree. The most important thing right now is a switch to kernel-based design. Optimizations can come later

@leios
Copy link
Contributor

leios commented Sep 22, 2022

EnzymeAD/Enzyme#849 (I didn't write the PR)

Looks promising! I'll chat with a few other folks tomorrow about how to easily test things in Julia. I think we need to wait for the next jll bump first.

@jgreener64
Copy link
Collaborator Author

Great! Enzyme seems to be improving fast at the moment.

@jgreener64
Copy link
Collaborator Author

After a lot of work and help from the Enzyme team, v0.15.0 adds differentiable CUDA.jl kernels and much-improved performance across the board.

The kernels are currently rather simple and leave a lot of performance on the table. Going from simple to complex kernels is a smaller leap than going from broadcasting to kernels though. Hopefully there will be a GSoC student looking at that this summer.

As Enzyme.jl begins to work on more of Julia it will be interesting to see if we can replace Zygote.jl for the simulators too, improving performance by reducing memory allocations using mutation.

@leios
Copy link
Contributor

leios commented Apr 19, 2023

I'm happy to see the kernels branch merged! Great work! I'm happy to extend the #99 to the kernels branch as well. I don't think it has to be merged immediately as long as it is kept up-to-date for people who need it.

@jgreener64
Copy link
Collaborator Author

It would certainly be good to have support for different GPU array types. I don't know how well the current CUDA.jl kernels play with other array types though.

Maybe a job for KernelAbstractions.jl in the long run. Personally I am leaning towards optimising the CUDA.jl kernels first and then investigating a switch to KernelAbstractions.jl, but I am happy to support anyone who wants to try the switch on the current code.

@leios
Copy link
Contributor

leios commented Apr 19, 2023

Ok, in that case, let me mock up the KernelAbstractions branch and you can look at it. It should work just fine and have the same performance. Anyone who needs the AMD / Metal / Intel GPU support can swap to it.

If you are fine with the KernelAbstractions syntax, you can merge it, otherwise, we can just keep it updated and mention it in the documentation.

@jgreener64
Copy link
Collaborator Author

Great. I would be inclined to merge it, even if it lived alongside more complex CUDA.jl kernels later on. It will be interesting to see how nicely Enzyme plays with KernelAbstractions too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants