Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Single Precision cuFFT for GPU Package #4043

Open
jobs-git opened this issue Jan 15, 2024 · 4 comments
Open

[Feature Request] Single Precision cuFFT for GPU Package #4043

jobs-git opened this issue Jan 15, 2024 · 4 comments

Comments

@jobs-git
Copy link

jobs-git commented Jan 15, 2024

Summary

As the title says

Detailed Description

According to the documentation: https://docs.lammps.org/Speed_gpu.html, FFT uses CPU to take advantage of MPI communication between processors. However, gpu nowadays may even be faster than CPU MPI, thus use of cuFFT for GPU is the best approach, then mpi process to distribute the load in case of multi-gpu, see this for possible model: https://developer.nvidia.com/blog/creating-faster-molecular-dynamics-simulations-with-gromacs-2020/

@stanmoore1
Copy link
Contributor

Should work with the KOKKOS package.

@ndtrung81
Copy link
Contributor

Right, the KOKKOS package already supports FFTs on the GPUs. Supporting this feature in the GPU package needs volunteers at this point. My estimate is that the host-device transfers needed for doing (forward and backwards) FFTs on the GPUs could easily undermine any computational gain with the GPUs, and it's getting worse with multiple MPI procs due to communication cost.

@alphataubio
Copy link
Contributor

alphataubio commented Jan 19, 2024

My experience with GPUs has been similar in many cases... you gain an order of magnitude by vectorizing a math operation in the GPU but you lose two orders of magnitude getting the data in and out of CPU <-> GPU.

The only approach I've seen that makes sense is kokkos package: keep everything in the gpu as much as possible

@akohlmey
Copy link
Member

The only approach I've seen that makes sense is kokkos package: keep everything in the gpu as much as possible

I disagree. What is the best approach strongly depends on the problem at hand and the available hardware. The GPU package can outperform KOKKOS under multiple circumstances by focusing on using GPU acceleration only on the parts that are best suited to it, i.e. the pair style. Particular in parallel it is usually better to not even use GPU acceleration for PPPM (hence the -pk pair/only yes option). Then the bonded interactions and Kspace can be run concurrently to the data transfer to (positions) and from (forces) the GPU and the computation. If the pair style is on the expensive side or the (Coulomb) cutoff on the larger side then the computation on the CPU is almost "free" since it is done concurrently with the GPU calculation. Also only the GPU package currently supports mixed (and single) precision force computation, so it is a big win for consumer GPUs that have often a crippled double precision performance. The 'GPU as accelerator' strategy of the GPU package benefits from having more CPU cores or a weaker GPU, while the 'CPU as decelerator' approach of KOKKOS is particularly beneficial for very large systems on leadership class clusters with many nodes and many high-end data center GPUs per node.

Bottom line, like with so many things in science the answer is more often than not: it depends.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants