[Feature Request] Single Precision cuFFT for GPU Package #4043

jobs-git · 2024-01-15T05:30:20Z

Summary

As the title says

Detailed Description

According to the documentation: https://docs.lammps.org/Speed_gpu.html, FFT uses CPU to take advantage of MPI communication between processors. However, gpu nowadays may even be faster than CPU MPI, thus use of cuFFT for GPU is the best approach, then mpi process to distribute the load in case of multi-gpu, see this for possible model: https://developer.nvidia.com/blog/creating-faster-molecular-dynamics-simulations-with-gromacs-2020/

stanmoore1 · 2024-01-18T00:22:44Z

Should work with the KOKKOS package.

ndtrung81 · 2024-01-18T17:09:19Z

Right, the KOKKOS package already supports FFTs on the GPUs. Supporting this feature in the GPU package needs volunteers at this point. My estimate is that the host-device transfers needed for doing (forward and backwards) FFTs on the GPUs could easily undermine any computational gain with the GPUs, and it's getting worse with multiple MPI procs due to communication cost.

alphataubio · 2024-01-19T02:06:44Z

My experience with GPUs has been similar in many cases... you gain an order of magnitude by vectorizing a math operation in the GPU but you lose two orders of magnitude getting the data in and out of CPU <-> GPU.

The only approach I've seen that makes sense is kokkos package: keep everything in the gpu as much as possible

akohlmey · 2024-01-19T05:26:03Z

The only approach I've seen that makes sense is kokkos package: keep everything in the gpu as much as possible

I disagree. What is the best approach strongly depends on the problem at hand and the available hardware. The GPU package can outperform KOKKOS under multiple circumstances by focusing on using GPU acceleration only on the parts that are best suited to it, i.e. the pair style. Particular in parallel it is usually better to not even use GPU acceleration for PPPM (hence the -pk pair/only yes option). Then the bonded interactions and Kspace can be run concurrently to the data transfer to (positions) and from (forces) the GPU and the computation. If the pair style is on the expensive side or the (Coulomb) cutoff on the larger side then the computation on the CPU is almost "free" since it is done concurrently with the GPU calculation. Also only the GPU package currently supports mixed (and single) precision force computation, so it is a big win for consumer GPUs that have often a crippled double precision performance. The 'GPU as accelerator' strategy of the GPU package benefits from having more CPU cores or a weaker GPU, while the 'CPU as decelerator' approach of KOKKOS is particularly beneficial for very large systems on leadership class clusters with many nodes and many high-end data center GPUs per node.

Bottom line, like with so many things in science the answer is more often than not: it depends.

jobs-git added the enhancement label Jan 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Single Precision cuFFT for GPU Package #4043

[Feature Request] Single Precision cuFFT for GPU Package #4043

jobs-git commented Jan 15, 2024 •

edited

stanmoore1 commented Jan 18, 2024

ndtrung81 commented Jan 18, 2024

alphataubio commented Jan 19, 2024 •

edited

akohlmey commented Jan 19, 2024

[Feature Request] Single Precision cuFFT for GPU Package #4043

[Feature Request] Single Precision cuFFT for GPU Package #4043

Comments

jobs-git commented Jan 15, 2024 • edited

stanmoore1 commented Jan 18, 2024

ndtrung81 commented Jan 18, 2024

alphataubio commented Jan 19, 2024 • edited

akohlmey commented Jan 19, 2024

jobs-git commented Jan 15, 2024 •

edited

alphataubio commented Jan 19, 2024 •

edited