Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cabana-based code takes same time to run on both GPU and on parallel CPU cores #749

Open
dineshadepu opened this issue Apr 24, 2024 · 2 comments
Labels
question Further information is requested

Comments

@dineshadepu
Copy link
Contributor

Hi all,

Similar to #748 this is also a question.

I have a HPC system with the current configuration:

AMD Ryzen Threadripper PRO 5975WX 32-Cores and two NVIDIA RTX A5500 GPUs .

In issue #748 I had mentioned that I am dealing with SPH-DEM solver and implemented both SPH (has bugs) and DEM solver (Done) independently so far (not coupled yet). I had ran the settling_of_bodies_in_tank.cpp on both parallel CPU cores and on GPU. Here are the run times:

I used the following command to run:

time ./examples/03RBBodiesSettling 0.1 1.0 1.0 1000 0.1 200

Which considers a total no of bodies of 1000, and a total time of $0.1$ seconds, for a total of $1800$ steps. A 1000 rigid bodies resulted in $127436$ of particles.

The total time taken is :

OpenMP Cuda
9.17 seconds 9.8 seconds

However, the ExaMPM code developed by Cabana developers is very fast on GPU when compared to parallel CPU runtime.
For comparision, I took DamBreak example, and run with the following command

time ./examples/DamBreak 0.05 2 0 0.001 1.0 10 OpenMP

and for GPU

time ./examples/DamBreak 0.05 2 0 0.001 1.0 10 CUDA

I get the following run times:

OpenMP Cuda
33 seconds 0.98 seconds

The CUDA run is almost 30 times faster than the CPU run. Unfortunately, I am unable to get the same numbers for my own code.
I believe I followed the best practices. I am really not sure why I am lacking with this performance boost. Is it that, in my case I am using two AoSoA's or something else. Is there a way to debug this performance issue. I am almost ready with both SPH and DEM codes, just the coupling is left out to be added. Can you please help me with this issue?

Thank you so much. I will provide any additional information regarding this.

@streeve
Copy link
Member

streeve commented Apr 25, 2024

100k particles is often the break-even point for performance comparisons between a CPU node and single GPU. It's generally (but not necessarily) enough work to fully utilize even a single GPU, let alone two.

You can certainly still get performance improvements, particularly from avoiding memory allocation (as in the previous issue), communication, etc. The important question is what is the timing breakdown for the CPU and GPU? It's very likely the code is spending very different amounts of time in different sections for each hardware

@dineshadepu
Copy link
Contributor Author

dineshadepu commented Apr 30, 2024

Thanks for the input, Sam. I will leave this open and continue with the rest of the code development. I will update on this once I do complete profiling of the code on both architectures.

@streeve streeve changed the title Code takes same time to run on both GPU and on parallel CPU cores (No performance increase). Cabana-based code takes same time to run on both GPU and on parallel CPU cores Apr 30, 2024
@streeve streeve added the question Further information is requested label Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants