Cabana-based code takes same time to run on both GPU and on parallel CPU cores #749

dineshadepu · 2024-04-24T19:20:21Z

Hi all,

Similar to #748 this is also a question.

I have a HPC system with the current configuration:

AMD Ryzen Threadripper PRO 5975WX 32-Cores and two NVIDIA RTX A5500 GPUs .

In issue #748 I had mentioned that I am dealing with SPH-DEM solver and implemented both SPH (has bugs) and DEM solver (Done) independently so far (not coupled yet). I had ran the settling_of_bodies_in_tank.cpp on both parallel CPU cores and on GPU. Here are the run times:

I used the following command to run:

time ./examples/03RBBodiesSettling 0.1 1.0 1.0 1000 0.1 200

Which considers a total no of bodies of 1000, and a total time of $0.1$ seconds, for a total of $1800$ steps. A 1000 rigid bodies resulted in $127436$ of particles.

The total time taken is :

OpenMP	Cuda
9.17 seconds	9.8 seconds

However, the ExaMPM code developed by Cabana developers is very fast on GPU when compared to parallel CPU runtime.
For comparision, I took DamBreak example, and run with the following command

time ./examples/DamBreak 0.05 2 0 0.001 1.0 10 OpenMP

and for GPU

time ./examples/DamBreak 0.05 2 0 0.001 1.0 10 CUDA

I get the following run times:

OpenMP	Cuda
33 seconds	0.98 seconds

The CUDA run is almost 30 times faster than the CPU run. Unfortunately, I am unable to get the same numbers for my own code.
I believe I followed the best practices. I am really not sure why I am lacking with this performance boost. Is it that, in my case I am using two AoSoA's or something else. Is there a way to debug this performance issue. I am almost ready with both SPH and DEM codes, just the coupling is left out to be added. Can you please help me with this issue?

Thank you so much. I will provide any additional information regarding this.

The text was updated successfully, but these errors were encountered:

streeve · 2024-04-25T12:42:13Z

100k particles is often the break-even point for performance comparisons between a CPU node and single GPU. It's generally (but not necessarily) enough work to fully utilize even a single GPU, let alone two.

You can certainly still get performance improvements, particularly from avoiding memory allocation (as in the previous issue), communication, etc. The important question is what is the timing breakdown for the CPU and GPU? It's very likely the code is spending very different amounts of time in different sections for each hardware

dineshadepu · 2024-04-30T10:36:25Z

Thanks for the input, Sam. I will leave this open and continue with the rest of the code development. I will update on this once I do complete profiling of the code on both architectures.

streeve changed the title ~~Code takes same time to run on both GPU and on parallel CPU cores (No performance increase).~~ Cabana-based code takes same time to run on both GPU and on parallel CPU cores Apr 30, 2024

streeve added the question Further information is requested label Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cabana-based code takes same time to run on both GPU and on parallel CPU cores #749

Cabana-based code takes same time to run on both GPU and on parallel CPU cores #749

dineshadepu commented Apr 24, 2024

streeve commented Apr 25, 2024

dineshadepu commented Apr 30, 2024 •

edited

Cabana-based code takes same time to run on both GPU and on parallel CPU cores #749

Cabana-based code takes same time to run on both GPU and on parallel CPU cores #749

Comments

dineshadepu commented Apr 24, 2024

streeve commented Apr 25, 2024

dineshadepu commented Apr 30, 2024 • edited

dineshadepu commented Apr 30, 2024 •

edited