Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Specfem with heterogeneous machines #700

Open
kpouget opened this issue Aug 5, 2020 · 4 comments
Open

Using Specfem with heterogeneous machines #700

kpouget opened this issue Aug 5, 2020 · 4 comments

Comments

@kpouget
Copy link
Contributor

kpouget commented Aug 5, 2020

Hello Specfem developers,

I am playing with Specfem3D_Globe and getting it to run on OpenShift Kubernetes (see this video for a first demo/illustration).

I have 2 questions related to Specfem3D execution:

  1. Is there, in the code, any optimization related to the CPU micro-architecture? I'm not very familiar with such optimizations, but I understand that at compile time, you can optimize the binary for the instructions of one specific CPU or another. We would like to do some benchmarks on a cluster with multiple micro-architectures and select at launch-time the right binary (container).
  2. When running Specfem with MPI, can we mix GPU=Cuda|OpenCL with GPU=No?

thanks,

Kevin

@danielpeter
Copy link
Contributor

hi Kevin, great to see it setup through kubernetes, will need to try that out soon :)

for 1, there are no specific instructions in the code. tailoring to a specific CPU architecture would happen through the compiler and corresponding flags, e.g., when running the ./configure script with specific --host or --target options.

for 2, by setting the flag GPU_MODE = .true. or .false. it is running exclusively either on GPU or CPU only. there is no setting to run a hybrid simulation with part of the MPI processes on CPU and others on GPU. for the globe version, this would be a bit tricky since one would need to change the partition sizes to balance the load. thus, becomes a meshing challenge with the cube-sphere mesher. this could however be done in the SPECFEM3D_Cartesian version and by modifying the code a bit. I did this once with the Cartesian version, but found that there was little gain on adding CPU-processes to the GPU ones. the GPUs were taking on pretty much most of the work, and thus the time-to-solution at the end was determined by how fast the GPUs were, and only little gain was achieved by adding CPU workers.

best wishes,
daniel

@kpouget
Copy link
Contributor Author

kpouget commented Aug 26, 2020

Hello Daniel,

for 1, there are no specific instructions in the code. tailoring to a specific CPU architecture would happen through the compiler and corresponding flags, e.g., when running the ./configure script with specific --host or --target options.

ok, I see. Do you happen to know if Specfem is sensible to such CPU variations ?

for 2, by setting the flag GPU_MODE = .true. or .false. it is running exclusively either on GPU or CPU only. there is no setting to run a hybrid simulation with part of the MPI processes on CPU and others on GPU. for the globe version, this would be a bit tricky since one would need to change the partition sizes to balance the load. thus, becomes a meshing challenge with the cube-sphere mesher. this could however be done in the SPECFEM3D_Cartesian version and by modifying the code a bit. I did this once with the Cartesian version, but found that there was little gain on adding CPU-processes to the GPU ones. the GPUs were taking on pretty much most of the work, and thus the time-to-solution at the end was determined by how fast the GPUs were, and only little gain was achieved by adding CPU workers.

ok, makes sense, thanks

we're currently benchmarking Specfem with classic baremetal runs, modifying mainly the NEX_XI/NEX_ETA (16/32/64/128) and NPROC_XI/NPROC_ETA (running on 1/4/8/16 machines) with the default DATA problem; I wonder if other examples could be interesting for running the benchmark (I would like the execution to be between 15 and 45 min, max 1h30).

with MPI_NPROC=16 | MPI_SLOTS=4 | OMP_THREADS=2 | NEX=128 (=4 x 8-core machines) this took 1h33min.

@kpouget
Copy link
Contributor Author

kpouget commented Nov 13, 2020

Hello Daniel,

FYI we published two blog posts about Specfem on OpenShift (kubernetes):

https://www.openshift.com/blog/a-complete-guide-for-running-specfem-scientific-hpc-workload-on-red-hat-openshift
https://www.openshift.com/blog/demonstrating-performance-capabilities-of-red-hat-openshift-for-running-scientific-hpc-workloads

it's not with "heterogeneous machines" (title of the issue), but still the continuation of what I mentioned above.
and no GPU at this stage, but I'm currently working on it, for testing purposes

@danielpeter
Copy link
Contributor

hi Kevin,

thanks for posting! let me add a corresponding entry in the manual.

it looks like SPECFEM - even more so than GROMACS - is showing very good performance results on such an OpenShift platform, with scalings at almost the same performance level as for the bare-metal runs. this is probably due to the local communication and over-lapping computations which help to mitigate the overhead with OpenShift's network performance.

anyway, i plan to see if we could drop the static compilation requirements of the package. your OpenShift setup might then become simpler as well - will let you know if this becomes an option.

many thanks,
daniel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants