Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Strange communication lapse while scaling PPPM to extreme scale #4133

Open
hagertnl opened this issue Apr 11, 2024 · 53 comments
Open

[BUG] Strange communication lapse while scaling PPPM to extreme scale #4133

hagertnl opened this issue Apr 11, 2024 · 53 comments
Assignees
Labels

Comments

@hagertnl
Copy link
Contributor

hagertnl commented Apr 11, 2024

Hi all

I have been scaling the PPPM long-range potential out to Frontier scale with an SPC/E water box (targeting 9261 nodes). I am using Kokkos+HIP backend. I noticed that I got very good weak-scaling behavior, much better than expected, but with an odd "shoulder" in the scaling -- using 64 nodes as the base of the weak-scaling, 512 nodes had much worse scaling than 4096 nodes did (parallel efficiency went down, then back up). I attached the plot of this to this ticket. I found this strange, so I poked further.

I analyzed the bytes sent to/from each MPI rank during a 50-timestep run using an MPI tracing tool I built with LLNL's wrap utility. I simply dumped message size and destination for every MPI call. MPI_Send is the dominant MPI call that sends the majority of particle and PPPM data, so I focused on this.

I chose the rank with the maximum time in MPI_Send and computed the total MPI_Send's called and total number of bytes for that rank for each node count, here is what I found:

Nodecount: max MPI_Send time, total bytes, total messages
64-node: 64 seconds, 453.1x10^9 bytes, 41.2K messages
512-node: 115 seconds 435.2x10^9 bytes, 251.3K messages
4096-node: 89 seconds, 481.5x10^9 bytes, 98.8K messages
9261-node: 188 seconds, 473.9x10^9 bytes, 136.5K messages

The 512-node point didn't look right, so next, I looked at where the data was sent. I've attached 4 plots to this ticket named "r2a_n_rank1_count.png". This shows the number of MPI_Send's called by rank 1 to each of the destination ranks on the x-axis. From what I know of LAMMPS, there's 2 main reasons to communicate with MPI_Send -- particle/neighbor list exchange and PPPM FFT data exchange. I'd expect to see 6 "peak" ranks where particle exchange happens -- one neighbor in each cartesian direction. Then I'd expect every other rank in the simulation to receive data from rank 1 for FFTs. But, looking at the plots, this is not true -- at 64 and 512 nodes, it is true, but at 4096 nodes, there's a large number of ranks not communicated with by rank 1. Do you have any insight? I am happy to help put time towards figuring this out, I'm just not sure where to look.

I also ran this problem on Summit, but 4096 Summit nodes can only run the 512-Frontier-node problem due to HBM sizes. Summit showed similar scaling behavior up to the 512-node problem (run on 4096 nodes of Summit).

LAMMPS Version and Platform

LAMMPS commit 0849863 built for Kokkos+HIP with hipcc from ROCm/5.7.1 + cray-mpich/8.1.28, uses hipFFT.
Run on Frontier, but also seems to be trending in this direction on Summit.

Expected Behavior

I'd expect the message count for MPI_Send to scale consistently, not shoot up exponentially, then drop back down at a large enough node count.

Actual Behavior

Rank 1 calls MPI_Send 41.2K, 251.3K, 98.8K, and 136.5K times at 64, 512, 4096, and 9261 nodes respectively. The 4096+ node points seem wrong.

Steps to Reproduce

I can provide input files as needed. Using an srun line like:

srun -N ${SLURM_NNODES} -n $(expr ${SLURM_NNODES} \* 8) -c 7 --gpus-per-task=1 --gpu-bind=closest \
    lmp -k on g 1 -sf kk -pk kokkos gpu/aware on neigh half comm device newton off neigh/transpose on sort device atom/map no \
    -in input.in -log none -v x ${x} -v y ${y} -v z ${z}

Further Information, Files, and Links

64-node:

r2a_n64_rank1_count

512-node:

r2a_n512_rank1_count

4096-node:

r2a_n4096_rank1_count

9261-node:

r2a_n9261_rank1_count

Parallel efficiency plot:

lammps_parallel_efficieny

@hagertnl hagertnl added the bug label Apr 11, 2024
@hagertnl
Copy link
Contributor Author

@stanmoore1 I'm curious to get your thoughts on this, if you have a second... Since it's only at massive scale, I'm happy to put some time towards this, I just don't initially know where to go looking in LAMMPS yet.

@hagertnl
Copy link
Contributor Author

Interestingly, our energy breakdowns are exactly what we expect them to be. So it's not like LAMMPS is giving horribly wrong answers -- the printed thermodynamic data is correct, but we also don't run the system for very long

@stanmoore1
Copy link
Contributor

stanmoore1 commented Apr 11, 2024

@hagertnl, that is odd, a few questions:

  • What is the FFT grid nx*ny*nz breakdown that is produced for each case?
  • Is the shoulder repeatable across multiple subsequent runs?
  • Where is the extra time showing up in the log file timing breakdown? I.e. is there a load imbalance, like one node is running slower than others? I'd be curious to see the log files.

Here is where the FFT grid is printed out in the log file:

PPPM initialization ...
WARNING: Using 12-bit tables for long-range coulomb (../kspace.cpp:316)
  G vector (1/distance) = 0.248593
  grid = 48 60 36

@hagertnl
Copy link
Contributor Author

@stanmoore1

  • FFT grid: the grid scales perfectly with problem size/node count -- 2880 x 2880 x 2880 at 64 nodes, then each dimension is multiplied by 2 when we go up to 512 nodes, and so on. So 512 is 5760 x 5760 x 5760. The G vector is constant at 0.33317903 for all sizes.
  • The shoulder is 100% repeatable. We have tried basically all possible node configurations -- specifying node layout to contain the run to the fewest number of Dragonfly groups as possible, randomizing the nodes, and running multiple jobs over several weeks.
  • The extra time shows up in the KSpace breakdown:
    64-node: Kspace | 244.2 | 244.8 | 245.2 | 1.2 | 87.58
    512-node: Kspace | 355.08 | 356.92 | 358.19 | 2.9 | 90.56
    4096-node: Kspace | 282.12 | 284.62 | 286.73 | 4.8 | 87.74

@stanmoore1
Copy link
Contributor

When you run another set of jobs, can you add kspace_modify fftbench yes to the input? It will do some additional timings at the end, see https://docs.lammps.org/kspace_modify.html, comparing the time for 3D vs 1D FFTs.

I chose the rank with the maximum time in MPI_Send and computed the total MPI_Send's called and total number of bytes for that rank for each node count

Could you do the same thing but compare the same rank in each case, e.g. rank 0? Just want to make sure it is apples to apples; the variance is pretty low, so a little jitter could make any random rank the slowest in a given run.

@sjplimp any other ideas?

@hagertnl
Copy link
Contributor Author

Will do!

Also, all the plots I showed were from rank 1, looking at data it sent to all other ranks. The tabular data was all for just the maximum rank. I can grab the data for rank 1 to line up with the plots

@hagertnl
Copy link
Contributor Author

Results from FFT bench:

64-node:

PPPM Kokkos initialization ...
  using 12-bit tables for long-range coulomb (../kspace.cpp:342)
  G vector (1/distance) = 0.33317903
  grid = 2880 2880 2880
  ...
  using double precision hipFFT
  3d grid and FFT values/proc = 49430863 49766400
  ...
MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
...
Kspace  | 244.56     | 245.26     | 245.67     |   1.4 | 87.74
...

FFT time (% of Kspce) = 174.284 (71.06)
FFT Gflps 3d (1d only) = 6550.695 45354.741

512-node:

PPPM Kokkos initialization ...
  using 12-bit tables for long-range coulomb (../kspace.cpp:342)
  G vector (1/distance) = 0.33317903
  grid = 5760 5760 5760
  ...
  using double precision hipFFT
  3d grid and FFT values/proc = 49430863 66355200
  ...
MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
...
Kspace  | 364.03     | 366.18     | 367.6      |   2.8 | 90.75
...

FFT time (% of Kspce) = 290.224 (79.26)
FFT Gflps 3d (1d only) = 34208.806 1560932.6

4096-node:

PPPM Kokkos initialization ...
  using 12-bit tables for long-range coulomb (../kspace.cpp:342)
  G vector (1/distance) = 0.33317903
  grid = 11520 11520 11520
  ...
  using double precision hipFFT
  3d grid and FFT values/proc = 49430863 46656000
  ...
MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
...
Kspace  | 283.05     | 285.4      | 288.32     |   5.0 | 87.68
...

FFT time (% of Kspce) = 246.071 (86.22)
FFT Gflps 3d (1d only) = 348614.09 16778149

Interesting, the FFT time dropped pretty substantially from 512 to 4096

@stanmoore1
Copy link
Contributor

@hagertnl look at this line:

64 nodes: 3d grid and FFT values/proc = 49430863 49766400
512 nodes: 3d grid and FFT values/proc = 49430863 66355200
4096 nodes: 3d grid and FFT values/proc = 49430863 46656000

Here is the code:

    mesg += fmt::format("  3d grid and FFT values/proc = {} {}\n",
                       ngrid_max,nfft_both_max);

  // nfft_brick = FFT points in 3d brick-decomposition on this proc
  //              same as count of owned grid cells
  // nfft = FFT points in x-pencil FFT decomposition on this proc
  // nfft_both = greater of nfft and nfft_brick

@hagertnl
Copy link
Contributor Author

Well that makes the FFT time drop much less interesting :) In a weak-scaling simulation, do you know, should the values per proc be constant per-rank? I would think since the FFT grid weak-scales with the system size, it should remain constant, but I don't know much about the inner workings of the 3D FFTs.

@stanmoore1
Copy link
Contributor

I think it should be constant for weak scaling, or at the very least increase monotonically with node count. I will have to dig through the code and see what is going on.

@akohlmey
Copy link
Member

I think it should be constant for weak scaling, or at the very least increase monotonically with node count. I will have to dig through the code and see what is going on.

I would suspect the list of available primefactors for the FFT plan. The planner will choose more grid points when it cannot decompose the optimal number and what is available depends on the FFT interface.
I've seen similar things happen where FFTW on IBM Power CPUs outperformed the native IBM FFTs because FFTW supports prime factors up to 13 (?) while the IBM FFTs were power of two only. So in some cases FFTW was using a much smaller grid and thus was faster for the operation even though the IBM FFT was significantly faster for compatible grids.

@stanmoore1
Copy link
Contributor

Interesting. Since this is a per-proc "max" value, it looks like load imbalance, i.e. one proc had more FFT work than the others which increases the FFT time. I wonder why the code can't keep the same decomposition when weak scaling--the total FFT grid is exactly 8x larger for 512 vs 64 nodes.

@sjplimp
Copy link
Contributor

sjplimp commented Apr 16, 2024 via email

@sjplimp
Copy link
Contributor

sjplimp commented Apr 16, 2024 via email

@stanmoore1
Copy link
Contributor

8 MPI tasks per node

@hagertnl
Copy link
Contributor Author

I also use the processors command to set a 2x2x2 local grid for the 8 local ranks. We're using a cubic system and always use perfect-cube number of nodes.

@stanmoore1
Copy link
Contributor

64*8 = 512 ranks --> 16x32
512*8 = 4096 ranks --> 64x64
4096*8 = 32768 ranks --> 128x256

Seems odd that 512 nodes has the best 2D decomposition but is also the slowest?

@sjplimp
Copy link
Contributor

sjplimp commented Apr 16, 2024 via email

@stanmoore1
Copy link
Contributor

64 nodes: grid = 2880 2880 2880
512 nodes: grid = 5760 5760 5760
4096 nodes: grid = 11520 11520 11520

@stanmoore1
Copy link
Contributor

2880^3/512 = 46656000
5760^3/4096 = 46656000
11520^3/32768 = 46656000

However only 4096 nodes gets the ideal perfectly balanced case where every rank has 46656000 FFT data points:

64 nodes: 3d grid and FFT values/proc = 49430863 49766400
512 nodes: 3d grid and FFT values/proc = 49430863 66355200
4096 nodes: 3d grid and FFT values/proc = 49430863 46656000

@sjplimp
Copy link
Contributor

sjplimp commented Apr 16, 2024 via email

@sjplimp
Copy link
Contributor

sjplimp commented Apr 16, 2024 via email

@sjplimp
Copy link
Contributor

sjplimp commented Apr 16, 2024 via email

@stanmoore1
Copy link
Contributor

I'm pulling the PPPM code into a standalone reproducer which will be easy to debug. This is Kokkos PPPM, I believe using the default order = 5.

@stanmoore1
Copy link
Contributor

stanmoore1 commented Apr 16, 2024

I can reproduce the issue in my standalone. Here are the values for the first 11 ranks on 512 nodes = 4096 ranks:

printf("%i: %i %i %i %i %i %i --> %i\n",me,nxlo_fft,nxhi_fft,nylo_fft,nyhi_fft,nzlo_fft,nzlo_fft,nfft);

0: 0 5759 0 5759 0 0 --> 33177600
1: 0 5759 0 5759 1 1 --> 33177600
2: 0 5759 0 5759 2 3 --> 66355200
3: 0 5759 0 5759 4 4 --> 33177600
4: 0 5759 0 5759 5 6 --> 66355200
5: 0 5759 0 5759 7 7 --> 33177600
6: 0 5759 0 5759 8 8 --> 33177600
7: 0 5759 0 5759 9 10 --> 66355200
8: 0 5759 0 5759 11 11 --> 33177600
9: 0 5759 0 5759 12 13 --> 66355200
10: 0 5759 0 5759 14 14 --> 33177600

@stanmoore1
Copy link
Contributor

OK here is the issue. When nprocs is small enough, each owns 1 or more entire xy planes, otherwise each proc owns 2d sub-blocks of yz plane. For 64 and 512 nodes the first condition is true, but for 4096 nodes, the latter is true, which is better load balanced.

  if (nz_pppm >= nprocs) {
    npey_fft = 1;
    npez_fft = nprocs;
  } else procs2grid2d(nprocs,ny_pppm,nz_pppm,&npey_fft,&npez_fft);

@stanmoore1
Copy link
Contributor

stanmoore1 commented Apr 16, 2024

64 nodes:

0: 0 2879 0 2879 0 4 --> 41472000
1: 0 2879 0 2879 5 10 --> 49766400
2: 0 2879 0 2879 11 15 --> 41472000
3: 0 2879 0 2879 16 21 --> 49766400
4: 0 2879 0 2879 22 27 --> 49766400
5: 0 2879 0 2879 28 32 --> 41472000
6: 0 2879 0 2879 33 38 --> 49766400
7: 0 2879 0 2879 39 44 --> 49766400
8: 0 2879 0 2879 45 49 --> 41472000
9: 0 2879 0 2879 50 55 --> 49766400
10: 0 2879 0 2879 56 60 --> 41472000

512 nodes:

0: 0 5759 0 5759 0 0 --> 33177600
1: 0 5759 0 5759 1 1 --> 33177600
2: 0 5759 0 5759 2 3 --> 66355200
3: 0 5759 0 5759 4 4 --> 33177600
4: 0 5759 0 5759 5 6 --> 66355200
5: 0 5759 0 5759 7 7 --> 33177600
6: 0 5759 0 5759 8 8 --> 33177600
7: 0 5759 0 5759 9 10 --> 66355200
8: 0 5759 0 5759 11 11 --> 33177600
9: 0 5759 0 5759 12 13 --> 66355200
10: 0 5759 0 5759 14 14 --> 33177600

4096 nodes:

0: 0 11519 0 89 0 44 --> 46656000
1: 0 11519 90 179 0 44 --> 46656000
2: 0 11519 180 269 0 44 --> 46656000
3: 0 11519 270 359 0 44 --> 46656000
4: 0 11519 360 449 0 44 --> 46656000
5: 0 11519 450 539 0 44 --> 46656000
6: 0 11519 540 629 0 44 --> 46656000
7: 0 11519 630 719 0 44 --> 46656000
8: 0 11519 720 809 0 44 --> 46656000
9: 0 11519 810 899 0 44 --> 46656000
10: 0 11519 900 989 0 44 --> 46656000

@stanmoore1
Copy link
Contributor

If I force it to always use procs2grid2d() then I get the expected behavior for 512 nodes:

0: 0 5759 0 89 0 89 --> 46656000
1: 0 5759 90 179 0 89 --> 46656000
2: 0 5759 180 269 0 89 --> 46656000
3: 0 5759 270 359 0 89 --> 46656000
4: 0 5759 360 449 0 89 --> 46656000
5: 0 5759 450 539 0 89 --> 46656000
6: 0 5759 540 629 0 89 --> 46656000
7: 0 5759 630 719 0 89 --> 46656000
8: 0 5759 720 809 0 89 --> 46656000
9: 0 5759 810 899 0 89 --> 46656000
10: 0 5759 900 989 0 89 --> 46656000

Not sure what the performance implications are though? In Nick's case, perfect load balance appears to be much more important than each proc having an entire xy plane.

@stanmoore1
Copy link
Contributor

@hagertnl this patch should give the expected behavior for 512 nodes. We need to decide if we want to make this change in LAMMPS or not.

diff --git a/src/KSPACE/pppm.cpp b/src/KSPACE/pppm.cpp
index 4fe5075..06cbf11 100644
--- a/src/KSPACE/pppm.cpp
+++ b/src/KSPACE/pppm.cpp
@@ -1389,10 +1389,7 @@ void PPPM::set_grid_local()
   //   of the global FFT mesh that I own in x-pencil decomposition
 
   int npey_fft,npez_fft;
-  if (nz_pppm >= nprocs) {
-    npey_fft = 1;
-    npez_fft = nprocs;
-  } else procs2grid2d(nprocs,ny_pppm,nz_pppm,&npey_fft,&npez_fft);
+  procs2grid2d(nprocs,ny_pppm,nz_pppm,&npey_fft,&npez_fft);
 
   int me_y = me % npey_fft;
   int me_z = me / npey_fft;

@stanmoore1 stanmoore1 self-assigned this Apr 16, 2024
@sjplimp
Copy link
Contributor

sjplimp commented Apr 16, 2024 via email

@stanmoore1
Copy link
Contributor

@sjplimp sorry I had the data mixed up (now corrected above), it should have been:

512 nodes:

0: 0 5759 0 5759 0 0 --> 33177600
1: 0 5759 0 5759 1 1 --> 33177600
2: 0 5759 0 5759 2 3 --> 66355200
3: 0 5759 0 5759 4 4 --> 33177600
4: 0 5759 0 5759 5 6 --> 66355200
5: 0 5759 0 5759 7 7 --> 33177600
6: 0 5759 0 5759 8 8 --> 33177600
7: 0 5759 0 5759 9 10 --> 66355200
8: 0 5759 0 5759 11 11 --> 33177600
9: 0 5759 0 5759 12 13 --> 66355200
10: 0 5759 0 5759 14 14 --> 33177600

But looking at KSPACE/fft3d.cpp I don't see where the 2d FFTs
w/out communication are being checked for or performed.

I don't think that is supported anywhere in LAMMPS?

@sjplimp
Copy link
Contributor

sjplimp commented Apr 16, 2024 via email

@sjplimp
Copy link
Contributor

sjplimp commented Apr 16, 2024 via email

@sjplimp
Copy link
Contributor

sjplimp commented Apr 16, 2024 via email

@stanmoore1
Copy link
Contributor

Stan - does your FFT-only reproducer actually perform FFTs?

No, it just computes the decomposition without actually performing FFTs.

Can PPPM KOKKOS wrap the FFTlib from ORNL (can't recall it's name).

Yes hipFFT and cuFFT both support 2d FFTs, adding this could give speedup.

@hagertnl
Copy link
Contributor Author

Does this also answer why there was the breakdown in communication at 4096+ nodes?

Frontier is busy decongesting from GB runs, but I am submitting jobs to test the patch you provided. I'll run 64, 512, and 4096 node jobs and report back. Thanks!

@stanmoore1
Copy link
Contributor

stanmoore1 commented Apr 17, 2024

Does this also answer why there was the breakdown in communication at 4096+ nodes?

Not sure what you mean by "breakdown in communication"? The FFTs require a many-to-many proc communication pattern, so the more nodes you have, the more costly it will be. FFTs are not going to weak scale as well as the point-to-point communication used for the force calculation. One way around this would be to limit the number of ranks for FFTs using run_style verlet/split https://docs.lammps.org/run_style.html, however this isn't yet ported to Kokkos (though I don't think it would be hard to port).

@hagertnl
Copy link
Contributor Author

By breakdown, I meant processes that don't talk to each other. At <= 512 nodes, we find that all process talk to all other processes at least 50 times (once per timestep). But at 4096+ nodes, we see that there's a large number of processes that aren't sending any data to some processes, so the "all-to-all" behavior we were initially seeing changes to many-to-many.

@stanmoore1
Copy link
Contributor

By breakdown, I meant processes that don't talk to each other. At <= 512 nodes, we find that all process talk to all other processes at least 50 times (once per timestep). But at 4096+ nodes, we see that there's a large number of processes that aren't sending any data to some processes, so the "all-to-all" behavior we were initially seeing changes to many-to-many.

Ah yes I think that patch could switch the 64 and 512 node cases to a more of a many-to-many comm pattern instead of all-to-all, but @sjplimp would know better.

@sjplimp
Copy link
Contributor

sjplimp commented Apr 17, 2024 via email

@stanmoore1
Copy link
Contributor

@hagertnl note that all-to-all isn't always worse that many-to-many. There is actually a kspace_modify collective yes command, see https://docs.lammps.org/kspace_modify.html which uses all-to-all:

"The collective keyword applies only to PPPM. It is set to no by default, except on IBM BlueGene machines. If this option is set to yes, LAMMPS will use MPI collective operations to remap data for 3d-FFT operations instead of the default point-to-point communication. This is faster on IBM BlueGene machines, and may also be faster on other machines if they have an efficient implementation of MPI collective operations and adequate hardware."

However this option is not yet ported to the KOKKOS package.

@stanmoore1
Copy link
Contributor

But the collective option is using an actual MPI_Alltoall call, not just sending to all procs with MPI Send/Recv, so it is a different issue than the planar vs pencil decomposition. Planar decomposition could be better if we used 2D instead of 1D FFTs.

@hagertnl
Copy link
Contributor Author

I was going to ask about that -- especially at large scale, I would expect an Alltoall to outperform the Send+Irecv implementation most days of the week. In a hackathon I helped with recently, a team saw a >20x speedup by using Alltoall instead of Send+Irecv. Granted, different buffer sizes, different science domain, etc. Given the potential benefit, I'd be very interested to give that a shot. Is there a technical issue blocking the porting to the KOKKOS package? If it's an issue of time, I'm happy to try to help where I can.

@stanmoore1
Copy link
Contributor

No technical issues, only lack of time. It would be great if you could work on this. You can see the collective code in the regular (non-Kokkos) node here:

https://github.com/lammps/lammps/blob/develop/src/KSPACE/remap.cpp#L113-L203

which is missing from the Kokkos version: https://github.com/lammps/lammps/blob/develop/src/KOKKOS/remap_kokkos.cpp#L106

The Kokkos version would need to support both GPU-aware MPI on and off, which is already done for the point-to-point version.

@hagertnl
Copy link
Contributor Author

I'd be happy to help with that. Is there an open feature request about it already? I can create one to track, if not, just so work doesn't get duplicated.

@stanmoore1
Copy link
Contributor

stanmoore1 commented Apr 17, 2024

I just checked and there is currently no existing feature request for this, thanks

@akohlmey
Copy link
Member

akohlmey commented Apr 17, 2024

Also, another reason FFTs don't scale as well as you might expect is their computation is NlogN, not O(N) as the rest of the MD algorithm is.

Plus for the non-bonded interaction you have a 3d domain decomposition, with pencils you are reduced to a 2d domain decomposition and with slices you are distributing data only in 1d. So you are quickly running out of work units when you have lots of MPI ranks. This was one major motivation to implement OpenMP parallelism to pair styles back around 2010. This way we could leave the number of MPI ranks smaller and still get decent strong scaling. It was the major "winner" in our INCITE proposal on the Cray XT5 back then at ORNL. I've put some numbers graphs together in a talk for a LAMMPS symposium in 2011 (see pages 14 to 19) of the attached PDF). Run style verlet/split was invented for the same purpose later. It has more potential for pushing strong scaling even farther, if you do have plenty of hardware.
lammps-overdrive.pdf.

So having a working verlet/split for KOKKOS could be the path for the best scaling.

P.S.: at the strong scaling limit you would also see a significant performance benefit from using single precision FFTs over double precision FFTs (not so much for the rest). This is because communication was already dominant with this way you would reduce the required bandwidth by half.

@hagertnl
Copy link
Contributor Author

Submitted #4140 to track. I'll hopefully make some progress on this in the coming few weeks.

@akohlmey
Copy link
Member

But the collective option is using an actual MPI_Alltoall call, not just sending to all procs with MPI Send/Recv, so it is a different issue than the planar vs pencil decomposition. Planar decomposition could be better if we used 2D instead of 1D FFTs.

Isn't HeFFTe supposed to have some optimizations for that? But that is not yet KOKKOS compatible, right?

@stanmoore1
Copy link
Contributor

stanmoore1 commented Apr 17, 2024

With the KOKKOS package you can also run KSpace and Bond, Angle etc. calculations on the Host CPU, even threaded with OpenMP, while the Pair force calculation runs on the GPUs. If you don't use a barostat (e.g. fix npt) then the CPU and GPU time will overlap. But whether this would give speedup on Frontier over everything running on the GPUs, I'm not sure.

Like Axel said, using single precision for FFTs could also help, which should already work with the KOKKOS package, you just need to compile with a flag, see https://docs.lammps.org/Build_settings.html.

Isn't HeFFTe supposed to have some optimizations for that? But that is not yet KOKKOS compatible, right?

Probably (not sure), but HeFFTe is not yet KOKKOS compatible.

@hagertnl
Copy link
Contributor Author

I don't have the 4096-node job back yet, but here's some data for the patch @stanmoore1 provided. We do see a good (20+%) speed-up of the 512-node case, and confirm it's using a much better FFT grid.

64-node:

New

3d grid and FFT values/proc = 49430863 46656000
Performance: 0.015 ns/day, 1622.397 hours/ns, 0.342 timesteps/s, 1.103 Gatom-step/s
FFT time (% of Kspce) = 179.062 (69.84)
FFT Gflps 3d (1d only) = 6375.8953 46359.128

Old

3d grid and FFT values/proc = 49430863 49766400
Performance: 0.015 ns/day, 1552.893 hours/ns, 0.358 timesteps/s, 1.152 Gatom-step/s
FFT time (% of Kspce) = 174.284 (71.06)
FFT Gflps 3d (1d only) = 6550.695 45354.741

512-node:

New

3d grid and FFT values/proc = 49430863 46656000
Performance: 0.013 ns/day, 1821.530 hours/ns, 0.305 timesteps/s, 7.860 Gatom-step/s
FFT time (% of Kspce) = 209.756 (72.45)
FFT Gflps 3d (1d only) = 47332.296 1775078.1

Old

3d grid and FFT values/proc = 49430863 66355200
Performance: 0.011 ns/day, 2241.749 hours/ns, 0.248 timesteps/s, 6.386 Gatom-step/s
FFT time (% of Kspce) = 290.224 (79.26)
FFT Gflps 3d (1d only) = 34208.806 1560932.6

@stanmoore1
Copy link
Contributor

stanmoore1 commented Apr 18, 2024

@hagertnl looks like the 64 node case was a little slower while the 512 node case was faster. The performance of the 4096 node case and higher node counts should be unchanged with the patch.

@hagertnl
Copy link
Contributor Author

There were some network timeouts reported by the 64-node job, so I can't guarantee that the code is at fault for that timing. It was close enough to not fret over for the moment. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants