[BUG] Strange communication lapse while scaling PPPM to extreme scale #4133

hagertnl · 2024-04-11T19:19:44Z

Hi all

I have been scaling the PPPM long-range potential out to Frontier scale with an SPC/E water box (targeting 9261 nodes). I am using Kokkos+HIP backend. I noticed that I got very good weak-scaling behavior, much better than expected, but with an odd "shoulder" in the scaling -- using 64 nodes as the base of the weak-scaling, 512 nodes had much worse scaling than 4096 nodes did (parallel efficiency went down, then back up). I attached the plot of this to this ticket. I found this strange, so I poked further.

I analyzed the bytes sent to/from each MPI rank during a 50-timestep run using an MPI tracing tool I built with LLNL's wrap utility. I simply dumped message size and destination for every MPI call. MPI_Send is the dominant MPI call that sends the majority of particle and PPPM data, so I focused on this.

I chose the rank with the maximum time in MPI_Send and computed the total MPI_Send's called and total number of bytes for that rank for each node count, here is what I found:

Nodecount: max MPI_Send time, total bytes, total messages
64-node: 64 seconds, 453.1x10^9 bytes, 41.2K messages
512-node: 115 seconds 435.2x10^9 bytes, 251.3K messages
4096-node: 89 seconds, 481.5x10^9 bytes, 98.8K messages
9261-node: 188 seconds, 473.9x10^9 bytes, 136.5K messages

The 512-node point didn't look right, so next, I looked at where the data was sent. I've attached 4 plots to this ticket named "r2a_n_rank1_count.png". This shows the number of MPI_Send's called by rank 1 to each of the destination ranks on the x-axis. From what I know of LAMMPS, there's 2 main reasons to communicate with MPI_Send -- particle/neighbor list exchange and PPPM FFT data exchange. I'd expect to see 6 "peak" ranks where particle exchange happens -- one neighbor in each cartesian direction. Then I'd expect every other rank in the simulation to receive data from rank 1 for FFTs. But, looking at the plots, this is not true -- at 64 and 512 nodes, it is true, but at 4096 nodes, there's a large number of ranks not communicated with by rank 1. Do you have any insight? I am happy to help put time towards figuring this out, I'm just not sure where to look.

I also ran this problem on Summit, but 4096 Summit nodes can only run the 512-Frontier-node problem due to HBM sizes. Summit showed similar scaling behavior up to the 512-node problem (run on 4096 nodes of Summit).

LAMMPS Version and Platform

LAMMPS commit 0849863 built for Kokkos+HIP with hipcc from ROCm/5.7.1 + cray-mpich/8.1.28, uses hipFFT.
Run on Frontier, but also seems to be trending in this direction on Summit.

Expected Behavior

I'd expect the message count for MPI_Send to scale consistently, not shoot up exponentially, then drop back down at a large enough node count.

Actual Behavior

Rank 1 calls MPI_Send 41.2K, 251.3K, 98.8K, and 136.5K times at 64, 512, 4096, and 9261 nodes respectively. The 4096+ node points seem wrong.

Steps to Reproduce

I can provide input files as needed. Using an srun line like:

srun -N ${SLURM_NNODES} -n $(expr ${SLURM_NNODES} \* 8) -c 7 --gpus-per-task=1 --gpu-bind=closest \
    lmp -k on g 1 -sf kk -pk kokkos gpu/aware on neigh half comm device newton off neigh/transpose on sort device atom/map no \
    -in input.in -log none -v x ${x} -v y ${y} -v z ${z}

Further Information, Files, and Links

64-node:

512-node:

4096-node:

9261-node:

Parallel efficiency plot:

The text was updated successfully, but these errors were encountered:

hagertnl · 2024-04-11T19:21:09Z

@stanmoore1 I'm curious to get your thoughts on this, if you have a second... Since it's only at massive scale, I'm happy to put some time towards this, I just don't initially know where to go looking in LAMMPS yet.

hagertnl · 2024-04-11T19:31:55Z

Interestingly, our energy breakdowns are exactly what we expect them to be. So it's not like LAMMPS is giving horribly wrong answers -- the printed thermodynamic data is correct, but we also don't run the system for very long

stanmoore1 · 2024-04-11T20:22:38Z

@hagertnl, that is odd, a few questions:

What is the FFT grid nx*ny*nz breakdown that is produced for each case?
Is the shoulder repeatable across multiple subsequent runs?
Where is the extra time showing up in the log file timing breakdown? I.e. is there a load imbalance, like one node is running slower than others? I'd be curious to see the log files.

Here is where the FFT grid is printed out in the log file:

PPPM initialization ...
WARNING: Using 12-bit tables for long-range coulomb (../kspace.cpp:316)
  G vector (1/distance) = 0.248593
  grid = 48 60 36

hagertnl · 2024-04-11T20:28:29Z

@stanmoore1

FFT grid: the grid scales perfectly with problem size/node count -- 2880 x 2880 x 2880 at 64 nodes, then each dimension is multiplied by 2 when we go up to 512 nodes, and so on. So 512 is 5760 x 5760 x 5760. The G vector is constant at 0.33317903 for all sizes.
The shoulder is 100% repeatable. We have tried basically all possible node configurations -- specifying node layout to contain the run to the fewest number of Dragonfly groups as possible, randomizing the nodes, and running multiple jobs over several weeks.
The extra time shows up in the KSpace breakdown:
64-node: Kspace | 244.2 | 244.8 | 245.2 | 1.2 | 87.58
512-node: Kspace | 355.08 | 356.92 | 358.19 | 2.9 | 90.56
4096-node: Kspace | 282.12 | 284.62 | 286.73 | 4.8 | 87.74

stanmoore1 · 2024-04-11T20:42:47Z

When you run another set of jobs, can you add kspace_modify fftbench yes to the input? It will do some additional timings at the end, see https://docs.lammps.org/kspace_modify.html, comparing the time for 3D vs 1D FFTs.

I chose the rank with the maximum time in MPI_Send and computed the total MPI_Send's called and total number of bytes for that rank for each node count

Could you do the same thing but compare the same rank in each case, e.g. rank 0? Just want to make sure it is apples to apples; the variance is pretty low, so a little jitter could make any random rank the slowest in a given run.

@sjplimp any other ideas?

hagertnl · 2024-04-11T20:48:51Z

Will do!

Also, all the plots I showed were from rank 1, looking at data it sent to all other ranks. The tabular data was all for just the maximum rank. I can grab the data for rank 1 to line up with the plots

hagertnl · 2024-04-16T11:39:08Z

Results from FFT bench:

64-node:

PPPM Kokkos initialization ...
  using 12-bit tables for long-range coulomb (../kspace.cpp:342)
  G vector (1/distance) = 0.33317903
  grid = 2880 2880 2880
  ...
  using double precision hipFFT
  3d grid and FFT values/proc = 49430863 49766400
  ...
MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
...
Kspace  | 244.56     | 245.26     | 245.67     |   1.4 | 87.74
...

FFT time (% of Kspce) = 174.284 (71.06)
FFT Gflps 3d (1d only) = 6550.695 45354.741

512-node:

PPPM Kokkos initialization ...
  using 12-bit tables for long-range coulomb (../kspace.cpp:342)
  G vector (1/distance) = 0.33317903
  grid = 5760 5760 5760
  ...
  using double precision hipFFT
  3d grid and FFT values/proc = 49430863 66355200
  ...
MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
...
Kspace  | 364.03     | 366.18     | 367.6      |   2.8 | 90.75
...

FFT time (% of Kspce) = 290.224 (79.26)
FFT Gflps 3d (1d only) = 34208.806 1560932.6

4096-node:

PPPM Kokkos initialization ...
  using 12-bit tables for long-range coulomb (../kspace.cpp:342)
  G vector (1/distance) = 0.33317903
  grid = 11520 11520 11520
  ...
  using double precision hipFFT
  3d grid and FFT values/proc = 49430863 46656000
  ...
MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
...
Kspace  | 283.05     | 285.4      | 288.32     |   5.0 | 87.68
...

FFT time (% of Kspce) = 246.071 (86.22)
FFT Gflps 3d (1d only) = 348614.09 16778149

Interesting, the FFT time dropped pretty substantially from 512 to 4096

stanmoore1 · 2024-04-16T15:37:25Z

@hagertnl look at this line:

64 nodes: 3d grid and FFT values/proc = 49430863 49766400
512 nodes: 3d grid and FFT values/proc = 49430863 66355200
4096 nodes: 3d grid and FFT values/proc = 49430863 46656000

Here is the code:

    mesg += fmt::format("  3d grid and FFT values/proc = {} {}\n",
                       ngrid_max,nfft_both_max);

  // nfft_brick = FFT points in 3d brick-decomposition on this proc
  //              same as count of owned grid cells
  // nfft = FFT points in x-pencil FFT decomposition on this proc
  // nfft_both = greater of nfft and nfft_brick

hagertnl · 2024-04-16T15:40:47Z

Well that makes the FFT time drop much less interesting :) In a weak-scaling simulation, do you know, should the values per proc be constant per-rank? I would think since the FFT grid weak-scales with the system size, it should remain constant, but I don't know much about the inner workings of the 3D FFTs.

stanmoore1 · 2024-04-16T15:45:02Z

I think it should be constant for weak scaling, or at the very least increase monotonically with node count. I will have to dig through the code and see what is going on.

akohlmey · 2024-04-16T15:53:38Z

I think it should be constant for weak scaling, or at the very least increase monotonically with node count. I will have to dig through the code and see what is going on.

I would suspect the list of available primefactors for the FFT plan. The planner will choose more grid points when it cannot decompose the optimal number and what is available depends on the FFT interface.
I've seen similar things happen where FFTW on IBM Power CPUs outperformed the native IBM FFTs because FFTW supports prime factors up to 13 (?) while the IBM FFTs were power of two only. So in some cases FFTW was using a much smaller grid and thus was faster for the operation even though the IBM FFT was significantly faster for compatible grids.

stanmoore1 · 2024-04-16T15:57:26Z

Interesting. Since this is a per-proc "max" value, it looks like load imbalance, i.e. one proc had more FFT work than the others which increases the FFT time. I wonder why the code can't keep the same decomposition when weak scaling--the total FFT grid is exactly 8x larger for 512 vs 64 nodes.

sjplimp · 2024-04-16T17:06:17Z

Might be related to the fact that for a 3d FFT, the grid goes thru a sequence of 2d decompositions across the procs. I.e. all the procs own 1d pencils of the grid in the x dir, then y dir, then z dir. So that is a 2d decomp in the yz plane, the xz plane, the xy plane. For the 2d decomp, 64 procs = 8x8 and 4096 procs = 64x64, But for 512 procs it is 16x32. What are the grid dimensions? Steve

…

On Tue, Apr 16, 2024 at 9:57 AM Stan Moore ***@***.***> wrote: Interesting. Since this is a per-proc "max" value, it looks like load imbalance, i.e. one proc had more FFT work than the others which increases the FFT time. I wonder why the code can't keep the same decomposition when weak scaling--the total FFT grid is exactly 8x larger for 512 vs 64 nodes. — Reply to this email directly, view it on GitHub <#4133 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFA4LQSHJGI7PNZ5PR2AQ6DY5VC7XAVCNFSM6AAAAABGC3Z2ROVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJZGQZDSMZSGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

sjplimp · 2024-04-16T17:08:36Z

I see now you said 512 nodes, not procs. How many MPI tasks per node ? Steve

…

On Tue, Apr 16, 2024 at 11:05 AM Steve Plimpton ***@***.***> wrote: Might be related to the fact that for a 3d FFT, the grid goes thru a sequence of 2d decompositions across the procs. I.e. all the procs own 1d pencils of the grid in the x dir, then y dir, then z dir. So that is a 2d decomp in the yz plane, the xz plane, the xy plane. For the 2d decomp, 64 procs = 8x8 and 4096 procs = 64x64, But for 512 procs it is 16x32. What are the grid dimensions? Steve On Tue, Apr 16, 2024 at 9:57 AM Stan Moore ***@***.***> wrote: > Interesting. Since this is a per-proc "max" value, it looks like load > imbalance, i.e. one proc had more FFT work than the others which increases > the FFT time. I wonder why the code can't keep the same decomposition when > weak scaling--the total FFT grid is exactly 8x larger for 512 vs 64 nodes. > > — > Reply to this email directly, view it on GitHub > <#4133 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AFA4LQSHJGI7PNZ5PR2AQ6DY5VC7XAVCNFSM6AAAAABGC3Z2ROVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJZGQZDSMZSGI> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >

stanmoore1 · 2024-04-16T17:10:00Z

8 MPI tasks per node

hagertnl · 2024-04-16T17:30:50Z

I also use the processors command to set a 2x2x2 local grid for the 8 local ranks. We're using a cubic system and always use perfect-cube number of nodes.

stanmoore1 · 2024-04-16T18:10:54Z

64*8 = 512 ranks --> 16x32
512*8 = 4096 ranks --> 64x64
4096*8 = 32768 ranks --> 128x256

Seems odd that 512 nodes has the best 2D decomposition but is also the slowest?

sjplimp · 2024-04-16T18:22:07Z

So then 64 nodes = 2^(6+3) MPI tasks = 16x32 2d decomp 512 nodes = 2^(9+3) MPI tasks = 64x64 2d decomp 4096 nodes = 2^(12+3) MPI tasks = 128x256 decomp So that is the opposite of what I hypothesized. What is the 3d size of the FFT grid for the 3 cases? Steve

…

On Tue, Apr 16, 2024 at 11:31 AM Nick Hagerty ***@***.***> wrote: I also use the processors command to set a 2x2x2 local grid for the 8 local ranks. We're using a cubic system and always use perfect-cube number of nodes. — Reply to this email directly, view it on GitHub <#4133 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFA4LQS2HBAALBDAFVMX223Y5VN57AVCNFSM6AAAAABGC3Z2ROVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJZGU4TCNJQGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

stanmoore1 · 2024-04-16T18:24:30Z

64 nodes: grid = 2880 2880 2880
512 nodes: grid = 5760 5760 5760
4096 nodes: grid = 11520 11520 11520

stanmoore1 · 2024-04-16T18:31:28Z

2880^3/512 = 46656000
5760^3/4096 = 46656000
11520^3/32768 = 46656000

However only 4096 nodes gets the ideal perfectly balanced case where every rank has 46656000 FFT data points:

64 nodes: 3d grid and FFT values/proc = 49430863 49766400
512 nodes: 3d grid and FFT values/proc = 49430863 66355200
4096 nodes: 3d grid and FFT values/proc = 49430863 46656000

sjplimp · 2024-04-16T18:44:10Z

Hmmm. I think all 3 of those should give 46656000 for FFT values/proc, since the numbers for 2d decomps are evenly divisible in each case. Total grid size / total MPI count = 46656000 for all 3 sizes. But only the last one gives that. I'd like to understand that. I suggest putting in some debug statements where that output is calculated and printed. For the 3d grid decomp, each proc's portion of the grid is 360^3 = 46656000 But the output numbers include ghost cells for that = 367^3 = 49430863 which has to do with the PPPM charge mapping stencil size. What is it for these runs? Again, same for all 3 problem sizes. Steve

…

On Tue, Apr 16, 2024 at 12:24 PM Stan Moore ***@***.***> wrote: 64 nodes: grid = 2880 2880 2880 512 nodes: grid = 5760 5760 5760 4096 nodes: grid = 11520 11520 11520 — Reply to this email directly, view it on GitHub <#4133 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFA4LQURVK5NNBES5PKCQC3Y5VUHJAVCNFSM6AAAAABGC3Z2ROVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJZGY4TCMZTGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

sjplimp · 2024-04-16T19:05:51Z

These are the lines below in pppm.cpp to instrument. What is printed to the screen (later) is the max of nfft_both across all procs. I'm pretty sure nfft_brick is the same on all procs since it is related to ngrid. So I'd like to know what nfft (and the 6 values that determine it) are on each proc. Actually nxhi_fft and nxlo_fft are not interesting should just be 1 to Nx. So just the 4 yz values, and really just the differences. So procID + 3 values: procID nfft (nyhi_fft-nylo_fft+1) (nzhi_fft-nzlo_fft+1) You could do this for the smallest problem since it seems off. Doing it for the medium problem would be a bonus since it is dramatically wrong. Thanks, Steve

…

------------- // tally local grid sizes // ngrid = count of owned+ghost grid cells on this proc // nfft_brick = FFT points in 3d brick-decomposition on this proc // same as count of owned grid cells // nfft = FFT points in x-pencil FFT decomposition on this proc // nfft_both = greater of nfft and nfft_brick ngrid = (nxhi_out-nxlo_out+1) * (nyhi_out-nylo_out+1) * (nzhi_out-nzlo_out+1); nfft_brick = (nxhi_in-nxlo_in+1) * (nyhi_in-nylo_in+1) * (nzhi_in-nzlo_in+1); nfft = (nxhi_fft-nxlo_fft+1) * (nyhi_fft-nylo_fft+1) * (nzhi_fft-nzlo_fft+1); nfft_both = MAX(nfft,nfft_brick);

On Tue, Apr 16, 2024 at 12:43 PM Steve Plimpton ***@***.***> wrote: Hmmm. I think all 3 of those should give 46656000 for FFT values/proc, since the numbers for 2d decomps are evenly divisible in each case. Total grid size / total MPI count = 46656000 for all 3 sizes. But only the last one gives that. I'd like to understand that. I suggest putting in some debug statements where that output is calculated and printed. For the 3d grid decomp, each proc's portion of the grid is 360^3 = 46656000 But the output numbers include ghost cells for that = 367^3 = 49430863 which has to do with the PPPM charge mapping stencil size. What is it for these runs? Again, same for all 3 problem sizes. Steve On Tue, Apr 16, 2024 at 12:24 PM Stan Moore ***@***.***> wrote: > 64 nodes: grid = 2880 2880 2880 > 512 nodes: grid = 5760 5760 5760 > 4096 nodes: grid = 11520 11520 11520 > > — > Reply to this email directly, view it on GitHub > <#4133 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AFA4LQURVK5NNBES5PKCQC3Y5VUHJAVCNFSM6AAAAABGC3Z2ROVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJZGY4TCMZTGU> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >

sjplimp · 2024-04-16T19:16:15Z

Here is a simpler thing you can output first, just from a single proc after this line in PPPM::set_grid_local() procs2grid2d(nprocs,ny_pppm,nz_pppm,&npey_fft,&npez_fft); Print out all the args of this function (inputs and outputs), just on 1 proc. Stan - is this benchmark running with just vanilla PPPM - not staggered or some other optimization? And what is the order param for the stencil? Thanks, Steve

…

On Tue, Apr 16, 2024 at 1:05 PM Steve Plimpton ***@***.***> wrote: These are the lines below in pppm.cpp to instrument. What is printed to the screen (later) is the max of nfft_both across all procs. I'm pretty sure nfft_brick is the same on all procs since it is related to ngrid. So I'd like to know what nfft (and the 6 values that determine it) are on each proc. Actually nxhi_fft and nxlo_fft are not interesting should just be 1 to Nx. So just the 4 yz values, and really just the differences. So procID + 3 values: procID nfft (nyhi_fft-nylo_fft+1) (nzhi_fft-nzlo_fft+1) You could do this for the smallest problem since it seems off. Doing it for the medium problem would be a bonus since it is dramatically wrong. Thanks, Steve ------------- // tally local grid sizes // ngrid = count of owned+ghost grid cells on this proc // nfft_brick = FFT points in 3d brick-decomposition on this proc // same as count of owned grid cells // nfft = FFT points in x-pencil FFT decomposition on this proc // nfft_both = greater of nfft and nfft_brick ngrid = (nxhi_out-nxlo_out+1) * (nyhi_out-nylo_out+1) * (nzhi_out-nzlo_out+1); nfft_brick = (nxhi_in-nxlo_in+1) * (nyhi_in-nylo_in+1) * (nzhi_in-nzlo_in+1); nfft = (nxhi_fft-nxlo_fft+1) * (nyhi_fft-nylo_fft+1) * (nzhi_fft-nzlo_fft+1); nfft_both = MAX(nfft,nfft_brick); On Tue, Apr 16, 2024 at 12:43 PM Steve Plimpton ***@***.***> wrote: > Hmmm. I think all 3 of those should give 46656000 for FFT values/proc, > since the numbers for 2d decomps are evenly divisible in each case. > Total grid size / total MPI count = 46656000 for all 3 sizes. > But only the last one gives that. I'd like to understand that. > > I suggest putting in some debug statements where that output > is calculated and printed. > > For the 3d grid decomp, each proc's portion of the grid is 360^3 = > 46656000 > But the output numbers include ghost cells for that = 367^3 = 49430863 > which has to do with the PPPM charge mapping stencil size. What is it > for these runs? > Again, same for all 3 problem sizes. > > Steve > > On Tue, Apr 16, 2024 at 12:24 PM Stan Moore ***@***.***> > wrote: > >> 64 nodes: grid = 2880 2880 2880 >> 512 nodes: grid = 5760 5760 5760 >> 4096 nodes: grid = 11520 11520 11520 >> >> — >> Reply to this email directly, view it on GitHub >> <#4133 (comment)>, >> or unsubscribe >> <https://github.com/notifications/unsubscribe-auth/AFA4LQURVK5NNBES5PKCQC3Y5VUHJAVCNFSM6AAAAABGC3Z2ROVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJZGY4TCMZTGU> >> . >> You are receiving this because you were mentioned.Message ID: >> ***@***.***> >> >

stanmoore1 · 2024-04-16T19:19:11Z

I'm pulling the PPPM code into a standalone reproducer which will be easy to debug. This is Kokkos PPPM, I believe using the default order = 5.

stanmoore1 · 2024-04-16T19:25:00Z

I can reproduce the issue in my standalone. Here are the values for the first 11 ranks on 512 nodes = 4096 ranks:

printf("%i: %i %i %i %i %i %i --> %i\n",me,nxlo_fft,nxhi_fft,nylo_fft,nyhi_fft,nzlo_fft,nzlo_fft,nfft);

0: 0 5759 0 5759 0 0 --> 33177600
1: 0 5759 0 5759 1 1 --> 33177600
2: 0 5759 0 5759 2 3 --> 66355200
3: 0 5759 0 5759 4 4 --> 33177600
4: 0 5759 0 5759 5 6 --> 66355200
5: 0 5759 0 5759 7 7 --> 33177600
6: 0 5759 0 5759 8 8 --> 33177600
7: 0 5759 0 5759 9 10 --> 66355200
8: 0 5759 0 5759 11 11 --> 33177600
9: 0 5759 0 5759 12 13 --> 66355200
10: 0 5759 0 5759 14 14 --> 33177600

stanmoore1 · 2024-04-16T19:32:08Z

OK here is the issue. When nprocs is small enough, each owns 1 or more entire xy planes, otherwise each proc owns 2d sub-blocks of yz plane. For 64 and 512 nodes the first condition is true, but for 4096 nodes, the latter is true, which is better load balanced.

  if (nz_pppm >= nprocs) {
    npey_fft = 1;
    npez_fft = nprocs;
  } else procs2grid2d(nprocs,ny_pppm,nz_pppm,&npey_fft,&npez_fft);

stanmoore1 · 2024-04-16T19:41:51Z

64 nodes:

0: 0 2879 0 2879 0 4 --> 41472000
1: 0 2879 0 2879 5 10 --> 49766400
2: 0 2879 0 2879 11 15 --> 41472000
3: 0 2879 0 2879 16 21 --> 49766400
4: 0 2879 0 2879 22 27 --> 49766400
5: 0 2879 0 2879 28 32 --> 41472000
6: 0 2879 0 2879 33 38 --> 49766400
7: 0 2879 0 2879 39 44 --> 49766400
8: 0 2879 0 2879 45 49 --> 41472000
9: 0 2879 0 2879 50 55 --> 49766400
10: 0 2879 0 2879 56 60 --> 41472000

512 nodes:

0: 0 5759 0 5759 0 0 --> 33177600
1: 0 5759 0 5759 1 1 --> 33177600
2: 0 5759 0 5759 2 3 --> 66355200
3: 0 5759 0 5759 4 4 --> 33177600
4: 0 5759 0 5759 5 6 --> 66355200
5: 0 5759 0 5759 7 7 --> 33177600
6: 0 5759 0 5759 8 8 --> 33177600
7: 0 5759 0 5759 9 10 --> 66355200
8: 0 5759 0 5759 11 11 --> 33177600
9: 0 5759 0 5759 12 13 --> 66355200
10: 0 5759 0 5759 14 14 --> 33177600

4096 nodes:

0: 0 11519 0 89 0 44 --> 46656000
1: 0 11519 90 179 0 44 --> 46656000
2: 0 11519 180 269 0 44 --> 46656000
3: 0 11519 270 359 0 44 --> 46656000
4: 0 11519 360 449 0 44 --> 46656000
5: 0 11519 450 539 0 44 --> 46656000
6: 0 11519 540 629 0 44 --> 46656000
7: 0 11519 630 719 0 44 --> 46656000
8: 0 11519 720 809 0 44 --> 46656000
9: 0 11519 810 899 0 44 --> 46656000
10: 0 11519 900 989 0 44 --> 46656000

stanmoore1 · 2024-04-16T19:46:08Z

If I force it to always use procs2grid2d() then I get the expected behavior for 512 nodes:

0: 0 5759 0 89 0 89 --> 46656000
1: 0 5759 90 179 0 89 --> 46656000
2: 0 5759 180 269 0 89 --> 46656000
3: 0 5759 270 359 0 89 --> 46656000
4: 0 5759 360 449 0 89 --> 46656000
5: 0 5759 450 539 0 89 --> 46656000
6: 0 5759 540 629 0 89 --> 46656000
7: 0 5759 630 719 0 89 --> 46656000
8: 0 5759 720 809 0 89 --> 46656000
9: 0 5759 810 899 0 89 --> 46656000
10: 0 5759 900 989 0 89 --> 46656000

Not sure what the performance implications are though? In Nick's case, perfect load balance appears to be much more important than each proc having an entire xy plane.

stanmoore1 · 2024-04-16T20:01:36Z

@hagertnl this patch should give the expected behavior for 512 nodes. We need to decide if we want to make this change in LAMMPS or not.

diff --git a/src/KSPACE/pppm.cpp b/src/KSPACE/pppm.cpp
index 4fe5075..06cbf11 100644
--- a/src/KSPACE/pppm.cpp
+++ b/src/KSPACE/pppm.cpp
@@ -1389,10 +1389,7 @@ void PPPM::set_grid_local()
   //   of the global FFT mesh that I own in x-pencil decomposition
 
   int npey_fft,npez_fft;
-  if (nz_pppm >= nprocs) {
-    npey_fft = 1;
-    npez_fft = nprocs;
-  } else procs2grid2d(nprocs,ny_pppm,nz_pppm,&npey_fft,&npez_fft);
+  procs2grid2d(nprocs,ny_pppm,nz_pppm,&npey_fft,&npez_fft);
 
   int me_y = me % npey_fft;
   int me_z = me / npey_fft;

sjplimp · 2024-04-16T21:19:27Z

Now it makes sense - thanks for figuring it out Stan. So the rationale for switching to decomposing by 2d planes instead of 1d pencils (when you have less processes than grid points in a dimension) is to avoid communication. I.e. you can either do true 2d FFTs in memory (on each proc) or do a series of 1d FFTs in one dim then another series of strided 1d FFTs in the other dim, w/out needing to comm/transpose the data inbetween. In principle that should be a win versus communicating between every set of 1d FFTs. But looking at KSPACE/fft3d.cpp I don't see where the 2d FFTs w/out communication are being checked for or performed. Does this ring any bells for you Stan? Also, these numbers for 512 nodes: 0: 0 2879 0 2879 0 4 --> 41472000 1: 0 2879 0 2879 5 10 --> 49766400 2: 0 2879 0 2879 11 15 --> 41472000 3: 0 2879 0 2879 16 21 --> 49766400 4: 0 2879 0 2879 22 27 --> 49766400 5: 0 2879 0 2879 28 32 --> 41472000 6: 0 2879 0 2879 33 38 --> 49766400 7: 0 2879 0 2879 39 44 --> 49766400 8: 0 2879 0 2879 45 49 --> 41472000 9: 0 2879 0 2879 50 55 --> 49766400 10: 0 2879 0 2879 56 60 --> 41472000 do not make it seem like any proc will have 50% more FFT grid points ? *I,e, max = 66355200* The only imbalance is 5 versus 6 planes. Steve

…

On Tue, Apr 16, 2024 at 2:01 PM Stan Moore ***@***.***> wrote: @hagertnl <https://github.com/hagertnl> this patch should give the expected behavior for 512 nodes. We need to decide if we want to make this change in LAMMPS or not. diff --git a/src/KSPACE/pppm.cpp b/src/KSPACE/pppm.cpp index 4fe5075..06cbf11 100644--- a/src/KSPACE/pppm.cpp+++ b/src/KSPACE/pppm.cpp@@ -1389,10 +1389,7 @@ void PPPM::set_grid_local() // of the global FFT mesh that I own in x-pencil decomposition int npey_fft,npez_fft;- if (nz_pppm >= nprocs) {- npey_fft = 1;- npez_fft = nprocs;- } else procs2grid2d(nprocs,ny_pppm,nz_pppm,&npey_fft,&npez_fft);+ procs2grid2d(nprocs,ny_pppm,nz_pppm,&npey_fft,&npez_fft); int me_y = me % npey_fft; int me_z = me / npey_fft; — Reply to this email directly, view it on GitHub <#4133 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFA4LQVIRXEUDE6LVWU4MITY5V7TPAVCNFSM6AAAAABGC3Z2ROVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJZHAZTGMBQGM> . You are receiving this because you were mentioned.Message ID: ***@***.***>

stanmoore1 · 2024-04-16T21:23:53Z

@sjplimp sorry I had the data mixed up (now corrected above), it should have been:

512 nodes:

0: 0 5759 0 5759 0 0 --> 33177600
1: 0 5759 0 5759 1 1 --> 33177600
2: 0 5759 0 5759 2 3 --> 66355200
3: 0 5759 0 5759 4 4 --> 33177600
4: 0 5759 0 5759 5 6 --> 66355200
5: 0 5759 0 5759 7 7 --> 33177600
6: 0 5759 0 5759 8 8 --> 33177600
7: 0 5759 0 5759 9 10 --> 66355200
8: 0 5759 0 5759 11 11 --> 33177600
9: 0 5759 0 5759 12 13 --> 66355200
10: 0 5759 0 5759 14 14 --> 33177600

But looking at KSPACE/fft3d.cpp I don't see where the 2d FFTs
w/out communication are being checked for or performed.

I don't think that is supported anywhere in LAMMPS?

sjplimp · 2024-04-16T21:29:57Z

oh, I see now that in Stan's email the 64 vs 512 node values were flipped. So on 512 procs it is just 1 versus 2 planes which leads to the 66335200 (2 planes). If we are not leveraging the planar decomp in the fft3d.cpp code, then I don't think we should be ever choosing to perform the planar decomp in the PPPM code. I'm curious if removing that option in pppm.cpp will improve both the 64 and 512 node timings. Steve

…

On Tue, Apr 16, 2024 at 3:19 PM Steve Plimpton ***@***.***> wrote: Now it makes sense - thanks for figuring it out Stan. So the rationale for switching to decomposing by 2d planes instead of 1d pencils (when you have less processes than grid points in a dimension) is to avoid communication. I.e. you can either do true 2d FFTs in memory (on each proc) or do a series of 1d FFTs in one dim then another series of strided 1d FFTs in the other dim, w/out needing to comm/transpose the data inbetween. In principle that should be a win versus communicating between every set of 1d FFTs. But looking at KSPACE/fft3d.cpp I don't see where the 2d FFTs w/out communication are being checked for or performed. Does this ring any bells for you Stan? Also, these numbers for 512 nodes: 0: 0 2879 0 2879 0 4 --> 41472000 1: 0 2879 0 2879 5 10 --> 49766400 2: 0 2879 0 2879 11 15 --> 41472000 3: 0 2879 0 2879 16 21 --> 49766400 4: 0 2879 0 2879 22 27 --> 49766400 5: 0 2879 0 2879 28 32 --> 41472000 6: 0 2879 0 2879 33 38 --> 49766400 7: 0 2879 0 2879 39 44 --> 49766400 8: 0 2879 0 2879 45 49 --> 41472000 9: 0 2879 0 2879 50 55 --> 49766400 10: 0 2879 0 2879 56 60 --> 41472000 do not make it seem like any proc will have 50% more FFT grid points ? *I,e, max = 66355200* The only imbalance is 5 versus 6 planes. Steve On Tue, Apr 16, 2024 at 2:01 PM Stan Moore ***@***.***> wrote: > @hagertnl <https://github.com/hagertnl> this patch should give the > expected behavior for 512 nodes. We need to decide if we want to make this > change in LAMMPS or not. > > diff --git a/src/KSPACE/pppm.cpp b/src/KSPACE/pppm.cpp > index 4fe5075..06cbf11 100644--- a/src/KSPACE/pppm.cpp+++ b/src/KSPACE/pppm.cpp@@ -1389,10 +1389,7 @@ void PPPM::set_grid_local() > // of the global FFT mesh that I own in x-pencil decomposition > > int npey_fft,npez_fft;- if (nz_pppm >= nprocs) {- npey_fft = 1;- npez_fft = nprocs;- } else procs2grid2d(nprocs,ny_pppm,nz_pppm,&npey_fft,&npez_fft);+ procs2grid2d(nprocs,ny_pppm,nz_pppm,&npey_fft,&npez_fft); > > int me_y = me % npey_fft; > int me_z = me / npey_fft; > > — > Reply to this email directly, view it on GitHub > <#4133 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AFA4LQVIRXEUDE6LVWU4MITY5V7TPAVCNFSM6AAAAABGC3Z2ROVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJZHAZTGMBQGM> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >

sjplimp · 2024-04-16T21:30:51Z

oops - our emails crossed Thanks, Steve

…

On Tue, Apr 16, 2024 at 3:29 PM Steve Plimpton ***@***.***> wrote: oh, I see now that in Stan's email the 64 vs 512 node values were flipped. So on 512 procs it is just 1 versus 2 planes which leads to the 66335200 (2 planes). If we are not leveraging the planar decomp in the fft3d.cpp code, then I don't think we should be ever choosing to perform the planar decomp in the PPPM code. I'm curious if removing that option in pppm.cpp will improve both the 64 and 512 node timings. Steve On Tue, Apr 16, 2024 at 3:19 PM Steve Plimpton ***@***.***> wrote: > Now it makes sense - thanks for figuring it out Stan. > > So the rationale for switching to decomposing by 2d planes > instead of 1d pencils (when you have less processes than grid points > in a dimension) is to avoid communication. I.e. you can either do > true 2d FFTs in memory (on each proc) or do a series of 1d FFTs > in one dim then another series of strided 1d FFTs in the other dim, > w/out needing to comm/transpose the data inbetween. > > In principle that should be a win versus communicating between every > set of 1d FFTs. > > But looking at KSPACE/fft3d.cpp I don't see where the 2d FFTs > w/out communication are being checked for or performed. Does > this ring any bells for you Stan? > > Also, these numbers for 512 nodes: > > 0: 0 2879 0 2879 0 4 --> 41472000 > 1: 0 2879 0 2879 5 10 --> 49766400 > 2: 0 2879 0 2879 11 15 --> 41472000 > 3: 0 2879 0 2879 16 21 --> 49766400 > 4: 0 2879 0 2879 22 27 --> 49766400 > 5: 0 2879 0 2879 28 32 --> 41472000 > 6: 0 2879 0 2879 33 38 --> 49766400 > 7: 0 2879 0 2879 39 44 --> 49766400 > 8: 0 2879 0 2879 45 49 --> 41472000 > 9: 0 2879 0 2879 50 55 --> 49766400 > 10: 0 2879 0 2879 56 60 --> 41472000 > > do not make it seem like any proc will have 50% more FFT grid points ? > > *I,e, max = 66355200* > > The only imbalance is 5 versus 6 planes. > > Steve > > > On Tue, Apr 16, 2024 at 2:01 PM Stan Moore ***@***.***> > wrote: > >> @hagertnl <https://github.com/hagertnl> this patch should give the >> expected behavior for 512 nodes. We need to decide if we want to make this >> change in LAMMPS or not. >> >> diff --git a/src/KSPACE/pppm.cpp b/src/KSPACE/pppm.cpp >> index 4fe5075..06cbf11 100644--- a/src/KSPACE/pppm.cpp+++ b/src/KSPACE/pppm.cpp@@ -1389,10 +1389,7 @@ void PPPM::set_grid_local() >> // of the global FFT mesh that I own in x-pencil decomposition >> >> int npey_fft,npez_fft;- if (nz_pppm >= nprocs) {- npey_fft = 1;- npez_fft = nprocs;- } else procs2grid2d(nprocs,ny_pppm,nz_pppm,&npey_fft,&npez_fft);+ procs2grid2d(nprocs,ny_pppm,nz_pppm,&npey_fft,&npez_fft); >> >> int me_y = me % npey_fft; >> int me_z = me / npey_fft; >> >> — >> Reply to this email directly, view it on GitHub >> <#4133 (comment)>, >> or unsubscribe >> <https://github.com/notifications/unsubscribe-auth/AFA4LQVIRXEUDE6LVWU4MITY5V7TPAVCNFSM6AAAAABGC3Z2ROVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJZHAZTGMBQGM> >> . >> You are receiving this because you were mentioned.Message ID: >> ***@***.***> >> >

sjplimp · 2024-04-16T21:39:05Z

Stan - does your FFT-only reproducer actually perform FFTs? If so, I could work to upgrade fftd3d.cpp to detect and perform 2d FFTs on a planar decomp and we could test them for correctness and speed. Can PPPM KOKKOS wrap the FFTlib from ORNL (can't recall it's name). That might already have options for using the 2d planar idea when applicable. If so you could see how it does on these 3 problems, Steve

…

On Tue, Apr 16, 2024 at 3:30 PM Steve Plimpton ***@***.***> wrote: oops - our emails crossed Thanks, Steve On Tue, Apr 16, 2024 at 3:29 PM Steve Plimpton ***@***.***> wrote: > oh, I see now that in Stan's email the 64 vs 512 node values were flipped. > So on 512 procs it is just 1 versus 2 planes which leads to the 66335200 > (2 planes). > > If we are not leveraging the planar decomp in the fft3d.cpp code, then I > don't think > we should be ever choosing to perform the planar decomp in the PPPM code. > > I'm curious if removing that option in pppm.cpp will improve both the 64 > and 512 node > timings. > > Steve > > On Tue, Apr 16, 2024 at 3:19 PM Steve Plimpton ***@***.***> wrote: > >> Now it makes sense - thanks for figuring it out Stan. >> >> So the rationale for switching to decomposing by 2d planes >> instead of 1d pencils (when you have less processes than grid points >> in a dimension) is to avoid communication. I.e. you can either do >> true 2d FFTs in memory (on each proc) or do a series of 1d FFTs >> in one dim then another series of strided 1d FFTs in the other dim, >> w/out needing to comm/transpose the data inbetween. >> >> In principle that should be a win versus communicating between every >> set of 1d FFTs. >> >> But looking at KSPACE/fft3d.cpp I don't see where the 2d FFTs >> w/out communication are being checked for or performed. Does >> this ring any bells for you Stan? >> >> Also, these numbers for 512 nodes: >> >> 0: 0 2879 0 2879 0 4 --> 41472000 >> 1: 0 2879 0 2879 5 10 --> 49766400 >> 2: 0 2879 0 2879 11 15 --> 41472000 >> 3: 0 2879 0 2879 16 21 --> 49766400 >> 4: 0 2879 0 2879 22 27 --> 49766400 >> 5: 0 2879 0 2879 28 32 --> 41472000 >> 6: 0 2879 0 2879 33 38 --> 49766400 >> 7: 0 2879 0 2879 39 44 --> 49766400 >> 8: 0 2879 0 2879 45 49 --> 41472000 >> 9: 0 2879 0 2879 50 55 --> 49766400 >> 10: 0 2879 0 2879 56 60 --> 41472000 >> >> do not make it seem like any proc will have 50% more FFT grid points ? >> >> *I,e, max = 66355200* >> >> The only imbalance is 5 versus 6 planes. >> >> Steve >> >> >> On Tue, Apr 16, 2024 at 2:01 PM Stan Moore ***@***.***> >> wrote: >> >>> @hagertnl <https://github.com/hagertnl> this patch should give the >>> expected behavior for 512 nodes. We need to decide if we want to make this >>> change in LAMMPS or not. >>> >>> diff --git a/src/KSPACE/pppm.cpp b/src/KSPACE/pppm.cpp >>> index 4fe5075..06cbf11 100644--- a/src/KSPACE/pppm.cpp+++ b/src/KSPACE/pppm.cpp@@ -1389,10 +1389,7 @@ void PPPM::set_grid_local() >>> // of the global FFT mesh that I own in x-pencil decomposition >>> >>> int npey_fft,npez_fft;- if (nz_pppm >= nprocs) {- npey_fft = 1;- npez_fft = nprocs;- } else procs2grid2d(nprocs,ny_pppm,nz_pppm,&npey_fft,&npez_fft);+ procs2grid2d(nprocs,ny_pppm,nz_pppm,&npey_fft,&npez_fft); >>> >>> int me_y = me % npey_fft; >>> int me_z = me / npey_fft; >>> >>> — >>> Reply to this email directly, view it on GitHub >>> <#4133 (comment)>, >>> or unsubscribe >>> <https://github.com/notifications/unsubscribe-auth/AFA4LQVIRXEUDE6LVWU4MITY5V7TPAVCNFSM6AAAAABGC3Z2ROVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJZHAZTGMBQGM> >>> . >>> You are receiving this because you were mentioned.Message ID: >>> ***@***.***> >>> >>

stanmoore1 · 2024-04-16T21:48:32Z

Stan - does your FFT-only reproducer actually perform FFTs?

No, it just computes the decomposition without actually performing FFTs.

Can PPPM KOKKOS wrap the FFTlib from ORNL (can't recall it's name).

Yes hipFFT and cuFFT both support 2d FFTs, adding this could give speedup.

hagertnl · 2024-04-17T17:49:04Z

Does this also answer why there was the breakdown in communication at 4096+ nodes?

Frontier is busy decongesting from GB runs, but I am submitting jobs to test the patch you provided. I'll run 64, 512, and 4096 node jobs and report back. Thanks!

stanmoore1 · 2024-04-17T17:56:34Z

Does this also answer why there was the breakdown in communication at 4096+ nodes?

Not sure what you mean by "breakdown in communication"? The FFTs require a many-to-many proc communication pattern, so the more nodes you have, the more costly it will be. FFTs are not going to weak scale as well as the point-to-point communication used for the force calculation. One way around this would be to limit the number of ranks for FFTs using run_style verlet/split https://docs.lammps.org/run_style.html, however this isn't yet ported to Kokkos (though I don't think it would be hard to port).

hagertnl · 2024-04-17T17:59:35Z

By breakdown, I meant processes that don't talk to each other. At <= 512 nodes, we find that all process talk to all other processes at least 50 times (once per timestep). But at 4096+ nodes, we see that there's a large number of processes that aren't sending any data to some processes, so the "all-to-all" behavior we were initially seeing changes to many-to-many.

stanmoore1 · 2024-04-17T18:04:56Z

By breakdown, I meant processes that don't talk to each other. At <= 512 nodes, we find that all process talk to all other processes at least 50 times (once per timestep). But at 4096+ nodes, we see that there's a large number of processes that aren't sending any data to some processes, so the "all-to-all" behavior we were initially seeing changes to many-to-many.

Ah yes I think that patch could switch the 64 and 512 node cases to a more of a many-to-many comm pattern instead of all-to-all, but @sjplimp would know better.

sjplimp · 2024-04-17T18:44:52Z

I think in principle the FFT comm (for the data transposes) should be many-to-many for the pencil decomp. For the planar decomp (which Stan just changed) it could be all-to-all. Also, another reason FFTs don't scale as well as you might expect is their computation is NlogN, not O(N) as the rest of the MD algorithm is. Steve

…

On Wed, Apr 17, 2024 at 12:05 PM Stan Moore ***@***.***> wrote: By breakdown, I meant processes that don't talk to each other. At <= 512 nodes, we find that all process talk to all other processes at least 50 times (once per timestep). But at 4096+ nodes, we see that there's a large number of processes that aren't sending any data to some processes, so the "all-to-all" behavior we were initially seeing changes to many-to-many. Ah yes I think that patch could switch the 64 and 512 node cases to a more of a many-to-many comm pattern instead of all-to-all, but @sjplimp <https://github.com/sjplimp> would know better. — Reply to this email directly, view it on GitHub <#4133 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFA4LQUPVJ44CONGCP3JJJ3Y522V5AVCNFSM6AAAAABGC3Z2ROVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRRHEYDOOBRGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

stanmoore1 · 2024-04-17T18:50:52Z

@hagertnl note that all-to-all isn't always worse that many-to-many. There is actually a kspace_modify collective yes command, see https://docs.lammps.org/kspace_modify.html which uses all-to-all:

"The collective keyword applies only to PPPM. It is set to no by default, except on IBM BlueGene machines. If this option is set to yes, LAMMPS will use MPI collective operations to remap data for 3d-FFT operations instead of the default point-to-point communication. This is faster on IBM BlueGene machines, and may also be faster on other machines if they have an efficient implementation of MPI collective operations and adequate hardware."

However this option is not yet ported to the KOKKOS package.

stanmoore1 · 2024-04-17T18:56:01Z

But the collective option is using an actual MPI_Alltoall call, not just sending to all procs with MPI Send/Recv, so it is a different issue than the planar vs pencil decomposition. Planar decomposition could be better if we used 2D instead of 1D FFTs.

hagertnl · 2024-04-17T18:56:14Z

I was going to ask about that -- especially at large scale, I would expect an Alltoall to outperform the Send+Irecv implementation most days of the week. In a hackathon I helped with recently, a team saw a >20x speedup by using Alltoall instead of Send+Irecv. Granted, different buffer sizes, different science domain, etc. Given the potential benefit, I'd be very interested to give that a shot. Is there a technical issue blocking the porting to the KOKKOS package? If it's an issue of time, I'm happy to try to help where I can.

stanmoore1 · 2024-04-17T19:04:51Z

No technical issues, only lack of time. It would be great if you could work on this. You can see the collective code in the regular (non-Kokkos) node here:

https://github.com/lammps/lammps/blob/develop/src/KSPACE/remap.cpp#L113-L203

which is missing from the Kokkos version: https://github.com/lammps/lammps/blob/develop/src/KOKKOS/remap_kokkos.cpp#L106

The Kokkos version would need to support both GPU-aware MPI on and off, which is already done for the point-to-point version.

hagertnl · 2024-04-17T19:09:28Z

I'd be happy to help with that. Is there an open feature request about it already? I can create one to track, if not, just so work doesn't get duplicated.

stanmoore1 · 2024-04-17T19:13:10Z

I just checked and there is currently no existing feature request for this, thanks

akohlmey · 2024-04-17T19:15:16Z

Also, another reason FFTs don't scale as well as you might expect is their computation is NlogN, not O(N) as the rest of the MD algorithm is.

Plus for the non-bonded interaction you have a 3d domain decomposition, with pencils you are reduced to a 2d domain decomposition and with slices you are distributing data only in 1d. So you are quickly running out of work units when you have lots of MPI ranks. This was one major motivation to implement OpenMP parallelism to pair styles back around 2010. This way we could leave the number of MPI ranks smaller and still get decent strong scaling. It was the major "winner" in our INCITE proposal on the Cray XT5 back then at ORNL. I've put some numbers graphs together in a talk for a LAMMPS symposium in 2011 (see pages 14 to 19) of the attached PDF). Run style verlet/split was invented for the same purpose later. It has more potential for pushing strong scaling even farther, if you do have plenty of hardware.
lammps-overdrive.pdf.

So having a working verlet/split for KOKKOS could be the path for the best scaling.

P.S.: at the strong scaling limit you would also see a significant performance benefit from using single precision FFTs over double precision FFTs (not so much for the rest). This is because communication was already dominant with this way you would reduce the required bandwidth by half.

hagertnl · 2024-04-17T19:20:00Z

Submitted #4140 to track. I'll hopefully make some progress on this in the coming few weeks.

akohlmey · 2024-04-17T19:20:04Z

But the collective option is using an actual MPI_Alltoall call, not just sending to all procs with MPI Send/Recv, so it is a different issue than the planar vs pencil decomposition. Planar decomposition could be better if we used 2D instead of 1D FFTs.

Isn't HeFFTe supposed to have some optimizations for that? But that is not yet KOKKOS compatible, right?

stanmoore1 · 2024-04-17T19:25:08Z

With the KOKKOS package you can also run KSpace and Bond, Angle etc. calculations on the Host CPU, even threaded with OpenMP, while the Pair force calculation runs on the GPUs. If you don't use a barostat (e.g. fix npt) then the CPU and GPU time will overlap. But whether this would give speedup on Frontier over everything running on the GPUs, I'm not sure.

Like Axel said, using single precision for FFTs could also help, which should already work with the KOKKOS package, you just need to compile with a flag, see https://docs.lammps.org/Build_settings.html.

Isn't HeFFTe supposed to have some optimizations for that? But that is not yet KOKKOS compatible, right?

Probably (not sure), but HeFFTe is not yet KOKKOS compatible.

hagertnl · 2024-04-18T14:31:51Z

I don't have the 4096-node job back yet, but here's some data for the patch @stanmoore1 provided. We do see a good (20+%) speed-up of the 512-node case, and confirm it's using a much better FFT grid.

64-node:

New

3d grid and FFT values/proc = 49430863 46656000
Performance: 0.015 ns/day, 1622.397 hours/ns, 0.342 timesteps/s, 1.103 Gatom-step/s
FFT time (% of Kspce) = 179.062 (69.84)
FFT Gflps 3d (1d only) = 6375.8953 46359.128

Old

3d grid and FFT values/proc = 49430863 49766400
Performance: 0.015 ns/day, 1552.893 hours/ns, 0.358 timesteps/s, 1.152 Gatom-step/s
FFT time (% of Kspce) = 174.284 (71.06)
FFT Gflps 3d (1d only) = 6550.695 45354.741

512-node:

New

3d grid and FFT values/proc = 49430863 46656000
Performance: 0.013 ns/day, 1821.530 hours/ns, 0.305 timesteps/s, 7.860 Gatom-step/s
FFT time (% of Kspce) = 209.756 (72.45)
FFT Gflps 3d (1d only) = 47332.296 1775078.1

Old

3d grid and FFT values/proc = 49430863 66355200
Performance: 0.011 ns/day, 2241.749 hours/ns, 0.248 timesteps/s, 6.386 Gatom-step/s
FFT time (% of Kspce) = 290.224 (79.26)
FFT Gflps 3d (1d only) = 34208.806 1560932.6

stanmoore1 · 2024-04-18T15:53:45Z

@hagertnl looks like the 64 node case was a little slower while the 512 node case was faster. The performance of the 4096 node case and higher node counts should be unchanged with the patch.

hagertnl · 2024-04-18T17:38:57Z

There were some network timeouts reported by the 64-node job, so I can't guarantee that the code is at fault for that timing. It was close enough to not fret over for the moment. Thanks!

hagertnl added the bug label Apr 11, 2024

stanmoore1 self-assigned this Apr 16, 2024

hagertnl mentioned this issue Apr 23, 2024

Implement kspace_modify collective yes in KOKKOS package #4143

Draft

11 tasks

[BUG] Strange communication lapse while scaling PPPM to extreme scale #4133

[BUG] Strange communication lapse while scaling PPPM to extreme scale #4133

Comments

hagertnl commented Apr 11, 2024 • edited

hagertnl commented Apr 11, 2024

hagertnl commented Apr 11, 2024

stanmoore1 commented Apr 11, 2024 • edited

hagertnl commented Apr 11, 2024

stanmoore1 commented Apr 11, 2024

hagertnl commented Apr 11, 2024

hagertnl commented Apr 16, 2024

64-node:

512-node:

4096-node:

stanmoore1 commented Apr 16, 2024

hagertnl commented Apr 16, 2024

stanmoore1 commented Apr 16, 2024

akohlmey commented Apr 16, 2024

stanmoore1 commented Apr 16, 2024

sjplimp commented Apr 16, 2024 via email

sjplimp commented Apr 16, 2024 via email

stanmoore1 commented Apr 16, 2024

hagertnl commented Apr 16, 2024

stanmoore1 commented Apr 16, 2024

sjplimp commented Apr 16, 2024 via email

stanmoore1 commented Apr 16, 2024

stanmoore1 commented Apr 16, 2024

sjplimp commented Apr 16, 2024 via email

sjplimp commented Apr 16, 2024 via email

sjplimp commented Apr 16, 2024 via email

stanmoore1 commented Apr 16, 2024

stanmoore1 commented Apr 16, 2024 • edited

stanmoore1 commented Apr 16, 2024

stanmoore1 commented Apr 16, 2024 • edited

stanmoore1 commented Apr 16, 2024

stanmoore1 commented Apr 16, 2024

sjplimp commented Apr 16, 2024 via email

stanmoore1 commented Apr 16, 2024

sjplimp commented Apr 16, 2024 via email

sjplimp commented Apr 16, 2024 via email

sjplimp commented Apr 16, 2024 via email

stanmoore1 commented Apr 16, 2024

hagertnl commented Apr 17, 2024

stanmoore1 commented Apr 17, 2024 • edited

hagertnl commented Apr 17, 2024

stanmoore1 commented Apr 17, 2024

sjplimp commented Apr 17, 2024 via email

stanmoore1 commented Apr 17, 2024

stanmoore1 commented Apr 17, 2024

hagertnl commented Apr 17, 2024

stanmoore1 commented Apr 17, 2024

hagertnl commented Apr 17, 2024

stanmoore1 commented Apr 17, 2024 • edited

akohlmey commented Apr 17, 2024 • edited

hagertnl commented Apr 17, 2024

akohlmey commented Apr 17, 2024

stanmoore1 commented Apr 17, 2024 • edited

hagertnl commented Apr 18, 2024

64-node:

New

Old

512-node:

New

Old

stanmoore1 commented Apr 18, 2024 • edited

hagertnl commented Apr 18, 2024

hagertnl commented Apr 11, 2024 •

edited

stanmoore1 commented Apr 11, 2024 •

edited

stanmoore1 commented Apr 16, 2024 •

edited

stanmoore1 commented Apr 16, 2024 •

edited

stanmoore1 commented Apr 17, 2024 •

edited

stanmoore1 commented Apr 17, 2024 •

edited

akohlmey commented Apr 17, 2024 •

edited

stanmoore1 commented Apr 17, 2024 •

edited

stanmoore1 commented Apr 18, 2024 •

edited