New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Strange communication lapse while scaling PPPM to extreme scale #4133
Comments
@stanmoore1 I'm curious to get your thoughts on this, if you have a second... Since it's only at massive scale, I'm happy to put some time towards this, I just don't initially know where to go looking in LAMMPS yet. |
Interestingly, our energy breakdowns are exactly what we expect them to be. So it's not like LAMMPS is giving horribly wrong answers -- the printed thermodynamic data is correct, but we also don't run the system for very long |
@hagertnl, that is odd, a few questions:
Here is where the FFT grid is printed out in the log file:
|
|
When you run another set of jobs, can you add
Could you do the same thing but compare the same rank in each case, e.g. rank 0? Just want to make sure it is apples to apples; the variance is pretty low, so a little jitter could make any random rank the slowest in a given run. @sjplimp any other ideas? |
Will do! Also, all the plots I showed were from rank 1, looking at data it sent to all other ranks. The tabular data was all for just the maximum rank. I can grab the data for rank 1 to line up with the plots |
Results from FFT bench: 64-node:
512-node:
4096-node:
Interesting, the FFT time dropped pretty substantially from 512 to 4096 |
@hagertnl look at this line: 64 nodes: 3d grid and FFT values/proc = 49430863 49766400 Here is the code:
|
Well that makes the FFT time drop much less interesting :) In a weak-scaling simulation, do you know, should the values per proc be constant per-rank? I would think since the FFT grid weak-scales with the system size, it should remain constant, but I don't know much about the inner workings of the 3D FFTs. |
I think it should be constant for weak scaling, or at the very least increase monotonically with node count. I will have to dig through the code and see what is going on. |
I would suspect the list of available primefactors for the FFT plan. The planner will choose more grid points when it cannot decompose the optimal number and what is available depends on the FFT interface. |
Interesting. Since this is a per-proc "max" value, it looks like load imbalance, i.e. one proc had more FFT work than the others which increases the FFT time. I wonder why the code can't keep the same decomposition when weak scaling--the total FFT grid is exactly 8x larger for 512 vs 64 nodes. |
Might be related to the fact that for a 3d FFT, the grid goes thru a
sequence
of 2d decompositions across the procs. I.e. all the procs own 1d pencils
of the grid
in the x dir, then y dir, then z dir. So that is a 2d decomp in the yz
plane, the xz
plane, the xy plane.
For the 2d decomp, 64 procs = 8x8 and 4096 procs = 64x64,
But for 512 procs it is 16x32.
What are the grid dimensions?
Steve
…On Tue, Apr 16, 2024 at 9:57 AM Stan Moore ***@***.***> wrote:
Interesting. Since this is a per-proc "max" value, it looks like load
imbalance, i.e. one proc had more FFT work than the others which increases
the FFT time. I wonder why the code can't keep the same decomposition when
weak scaling--the total FFT grid is exactly 8x larger for 512 vs 64 nodes.
—
Reply to this email directly, view it on GitHub
<#4133 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFA4LQSHJGI7PNZ5PR2AQ6DY5VC7XAVCNFSM6AAAAABGC3Z2ROVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJZGQZDSMZSGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I see now you said 512 nodes, not procs. How many MPI tasks per node ?
Steve
…On Tue, Apr 16, 2024 at 11:05 AM Steve Plimpton ***@***.***> wrote:
Might be related to the fact that for a 3d FFT, the grid goes thru a
sequence
of 2d decompositions across the procs. I.e. all the procs own 1d pencils
of the grid
in the x dir, then y dir, then z dir. So that is a 2d decomp in the yz
plane, the xz
plane, the xy plane.
For the 2d decomp, 64 procs = 8x8 and 4096 procs = 64x64,
But for 512 procs it is 16x32.
What are the grid dimensions?
Steve
On Tue, Apr 16, 2024 at 9:57 AM Stan Moore ***@***.***>
wrote:
> Interesting. Since this is a per-proc "max" value, it looks like load
> imbalance, i.e. one proc had more FFT work than the others which increases
> the FFT time. I wonder why the code can't keep the same decomposition when
> weak scaling--the total FFT grid is exactly 8x larger for 512 vs 64 nodes.
>
> —
> Reply to this email directly, view it on GitHub
> <#4133 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AFA4LQSHJGI7PNZ5PR2AQ6DY5VC7XAVCNFSM6AAAAABGC3Z2ROVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJZGQZDSMZSGI>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
8 MPI tasks per node |
I also use the |
Seems odd that 512 nodes has the best 2D decomposition but is also the slowest? |
So then
64 nodes = 2^(6+3) MPI tasks = 16x32 2d decomp
512 nodes = 2^(9+3) MPI tasks = 64x64 2d decomp
4096 nodes = 2^(12+3) MPI tasks = 128x256 decomp
So that is the opposite of what I hypothesized.
What is the 3d size of the FFT grid for the 3 cases?
Steve
…On Tue, Apr 16, 2024 at 11:31 AM Nick Hagerty ***@***.***> wrote:
I also use the processors command to set a 2x2x2 local grid for the 8
local ranks. We're using a cubic system and always use perfect-cube number
of nodes.
—
Reply to this email directly, view it on GitHub
<#4133 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFA4LQS2HBAALBDAFVMX223Y5VN57AVCNFSM6AAAAABGC3Z2ROVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJZGU4TCNJQGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
64 nodes: grid = 2880 2880 2880 |
2880^3/512 = 46656000 However only 4096 nodes gets the ideal perfectly balanced case where every rank has 46656000 FFT data points: 64 nodes: 3d grid and FFT values/proc = 49430863 49766400 |
Hmmm. I think all 3 of those should give 46656000 for FFT values/proc,
since the numbers for 2d decomps are evenly divisible in each case.
Total grid size / total MPI count = 46656000 for all 3 sizes.
But only the last one gives that. I'd like to understand that.
I suggest putting in some debug statements where that output
is calculated and printed.
For the 3d grid decomp, each proc's portion of the grid is 360^3 = 46656000
But the output numbers include ghost cells for that = 367^3 = 49430863
which has to do with the PPPM charge mapping stencil size. What is it for
these runs?
Again, same for all 3 problem sizes.
Steve
…On Tue, Apr 16, 2024 at 12:24 PM Stan Moore ***@***.***> wrote:
64 nodes: grid = 2880 2880 2880
512 nodes: grid = 5760 5760 5760
4096 nodes: grid = 11520 11520 11520
—
Reply to this email directly, view it on GitHub
<#4133 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFA4LQURVK5NNBES5PKCQC3Y5VUHJAVCNFSM6AAAAABGC3Z2ROVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJZGY4TCMZTGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
These are the lines below in pppm.cpp to instrument.
What is printed to the screen (later) is the max of nfft_both across all
procs.
I'm pretty sure nfft_brick is the same on all procs since it is related to
ngrid.
So I'd like to know what nfft (and the 6 values that determine it) are
on each proc. Actually nxhi_fft and nxlo_fft are not interesting should
just be 1 to Nx.
So just the 4 yz values, and really just the differences. So procID + 3
values:
procID nfft (nyhi_fft-nylo_fft+1) (nzhi_fft-nzlo_fft+1)
You could do this for the smallest problem since it seems off.
Doing it for the medium problem would be a bonus since it is dramatically
wrong.
Thanks,
Steve
…-------------
// tally local grid sizes
// ngrid = count of owned+ghost grid cells on this proc
// nfft_brick = FFT points in 3d brick-decomposition on this proc
// same as count of owned grid cells
// nfft = FFT points in x-pencil FFT decomposition on this proc
// nfft_both = greater of nfft and nfft_brick
ngrid = (nxhi_out-nxlo_out+1) * (nyhi_out-nylo_out+1) *
(nzhi_out-nzlo_out+1);
nfft_brick = (nxhi_in-nxlo_in+1) * (nyhi_in-nylo_in+1) *
(nzhi_in-nzlo_in+1);
nfft = (nxhi_fft-nxlo_fft+1) * (nyhi_fft-nylo_fft+1) *
(nzhi_fft-nzlo_fft+1);
nfft_both = MAX(nfft,nfft_brick);
On Tue, Apr 16, 2024 at 12:43 PM Steve Plimpton ***@***.***> wrote:
Hmmm. I think all 3 of those should give 46656000 for FFT values/proc,
since the numbers for 2d decomps are evenly divisible in each case.
Total grid size / total MPI count = 46656000 for all 3 sizes.
But only the last one gives that. I'd like to understand that.
I suggest putting in some debug statements where that output
is calculated and printed.
For the 3d grid decomp, each proc's portion of the grid is 360^3 = 46656000
But the output numbers include ghost cells for that = 367^3 = 49430863
which has to do with the PPPM charge mapping stencil size. What is it for
these runs?
Again, same for all 3 problem sizes.
Steve
On Tue, Apr 16, 2024 at 12:24 PM Stan Moore ***@***.***>
wrote:
> 64 nodes: grid = 2880 2880 2880
> 512 nodes: grid = 5760 5760 5760
> 4096 nodes: grid = 11520 11520 11520
>
> —
> Reply to this email directly, view it on GitHub
> <#4133 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AFA4LQURVK5NNBES5PKCQC3Y5VUHJAVCNFSM6AAAAABGC3Z2ROVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJZGY4TCMZTGU>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
Here is a simpler thing you can output first, just from a single proc after
this line in PPPM::set_grid_local()
procs2grid2d(nprocs,ny_pppm,nz_pppm,&npey_fft,&npez_fft);
Print out all the args of this function (inputs and outputs), just on 1
proc.
Stan - is this benchmark running with just vanilla PPPM - not staggered or
some other optimization?
And what is the order param for the stencil?
Thanks,
Steve
…On Tue, Apr 16, 2024 at 1:05 PM Steve Plimpton ***@***.***> wrote:
These are the lines below in pppm.cpp to instrument.
What is printed to the screen (later) is the max of nfft_both across all
procs.
I'm pretty sure nfft_brick is the same on all procs since it is related to
ngrid.
So I'd like to know what nfft (and the 6 values that determine it) are
on each proc. Actually nxhi_fft and nxlo_fft are not interesting should
just be 1 to Nx.
So just the 4 yz values, and really just the differences. So procID + 3
values:
procID nfft (nyhi_fft-nylo_fft+1) (nzhi_fft-nzlo_fft+1)
You could do this for the smallest problem since it seems off.
Doing it for the medium problem would be a bonus since it is dramatically
wrong.
Thanks,
Steve
-------------
// tally local grid sizes
// ngrid = count of owned+ghost grid cells on this proc
// nfft_brick = FFT points in 3d brick-decomposition on this proc
// same as count of owned grid cells
// nfft = FFT points in x-pencil FFT decomposition on this proc
// nfft_both = greater of nfft and nfft_brick
ngrid = (nxhi_out-nxlo_out+1) * (nyhi_out-nylo_out+1) *
(nzhi_out-nzlo_out+1);
nfft_brick = (nxhi_in-nxlo_in+1) * (nyhi_in-nylo_in+1) *
(nzhi_in-nzlo_in+1);
nfft = (nxhi_fft-nxlo_fft+1) * (nyhi_fft-nylo_fft+1) *
(nzhi_fft-nzlo_fft+1);
nfft_both = MAX(nfft,nfft_brick);
On Tue, Apr 16, 2024 at 12:43 PM Steve Plimpton ***@***.***> wrote:
> Hmmm. I think all 3 of those should give 46656000 for FFT values/proc,
> since the numbers for 2d decomps are evenly divisible in each case.
> Total grid size / total MPI count = 46656000 for all 3 sizes.
> But only the last one gives that. I'd like to understand that.
>
> I suggest putting in some debug statements where that output
> is calculated and printed.
>
> For the 3d grid decomp, each proc's portion of the grid is 360^3 =
> 46656000
> But the output numbers include ghost cells for that = 367^3 = 49430863
> which has to do with the PPPM charge mapping stencil size. What is it
> for these runs?
> Again, same for all 3 problem sizes.
>
> Steve
>
> On Tue, Apr 16, 2024 at 12:24 PM Stan Moore ***@***.***>
> wrote:
>
>> 64 nodes: grid = 2880 2880 2880
>> 512 nodes: grid = 5760 5760 5760
>> 4096 nodes: grid = 11520 11520 11520
>>
>> —
>> Reply to this email directly, view it on GitHub
>> <#4133 (comment)>,
>> or unsubscribe
>> <https://github.com/notifications/unsubscribe-auth/AFA4LQURVK5NNBES5PKCQC3Y5VUHJAVCNFSM6AAAAABGC3Z2ROVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJZGY4TCMZTGU>
>> .
>> You are receiving this because you were mentioned.Message ID:
>> ***@***.***>
>>
>
|
I'm pulling the PPPM code into a standalone reproducer which will be easy to debug. This is Kokkos PPPM, I believe using the default order = 5. |
I can reproduce the issue in my standalone. Here are the values for the first 11 ranks on 512 nodes = 4096 ranks:
|
OK here is the issue. When
|
|
If I force it to always use
Not sure what the performance implications are though? In Nick's case, perfect load balance appears to be much more important than each proc having an entire |
@hagertnl this patch should give the expected behavior for 512 nodes. We need to decide if we want to make this change in LAMMPS or not. diff --git a/src/KSPACE/pppm.cpp b/src/KSPACE/pppm.cpp
index 4fe5075..06cbf11 100644
--- a/src/KSPACE/pppm.cpp
+++ b/src/KSPACE/pppm.cpp
@@ -1389,10 +1389,7 @@ void PPPM::set_grid_local()
// of the global FFT mesh that I own in x-pencil decomposition
int npey_fft,npez_fft;
- if (nz_pppm >= nprocs) {
- npey_fft = 1;
- npez_fft = nprocs;
- } else procs2grid2d(nprocs,ny_pppm,nz_pppm,&npey_fft,&npez_fft);
+ procs2grid2d(nprocs,ny_pppm,nz_pppm,&npey_fft,&npez_fft);
int me_y = me % npey_fft;
int me_z = me / npey_fft; |
Now it makes sense - thanks for figuring it out Stan.
So the rationale for switching to decomposing by 2d planes
instead of 1d pencils (when you have less processes than grid points
in a dimension) is to avoid communication. I.e. you can either do
true 2d FFTs in memory (on each proc) or do a series of 1d FFTs
in one dim then another series of strided 1d FFTs in the other dim,
w/out needing to comm/transpose the data inbetween.
In principle that should be a win versus communicating between every
set of 1d FFTs.
But looking at KSPACE/fft3d.cpp I don't see where the 2d FFTs
w/out communication are being checked for or performed. Does
this ring any bells for you Stan?
Also, these numbers for 512 nodes:
0: 0 2879 0 2879 0 4 --> 41472000
1: 0 2879 0 2879 5 10 --> 49766400
2: 0 2879 0 2879 11 15 --> 41472000
3: 0 2879 0 2879 16 21 --> 49766400
4: 0 2879 0 2879 22 27 --> 49766400
5: 0 2879 0 2879 28 32 --> 41472000
6: 0 2879 0 2879 33 38 --> 49766400
7: 0 2879 0 2879 39 44 --> 49766400
8: 0 2879 0 2879 45 49 --> 41472000
9: 0 2879 0 2879 50 55 --> 49766400
10: 0 2879 0 2879 56 60 --> 41472000
do not make it seem like any proc will have 50% more FFT grid points ?
*I,e, max = 66355200*
The only imbalance is 5 versus 6 planes.
Steve
…On Tue, Apr 16, 2024 at 2:01 PM Stan Moore ***@***.***> wrote:
@hagertnl <https://github.com/hagertnl> this patch should give the
expected behavior for 512 nodes. We need to decide if we want to make this
change in LAMMPS or not.
diff --git a/src/KSPACE/pppm.cpp b/src/KSPACE/pppm.cpp
index 4fe5075..06cbf11 100644--- a/src/KSPACE/pppm.cpp+++ b/src/KSPACE/pppm.cpp@@ -1389,10 +1389,7 @@ void PPPM::set_grid_local()
// of the global FFT mesh that I own in x-pencil decomposition
int npey_fft,npez_fft;- if (nz_pppm >= nprocs) {- npey_fft = 1;- npez_fft = nprocs;- } else procs2grid2d(nprocs,ny_pppm,nz_pppm,&npey_fft,&npez_fft);+ procs2grid2d(nprocs,ny_pppm,nz_pppm,&npey_fft,&npez_fft);
int me_y = me % npey_fft;
int me_z = me / npey_fft;
—
Reply to this email directly, view it on GitHub
<#4133 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFA4LQVIRXEUDE6LVWU4MITY5V7TPAVCNFSM6AAAAABGC3Z2ROVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJZHAZTGMBQGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@sjplimp sorry I had the data mixed up (now corrected above), it should have been:
I don't think that is supported anywhere in LAMMPS? |
oh, I see now that in Stan's email the 64 vs 512 node values were flipped.
So on 512 procs it is just 1 versus 2 planes which leads to the 66335200 (2
planes).
If we are not leveraging the planar decomp in the fft3d.cpp code, then I
don't think
we should be ever choosing to perform the planar decomp in the PPPM code.
I'm curious if removing that option in pppm.cpp will improve both the 64
and 512 node
timings.
Steve
…On Tue, Apr 16, 2024 at 3:19 PM Steve Plimpton ***@***.***> wrote:
Now it makes sense - thanks for figuring it out Stan.
So the rationale for switching to decomposing by 2d planes
instead of 1d pencils (when you have less processes than grid points
in a dimension) is to avoid communication. I.e. you can either do
true 2d FFTs in memory (on each proc) or do a series of 1d FFTs
in one dim then another series of strided 1d FFTs in the other dim,
w/out needing to comm/transpose the data inbetween.
In principle that should be a win versus communicating between every
set of 1d FFTs.
But looking at KSPACE/fft3d.cpp I don't see where the 2d FFTs
w/out communication are being checked for or performed. Does
this ring any bells for you Stan?
Also, these numbers for 512 nodes:
0: 0 2879 0 2879 0 4 --> 41472000
1: 0 2879 0 2879 5 10 --> 49766400
2: 0 2879 0 2879 11 15 --> 41472000
3: 0 2879 0 2879 16 21 --> 49766400
4: 0 2879 0 2879 22 27 --> 49766400
5: 0 2879 0 2879 28 32 --> 41472000
6: 0 2879 0 2879 33 38 --> 49766400
7: 0 2879 0 2879 39 44 --> 49766400
8: 0 2879 0 2879 45 49 --> 41472000
9: 0 2879 0 2879 50 55 --> 49766400
10: 0 2879 0 2879 56 60 --> 41472000
do not make it seem like any proc will have 50% more FFT grid points ?
*I,e, max = 66355200*
The only imbalance is 5 versus 6 planes.
Steve
On Tue, Apr 16, 2024 at 2:01 PM Stan Moore ***@***.***>
wrote:
> @hagertnl <https://github.com/hagertnl> this patch should give the
> expected behavior for 512 nodes. We need to decide if we want to make this
> change in LAMMPS or not.
>
> diff --git a/src/KSPACE/pppm.cpp b/src/KSPACE/pppm.cpp
> index 4fe5075..06cbf11 100644--- a/src/KSPACE/pppm.cpp+++ b/src/KSPACE/pppm.cpp@@ -1389,10 +1389,7 @@ void PPPM::set_grid_local()
> // of the global FFT mesh that I own in x-pencil decomposition
>
> int npey_fft,npez_fft;- if (nz_pppm >= nprocs) {- npey_fft = 1;- npez_fft = nprocs;- } else procs2grid2d(nprocs,ny_pppm,nz_pppm,&npey_fft,&npez_fft);+ procs2grid2d(nprocs,ny_pppm,nz_pppm,&npey_fft,&npez_fft);
>
> int me_y = me % npey_fft;
> int me_z = me / npey_fft;
>
> —
> Reply to this email directly, view it on GitHub
> <#4133 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AFA4LQVIRXEUDE6LVWU4MITY5V7TPAVCNFSM6AAAAABGC3Z2ROVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJZHAZTGMBQGM>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
oops - our emails crossed
Thanks,
Steve
…On Tue, Apr 16, 2024 at 3:29 PM Steve Plimpton ***@***.***> wrote:
oh, I see now that in Stan's email the 64 vs 512 node values were flipped.
So on 512 procs it is just 1 versus 2 planes which leads to the 66335200
(2 planes).
If we are not leveraging the planar decomp in the fft3d.cpp code, then I
don't think
we should be ever choosing to perform the planar decomp in the PPPM code.
I'm curious if removing that option in pppm.cpp will improve both the 64
and 512 node
timings.
Steve
On Tue, Apr 16, 2024 at 3:19 PM Steve Plimpton ***@***.***> wrote:
> Now it makes sense - thanks for figuring it out Stan.
>
> So the rationale for switching to decomposing by 2d planes
> instead of 1d pencils (when you have less processes than grid points
> in a dimension) is to avoid communication. I.e. you can either do
> true 2d FFTs in memory (on each proc) or do a series of 1d FFTs
> in one dim then another series of strided 1d FFTs in the other dim,
> w/out needing to comm/transpose the data inbetween.
>
> In principle that should be a win versus communicating between every
> set of 1d FFTs.
>
> But looking at KSPACE/fft3d.cpp I don't see where the 2d FFTs
> w/out communication are being checked for or performed. Does
> this ring any bells for you Stan?
>
> Also, these numbers for 512 nodes:
>
> 0: 0 2879 0 2879 0 4 --> 41472000
> 1: 0 2879 0 2879 5 10 --> 49766400
> 2: 0 2879 0 2879 11 15 --> 41472000
> 3: 0 2879 0 2879 16 21 --> 49766400
> 4: 0 2879 0 2879 22 27 --> 49766400
> 5: 0 2879 0 2879 28 32 --> 41472000
> 6: 0 2879 0 2879 33 38 --> 49766400
> 7: 0 2879 0 2879 39 44 --> 49766400
> 8: 0 2879 0 2879 45 49 --> 41472000
> 9: 0 2879 0 2879 50 55 --> 49766400
> 10: 0 2879 0 2879 56 60 --> 41472000
>
> do not make it seem like any proc will have 50% more FFT grid points ?
>
> *I,e, max = 66355200*
>
> The only imbalance is 5 versus 6 planes.
>
> Steve
>
>
> On Tue, Apr 16, 2024 at 2:01 PM Stan Moore ***@***.***>
> wrote:
>
>> @hagertnl <https://github.com/hagertnl> this patch should give the
>> expected behavior for 512 nodes. We need to decide if we want to make this
>> change in LAMMPS or not.
>>
>> diff --git a/src/KSPACE/pppm.cpp b/src/KSPACE/pppm.cpp
>> index 4fe5075..06cbf11 100644--- a/src/KSPACE/pppm.cpp+++ b/src/KSPACE/pppm.cpp@@ -1389,10 +1389,7 @@ void PPPM::set_grid_local()
>> // of the global FFT mesh that I own in x-pencil decomposition
>>
>> int npey_fft,npez_fft;- if (nz_pppm >= nprocs) {- npey_fft = 1;- npez_fft = nprocs;- } else procs2grid2d(nprocs,ny_pppm,nz_pppm,&npey_fft,&npez_fft);+ procs2grid2d(nprocs,ny_pppm,nz_pppm,&npey_fft,&npez_fft);
>>
>> int me_y = me % npey_fft;
>> int me_z = me / npey_fft;
>>
>> —
>> Reply to this email directly, view it on GitHub
>> <#4133 (comment)>,
>> or unsubscribe
>> <https://github.com/notifications/unsubscribe-auth/AFA4LQVIRXEUDE6LVWU4MITY5V7TPAVCNFSM6AAAAABGC3Z2ROVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJZHAZTGMBQGM>
>> .
>> You are receiving this because you were mentioned.Message ID:
>> ***@***.***>
>>
>
|
Stan - does your FFT-only reproducer actually perform FFTs?
If so, I could work to upgrade fftd3d.cpp to detect and perform 2d
FFTs on a planar decomp and we could test them for correctness and speed.
Can PPPM KOKKOS wrap the FFTlib from ORNL (can't recall it's name).
That might already have options for using the 2d planar idea when
applicable.
If so you could see how it does on these 3 problems,
Steve
…On Tue, Apr 16, 2024 at 3:30 PM Steve Plimpton ***@***.***> wrote:
oops - our emails crossed
Thanks,
Steve
On Tue, Apr 16, 2024 at 3:29 PM Steve Plimpton ***@***.***> wrote:
> oh, I see now that in Stan's email the 64 vs 512 node values were flipped.
> So on 512 procs it is just 1 versus 2 planes which leads to the 66335200
> (2 planes).
>
> If we are not leveraging the planar decomp in the fft3d.cpp code, then I
> don't think
> we should be ever choosing to perform the planar decomp in the PPPM code.
>
> I'm curious if removing that option in pppm.cpp will improve both the 64
> and 512 node
> timings.
>
> Steve
>
> On Tue, Apr 16, 2024 at 3:19 PM Steve Plimpton ***@***.***> wrote:
>
>> Now it makes sense - thanks for figuring it out Stan.
>>
>> So the rationale for switching to decomposing by 2d planes
>> instead of 1d pencils (when you have less processes than grid points
>> in a dimension) is to avoid communication. I.e. you can either do
>> true 2d FFTs in memory (on each proc) or do a series of 1d FFTs
>> in one dim then another series of strided 1d FFTs in the other dim,
>> w/out needing to comm/transpose the data inbetween.
>>
>> In principle that should be a win versus communicating between every
>> set of 1d FFTs.
>>
>> But looking at KSPACE/fft3d.cpp I don't see where the 2d FFTs
>> w/out communication are being checked for or performed. Does
>> this ring any bells for you Stan?
>>
>> Also, these numbers for 512 nodes:
>>
>> 0: 0 2879 0 2879 0 4 --> 41472000
>> 1: 0 2879 0 2879 5 10 --> 49766400
>> 2: 0 2879 0 2879 11 15 --> 41472000
>> 3: 0 2879 0 2879 16 21 --> 49766400
>> 4: 0 2879 0 2879 22 27 --> 49766400
>> 5: 0 2879 0 2879 28 32 --> 41472000
>> 6: 0 2879 0 2879 33 38 --> 49766400
>> 7: 0 2879 0 2879 39 44 --> 49766400
>> 8: 0 2879 0 2879 45 49 --> 41472000
>> 9: 0 2879 0 2879 50 55 --> 49766400
>> 10: 0 2879 0 2879 56 60 --> 41472000
>>
>> do not make it seem like any proc will have 50% more FFT grid points ?
>>
>> *I,e, max = 66355200*
>>
>> The only imbalance is 5 versus 6 planes.
>>
>> Steve
>>
>>
>> On Tue, Apr 16, 2024 at 2:01 PM Stan Moore ***@***.***>
>> wrote:
>>
>>> @hagertnl <https://github.com/hagertnl> this patch should give the
>>> expected behavior for 512 nodes. We need to decide if we want to make this
>>> change in LAMMPS or not.
>>>
>>> diff --git a/src/KSPACE/pppm.cpp b/src/KSPACE/pppm.cpp
>>> index 4fe5075..06cbf11 100644--- a/src/KSPACE/pppm.cpp+++ b/src/KSPACE/pppm.cpp@@ -1389,10 +1389,7 @@ void PPPM::set_grid_local()
>>> // of the global FFT mesh that I own in x-pencil decomposition
>>>
>>> int npey_fft,npez_fft;- if (nz_pppm >= nprocs) {- npey_fft = 1;- npez_fft = nprocs;- } else procs2grid2d(nprocs,ny_pppm,nz_pppm,&npey_fft,&npez_fft);+ procs2grid2d(nprocs,ny_pppm,nz_pppm,&npey_fft,&npez_fft);
>>>
>>> int me_y = me % npey_fft;
>>> int me_z = me / npey_fft;
>>>
>>> —
>>> Reply to this email directly, view it on GitHub
>>> <#4133 (comment)>,
>>> or unsubscribe
>>> <https://github.com/notifications/unsubscribe-auth/AFA4LQVIRXEUDE6LVWU4MITY5V7TPAVCNFSM6AAAAABGC3Z2ROVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJZHAZTGMBQGM>
>>> .
>>> You are receiving this because you were mentioned.Message ID:
>>> ***@***.***>
>>>
>>
|
No, it just computes the decomposition without actually performing FFTs.
Yes |
Does this also answer why there was the breakdown in communication at 4096+ nodes? Frontier is busy decongesting from GB runs, but I am submitting jobs to test the patch you provided. I'll run 64, 512, and 4096 node jobs and report back. Thanks! |
Not sure what you mean by "breakdown in communication"? The FFTs require a many-to-many proc communication pattern, so the more nodes you have, the more costly it will be. FFTs are not going to weak scale as well as the point-to-point communication used for the force calculation. One way around this would be to limit the number of ranks for FFTs using |
By breakdown, I meant processes that don't talk to each other. At <= 512 nodes, we find that all process talk to all other processes at least 50 times (once per timestep). But at 4096+ nodes, we see that there's a large number of processes that aren't sending any data to some processes, so the "all-to-all" behavior we were initially seeing changes to many-to-many. |
Ah yes I think that patch could switch the 64 and 512 node cases to a more of a many-to-many comm pattern instead of all-to-all, but @sjplimp would know better. |
I think in principle the FFT comm (for the data transposes) should be
many-to-many
for the pencil decomp. For the planar decomp (which Stan just changed) it
could be all-to-all.
Also, another reason FFTs don't scale as well as you might expect is their
computation is NlogN, not O(N) as the rest
of the MD algorithm is.
Steve
…On Wed, Apr 17, 2024 at 12:05 PM Stan Moore ***@***.***> wrote:
By breakdown, I meant processes that don't talk to each other. At <= 512
nodes, we find that all process talk to all other processes at least 50
times (once per timestep). But at 4096+ nodes, we see that there's a large
number of processes that aren't sending any data to some processes, so the
"all-to-all" behavior we were initially seeing changes to many-to-many.
Ah yes I think that patch could switch the 64 and 512 node cases to a more
of a many-to-many comm pattern instead of all-to-all, but @sjplimp
<https://github.com/sjplimp> would know better.
—
Reply to this email directly, view it on GitHub
<#4133 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFA4LQUPVJ44CONGCP3JJJ3Y522V5AVCNFSM6AAAAABGC3Z2ROVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRRHEYDOOBRGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@hagertnl note that all-to-all isn't always worse that many-to-many. There is actually a "The collective keyword applies only to PPPM. It is set to no by default, except on IBM BlueGene machines. If this option is set to yes, LAMMPS will use MPI collective operations to remap data for 3d-FFT operations instead of the default point-to-point communication. This is faster on IBM BlueGene machines, and may also be faster on other machines if they have an efficient implementation of MPI collective operations and adequate hardware." However this option is not yet ported to the KOKKOS package. |
But the |
I was going to ask about that -- especially at large scale, I would expect an Alltoall to outperform the Send+Irecv implementation most days of the week. In a hackathon I helped with recently, a team saw a >20x speedup by using Alltoall instead of Send+Irecv. Granted, different buffer sizes, different science domain, etc. Given the potential benefit, I'd be very interested to give that a shot. Is there a technical issue blocking the porting to the KOKKOS package? If it's an issue of time, I'm happy to try to help where I can. |
No technical issues, only lack of time. It would be great if you could work on this. You can see the collective code in the regular (non-Kokkos) node here: https://github.com/lammps/lammps/blob/develop/src/KSPACE/remap.cpp#L113-L203 which is missing from the Kokkos version: https://github.com/lammps/lammps/blob/develop/src/KOKKOS/remap_kokkos.cpp#L106 The Kokkos version would need to support both GPU-aware MPI on and off, which is already done for the point-to-point version. |
I'd be happy to help with that. Is there an open feature request about it already? I can create one to track, if not, just so work doesn't get duplicated. |
I just checked and there is currently no existing feature request for this, thanks |
Plus for the non-bonded interaction you have a 3d domain decomposition, with pencils you are reduced to a 2d domain decomposition and with slices you are distributing data only in 1d. So you are quickly running out of work units when you have lots of MPI ranks. This was one major motivation to implement OpenMP parallelism to pair styles back around 2010. This way we could leave the number of MPI ranks smaller and still get decent strong scaling. It was the major "winner" in our INCITE proposal on the Cray XT5 back then at ORNL. I've put some numbers graphs together in a talk for a LAMMPS symposium in 2011 (see pages 14 to 19) of the attached PDF). Run style verlet/split was invented for the same purpose later. It has more potential for pushing strong scaling even farther, if you do have plenty of hardware. So having a working verlet/split for KOKKOS could be the path for the best scaling. P.S.: at the strong scaling limit you would also see a significant performance benefit from using single precision FFTs over double precision FFTs (not so much for the rest). This is because communication was already dominant with this way you would reduce the required bandwidth by half. |
Submitted #4140 to track. I'll hopefully make some progress on this in the coming few weeks. |
Isn't HeFFTe supposed to have some optimizations for that? But that is not yet KOKKOS compatible, right? |
With the KOKKOS package you can also run Like Axel said, using single precision for FFTs could also help, which should already work with the KOKKOS package, you just need to compile with a flag, see https://docs.lammps.org/Build_settings.html.
Probably (not sure), but HeFFTe is not yet KOKKOS compatible. |
I don't have the 4096-node job back yet, but here's some data for the patch @stanmoore1 provided. We do see a good (20+%) speed-up of the 512-node case, and confirm it's using a much better FFT grid. 64-node:New3d grid and FFT values/proc = 49430863 46656000 Old3d grid and FFT values/proc = 49430863 49766400 512-node:New3d grid and FFT values/proc = 49430863 46656000 Old3d grid and FFT values/proc = 49430863 66355200 |
@hagertnl looks like the 64 node case was a little slower while the 512 node case was faster. The performance of the 4096 node case and higher node counts should be unchanged with the patch. |
There were some network timeouts reported by the 64-node job, so I can't guarantee that the code is at fault for that timing. It was close enough to not fret over for the moment. Thanks! |
Hi all
I have been scaling the PPPM long-range potential out to Frontier scale with an SPC/E water box (targeting 9261 nodes). I am using Kokkos+HIP backend. I noticed that I got very good weak-scaling behavior, much better than expected, but with an odd "shoulder" in the scaling -- using 64 nodes as the base of the weak-scaling, 512 nodes had much worse scaling than 4096 nodes did (parallel efficiency went down, then back up). I attached the plot of this to this ticket. I found this strange, so I poked further.
I analyzed the bytes sent to/from each MPI rank during a 50-timestep run using an MPI tracing tool I built with LLNL's
wrap
utility. I simply dumped message size and destination for every MPI call. MPI_Send is the dominant MPI call that sends the majority of particle and PPPM data, so I focused on this.I chose the rank with the maximum time in MPI_Send and computed the total MPI_Send's called and total number of bytes for that rank for each node count, here is what I found:
Nodecount: max MPI_Send time, total bytes, total messages
64-node: 64 seconds, 453.1x10^9 bytes, 41.2K messages
512-node: 115 seconds 435.2x10^9 bytes, 251.3K messages
4096-node: 89 seconds, 481.5x10^9 bytes, 98.8K messages
9261-node: 188 seconds, 473.9x10^9 bytes, 136.5K messages
The 512-node point didn't look right, so next, I looked at where the data was sent. I've attached 4 plots to this ticket named "r2a_n_rank1_count.png". This shows the number of MPI_Send's called by rank 1 to each of the destination ranks on the x-axis. From what I know of LAMMPS, there's 2 main reasons to communicate with MPI_Send -- particle/neighbor list exchange and PPPM FFT data exchange. I'd expect to see 6 "peak" ranks where particle exchange happens -- one neighbor in each cartesian direction. Then I'd expect every other rank in the simulation to receive data from rank 1 for FFTs. But, looking at the plots, this is not true -- at 64 and 512 nodes, it is true, but at 4096 nodes, there's a large number of ranks not communicated with by rank 1. Do you have any insight? I am happy to help put time towards figuring this out, I'm just not sure where to look.
I also ran this problem on Summit, but 4096 Summit nodes can only run the 512-Frontier-node problem due to HBM sizes. Summit showed similar scaling behavior up to the 512-node problem (run on 4096 nodes of Summit).
LAMMPS Version and Platform
LAMMPS commit 0849863 built for Kokkos+HIP with hipcc from ROCm/5.7.1 + cray-mpich/8.1.28, uses hipFFT.
Run on Frontier, but also seems to be trending in this direction on Summit.
Expected Behavior
I'd expect the message count for MPI_Send to scale consistently, not shoot up exponentially, then drop back down at a large enough node count.
Actual Behavior
Rank 1 calls MPI_Send 41.2K, 251.3K, 98.8K, and 136.5K times at 64, 512, 4096, and 9261 nodes respectively. The 4096+ node points seem wrong.
Steps to Reproduce
I can provide input files as needed. Using an srun line like:
Further Information, Files, and Links
64-node:
512-node:
4096-node:
9261-node:
Parallel efficiency plot:
The text was updated successfully, but these errors were encountered: