Implement kspace_modify collective yes in KOKKOS package #4143

hagertnl · 2024-04-19T20:05:34Z

Summary

Implements the kspace_modify collective yes option in the KOKKOS package with KSPACE enabled.

Related Issue(s)

closes #4140
closes #4133

Author(s)

Nick Hagerty, Oak Ridge National Laboratory

Licensing

By submitting this pull request, I agree, that my contribution will be included in LAMMPS and redistributed under either the GNU General Public License version 2 (GPL v2) or the GNU Lesser General Public License version 2.1 (LGPL v2.1).

Backward Compatibility

No change in backwards compatibility

Implementation Notes

A simple water box with PPPM enabled was run on 1 and 8 ranks on a single node, and on 512 ranks, 4096 ranks, 32K ranks for scaling analysis.

Post Submission Checklist

Further Information, Files, and Links

stanmoore1 · 2024-04-20T15:42:49Z

A simple water box with PPPM enabled was run on 1 and 8 ranks on a single node, and on 512 ranks, 4096 ranks, 32K ranks for scaling analysis.

@hagertnl this wasn't actually being used in PPPMKokkos, may need to rerun the tests

stanmoore1 · 2024-04-20T15:56:27Z

I tested this on 40 MPI ranks on CPU and on 4 V100 GPUs and it seems to give the correct numerics, nice work @hagertnl. Really curious what your scaling analysis will show, especially at large-scale.

hagertnl · 2024-04-20T17:25:20Z

I figured that out when my performance numbers were turning out to be the exact same, hah! I still need to tie in the kspace_modify keyword to control the setting (right now I just locally hard-coded in a 1 for collectives).

Preliminary, at 512 nodes (4096 ranks), I see about a 13% speedup, with correct answers. I'd like to do a quick round of optimization, I think the setup is doing double the work that it needs to for the collectives, creating excessive overhead for the collective approach.

Not surprisingly, small scale did not see any benefit, and saw some performance regression in the case I saw. This is why I wanted to do a round of optimization, it seems like there may be excessive overhead.

…put file

hagertnl · 2024-04-23T17:15:12Z

Here are some preliminary numbers from Frontier, 8 ranks per node (MPI timings from rank 1):

# Nodes	coll/pt2pt	timesteps/s	Total MPI time	MPI_Send time	MPI_Alltoallv time
64	coll	0.316	26.91	4.06	19.40
64	pt2pt	0.360	65.15	58.30	0.17
512	coll	0.288	52.30	3.76	37.02
512	pt2pt	0.258	121.64	111.13	0.53
4096	coll	0.161	119.88	2.92	76.71
4096	pt2pt	0.300	112.38	76.03	16.56

Some of these jobs hit network timeouts, and these timings aren't averaged, so these numbers may change slightly run-to-run. These runs are also without the patch from #4133 , so attempting to draw conclusions across job sizes is tricky.

The conclusions I draw from this currently are:

At 64 and 512 nodes, there's a >50% drop in total MPI time when using collectives relative to pt2pt, but the collective is actually slower per-timestep. I suspect there's a lot of overhead the way I'm currently packing the buffer for Alltoallv
In spite of the possible overhead, we still see a >10% speedup at 512 nodes.

@stanmoore1 currently, the collective code for kokkos and non-kokkos PPPM are fairly apples-to-apples. From your end, would you prefer any of the following, with respect to the scope of work in this PR? (1) finish verifying the correctness of the implementation we have now, then merge, optimizing both kokkos and non-kokkos PPPM collectives in a new PR later, (2) optimize just kokkos PPPM now (diverging the Kokkos implementation of remap from non-Kokkos), (3) optimize kokkos & non-kokkos PPPM collectives now, or something else?

stanmoore1 · 2024-04-23T17:29:05Z

@hagertnl it is odd that the 512 node case is slower than 4096. Do these results include the patch with better load balancing by not using full xy planes, like we saw previously?

From your end, would you prefer any of the following, with respect to the scope of work in this PR? (1) finish verifying the correctness of the implementation we have now, then merge, optimizing both kokkos and non-kokkos PPPM collectives in a new PR later, (2) optimize just kokkos PPPM now (diverging the Kokkos implementation of remap from non-Kokkos), (3) optimize kokkos & non-kokkos PPPM collectives now, or something else?

This is mostly up to you and how much time you have to work on this. If you have more time it would be great to optimize more, otherwise we can merge now as-is.

hagertnl · 2024-04-23T17:34:21Z

This does not include the patch for better load balancing. Is that something you'd like me to put in this PR? I thought it might've been in another PR already.

stanmoore1 · 2024-04-23T17:39:47Z

I think the load balancing issue is confounding the pt2pt vs coll comparison in the table above. It sounded like Steve agreed that we should apply that patch since we aren't using 2D FFTs when we have full xy planes. It is not in a PR yet.

hagertnl · 2024-04-23T17:41:30Z

Sounds good. I'll apply the patch in this PR and try out some optimization.

…re independent

…t passing. Beginning optimization of collective

…p memory allocations

hagertnl · 2024-04-25T19:39:44Z

This is close to what I hope is the final version. I am currently running scaling studies on Frontier & Summit to test. At small scale, pt2pt is 10%+ faster than collective, but that's to be expected. I'm very interested what happens at 512+ nodes.

Note -- I replaced a ~100-line section of code that looked for what other ranks needed to communicate for commringlist with a single Allreduce. I assume that if the user wants to use Alltoallv, then the Allreduce should be performant on the platform as well, and it saves running a nested for-loop that gets very expensive at scale. I did some testing of this vs the old 100-line section of code and it is performance neutral for 1-64 nodes. I did not have a chance to run at 512+, but unless the Allreduce is prohibitively expensive, I don't see a reason that the local commringlist building would be faster or preferred.

stanmoore1 · 2024-04-25T21:01:17Z

I assume that if the user wants to use Alltoallv, then the Allreduce should be performant on the platform as well, and it saves running a nested for-loop that gets very expensive at scale.

I agree with this.

hagertnl · 2024-04-26T14:56:32Z

In profiling (to validate message sizes), I noticed that ranks send themselves a significant amount of data. For example, in the below Alltoallv call, there were about 746 MB sent total by the MPI_Alltoallv, and half of that -- 373 MB -- was sent to the current process:

[Rank 0] MPI_Alltoallv started 1714096596.363301992, ended 1714096596.480638027 (elapsed 0.117336035), sent 746496000 bytes total, max of 373248000 to rank 0, received 746496000 bytes total, max of 373248000 from rank 0, to MPI_COMM_WORLD

I am thinking it is worth checking the performance impact of implementing plan->self like non-collective strategy does, since it can substantially decrease the total bytes sent by MPI. Ideally, MPI implementations are smart enough to use a local copy for Alltoallv messages to self, but that doesn't appear to be true with cray-mpich.

stanmoore1 · 2024-04-26T15:04:57Z

I am thinking it is worth checking the performance impact of implementing plan->self like non-collective strategy does, since it can substantially decrease the total bytes sent by MPI. Ideally, MPI implementations are smart enough to use a local copy for Alltoallv messages to self, but that doesn't appear to be true with cray-mpich.

Interesting, yes that would be good to check.

hagertnl · 2024-05-06T02:59:19Z

Just a status update, currently I'm debugging an issue on Infiniband networks, where the NIC appears to be reporting that it's out of physical memory while doing the Alltoallv. Once I have that worked out, I'll do a quick scaling study with small & large system sizes on both Summit and Frontier.

For separating communication to self out from MPI, I observed a small performance improvement at small node counts (still slower than pt2pt), but no observable improvement at large scale on Frontier. I was trying to test this on Summit when I hit the Infiniband error. I'll test both implementations on Frontier & Summit when I do the scaling study, and we can figure out which implementation we want based on the data.

stanmoore1 · 2024-05-06T17:35:00Z

@hagertnl very interesting, thanks for working on this.

hagertnl · 2024-05-12T00:55:48Z

Performance data from Frontier (Slingshot 11, Dragonfly topology) and Summit (Infiniband, Fat-tree topology) for small & large systems. Summit's "large" problem is 8x smaller per-node than Frontier's, due to HBM limits. The small problems are 8x smaller per-node than large for each system. In all cases, I ran 3 jobs and took the average of timesteps/sec, excluding obvious outliers.

Regarding the issues at 4096 nodes I saw on Summit, it appears to be due to this: https://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages. Basically, we're just overrunning the resources available. There are some parameters we can tune to get it to work, but I don't know them off the top of my head. Smaller systems were able to run with collectives without a problem.

hagertnl · 2024-05-12T01:00:37Z

Just pushed the source for the collective+self optimization. I still need to clean up the memory deallocations. The remap destructor is not organized like the allocations, and I think I'm probably allocating more arrays than necessary, or I can re-use some of the existing pt2pt variables.

stanmoore1 · 2024-05-13T16:04:12Z

@hagertnl looks like the collective option can give significant speedup on Frontier, which is great.

hagertnl · 2024-05-13T16:32:08Z

@stanmoore1 it's actually opposite (metric reported was timesteps/sec) -- it's quite a slow-down on Frontier at scale. But it is quite an improvement on full-system Summit (>2x faster for small systems).

Frontier has a much higher message injection rate into the fabric due to the 4 NICs per node (only 1 NIC per node on Summit), so I'm not very surprised by these results. Also, cray-mpich is still being tuned for AMD GPU support, so some lesser-used collectives may not be fully optimized.

Also note that Summit's pt2pt results are tuned with Infiniband network settings that give a ~50% speedup from default settings, so it's impressive that collectives still beat that.

All that to say, I think that the benefit of these optimizations depends more heavily on the node design than I originally thought. If the NIC's injection rates are not limiting the MPI_Send's, then it's entirely possible that collectives won't be optimal. One interesting thing to try could be to implement MPI_Isend's instead of Sends by using the full send buffer as we do with collectives so that each rank has an independent portion of the buffer, but that's an addition for later, I think.

stanmoore1 · 2024-05-13T16:37:01Z

@stanmoore1 it's actually opposite (metric reported was timesteps/sec) -- it's quite a slow-down on Frontier at scale. But it is quite an improvement on full-system Summit (>2x faster for small systems).

Ah got it, interesting...

…os.h for collective fields

hagertnl · 2024-05-13T20:00:22Z

I believe this code is ready to go then -- I looked at the docs to see if they need updates, but it doesn't look like it.

What I currently have pushed is the implementation where the current processor's data is locally copied and is not sent through MPI. This was faster or comparable in all cases above, and is especially beneficial for small scale.

stanmoore1 · 2024-05-13T20:03:38Z

@hagertnl thanks for all your effort on this. How beneficial do you think it would be to port the changes to Remap to non-Kokkos? (Doesn't need to be done by you or in this PR, just curious...).

hagertnl · 2024-05-13T20:19:32Z

That's a good question... The main performance changes I could see is in that nested for-loop that locally builds the commringlist that I replaced with MPI_Allreduce, and it's hard to say -- what will be faster, an Allreduce with a lot of ranks, or a double-nested for-loop for a lot of iterations in each? I think as number of ranks goes up, the Allreduce should be faster. And since most processes send a lot of data to themselves, pulling that communication out of the Alltoallv into a local pack/unpack could be beneficial at all scales.

It also would contribute to code maintainability, I think. There's a few spots that I felt like the logic was tricky to follow, I think the Kokkos implementation is more readable than the non-Kokkos version was. Mainly just getting larger conditional blocks for if (!plan->use_collective) { ... } else { ... }. When the conditional blocks get interleaved, it was tricky to follow.

Given that there's an existing implementation, it would be neat to see if there's any performance difference from the changes!

hagertnl and others added 2 commits April 19, 2024 16:00

Initial port of kspace_modify collective yes to KOKKOS package

3116250

Enable collective comm for PPPMKokkos

0af4fe2

stanmoore1 assigned hagertnl Apr 20, 2024

stanmoore1 added enhancement kspace_package kokkos_package labels Apr 20, 2024

hagertnl added 2 commits April 22, 2024 10:23

Updated collective flag keyword in KOKKOS PPPM to use setting from in…

f43fec4

…put file

Updated kspace kokkos to use input file for collective setting

5c5dd1e

hagertnl added 4 commits April 23, 2024 13:42

Fix load balancing issue with 2D FFTs

f9e349a

Relocated send & recv initializations to occur together, since they a…

65e8a5c

…re independent

Split collective and non-collective into conditionals. Multi-node tes…

1a431b0

…t passing. Beginning optimization of collective

Refactored kspace+kokkos collective remap. Need to go back and tidy u…

71f82e7

…p memory allocations

akohlmey added 2 commits April 27, 2024 02:49

make pip install packages in virtual environment

3f9d96d

downgrade macOS to version 13

8c3dab0

Added optimization to collective to local copy to self

bd52e31

Cleaned up memory deallocation logic, improved comments on remap_kokk…

c9049c0

…os.h for collective fields

stanmoore1 marked this pull request as ready for review May 13, 2024 20:01

stanmoore1 self-requested a review as a code owner May 13, 2024 20:01

stanmoore1 requested review from akohlmey and sjplimp May 13, 2024 20:02

stanmoore1 assigned stanmoore1 and unassigned hagertnl May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement kspace_modify collective yes in KOKKOS package #4143

Implement kspace_modify collective yes in KOKKOS package #4143

hagertnl commented Apr 19, 2024 •

edited by stanmoore1

stanmoore1 commented Apr 20, 2024

stanmoore1 commented Apr 20, 2024

hagertnl commented Apr 20, 2024

hagertnl commented Apr 23, 2024

stanmoore1 commented Apr 23, 2024 •

edited

hagertnl commented Apr 23, 2024

stanmoore1 commented Apr 23, 2024

hagertnl commented Apr 23, 2024

hagertnl commented Apr 25, 2024

stanmoore1 commented Apr 25, 2024

hagertnl commented Apr 26, 2024

stanmoore1 commented Apr 26, 2024

hagertnl commented May 6, 2024

stanmoore1 commented May 6, 2024

hagertnl commented May 12, 2024

hagertnl commented May 12, 2024

stanmoore1 commented May 13, 2024

hagertnl commented May 13, 2024

stanmoore1 commented May 13, 2024

hagertnl commented May 13, 2024

stanmoore1 commented May 13, 2024

hagertnl commented May 13, 2024

Implement kspace_modify collective yes in KOKKOS package #4143

Are you sure you want to change the base?

Implement kspace_modify collective yes in KOKKOS package #4143

Conversation

hagertnl commented Apr 19, 2024 • edited by stanmoore1

stanmoore1 commented Apr 20, 2024

stanmoore1 commented Apr 20, 2024

hagertnl commented Apr 20, 2024

hagertnl commented Apr 23, 2024

stanmoore1 commented Apr 23, 2024 • edited

hagertnl commented Apr 23, 2024

stanmoore1 commented Apr 23, 2024

hagertnl commented Apr 23, 2024

hagertnl commented Apr 25, 2024

stanmoore1 commented Apr 25, 2024

hagertnl commented Apr 26, 2024

stanmoore1 commented Apr 26, 2024

hagertnl commented May 6, 2024

stanmoore1 commented May 6, 2024

hagertnl commented May 12, 2024

hagertnl commented May 12, 2024

stanmoore1 commented May 13, 2024

hagertnl commented May 13, 2024

stanmoore1 commented May 13, 2024

hagertnl commented May 13, 2024

stanmoore1 commented May 13, 2024

hagertnl commented May 13, 2024

hagertnl commented Apr 19, 2024 •

edited by stanmoore1

stanmoore1 commented Apr 23, 2024 •

edited