Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement kspace_modify collective yes in KOKKOS package #4143

Draft
wants to merge 10 commits into
base: develop
Choose a base branch
from

Conversation

hagertnl
Copy link
Contributor

Summary

Implements the kspace_modify collective yes option in the KOKKOS package with KSPACE enabled.

Related Issue(s)

closes #4140

Author(s)

Nick Hagerty, Oak Ridge National Laboratory

Licensing

By submitting this pull request, I agree, that my contribution will be included in LAMMPS and redistributed under either the GNU General Public License version 2 (GPL v2) or the GNU Lesser General Public License version 2.1 (LGPL v2.1).

Backward Compatibility

No change in backwards compatibility

Implementation Notes

A simple water box with PPPM enabled was run on 1 and 8 ranks on a single node, and on 512 ranks, 4096 ranks, 32K ranks for scaling analysis.

Post Submission Checklist

  • The feature or features in this pull request is complete
  • Licensing information is complete
  • Corresponding author information is complete
  • The source code follows the LAMMPS formatting guidelines
  • Suitable new documentation files and/or updates to the existing docs are included
  • The added/updated documentation is integrated and tested with the documentation build system
  • The feature has been verified to work with the conventional build system
  • The feature has been verified to work with the CMake based build system
  • Suitable tests have been added to the unittest tree.
  • A package specific README file has been included or updated
  • One or more example input decks are included

Further Information, Files, and Links

@stanmoore1
Copy link
Contributor

A simple water box with PPPM enabled was run on 1 and 8 ranks on a single node, and on 512 ranks, 4096 ranks, 32K ranks for scaling analysis.

@hagertnl this wasn't actually being used in PPPMKokkos, may need to rerun the tests

@stanmoore1
Copy link
Contributor

I tested this on 40 MPI ranks on CPU and on 4 V100 GPUs and it seems to give the correct numerics, nice work @hagertnl. Really curious what your scaling analysis will show, especially at large-scale.

@hagertnl
Copy link
Contributor Author

I figured that out when my performance numbers were turning out to be the exact same, hah! I still need to tie in the kspace_modify keyword to control the setting (right now I just locally hard-coded in a 1 for collectives).

Preliminary, at 512 nodes (4096 ranks), I see about a 13% speedup, with correct answers. I'd like to do a quick round of optimization, I think the setup is doing double the work that it needs to for the collectives, creating excessive overhead for the collective approach.

Not surprisingly, small scale did not see any benefit, and saw some performance regression in the case I saw. This is why I wanted to do a round of optimization, it seems like there may be excessive overhead.

@hagertnl
Copy link
Contributor Author

Here are some preliminary numbers from Frontier, 8 ranks per node (MPI timings from rank 1):

# Nodes coll/pt2pt timesteps/s Total MPI time MPI_Send time MPI_Alltoallv time
64 coll 0.316 26.91 4.06 19.40
64 pt2pt 0.360 65.15 58.30 0.17
512 coll 0.288 52.30 3.76 37.02
512 pt2pt 0.258 121.64 111.13 0.53
4096 coll 0.161 119.88 2.92 76.71
4096 pt2pt 0.300 112.38 76.03 16.56

Some of these jobs hit network timeouts, and these timings aren't averaged, so these numbers may change slightly run-to-run. These runs are also without the patch from #4133 , so attempting to draw conclusions across job sizes is tricky.

The conclusions I draw from this currently are:

  • At 64 and 512 nodes, there's a >50% drop in total MPI time when using collectives relative to pt2pt, but the collective is actually slower per-timestep. I suspect there's a lot of overhead the way I'm currently packing the buffer for Alltoallv
  • In spite of the possible overhead, we still see a >10% speedup at 512 nodes.

@stanmoore1 currently, the collective code for kokkos and non-kokkos PPPM are fairly apples-to-apples. From your end, would you prefer any of the following, with respect to the scope of work in this PR? (1) finish verifying the correctness of the implementation we have now, then merge, optimizing both kokkos and non-kokkos PPPM collectives in a new PR later, (2) optimize just kokkos PPPM now (diverging the Kokkos implementation of remap from non-Kokkos), (3) optimize kokkos & non-kokkos PPPM collectives now, or something else?

@stanmoore1
Copy link
Contributor

stanmoore1 commented Apr 23, 2024

@hagertnl it is odd that the 512 node case is slower than 4096. Do these results include the patch with better load balancing by not using full xy planes, like we saw previously?

From your end, would you prefer any of the following, with respect to the scope of work in this PR? (1) finish verifying the correctness of the implementation we have now, then merge, optimizing both kokkos and non-kokkos PPPM collectives in a new PR later, (2) optimize just kokkos PPPM now (diverging the Kokkos implementation of remap from non-Kokkos), (3) optimize kokkos & non-kokkos PPPM collectives now, or something else?

This is mostly up to you and how much time you have to work on this. If you have more time it would be great to optimize more, otherwise we can merge now as-is.

@hagertnl
Copy link
Contributor Author

This does not include the patch for better load balancing. Is that something you'd like me to put in this PR? I thought it might've been in another PR already.

@stanmoore1
Copy link
Contributor

I think the load balancing issue is confounding the pt2pt vs coll comparison in the table above. It sounded like Steve agreed that we should apply that patch since we aren't using 2D FFTs when we have full xy planes. It is not in a PR yet.

@hagertnl
Copy link
Contributor Author

Sounds good. I'll apply the patch in this PR and try out some optimization.

@hagertnl
Copy link
Contributor Author

This is close to what I hope is the final version. I am currently running scaling studies on Frontier & Summit to test. At small scale, pt2pt is 10%+ faster than collective, but that's to be expected. I'm very interested what happens at 512+ nodes.

Note -- I replaced a ~100-line section of code that looked for what other ranks needed to communicate for commringlist with a single Allreduce. I assume that if the user wants to use Alltoallv, then the Allreduce should be performant on the platform as well, and it saves running a nested for-loop that gets very expensive at scale. I did some testing of this vs the old 100-line section of code and it is performance neutral for 1-64 nodes. I did not have a chance to run at 512+, but unless the Allreduce is prohibitively expensive, I don't see a reason that the local commringlist building would be faster or preferred.

@stanmoore1
Copy link
Contributor

I assume that if the user wants to use Alltoallv, then the Allreduce should be performant on the platform as well, and it saves running a nested for-loop that gets very expensive at scale.

I agree with this.

@hagertnl
Copy link
Contributor Author

In profiling (to validate message sizes), I noticed that ranks send themselves a significant amount of data. For example, in the below Alltoallv call, there were about 746 MB sent total by the MPI_Alltoallv, and half of that -- 373 MB -- was sent to the current process:

[Rank 0] MPI_Alltoallv started 1714096596.363301992, ended 1714096596.480638027 (elapsed 0.117336035), sent 746496000 bytes total, max of 373248000 to rank 0, received 746496000 bytes total, max of 373248000 from rank 0, to MPI_COMM_WORLD

I am thinking it is worth checking the performance impact of implementing plan->self like non-collective strategy does, since it can substantially decrease the total bytes sent by MPI. Ideally, MPI implementations are smart enough to use a local copy for Alltoallv messages to self, but that doesn't appear to be true with cray-mpich.

@stanmoore1
Copy link
Contributor

I am thinking it is worth checking the performance impact of implementing plan->self like non-collective strategy does, since it can substantially decrease the total bytes sent by MPI. Ideally, MPI implementations are smart enough to use a local copy for Alltoallv messages to self, but that doesn't appear to be true with cray-mpich.

Interesting, yes that would be good to check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Implement MPI collective operations in KOKKOS implementation of KSPACE
3 participants