-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#2074: support sparse OfflineLB maps #2145
base: develop
Are you sure you want to change the base?
Conversation
744a987
to
09b0807
Compare
Pipelines resultsPR tests (gcc-12, ubuntu, mpich) Build for 30cc7f4 (2024-04-22 15:38:34 UTC)
PR tests (clang-9, ubuntu, mpich) Build for f076ec9 (2024-05-24 12:04:00 UTC)
PR tests (gcc-10, ubuntu, openmpi, no LB) Build for f076ec9 (2024-05-24 12:04:00 UTC)
PR tests (gcc-9, ubuntu, mpich, zoltan, json schema test) Build for f076ec9 (2024-05-24 12:04:00 UTC)
PR tests (clang-12, ubuntu, mpich) Build for f076ec9 (2024-05-24 12:04:00 UTC)
PR tests (clang-11, ubuntu, mpich) Build for f076ec9 (2024-05-24 12:04:00 UTC)
PR tests (clang-13, ubuntu, mpich) Build for f076ec9 (2024-05-24 12:04:00 UTC)
PR tests (clang-14, ubuntu, mpich) Build for 30cc7f4 (2024-04-22 15:38:34 UTC)
PR tests (gcc-8, ubuntu, mpich, address sanitizer) Build for f076ec9 (2024-05-24 12:04:00 UTC)
PR tests (gcc-11, ubuntu, mpich, trace runtime, coverage) Build for f076ec9 (2024-05-24 12:04:00 UTC)
PR tests (clang-10, ubuntu, mpich) Build for f076ec9 (2024-05-24 12:04:00 UTC)
PR tests (intel icpc, ubuntu, mpich) Build for f076ec9 (2024-05-24 12:04:00 UTC)
PR tests (nvidia cuda 11.2, gcc-9, ubuntu, mpich) Build for f076ec9 (2024-05-24 12:04:00 UTC)
PR tests (clang-13, alpine, mpich) Build for f076ec9 (2024-05-24 12:04:00 UTC)
PR tests (intel icpx, ubuntu, mpich) Build for 30cc7f4 (2024-04-22 15:38:34 UTC)
PR tests (nvidia cuda 12.2.0, gcc-9, ubuntu, mpich) Build for 30cc7f4 (2024-04-22 15:38:34 UTC)
PR tests (intel icpx, ubuntu, mpich, verbose) Build for f076ec9 (2024-05-24 12:04:00 UTC)
PR tests (gcc-12, ubuntu, mpich, verbose) Build for f076ec9 (2024-05-24 12:04:00 UTC)
PR tests (clang-14, ubuntu, mpich, verbose) Build for f076ec9 (2024-05-24 12:04:00 UTC)
PR tests (nvidia cuda 12.2.0, gcc-9, ubuntu, mpich, verbose) Build for f076ec9 (2024-05-24 12:04:00 UTC)
|
3fa3511
to
5b75e31
Compare
5b75e31
to
f9c5f29
Compare
void OfflineLB::runLB(TimeType) { | ||
auto const& distro = theLBDataReader()->getDistro(phase_ + 1); | ||
for (auto&& elm : distro) { | ||
auto nextPhase = theLBDataReader()->findNextPhase(phase_); | ||
auto const distro = theLBDataReader()->getDistro(nextPhase); | ||
for (auto&& elm : *distro) { | ||
migrateObjectTo(elm, theContext()->getNode()); | ||
} | ||
theLBDataReader()->clearDistro(phase_ + 1); | ||
theLBDataReader()->clearDistro(nextPhase); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If phase_ + 1
was skipped, does this new logic grab phase_ + 2
if it exists (whether explicitly described in the file or listed as identical) as a surrogate?
@lifflander Is this the desired behavior or do we want to disallow running OfflineLB when we're missing information about the phase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly, it will grab phase_ + 2
or whatever is the next existing phase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nlslatt I changed the behavior to disallow running of OfflineLB when the phase is skipped.
if (local_changed_distro[i]) { | ||
PhaseType curr = 0, next; | ||
for (;curr < num_phases_ - 1;) { | ||
next = findNextPhase(curr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to my other question, do we want to allow OfflineLB to operate over data from non-consecutive phases (where one is explicitly listed as skipped as opposed to just identical)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nlslatt I updated that to operate only on consecutive phases.
0d2eb73
to
044d7c4
Compare
ef91bad
to
8b20790
Compare
56592af
to
c9120ba
Compare
c9120ba
to
3f0a613
Compare
@nlslatt, @lifflander - I rebased this PR and fixed all conflicts. PR is ready for review. |
3f0a613
to
b3e07a4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The testing still needs work.
tests/unit/lb/test_offlinelb.cc
Outdated
"0 OfflineLB\n" | ||
"1 NoLB\n" | ||
"2 NoLB\n" | ||
"3 NoLB\n" | ||
"4 OfflineLB\n" | ||
"5 OfflineLB\n" | ||
"6 NoLB\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this configuration, coupled with your choice of which phases are identical to previous, means that OfflineLB
by definition has nothing to migrate (for example, phase 1 is said to be identical to phase 0, so nothing should migrate when OfflineLB
runs on phase 0). As such, this does not appear to be a good test of the sparse functionality.
Also, I think the phases that you have defined as identical to previous contradict what was loaded into the LBDataHolder
(e.g., phase 1 is marked as identical to phase 0, but the LBDataHolder
content indicates that they are not identical).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I updated the configuration for this test so the OfflineLB
will be able to do some migrations. Also I removed unnecessary phases from LBDataHolder
.
If you have more suggestions, please let me know.
8a72a9f
to
77911e8
Compare
dac04e0
to
8eedc00
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of it looks great, but I think there's an issue with some of the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your recent changes look good. One last thing to address, I think. It segfaults when OfflineLB
is requested but the user forgets to also use the command-line argument --vt_lb_data_in
. We need to give a usable error message instead.
Thanks for catching that issue! Looks like it also happens on the Instead of showing the error message I changed the code to allow for more relaxed
Please let me know if that is ok to you or do you prefer to show a message that vt is not configured properly to run |
@lifflander and I discussed it, and we prefer just showing an error message when |
Closes #2074
Depends on #2137Depends on #2250