Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#2074: support sparse OfflineLB maps #2145

Open
wants to merge 27 commits into
base: develop
Choose a base branch
from

Conversation

thearusable
Copy link
Contributor

@thearusable thearusable commented May 9, 2023

Closes #2074

Depends on #2137
Depends on #2250

@thearusable thearusable force-pushed the 2074-support-sparse-offlinelb-maps branch from 744a987 to 09b0807 Compare May 9, 2023 16:26
@github-actions
Copy link

github-actions bot commented May 9, 2023

Pipelines results

PR tests (gcc-12, ubuntu, mpich)

Build for 30cc7f4 (2024-04-22 15:38:34 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-9, ubuntu, mpich)

Build for 2bc70b0 (2024-06-04 15:49:10 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (gcc-10, ubuntu, openmpi, no LB)

Build for 2bc70b0 (2024-06-04 15:49:10 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (gcc-9, ubuntu, mpich, zoltan, json schema test)

Build for 2bc70b0 (2024-06-04 15:49:10 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-12, ubuntu, mpich)

Build for 2bc70b0 (2024-06-04 15:49:10 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-11, ubuntu, mpich)

Build for 2bc70b0 (2024-06-04 15:49:10 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-13, ubuntu, mpich)

Build for 2bc70b0 (2024-06-04 15:49:10 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-14, ubuntu, mpich)

Build for 30cc7f4 (2024-04-22 15:38:34 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (gcc-8, ubuntu, mpich, address sanitizer)

Build for 2bc70b0 (2024-06-04 15:49:10 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (gcc-11, ubuntu, mpich, trace runtime, coverage)

Build for 2bc70b0 (2024-06-04 15:49:10 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-10, ubuntu, mpich)

Build for 2bc70b0 (2024-06-04 15:49:10 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (intel icpc, ubuntu, mpich)

Build for 2bc70b0 (2024-06-04 15:49:10 UTC)

remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get full report use -qopt-report=4 -qopt-report-phase ipo
remark #11074: Inlining inhibited by limit max-size 
remark #11074: Inlining inhibited by limit max-total-size 
remark #11076: To get %0D%0A%0D%0A%0D%0A ==> And there is more. Read log. <==

Build log


PR tests (nvidia cuda 11.2, gcc-9, ubuntu, mpich)

Build for 2bc70b0 (2024-06-04 15:49:10 UTC)

/vt/src/vt/pipe/pipe_manager.impl.h(135): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&vt::vrt::collection::lb::GreedyLB::collectHandler, Target=vt::objgroup::proxy::ProxyElm<vt::vrt::collection::lb::GreedyLB>]"
          detected during:
            instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&vt::vrt::collection::lb::GreedyLB::collectHandler, Target=vt::objgroup::proxy::ProxyElm<vt::vrt::collection::lb::GreedyLB>]" 
/vt/src/vt/objgroup/proxy/proxy_objgroup.impl.h(204): here
            instantiation of "vt::objgroup::proxy::Proxy<ObjT>::PendingSendType vt::objgroup::proxy::Proxy<ObjT>::reduce<f,Op,Target,Args...>(Target, Args &&...) const [with ObjT=vt::vrt::collection::lb::GreedyLB, f=&vt::vrt::collection::lb::GreedyLB::collectHandler, Op=vt::collective::PlusOp, Target=vt::objgroup::proxy::ProxyElm<vt::vrt::collection::lb::GreedyLB>, Args=<vt::vrt::collection::lb::GreedyPayload>]" 
/vt/src/vt/vrt/collection/balance/greedylb/greedylb.cc(222): here

/vt/src/vt/pipe/pipe_manager.impl.h(135): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&MyObj::handler, Target=vt::objgroup::proxy::ProxyElm<MyObj>]"
          detected during instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&MyObj::handler, Target=vt::objgroup::proxy::ProxyElm<MyObj>]" 
/vt/examples/callback/callback.cc(147): here

/vt/src/vt/pipe/pipe_manager.impl.h(135): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&colHan, Target=vt::vrt::collection::VrtElmProxy<MyCol, vt::Index1D>]"
          detected during instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&colHan, Target=vt::vrt::collection::VrtElmProxy<MyCol, vt::Index1D>]" 
/vt/examples/callback/callback.cc(153): here

/vt/src/vt/pipe/pipe_manager.impl.h(135): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&MyObj::handler, Target=vt::objgroup::proxy::ProxyElm<MyObj>]"
          detected during instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&MyObj::handler, Target=vt::objgroup::proxy::ProxyElm<MyObj>]" 
/vt/examples/callback/callback.cc(147): here

/vt/src/vt/pipe/pipe_manager.impl.h(135): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&colHan, Target=vt::vrt::collection::VrtElmProxy<MyCol, vt::Index1D>]"
          detected during instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&colHan, Target=vt::vrt::collection::VrtElmProxy<MyCol, vt::Index1D>]" 
/vt/examples/callback/callback.cc(153%0D%0A%0D%0A%0D%0A ==> And there is more. Read log. <==

Build log


PR tests (clang-13, alpine, mpich)

Build for 2bc70b0 (2024-06-04 15:49:10 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (intel icpx, ubuntu, mpich)

Build for 30cc7f4 (2024-04-22 15:38:34 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (nvidia cuda 12.2.0, gcc-9, ubuntu, mpich)

Build for 30cc7f4 (2024-04-22 15:38:34 UTC)

/vt/lib/CLI/CLI/CLI11.hpp(1029): warning #2361-D: invalid narrowing conversion from "double" to "unsigned long"
          TT { std::declval<CC>() }
               ^
          detected during:
            instantiation of "vt::CLI::detail::is_direct_constructible<T, C>::test [with T=std::vector<std::string, std::allocator<std::string>>, C=double]" based on template arguments <std::vector<std::string, std::allocator<std::string>>, double> at line 1041
            instantiation of class "vt::CLI::detail::is_direct_constructible<T, C> [with T=std::vector<std::string, std::allocator<std::string>>, C=double]" at line 5005
            instantiation of "void vt::CLI::Option::results(T &) const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 5034
            instantiation of "T vt::CLI::Option::as<T>() const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 7315

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

/vt/lib/CLI/CLI/CLI11.hpp(1029): warning #2361-D: invalid narrowing conversion from "int" to "unsigned long"
          TT { std::declval<CC>() }
               ^
          detected during:
            instantiation of "vt::CLI::detail::is_direct_constructible<T, C>::test [with T=std::vector<std::string, std::allocator<std::string>>, C=int]" based on template arguments <std::vector<std::string, std::allocator<std::string>>, int> at line 1041
            instantiation of class "vt::CLI::detail::is_direct_constructible<T, C> [with T=std::vector<std::string, std::allocator<std::string>>, C=int]" at line 5005
            instantiation of "void vt::CLI::Option::results(T &) const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 5034
            instantiation of "T vt::CLI::Option::as<T>() const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 7315

Testing - passed

Build log


PR tests (intel icpx, ubuntu, mpich, verbose)

Build for 2bc70b0 (2024-06-04 15:49:10 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (gcc-12, ubuntu, mpich, verbose)

Build for 2bc70b0 (2024-06-04 15:49:10 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-14, ubuntu, mpich, verbose)

Build for 2bc70b0 (2024-06-04 15:49:10 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (nvidia cuda 12.2.0, gcc-9, ubuntu, mpich, verbose)

Build for 2bc70b0 (2024-06-04 15:49:10 UTC)

/vt/lib/CLI/CLI/CLI11.hpp(1029): warning #2361-D: invalid narrowing conversion from "double" to "unsigned long"
          TT { std::declval<CC>() }
               ^
          detected during:
            instantiation of "vt::CLI::detail::is_direct_constructible<T, C>::test [with T=std::vector<std::string, std::allocator<std::string>>, C=double]" based on template arguments <std::vector<std::string, std::allocator<std::string>>, double> at line 1041
            instantiation of class "vt::CLI::detail::is_direct_constructible<T, C> [with T=std::vector<std::string, std::allocator<std::string>>, C=double]" at line 5005
            instantiation of "void vt::CLI::Option::results(T &) const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 5034
            instantiation of "T vt::CLI::Option::as<T>() const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 7315

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

/vt/lib/CLI/CLI/CLI11.hpp(1029): warning #2361-D: invalid narrowing conversion from "int" to "unsigned long"
          TT { std::declval<CC>() }
               ^
          detected during:
            instantiation of "vt::CLI::detail::is_direct_constructible<T, C>::test [with T=std::vector<std::string, std::allocator<std::string>>, C=int]" based on template arguments <std::vector<std::string, std::allocator<std::string>>, int> at line 1041
            instantiation of class "vt::CLI::detail::is_direct_constructible<T, C> [with T=std::vector<std::string, std::allocator<std::string>>, C=int]" at line 5005
            instantiation of "void vt::CLI::Option::results(T &) const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 5034
            instantiation of "T vt::CLI::Option::as<T>() const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 7315

/vt/tests/perf/send_cost.cc(169): warning #177-D: variable "prevNode" was declared but never referenced
    auto const prevNode = (thisNode - 1 + num_nodes_) % num_nodes_;
               ^

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

Testing - passed

Build log


@thearusable thearusable force-pushed the 2074-support-sparse-offlinelb-maps branch 2 times, most recently from 3fa3511 to 5b75e31 Compare May 16, 2023 16:33
@thearusable thearusable self-assigned this May 16, 2023
@thearusable thearusable force-pushed the 2074-support-sparse-offlinelb-maps branch from 5b75e31 to f9c5f29 Compare May 22, 2023 10:07
@thearusable thearusable marked this pull request as ready for review May 22, 2023 16:39
tests/unit/lb/test_lb_reader.nompi.cc Outdated Show resolved Hide resolved
tests/unit/lb/test_offlinelb.cc Outdated Show resolved Hide resolved
Comment on lines 57 to 63
void OfflineLB::runLB(TimeType) {
auto const& distro = theLBDataReader()->getDistro(phase_ + 1);
for (auto&& elm : distro) {
auto nextPhase = theLBDataReader()->findNextPhase(phase_);
auto const distro = theLBDataReader()->getDistro(nextPhase);
for (auto&& elm : *distro) {
migrateObjectTo(elm, theContext()->getNode());
}
theLBDataReader()->clearDistro(phase_ + 1);
theLBDataReader()->clearDistro(nextPhase);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If phase_ + 1 was skipped, does this new logic grab phase_ + 2 if it exists (whether explicitly described in the file or listed as identical) as a surrogate?

@lifflander Is this the desired behavior or do we want to disallow running OfflineLB when we're missing information about the phase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly, it will grab phase_ + 2 or whatever is the next existing phase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nlslatt I changed the behavior to disallow running of OfflineLB when the phase is skipped.

if (local_changed_distro[i]) {
PhaseType curr = 0, next;
for (;curr < num_phases_ - 1;) {
next = findNextPhase(curr);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to my other question, do we want to allow OfflineLB to operate over data from non-consecutive phases (where one is explicitly listed as skipped as opposed to just identical)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nlslatt I updated that to operate only on consecutive phases.

@thearusable thearusable force-pushed the 2074-support-sparse-offlinelb-maps branch from 0d2eb73 to 044d7c4 Compare June 5, 2023 11:13
@thearusable thearusable force-pushed the 2074-support-sparse-offlinelb-maps branch 2 times, most recently from ef91bad to 8b20790 Compare June 12, 2023 14:56
@thearusable thearusable requested a review from nlslatt June 13, 2023 11:56
@thearusable thearusable force-pushed the 2074-support-sparse-offlinelb-maps branch from 56592af to c9120ba Compare June 20, 2023 17:08
@thearusable thearusable force-pushed the 2074-support-sparse-offlinelb-maps branch from c9120ba to 3f0a613 Compare October 19, 2023 14:00
@thearusable
Copy link
Contributor Author

@nlslatt, @lifflander - I rebased this PR and fixed all conflicts. PR is ready for review.

@thearusable thearusable force-pushed the 2074-support-sparse-offlinelb-maps branch from 3f0a613 to b3e07a4 Compare October 31, 2023 14:50
Copy link
Collaborator

@nlslatt nlslatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The testing still needs work.

tests/unit/lb/test_lb_reader.nompi.cc Outdated Show resolved Hide resolved
tests/unit/lb/test_offlinelb.cc Outdated Show resolved Hide resolved
tests/unit/lb/test_offlinelb.cc Outdated Show resolved Hide resolved
tests/unit/lb/test_offlinelb.cc Outdated Show resolved Hide resolved
Comment on lines 227 to 230
"0 OfflineLB\n"
"1 NoLB\n"
"2 NoLB\n"
"3 NoLB\n"
"4 OfflineLB\n"
"5 OfflineLB\n"
"6 NoLB\n";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this configuration, coupled with your choice of which phases are identical to previous, means that OfflineLB by definition has nothing to migrate (for example, phase 1 is said to be identical to phase 0, so nothing should migrate when OfflineLB runs on phase 0). As such, this does not appear to be a good test of the sparse functionality.

Also, I think the phases that you have defined as identical to previous contradict what was loaded into the LBDataHolder (e.g., phase 1 is marked as identical to phase 0, but the LBDataHolder content indicates that they are not identical).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I updated the configuration for this test so the OfflineLB will be able to do some migrations. Also I removed unnecessary phases from LBDataHolder.
If you have more suggestions, please let me know.

src/vt/vrt/collection/balance/lb_data_restart_reader.h Outdated Show resolved Hide resolved
@thearusable thearusable marked this pull request as draft November 29, 2023 13:54
@thearusable thearusable force-pushed the 2074-support-sparse-offlinelb-maps branch 2 times, most recently from 8a72a9f to 77911e8 Compare November 30, 2023 18:05
@thearusable thearusable marked this pull request as ready for review November 30, 2023 18:17
@thearusable thearusable force-pushed the 2074-support-sparse-offlinelb-maps branch from cf157d9 to 2bc70b0 Compare June 4, 2024 15:49
@thearusable
Copy link
Contributor Author

@nlslatt Ok, I changed it to abort when VT is misconfigured.

@nlslatt
Copy link
Collaborator

nlslatt commented Jun 4, 2024

@nlslatt Ok, I changed it to abort when VT is misconfigured.

@thearusable Your check only covers one of two mechanisms for enabling OfflineLB. Perhaps first check if OfflineLB is enabled by any mechanism, and then abort if the necessary flag was not passed.

@thearusable
Copy link
Contributor Author

@nlslatt Ah, yes, right! I'll update that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow for sparse OfflineLB maps that may be used with and LB configuration
2 participants