Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#2074: support sparse OfflineLB maps #2145

Open
wants to merge 25 commits into
base: develop
Choose a base branch
from

Conversation

thearusable
Copy link
Contributor

@thearusable thearusable commented May 9, 2023

Closes #2074

Depends on #2137
Depends on #2250

@thearusable thearusable force-pushed the 2074-support-sparse-offlinelb-maps branch from 744a987 to 09b0807 Compare May 9, 2023 16:26
@github-actions
Copy link

github-actions bot commented May 9, 2023

Pipelines results

PR tests (gcc-12, ubuntu, mpich)

Build for 30cc7f4 (2024-04-22 15:38:34 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-9, ubuntu, mpich)

Build for f076ec9 (2024-05-24 12:04:00 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (gcc-10, ubuntu, openmpi, no LB)

Build for f076ec9 (2024-05-24 12:04:00 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (gcc-9, ubuntu, mpich, zoltan, json schema test)

Build for f076ec9 (2024-05-24 12:04:00 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-12, ubuntu, mpich)

Build for f076ec9 (2024-05-24 12:04:00 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-11, ubuntu, mpich)

Build for f076ec9 (2024-05-24 12:04:00 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-13, ubuntu, mpich)

Build for f076ec9 (2024-05-24 12:04:00 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-14, ubuntu, mpich)

Build for 30cc7f4 (2024-04-22 15:38:34 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (gcc-8, ubuntu, mpich, address sanitizer)

Build for f076ec9 (2024-05-24 12:04:00 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (gcc-11, ubuntu, mpich, trace runtime, coverage)

Build for f076ec9 (2024-05-24 12:04:00 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-10, ubuntu, mpich)

Build for f076ec9 (2024-05-24 12:04:00 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (intel icpc, ubuntu, mpich)

Build for f076ec9 (2024-05-24 12:04:00 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (nvidia cuda 11.2, gcc-9, ubuntu, mpich)

Build for f076ec9 (2024-05-24 12:04:00 UTC)

/vt/src/vt/pipe/pipe_manager.impl.h(133): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&vt::vrt::collection::lb::GreedyLB::collectHandler, Target=vt::objgroup::proxy::ProxyElm<vt::vrt::collection::lb::GreedyLB>]"
          detected during:
            instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&vt::vrt::collection::lb::GreedyLB::collectHandler, Target=vt::objgroup::proxy::ProxyElm<vt::vrt::collection::lb::GreedyLB>]" 
/vt/src/vt/objgroup/proxy/proxy_objgroup.impl.h(202): here
            instantiation of "vt::objgroup::proxy::Proxy<ObjT>::PendingSendType vt::objgroup::proxy::Proxy<ObjT>::reduce<f,Op,Target,Args...>(Target, Args &&...) const [with ObjT=vt::vrt::collection::lb::GreedyLB, f=&vt::vrt::collection::lb::GreedyLB::collectHandler, Op=vt::collective::PlusOp, Target=vt::objgroup::proxy::ProxyElm<vt::vrt::collection::lb::GreedyLB>, Args=<vt::vrt::collection::lb::GreedyPayload>]" 
/vt/src/vt/vrt/collection/balance/greedylb/greedylb.cc(222): here

/vt/src/vt/pipe/pipe_manager.impl.h(133): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&MyObj::handler, Target=vt::objgroup::proxy::ProxyElm<MyObj>]"
          detected during instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&MyObj::handler, Target=vt::objgroup::proxy::ProxyElm<MyObj>]" 
/vt/examples/callback/callback.cc(147): here

/vt/src/vt/pipe/pipe_manager.impl.h(133): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&colHan, Target=vt::vrt::collection::VrtElmProxy<MyCol, vt::Index1D>]"
          detected during instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&colHan, Target=vt::vrt::collection::VrtElmProxy<MyCol, vt::Index1D>]" 
/vt/examples/callback/callback.cc(153): here

/vt/src/vt/pipe/pipe_manager.impl.h(133): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&MyObj::handler, Target=vt::objgroup::proxy::ProxyElm<MyObj>]"
          detected during instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&MyObj::handler, Target=vt::objgroup::proxy::ProxyElm<MyObj>]" 
/vt/examples/callback/callback.cc(147): here

/vt/src/vt/pipe/pipe_manager.impl.h(133): warning: missing return statement at end of non-void function "vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&colHan, Target=vt::vrt::collection::VrtElmProxy<MyCol, vt::Index1D>]"
          detected during instantiation of "auto vt::pipe::PipeManager::makeSend<f,Target>(Target) [with f=&colHan, Target=vt::vrt::collection::VrtElmProxy<MyCol, vt::Index1D>]" 
/vt/examples/callback/callback.cc(153%0D%0A%0D%0A%0D%0A ==> And there is more. Read log. <==

Build log


PR tests (clang-13, alpine, mpich)

Build for f076ec9 (2024-05-24 12:04:00 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (intel icpx, ubuntu, mpich)

Build for 30cc7f4 (2024-04-22 15:38:34 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (nvidia cuda 12.2.0, gcc-9, ubuntu, mpich)

Build for 30cc7f4 (2024-04-22 15:38:34 UTC)

/vt/lib/CLI/CLI/CLI11.hpp(1029): warning #2361-D: invalid narrowing conversion from "double" to "unsigned long"
          TT { std::declval<CC>() }
               ^
          detected during:
            instantiation of "vt::CLI::detail::is_direct_constructible<T, C>::test [with T=std::vector<std::string, std::allocator<std::string>>, C=double]" based on template arguments <std::vector<std::string, std::allocator<std::string>>, double> at line 1041
            instantiation of class "vt::CLI::detail::is_direct_constructible<T, C> [with T=std::vector<std::string, std::allocator<std::string>>, C=double]" at line 5005
            instantiation of "void vt::CLI::Option::results(T &) const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 5034
            instantiation of "T vt::CLI::Option::as<T>() const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 7315

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

/vt/lib/CLI/CLI/CLI11.hpp(1029): warning #2361-D: invalid narrowing conversion from "int" to "unsigned long"
          TT { std::declval<CC>() }
               ^
          detected during:
            instantiation of "vt::CLI::detail::is_direct_constructible<T, C>::test [with T=std::vector<std::string, std::allocator<std::string>>, C=int]" based on template arguments <std::vector<std::string, std::allocator<std::string>>, int> at line 1041
            instantiation of class "vt::CLI::detail::is_direct_constructible<T, C> [with T=std::vector<std::string, std::allocator<std::string>>, C=int]" at line 5005
            instantiation of "void vt::CLI::Option::results(T &) const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 5034
            instantiation of "T vt::CLI::Option::as<T>() const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 7315

Testing - passed

Build log


PR tests (intel icpx, ubuntu, mpich, verbose)

Build for f076ec9 (2024-05-24 12:04:00 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (gcc-12, ubuntu, mpich, verbose)

Build for f076ec9 (2024-05-24 12:04:00 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (clang-14, ubuntu, mpich, verbose)

Build for f076ec9 (2024-05-24 12:04:00 UTC)

Compilation - successful

Testing - passed

Build log


PR tests (nvidia cuda 12.2.0, gcc-9, ubuntu, mpich, verbose)

Build for f076ec9 (2024-05-24 12:04:00 UTC)

/vt/lib/CLI/CLI/CLI11.hpp(1029): warning #2361-D: invalid narrowing conversion from "double" to "unsigned long"
          TT { std::declval<CC>() }
               ^
          detected during:
            instantiation of "vt::CLI::detail::is_direct_constructible<T, C>::test [with T=std::vector<std::string, std::allocator<std::string>>, C=double]" based on template arguments <std::vector<std::string, std::allocator<std::string>>, double> at line 1041
            instantiation of class "vt::CLI::detail::is_direct_constructible<T, C> [with T=std::vector<std::string, std::allocator<std::string>>, C=double]" at line 5005
            instantiation of "void vt::CLI::Option::results(T &) const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 5034
            instantiation of "T vt::CLI::Option::as<T>() const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 7315

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

/vt/lib/CLI/CLI/CLI11.hpp(1029): warning #2361-D: invalid narrowing conversion from "int" to "unsigned long"
          TT { std::declval<CC>() }
               ^
          detected during:
            instantiation of "vt::CLI::detail::is_direct_constructible<T, C>::test [with T=std::vector<std::string, std::allocator<std::string>>, C=int]" based on template arguments <std::vector<std::string, std::allocator<std::string>>, int> at line 1041
            instantiation of class "vt::CLI::detail::is_direct_constructible<T, C> [with T=std::vector<std::string, std::allocator<std::string>>, C=int]" at line 5005
            instantiation of "void vt::CLI::Option::results(T &) const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 5034
            instantiation of "T vt::CLI::Option::as<T>() const [with T=std::vector<std::string, std::allocator<std::string>>]" at line 7315

Testing - passed

Build log


@thearusable thearusable force-pushed the 2074-support-sparse-offlinelb-maps branch 2 times, most recently from 3fa3511 to 5b75e31 Compare May 16, 2023 16:33
@thearusable thearusable self-assigned this May 16, 2023
@thearusable thearusable force-pushed the 2074-support-sparse-offlinelb-maps branch from 5b75e31 to f9c5f29 Compare May 22, 2023 10:07
@thearusable thearusable marked this pull request as ready for review May 22, 2023 16:39
tests/unit/lb/test_lb_reader.nompi.cc Outdated Show resolved Hide resolved
tests/unit/lb/test_offlinelb.cc Outdated Show resolved Hide resolved
Comment on lines 57 to 63
void OfflineLB::runLB(TimeType) {
auto const& distro = theLBDataReader()->getDistro(phase_ + 1);
for (auto&& elm : distro) {
auto nextPhase = theLBDataReader()->findNextPhase(phase_);
auto const distro = theLBDataReader()->getDistro(nextPhase);
for (auto&& elm : *distro) {
migrateObjectTo(elm, theContext()->getNode());
}
theLBDataReader()->clearDistro(phase_ + 1);
theLBDataReader()->clearDistro(nextPhase);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If phase_ + 1 was skipped, does this new logic grab phase_ + 2 if it exists (whether explicitly described in the file or listed as identical) as a surrogate?

@lifflander Is this the desired behavior or do we want to disallow running OfflineLB when we're missing information about the phase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly, it will grab phase_ + 2 or whatever is the next existing phase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nlslatt I changed the behavior to disallow running of OfflineLB when the phase is skipped.

if (local_changed_distro[i]) {
PhaseType curr = 0, next;
for (;curr < num_phases_ - 1;) {
next = findNextPhase(curr);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to my other question, do we want to allow OfflineLB to operate over data from non-consecutive phases (where one is explicitly listed as skipped as opposed to just identical)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nlslatt I updated that to operate only on consecutive phases.

@thearusable thearusable force-pushed the 2074-support-sparse-offlinelb-maps branch from 0d2eb73 to 044d7c4 Compare June 5, 2023 11:13
@thearusable thearusable force-pushed the 2074-support-sparse-offlinelb-maps branch 2 times, most recently from ef91bad to 8b20790 Compare June 12, 2023 14:56
@thearusable thearusable requested a review from nlslatt June 13, 2023 11:56
@thearusable thearusable force-pushed the 2074-support-sparse-offlinelb-maps branch from 56592af to c9120ba Compare June 20, 2023 17:08
@thearusable thearusable force-pushed the 2074-support-sparse-offlinelb-maps branch from c9120ba to 3f0a613 Compare October 19, 2023 14:00
@thearusable
Copy link
Contributor Author

@nlslatt, @lifflander - I rebased this PR and fixed all conflicts. PR is ready for review.

@thearusable thearusable force-pushed the 2074-support-sparse-offlinelb-maps branch from 3f0a613 to b3e07a4 Compare October 31, 2023 14:50
Copy link
Collaborator

@nlslatt nlslatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The testing still needs work.

tests/unit/lb/test_lb_reader.nompi.cc Outdated Show resolved Hide resolved
tests/unit/lb/test_offlinelb.cc Outdated Show resolved Hide resolved
tests/unit/lb/test_offlinelb.cc Outdated Show resolved Hide resolved
tests/unit/lb/test_offlinelb.cc Outdated Show resolved Hide resolved
Comment on lines 227 to 230
"0 OfflineLB\n"
"1 NoLB\n"
"2 NoLB\n"
"3 NoLB\n"
"4 OfflineLB\n"
"5 OfflineLB\n"
"6 NoLB\n";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this configuration, coupled with your choice of which phases are identical to previous, means that OfflineLB by definition has nothing to migrate (for example, phase 1 is said to be identical to phase 0, so nothing should migrate when OfflineLB runs on phase 0). As such, this does not appear to be a good test of the sparse functionality.

Also, I think the phases that you have defined as identical to previous contradict what was loaded into the LBDataHolder (e.g., phase 1 is marked as identical to phase 0, but the LBDataHolder content indicates that they are not identical).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I updated the configuration for this test so the OfflineLB will be able to do some migrations. Also I removed unnecessary phases from LBDataHolder.
If you have more suggestions, please let me know.

src/vt/vrt/collection/balance/lb_data_restart_reader.h Outdated Show resolved Hide resolved
@thearusable thearusable marked this pull request as draft November 29, 2023 13:54
@thearusable thearusable force-pushed the 2074-support-sparse-offlinelb-maps branch 2 times, most recently from 8a72a9f to 77911e8 Compare November 30, 2023 18:05
@thearusable thearusable marked this pull request as ready for review November 30, 2023 18:17
@thearusable thearusable force-pushed the 2074-support-sparse-offlinelb-maps branch from dac04e0 to 8eedc00 Compare May 21, 2024 14:01
Copy link
Collaborator

@nlslatt nlslatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of it looks great, but I think there's an issue with some of the tests.

tests/unit/runtime/test_initialization.cc Outdated Show resolved Hide resolved
Copy link
Collaborator

@nlslatt nlslatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your recent changes look good. One last thing to address, I think. It segfaults when OfflineLB is requested but the user forgets to also use the command-line argument --vt_lb_data_in. We need to give a usable error message instead.

@thearusable
Copy link
Contributor Author

Your recent changes look good. One last thing to address, I think. It segfaults when OfflineLB is requested but the user forgets to also use the command-line argument --vt_lb_data_in. We need to give a usable error message instead.

Thanks for catching that issue! Looks like it also happens on the develop.

Instead of showing the error message I changed the code to allow for more relaxed OfflineLB related configuration.
With latest changes we will check if we need LBDataRestartReader to be created when we supply one or more of those parameters:
--vt_lb_data_in, --vt_lb_name, --vt_lb_file_name.
Then we will check if the --vt_lb_file_name exists and if it requires the OfflineLB to be run.

bool data_in = arg_config_->config_.vt_lb_data_in;
bool requested_offline_lb = arg_config_->config_.vt_lb_name == get_lb_names()[LBType::OfflineLB];
bool has_file = arg_config_->config_.vt_lb_file_name != "";

if (data_in || requested_offline_lb || has_file) {
  auto& config_file = arg_config_->config_.vt_lb_file_name;
  if (config_file != "") {
    bool const has_spec = ReadLBConfig::openConfig(config_file);
    if (has_spec) {
      return ReadLBConfig::hasOfflineLB();
    }
  }
  return true;
}

Please let me know if that is ok to you or do you prefer to show a message that vt is not configured properly to run OfflineLB.

@nlslatt
Copy link
Collaborator

nlslatt commented May 29, 2024

Your recent changes look good. One last thing to address, I think. It segfaults when OfflineLB is requested but the user forgets to also use the command-line argument --vt_lb_data_in. We need to give a usable error message instead.

Thanks for catching that issue! Looks like it also happens on the develop.

Instead of showing the error message I changed the code to allow for more relaxed OfflineLB related configuration. With latest changes we will check if we need LBDataRestartReader to be created when we supply one or more of those parameters: --vt_lb_data_in, --vt_lb_name, --vt_lb_file_name. Then we will check if the --vt_lb_file_name exists and if it requires the OfflineLB to be run.

[snip]

Please let me know if that is ok to you or do you prefer to show a message that vt is not configured properly to run OfflineLB.

@lifflander and I discussed it, and we prefer just showing an error message when OfflineLB is requested but --vt_lb_data_in is not specified. Automatically enabling --vt_lb_data_in has undesirable consequences.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow for sparse OfflineLB maps that may be used with and LB configuration
2 participants