Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parcelport fails to initialize when multiple jobs run on the same cluster #6097

Open
antoniupop opened this issue Dec 7, 2022 · 4 comments

Comments

@antoniupop
Copy link

antoniupop commented Dec 7, 2022

Expected Behavior

Expected is that multiple independent jobs (e.g. SLURM job array) can run concurrently on the same cluster (on disjoint sets of nodes, not co-scheduled).

Actual Behavior

Only one (or none) of the jobs is able to run while all others crash at initialization with the following errors:

the bootstrap parcelport (tcp) has failed to initialize on locality 0:
<unknown>: HPX(network_error),
bailing out
terminate called without an active exception
srun: error: queue1-dy-m5a2xlarge-1: task 0: Exited with exit code 255
the bootstrap parcelport (tcp) has failed to initialize on locality 4294967295:
<unknown>: HPX(network_error),
bailing out
terminate called without an active exception
the bootstrap parcelport (tcp) has failed to initialize on locality 4294967295:
<unknown>: HPX(network_error),
bailing out

Steps to Reproduce the Problem

Schedule multiple jobs on a SLURM cluster without dependences and only using a subset of the nodes (so allowing the SLURM scheduler to start multiple instances on separate partitions).

Tried to use MPI parcelport and disable TCP to no avail (error changes, but still fails to initialize).

Specifications

  • HPX Version: 1.7.1 and 1.8.1 tried
  • Platform (compiler, OS): Ubuntu / GCC
@hkaiser
Copy link
Member

hkaiser commented Dec 7, 2022

Disabling the TCP parcelport should help. How did you disable it?

@antoniupop
Copy link
Author

Disabling the TCP parcelport should help. How did you disable it?

I've used -DHPX_WITH_PARCELPORT_MPI=ON -DHPX_WITH_PARCELPORT_TCP=OFF when building HPX.

@hkaiser
Copy link
Member

hkaiser commented Dec 7, 2022

Disabling the TCP parcelport should help. How did you disable it?

I've used -DHPX_WITH_PARCELPORT_MPI=ON -DHPX_WITH_PARCELPORT_TCP=OFF when building HPX.

Could you give us the error message you see in this case, please?

@antoniupop
Copy link
Author

antoniupop commented Dec 8, 2022

I've used -DHPX_WITH_PARCELPORT_MPI=ON -DHPX_WITH_PARCELPORT_TCP=OFF when building HPX.

Could you give us the error message you see in this case, please?

I used to get an error message along the lines of failure to initialise Parcelport before, but now it's crashing with the following:

0x7f5be6dfc3c0  : /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0) [0x7f5be6dfc3c0] in /lib/x86_64-linux-gnu/libpthread.so.0
0x7f5be6437513  : /shared/concrete-compiler-internal/compiler/hpx-1.7.1/build/lib/libhpx.so.1(+0x6ec513) [0x7f5be6437513] in /shared/concrete-compiler-internal/compiler/hpx-1.7.1/build/lib/libhpx.so.1
0x7f5be5f8e5f7  : hpx::parcelset::detail::parcel_await_apply(hpx::parcelset::parcel&&, hpx::util::function<void (std::error_code const&, hpx::parcelset::parcel const&), false>&&, unsigned int, hpx::util::unique_function<void (hpx::parcelset::parcel&&, hpx::util::function<void (std::error_code const&, hpx::parcelset::parcel const&), false>&&), false>) [0xc7] in /shared/concrete-compiler-internal/compiler/hpx-1.7.1/build/lib/libhpx.so.1
0x7f5be643cbc2  : void hpx::agas::big_boot_barrier::apply<hpx::actions::direct_action<void (*)(hpx::agas::registration_header const&), &hpx::agas::register_worker, hpx::actions::detail::this_type>, hpx::agas::registration_header>(unsigned int, unsigned int, hpx::parcelset::locality, hpx::actions::direct_action<void (*)(hpx::agas::registration_header const&), &hpx::agas::register_worker, hpx::actions::detail::this_type>, hpx::agas::registration_header&&) [0x1a2] in /shared/concrete-compiler-internal/compiler/hpx-1.7.1/build/lib/libhpx.so.1
0x7f5be64362cc  : hpx::agas::big_boot_barrier::wait_hosted(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, unsigned long) [0x4fc] in /shared/concrete-compiler-internal/compiler/hpx-1.7.1/build/lib/libhpx.so.1
0x7f5be644c563  : hpx::runtime_distributed::initialize_agas() [0x283] in /shared/concrete-compiler-internal/compiler/hpx-1.7.1/build/lib/libhpx.so.1
0x7f5be644fd47  : hpx::runtime_distributed::runtime_distributed(hpx::util::runtime_configuration&, int (*)(hpx::runtime_mode)) [0xf17] in /shared/concrete-compiler-internal/compiler/hpx-1.7.1/build/lib/libhpx.so.1
0x7f5be62e176e  : hpx::detail::run_or_start(hpx::util::function<int (hpx::program_options::variables_map&), false> const&, int, char**, hpx::init_params const&, bool) [0xd8e] in /shared/concrete-compiler-internal/compiler/hpx-1.7.1/build/lib/libhpx.so.1

I'm not quite sure of what changed for this to now crash instead, still looking to reproduce previous behaviour, but there is no difference between the code run now with MPI parcelport and the initial code using TCP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants