Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--hpx:queuing=shared fails for distributed runs #6190

Open
hkaiser opened this issue Mar 7, 2023 · 6 comments
Open

--hpx:queuing=shared fails for distributed runs #6190

hkaiser opened this issue Mar 7, 2023 · 6 comments

Comments

@hkaiser
Copy link
Member

hkaiser commented Mar 7, 2023

From IRC:

[16:37]	beojan: I've noticed that if I use the `--hpx:queuing=shared` option to enable a shared queue across hardware threads, my program crashes when I run it through mpirun with -n >= 2.
[16:38]	beojan: I originally noticed this with my Gaudi port, but it also happens with my toy demo: https://github.com/beojan/HPXDemo
[16:39]	beojan: Here's the error:
[16:39]	beojan: {os-thread}: locality#1/worker-thread#1
[16:39]	beojan: {thread-description}: <unknown>
[16:39]	beojan: {state}: not running
[16:39]	beojan: {auxinfo}: 
[16:39]	beojan: {file}: /home/beojan/Development/src/hpx/src/hpx-1.8.1/libs/core/schedulers/include/hpx/schedulers/thread_queue_mc.hpp
[16:39]	beojan: {line}: 247
[16:39]	beojan: {function}: thread_queue_mc::create_thread
[16:39]	beojan: {what}: staged tasks must have 'pending' as their initial state: HPX(bad_parameter)
@beojan
Copy link

beojan commented Mar 8, 2023

Here's the full stacktrace for that thread (now with hpx-1.9.0-rc1):

{stack-trace}: 13 frames:
0x7f06aeeb12bb  : /usr/lib/libhpx.so.1(+0x4b12bb) [0x7f06aeeb12bb] in /usr/lib/libhpx.so.1
0x7f06ae7387ec  : std::__exception_ptr::exception_ptr hpx::detail::get_exception<hpx::exception>(hpx::exception const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) [0xac] in /usr/lib/libhpx_core.so
0x7f06ae738906  : void hpx::detail::throw_exception<hpx::exception>(hpx::exception const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long) [0x76] in /usr/lib/libhpx_core.so
0x7f06ae73e3c1  : hpx::detail::throw_exception(hpx::error, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long) [0xd1] in /usr/lib/libhpx_core.so
0x7f06ae86a70b  : /usr/lib/libhpx_core.so(+0x26a70b) [0x7f06ae86a70b] in /usr/lib/libhpx_core.so
0x7f06ae804c54  : hpx::threads::detail::create_background_thread(hpx::threads::policies::scheduler_base&, unsigned long, hpx::threads::detail::scheduling_callbacks&, std::shared_ptr<bool>&, long&) [0x1a4] in /usr/lib/libhpx_core.so
0x7f06ae86bb0e  : /usr/lib/libhpx_core.so(+0x26bb0e) [0x7f06ae86bb0e] in /usr/lib/libhpx_core.so
0x7f06ae86c795  : hpx::threads::detail::scheduled_thread_pool<hpx::threads::policies::shared_priority_queue_scheduler<std::mutex, hpx::threads::policies::concurrentqueue_fifo, hpx::threads::policies::lockfree_lifo> >::thread_func(unsigned long, unsigned long, std::shared_ptr<hpx::util::barrier>) [0x4f5] in /usr/lib/libhpx_core.so
0x7f06ae816695  : std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (hpx::threads::detail::scheduled_thread_pool<hpx::threads::policies::shared_priority_queue_scheduler<std::mutex, hpx::threads::policies::concurrentqueue_fifo, hpx::threads::policies::lockfree_lifo> >::*)(unsigned long, unsigned long, std::shared_ptr<hpx::util::barrier>), hpx::threads::detail::scheduled_thread_pool<hpx::threads::policies::shared_priority_queue_scheduler<std::mutex, hpx::threads::policies::concurrentqueue_fifo, hpx::threads::policies::lockfree_lifo> >*, unsigned long, unsigned long, std::shared_ptr<hpx::util::barrier> > > >::_M_run() [0x55] in /usr/lib/libhpx_core.so
0x7f067cad72c3  : /usr/lib/libstdc++.so.6(+0xd72c3) [0x7f067cad72c3] in /usr/lib/libstdc++.so.6
0x7f067c89ebb5  : /usr/lib/libc.so.6(+0x85bb5) [0x7f067c89ebb5] in /usr/lib/libc.so.6
0x7f067c920d90  : /usr/lib/libc.so.6(+0x107d90) [0x7f067c920d90] in /usr/lib/libc.so.6
{locality-id}: 1
{hostname}: [ (mpi:1) (tcp:127.0.0.1:7911) ]
{process-id}: 68100
{os-thread}: locality#1/worker-thread#5
{thread-description}: <unknown>
{state}: state::pre_main
{auxinfo}: 
{file}: /home/beojan/Development/src/hpx/src/hpx-1.9.0-rc1/libs/core/schedulers/include/hpx/schedulers/thread_queue_mc.hpp
{line}: 249
{function}: thread_queue_mc::create_thread
{what}: staged tasks must have 'pending' as their initial state: HPX(bad_parameter)

@hkaiser
Copy link
Member Author

hkaiser commented Mar 9, 2023

@beojan I'm not able to reproduce this issue locally. What application did you run?

@beojan
Copy link

beojan commented Mar 9, 2023

My demo app is at https://github.com/beojan/HPXDemo.

@beojan
Copy link

beojan commented Mar 9, 2023

If I use the Intel mpirun executable (with the demo linked to OpenMPI) it doesn't crash but this is a clearly faulty setup because of the mismatch between the mpirun version and the libmpi version.

My Gaudi port understandably crashes during MPI initialization with such a setup.

@hkaiser
Copy link
Member Author

hkaiser commented Mar 21, 2023

@beojan would you have more information on how we could reproduce this issue? Are you using any specific environment?

@beojan
Copy link

beojan commented Mar 21, 2023

With the demo, I'm running on my laptop (Arch Linux) with HPX 1.9.0-rc1 and OpenMPI 4.1.

You can comment out the TBB and CUDA demos in the CMake file, though you do need oneMKL available to build it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants