You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
i'm seeing a hang in the logcabin client while the cluster is being initialized.
thread 73
(gdb) bt
#0 0x00007fd69050bf4d in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007fd690507d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2 0x00007fd690507c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x0000000000a0c57c in __gthread_mutex_lock (__mutex=0x7fd508018580) at /usr/include/c++/4.8.2/x86_64-redhat-linux/bits/gthr-default.h:748
#4 0x0000000000a0ed3e in std::mutex::lock (this=0x7fd508018580) at /usr/include/c++/4.8.2/mutex:134
#5 0x0000000000a0f774 in std::lock_guard<std::mutex>::lock_guard (this=0x7fd627ffeb10, __m=...) at /usr/include/c++/4.8.2/mutex:414
#6 0x0000000000aa1c2c in LogCabin::Event::File::Monitor::disableForever (this=0x7fd508018570) at build/Event/File.cc:55
#7 0x0000000000a9ed94 in LogCabin::RPC::MessageSocket::disconnect (this=0x7fd508018420) at build/RPC/MessageSocket.cc:231
#8 0x0000000000a9ee1e in LogCabin::RPC::MessageSocket::readable (this=0x7fd508018420) at build/RPC/MessageSocket.cc:248
#9 0x0000000000a9e61f in LogCabin::RPC::MessageSocket::ReceiveSocket::handleFileEvent (this=0x7fd508018500, events=25) at build/RPC/MessageSocket.cc:102
#10 0x0000000000aa2802 in LogCabin::Event::Loop::runForever (this=0x33c2510) at build/Event/Loop.cc:162
#11 0x0000000000a218ed in std::_Mem_fn<void (LogCabin::Event::Loop::*)()>::operator()<, void>(LogCabin::Event::Loop*) const (this=0x33c2498, __object=0x33c2510) at /usr/include/c++/4.8.2/functional:601
#12 0x0000000000a21747 in std::_Bind_simple<std::_Mem_fn<void (LogCabin::Event::Loop::*)()> (LogCabin::Event::Loop*)>::_M_invoke<0ul>(std::_Index_tuple<0ul>) (this=0x33c2490) at /usr/include/c++/4.8.2/functional:1732
#13 0x0000000000a215e7 in std::_Bind_simple<std::_Mem_fn<void (LogCabin::Event::Loop::*)()> (LogCabin::Event::Loop*)>::operator()() (this=0x33c2490) at /usr/include/c++/4.8.2/functional:1720
#14 0x0000000000a21518 in std::thread::_Impl<std::_Bind_simple<std::_Mem_fn<void (LogCabin::Event::Loop::*)()> (LogCabin::Event::Loop*)> >::_M_run() (this=0x33c2478) at /usr/include/c++/4.8.2/thread:115
#15 0x00007fd690ce7220 in ?? () from /lib64/libstdc++.so.6
#16 0x00007fd690505dc5 in start_thread () from /lib64/libpthread.so.0
#17 0x00007fd690232ced in clone () from /lib64/libc.so.6
(gdb) fr 5
#5 0x0000000000a0f774 in std::lock_guard<std::mutex>::lock_guard (this=0x7fd627ffeb10, __m=...) at /usr/include/c++/4.8.2/mutex:414
414 /usr/include/c++/4.8.2/mutex: No such file or directory.
(gdb) print __m
$1 = (std::lock_guard<std::mutex>::mutex_type &) @0x7fd508018580: {<std::__mutex_base> = {_M_mutex = {__data = {__lock = 2, __count = 0, __owner = 37030, __nusers = 1, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}},
__size = "\002\000\000\000\000\000\000\000\246\220\000\000\001", '\000' <repeats 26 times>, __align = 2}}, <No data fields>}
mutex held by 37030 (thread 12)
(gdb) thread 12
[Switching to thread 12 (Thread 0x7fd54b7fe700 (LWP 37030))]
#0 0x00007fd6905096d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) bt
#0 0x00007fd6905096d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x0000000000aa3a13 in LogCabin::Core::ConditionVariable::wait (this=0x33c2570, lockGuard=...) at build/Core/ConditionVariable.cc:84
#2 0x0000000000aa2279 in LogCabin::Event::Loop::Lock::Lock (this=0x7fd54b7fcc40, eventLoop=...) at build/Event/Loop.cc:62
#3 0x0000000000aa1c5a in LogCabin::Event::File::Monitor::disableForever (this=0x7fd508018570) at build/Event/File.cc:58
#4 0x0000000000aa1bca in LogCabin::Event::File::Monitor::~Monitor (this=0x7fd508018570, __in_chrg=<optimized out>) at build/Event/File.cc:49
#5 0x0000000000a9eaf0 in LogCabin::RPC::MessageSocket::~MessageSocket (this=0x7fd508018420, __in_chrg=<optimized out>) at build/RPC/MessageSocket.cc:190
#6 0x0000000000a9c482 in std::default_delete<LogCabin::RPC::MessageSocket>::operator() (this=0x7fd5080194b0, __ptr=0x7fd508018420) at /usr/include/c++/4.8.2/bits/unique_ptr.h:67
#7 0x0000000000a9bdfa in std::unique_ptr<LogCabin::RPC::MessageSocket, std::default_delete<LogCabin::RPC::MessageSocket> >::reset (this=0x7fd5080194b0, __p=0x7fd508018420) at /usr/include/c++/4.8.2/bits/unique_ptr.h:262
#8 0x0000000000a9aa2e in LogCabin::RPC::ClientSession::~ClientSession (this=0x7fd508019350, __in_chrg=<optimized out>) at build/RPC/ClientSession.cc:372
#9 0x0000000000a9e114 in std::_Sp_counted_ptr<LogCabin::RPC::ClientSession*, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x7fd508021560) at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:290
#10 0x00000000005df2e9 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7fd508021560) at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:144
#11 0x0000000000a0f58b in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x7fd54b7fcf98, __in_chrg=<optimized out>) at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:546
#12 0x0000000000a19d3e in std::__shared_ptr<LogCabin::RPC::ClientSession, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7fd54b7fcf90, __in_chrg=<optimized out>) at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:781
#13 0x0000000000a19d58 in std::shared_ptr<LogCabin::RPC::ClientSession>::~shared_ptr (this=0x7fd54b7fcf90, __in_chrg=<optimized out>) at /usr/include/c++/4.8.2/bits/shared_ptr.h:93
#14 0x0000000000a25e79 in LogCabin::Client::SessionManager::createSession (this=0x33c26b0, address=..., timeout=..., clusterUUID=0x33c2678, serverId=0x0) at build/Client/SessionManager.cc:122
#15 0x0000000000a2295f in LogCabin::Client::LeaderRPC::getSession (this=0x33c1e90, timeout=...) at build/Client/LeaderRPC.cc:266
#16 0x0000000000a21d1f in LogCabin::Client::LeaderRPC::Call::start (this=0x7fd54b7fd3d0, opCode=LogCabin::Protocol::Client::GET_CONFIGURATION, request=..., timeout=...) at build/Client/LeaderRPC.cc:88
#17 0x0000000000a22460 in LogCabin::Client::LeaderRPC::call (this=0x33c1e90, opCode=LogCabin::Protocol::Client::GET_CONFIGURATION, request=..., response=..., timeout=...) at build/Client/LeaderRPC.cc:189
#18 0x0000000000a14e07 in LogCabin::Client::ClientImpl::getConfiguration (this=0x33c24c8, timeout=...) at build/Client/ClientImpl.cc:497
#19 0x0000000000a0e625 in LogCabin::Client::Cluster::getConfiguration2 (this=0x33c1d00, timeoutNanoseconds=10000000000) at build/Client/Client.cc:546
#20 0x0000000000a0e65e in LogCabin::Client::Cluster::getConfiguration2Ex (this=0x33c1d00, timeoutNanoseconds=10000000000) at build/Client/Client.cc:552
#21 0x000000000086d38d in LogCabinCluster::GetConfiguration (this=this@entry=0x33bf8f0, timeoutNanoseconds=timeoutNanoseconds@entry=10000000000) at LogCabinCluster.cpp:189
#22 0x00000000005793b7 in LogCabinCollector::collectLocked (this=0x3423220) at LogCabinCollector.cpp:73
#23 0x0000000000960fb3 in Forte::Collector::Collect (this=this@entry=0x3423220) at Collector.cpp:30
#24 0x00000000004f0285 in CollectorReqHandler::Handler (this=this@entry=0x33ef890, event=event@entry=0x7fd51c000a90) at CollectorManagerImpl.cpp:302
#25 0x00000000009e7e61 in Forte::ThreadPoolDispatcherWorker::run (this=0x7fd608004680) at ThreadPoolDispatcher.cpp:259
#26 0x00000000009e5a30 in Forte::Thread::startThread (obj=0x7fd608004680) at Thread.cpp:65
#27 0x00007fd690505dc5 in start_thread () from /lib64/libpthread.so.0
#28 0x00007fd690232ced in clone () from /lib64/libc.so.6
(gdb) fr 1
#1 0x0000000000aa3a13 in LogCabin::Core::ConditionVariable::wait (this=0x33c2570, lockGuard=...) at build/Core/ConditionVariable.cc:84
84 in build/Core/ConditionVariable.cc
(gdb) print mutex
$2 = (pthread_mutex_t *) 0x33c2530
(gdb) print *mutex
$3 = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 1, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 12 times>, "\001", '\000' <repeats 26 times>, __align = 0}
haven't dug in fully but i believe the logcabin instance it is attempting to connect to is not yet running so GetConfiguration is timing out. haven't had a chance to fully dig in but from the context it seems that maybe when we don't get a session, we don't get a mutex and the following code is blocked trying to wait on a condition variable with an empty mutex
} else {
pthread_mutex_t* mutex = lockGuard.mutex()->native_handle();
int r = pthread_cond_wait(&cv, mutex);
if (r != 0)
PANIC("pthread_cond_wait failed: %s", strerror(r));
}
a potential fix would be to check the mutex before waiting but i'm not sure if this indicates a deeper problem. potentially related to #82 ?
The text was updated successfully, but these errors were encountered:
fwiw, this has happened once in probably 10s of thousands of installs so it's not pressing all that much. wanted to file it mainly for tracking purposes.
i'm seeing a hang in the logcabin client while the cluster is being initialized.
thread 73
mutex held by 37030 (thread 12)
haven't dug in fully but i believe the logcabin instance it is attempting to connect to is not yet running so GetConfiguration is timing out. haven't had a chance to fully dig in but from the context it seems that maybe when we don't get a session, we don't get a mutex and the following code is blocked trying to wait on a condition variable with an empty mutex
a potential fix would be to check the mutex before waiting but i'm not sure if this indicates a deeper problem. potentially related to #82 ?
The text was updated successfully, but these errors were encountered: