Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deadlock in client near LogCabin::Event::File::Monitor::~Monitor on GetConfiguration timeout #226

Open
nhardt opened this issue Jan 20, 2017 · 2 comments

Comments

@nhardt
Copy link
Contributor

nhardt commented Jan 20, 2017

i'm seeing a hang in the logcabin client while the cluster is being initialized.

thread 73

(gdb) bt
#0  0x00007fd69050bf4d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fd690507d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007fd690507c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000000000a0c57c in __gthread_mutex_lock (__mutex=0x7fd508018580) at /usr/include/c++/4.8.2/x86_64-redhat-linux/bits/gthr-default.h:748
#4  0x0000000000a0ed3e in std::mutex::lock (this=0x7fd508018580) at /usr/include/c++/4.8.2/mutex:134
#5  0x0000000000a0f774 in std::lock_guard<std::mutex>::lock_guard (this=0x7fd627ffeb10, __m=...) at /usr/include/c++/4.8.2/mutex:414
#6  0x0000000000aa1c2c in LogCabin::Event::File::Monitor::disableForever (this=0x7fd508018570) at build/Event/File.cc:55
#7  0x0000000000a9ed94 in LogCabin::RPC::MessageSocket::disconnect (this=0x7fd508018420) at build/RPC/MessageSocket.cc:231
#8  0x0000000000a9ee1e in LogCabin::RPC::MessageSocket::readable (this=0x7fd508018420) at build/RPC/MessageSocket.cc:248
#9  0x0000000000a9e61f in LogCabin::RPC::MessageSocket::ReceiveSocket::handleFileEvent (this=0x7fd508018500, events=25) at build/RPC/MessageSocket.cc:102
#10 0x0000000000aa2802 in LogCabin::Event::Loop::runForever (this=0x33c2510) at build/Event/Loop.cc:162
#11 0x0000000000a218ed in std::_Mem_fn<void (LogCabin::Event::Loop::*)()>::operator()<, void>(LogCabin::Event::Loop*) const (this=0x33c2498, __object=0x33c2510) at /usr/include/c++/4.8.2/functional:601
#12 0x0000000000a21747 in std::_Bind_simple<std::_Mem_fn<void (LogCabin::Event::Loop::*)()> (LogCabin::Event::Loop*)>::_M_invoke<0ul>(std::_Index_tuple<0ul>) (this=0x33c2490) at /usr/include/c++/4.8.2/functional:1732
#13 0x0000000000a215e7 in std::_Bind_simple<std::_Mem_fn<void (LogCabin::Event::Loop::*)()> (LogCabin::Event::Loop*)>::operator()() (this=0x33c2490) at /usr/include/c++/4.8.2/functional:1720
#14 0x0000000000a21518 in std::thread::_Impl<std::_Bind_simple<std::_Mem_fn<void (LogCabin::Event::Loop::*)()> (LogCabin::Event::Loop*)> >::_M_run() (this=0x33c2478) at /usr/include/c++/4.8.2/thread:115
#15 0x00007fd690ce7220 in ?? () from /lib64/libstdc++.so.6
#16 0x00007fd690505dc5 in start_thread () from /lib64/libpthread.so.0
#17 0x00007fd690232ced in clone () from /lib64/libc.so.6
(gdb) fr 5
#5  0x0000000000a0f774 in std::lock_guard<std::mutex>::lock_guard (this=0x7fd627ffeb10, __m=...) at /usr/include/c++/4.8.2/mutex:414
414	/usr/include/c++/4.8.2/mutex: No such file or directory.
(gdb) print __m
$1 = (std::lock_guard<std::mutex>::mutex_type &) @0x7fd508018580: {<std::__mutex_base> = {_M_mutex = {__data = {__lock = 2, __count = 0, __owner = 37030, __nusers = 1, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, 
      __size = "\002\000\000\000\000\000\000\000\246\220\000\000\001", '\000' <repeats 26 times>, __align = 2}}, <No data fields>}

mutex held by 37030 (thread 12)

(gdb) thread 12
[Switching to thread 12 (Thread 0x7fd54b7fe700 (LWP 37030))]
#0  0x00007fd6905096d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007fd6905096d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000000aa3a13 in LogCabin::Core::ConditionVariable::wait (this=0x33c2570, lockGuard=...) at build/Core/ConditionVariable.cc:84
#2  0x0000000000aa2279 in LogCabin::Event::Loop::Lock::Lock (this=0x7fd54b7fcc40, eventLoop=...) at build/Event/Loop.cc:62
#3  0x0000000000aa1c5a in LogCabin::Event::File::Monitor::disableForever (this=0x7fd508018570) at build/Event/File.cc:58
#4  0x0000000000aa1bca in LogCabin::Event::File::Monitor::~Monitor (this=0x7fd508018570, __in_chrg=<optimized out>) at build/Event/File.cc:49
#5  0x0000000000a9eaf0 in LogCabin::RPC::MessageSocket::~MessageSocket (this=0x7fd508018420, __in_chrg=<optimized out>) at build/RPC/MessageSocket.cc:190
#6  0x0000000000a9c482 in std::default_delete<LogCabin::RPC::MessageSocket>::operator() (this=0x7fd5080194b0, __ptr=0x7fd508018420) at /usr/include/c++/4.8.2/bits/unique_ptr.h:67
#7  0x0000000000a9bdfa in std::unique_ptr<LogCabin::RPC::MessageSocket, std::default_delete<LogCabin::RPC::MessageSocket> >::reset (this=0x7fd5080194b0, __p=0x7fd508018420) at /usr/include/c++/4.8.2/bits/unique_ptr.h:262
#8  0x0000000000a9aa2e in LogCabin::RPC::ClientSession::~ClientSession (this=0x7fd508019350, __in_chrg=<optimized out>) at build/RPC/ClientSession.cc:372
#9  0x0000000000a9e114 in std::_Sp_counted_ptr<LogCabin::RPC::ClientSession*, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x7fd508021560) at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:290
#10 0x00000000005df2e9 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7fd508021560) at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:144
#11 0x0000000000a0f58b in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x7fd54b7fcf98, __in_chrg=<optimized out>) at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:546
#12 0x0000000000a19d3e in std::__shared_ptr<LogCabin::RPC::ClientSession, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7fd54b7fcf90, __in_chrg=<optimized out>) at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:781
#13 0x0000000000a19d58 in std::shared_ptr<LogCabin::RPC::ClientSession>::~shared_ptr (this=0x7fd54b7fcf90, __in_chrg=<optimized out>) at /usr/include/c++/4.8.2/bits/shared_ptr.h:93
#14 0x0000000000a25e79 in LogCabin::Client::SessionManager::createSession (this=0x33c26b0, address=..., timeout=..., clusterUUID=0x33c2678, serverId=0x0) at build/Client/SessionManager.cc:122
#15 0x0000000000a2295f in LogCabin::Client::LeaderRPC::getSession (this=0x33c1e90, timeout=...) at build/Client/LeaderRPC.cc:266
#16 0x0000000000a21d1f in LogCabin::Client::LeaderRPC::Call::start (this=0x7fd54b7fd3d0, opCode=LogCabin::Protocol::Client::GET_CONFIGURATION, request=..., timeout=...) at build/Client/LeaderRPC.cc:88
#17 0x0000000000a22460 in LogCabin::Client::LeaderRPC::call (this=0x33c1e90, opCode=LogCabin::Protocol::Client::GET_CONFIGURATION, request=..., response=..., timeout=...) at build/Client/LeaderRPC.cc:189
#18 0x0000000000a14e07 in LogCabin::Client::ClientImpl::getConfiguration (this=0x33c24c8, timeout=...) at build/Client/ClientImpl.cc:497
#19 0x0000000000a0e625 in LogCabin::Client::Cluster::getConfiguration2 (this=0x33c1d00, timeoutNanoseconds=10000000000) at build/Client/Client.cc:546
#20 0x0000000000a0e65e in LogCabin::Client::Cluster::getConfiguration2Ex (this=0x33c1d00, timeoutNanoseconds=10000000000) at build/Client/Client.cc:552
#21 0x000000000086d38d in LogCabinCluster::GetConfiguration (this=this@entry=0x33bf8f0, timeoutNanoseconds=timeoutNanoseconds@entry=10000000000) at LogCabinCluster.cpp:189
#22 0x00000000005793b7 in LogCabinCollector::collectLocked (this=0x3423220) at LogCabinCollector.cpp:73
#23 0x0000000000960fb3 in Forte::Collector::Collect (this=this@entry=0x3423220) at Collector.cpp:30
#24 0x00000000004f0285 in CollectorReqHandler::Handler (this=this@entry=0x33ef890, event=event@entry=0x7fd51c000a90) at CollectorManagerImpl.cpp:302
#25 0x00000000009e7e61 in Forte::ThreadPoolDispatcherWorker::run (this=0x7fd608004680) at ThreadPoolDispatcher.cpp:259
#26 0x00000000009e5a30 in Forte::Thread::startThread (obj=0x7fd608004680) at Thread.cpp:65
#27 0x00007fd690505dc5 in start_thread () from /lib64/libpthread.so.0
#28 0x00007fd690232ced in clone () from /lib64/libc.so.6
(gdb) fr 1
#1  0x0000000000aa3a13 in LogCabin::Core::ConditionVariable::wait (this=0x33c2570, lockGuard=...) at build/Core/ConditionVariable.cc:84
84	in build/Core/ConditionVariable.cc
(gdb) print mutex
$2 = (pthread_mutex_t *) 0x33c2530
(gdb) print *mutex
$3 = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 1, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 12 times>, "\001", '\000' <repeats 26 times>, __align = 0}

haven't dug in fully but i believe the logcabin instance it is attempting to connect to is not yet running so GetConfiguration is timing out. haven't had a chance to fully dig in but from the context it seems that maybe when we don't get a session, we don't get a mutex and the following code is blocked trying to wait on a condition variable with an empty mutex

    } else {
        pthread_mutex_t* mutex = lockGuard.mutex()->native_handle();
        int r = pthread_cond_wait(&cv, mutex);
        if (r != 0)
            PANIC("pthread_cond_wait failed: %s", strerror(r));
    }

a potential fix would be to check the mutex before waiting but i'm not sure if this indicates a deeper problem. potentially related to #82 ?

@ongardie
Copy link
Member

ongardie commented Feb 3, 2017

Sorry for the delay. I'll try to look into this early next week.

@nhardt
Copy link
Contributor Author

nhardt commented Feb 3, 2017

fwiw, this has happened once in probably 10s of thousands of installs so it's not pressing all that much. wanted to file it mainly for tracking purposes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants