Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data race in peer #493

Open
as51340 opened this issue Mar 21, 2024 · 3 comments
Open

Data race in peer #493

as51340 opened this issue Mar 21, 2024 · 3 comments

Comments

@as51340
Copy link

as51340 commented Mar 21, 2024

I found data race using TSAN in my code. The issue seems to be in peer.

WARNING: ThreadSanitizer: data race (pid=182953)
  Write of size 8 at 0x7b5800000e50 by thread T5 (mutexes: write M2013, write M1987):
    #0 operator delete(void*) /root/memgraph/environment/toolchain/build/llvm-13.0.0/projects/compiler-rt/lib/tsan/rtl/tsan_new_delete.cpp:126 (coordinator_instance+0x2c2d215)
    #1 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/bits/shared_ptr_base.h:184 (coordinator_instance+0x486baa4)
    #2 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count() /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/bits/shared_ptr_base.h:705 (coordinator_instance+0x486baa4)
    #3 std::__shared_ptr<nuraft::peer, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr() /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/bits/shared_ptr_base.h:1154 (coordinator_instance+0x486baa4)
    #4 std::__shared_ptr<nuraft::peer, (__gnu_cxx::_Lock_policy)2>::reset() /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/bits/shared_ptr_base.h:1272 (coordinator_instance+0x486baa4)
    #5 nuraft::raft_server::reconfigure(std::shared_ptr<nuraft::cluster_config> const&) /home/andi/Memgraph/code/memgraph/libs/nuraft/src/handle_commit.cxx:717 (coordinator_instance+0x486baa4)

  Previous atomic read of size 1 at 0x7b5800000e50 by thread T4:
    #0 pthread_mutex_lock /root/memgraph/environment/toolchain/build/llvm-13.0.0/projects/compiler-rt/lib/tsan/../sanitizer_common/sanitizer_common_interceptors.inc:4249 (coordinator_instance+0x2ba6914)
    #1 __gthread_mutex_lock(pthread_mutex_t*) /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/x86_64-linux-gnu/bits/gthr-default.h:749 (coordinator_instance+0x4894ad7)
    #2 std::mutex::lock() /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/bits/std_mutex.h:100 (coordinator_instance+0x4894ad7)
    #3 std::lock_guard<std::mutex>::lock_guard(std::mutex&) /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/bits/std_mutex.h:229 (coordinator_instance+0x4894ad7)
    #4 nuraft::timer_helper::set_duration_us(unsigned long) /home/andi/Memgraph/code/memgraph/libs/nuraft/include/libnuraft/internal_timer.hxx:57 (coordinator_instance+0x4894ad7)
    #5 nuraft::timer_helper::set_duration_ms(unsigned long) /home/andi/Memgraph/code/memgraph/libs/nuraft/include/libnuraft/internal_timer.hxx:62 (coordinator_instance+0x4894ad7)
    #6 nuraft::peer::handle_rpc_result(std::shared_ptr<nuraft::peer>, std::shared_ptr<nuraft::rpc_client>, std::shared_ptr<nuraft::req_msg>&, std::shared_ptr<nuraft::cmd_result<std::shared_ptr<nuraft::resp_msg>, std::shared_ptr<nuraft::rpc_exception> > >&, std::shared_ptr<nuraft::resp_msg>&, std::shared_ptr<nuraft::rpc_exception>&) /home/andi/Memgraph/code/memgraph/libs/nuraft/src/peer.cxx:146 (coordinator_instance+0x4894ad7)

  Mutex M2013 (0x7b6c00001298) created at:
    #0 pthread_mutex_trylock /root/memgraph/environment/toolchain/build/llvm-13.0.0/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:1296 (coordinator_instance+0x2b9a324)
    #1 __gthread_mutex_trylock(pthread_mutex_t*) /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/x86_64-linux-gnu/bits/gthr-default.h:758 (coordinator_instance+0x4867b9f)
    #2 std::mutex::try_lock() /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/bits/std_mutex.h:111 (coordinator_instance+0x4867b9f)
    #3 std::unique_lock<std::mutex>::unique_lock(std::mutex&, std::try_to_lock_t) /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/bits/unique_lock.h:78 (coordinator_instance+0x4867b9f)
    #4 nuraft::raft_server::commit_in_bg_exec(unsigned long) /home/andi/Memgraph/code/memgraph/libs/nuraft/src/handle_commit.cxx:161 (coordinator_instance+0x4867b9f)

  Mutex M1987 (0x7b6c000011f0) created at:
    #0 pthread_mutex_lock /root/memgraph/environment/toolchain/build/llvm-13.0.0/projects/compiler-rt/lib/tsan/../sanitizer_common/sanitizer_common_interceptors.inc:4249 (coordinator_instance+0x2ba6914)
    #1 __gthread_mutex_lock(pthread_mutex_t*) /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/x86_64-linux-gnu/bits/gthr-default.h:749 (coordinator_instance+0x4880979)
    #2 __gthread_recursive_mutex_lock(pthread_mutex_t*) /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/x86_64-linux-gnu/bits/gthr-default.h:811 (coordinator_instance+0x4880979)
    #3 std::recursive_mutex::lock() /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/mutex:108 (coordinator_instance+0x4880979)
    #4 std::unique_lock<std::recursive_mutex>::lock() /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/bits/unique_lock.h:139 (coordinator_instance+0x4880979)
    #5 std::unique_lock<std::recursive_mutex>::unique_lock(std::recursive_mutex&) /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/bits/unique_lock.h:69 (coordinator_instance+0x4880979)
    #6 nuraft::raft_server::restart_election_timer() /home/andi/Memgraph/code/memgraph/libs/nuraft/src/handle_timeout.cxx:164 (coordinator_instance+0x4880979)
    #7 memgraph::coordination::RaftState::MakeRaftState(memgraph::coordination::CoordinatorInstanceInitConfig const&, std::function<void ()>&&, std::function<void ()>&&) /home/andi/Memgraph/code/memgraph/src/coordination/raft_state.cpp:104 (coordinator_instance+0x2dc1ace)
    #8 memgraph::coordination::CoordinatorInstance::CoordinatorInstance(memgraph::coordination::CoordinatorInstanceInitConfig const&) /home/andi/Memgraph/code/memgraph/src/coordination/coordinator_instance.cpp:36 (coordinator_instance+0x2cb89eb)
    #9 CoordinatorInstanceTest_ConnectCoordinators_Test::TestBody() /home/andi/Memgraph/code/memgraph/tests/unit/coordinator_instance.cpp:67 (coordinator_instance+0x2c30777)
    #10 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) ??:? (coordinator_instance+0x2fa05b5)
    #11 main /home/andi/Memgraph/code/memgraph/tests/unit/main.cpp:19 (coordinator_instance+0x2cb83be)

  Thread T5 'nuraft_commit' (tid=182959, running) created by main thread at:
    #0 pthread_create /root/memgraph/environment/toolchain/build/llvm-13.0.0/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:977 (coordinator_instance+0x2bbb63a)
    #1 std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) /root/memgraph/environment/toolchain/build/gcc-11.2.0/build/x86_64-linux-gnu/libstdc++-v3/include/x86_64-linux-gnu/bits/gthr-default.h:663 (coordinator_instance+0x60720b9)
    #2 std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) /root/memgraph/environment/toolchain/build/gcc-11.2.0/build/x86_64-linux-gnu/libstdc++-v3/src/c++11/../../../../../libstdc++-v3/src/c++11/thread.cc:147 (coordinator_instance+0x60720b9)
    #3 memgraph::coordination::RaftState::MakeRaftState(memgraph::coordination::CoordinatorInstanceInitConfig const&, std::function<void ()>&&, std::function<void ()>&&) /home/andi/Memgraph/code/memgraph/src/coordination/raft_state.cpp:104 (coordinator_instance+0x2dc1ace)
    #4 memgraph::coordination::CoordinatorInstance::CoordinatorInstance(memgraph::coordination::CoordinatorInstanceInitConfig const&) /home/andi/Memgraph/code/memgraph/src/coordination/coordinator_instance.cpp:36 (coordinator_instance+0x2cb89eb)
    #5 CoordinatorInstanceTest_ConnectCoordinators_Test::TestBody() /home/andi/Memgraph/code/memgraph/tests/unit/coordinator_instance.cpp:67 (coordinator_instance+0x2c30777)
    #6 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) ??:? (coordinator_instance+0x2fa05b5)
    #7 main /home/andi/Memgraph/code/memgraph/tests/unit/main.cpp:19 (coordinator_instance+0x2cb83be)

  Thread T4 'nuraft_w_0' (tid=182958, running) created by main thread at:
    #0 pthread_create /root/memgraph/environment/toolchain/build/llvm-13.0.0/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:977 (coordinator_instance+0x2bbb63a)
    #1 __gthread_create /root/memgraph/environment/toolchain/build/gcc-11.2.0/build/x86_64-linux-gnu/libstdc++-v3/include/x86_64-linux-gnu/bits/gthr-default.h:663 (coordinator_instance+0x60720b9)
    #2 std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) /root/memgraph/environment/toolchain/build/gcc-11.2.0/build/x86_64-linux-gnu/libstdc++-v3/src/c++11/../../../../../libstdc++-v3/src/c++11/thread.cc:147 (coordinator_instance+0x60720b9)
    #3 memgraph::coordination::RaftState::MakeRaftState(memgraph::coordination::CoordinatorInstanceInitConfig const&, std::function<void ()>&&, std::function<void ()>&&) /home/andi/Memgraph/code/memgraph/src/coordination/raft_state.cpp:104 (coordinator_instance+0x2dc1ace)
    #4 memgraph::coordination::CoordinatorInstance::CoordinatorInstance(memgraph::coordination::CoordinatorInstanceInitConfig const&) /home/andi/Memgraph/code/memgraph/src/coordination/coordinator_instance.cpp:36 (coordinator_instance+0x2cb89eb)
    #5 CoordinatorInstanceTest_ConnectCoordinators_Test::TestBody() /home/andi/Memgraph/code/memgraph/tests/unit/coordinator_instance.cpp:67 (coordinator_instance+0x2c30777)
    #6 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) ??:? (coordinator_instance+0x2fa05b5)
    #7 main /home/andi/Memgraph/code/memgraph/tests/unit/main.cpp:19 (coordinator_instance+0x2cb83be)

SUMMARY: ThreadSanitizer: data race /root/memgraph/environment/toolchain/build/llvm-13.0.0/projects/compiler-rt/lib/tsan/rtl/tsan_new_delete.cpp:126 in operator delete(void*)
=

Relevant code that is using NuRaft here is related to adding servers, we do:

srv_config const srv_config_to_add(static_cast<int>(config.coordinator_id), endpoint);

  auto cmd_result = raft_server_->add_srv(srv_config_to_add);

  if (cmd_result->get_result_code() == nuraft::cmd_result_code::OK) {
    spdlog::info("Request to add server {} to the cluster accepted", endpoint);
  } else {
    throw RaftAddServerException("Failed to accept request to add server {} to the cluster with error code {}",
                                 endpoint, int(cmd_result->get_result_code()));
  }

  // Waiting for server to join
  constexpr int max_tries{10};
  auto maybe_stop = utils::ResettableCounter<max_tries>();
  constexpr int waiting_period{200};
  bool added{false};
  while (!maybe_stop()) {
    std::this_thread::sleep_for(std::chrono::milliseconds(waiting_period));
    const auto server_config = raft_server_->get_srv_config(static_cast<nuraft::int32>(config.coordinator_id));
    if (server_config) {
      spdlog::trace("Server with id {} added to cluster", config.coordinator_id);
      added = true;
      break;
    }
  }

  if (!added) {
    throw RaftAddServerException("Failed to add server {} to the cluster in {}ms", endpoint,
                                 max_tries * waiting_period);
  }
@greensky00
Copy link
Contributor

From what you provided, it seems there is a race between srv_to_join_.reset() and peer::lock_. However, that is not possible as peer::handle_rpc_result has myself not to make the reference counter of the shared pointer become 0. I haven't seen TSAN alert during the normal server add/removal process.

  • Can you please provide the exact commit hash of NuRaft that you used?
  • Also, do you get this TSAN alert every time you run the above code? If so, it will be great to share the logs generated by NuRaft to see what happened.

@as51340
Copy link
Author

as51340 commented Mar 28, 2024

Thanks for the reply!

We are using NuRaft v2.1.0
Yes, TSAN alerts on every run.

Can you instruct me on how to get logs?

@greensky00
Copy link
Contributor

I tried the same code with the same NuRaft version, still didn't get TSAN alert.

Could you share your build environment? Such as OS version, compiler version, etc. Also please share Raft parameters (heartbeat, election lower/upper timeout, ...).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants