Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AAE - race condition causes nodes to go down #983

Open
nsaadouni opened this issue Jul 2, 2019 · 0 comments
Open

AAE - race condition causes nodes to go down #983

nsaadouni opened this issue Jul 2, 2019 · 0 comments

Comments

@nsaadouni
Copy link
Contributor

nsaadouni commented Jul 2, 2019

A race condition in AAE causes the same eleveldb iterator to be used in two different erlang processes. This can result in a kernel general protection fault, or a segfault, which terminates the erlang beam process immediately.

The condition is caused by several compounding bugs. One of the main ones is in the terminate clause of the riak_kv_vnode, which updates all hashtree's on a nodes shutdown. With a large amount of data in AAE store, this causes the vnode to take longer than 60 seconds and crash on its way down.

The race condition is triggered (and has been replicated) via the following:

  1. Trigger an AAE exchange for Preflist P1 between Node 1 and Node 2
  2. Stop Node 1 (when it has fired off the riak_kv_index_hashtree:compare/5 call in the riak_kv_exchange_fsm)
  3. Trigger an AAE exchange for Preflist P1 on Node 2 and Node 3
  4. Node 2 will now have two processes using the same eleveldb iterator

The race condition occurs due to riak_kv_exchange_fsm stopping on Node 1 shutdown, which causes the riak_kv_index_hashtree locks to be released on Node 1 and Node 2. However the riak_kv_index_hashtree:compre/5 call is a spawen'd off process, which is still running due to the bug mentioned above that causes the riak_kv_vnode to stay up for 60 seconds on a nodes shutdown.

The comparsion on Node 1 is still active for 60 seconds, and sends to Node 2 riak_kv_index_hashtree:exchange_segment/2 calls. This call utilises the eleveldb iterator stored in the riak_kv_index_hashtree state. The new exchange between Node 2 and Node 3 causes an spawn'd of process to update the eleveldb iterator, save it to state and then update the hashtree. With an exchange_segments call coming in after this, we now have two processes using the same eleveldb iterator.

Thus causing the general protection fault, or segfault. Taking down a seperate node to the node that has been requested to stop.

The number of nodes this could potentially take down is the lowest out of:

  1. the highest n_val
  2. the anti entropy concurrency limit

While the edge case is extremely difficult to hit, we can mitigate the race condition by stopping exchanges while stoppping any node in the cluster.


To stop exchanges do the following:

riak attach

riak_core_util:rpc_every_member_ann(riak_kv_entropy_manager, set_mode, [manual], 10000).
riak_core_util:rpc_every_member_ann(riak_kv_entropy_manager, cancel_exchanges, [], 10000).

To start the exchanges again:

riak attach

riak_core_util:rpc_every_member_ann(riak_kv_entropy_manager, set_mode, [automatic], 10000).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant