You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A race condition in AAE causes the same eleveldb iterator to be used in two different erlang processes. This can result in a kernel general protection fault, or a segfault, which terminates the erlang beam process immediately.
The condition is caused by several compounding bugs. One of the main ones is in the terminate clause of the riak_kv_vnode, which updates all hashtree's on a nodes shutdown. With a large amount of data in AAE store, this causes the vnode to take longer than 60 seconds and crash on its way down.
The race condition is triggered (and has been replicated) via the following:
Trigger an AAE exchange for Preflist P1 between Node 1 and Node 2
Stop Node 1 (when it has fired off the riak_kv_index_hashtree:compare/5 call in the riak_kv_exchange_fsm)
Trigger an AAE exchange for Preflist P1 on Node 2 and Node 3
Node 2 will now have two processes using the same eleveldb iterator
The race condition occurs due to riak_kv_exchange_fsm stopping on Node 1 shutdown, which causes the riak_kv_index_hashtree locks to be released on Node 1 and Node 2. However the riak_kv_index_hashtree:compre/5 call is a spawen'd off process, which is still running due to the bug mentioned above that causes the riak_kv_vnode to stay up for 60 seconds on a nodes shutdown.
The comparsion on Node 1 is still active for 60 seconds, and sends to Node 2 riak_kv_index_hashtree:exchange_segment/2 calls. This call utilises the eleveldb iterator stored in the riak_kv_index_hashtree state. The new exchange between Node 2 and Node 3 causes an spawn'd of process to update the eleveldb iterator, save it to state and then update the hashtree. With an exchange_segments call coming in after this, we now have two processes using the same eleveldb iterator.
Thus causing the general protection fault, or segfault. Taking down a seperate node to the node that has been requested to stop.
The number of nodes this could potentially take down is the lowest out of:
the highest n_val
the anti entropy concurrency limit
While the edge case is extremely difficult to hit, we can mitigate the race condition by stopping exchanges while stoppping any node in the cluster.
A race condition in AAE causes the same eleveldb iterator to be used in two different erlang processes. This can result in a kernel general protection fault, or a segfault, which terminates the erlang beam process immediately.
The condition is caused by several compounding bugs. One of the main ones is in the terminate clause of the riak_kv_vnode, which updates all hashtree's on a nodes shutdown. With a large amount of data in AAE store, this causes the vnode to take longer than 60 seconds and crash on its way down.
The race condition is triggered (and has been replicated) via the following:
The race condition occurs due to riak_kv_exchange_fsm stopping on Node 1 shutdown, which causes the riak_kv_index_hashtree locks to be released on Node 1 and Node 2. However the riak_kv_index_hashtree:compre/5 call is a spawen'd off process, which is still running due to the bug mentioned above that causes the riak_kv_vnode to stay up for 60 seconds on a nodes shutdown.
The comparsion on Node 1 is still active for 60 seconds, and sends to Node 2 riak_kv_index_hashtree:exchange_segment/2 calls. This call utilises the eleveldb iterator stored in the riak_kv_index_hashtree state. The new exchange between Node 2 and Node 3 causes an spawn'd of process to update the eleveldb iterator, save it to state and then update the hashtree. With an exchange_segments call coming in after this, we now have two processes using the same eleveldb iterator.
Thus causing the general protection fault, or segfault. Taking down a seperate node to the node that has been requested to stop.
The number of nodes this could potentially take down is the lowest out of:
While the edge case is extremely difficult to hit, we can mitigate the race condition by stopping exchanges while stoppping any node in the cluster.
To stop exchanges do the following:
riak attach
riak_core_util:rpc_every_member_ann(riak_kv_entropy_manager, set_mode, [manual], 10000).
riak_core_util:rpc_every_member_ann(riak_kv_entropy_manager, cancel_exchanges, [], 10000).
To start the exchanges again:
riak attach
riak_core_util:rpc_every_member_ann(riak_kv_entropy_manager, set_mode, [automatic], 10000).
The text was updated successfully, but these errors were encountered: