Partitions waiting to handoff indefinitely #1135

patrickkokou · 2023-04-27T02:58:32Z

I'm running a cluster with 24 nodes with 1024 partitions
riak_kv_version : <<"2.1.7-226">>
riak version : <<"2.0.5">>

I have 142 partitions waiting to handoff for more than 30 days. There's no ongoing transfer in the cluster.
Under this node riak@0037-internal.xx.com, I can see this error message

<0.30120.441>@riak_core_handoff_sender:start_fold:282 hinted transfer of riak_kv_vnode from 'riak@0037-internal.xx.com' 994791641816054037097625320706298110058774396928 to 'riak@0029-internal.xx.com' 994791641816054037097625320706298110058774396928 failed because of error:{badmatch,{error,closed}} [{riak_core_handoff_sender,start_fold,5,[{file,"src/riak_core_handoff_sender.erl"},{line,132}]}]
<0.9143.441>@riak_core_handoff_sender:start_fold:282 hinted transfer of riak_kv_vnode from 'riak@0037-internal.xx.com' 616571003248974668617179538802181898917346541568 to 'riak@0035-internal.xx.com' 616571003248974668617179538802181898917346541568 failed because of error:{badmatch,{error,closed}} [{riak_core_handoff_sender,start_fold,5,[{file,"src/riak_core_handoff_sender.erl"},{line,132}]}]

When I check the partitions list (riak-admin cluster partitions) I notice that all partitions which are waiting for handoff are marked as secondary. I was expecting all those partitions type to be primary

Any idea about how to fix this issue?

martinsumner · 2023-04-27T07:11:11Z

Given that these are hinted handoffs, I think it would be expected that they are handoffs from secondary partitions (i.e. fallback vnodes that were temporarily created to maintain n_val during an outage).

There's been a lot of work done in the last few versions of Riak to try and improve handoff reliability, as there were a lot of problems with handoff timeouts, particularly when handoffs are occurring during busy periods or vnodes are particularly large.

In your version, the first thing is probably to reduce the riak_core handoff_acksync_threshold across your cluster. This reduces the number of batches between acknowledgements.

There may also be value in increasing the riak_core handoff_timeout across the cluster.

There may also be value in increasing the riak_core handoff_receive_vnode_timeout.

These changes can all be made via riak attach and application set_env (which will change for the next handoff). Also you can add different settings into advanced.config (which will have effect following reboot).

Finally, if you have increased the riak_core handoff_concurrency from the default setting, there may be value in reducing back to the default again.

Monitoring of these handoffs has been improved in recent versions, as working out what exactly is going wrong in older Riak versions is hard. When a handoff fails, it starts to re-send all the data from the beginning, so if the fallback vnodes were created as part of an extended outage (and are quite large) then continuous failures are possible.

If you are confident that all the data is sufficiently covered in your cluster (due to other replicas and anti-entropy mechanisms), in the worst case scenario you can stop each node in turn and manually delete the fallback vnodes. Obviously though, it would be more sustainable to find a configuration which will work for future handoffs.

patrickkokou · 2023-04-27T16:23:13Z

Thanks Martin, I'll try these config changes steps and see who it goes. Will keep you updated.

patrickkokou · 2023-05-03T01:49:49Z

I did some changes in riak attach and application set_env
and restart riak.
That kicks off the transfer again, but now I'm seing a different error in riak errors logs

2023-05-03 01:34:09.787 [error] <0.304.0>@riak_core_ring:check_tainted:263 Error: riak_core_ring/ring_ready called on tainted ring
2023-05-03 01:34:09.787 [error] <0.304.0>@riak_core_ring:check_tainted:263 Error: riak_core_ring/ring_ready called on tainted ring

The transfer seems to be in progress, but I don't understand how to fix this riak_core_ring:check_tainted error

I need your help again, thanks

martinsumner · 2023-05-03T07:05:22Z

I don't know really. I believe the tainted flag was added, so that before a read-only cache of the ring is exported (using mochiglobal), it is marked as tainted so that it can be confirmed that such a cached ring is never mistakenly used as the version to make an updated ring - i.e. some code updates the ring from get_raw_ring not get_my_ring.

So the tainted state, and the error messages were a check to make sure this never happens. But clearly, in some rare circumstance it can. Because of this the unset_tainted function was added so that this could be fixed from remote_console ... but that isn't available in older versions of Riak.

If the error logs don't go away, there might be another method to clear this status. I don't think it will work, but perhaps riak_core_ring_manager:force_update/0 might be worth a shot. You could compile a new version of the riak_core_ring module with the exported unset_tainted function added, and hot code load it, then use the function to unset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partitions waiting to handoff indefinitely #1135

Partitions waiting to handoff indefinitely #1135

patrickkokou commented Apr 27, 2023

martinsumner commented Apr 27, 2023 •

edited

patrickkokou commented Apr 27, 2023

patrickkokou commented May 3, 2023

martinsumner commented May 3, 2023 •

edited

Partitions waiting to handoff indefinitely #1135

Partitions waiting to handoff indefinitely #1135

Comments

patrickkokou commented Apr 27, 2023

martinsumner commented Apr 27, 2023 • edited

patrickkokou commented Apr 27, 2023

patrickkokou commented May 3, 2023

martinsumner commented May 3, 2023 • edited

martinsumner commented Apr 27, 2023 •

edited

martinsumner commented May 3, 2023 •

edited