Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partitions waiting to handoff indefinitely #1135

Open
patrickkokou opened this issue Apr 27, 2023 · 4 comments
Open

Partitions waiting to handoff indefinitely #1135

patrickkokou opened this issue Apr 27, 2023 · 4 comments

Comments

@patrickkokou
Copy link

I'm running a cluster with 24 nodes with 1024 partitions
riak_kv_version : <<"2.1.7-226">>
riak version : <<"2.0.5">>

I have 142 partitions waiting to handoff for more than 30 days. There's no ongoing transfer in the cluster.
Under this node riak@0037-internal.xx.com, I can see this error message

<0.30120.441>@riak_core_handoff_sender:start_fold:282 hinted transfer of riak_kv_vnode from 'riak@0037-internal.xx.com' 994791641816054037097625320706298110058774396928 to 'riak@0029-internal.xx.com' 994791641816054037097625320706298110058774396928 failed because of error:{badmatch,{error,closed}} [{riak_core_handoff_sender,start_fold,5,[{file,"src/riak_core_handoff_sender.erl"},{line,132}]}]
<0.9143.441>@riak_core_handoff_sender:start_fold:282 hinted transfer of riak_kv_vnode from 'riak@0037-internal.xx.com' 616571003248974668617179538802181898917346541568 to 'riak@0035-internal.xx.com' 616571003248974668617179538802181898917346541568 failed because of error:{badmatch,{error,closed}} [{riak_core_handoff_sender,start_fold,5,[{file,"src/riak_core_handoff_sender.erl"},{line,132}]}]

When I check the partitions list (riak-admin cluster partitions) I notice that all partitions which are waiting for handoff are marked as secondary. I was expecting all those partitions type to be primary

Any idea about how to fix this issue?

@martinsumner
Copy link
Contributor

martinsumner commented Apr 27, 2023

Given that these are hinted handoffs, I think it would be expected that they are handoffs from secondary partitions (i.e. fallback vnodes that were temporarily created to maintain n_val during an outage).

There's been a lot of work done in the last few versions of Riak to try and improve handoff reliability, as there were a lot of problems with handoff timeouts, particularly when handoffs are occurring during busy periods or vnodes are particularly large.

In your version, the first thing is probably to reduce the riak_core handoff_acksync_threshold across your cluster. This reduces the number of batches between acknowledgements.

There may also be value in increasing the riak_core handoff_timeout across the cluster.

There may also be value in increasing the riak_core handoff_receive_vnode_timeout.

These changes can all be made via riak attach and application set_env (which will change for the next handoff). Also you can add different settings into advanced.config (which will have effect following reboot).

Finally, if you have increased the riak_core handoff_concurrency from the default setting, there may be value in reducing back to the default again.

Monitoring of these handoffs has been improved in recent versions, as working out what exactly is going wrong in older Riak versions is hard. When a handoff fails, it starts to re-send all the data from the beginning, so if the fallback vnodes were created as part of an extended outage (and are quite large) then continuous failures are possible.

If you are confident that all the data is sufficiently covered in your cluster (due to other replicas and anti-entropy mechanisms), in the worst case scenario you can stop each node in turn and manually delete the fallback vnodes. Obviously though, it would be more sustainable to find a configuration which will work for future handoffs.

@patrickkokou
Copy link
Author

Thanks Martin, I'll try these config changes steps and see who it goes. Will keep you updated.

@patrickkokou
Copy link
Author

I did some changes in riak attach and application set_env
and restart riak.
That kicks off the transfer again, but now I'm seing a different error in riak errors logs

2023-05-03 01:34:09.787 [error] <0.304.0>@riak_core_ring:check_tainted:263 Error: riak_core_ring/ring_ready called on tainted ring
2023-05-03 01:34:09.787 [error] <0.304.0>@riak_core_ring:check_tainted:263 Error: riak_core_ring/ring_ready called on tainted ring

The transfer seems to be in progress, but I don't understand how to fix this riak_core_ring:check_tainted error

I need your help again, thanks

@martinsumner
Copy link
Contributor

martinsumner commented May 3, 2023

I don't know really. I believe the tainted flag was added, so that before a read-only cache of the ring is exported (using mochiglobal), it is marked as tainted so that it can be confirmed that such a cached ring is never mistakenly used as the version to make an updated ring - i.e. some code updates the ring from get_raw_ring not get_my_ring.

So the tainted state, and the error messages were a check to make sure this never happens. But clearly, in some rare circumstance it can. Because of this the unset_tainted function was added so that this could be fixed from remote_console ... but that isn't available in older versions of Riak.

If the error logs don't go away, there might be another method to clear this status. I don't think it will work, but perhaps riak_core_ring_manager:force_update/0 might be worth a shot. You could compile a new version of the riak_core_ring module with the exported unset_tainted function added, and hot code load it, then use the function to unset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants