Cluster is not recovering after failure "Could not find healthy leader ip, aborting remove and delete operation..." #150

doronl · 2022-07-31T13:00:49Z

Cluster is not revering, operator stuck in a loop looking for a leader, only one node left out of 3 (no followers), see attached operators log

cluster_failure.log

Steps taken (not shown in log), didn't help:

Delete the 3rd node.
Delete the config map.

Any idea?

@voltbit @NataliAharoniPayu

NataliAharoniPayu · 2022-07-31T14:26:40Z

Hi, the cluster api exposes entry points like /fix and /rebalance
Try to see if getting them triggered helps.
My recommended trouble shooting routine is to open several terminals to have a full view of the event:

watch kubectl get pods
watch kubectl describe configmap redis-cluster-state-map
kubectl port-forward chosen-pod 6379:6379
watch redis-cli cluster nodes
kubectl port-forward operator-pod 8090:8080
terminal to manage curl requests to api
kubectl logs -f operator-pod
If there is a need to reset the cluster, edit the redis-operator-config configmap, inside look for the flag var ExposeSensitiveEntryPoints, set it to true, after operator logs will show that the new configuration got loaded, use the entry point /reset, it will reset the cluster.

General note: Clusters that has only masters without followers are in risk to loose sync between their configurations, it is something we noticed not long ago and still suggesting good method to detect properly and handle.

I looked into the log file, seems like the cluster lost quorum and this is a point of failure to the cluster, the way to detect it is to see if a watcher terminal for redis-cli cluster nodes query shows a list of masters with question mark next to their identifier, this is a case that by redis design cannot recover. if you found out this is the case, a reset action is required.

We are working on a good way to trigger auto-reset for those cases, currently we are not convinced we are able to distinguish well between cases that can be recovered and cases that cannot, this is why we enabling the /reset entry point and exposing the counter that implies how many reconcile loops ran in a row since last healthy one, which can serve as metric to alert by.

doronl · 2022-07-31T15:13:03Z

Thanks Natali!
In production we will be using replicas, this is staging environment on Azure we are testing...
BTW when I did redis-cli cluster nodes on the last left node I got a very long list keyslot ranges... something I never seen before... anyway the /reset is something I was not aware of so, its good to know,

Anyway a question, are they any operator configurations that control the when the operator detects "Lost nodes detcted on some of cluster nodes"? (e.g retries, timeouts etc.)?

NataliAharoniPayu · 2022-07-31T15:48:11Z

Regard the image of great list of keyslots ranges, when the cluster is loosing handle on some of the slots (reshard that been interrupted or loose of master and all his followers for example), the operator changes cluster state to "Fix", it will "trap" the reconcile loop to apply fix and rebalance until both successful, as fixed cluster is a requirement by redis in order to be able to perform any other operation on redis cluster, and rebalanced cluster is a requirement by use in order to perform any operation on redis cluster (link to full article we published about general design and our decision process)
Performing cluster fix in a lot of cases leaves the cluster in this state of mixed-up keyslots tables, each node accepts random slots range.

We will see the log that imply loss of redis node every time the "cluster view" (map that lists each one of the existing pods in cluster) is different from the "state view" (map that lists each one of the theoretically expected pods according to spec), when a node that is expected to be a leader is missing from the cluster view, it will trigger a search for a redis node that appears in the cluster view and has leader name that is the same as the leader from state map that we know is missing, if non of such appeared in the search - it will declare "loss of master and all of its followers" to the logs. sometimes it is not the real case (example: we have only masters, and during upgrade process the rolled pod starts to reshard its slots to other redis node, it is kept in the state map but deleted as an actual pod), but we do declare it as a case of loss of pod set, as it requires the same handle as a real lost, and there is no other good way to distinguish the cases currently, anyway running fix and rebalance after such operation is always recommended as it is sensitive process.

In the heart of the idea to make the operator applying "self recovery / self maintenance" we purposely didn't implement a retry mechanism per operation, we save the state and attempts to perform the mitigation in the next reconcile loop. Good example that helps to understand the rational behind it is when we have a loss of node/s and it leads to miss-alignment between nodes tables, at this point it doesnt matter how much we try - we cannot add new nodes until tables are cleared from non-responsive nodes, cluster is being fixed and rebalanced with a proper waiting for the nodes to agree about new configuration. this is a routine that can only be guaranteed before next attempt to add node only if the next reconcile loop will be triggered.

So, as we see it, we don't wan't to follow a hard rule of "trigger an SRE for any case of 3 failures in a row" - as it could be a case that can be fixed alone within finite number of reconcile loops, and also we don't want to stay blind to a case of inability of the operator and cluster to come back to healthy state, this is why we managing the counter of NumOfReconcileLoopsSinceHealthyCluster, we assume that value that is greater than 250 implies a considering state and requires intervention of SRE to have a look on the logs and perform mitigation steps manually if required.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster is not recovering after failure "Could not find healthy leader ip, aborting remove and delete operation..." #150

Cluster is not recovering after failure "Could not find healthy leader ip, aborting remove and delete operation..." #150

doronl commented Jul 31, 2022

NataliAharoniPayu commented Jul 31, 2022 •

edited

doronl commented Jul 31, 2022

NataliAharoniPayu commented Jul 31, 2022 •

edited

Cluster is not recovering after failure "Could not find healthy leader ip, aborting remove and delete operation..." #150

Cluster is not recovering after failure "Could not find healthy leader ip, aborting remove and delete operation..." #150

Comments

doronl commented Jul 31, 2022

NataliAharoniPayu commented Jul 31, 2022 • edited

doronl commented Jul 31, 2022

NataliAharoniPayu commented Jul 31, 2022 • edited

NataliAharoniPayu commented Jul 31, 2022 •

edited

NataliAharoniPayu commented Jul 31, 2022 •

edited