Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster is not recovering after failure "Could not find healthy leader ip, aborting remove and delete operation..." #150

Open
doronl opened this issue Jul 31, 2022 · 3 comments

Comments

@doronl
Copy link

doronl commented Jul 31, 2022

Cluster is not revering, operator stuck in a loop looking for a leader, only one node left out of 3 (no followers), see attached operators log

cluster_failure.log

Steps taken (not shown in log), didn't help:

  1. Delete the 3rd node.
  2. Delete the config map.

Any idea?

@voltbit @NataliAharoniPayu

@NataliAharoniPayu
Copy link
Collaborator

NataliAharoniPayu commented Jul 31, 2022

Hi, the cluster api exposes entry points like /fix and /rebalance
Try to see if getting them triggered helps.
My recommended trouble shooting routine is to open several terminals to have a full view of the event:

  • watch kubectl get pods
  • watch kubectl describe configmap redis-cluster-state-map
  • kubectl port-forward chosen-pod 6379:6379
  • watch redis-cli cluster nodes
  • kubectl port-forward operator-pod 8090:8080
  • terminal to manage curl requests to api
  • kubectl logs -f operator-pod
  • If there is a need to reset the cluster, edit the redis-operator-config configmap, inside look for the flag var ExposeSensitiveEntryPoints, set it to true, after operator logs will show that the new configuration got loaded, use the entry point /reset, it will reset the cluster.

General note: Clusters that has only masters without followers are in risk to loose sync between their configurations, it is something we noticed not long ago and still suggesting good method to detect properly and handle.

I looked into the log file, seems like the cluster lost quorum and this is a point of failure to the cluster, the way to detect it is to see if a watcher terminal for redis-cli cluster nodes query shows a list of masters with question mark next to their identifier, this is a case that by redis design cannot recover. if you found out this is the case, a reset action is required.

We are working on a good way to trigger auto-reset for those cases, currently we are not convinced we are able to distinguish well between cases that can be recovered and cases that cannot, this is why we enabling the /reset entry point and exposing the counter that implies how many reconcile loops ran in a row since last healthy one, which can serve as metric to alert by.

@doronl
Copy link
Author

doronl commented Jul 31, 2022

Thanks Natali!
In production we will be using replicas, this is staging environment on Azure we are testing...
BTW when I did redis-cli cluster nodes on the last left node I got a very long list keyslot ranges... something I never seen before... anyway the /reset is something I was not aware of so, its good to know,

Anyway a question, are they any operator configurations that control the when the operator detects "Lost nodes detcted on some of cluster nodes"? (e.g retries, timeouts etc.)?

@NataliAharoniPayu
Copy link
Collaborator

NataliAharoniPayu commented Jul 31, 2022

Regard the image of great list of keyslots ranges, when the cluster is loosing handle on some of the slots (reshard that been interrupted or loose of master and all his followers for example), the operator changes cluster state to "Fix", it will "trap" the reconcile loop to apply fix and rebalance until both successful, as fixed cluster is a requirement by redis in order to be able to perform any other operation on redis cluster, and rebalanced cluster is a requirement by use in order to perform any operation on redis cluster (link to full article we published about general design and our decision process)
Performing cluster fix in a lot of cases leaves the cluster in this state of mixed-up keyslots tables, each node accepts random slots range.
image

We will see the log that imply loss of redis node every time the "cluster view" (map that lists each one of the existing pods in cluster) is different from the "state view" (map that lists each one of the theoretically expected pods according to spec), when a node that is expected to be a leader is missing from the cluster view, it will trigger a search for a redis node that appears in the cluster view and has leader name that is the same as the leader from state map that we know is missing, if non of such appeared in the search - it will declare "loss of master and all of its followers" to the logs. sometimes it is not the real case (example: we have only masters, and during upgrade process the rolled pod starts to reshard its slots to other redis node, it is kept in the state map but deleted as an actual pod), but we do declare it as a case of loss of pod set, as it requires the same handle as a real lost, and there is no other good way to distinguish the cases currently, anyway running fix and rebalance after such operation is always recommended as it is sensitive process.

In the heart of the idea to make the operator applying "self recovery / self maintenance" we purposely didn't implement a retry mechanism per operation, we save the state and attempts to perform the mitigation in the next reconcile loop. Good example that helps to understand the rational behind it is when we have a loss of node/s and it leads to miss-alignment between nodes tables, at this point it doesnt matter how much we try - we cannot add new nodes until tables are cleared from non-responsive nodes, cluster is being fixed and rebalanced with a proper waiting for the nodes to agree about new configuration. this is a routine that can only be guaranteed before next attempt to add node only if the next reconcile loop will be triggered.

So, as we see it, we don't wan't to follow a hard rule of "trigger an SRE for any case of 3 failures in a row" - as it could be a case that can be fixed alone within finite number of reconcile loops, and also we don't want to stay blind to a case of inability of the operator and cluster to come back to healthy state, this is why we managing the counter of NumOfReconcileLoopsSinceHealthyCluster, we assume that value that is greater than 250 implies a considering state and requires intervention of SRE to have a look on the logs and perform mitigation steps manually if required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants