Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Active-Active Zookeeper Operator Deployment #245

Open
hikhvar opened this issue Oct 19, 2021 · 4 comments
Open

Active-Active Zookeeper Operator Deployment #245

hikhvar opened this issue Oct 19, 2021 · 4 comments
Assignees

Comments

@hikhvar
Copy link
Contributor

hikhvar commented Oct 19, 2021

Hello,
is it possible to run two instances of the zookeeper operator? I didn't find an option to activate some sort of leader election. Many operators use some sort of leader election to allow the deployment of multiple instances to mitigate node failures in the k8s cluster.

@nightkr
Copy link
Member

nightkr commented Oct 20, 2021

Not currently (this is also blocked on an upstream issue, kube-rs/kube#485).

That said, the managed ZooKeeper should still be available, even if the operator managing it is down.

@hikhvar
Copy link
Contributor Author

hikhvar commented Oct 21, 2021

That is true. I think beeing able to deploy the zookeeper operator in an active active fashion does help from this two design goals:

  • I want to recover faster in case of a single node failure. At the moment the pod is only rescheduled after the default 5 minutes. In a former project I observed this failure: The node did become unready but did not disappear from the cluster. The pods didn't get deleted by the kubelet. They stayed in state terminating, since the node was knocked out by an kernel error. As a result the replicaset controller didn't create a new pod for our single instance horizontal pod autoscaler (HPA). This rendered the horizontal pod autoscaling disabled. If we had an active-active HPA, the second non-terminating, non-stuck HPA would had become the active one.
  • I want to be sure, only a single operator is in control in case of an administrative error creating two instances (e.g. scaling the operator deployment by accident)

I admit, both cases are edge cases. But from an operational safety perspective those are

@nightkr
Copy link
Member

nightkr commented Oct 21, 2021

Sadly, I'm not sure this would do much to help with goal 1, since you'd just end up stuck waiting for the lease to expire, rather than the pod being deleted.

Safe forced lease-takeover would require fencing, either at the networking or Kubernetes API layer. Neither of those are available out of the box for Kubernetes clusters.

@hikhvar
Copy link
Contributor Author

hikhvar commented Oct 21, 2021

Yes, but the lease will expire at some point in time. The pod was in state terminating for 11h. The lease has a duration in the range of minutes.

@nightkr nightkr self-assigned this Feb 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants