Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploy PodDisruptionBudgets for Cassandra nodes #349

Open
linki opened this issue Feb 26, 2020 · 5 comments
Open

Deploy PodDisruptionBudgets for Cassandra nodes #349

linki opened this issue Feb 26, 2020 · 5 comments

Comments

@linki
Copy link

linki commented Feb 26, 2020

In order to limit the amount of voluntary pod terminations, e.g. during Kubernetes cluster updates, we should deploy a PodDisruptionBudget that limits the number of Cassandra nodes that can be unavailable at the same time.

@smiklosovic
Copy link
Collaborator

smiklosovic commented Feb 26, 2020

@linki interesting, thank you for this idea, I will look into it tomorrow. If I understand it correctly, you would like to see this configurable directly in cdc spec?

@linki
Copy link
Author

linki commented Feb 28, 2020

@smiklosovic I can also look into it 😃 It's not urgent but I didn't want to forget about it.

We could default the PDB to allow one pod being down at most at any point in time. Do you see any use cases where it's useful to allow more disruptions?

@smiklosovic
Copy link
Collaborator

smiklosovic commented Feb 28, 2020

@linki I think this is interesting mix of two concepts. Imagine you have a cluster of 5 nodes, RF is 5 and you have CL QUORUM. It means that you should be in theory fine with having two nodes down ... why just one?

I am not sure I am getting that completely, but I would make that number in PDB dynamically computed based on what nodes I have and CL i am interested in.

If it is overly complicated or I am flat out wrong here, just implement it as you wish and I ll take a look afterwards. I am glad you want to help! You are warmly welcomed.

@linki
Copy link
Author

linki commented Mar 9, 2020

Oh yes, that makes a lot of sense. I totally forgot than we can tolerate more than one node failure. In this case it's probably best to allow the user to override the value in case she knows better.

IIRC from the etcd docs: a minimum of 5 nodes were recommended because one might be down by intention (for updates) and one might fail exactly when the other is down (and unintentionally) which still leaves us with a happy 3 node cluster. A PDB in this setup would say minAvailable: 4 to allow this one intentional node update but still keep one spare. Of course for larger clusters that can be more but it should be high enough to allow for unintended failures as well.

At Zalando we respect the PDBs when draining nodes and we use default PDBs that allow 1 disruption.

I wonder, are those quorum settings per Cassandra cluster or per Dataset? If they are per Dataset then it'll be tricky do decide how many failed nodes we can tolerate on the pod-level.

@smiklosovic
Copy link
Collaborator

smiklosovic commented Mar 9, 2020

Its more complicated than that (your last question).

Lets say you have a 2-dc cluster, 5 nodes in each DC, together 10. In your application, you decide to have the replication factor 3 (per dc) and your application is writing with consistency level QUORUM. In that case, you expect majority of nodes across all DCs (among replicas) to write it fine, which is (3+3 / 2) + 1 = 4. So you can, in theory, have 1 DC up with 5 nodes (and with your three replicas) and the other DC can be totally down but one node is up and if it happens to be a replica for your data, your write succeeds. But if all three nodes which are replicas out of all 5 in the second DC are down, you are out of luck and your write will fail because you can not achieve quorum. (read the note at the end of this post)

If all is same as above but your CL is LOCAL_QUORUM, you expect only DC to be up which is local to a coordinator your write request is sent to, irrespective of what other DC is doing. In that case, whole second DC can be down and as long as you have LOCAL_QUORUM in dc1 only, you are fine (it does not mean that data would not be replicated to the second DC in case it was up, it would, local_ is just about the fact that you do not care if it is down, if it was down and it is up again after a while, hints would be sent to these nodes and so on)

It is good you are respecting PDBs in case of node drainage, but just be sure that even you drain (and decomission maybe?) but just be sure that you are not on ALL or ONE or something similar so you would cut the branch you sit at if you happen to drain or decommission just that node.

There is a pod per "node" (each pod containing C* container and sidcar container) so if we ever support multiple DC (which is not possible right now), your solution to this would probably have to count with the fact what node you are going to take down and what DC it is in (which DC you are scaling down). Hence I naively think that there should be PDBs per dc (if we ever going to have more than 1 dc deployable).

note: keep in mind that this scenario would work too -> 5 nodes per DC, RF 3 per DC and CL QUORUM and only two nodes in each DC would be up, so in total only 4 nodes out of 10 (2 nodes in each DC). 6 nodes would be down (3 per DC would be down). if 4 replicas are up out of 6 (why 6? because you have RF3, per dc, so 3 replicas in DC1, 3 in DC2), you still achieved QUORUM. So you can have 3 nodes down per DC, in both DC and all would be fine ...

For these reasons, having CL QUORUM in multi-dc is not a good idea, because you basically have to have second DC available no matter what, because without second DC, you will never achieve QUORUM. So in case you have multi-dc setup and you require one DC to go completely down (or you want to be prepared for such situation), you should use LOCAL_QUORUM. You can then take the second DC down completely and if you happen to have some smart traffic balancing in front of your apps, you can just reroute all traffic to the first DC only for a while and all would be still fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants