-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Blog post or documentation on how to replace broken (or potentially broken) nodes? #78
Comments
➤ John Sanda commented: Hi arianvp We definitely want to get these types of things documented. In fact it was mentioned in k8ssandra/k8ssandra#501 (comment). Would you be interested in working on this? I would be happy to assist 😀 |
➤ Arian van Putten commented: Hmm I don't know if I know enough (yet) about k8ssandra to be able to contribute to this effecively. But I really do want to stress it in exactly these kind of scenarios :-) . Was hoping to get some feedback whether my procedures were on the right track. For inspiration; the KUDO cassandra operator has some docs already Their manual node replacement procedure sounds very similar to what I wrote down. However they also have a "Recovery controller" that automates large parts of this process. It would be nice if k8ssandra could add a similar feature I see that cass-operator has a replaceNodes field in the spec. https://github.com/datastax/cass-operator/blob/6b05f8303c646932758a3aadaa0885f7f445e587/tests/testdata/cass-operator-1.1.0-chart/templates/customresourcedefinition.yaml#L100-L104 I suppose it's of use here? 😀 |
➤ John Sanda commented: Hey arianvp I spent some time going over the first scenario where the worker node is still healthy but the Cassandra pod is not. The steps sound right when using local storage. Here are updated (dlsclaimer: untested) steps for use with cass-operator:
The pod should get scheduled to a different worker node since the original node is cordoned. If you are using attached storage I think the process is different. There is no need to schedule the pod to a different worker node. It should be sufficient to delete the PVC and then delete the pod so it gets a new PVC. Deleting the PVC will be blocked with a finalizer since it is in use by the pod. You will need to clear the finalizer to allow for deletion to proceed. |
➤ Arian van Putten commented: Thanks so much! I'll test these procedures and see if they work as expected and see if I can write some prose for the documentation! I had one more question and it's semi-related (Also has to do with node restarts. But this time in a non-disaster scenario; where the pod keeps the same PV) When I originally explored Cassandra on kubernetes I ran into one blocker which was the following issue: kubernetes/kubernetes#28969 A lot of commenters claimed (Including it seems people from Datastax) that Cassandra does not particularly like it's node-IP changing when a node restarts. The problem is that pods in a statefulset will get a new IP when they restart. I also see that the cass-operator automatically restart pods that are not Ready. However some comments also suggest it's not an issue in cassandra anymore and nodes will automatically detect that their IP address changed. The cassandra docs itself also seem to suggest this: https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/operations/opsReplaceLiveNode.html
Is there something special happening in cass-operator that solves this concern? Or is it indeed just not a problem anymore in cassandra itself? Or are the only problems that occur the issues with replacing nodes; which is handled by .spec.replaceNodes (See also the discussion in: https://github.com/rook/rook/blob/master/design/cassandra/design.md#major-pain-point-stable-pod-identity) |
➤ John Sanda commented: There is another detail I left out in my previous steps. If the time for replace exceeds the hint window, then I think you will want to run a full repair on the replacement C* node. k8ssandra/cass-operator could further automate some of these steps, but I am not sure if it makes sense for it to automate others. For example, we probably want an admin to cordon the worker node.
This is correct. Here is some example output from my 3.11.10 cluster: WARN [GossipStage:1] 2021-04-07 12:14:41,664 StorageService.java:2491 - Not updating host ID aa61932f-7d14-4bc3-b210-38d11d433b68 for /10.40.5.15 because it's mine
cass-operator does indeed handle seed changes. It adds the following label to pods that run seed nodes: cassandra.datastax.com/seed-node: "true"When cass-operator relabels seed pods, it calls an endpoint on the management-api service in each pod to reload seeds. This way seed nodes are kept up to date. |
➤ Arian van Putten commented:
This is exactly the procedure as done in https://github.com/datastax/cass-operator/blob/master/tests/node_replace/node_replace_suite_test.go right? It sounds to me we can use the same process for locally provisioned PersistentVolumes. We don't need to differentiate; except for the fact that the node doesn't need to be cordoned in the case of reattachable volumes. Note that we cordon the node. So when we delete the PVC (with finalizer removed) and the Pod. The new pod will be scheduled on a new kubernetes node anyway. So no need to delete the pod twice as you described A procedure like this should work (Though I haven't tested it yet; again) (Given the storageclass is marked WaitForFirstConsumer):
I'm going to experiment with these procedures a bit; and see if I can create a pull request for this documentation. The question on how to do disaster recovery (Node is actually already broken) is still open; but from https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/operations/opsReplaceNode.html it sounds like it's basically identical to replacing a live node. |
➤ John Sanda commented:
Yes, that looks right.
Agreed - the key distinction is that we do not have to cordon the worker node when using attached storage.
👍
That would be terrific 😄 |
➤ Arian van Putten commented: the k8ssandra helm chart does not seem to expose the replaceNodes option of the underlying CassandraDatacenter that it deploys. Should I expose it or should the docs just use kubectl edit cassandradatacenter? I noticed that the k8ssandra.io docs don't mention anything about the schema of the CRD; but only of the helm charts themselves. So it sounds to me exposing that option as a helm value might be preferred. I did notice that the operator modifies spec.replaceNodes after it has set status.nodeReplacements. This might be a bit akward with helm as you'd have to manually make sure you remove replaceNodes from your values.yaml after applying it to avoid accidentally starting the replace procedure twice... Thoughts about this? |
➤ John Sanda commented: I believe that there are a couple properties in the spec that cass-operator will actually update. I am not a fan of that. I agree that it is awkward and may potentially cause some problems. If we expose replaceNodes in the chart properties, do a helm upgrade, cass-operator makes the changes and then removes .spec.replaceNodes as you mentioned. Before another helm upgrade I need to remove the replaceNodes chart property; otherwise, the operation will be performed again. To avoid this situation we might want to consider implementing this via a post-upgrade hook. |
➤ Arian van Putten commented: Small update on this: I worked on a recipe for using EC2 Instance Storage (i3.2xlarge instances) (Upstreamed docs for that here: kubernetes-sigs/sig-storage-local-static-provisioner#252) which exposes local disks and sets all the expected sysfs tweaks that Cassandra docs suggest for NVME disks. This is probably already useful to have the docs on its own next to the existing EKS EBS docs. I will come with a PR. Leaving notes here for now. I did some experiments with replacing nodes etc and it all worked well (By editing the CRD directly; not through helm yet). However managed node groups in EKS might not be the best primitive for this. E.g. a rolling replace of nodes is not really something you want when rolling out a new kubernetes version. As you need to set replaceNodes every time you do that. You want to controllably replace nodes one by one. Haven't figured out how to do that yet with EKS. the primitive here is a EC2 autoscaling group and as soon as you start a new release it rolling replaces all the nodes automatically; leaving no time for doing the replaceNodes dance... install an EKS cluster using https://eksctl.io apiVersion: eksctl.io/v1alpha5
cassandra:
|
➤ John Sanda commented: arianvp This is a great stuff! For replacing nodes, do you think the canaryUpgradeCount ( https://github.com/k8ssandra/cass-operator/blob/master/operator/pkg/apis/cassandra/v1beta1/cassandradatacenter_types.go#L159 ) property might be helpful? |
Is your feature request related to a problem? Please describe.
As an operator of cassandra, I want to replace nodes when for example their disks have degraded or have any other signs of hardware failure. on the k8ssandra.io website I would like to see some guide explaining how this would work in the Kubernetes world. Also describing how to replace an already failed node would be nice.
Describe the solution you'd like
Describe the two scenarios in the docs on how to replace a non-broken node, and how to replace a broken node.
I think the instructions would look something like this
Non-broken:
Broken node:
These are hypothetical steps. I didn't test them. But it would be nice to describe these procedures. especially when using local persistent volumes some care has to be taken to make sure that the new pods get scheduled on new nodes. This is tricky and can probably use a good step-by-step guide.
Even better would be if we could somehow automate this in the operator. But IDK what that'd look like
Describe alternatives you've considered
None
Additional context
Add any other context or screenshots about the feature request here.
┆Issue is synchronized with this Jira Task by Unito
The text was updated successfully, but these errors were encountered: