Blog post or documentation on how to replace broken (or potentially broken) nodes? #78

sync-by-unito · 2021-05-19T19:36:58Z

Is your feature request related to a problem? Please describe.
As an operator of cassandra, I want to replace nodes when for example their disks have degraded or have any other signs of hardware failure. on the k8ssandra.io website I would like to see some guide explaining how this would work in the Kubernetes world. Also describing how to replace an already failed node would be nice.

Describe the solution you'd like

Describe the two scenarios in the docs on how to replace a non-broken node, and how to replace a broken node.

I think the instructions would look something like this
Non-broken:

cordon the node
delete the pod
delete the pvc (and the pv) ❓ (This might be needed in case of local storage; as the PVC is bound to a specific node, and you want to make sure a new PVC is created (with WaitForConsumer) so that the pod ends up on a new node and not the old one again)
delete the node or replace the broken persistent volume
Pod should now be pending
add new node or uncordon the existing node (if the volume was replaced)
PVC binds to the new disk, and pod should now be running

Broken node:

The pod is pending as it is trying to reschedule to the broken node; where its PV and PVC is bound to (e.g. when using local persistent volume)
Delete the PVC (Need to do this first. Otherwise if you delete the pod, it will be bound to the existing PVC, and thus be re-scheduled on the broken node, and stay Pending forever)
Delete the pod
Pod and PVC should be recreated now and both pending
Add new node

These are hypothetical steps. I didn't test them. But it would be nice to describe these procedures. especially when using local persistent volumes some care has to be taken to make sure that the new pods get scheduled on new nodes. This is tricky and can probably use a good step-by-step guide.

Even better would be if we could somehow automate this in the operator. But IDK what that'd look like

Describe alternatives you've considered
None

Additional context
Add any other context or screenshots about the feature request here.

┆Issue is synchronized with this Jira Task by Unito

sync-by-unito · 2021-05-19T19:37:00Z

➤ John Sanda commented:

Hi arianvp

We definitely want to get these types of things documented. In fact it was mentioned in k8ssandra/k8ssandra#501 (comment). Would you be interested in working on this? I would be happy to assist 😀

sync-by-unito · 2021-05-19T19:37:01Z

➤ Arian van Putten commented:

Hmm I don't know if I know enough (yet) about k8ssandra to be able to contribute to this effecively. But I really do want to stress it in exactly these kind of scenarios :-) . Was hoping to get some feedback whether my procedures were on the right track.

For inspiration; the KUDO cassandra operator has some docs already

https://github.com/kudobuilder/operators/blob/master/repository/cassandra/3.11/docs/managing.md#failure-handling

Their manual node replacement procedure sounds very similar to what I wrote down.

However they also have a "Recovery controller" that automates large parts of this process. It would be nice if k8ssandra could add a similar feature

I see that cass-operator has a replaceNodes field in the spec. https://github.com/datastax/cass-operator/blob/6b05f8303c646932758a3aadaa0885f7f445e587/tests/testdata/cass-operator-1.1.0-chart/templates/customresourcedefinition.yaml#L100-L104 I suppose it's of use here? 😀

sync-by-unito · 2021-05-19T19:37:03Z

➤ John Sanda commented:

Hey arianvp

I spent some time going over the first scenario where the worker node is still healthy but the Cassandra pod is not. The steps sound right when using local storage. Here are updated (dlsclaimer: untested) steps for use with cass-operator:

cordon the node
delete the pod
wait for new pod to be created

Note that it cannot be scheduled since it is trying to use the PV that is bound to the cordoned node

delete the pvc

I don't think it is necessary to delete the PV
Definitely want to use WaitForFirstConsumer
If you want to keep the PV make sure that the volume reclaim policy of the PV is not set to Delete

Update the CassandraDatacenter by adding the down C* node's ip address to .spec.replaceNodes
Delete the pod

The pod should get scheduled to a different worker node since the original node is cordoned.

If you are using attached storage I think the process is different. There is no need to schedule the pod to a different worker node. It should be sufficient to delete the PVC and then delete the pod so it gets a new PVC. Deleting the PVC will be blocked with a finalizer since it is in use by the pod. You will need to clear the finalizer to allow for deletion to proceed.

sync-by-unito · 2021-05-19T19:37:04Z

➤ Arian van Putten commented:

Thanks so much! I'll test these procedures and see if they work as expected and see if I can write some prose for the documentation!

I had one more question and it's semi-related (Also has to do with node restarts. But this time in a non-disaster scenario; where the pod keeps the same PV)

When I originally explored Cassandra on kubernetes I ran into one blocker which was the following issue: kubernetes/kubernetes#28969

A lot of commenters claimed (Including it seems people from Datastax) that Cassandra does not particularly like it's node-IP changing when a node restarts. The problem is that pods in a statefulset will get a new IP when they restart. I also see that the cass-operator automatically restart pods that are not Ready.

However some comments also suggest it's not an issue in cassandra anymore and nodes will automatically detect that their IP address changed.

The cassandra docs itself also seem to suggest this:

https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/operations/opsReplaceLiveNode.html

Note: To change the IP address of a node, simply change the IP of node and then restart Cassandra. If you change the IP address of a seed node, you must update the -seeds parameter in the seed_provider list in each node's cassandra.yaml file.

Is there something special happening in cass-operator that solves this concern? Or is it indeed just not a problem anymore in cassandra itself? Or are the only problems that occur the issues with replacing nodes; which is handled by .spec.replaceNodes

(See also the discussion in: https://github.com/rook/rook/blob/master/design/cassandra/design.md#major-pain-point-stable-pod-identity)

sync-by-unito · 2021-05-19T19:37:06Z

➤ John Sanda commented:

There is another detail I left out in my previous steps. If the time for replace exceeds the hint window, then I think you will want to run a full repair on the replacement C* node.

k8ssandra/cass-operator could further automate some of these steps, but I am not sure if it makes sense for it to automate others. For example, we probably want an admin to cordon the worker node.

However some comments also suggest it's not an issue in cassandra anymore and nodes will automatically detect that their IP address changed.

This is correct. Here is some example output from my 3.11.10 cluster:

WARN [GossipStage:1] 2021-04-07 12:14:41,664 StorageService.java:2491 - Not updating host ID aa61932f-7d14-4bc3-b210-38d11d433b68 for /10.40.5.15 because it's mine
INFO [GossipStage:1] 2021-04-07 12:14:41,667 StorageService.java:2422 - Nodes () and /10.40.5.15 have the same token /10.40.5.16. Ignoring -103243700229840403110.3.5.15 is the old ip and 10.40.5.16 is the new on.

Is there something special happening in cass-operator that solves this concern? Or is it indeed just not a problem anymore in cassandra itself?

cass-operator does indeed handle seed changes. It adds the following label to pods that run seed nodes:

cassandra.datastax.com/seed-node: "true"When cass-operator relabels seed pods, it calls an endpoint on the management-api service in each pod to reload seeds. This way seed nodes are kept up to date.

sync-by-unito · 2021-05-19T19:37:07Z

➤ Arian van Putten commented:

If you are using attached storage I think the process is different. There is no need to schedule the pod to a different worker node. It should be sufficient to delete the PVC and then delete the pod so it gets a new PVC. Deleting the PVC will be blocked with a finalizer since it is in use by the pod. You will need to clear the finalizer to allow for deletion to proceed.

This is exactly the procedure as done in https://github.com/datastax/cass-operator/blob/master/tests/node_replace/node_replace_suite_test.go right?

It sounds to me we can use the same process for locally provisioned PersistentVolumes. We don't need to differentiate; except for the fact that the node doesn't need to be cordoned in the case of reattachable volumes.

Note that we cordon the node. So when we delete the PVC (with finalizer removed) and the Pod. The new pod will be scheduled on a new kubernetes node anyway. So no need to delete the pod twice as you described

A procedure like this should work (Though I haven't tested it yet; again) (Given the storageclass is marked WaitForFirstConsumer):

Cordon the node
set replaceNodes to the pod name
Delete the PVC and then remove the finalizer
Delete the pod
Pod and PVC get scheduled to new node
uncordon the node

I'm going to experiment with these procedures a bit; and see if I can create a pull request for this documentation.

The question on how to do disaster recovery (Node is actually already broken) is still open; but from https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/operations/opsReplaceNode.html it sounds like it's basically identical to replacing a live node.

sync-by-unito · 2021-05-19T19:37:09Z

➤ John Sanda commented:

This is exactly the procedure as done in https://github.com/datastax/cass-operator/blob/master/tests/node_replace/node_replace_suite_test.go right?

Yes, that looks right.

It sounds to me we can use the same process for locally provisioned PersistentVolumes. We don't need to differentiate; except for the fact that the node doesn't need to be cordoned in the case of reattachable volumes.

Agreed - the key distinction is that we do not have to cordon the worker node when using attached storage.

Note that we cordon the node. So when we delete the PVC (with finalizer removed) and the Pod. The new pod will be scheduled on a new kubernetes node anyway. So no need to delete the pod twice as you described

👍

I'm going to experiment with these procedures a bit; and see if I can create a pull request for this documentation.

That would be terrific 😄

sync-by-unito · 2021-05-19T19:37:10Z

➤ Arian van Putten commented:

the k8ssandra helm chart does not seem to expose the replaceNodes option of the underlying CassandraDatacenter that it deploys.

Should I expose it or should the docs just use kubectl edit cassandradatacenter? I noticed that the k8ssandra.io docs don't mention anything about the schema of the CRD; but only of the helm charts themselves. So it sounds to me exposing that option as a helm value might be preferred.

I did notice that the operator modifies spec.replaceNodes after it has set status.nodeReplacements. This might be a bit akward with helm as you'd have to manually make sure you remove replaceNodes from your values.yaml after applying it to avoid accidentally starting the replace procedure twice...

Thoughts about this?

sync-by-unito · 2021-05-19T19:37:12Z

➤ John Sanda commented:

I believe that there are a couple properties in the spec that cass-operator will actually update. I am not a fan of that. I agree that it is awkward and may potentially cause some problems. If we expose replaceNodes in the chart properties, do a helm upgrade, cass-operator makes the changes and then removes .spec.replaceNodes as you mentioned. Before another helm upgrade I need to remove the replaceNodes chart property; otherwise, the operation will be performed again. To avoid this situation we might want to consider implementing this via a post-upgrade hook.

sync-by-unito · 2021-05-19T19:37:14Z

➤ Arian van Putten commented:

Small update on this:

I worked on a recipe for using EC2 Instance Storage (i3.2xlarge instances) (Upstreamed docs for that here: kubernetes-sigs/sig-storage-local-static-provisioner#252) which exposes local disks and sets all the expected sysfs tweaks that Cassandra docs suggest for NVME disks.

This is probably already useful to have the docs on its own next to the existing EKS EBS docs. I will come with a PR. Leaving notes here for now.

I did some experiments with replacing nodes etc and it all worked well (By editing the CRD directly; not through helm yet). However managed node groups in EKS might not be the best primitive for this. E.g. a rolling replace of nodes is not really something you want when rolling out a new kubernetes version. As you need to set replaceNodes every time you do that. You want to controllably replace nodes one by one. Haven't figured out how to do that yet with EKS. the primitive here is a EC2 autoscaling group and as soon as you start a new release it rolling replaces all the nodes automatically; leaving no time for doing the replaceNodes dance...

install an EKS cluster using https://eksctl.io

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: k8ssandra-cluster
region: eu-central-1
managedNodeGroups:

name: storage-nvme
desiredCapacity: 3
instanceType: i3.2xlarge
preBootstrapCommands:
|
cat < /etc/udev/rules.d/90-kubernetes-discovery.rules

Discover Instance Storage disks so kubernetes local provisioner can pick them up from /dev/disk/kubernetes
KERNEL=="nvme[0-9]n[0-9]", ENV{DEVTYPE}=="disk", ATTRS{model}=="Amazon EC2 NVMe Instance Storage", ATTRS{serial}=="?*", SYMLINK+="disk/kubernetes/nvme-\$attr{model}_\$attr{serial}", OPTIONS="string_escape=replace"
EOF

|
cat < /etc/udev/rules.d/90-cassandra-tweaks.rules

Tweak sysctls as per https://docs.datastax.com/en/dse/5.1/dse-dev/datastax_enterprise/config/configRecommendedSettings.html
KERNEL=="nvme[0-9]n[0-9]", ENV{DEVTYPE}=="disk", ATTRS{model}=="Amazon EC2 NVMe Instance Storage", ATTR{queue/scheduler}="deadline", ATTR{queue/nr_requests}="128", ATTR{queue/rotational}="0", ATTR{queue/read_ahead_kb}="8"
KERNEL=="nvme[0-9]n[0-9]", ENV{DEVTYPE}=="disk", ATTRS{model}=="Amazon Elastic Block Store", ATTR{queue/read_ahead_kb}="32"
EOT

udevadm control --reload && udevadm triggerInstall the local-static-storage-provisioner with the following helm values:

local-static-provisioner-helm-values.yaml
classes:

name: fast-disks
hostDir: /dev/disk/kubernetes
storageClass: trueInstall k8ssandra with the fast-disks StorageClass:

cassandra:
version: "3.11.10"
cassandraLibDirVolume:
storageClass: fast-disks
size: 1769Gi
heap:
size: 31G
newGenSize: 31G

i3.2xlarge
resources:
requests:
cpu: 7000m
memory: 58Gi
limits:
cpu: 7000m
memory: 58Gi
datacenters:

name: dc1
size: 3
racks:
name: eu-central-1a
affinityLabels:
topology.kubernetes.io/zone: eu-central-1a
name: eu-central-1b
affinityLabels:
topology.kubernetes.io/zone: eu-central-1b
name: eu-central-1c
affinityLabels:
topology.kubernetes.io/zone: eu-central-1c

sync-by-unito · 2021-05-19T19:37:15Z

➤ John Sanda commented:

arianvp This is a great stuff!

For replacing nodes, do you think the canaryUpgradeCount ( https://github.com/k8ssandra/cass-operator/blob/master/operator/pkg/apis/cassandra/v1beta1/cassandradatacenter_types.go#L159 ) property might be helpful?

jdonenine closed this as completed May 19, 2021

github-vincent-miszczak mentioned this issue Sep 2, 2021

K8SSAND-863 ⁃ Replacing nodes #166

Closed

kien-truong mentioned this issue Apr 6, 2022

K8SSAND-1423 ⁃ cass-operator becomes partially inoperable if replaceNodes has a wrong pod name #315

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blog post or documentation on how to replace broken (or potentially broken) nodes? #78

Blog post or documentation on how to replace broken (or potentially broken) nodes? #78

sync-by-unito bot commented May 19, 2021 •

edited

sync-by-unito bot commented May 19, 2021

sync-by-unito bot commented May 19, 2021

sync-by-unito bot commented May 19, 2021

sync-by-unito bot commented May 19, 2021

sync-by-unito bot commented May 19, 2021

sync-by-unito bot commented May 19, 2021

sync-by-unito bot commented May 19, 2021

sync-by-unito bot commented May 19, 2021

sync-by-unito bot commented May 19, 2021

sync-by-unito bot commented May 19, 2021

sync-by-unito bot commented May 19, 2021

Blog post or documentation on how to replace broken (or potentially broken) nodes? #78

Blog post or documentation on how to replace broken (or potentially broken) nodes? #78

Comments

sync-by-unito bot commented May 19, 2021 • edited

sync-by-unito bot commented May 19, 2021

sync-by-unito bot commented May 19, 2021

sync-by-unito bot commented May 19, 2021

sync-by-unito bot commented May 19, 2021

sync-by-unito bot commented May 19, 2021

sync-by-unito bot commented May 19, 2021

sync-by-unito bot commented May 19, 2021

sync-by-unito bot commented May 19, 2021

sync-by-unito bot commented May 19, 2021

sync-by-unito bot commented May 19, 2021

sync-by-unito bot commented May 19, 2021

sync-by-unito bot commented May 19, 2021 •

edited