Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blog post or documentation on how to replace broken (or potentially broken) nodes? #78

Closed
sync-by-unito bot opened this issue May 19, 2021 · 11 comments

Comments

@sync-by-unito
Copy link

sync-by-unito bot commented May 19, 2021

Is your feature request related to a problem? Please describe.
As an operator of cassandra, I want to replace nodes when for example their disks have degraded or have any other signs of hardware failure. on the k8ssandra.io website I would like to see some guide explaining how this would work in the Kubernetes world. Also describing how to replace an already failed node would be nice.

Describe the solution you'd like

Describe the two scenarios in the docs on how to replace a non-broken node, and how to replace a broken node.

I think the instructions would look something like this
Non-broken:

  1. cordon the node
  2. delete the pod
  3. delete the pvc (and the pv) ❓ (This might be needed in case of local storage; as the PVC is bound to a specific node, and you want to make sure a new PVC is created (with WaitForConsumer) so that the pod ends up on a new node and not the old one again)
  4. delete the node or replace the broken persistent volume
  5. Pod should now be pending
  6. add new node or uncordon the existing node (if the volume was replaced)
  7. PVC binds to the new disk, and pod should now be running

Broken node:

  1. The pod is pending as it is trying to reschedule to the broken node; where its PV and PVC is bound to (e.g. when using local persistent volume)
  2. Delete the PVC (Need to do this first. Otherwise if you delete the pod, it will be bound to the existing PVC, and thus be re-scheduled on the broken node, and stay Pending forever)
  3. Delete the pod
  4. Pod and PVC should be recreated now and both pending
  5. Add new node

These are hypothetical steps. I didn't test them. But it would be nice to describe these procedures. especially when using local persistent volumes some care has to be taken to make sure that the new pods get scheduled on new nodes. This is tricky and can probably use a good step-by-step guide.

Even better would be if we could somehow automate this in the operator. But IDK what that'd look like

Describe alternatives you've considered
None

Additional context
Add any other context or screenshots about the feature request here.

┆Issue is synchronized with this Jira Task by Unito

@sync-by-unito
Copy link
Author

sync-by-unito bot commented May 19, 2021

➤ John Sanda commented:

Hi arianvp

We definitely want to get these types of things documented. In fact it was mentioned in k8ssandra/k8ssandra#501 (comment). Would you be interested in working on this? I would be happy to assist 😀

@sync-by-unito
Copy link
Author

sync-by-unito bot commented May 19, 2021

➤ Arian van Putten commented:

Hmm I don't know if I know enough (yet) about k8ssandra to be able to contribute to this effecively. But I really do want to stress it in exactly these kind of scenarios :-) . Was hoping to get some feedback whether my procedures were on the right track.

For inspiration; the KUDO cassandra operator has some docs already

https://github.com/kudobuilder/operators/blob/master/repository/cassandra/3.11/docs/managing.md#failure-handling

Their manual node replacement procedure sounds very similar to what I wrote down.

However they also have a "Recovery controller" that automates large parts of this process. It would be nice if k8ssandra could add a similar feature

I see that cass-operator has a replaceNodes field in the spec. https://github.com/datastax/cass-operator/blob/6b05f8303c646932758a3aadaa0885f7f445e587/tests/testdata/cass-operator-1.1.0-chart/templates/customresourcedefinition.yaml#L100-L104 I suppose it's of use here? 😀

@sync-by-unito
Copy link
Author

sync-by-unito bot commented May 19, 2021

➤ John Sanda commented:

Hey arianvp

I spent some time going over the first scenario where the worker node is still healthy but the Cassandra pod is not. The steps sound right when using local storage. Here are updated (dlsclaimer: untested) steps for use with cass-operator:

  1. cordon the node
  2. delete the pod
  3. wait for new pod to be created
  • Note that it cannot be scheduled since it is trying to use the PV that is bound to the cordoned node
  1. delete the pvc
  • I don't think it is necessary to delete the PV
  • Definitely want to use WaitForFirstConsumer
  • If you want to keep the PV make sure that the volume reclaim policy of the PV is not set to Delete
  1. Update the CassandraDatacenter by adding the down C* node's ip address to .spec.replaceNodes
  2. Delete the pod

The pod should get scheduled to a different worker node since the original node is cordoned.

If you are using attached storage I think the process is different. There is no need to schedule the pod to a different worker node. It should be sufficient to delete the PVC and then delete the pod so it gets a new PVC. Deleting the PVC will be blocked with a finalizer since it is in use by the pod. You will need to clear the finalizer to allow for deletion to proceed.

@sync-by-unito
Copy link
Author

sync-by-unito bot commented May 19, 2021

➤ Arian van Putten commented:

Thanks so much! I'll test these procedures and see if they work as expected and see if I can write some prose for the documentation!

I had one more question and it's semi-related (Also has to do with node restarts. But this time in a non-disaster scenario; where the pod keeps the same PV)

When I originally explored Cassandra on kubernetes I ran into one blocker which was the following issue: kubernetes/kubernetes#28969

A lot of commenters claimed (Including it seems people from Datastax) that Cassandra does not particularly like it's node-IP changing when a node restarts. The problem is that pods in a statefulset will get a new IP when they restart. I also see that the cass-operator automatically restart pods that are not Ready.

However some comments also suggest it's not an issue in cassandra anymore and nodes will automatically detect that their IP address changed.

The cassandra docs itself also seem to suggest this:

https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/operations/opsReplaceLiveNode.html

Note: To change the IP address of a node, simply change the IP of node and then restart Cassandra. If you change the IP address of a seed node, you must update the -seeds parameter in the seed_provider list in each node's cassandra.yaml file.

Is there something special happening in cass-operator that solves this concern? Or is it indeed just not a problem anymore in cassandra itself? Or are the only problems that occur the issues with replacing nodes; which is handled by .spec.replaceNodes

(See also the discussion in: https://github.com/rook/rook/blob/master/design/cassandra/design.md#major-pain-point-stable-pod-identity)

@sync-by-unito
Copy link
Author

sync-by-unito bot commented May 19, 2021

➤ John Sanda commented:

There is another detail I left out in my previous steps. If the time for replace exceeds the hint window, then I think you will want to run a full repair on the replacement C* node.

k8ssandra/cass-operator could further automate some of these steps, but I am not sure if it makes sense for it to automate others. For example, we probably want an admin to cordon the worker node.

However some comments also suggest it's not an issue in cassandra anymore and nodes will automatically detect that their IP address changed.

This is correct. Here is some example output from my 3.11.10 cluster:

WARN [GossipStage:1] 2021-04-07 12:14:41,664 StorageService.java:2491 - Not updating host ID aa61932f-7d14-4bc3-b210-38d11d433b68 for /10.40.5.15 because it's mine
INFO [GossipStage:1] 2021-04-07 12:14:41,667 StorageService.java:2422 - Nodes () and /10.40.5.15 have the same token /10.40.5.16. Ignoring -103243700229840403110.3.5.15 is the old ip and 10.40.5.16 is the new on.

Is there something special happening in cass-operator that solves this concern? Or is it indeed just not a problem anymore in cassandra itself?

cass-operator does indeed handle seed changes. It adds the following label to pods that run seed nodes:

cassandra.datastax.com/seed-node: "true"When cass-operator relabels seed pods, it calls an endpoint on the management-api service in each pod to reload seeds. This way seed nodes are kept up to date.

@sync-by-unito
Copy link
Author

sync-by-unito bot commented May 19, 2021

➤ Arian van Putten commented:

If you are using attached storage I think the process is different. There is no need to schedule the pod to a different worker node. It should be sufficient to delete the PVC and then delete the pod so it gets a new PVC. Deleting the PVC will be blocked with a finalizer since it is in use by the pod. You will need to clear the finalizer to allow for deletion to proceed.

This is exactly the procedure as done in https://github.com/datastax/cass-operator/blob/master/tests/node_replace/node_replace_suite_test.go right?

It sounds to me we can use the same process for locally provisioned PersistentVolumes. We don't need to differentiate; except for the fact that the node doesn't need to be cordoned in the case of reattachable volumes.

Note that we cordon the node. So when we delete the PVC (with finalizer removed) and the Pod. The new pod will be scheduled on a new kubernetes node anyway. So no need to delete the pod twice as you described

A procedure like this should work (Though I haven't tested it yet; again) (Given the storageclass is marked WaitForFirstConsumer):

  1. Cordon the node
  2. set replaceNodes to the pod name
  3. Delete the PVC and then remove the finalizer
  4. Delete the pod
  5. Pod and PVC get scheduled to new node
  6. uncordon the node

I'm going to experiment with these procedures a bit; and see if I can create a pull request for this documentation.

The question on how to do disaster recovery (Node is actually already broken) is still open; but from https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/operations/opsReplaceNode.html it sounds like it's basically identical to replacing a live node.

@sync-by-unito
Copy link
Author

sync-by-unito bot commented May 19, 2021

➤ John Sanda commented:

This is exactly the procedure as done in https://github.com/datastax/cass-operator/blob/master/tests/node_replace/node_replace_suite_test.go right?

Yes, that looks right.

It sounds to me we can use the same process for locally provisioned PersistentVolumes. We don't need to differentiate; except for the fact that the node doesn't need to be cordoned in the case of reattachable volumes.

Agreed - the key distinction is that we do not have to cordon the worker node when using attached storage.

Note that we cordon the node. So when we delete the PVC (with finalizer removed) and the Pod. The new pod will be scheduled on a new kubernetes node anyway. So no need to delete the pod twice as you described

👍

I'm going to experiment with these procedures a bit; and see if I can create a pull request for this documentation.

That would be terrific 😄

@sync-by-unito
Copy link
Author

sync-by-unito bot commented May 19, 2021

➤ Arian van Putten commented:

the k8ssandra helm chart does not seem to expose the replaceNodes option of the underlying CassandraDatacenter that it deploys.

Should I expose it or should the docs just use kubectl edit cassandradatacenter? I noticed that the k8ssandra.io docs don't mention anything about the schema of the CRD; but only of the helm charts themselves. So it sounds to me exposing that option as a helm value might be preferred.

I did notice that the operator modifies spec.replaceNodes after it has set status.nodeReplacements. This might be a bit akward with helm as you'd have to manually make sure you remove replaceNodes from your values.yaml after applying it to avoid accidentally starting the replace procedure twice...

Thoughts about this?

@sync-by-unito
Copy link
Author

sync-by-unito bot commented May 19, 2021

➤ John Sanda commented:

I believe that there are a couple properties in the spec that cass-operator will actually update. I am not a fan of that. I agree that it is awkward and may potentially cause some problems. If we expose replaceNodes in the chart properties, do a helm upgrade, cass-operator makes the changes and then removes .spec.replaceNodes as you mentioned. Before another helm upgrade I need to remove the replaceNodes chart property; otherwise, the operation will be performed again. To avoid this situation we might want to consider implementing this via a post-upgrade hook.

@sync-by-unito
Copy link
Author

sync-by-unito bot commented May 19, 2021

➤ Arian van Putten commented:

Small update on this:

I worked on a recipe for using EC2 Instance Storage (i3.2xlarge instances) (Upstreamed docs for that here: kubernetes-sigs/sig-storage-local-static-provisioner#252) which exposes local disks and sets all the expected sysfs tweaks that Cassandra docs suggest for NVME disks.

This is probably already useful to have the docs on its own next to the existing EKS EBS docs. I will come with a PR. Leaving notes here for now.

I did some experiments with replacing nodes etc and it all worked well (By editing the CRD directly; not through helm yet). However managed node groups in EKS might not be the best primitive for this. E.g. a rolling replace of nodes is not really something you want when rolling out a new kubernetes version. As you need to set replaceNodes every time you do that. You want to controllably replace nodes one by one. Haven't figured out how to do that yet with EKS. the primitive here is a EC2 autoscaling group and as soon as you start a new release it rolling replaces all the nodes automatically; leaving no time for doing the replaceNodes dance...

install an EKS cluster using https://eksctl.io

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: k8ssandra-cluster
region: eu-central-1
managedNodeGroups:

  • name: storage-nvme
    desiredCapacity: 3
    instanceType: i3.2xlarge
    preBootstrapCommands:
  • |
    cat < /etc/udev/rules.d/90-kubernetes-discovery.rules
  1. Discover Instance Storage disks so kubernetes local provisioner can pick them up from /dev/disk/kubernetes
    KERNEL=="nvme[0-9]n[0-9]", ENV{DEVTYPE}=="disk", ATTRS{model}=="Amazon EC2 NVMe Instance Storage", ATTRS{serial}=="?*", SYMLINK+="disk/kubernetes/nvme-\$attr{model}_\$attr{serial}", OPTIONS="string_escape=replace"
    EOF
  • |
    cat < /etc/udev/rules.d/90-cassandra-tweaks.rules
  1. Tweak sysctls as per https://docs.datastax.com/en/dse/5.1/dse-dev/datastax_enterprise/config/configRecommendedSettings.html
    KERNEL=="nvme[0-9]n[0-9]", ENV{DEVTYPE}=="disk", ATTRS{model}=="Amazon EC2 NVMe Instance Storage", ATTR{queue/scheduler}="deadline", ATTR{queue/nr_requests}="128", ATTR{queue/rotational}="0", ATTR{queue/read_ahead_kb}="8"
    KERNEL=="nvme[0-9]n[0-9]", ENV{DEVTYPE}=="disk", ATTRS{model}=="Amazon Elastic Block Store", ATTR{queue/read_ahead_kb}="32"
    EOT
  • udevadm control --reload && udevadm triggerInstall the local-static-storage-provisioner with the following helm values:
  1. local-static-provisioner-helm-values.yaml
    classes:
  • name: fast-disks
    hostDir: /dev/disk/kubernetes
    storageClass: trueInstall k8ssandra with the fast-disks StorageClass:

cassandra:
version: "3.11.10"
cassandraLibDirVolume:
storageClass: fast-disks
size: 1769Gi
heap:
size: 31G
newGenSize: 31G

  1. i3.2xlarge
    resources:
    requests:
    cpu: 7000m
    memory: 58Gi
    limits:
    cpu: 7000m
    memory: 58Gi
    datacenters:
  • name: dc1
    size: 3
    racks:
  • name: eu-central-1a
    affinityLabels:
    topology.kubernetes.io/zone: eu-central-1a
  • name: eu-central-1b
    affinityLabels:
    topology.kubernetes.io/zone: eu-central-1b
  • name: eu-central-1c
    affinityLabels:
    topology.kubernetes.io/zone: eu-central-1c

@sync-by-unito
Copy link
Author

sync-by-unito bot commented May 19, 2021

➤ John Sanda commented:

arianvp This is a great stuff!

For replacing nodes, do you think the canaryUpgradeCount ( https://github.com/k8ssandra/cass-operator/blob/master/operator/pkg/apis/cassandra/v1beta1/cassandradatacenter_types.go#L159 ) property might be helpful?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant