Cannot detach volumes attached to deleted nodes #691

tksm · 2022-01-14T09:45:21Z

Describe the bug

We cannot detach volumes attached to deleted nodes in Trident 21.10.1. In Trident v21.07.2, these volumes would be automatically detached after a certain period. If I understand correctly, this force detachment is done by AttachDetachController after ReconcilerMaxWaitForUnmountDuration.

It seems that this change is introduced in this commit. This commit makes Trident's ControllerUnpublishVolume check the existence of the node. If the node does not exist, ControllerUnpublishVolume now returns a NotFound error, so the volume detachment always fails when the node is already deleted.

In server failure, volume detachment might fail, and we have no choice but to delete the node, so it is desirable to detach volumes attached to deleted nodes automatically.

Environment

Trident version: 21.10.1
Trident installation flags used: silenceAutosupport: true (Trident Operator)
Container runtime: Docker 20.10.11
Kubernetes version: 1.22.5
Kubernetes orchestrator: Kubernetes
Kubernetes enabled feature gates:
OS: Ubuntu 20.04.3 LTS
NetApp backend types: ONTAP AFF 9.7P13
Other:

To Reproduce

Create a StatefulSet that has a ontap-san volume
Delete the node object that the Pod is scheduled on by kubectl delete node
The StatefulSet controller recreates a new Pod on another node after a short time
The recreated Pod cannot be attached to the volume even after 1 hour
- With Trident v21.07.2, the Pod will become Running after 6 to 8 minutes

In the VolumeAttachment, the following error can be found.

rpc error: code = NotFound desc = node <NODE_NAME> was not found'

Expected behavior

Trident automatically detaches volumes attached to deleted nodes.

The text was updated successfully, but these errors were encountered:

paalkr · 2022-01-14T14:31:29Z

We run a 100+ node Kubernetes cluster on AWS which heavily relies on spot nodes. Spot nodes will be terminated with just a few minutes warning on AWS, which expected to happen quite often. Even if we run the Node Termination Handler in SQS mode and react to spot termination notifications with automatic node draining we usually end up in a situation where the detach process doesn't finish before a node is deleted.

In this scenario we often encounter the exact same issue as described by @tksm. This is a severe problem as workloads will be stuck in a crashlooping state because the PVC fails to attached after the pod is moved to a new node. I hope the problem can be hotfixed.

paalkr · 2022-01-28T15:30:41Z

Any ETA on a fix?

gnarl · 2022-01-28T19:07:15Z

@paalkr, the team is currently working on a fix. We will update this issue with a link to the commit once it merges.

paalkr · 2022-01-28T22:35:38Z

Excellent, thank you very much.

paalkr · 2022-02-03T21:35:24Z

I'm a little disappointed that this serious bug was not addressed in the 22.01.0 release. It's a complete and major blocker for using NetApp with an elastic Kubernetes environment like EKS.

gnarl · 2022-02-03T21:44:02Z

Hi @paalkr, this GitHub issue was opened right before the code freeze date for the 22.01 release. This bug was introduced in the 21.10 release so previous Trident releases will not have this issue. The team is working on the issue and we hope to have a fix that works with all Trident storage drivers when this issue is addressed. I hope that helps.

paalkr · 2022-02-03T22:40:45Z

Hi @gnarl ,

Thanks for the feedback. We are in the process of migrating from Rook Ceph to AWS FSx for NetApp ONTAP on a large production cluster in AWS. And this issue is stopping us from proceeding. I apologize for being impatient ;)

Ideally when terminating a node ReadWriteOnce PVCs should reattach quickly on a new node. I believe termination of nodes should be handled more gracefully in general by Trident. And at least not be stuck in multi attach error, connected to a deleted node, if the detach process doesn't finish in time or fails for any other reason.

I'm looking forward to teste the fix NetApp are working on. Please let me know If I can help test any beta release. I Have a test environment set up, where this fails pretty consistent.

Elyytscha · 2022-03-17T10:23:27Z

This is an absolute blocker for running workload on kubernetes which are relying on persistent volumes via trident. our only storage provider is a netapp with trident on openshift.

So we need ASAP a fix or a workaround for getting pods back up online with persistent volumes again when a node is lost.
For now this is not possible without manual intervention.

Is there an ETA when the fix will be released?

gnarl · 2022-03-18T21:07:35Z

This issue is fixed with the Trident 22.01.1 release.

paalkr · 2022-03-21T07:50:08Z

Thx

tksm · 2022-03-23T06:35:01Z

@gnarl Thank you for fixing the issue and releasing v22.01.1.

I confirmed that Trident v22.01.1 no longer reproduces this issue with the steps below. Now a recreated Pod after deleting a node will become Running after 6 minutes. 👍

Create a StatefulSet that has a ontap-san volume
Delete the node object that the Pod is scheduled on by kubectl delete node
The StatefulSet controller recreates a new Pod on another node after a short time
With Trident v22.01.1, the recreated Pod will become Running after 6 minutes. 🎉

$ kubectl get pods -w
NAME            READY   STATUS              RESTARTS   AGE
detach-test-0   0/1     ContainerCreating   0          4m30s
detach-test-0   1/1     Running             0          6m15s

gnarl · 2022-03-23T16:34:03Z

@tksm,

Thank you for confirming that the fix works in your environment.

tksm added the bug label Jan 14, 2022

gnarl added the tracked label Jan 20, 2022

gnarl closed this as completed Mar 18, 2022

paalkr mentioned this issue Aug 23, 2022

PVC attachment timeout when a node is terminated / removed from the cluster #762

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot detach volumes attached to deleted nodes #691

Cannot detach volumes attached to deleted nodes #691

tksm commented Jan 14, 2022

paalkr commented Jan 14, 2022

paalkr commented Jan 28, 2022

gnarl commented Jan 28, 2022

paalkr commented Jan 28, 2022

paalkr commented Feb 3, 2022

gnarl commented Feb 3, 2022

paalkr commented Feb 3, 2022

Elyytscha commented Mar 17, 2022

gnarl commented Mar 18, 2022

paalkr commented Mar 21, 2022

tksm commented Mar 23, 2022

gnarl commented Mar 23, 2022

Cannot detach volumes attached to deleted nodes #691

Cannot detach volumes attached to deleted nodes #691

Comments

tksm commented Jan 14, 2022

paalkr commented Jan 14, 2022

paalkr commented Jan 28, 2022

gnarl commented Jan 28, 2022

paalkr commented Jan 28, 2022

paalkr commented Feb 3, 2022

gnarl commented Feb 3, 2022

paalkr commented Feb 3, 2022

Elyytscha commented Mar 17, 2022

gnarl commented Mar 18, 2022

paalkr commented Mar 21, 2022

tksm commented Mar 23, 2022

gnarl commented Mar 23, 2022