Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot detach volumes attached to deleted nodes #691

Closed
tksm opened this issue Jan 14, 2022 · 12 comments
Closed

Cannot detach volumes attached to deleted nodes #691

tksm opened this issue Jan 14, 2022 · 12 comments

Comments

@tksm
Copy link

tksm commented Jan 14, 2022

Describe the bug

We cannot detach volumes attached to deleted nodes in Trident 21.10.1. In Trident v21.07.2, these volumes would be automatically detached after a certain period. If I understand correctly, this force detachment is done by AttachDetachController after ReconcilerMaxWaitForUnmountDuration.

It seems that this change is introduced in this commit. This commit makes Trident's ControllerUnpublishVolume check the existence of the node. If the node does not exist, ControllerUnpublishVolume now returns a NotFound error, so the volume detachment always fails when the node is already deleted.

In server failure, volume detachment might fail, and we have no choice but to delete the node, so it is desirable to detach volumes attached to deleted nodes automatically.

Environment

  • Trident version: 21.10.1
  • Trident installation flags used: silenceAutosupport: true (Trident Operator)
  • Container runtime: Docker 20.10.11
  • Kubernetes version: 1.22.5
  • Kubernetes orchestrator: Kubernetes
  • Kubernetes enabled feature gates:
  • OS: Ubuntu 20.04.3 LTS
  • NetApp backend types: ONTAP AFF 9.7P13
  • Other:

To Reproduce

  • Create a StatefulSet that has a ontap-san volume
  • Delete the node object that the Pod is scheduled on by kubectl delete node
  • The StatefulSet controller recreates a new Pod on another node after a short time
  • The recreated Pod cannot be attached to the volume even after 1 hour
    • With Trident v21.07.2, the Pod will become Running after 6 to 8 minutes

In the VolumeAttachment, the following error can be found.

rpc error: code = NotFound desc = node <NODE_NAME> was not found'

Expected behavior

Trident automatically detaches volumes attached to deleted nodes.

@tksm tksm added the bug label Jan 14, 2022
@paalkr
Copy link

paalkr commented Jan 14, 2022

We run a 100+ node Kubernetes cluster on AWS which heavily relies on spot nodes. Spot nodes will be terminated with just a few minutes warning on AWS, which expected to happen quite often. Even if we run the Node Termination Handler in SQS mode and react to spot termination notifications with automatic node draining we usually end up in a situation where the detach process doesn't finish before a node is deleted.

In this scenario we often encounter the exact same issue as described by @tksm. This is a severe problem as workloads will be stuck in a crashlooping state because the PVC fails to attached after the pod is moved to a new node. I hope the problem can be hotfixed.

@gnarl gnarl added the tracked label Jan 20, 2022
@paalkr
Copy link

paalkr commented Jan 28, 2022

Any ETA on a fix?

@gnarl
Copy link
Contributor

gnarl commented Jan 28, 2022

@paalkr, the team is currently working on a fix. We will update this issue with a link to the commit once it merges.

@paalkr
Copy link

paalkr commented Jan 28, 2022

Excellent, thank you very much.

@paalkr
Copy link

paalkr commented Feb 3, 2022

I'm a little disappointed that this serious bug was not addressed in the 22.01.0 release. It's a complete and major blocker for using NetApp with an elastic Kubernetes environment like EKS.

@gnarl
Copy link
Contributor

gnarl commented Feb 3, 2022

Hi @paalkr, this GitHub issue was opened right before the code freeze date for the 22.01 release. This bug was introduced in the 21.10 release so previous Trident releases will not have this issue. The team is working on the issue and we hope to have a fix that works with all Trident storage drivers when this issue is addressed. I hope that helps.

@paalkr
Copy link

paalkr commented Feb 3, 2022

Hi @gnarl ,

Thanks for the feedback. We are in the process of migrating from Rook Ceph to AWS FSx for NetApp ONTAP on a large production cluster in AWS. And this issue is stopping us from proceeding. I apologize for being impatient ;)

Ideally when terminating a node ReadWriteOnce PVCs should reattach quickly on a new node. I believe termination of nodes should be handled more gracefully in general by Trident. And at least not be stuck in multi attach error, connected to a deleted node, if the detach process doesn't finish in time or fails for any other reason.

I'm looking forward to teste the fix NetApp are working on. Please let me know If I can help test any beta release. I Have a test environment set up, where this fails pretty consistent.

@Elyytscha
Copy link

This is an absolute blocker for running workload on kubernetes which are relying on persistent volumes via trident. our only storage provider is a netapp with trident on openshift.

So we need ASAP a fix or a workaround for getting pods back up online with persistent volumes again when a node is lost.
For now this is not possible without manual intervention.

Is there an ETA when the fix will be released?

@gnarl
Copy link
Contributor

gnarl commented Mar 18, 2022

This issue is fixed with the Trident 22.01.1 release.

@gnarl gnarl closed this as completed Mar 18, 2022
@paalkr
Copy link

paalkr commented Mar 21, 2022

Thx

@tksm
Copy link
Author

tksm commented Mar 23, 2022

@gnarl Thank you for fixing the issue and releasing v22.01.1.

I confirmed that Trident v22.01.1 no longer reproduces this issue with the steps below. Now a recreated Pod after deleting a node will become Running after 6 minutes. 👍

  1. Create a StatefulSet that has a ontap-san volume
  2. Delete the node object that the Pod is scheduled on by kubectl delete node
  3. The StatefulSet controller recreates a new Pod on another node after a short time
  4. With Trident v22.01.1, the recreated Pod will become Running after 6 minutes. 🎉
$ kubectl get pods -w
NAME            READY   STATUS              RESTARTS   AGE
detach-test-0   0/1     ContainerCreating   0          4m30s
detach-test-0   1/1     Running             0          6m15s

@gnarl
Copy link
Contributor

gnarl commented Mar 23, 2022

@tksm,

Thank you for confirming that the fix works in your environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants