-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot detach volumes attached to deleted nodes #691
Comments
We run a 100+ node Kubernetes cluster on AWS which heavily relies on spot nodes. Spot nodes will be terminated with just a few minutes warning on AWS, which expected to happen quite often. Even if we run the Node Termination Handler in SQS mode and react to spot termination notifications with automatic node draining we usually end up in a situation where the detach process doesn't finish before a node is deleted. In this scenario we often encounter the exact same issue as described by @tksm. This is a severe problem as workloads will be stuck in a crashlooping state because the PVC fails to attached after the pod is moved to a new node. I hope the problem can be hotfixed. |
Any ETA on a fix? |
@paalkr, the team is currently working on a fix. We will update this issue with a link to the commit once it merges. |
Excellent, thank you very much. |
I'm a little disappointed that this serious bug was not addressed in the 22.01.0 release. It's a complete and major blocker for using NetApp with an elastic Kubernetes environment like EKS. |
Hi @paalkr, this GitHub issue was opened right before the code freeze date for the 22.01 release. This bug was introduced in the 21.10 release so previous Trident releases will not have this issue. The team is working on the issue and we hope to have a fix that works with all Trident storage drivers when this issue is addressed. I hope that helps. |
Hi @gnarl , Thanks for the feedback. We are in the process of migrating from Rook Ceph to AWS FSx for NetApp ONTAP on a large production cluster in AWS. And this issue is stopping us from proceeding. I apologize for being impatient ;) Ideally when terminating a node ReadWriteOnce PVCs should reattach quickly on a new node. I believe termination of nodes should be handled more gracefully in general by Trident. And at least not be stuck in multi attach error, connected to a deleted node, if the detach process doesn't finish in time or fails for any other reason. I'm looking forward to teste the fix NetApp are working on. Please let me know If I can help test any beta release. I Have a test environment set up, where this fails pretty consistent. |
This is an absolute blocker for running workload on kubernetes which are relying on persistent volumes via trident. our only storage provider is a netapp with trident on openshift. So we need ASAP a fix or a workaround for getting pods back up online with persistent volumes again when a node is lost. Is there an ETA when the fix will be released? |
This issue is fixed with the Trident 22.01.1 release. |
Thx |
@gnarl Thank you for fixing the issue and releasing v22.01.1. I confirmed that Trident v22.01.1 no longer reproduces this issue with the steps below. Now a recreated Pod after deleting a node will become
|
Thank you for confirming that the fix works in your environment. |
Describe the bug
We cannot detach volumes attached to deleted nodes in Trident 21.10.1. In Trident v21.07.2, these volumes would be automatically detached after a certain period. If I understand correctly, this force detachment is done by AttachDetachController after ReconcilerMaxWaitForUnmountDuration.
It seems that this change is introduced in this commit. This commit makes Trident's ControllerUnpublishVolume check the existence of the node. If the node does not exist, ControllerUnpublishVolume now returns a NotFound error, so the volume detachment always fails when the node is already deleted.
In server failure, volume detachment might fail, and we have no choice but to delete the node, so it is desirable to detach volumes attached to deleted nodes automatically.
Environment
silenceAutosupport: true
(Trident Operator)To Reproduce
kubectl delete node
In the VolumeAttachment, the following error can be found.
Expected behavior
Trident automatically detaches volumes attached to deleted nodes.
The text was updated successfully, but these errors were encountered: