Fix: Mounted volumes on downed nodes are de-facto unmounted, treat them as such #2270

briantopping · 2023-03-07T20:07:44Z

Problem
If a node is running on a virtual machine that is shutdown, halted or crashes, Kubernetes is unaware and doesn't release the a pod's claim (PVC) on the PV. This is a very common occurrence in clusters with default shutdown configurations. A node's systemd (for example) sees the node CRI as a monolithic process, does not know that containers need to shut down gracefully to release a PV claim, and typically just kills the CRI and all the containers far too quickly.

The PVC is binding of the PV to a pod in etcd and that state is persisted beyond the destruction of the node.

At some point, the vSphere CSI driver will be sent an ControllerUnpublishVolume event. This will of course fail and without this fix, the CSI driver will propagate the error that it was unable to unmount the volume at the ESX level. This is partially correct, but the call will be repeated and halt graceful shutdown of a cluster indefinitely.

What this PR does / why we need it:
A CSI volume is in fact unmounted if the volume is mounted on a VM hosted on a downed node. When we In controller.ControllerUnpublishVolume(), when an unmount request is received and we cannot find the node that the volume should be mounted through, the volume is in fact unmounted (albeit ungracefully) and should reported as such without error. This will cause the control plane to properly delete the PVC and another pod may now bind with it or the cluster may continue to shut down.

If the unmount is unsuccessful, this fix confirms the node is lost and returns that the volume is unmounted as requested.

If the node is backed with some form of HA and the node was migrated elsewhere, the ControllerUnpublishVolume event (once issued) will succeed without this fix because the node it was issued to will have been transparently recovered. This fix only activates when the unmount is unsuccessful, and then only if the ESX inquiry indicates the node (VM) is gone.

Works around kubernetes/kubernetes#51835 which appears to remain unresolved and administratively closed.

Which issue this PR fixes
fixes #15

Release note:

Mounted volumes on downed nodes are de-facto unmounted, treat them as such

k8s-ci-robot · 2023-03-07T20:07:50Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: briantopping
Once this PR has been reviewed and has the lgtm label, please assign shalini-b for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2023-03-07T20:07:53Z

Hi @briantopping. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

briantopping · 2023-03-08T14:28:00Z

@gnufied, @lipingxue: This patch should close out Gardener's need to ship with a fork of the CSI driver. How can I help here? 😊

briantopping · 2023-03-09T06:36:46Z

cc: @chethanv28

Hopefully this is something that can make it into the 3.0.0 release. The new code simply searches for nodes and is only called by this check.

briantopping · 2023-03-14T20:15:37Z

@divyenpatel / @chethanv28 Would still like to see if we can get this reviewed for 3.0.0 release. What are the chances? 🤓

briantopping · 2023-03-15T12:39:55Z

@divyenpatel / @chethanv28 I have updated the PR problem statement. Could I get some feedback please?

briantopping · 2023-03-15T15:41:50Z

cc: @SandeepPissay @xing-yang @deepakkinni @shalini-b @gohilankit

divyenpatel · 2023-03-16T04:40:08Z

pkg/csi/service/vanilla/controller.go

-					"Marking ControllerUnpublishVolume for Volume: %q as successful.", req.NodeId, req.VolumeId)
-				return &csi.ControllerUnpublishVolumeResponse{}, "", nil
-			} else {
+			if err == cnsnode.ErrNodeNotFound {


When we get cnsvsphere.ErrVMNotFound at line 2268, it is already confirmed that VM is not present in the VC inventory. Why do we need to check it again using DNS name?

when UseCSINodeId feature is enabled, when get can call to detach the volume we will get Node VM UUID in the controllerunpublish request. Using Node VM UUID we can look up if Node VM is present on the system or not. if it is present we will get VM MoRef and using which we will issue detach call on the vCenter.

Using DNS to look up a node VM may not be a reliable method, especially if there are multiple VMs with the same DNS name on the vCenter. A better option is to use the UUID of the VM as a unique identifier to look up the node VM on the vCenter. The UUID is a unique identifier for each VM, making it the best option for verifying the presence of a specific node VM on the vCenter. Therefore, it is recommended to use the UUID of the VM instead of DNS when looking up a node VM.

Regarding the first issue, the transitive closure of calls under c.nodeMgr.GetNodeByName() may return errors for a large number of reasons outside of node failure. The code should not respond that the volume has been released unless it is confirmed that the node was actually lost.

~~I will update the patch with the UUID changes, thank you for that feedback!~~

Regarding the use of DNS vs. node UUID, this code has been in production for some time and I forgot about the mechanics mentioned in line 2271: "As the VM UUID is not known anymore, need to search by using NodeID as DNS Name". Do you have some consideration of this state?

if Node can be returned using c.nodeMgr.GetNodeByName, because Node object is deleted from API server. We do deeper look into vCenter using c.nodeMgr.GetNodeByUuid(ctx, req.NodeId).

Refer to following implementation and let us know what is missing in checking Node VM existence before we mark detach as scuess.

nodevm, err = c.nodeMgr.GetNodeByName(ctx, req.NodeId) if err == node.ErrNodeNotFound { log.Infof("Performing node VM lookup using node VM UUID: %q", req.NodeId) nodevm, err = c.nodeMgr.GetNodeByUuid(ctx, req.NodeId) }

func (nodes *Nodes) GetNodeByUuid(ctx context.Context, nodeUuid string) (*cnsvsphere.VirtualMachine, error) { return nodes.cnsNodeManager.GetNode(ctx, nodeUuid, nil) }

func (m *defaultManager) GetNode(ctx context.Context, nodeUUID string, dc *vsphere.Datacenter) (*vsphere.VirtualMachine, error) { log := logger.GetLogger(ctx) vmInf, discovered := m.nodeVMs.Load(nodeUUID) if !discovered { log.Infof("Node hasn't been discovered yet with nodeUUID %s", nodeUUID) var vm *vsphere.VirtualMachine var err error if dc != nil { vm, err = dc.GetVirtualMachineByUUID(ctx, nodeUUID, false) if err != nil { log.Errorf("failed to find node with nodeUUID %s on datacenter: %+v with err: %v", nodeUUID, dc, err) return nil, err } m.nodeVMs.Store(nodeUUID, vm) } else { if err = m.DiscoverNode(ctx, nodeUUID); err != nil { log.Errorf("failed to discover node with nodeUUID %s with err: %v", nodeUUID, err) return nil, err } vmInf, _ = m.nodeVMs.Load(nodeUUID) vm = vmInf.(*vsphere.VirtualMachine) } log.Infof("Node was successfully discovered with nodeUUID %s in vm %v", nodeUUID, vm) return vm, nil } vm := vmInf.(*vsphere.VirtualMachine) log.Debugf("Renewing virtual machine %v with nodeUUID %q", vm, nodeUUID) if err := vm.Renew(ctx, true); err != nil { log.Errorf("failed to renew VM %v with nodeUUID %q with err: %v", vm, nodeUUID, err) return nil, err } log.Debugf("VM %v was successfully renewed with nodeUUID %q", vm, nodeUUID) return vm, nil }

Thanks @divyenpatel I misread your initial response. Understand what you are after now and will get that resolved. Switching this to draft.

Signed-off-by: Brian Topping <brian.topping@sap.com>

MartinWeindel · 2023-03-28T10:28:58Z

@briantopping
I tried to reproduce the issue with the latest release v3.0.0. The good news is, it has been resolved somehow!
The issue always occurred on hibernating, but this time the PV was detached from the missing node. See detached successfully from node message in the log below.

{"level":"error","time":"2023-03-28T10:16:46.643922378Z","caller":"node/manager.go:226","msg":"failed to discover node with nodeUUID 4207bfa2-96c2-345b-90a1-5404602c4788 with err: virtual machine wasn't found","TraceId":"98fadc37-c821-44e9-bd53-b6c444992a77","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/common/cns-lib/node.(*defaultManager).GetNode\n\t/build/pkg/common/cns-lib/node/manager.go:226\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/common/cns-lib/node.(*Nodes).GetNodeByUuid\n\t/build/pkg/common/cns-lib/node/nodes.go:203\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ControllerUnpublishVolume.func1\n\t/build/pkg/csi/service/vanilla/controller.go:2262\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ControllerUnpublishVolume\n\t/build/pkg/csi/service/vanilla/controller.go:2285\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerUnpublishVolume_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/spec@v1.7.0/lib/go/csi/csi.pb.go:5725\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:922"}
{"level":"info","time":"2023-03-28T10:16:46.643984979Z","caller":"vanilla/controller.go:2269","msg":"Virtual Machine for Node ID: 4207bfa2-96c2-345b-90a1-5404602c4788 is not present in the VC Inventory. Marking ControllerUnpublishVolume for Volume: \"e331b82f-48c6-439a-bf12-39fcb1f09e63\" as successful.","TraceId":"98fadc37-c821-44e9-bd53-b6c444992a77"}
{"level":"info","time":"2023-03-28T10:16:46.644047899Z","caller":"vanilla/controller.go:2294","msg":"Volume \"e331b82f-48c6-439a-bf12-39fcb1f09e63\" detached successfully from node \"4207bfa2-96c2-345b-90a1-5404602c4788\".","TraceId":"98fadc37-c821-44e9-bd53-b6c444992a77"}

briantopping · 2023-03-28T20:25:36Z

@MartinWeindel this is excellent news. I was in the process of setting up a test rig for this and now I guess that won't be immediately necessary. Will set it up anyway for next time. Thanks a bunch for your efforts on this issue over the last year!!

k8s-ci-robot added do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 7, 2023

k8s-ci-robot requested review from gnufied and lipingxue March 7, 2023 20:07

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 7, 2023

briantopping changed the title ~~Mounted volumes on downed nodes are de-facto unmounted, treat them as such~~ Fix: Mounted volumes on downed nodes are de-facto unmounted, treat them as such Mar 15, 2023

briantopping force-pushed the fix/stuck-pods branch from 7627b48 to 83f3ba2 Compare March 15, 2023 13:13

k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Mar 15, 2023

divyenpatel reviewed Mar 16, 2023

View reviewed changes

Report no error when an umount request is issued to a failed node.

b50770b

Signed-off-by: Brian Topping <brian.topping@sap.com>

briantopping force-pushed the fix/stuck-pods branch from abf04f8 to b50770b Compare March 16, 2023 05:03

briantopping requested review from divyenpatel and removed request for gnufied and lipingxue March 16, 2023 05:45

briantopping marked this pull request as draft March 17, 2023 00:40

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 17, 2023

briantopping closed this Mar 28, 2023

briantopping deleted the fix/stuck-pods branch March 28, 2023 20:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Mounted volumes on downed nodes are de-facto unmounted, treat them as such #2270

Fix: Mounted volumes on downed nodes are de-facto unmounted, treat them as such #2270

briantopping commented Mar 7, 2023 •

edited

k8s-ci-robot commented Mar 7, 2023

k8s-ci-robot commented Mar 7, 2023

briantopping commented Mar 8, 2023

briantopping commented Mar 9, 2023 •

edited

briantopping commented Mar 14, 2023

briantopping commented Mar 15, 2023 •

edited

briantopping commented Mar 15, 2023

divyenpatel Mar 16, 2023

divyenpatel Mar 16, 2023

briantopping Mar 16, 2023 •

edited

briantopping Mar 16, 2023

divyenpatel Mar 16, 2023

briantopping Mar 17, 2023

MartinWeindel commented Mar 28, 2023

briantopping commented Mar 28, 2023

Fix: Mounted volumes on downed nodes are de-facto unmounted, treat them as such #2270

Fix: Mounted volumes on downed nodes are de-facto unmounted, treat them as such #2270

Conversation

briantopping commented Mar 7, 2023 • edited

k8s-ci-robot commented Mar 7, 2023

k8s-ci-robot commented Mar 7, 2023

briantopping commented Mar 8, 2023

briantopping commented Mar 9, 2023 • edited

briantopping commented Mar 14, 2023

briantopping commented Mar 15, 2023 • edited

briantopping commented Mar 15, 2023

divyenpatel Mar 16, 2023

Choose a reason for hiding this comment

divyenpatel Mar 16, 2023

Choose a reason for hiding this comment

briantopping Mar 16, 2023 • edited

Choose a reason for hiding this comment

briantopping Mar 16, 2023

Choose a reason for hiding this comment

divyenpatel Mar 16, 2023

Choose a reason for hiding this comment

briantopping Mar 17, 2023

Choose a reason for hiding this comment

MartinWeindel commented Mar 28, 2023

briantopping commented Mar 28, 2023

briantopping commented Mar 7, 2023 •

edited

briantopping commented Mar 9, 2023 •

edited

briantopping commented Mar 15, 2023 •

edited

briantopping Mar 16, 2023 •

edited