Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Mounted volumes on downed nodes are de-facto unmounted, treat them as such #2270

Closed

Conversation

briantopping
Copy link

@briantopping briantopping commented Mar 7, 2023

Problem
If a node is running on a virtual machine that is shutdown, halted or crashes, Kubernetes is unaware and doesn't release the a pod's claim (PVC) on the PV. This is a very common occurrence in clusters with default shutdown configurations. A node's systemd (for example) sees the node CRI as a monolithic process, does not know that containers need to shut down gracefully to release a PV claim, and typically just kills the CRI and all the containers far too quickly.

The PVC is binding of the PV to a pod in etcd and that state is persisted beyond the destruction of the node.

At some point, the vSphere CSI driver will be sent an ControllerUnpublishVolume event. This will of course fail and without this fix, the CSI driver will propagate the error that it was unable to unmount the volume at the ESX level. This is partially correct, but the call will be repeated and halt graceful shutdown of a cluster indefinitely.

What this PR does / why we need it:
A CSI volume is in fact unmounted if the volume is mounted on a VM hosted on a downed node. When we In controller.ControllerUnpublishVolume(), when an unmount request is received and we cannot find the node that the volume should be mounted through, the volume is in fact unmounted (albeit ungracefully) and should reported as such without error. This will cause the control plane to properly delete the PVC and another pod may now bind with it or the cluster may continue to shut down.

If the unmount is unsuccessful, this fix confirms the node is lost and returns that the volume is unmounted as requested.

If the node is backed with some form of HA and the node was migrated elsewhere, the ControllerUnpublishVolume event (once issued) will succeed without this fix because the node it was issued to will have been transparently recovered. This fix only activates when the unmount is unsuccessful, and then only if the ESX inquiry indicates the node (VM) is gone.

Works around kubernetes/kubernetes#51835 which appears to remain unresolved and administratively closed.

Which issue this PR fixes
fixes #15

Release note:

Mounted volumes on downed nodes are de-facto unmounted, treat them as such

@k8s-ci-robot k8s-ci-robot added do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 7, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: briantopping
Once this PR has been reviewed and has the lgtm label, please assign shalini-b for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

Hi @briantopping. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 7, 2023
@briantopping
Copy link
Author

@gnufied, @lipingxue: This patch should close out Gardener's need to ship with a fork of the CSI driver. How can I help here? 😊

@briantopping
Copy link
Author

briantopping commented Mar 9, 2023

cc: @chethanv28

Hopefully this is something that can make it into the 3.0.0 release. The new code simply searches for nodes and is only called by this check.

@briantopping
Copy link
Author

@divyenpatel / @chethanv28 Would still like to see if we can get this reviewed for 3.0.0 release. What are the chances? 🤓

@briantopping briantopping changed the title Mounted volumes on downed nodes are de-facto unmounted, treat them as such Fix: Mounted volumes on downed nodes are de-facto unmounted, treat them as such Mar 15, 2023
@briantopping
Copy link
Author

briantopping commented Mar 15, 2023

@divyenpatel / @chethanv28 I have updated the PR problem statement. Could I get some feedback please?

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Mar 15, 2023
@briantopping
Copy link
Author

"Marking ControllerUnpublishVolume for Volume: %q as successful.", req.NodeId, req.VolumeId)
return &csi.ControllerUnpublishVolumeResponse{}, "", nil
} else {
if err == cnsnode.ErrNodeNotFound {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we get cnsvsphere.ErrVMNotFound at line 2268, it is already confirmed that VM is not present in the VC inventory. Why do we need to check it again using DNS name?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when UseCSINodeId feature is enabled, when get can call to detach the volume we will get Node VM UUID in the controllerunpublish request. Using Node VM UUID we can look up if Node VM is present on the system or not. if it is present we will get VM MoRef and using which we will issue detach call on the vCenter.

Using DNS to look up a node VM may not be a reliable method, especially if there are multiple VMs with the same DNS name on the vCenter. A better option is to use the UUID of the VM as a unique identifier to look up the node VM on the vCenter. The UUID is a unique identifier for each VM, making it the best option for verifying the presence of a specific node VM on the vCenter. Therefore, it is recommended to use the UUID of the VM instead of DNS when looking up a node VM.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the first issue, the transitive closure of calls under c.nodeMgr.GetNodeByName() may return errors for a large number of reasons outside of node failure. The code should not respond that the volume has been released unless it is confirmed that the node was actually lost.

I will update the patch with the UUID changes, thank you for that feedback!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the use of DNS vs. node UUID, this code has been in production for some time and I forgot about the mechanics mentioned in line 2271: "As the VM UUID is not known anymore, need to search by using NodeID as DNS Name". Do you have some consideration of this state?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if Node can be returned using c.nodeMgr.GetNodeByName, because Node object is deleted from API server. We do deeper look into vCenter using c.nodeMgr.GetNodeByUuid(ctx, req.NodeId).

Refer to following implementation and let us know what is missing in checking Node VM existence before we mark detach as scuess.

nodevm, err = c.nodeMgr.GetNodeByName(ctx, req.NodeId)
			if err == node.ErrNodeNotFound {
				log.Infof("Performing node VM lookup using node VM UUID: %q", req.NodeId)
				nodevm, err = c.nodeMgr.GetNodeByUuid(ctx, req.NodeId)
			}
func (nodes *Nodes) GetNodeByUuid(ctx context.Context, nodeUuid string) (*cnsvsphere.VirtualMachine, error) {
	return nodes.cnsNodeManager.GetNode(ctx, nodeUuid, nil)
}
func (m *defaultManager) GetNode(ctx context.Context,
	nodeUUID string, dc *vsphere.Datacenter) (*vsphere.VirtualMachine, error) {
	log := logger.GetLogger(ctx)
	vmInf, discovered := m.nodeVMs.Load(nodeUUID)
	if !discovered {
		log.Infof("Node hasn't been discovered yet with nodeUUID %s", nodeUUID)
		var vm *vsphere.VirtualMachine
		var err error
		if dc != nil {
			vm, err = dc.GetVirtualMachineByUUID(ctx, nodeUUID, false)
			if err != nil {
				log.Errorf("failed to find node with nodeUUID %s on datacenter: %+v with err: %v", nodeUUID, dc, err)
				return nil, err
			}
			m.nodeVMs.Store(nodeUUID, vm)
		} else {
			if err = m.DiscoverNode(ctx, nodeUUID); err != nil {
				log.Errorf("failed to discover node with nodeUUID %s with err: %v", nodeUUID, err)
				return nil, err
			}

			vmInf, _ = m.nodeVMs.Load(nodeUUID)
			vm = vmInf.(*vsphere.VirtualMachine)
		}
		log.Infof("Node was successfully discovered with nodeUUID %s in vm %v", nodeUUID, vm)
		return vm, nil
	}

	vm := vmInf.(*vsphere.VirtualMachine)
	log.Debugf("Renewing virtual machine %v with nodeUUID %q", vm, nodeUUID)

	if err := vm.Renew(ctx, true); err != nil {
		log.Errorf("failed to renew VM %v with nodeUUID %q with err: %v", vm, nodeUUID, err)
		return nil, err
	}

	log.Debugf("VM %v was successfully renewed with nodeUUID %q", vm, nodeUUID)
	return vm, nil
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @divyenpatel I misread your initial response. Understand what you are after now and will get that resolved. Switching this to draft.

Signed-off-by: Brian Topping <brian.topping@sap.com>
@briantopping briantopping requested review from divyenpatel and removed request for gnufied and lipingxue March 16, 2023 05:45
@briantopping briantopping marked this pull request as draft March 17, 2023 00:40
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 17, 2023
@MartinWeindel
Copy link
Contributor

@briantopping
I tried to reproduce the issue with the latest release v3.0.0. The good news is, it has been resolved somehow!
The issue always occurred on hibernating, but this time the PV was detached from the missing node. See detached successfully from node message in the log below.

{"level":"error","time":"2023-03-28T10:16:46.643922378Z","caller":"node/manager.go:226","msg":"failed to discover node with nodeUUID 4207bfa2-96c2-345b-90a1-5404602c4788 with err: virtual machine wasn't found","TraceId":"98fadc37-c821-44e9-bd53-b6c444992a77","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/common/cns-lib/node.(*defaultManager).GetNode\n\t/build/pkg/common/cns-lib/node/manager.go:226\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/common/cns-lib/node.(*Nodes).GetNodeByUuid\n\t/build/pkg/common/cns-lib/node/nodes.go:203\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ControllerUnpublishVolume.func1\n\t/build/pkg/csi/service/vanilla/controller.go:2262\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ControllerUnpublishVolume\n\t/build/pkg/csi/service/vanilla/controller.go:2285\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerUnpublishVolume_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/spec@v1.7.0/lib/go/csi/csi.pb.go:5725\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:922"}
{"level":"info","time":"2023-03-28T10:16:46.643984979Z","caller":"vanilla/controller.go:2269","msg":"Virtual Machine for Node ID: 4207bfa2-96c2-345b-90a1-5404602c4788 is not present in the VC Inventory. Marking ControllerUnpublishVolume for Volume: \"e331b82f-48c6-439a-bf12-39fcb1f09e63\" as successful.","TraceId":"98fadc37-c821-44e9-bd53-b6c444992a77"}
{"level":"info","time":"2023-03-28T10:16:46.644047899Z","caller":"vanilla/controller.go:2294","msg":"Volume \"e331b82f-48c6-439a-bf12-39fcb1f09e63\" detached successfully from node \"4207bfa2-96c2-345b-90a1-5404602c4788\".","TraceId":"98fadc37-c821-44e9-bd53-b6c444992a77"}

@briantopping
Copy link
Author

@MartinWeindel this is excellent news. I was in the process of setting up a test rig for this and now I guess that won't be immediately necessary. Will set it up anyway for next time. Thanks a bunch for your efforts on this issue over the last year!!

@briantopping briantopping deleted the fix/stuck-pods branch March 28, 2023 20:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

When a Vmware node goes down, Kubernetes is unaware and PV isn't released
4 participants