Improvements to reduce rate limiting with flex scalesets #4535

gcampbell12 · 2023-08-30T18:52:54Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

The implementation of Azure flexible scalesets in cloud controller manager is causing a high rate of API rate limiting when used at scale due to the volume of calls made to retrieve instances, this has the knock on affect of causing instances to be deleted from kubernetes because the error is passed down and the instance is handled as not found:

cloud-provider-azure/pkg/provider/azure_vmssflex_cache.go

Lines 146 to 157 in 371c150

    
           	_, err := fs.vmssFlexVMCache.Get(vmssFlexID, azcache.CacheReadTypeForceRefresh) 
        
           	if err != nil { 
        
           		klog.Errorf("failed to refresh vmss flex VM cache for vmssFlexID %s", vmssFlexID) 
        
           	} 
        
           	return true 
        
           }) 
        
           cachedNodeName, isCached = fs.vmssFlexVMNameToNodeName.Load(vmName) 
        
           if isCached { 
        
           	return fmt.Sprintf("%v", cachedNodeName), nil 
        
           } 
        
           return "", cloudprovider.InstanceNotFound

This PR implements a new NodeExistsByProviderID method on the scaleset and for uniform and standard implementations this should behave the same as currently, for flex we extract the VM name from the provider ID and get the VM from the vm cache (which will call GetVirtualMachine if uncached) this saves us pulling the scaleset cache and listing every VM when InstanceExists is invoked by the node lifecycle controller. Also changes the power status and provisioning status checks to use a provider ID instead of node names for the same reasons so we can easily retrieve the desired VM just by taking the VM name from the provider ID in flex

Which issue(s) this PR fixes:

Fixes #2880

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2023-08-30T18:53:14Z

Hi @gcampbell12. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2023-08-30T18:53:31Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: gcampbell12
Once this PR has been reviewed and has the lgtm label, please assign andyzhangx for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

odinuge · 2023-08-30T20:03:33Z

/ok-to-test

gcampbell12 · 2023-09-06T11:04:51Z

/retest

feiskyer · 2023-09-08T05:26:17Z

Thanks for the contribution! But I'm a little concerned on this PR as it would impact all of the VM types.

Could you share the logs of CCM on VMSS flex nodes, especially why the Nodes were deleted? When cache refresh failed, the controller should continue to refresh until it succeeds.

gcampbell12 · 2023-09-08T06:54:06Z

@feiskyer I'll get some logs to you later but to explain what we are seeing:

If we look at GetNodeNameByProviderID this calls fs.getNodeNameByVMName, this will retrieve a list of every VMSS and begin range through them calling ListVmssFlexVMsWithoutInstanceView and ListVmssFlexVMsWithOnlyInstanceView for each scaleset, if you have say 50 scalesets and a couple of hundred nodes you soon get rate limited and azure returns 429's (I believe the number of VMs returned in these responses also contributes to rate limits being used up), this error is propagated from the vmclient back to here

cloud-provider-azure/pkg/provider/azure_vmssflex_cache.go

Lines 84 to 87 in d9bb39d

    
           if rerr != nil { 
        
           	klog.Errorf("ListVmssFlexVMsWithoutInstanceView failed: %v", rerr) 
        
           	return nil, rerr.Error() 
        
           }

And here it just ends up being logged and we carry on looping over the rest of VMSS's (further using up rate limits)

cloud-provider-azure/pkg/provider/azure_vmssflex_cache.go

Lines 146 to 149 in d9bb39d

    
           _, err := fs.vmssFlexVMCache.Get(vmssFlexID, azcache.CacheReadTypeForceRefresh) 
        
           if err != nil { 
        
           	klog.Errorf("failed to refresh vmss flex VM cache for vmssFlexID %s", vmssFlexID) 
        
           }

Eventually the getter fails and is ran again with a forced cache refresh which also fails due to the volume of calls and we return cloudprovider.InstanceNotFound

cloud-provider-azure/pkg/provider/azure_vmssflex_cache.go

Line 157 in d9bb39d

return "", cloudprovider.InstanceNotFound

Here we return false to the node lifecycle controller

cloud-provider-azure/pkg/provider/azure_instances.go

Lines 205 to 207 in d9bb39d

    
           if errors.Is(err, cloudprovider.InstanceNotFound) { 
        
           	return false, nil 
        
           }

which causes the node lifecycle controller to delete the node from k8s

cloud-provider-azure/vendor/k8s.io/cloud-provider/controllers/nodelifecycle/node_lifecycle_controller.go

Lines 161 to 178 in d9bb39d

    
           if !exists { 
        
           	// Current node does not exist, we should delete it, its taints do not matter anymore 
        
           	klog.V(2).Infof("deleting node since it is no longer present in cloud provider: %s", node.Name) 
        
           	ref := &v1.ObjectReference{ 
        
           		Kind:      "Node", 
        
           		Name:      node.Name, 
        
           		UID:       types.UID(node.UID), 
        
           		Namespace: "", 
        
           	} 
        
           	c.recorder.Eventf(ref, v1.EventTypeNormal, deleteNodeEvent, 
        
           		"Deleting node %s because it does not exist in the cloud provider", node.Name) 
        
           	if err := c.kubeClient.CoreV1().Nodes().Delete(ctx, node.Name, metav1.DeleteOptions{}); err != nil { 
        
           		klog.Errorf("unable to delete node %q: %v", node.Name, err) 
        
           	}

Since I've made some improvements to the underlying cache methods in flex I might be able to revert the change to add new Interfaces for all the scalesets and just fix the flex ones in place, I'll look at that later.

gcampbell12 · 2023-09-08T07:04:45Z

@feiskyer It's actually pretty easy to replicate this if you client side rate limit yourself e.g.

ERROR [2023-09-07T15:46:55.990557043Z] sigs.k8s.io/cloud-provider-azure/azure_vmssflex_cache.go:148: failed to refresh vmss flex VM cache for vmssFlexID /subscriptions/[subscription-id]/resourceGroups/[resource-group-name]/providers/Microsoft.Compute/virtualMachineScaleSets/[vmss-name-1] (pid: 10)
ERROR [2023-09-07T15:46:55.990567562Z] sigs.k8s.io/cloud-provider-azure/azure_vmssflex_cache.go:85: ListVmssFlexVMsWithoutInstanceView failed: &{true 0 0001-01-01 00:00:00 +0000 UTC azure cloud provider rate limited(read) for operation "VMList"} (pid: 10)
ERROR [2023-09-07T15:46:55.990585285Z] sigs.k8s.io/cloud-provider-azure/azure_vmssflex_cache.go:148: failed to refresh vmss flex VM cache for vmssFlexID /subscriptions/[subscription-id]/resourceGroups/[resource-group-name]/providers/Microsoft.Compute/virtualMachineScaleSets/[vmss-name-2] (pid: 10)
ERROR [2023-09-07T15:46:55.990602967Z] sigs.k8s.io/cloud-provider-azure/azure_vmssflex_cache.go:85: ListVmssFlexVMsWithoutInstanceView failed: &{true 0 0001-01-01 00:00:00 +0000 UTC azure cloud provider rate limited(read) for operation "VMList"} (pid: 10)
ERROR [2023-09-07T15:46:55.990630277Z] sigs.k8s.io/cloud-provider-azure/azure_vmssflex_cache.go:148: failed to refresh vmss flex VM cache for vmssFlexID /subscriptions/[subscription-id]/resourceGroups/[resource-group-name]/providers/Microsoft.Compute/virtualMachineScaleSets/[vmss-name-3] (pid: 10)
ERROR [2023-09-07T15:46:55.990645964Z] sigs.k8s.io/cloud-provider-azure/azure_vmssflex_cache.go:85: ListVmssFlexVMsWithoutInstanceView failed: &{true 0 0001-01-01 00:00:00 +0000 UTC azure cloud provider rate limited(read) for operation "VMList"} (pid: 10)
ERROR [2023-09-07T15:46:55.990661875Z] sigs.k8s.io/cloud-provider-azure/azure_vmssflex_cache.go:148: failed to refresh vmss flex VM cache for vmssFlexID /subscriptions/[subscription-id]/resourceGroups/[resource-group-name]/providers/Microsoft.Compute/virtualMachineScaleSets/[vmss-name-4] (pid: 10)
ERROR [2023-09-07T15:46:55.990700825Z] sigs.k8s.io/cloud-provider-azure/azure_vmssflex_cache.go:85: ListVmssFlexVMsWithoutInstanceView failed: &{true 0 0001-01-01 00:00:00 +0000 UTC azure cloud provider rate limited(read) for operation "VMList"} (pid: 10)
ERROR [2023-09-07T15:46:55.990714065Z] sigs.k8s.io/cloud-provider-azure/azure_vmssflex_cache.go:148: failed to refresh vmss flex VM cache for vmssFlexID /subscriptions/[subscription-id]/resourceGroups/[resource-group-name]/providers/Microsoft.Compute/virtualMachineScaleSets/[vmss-name-5] (pid: 10)
INFO  [2023-09-07T15:46:55.990728857Z] sigs.k8s.io/cloud-provider-azure/node_lifecycle_controller.go:164: deleting node since it is no longer present in cloud provider: [k8s-node-name] (pid: 10)

k8s-ci-robot · 2023-10-31T03:25:35Z

@gcampbell12: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cloud-provider-azure-e2e-ccm-vmss-capz	`b48a096`	link	true	`/test pull-cloud-provider-azure-e2e-ccm-vmss-capz`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-ci-robot · 2024-01-13T08:37:25Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-triage-robot · 2024-04-12T09:13:11Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-05-12T10:11:37Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Aug 30, 2023

k8s-ci-robot requested review from jwtty and MartinForReal August 30, 2023 18:53

gcampbell12 marked this pull request as ready for review August 30, 2023 19:33

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 30, 2023

k8s-ci-robot requested a review from feiskyer August 30, 2023 19:33

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 30, 2023

gcampbell12 force-pushed the gc/flex-ccm-improvements branch from 1359aa2 to 635d990 Compare September 1, 2023 16:27

k8s-ci-robot removed do-not-merge/contains-merge-commits do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Sep 1, 2023

gcampbell12 force-pushed the gc/flex-ccm-improvements branch from 39fe50c to f338523 Compare September 5, 2023 18:01

gcampbell12 force-pushed the gc/flex-ccm-improvements branch 2 times, most recently from 281b961 to 44a25eb Compare September 7, 2023 22:09

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 8, 2023

gcampbell12 force-pushed the gc/flex-ccm-improvements branch 2 times, most recently from 1cb822b to 9c69e00 Compare September 13, 2023 11:04

Improvements to reduce rate limiting with flex scalesets

b48a096

gcampbell12 force-pushed the gc/flex-ccm-improvements branch from 9c69e00 to b48a096 Compare September 13, 2023 11:05

gcampbell12 mentioned this pull request Sep 13, 2023

Support ARG #1493

Closed

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 13, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 12, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements to reduce rate limiting with flex scalesets #4535

Improvements to reduce rate limiting with flex scalesets #4535

gcampbell12 commented Aug 30, 2023 •

edited

k8s-ci-robot commented Aug 30, 2023

k8s-ci-robot commented Aug 30, 2023

odinuge commented Aug 30, 2023

gcampbell12 commented Sep 6, 2023

feiskyer commented Sep 8, 2023

gcampbell12 commented Sep 8, 2023

gcampbell12 commented Sep 8, 2023 •

edited

k8s-ci-robot commented Oct 31, 2023

k8s-ci-robot commented Jan 13, 2024

k8s-triage-robot commented Apr 12, 2024

k8s-triage-robot commented May 12, 2024

	_, err := fs.vmssFlexVMCache.Get(vmssFlexID, azcache.CacheReadTypeForceRefresh)
	if err != nil {
	klog.Errorf("failed to refresh vmss flex VM cache for vmssFlexID %s", vmssFlexID)
	}
	return true
	})

	cachedNodeName, isCached = fs.vmssFlexVMNameToNodeName.Load(vmName)
	if isCached {
	return fmt.Sprintf("%v", cachedNodeName), nil
	}
	return "", cloudprovider.InstanceNotFound

Improvements to reduce rate limiting with flex scalesets #4535

Are you sure you want to change the base?

Improvements to reduce rate limiting with flex scalesets #4535

Conversation

gcampbell12 commented Aug 30, 2023 • edited

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Aug 30, 2023

k8s-ci-robot commented Aug 30, 2023

odinuge commented Aug 30, 2023

gcampbell12 commented Sep 6, 2023

feiskyer commented Sep 8, 2023

gcampbell12 commented Sep 8, 2023

gcampbell12 commented Sep 8, 2023 • edited

k8s-ci-robot commented Oct 31, 2023

k8s-ci-robot commented Jan 13, 2024

k8s-triage-robot commented Apr 12, 2024

k8s-triage-robot commented May 12, 2024

gcampbell12 commented Aug 30, 2023 •

edited

gcampbell12 commented Sep 8, 2023 •

edited