Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to reduce rate limiting with flex scalesets #4535

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

gcampbell12
Copy link
Contributor

@gcampbell12 gcampbell12 commented Aug 30, 2023

What type of PR is this?

/kind bug

What this PR does / why we need it:

The implementation of Azure flexible scalesets in cloud controller manager is causing a high rate of API rate limiting when used at scale due to the volume of calls made to retrieve instances, this has the knock on affect of causing instances to be deleted from kubernetes because the error is passed down and the instance is handled as not found:

_, err := fs.vmssFlexVMCache.Get(vmssFlexID, azcache.CacheReadTypeForceRefresh)
if err != nil {
klog.Errorf("failed to refresh vmss flex VM cache for vmssFlexID %s", vmssFlexID)
}
return true
})
cachedNodeName, isCached = fs.vmssFlexVMNameToNodeName.Load(vmName)
if isCached {
return fmt.Sprintf("%v", cachedNodeName), nil
}
return "", cloudprovider.InstanceNotFound

This PR implements a new NodeExistsByProviderID method on the scaleset and for uniform and standard implementations this should behave the same as currently, for flex we extract the VM name from the provider ID and get the VM from the vm cache (which will call GetVirtualMachine if uncached) this saves us pulling the scaleset cache and listing every VM when InstanceExists is invoked by the node lifecycle controller. Also changes the power status and provisioning status checks to use a provider ID instead of node names for the same reasons so we can easily retrieve the desired VM just by taking the VM name from the provider ID in flex

Which issue(s) this PR fixes:

Fixes #2880

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. do-not-merge/needs-kind needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 30, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @gcampbell12. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Aug 30, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: gcampbell12
Once this PR has been reviewed and has the lgtm label, please assign andyzhangx for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gcampbell12 gcampbell12 marked this pull request as ready for review August 30, 2023 19:33
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 30, 2023
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 30, 2023
@odinuge
Copy link
Member

odinuge commented Aug 30, 2023

/ok-to-test

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. do-not-merge/contains-merge-commits and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 30, 2023
@k8s-ci-robot k8s-ci-robot removed do-not-merge/contains-merge-commits do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Sep 1, 2023
@gcampbell12
Copy link
Contributor Author

/retest

@gcampbell12 gcampbell12 force-pushed the gc/flex-ccm-improvements branch 2 times, most recently from 281b961 to 44a25eb Compare September 7, 2023 22:09
@feiskyer
Copy link
Member

feiskyer commented Sep 8, 2023

Thanks for the contribution! But I'm a little concerned on this PR as it would impact all of the VM types.

Could you share the logs of CCM on VMSS flex nodes, especially why the Nodes were deleted? When cache refresh failed, the controller should continue to refresh until it succeeds.

@gcampbell12
Copy link
Contributor Author

@feiskyer I'll get some logs to you later but to explain what we are seeing:

If we look at GetNodeNameByProviderID this calls fs.getNodeNameByVMName, this will retrieve a list of every VMSS and begin range through them calling ListVmssFlexVMsWithoutInstanceView and ListVmssFlexVMsWithOnlyInstanceView for each scaleset, if you have say 50 scalesets and a couple of hundred nodes you soon get rate limited and azure returns 429's (I believe the number of VMs returned in these responses also contributes to rate limits being used up), this error is propagated from the vmclient back to here

if rerr != nil {
klog.Errorf("ListVmssFlexVMsWithoutInstanceView failed: %v", rerr)
return nil, rerr.Error()
}

And here it just ends up being logged and we carry on looping over the rest of VMSS's (further using up rate limits)

_, err := fs.vmssFlexVMCache.Get(vmssFlexID, azcache.CacheReadTypeForceRefresh)
if err != nil {
klog.Errorf("failed to refresh vmss flex VM cache for vmssFlexID %s", vmssFlexID)
}

Eventually the getter fails and is ran again with a forced cache refresh which also fails due to the volume of calls and we return cloudprovider.InstanceNotFound
return "", cloudprovider.InstanceNotFound

Here we return false to the node lifecycle controller
if errors.Is(err, cloudprovider.InstanceNotFound) {
return false, nil
}
which causes the node lifecycle controller to delete the node from k8s
if !exists {
// Current node does not exist, we should delete it, its taints do not matter anymore
klog.V(2).Infof("deleting node since it is no longer present in cloud provider: %s", node.Name)
ref := &v1.ObjectReference{
Kind: "Node",
Name: node.Name,
UID: types.UID(node.UID),
Namespace: "",
}
c.recorder.Eventf(ref, v1.EventTypeNormal, deleteNodeEvent,
"Deleting node %s because it does not exist in the cloud provider", node.Name)
if err := c.kubeClient.CoreV1().Nodes().Delete(ctx, node.Name, metav1.DeleteOptions{}); err != nil {
klog.Errorf("unable to delete node %q: %v", node.Name, err)
}

Since I've made some improvements to the underlying cache methods in flex I might be able to revert the change to add new Interfaces for all the scalesets and just fix the flex ones in place, I'll look at that later.

@gcampbell12
Copy link
Contributor Author

gcampbell12 commented Sep 8, 2023

@feiskyer It's actually pretty easy to replicate this if you client side rate limit yourself e.g.

ERROR [2023-09-07T15:46:55.990557043Z] sigs.k8s.io/cloud-provider-azure/azure_vmssflex_cache.go:148: failed to refresh vmss flex VM cache for vmssFlexID /subscriptions/[subscription-id]/resourceGroups/[resource-group-name]/providers/Microsoft.Compute/virtualMachineScaleSets/[vmss-name-1] (pid: 10)
ERROR [2023-09-07T15:46:55.990567562Z] sigs.k8s.io/cloud-provider-azure/azure_vmssflex_cache.go:85: ListVmssFlexVMsWithoutInstanceView failed: &{true 0 0001-01-01 00:00:00 +0000 UTC azure cloud provider rate limited(read) for operation "VMList"} (pid: 10)
ERROR [2023-09-07T15:46:55.990585285Z] sigs.k8s.io/cloud-provider-azure/azure_vmssflex_cache.go:148: failed to refresh vmss flex VM cache for vmssFlexID /subscriptions/[subscription-id]/resourceGroups/[resource-group-name]/providers/Microsoft.Compute/virtualMachineScaleSets/[vmss-name-2] (pid: 10)
ERROR [2023-09-07T15:46:55.990602967Z] sigs.k8s.io/cloud-provider-azure/azure_vmssflex_cache.go:85: ListVmssFlexVMsWithoutInstanceView failed: &{true 0 0001-01-01 00:00:00 +0000 UTC azure cloud provider rate limited(read) for operation "VMList"} (pid: 10)
ERROR [2023-09-07T15:46:55.990630277Z] sigs.k8s.io/cloud-provider-azure/azure_vmssflex_cache.go:148: failed to refresh vmss flex VM cache for vmssFlexID /subscriptions/[subscription-id]/resourceGroups/[resource-group-name]/providers/Microsoft.Compute/virtualMachineScaleSets/[vmss-name-3] (pid: 10)
ERROR [2023-09-07T15:46:55.990645964Z] sigs.k8s.io/cloud-provider-azure/azure_vmssflex_cache.go:85: ListVmssFlexVMsWithoutInstanceView failed: &{true 0 0001-01-01 00:00:00 +0000 UTC azure cloud provider rate limited(read) for operation "VMList"} (pid: 10)
ERROR [2023-09-07T15:46:55.990661875Z] sigs.k8s.io/cloud-provider-azure/azure_vmssflex_cache.go:148: failed to refresh vmss flex VM cache for vmssFlexID /subscriptions/[subscription-id]/resourceGroups/[resource-group-name]/providers/Microsoft.Compute/virtualMachineScaleSets/[vmss-name-4] (pid: 10)
ERROR [2023-09-07T15:46:55.990700825Z] sigs.k8s.io/cloud-provider-azure/azure_vmssflex_cache.go:85: ListVmssFlexVMsWithoutInstanceView failed: &{true 0 0001-01-01 00:00:00 +0000 UTC azure cloud provider rate limited(read) for operation "VMList"} (pid: 10)
ERROR [2023-09-07T15:46:55.990714065Z] sigs.k8s.io/cloud-provider-azure/azure_vmssflex_cache.go:148: failed to refresh vmss flex VM cache for vmssFlexID /subscriptions/[subscription-id]/resourceGroups/[resource-group-name]/providers/Microsoft.Compute/virtualMachineScaleSets/[vmss-name-5] (pid: 10)
INFO  [2023-09-07T15:46:55.990728857Z] sigs.k8s.io/cloud-provider-azure/node_lifecycle_controller.go:164: deleting node since it is no longer present in cloud provider: [k8s-node-name] (pid: 10)

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 8, 2023
@gcampbell12 gcampbell12 force-pushed the gc/flex-ccm-improvements branch 2 times, most recently from 1cb822b to 9c69e00 Compare September 13, 2023 11:04
@k8s-ci-robot
Copy link
Contributor

@gcampbell12: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cloud-provider-azure-e2e-ccm-vmss-capz b48a096 link true /test pull-cloud-provider-azure-e2e-ccm-vmss-capz

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 13, 2024
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 12, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-kind lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nodes getting deleted after a rate-limited cache refresh
5 participants