-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements to reduce rate limiting with flex scalesets #4535
base: master
Are you sure you want to change the base?
Improvements to reduce rate limiting with flex scalesets #4535
Conversation
Hi @gcampbell12. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: gcampbell12 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/ok-to-test |
1359aa2
to
635d990
Compare
39fe50c
to
f338523
Compare
/retest |
281b961
to
44a25eb
Compare
Thanks for the contribution! But I'm a little concerned on this PR as it would impact all of the VM types. Could you share the logs of CCM on VMSS flex nodes, especially why the Nodes were deleted? When cache refresh failed, the controller should continue to refresh until it succeeds. |
@feiskyer I'll get some logs to you later but to explain what we are seeing: If we look at cloud-provider-azure/pkg/provider/azure_vmssflex_cache.go Lines 84 to 87 in d9bb39d
And here it just ends up being logged and we carry on looping over the rest of VMSS's (further using up rate limits) cloud-provider-azure/pkg/provider/azure_vmssflex_cache.go Lines 146 to 149 in d9bb39d
Eventually the getter fails and is ran again with a forced cache refresh which also fails due to the volume of calls and we return cloudprovider.InstanceNotFound
Here we return false to the node lifecycle controller cloud-provider-azure/pkg/provider/azure_instances.go Lines 205 to 207 in d9bb39d
Lines 161 to 178 in d9bb39d
Since I've made some improvements to the underlying cache methods in flex I might be able to revert the change to add new Interfaces for all the scalesets and just fix the flex ones in place, I'll look at that later. |
@feiskyer It's actually pretty easy to replicate this if you client side rate limit yourself e.g.
|
1cb822b
to
9c69e00
Compare
9c69e00
to
b48a096
Compare
@gcampbell12: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
What type of PR is this?
/kind bug
What this PR does / why we need it:
The implementation of Azure flexible scalesets in cloud controller manager is causing a high rate of API rate limiting when used at scale due to the volume of calls made to retrieve instances, this has the knock on affect of causing instances to be deleted from kubernetes because the error is passed down and the instance is handled as not found:
cloud-provider-azure/pkg/provider/azure_vmssflex_cache.go
Lines 146 to 157 in 371c150
This PR implements a new
NodeExistsByProviderID
method on the scaleset and for uniform and standard implementations this should behave the same as currently, for flex we extract the VM name from the provider ID and get the VM from the vm cache (which will callGetVirtualMachine
if uncached) this saves us pulling the scaleset cache and listing every VM whenInstanceExists
is invoked by the node lifecycle controller. Also changes the power status and provisioning status checks to use a provider ID instead of node names for the same reasons so we can easily retrieve the desired VM just by taking the VM name from the provider ID in flexWhich issue(s) this PR fixes:
Fixes #2880
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: