Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve scalability of the EC2/Route53 controllers #2029

Open
justinmir opened this issue Apr 1, 2024 · 0 comments
Open

Improve scalability of the EC2/Route53 controllers #2029

justinmir opened this issue Apr 1, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@justinmir
Copy link

Reddit uses provider-aws to primarily manage EC2 and Route53 resources. In the future we may be adopting it for interacting with various AWS managed services. The provider-aws controllers currently manage on the order of several thousand resources each, ~3000+ EC2 instances, and ~6000+ route53 records.

We are currently running provider-aws version 0.46.

What problem are you facing?

At this scale we run into issues with high queue depth in our controllers, even with 20+ active workers (provider-aws --max-reconcile-rate=20).

image
Figure 1: Crossplane controller is unable to work through queue of reconcile requests even with 20 active workers at our scale.

Reconcile times for EC2 instances typically take greater than one second at median and can take up to 6 seconds at p99.
image
Figure 2: Reconcile time for instance resources

Without any jitter, ResourceRecordSet resource observations can cause a backlog that can take up to an hour to resolve. AWS rate limits route53 API requests to 5 requests / second / account, which makes it extremely easy to hit the rate limit when performing observation on route53 resources. These queue depths exist even with poll intervals set to 30 minutes (up from 1 minute) in our fork.

image
Figure 3: Resource record set controller is backlogged during instance observation.

How could Crossplane help solve your problem?

Allow configuring per-resource poll interval / jitter
Introduce jitter in resource record sets, introducing jitter smooths the request rate of the resource due to observations and minimizes the impact of the rate limit. We hard-code jitter in our crossplane provider-aws fork.

image

Reduce reconcile time for EC2 instance resources
Reduce the time spent to perform an EC2 instance observation by: (1) reduce unnecessary API calls for duplicate data, (2) parallelize API calls where possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant