-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Exporting Provider-Specific Metrics in External-DNS #4135
base: master
Are you sure you want to change the base?
Proposal: Exporting Provider-Specific Metrics in External-DNS #4135
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Welcome @DerekTBrown! |
Hi @DerekTBrown. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
It's interesting, it can clearly help and in the same time I'm unsure of this proposal 🤔 . |
Ultimately, this proposal is more of a guideline for creating per-provider metrics, rather than a rigid specification. There are benefits if we all follow the guidelines (because we get a consistent experience), but there isn't a technical reason we need to do so. My expectations would be:
Because we lack maintainers, it is unrealistic for us to expect that all 30+ in-tree providers have metrics added. My proposal would be that we should accept metrics PRs to any of the existing in-tree providers, so long as they follow the guidelines in this proposal. I am committed to adding these features for AWS, and I would hope that others add metric functionalities for providers as they see the need.
I think adding metrics and moving providers out-of-tree should be kept as distinct projects and goals:
I am open to feedback on the flag aspect of this specification; my default assumption was that people would want to customize these metrics. However, I think there are many alternatives (such as using environment variables, which might be cleaner). Over time, I would expect the CLI options to migrate into separate webhook provider repositories, so this wouldn't be a huge change in the long-term. |
docs/proposal/provider-metrics.md
Outdated
- The `slidingWindowDurationSec` is the period over which the client-side aggregation finds the maximum. This window should be larger than the Prometheus scrape window, so that spikes are guaranteed to be collected by Prometheus. The default value is 60s. | ||
- Type: gauge | ||
- Annotations: | ||
- `method` - The method being invoked (eg. `ListTagsForResourceWithContext`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe func
or function
instead of method
. I thought of method
like http methods/verbs GET,POST,PUT
@DerekTBrown in general I like the idea. The question would be if this is only "guidance" for providers that use the webhook or if external-dns should collect the metrics. |
What do you think about the proposal, would this help you, too @jbilliau-rcd ? |
We only use external-dns with AWS and don't currently collect any metrics, so not any immediate use or me personally but yeah, seems helpful if and when we ever start. So yeah....no strong opinion on this one, not like the "crash on error" issue I got a little overzealous about :) |
The proposal should also have a section how the webhook provider would be doing this. This should be also part of the first change. |
@szuecs Can you clarify what you mean here? It looks like the webhook provider already has its own metrics scheme: https://github.com/kubernetes-sigs/external-dns/blob/master/provider/webhook/webhook.go#L41 |
@DerekTBrown can you check what is different or if something is missing? |
I added a note about this. I think we are good to land. |
- Labels: | ||
- `method` - The method being invoked (eg. `ListTagsForResourceWithContext`). | ||
- `error` - Whether the invocation of this method resulted in an error. | ||
- `error_code` - Error code returned from the method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we always get an error_code?
I think the SDKs do not return an error code but an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will clarify in the doc- Not all SDKs will return all labels, but if the SDK returns that specific label, it should be exported.
- `method` - The method being invoked (eg. `ListTagsForResourceWithContext`). | ||
- `error` - Whether the invocation of this method resulted in an error. | ||
- `error_code` - Error code returned from the method. | ||
- `external_dns_provider_request_latency_ms` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if latencies are useful for controllers.
latency buckets create a lot of data that seems not really useful, because who cares if it's 20ms or 200ms?
- Type: Histogram | ||
- Labels: | ||
- `method` - The method being invoked (eg. `ListTagsForResourceWithContext`). | ||
- `external_dns_provider_request_rate` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Prometheus you can query rates if you expose a counter, so I would propose to have only counters and do the rate calculations like this sum(rate(external_dns_provider_requests_count{}[1m]))
in prometheus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@szuecs I explained this in the documentation below. DNS providers potentially have ratelimiting windows that are more granular than the Prometheus scrape window:
Ratelimits are often enforced on a more granular time resolution than Prometheus metrics are collected. This metric pre-computes a peak request rate over a sliding window, to give users observability into bursty API requests.
For instance, suppose the Provider limits you to 5 requests per second, and Prometheus scrapes only every 30s. If the Prometheus count for that 30s window is 20, we have no way of knowing if rate-limiting is close to occurring, because its possible that we sent a peak of 1 request per second over 20 different seconds in the window, or alternatively, we would have sent a peak of 20 requests in a single second.
- The `slidingWindowDurationSec` is the period over which the client-side aggregation finds the maximum. This window should be larger than the Prometheus scrape window, so that spikes are guaranteed to be collected by Prometheus. The default value is 60s. | ||
- Type: gauge | ||
- Labels: | ||
- `function` - The function being invoked (eg. `ListTagsForResourceWithContext`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would likely use this as label to the request counter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand this?
/assign @szuecs |
This PR introduces a new proposal to allow for provider-specific metrics in external-dns.