Proposal: Exporting Provider-Specific Metrics in External-DNS #4135

DerekTBrown · 2023-12-22T19:32:02Z

This PR introduces a new proposal to allow for provider-specific metrics in external-dns.

k8s-ci-robot · 2023-12-22T19:32:09Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign szuecs for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2023-12-22T19:32:11Z

Welcome @DerekTBrown!

It looks like this is your first PR to kubernetes-sigs/external-dns 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/external-dns has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2023-12-22T19:32:12Z

Hi @DerekTBrown. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mloiseleur · 2024-01-01T11:50:44Z

It's interesting, it can clearly help and in the same time I'm unsure of this proposal 🤔 .
How would this work with webhook providers ? How the in-tree providers would be updated ? With the current number of in-tree providers, it would be a massive additions of CLI switch.
To say it differently: an important issue this project have currently is the lack of maintainers per provider.
It feels like it should be a feature on webhook provider, and also a source of motivation to move the provider out of tree.

mloiseleur · 2024-01-01T11:51:03Z

cc @Raffo @szuecs @johngmyers

DerekTBrown · 2024-01-02T18:13:07Z

How would this work with webhook providers?

Ultimately, this proposal is more of a guideline for creating per-provider metrics, rather than a rigid specification. There are benefits if we all follow the guidelines (because we get a consistent experience), but there isn't a technical reason we need to do so. My expectations would be:

Webhook providers must follow the external_dns_provider_{provider_name}_{metric_name} metric format.
Helm Charts for external-dns must include capabilities to export/scrape metrics for all webhook containers.
Webhook providers ideally should export the canonical metrics (request_count, request_latency, request_rate).
Webhook providers ideally should follow the CLI flag guidelines, to have a consistent experience.

How the in-tree providers would be updated?

Because we lack maintainers, it is unrealistic for us to expect that all 30+ in-tree providers have metrics added. My proposal would be that we should accept metrics PRs to any of the existing in-tree providers, so long as they follow the guidelines in this proposal. I am committed to adding these features for AWS, and I would hope that others add metric functionalities for providers as they see the need.

It feels like it should be a feature on webhook provider, and also a source of motivation to move the provider out of tree.

I think adding metrics and moving providers out-of-tree should be kept as distinct projects and goals:

The majority of users still rely on in-tree providers. We shouldn't condition metrics features (which can be a critical reliability feature) on migrating to out-of-tree providers.
No re-work is required. If we implement metrics for an in-tree provider and later migrate the provider out-of-tree, no additional changes or work is required; the provider already complies with the metrics proposal.

it would be a massive additions of CLI switch.

I am open to feedback on the flag aspect of this specification; my default assumption was that people would want to customize these metrics. However, I think there are many alternatives (such as using environment variables, which might be cleaner).

Over time, I would expect the CLI options to migrate into separate webhook provider repositories, so this wouldn't be a huge change in the long-term.

docs/proposal/provider-metrics.md

szuecs · 2024-01-10T17:06:41Z

docs/proposal/provider-metrics.md

+        - The `slidingWindowDurationSec` is the period over which the client-side aggregation finds the maximum. This window should be larger than the Prometheus scrape window, so that spikes are guaranteed to be collected by Prometheus. The default value is 60s.
+  - Type: gauge
+  - Annotations:
+     - `method` - The method being invoked (eg. `ListTagsForResourceWithContext`).


maybe func or function instead of method. I thought of method like http methods/verbs GET,POST,PUT

szuecs · 2024-01-10T17:14:59Z

@DerekTBrown in general I like the idea. The question would be if this is only "guidance" for providers that use the webhook or if external-dns should collect the metrics.
Maybe you can add pros/cons why this should stay in-tree. On the other hand some of the metrics could also be collected by the webhook provider itself. We could also think of a flag like --metrics-provider="<aws|gcp|..>" and then use it as value of the metrics label that I wrote before:
external_dns_provider__$metric_name${"provider"="$provider_name$"}
so --metrics-provider="foo" would create metrics of kind external_dns_provider__$metric_name${"provider"="foo"}

szuecs · 2024-01-10T18:34:44Z

What do you think about the proposal, would this help you, too @jbilliau-rcd ?
Maybe you have also input.

jbilliau-rcd · 2024-01-10T22:51:09Z

We only use external-dns with AWS and don't currently collect any metrics, so not any immediate use or me personally but yeah, seems helpful if and when we ever start.

So yeah....no strong opinion on this one, not like the "crash on error" issue I got a little overzealous about :)

docs/proposal/provider-metrics.md

szuecs · 2024-01-11T18:00:08Z

The proposal should also have a section how the webhook provider would be doing this. This should be also part of the first change.

DerekTBrown · 2024-01-11T22:24:42Z

The proposal should also have a section how the webhook provider would be doing this. This should be also part of the first change.

@szuecs Can you clarify what you mean here?

It looks like the webhook provider already has its own metrics scheme:

https://github.com/kubernetes-sigs/external-dns/blob/master/provider/webhook/webhook.go#L41

szuecs · 2024-01-12T19:25:01Z

@DerekTBrown can you check what is different or if something is missing?
Add this to the proposal, such that there is at least awareness.

DerekTBrown · 2024-01-17T22:23:07Z

@DerekTBrown can you check what is different or if something is missing? Add this to the proposal, such that there is at least awareness.

I added a note about this. I think we are good to land.

szuecs · 2024-02-13T15:02:25Z

docs/proposal/provider-metrics.md

+  - Labels:
+     - `method` - The method being invoked (eg. `ListTagsForResourceWithContext`).
+     - `error` - Whether the invocation of this method resulted in an error.
+     - `error_code` - Error code returned from the method. 


do we always get an error_code?
I think the SDKs do not return an error code but an error.

I will clarify in the doc- Not all SDKs will return all labels, but if the SDK returns that specific label, it should be exported.

szuecs · 2024-02-13T15:03:11Z

docs/proposal/provider-metrics.md

+     - `method` - The method being invoked (eg. `ListTagsForResourceWithContext`).
+     - `error` - Whether the invocation of this method resulted in an error.
+     - `error_code` - Error code returned from the method. 
+- `external_dns_provider_request_latency_ms`


not sure if latencies are useful for controllers.
latency buckets create a lot of data that seems not really useful, because who cares if it's 20ms or 200ms?

szuecs · 2024-02-13T15:05:39Z

docs/proposal/provider-metrics.md

+  - Type: Histogram
+  - Labels:
+     - `method` - The method being invoked (eg. `ListTagsForResourceWithContext`).
+- `external_dns_provider_request_rate`


In Prometheus you can query rates if you expose a counter, so I would propose to have only counters and do the rate calculations like this sum(rate(external_dns_provider_requests_count{}[1m])) in prometheus.

@szuecs I explained this in the documentation below. DNS providers potentially have ratelimiting windows that are more granular than the Prometheus scrape window:

Ratelimits are often enforced on a more granular time resolution than Prometheus metrics are collected. This metric pre-computes a peak request rate over a sliding window, to give users observability into bursty API requests.

For instance, suppose the Provider limits you to 5 requests per second, and Prometheus scrapes only every 30s. If the Prometheus count for that 30s window is 20, we have no way of knowing if rate-limiting is close to occurring, because its possible that we sent a peak of 1 request per second over 20 different seconds in the window, or alternatively, we would have sent a peak of 20 requests in a single second.

szuecs · 2024-02-13T15:06:45Z

docs/proposal/provider-metrics.md

+        - The `slidingWindowDurationSec` is the period over which the client-side aggregation finds the maximum. This window should be larger than the Prometheus scrape window, so that spikes are guaranteed to be collected by Prometheus. The default value is 60s.
+  - Type: gauge
+  - Labels:
+     - `function` - The function being invoked (eg. `ListTagsForResourceWithContext`).


I would likely use this as label to the request counter.

I am not sure I understand this?

docs/proposal/provider-metrics.md

mloiseleur · 2024-03-12T15:45:24Z

/assign @szuecs

proposal: metrics

3a73e27

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 22, 2023

k8s-ci-robot requested review from johngmyers and szuecs December 22, 2023 19:32

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Dec 22, 2023

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Dec 22, 2023

szuecs reviewed Jan 10, 2024

View reviewed changes

docs/proposal/provider-metrics.md Outdated Show resolved Hide resolved

szuecs reviewed Jan 10, 2024

View reviewed changes

address feedback

e6f8399

DerekTBrown requested a review from szuecs January 11, 2024 16:27

szuecs reviewed Jan 11, 2024

View reviewed changes

docs/proposal/provider-metrics.md Outdated Show resolved Hide resolved

szuecs reviewed Jan 11, 2024

View reviewed changes

docs/proposal/provider-metrics.md Outdated Show resolved Hide resolved

more changes

37015da

Update provider-metrics.md

f622bd6

szuecs reviewed Feb 13, 2024

View reviewed changes

docs/proposal/provider-metrics.md Show resolved Hide resolved

szuecs reviewed Feb 13, 2024

View reviewed changes

docs/proposal/provider-metrics.md Show resolved Hide resolved

DerekTBrown requested a review from szuecs February 20, 2024 20:56

k8s-ci-robot assigned szuecs Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Exporting Provider-Specific Metrics in External-DNS #4135

Proposal: Exporting Provider-Specific Metrics in External-DNS #4135

DerekTBrown commented Dec 22, 2023

k8s-ci-robot commented Dec 22, 2023

k8s-ci-robot commented Dec 22, 2023

k8s-ci-robot commented Dec 22, 2023

mloiseleur commented Jan 1, 2024

mloiseleur commented Jan 1, 2024

DerekTBrown commented Jan 2, 2024

szuecs Jan 10, 2024

szuecs commented Jan 10, 2024

szuecs commented Jan 10, 2024 •

edited

jbilliau-rcd commented Jan 10, 2024

szuecs commented Jan 11, 2024

DerekTBrown commented Jan 11, 2024 •

edited

szuecs commented Jan 12, 2024

DerekTBrown commented Jan 17, 2024

szuecs Feb 13, 2024

DerekTBrown Feb 20, 2024

szuecs Feb 13, 2024 •

edited

szuecs Feb 13, 2024

DerekTBrown Feb 20, 2024 •

edited

szuecs Feb 13, 2024

DerekTBrown Feb 20, 2024

mloiseleur commented Mar 12, 2024

Proposal: Exporting Provider-Specific Metrics in External-DNS #4135

Are you sure you want to change the base?

Proposal: Exporting Provider-Specific Metrics in External-DNS #4135

Conversation

DerekTBrown commented Dec 22, 2023

k8s-ci-robot commented Dec 22, 2023

k8s-ci-robot commented Dec 22, 2023

k8s-ci-robot commented Dec 22, 2023

mloiseleur commented Jan 1, 2024

mloiseleur commented Jan 1, 2024

DerekTBrown commented Jan 2, 2024

szuecs Jan 10, 2024

Choose a reason for hiding this comment

szuecs commented Jan 10, 2024

szuecs commented Jan 10, 2024 • edited

jbilliau-rcd commented Jan 10, 2024

szuecs commented Jan 11, 2024

DerekTBrown commented Jan 11, 2024 • edited

szuecs commented Jan 12, 2024

DerekTBrown commented Jan 17, 2024

szuecs Feb 13, 2024

Choose a reason for hiding this comment

DerekTBrown Feb 20, 2024

Choose a reason for hiding this comment

szuecs Feb 13, 2024 • edited

Choose a reason for hiding this comment

szuecs Feb 13, 2024

Choose a reason for hiding this comment

DerekTBrown Feb 20, 2024 • edited

Choose a reason for hiding this comment

szuecs Feb 13, 2024

Choose a reason for hiding this comment

DerekTBrown Feb 20, 2024

Choose a reason for hiding this comment

mloiseleur commented Mar 12, 2024

szuecs commented Jan 10, 2024 •

edited

DerekTBrown commented Jan 11, 2024 •

edited

szuecs Feb 13, 2024 •

edited

DerekTBrown Feb 20, 2024 •

edited