Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Exporting Provider-Specific Metrics in External-DNS #4135

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

DerekTBrown
Copy link

This PR introduces a new proposal to allow for provider-specific metrics in external-dns.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 22, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign szuecs for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

Welcome @DerekTBrown!

It looks like this is your first PR to kubernetes-sigs/external-dns 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/external-dns has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Dec 22, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @DerekTBrown. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Dec 22, 2023
@mloiseleur
Copy link
Contributor

It's interesting, it can clearly help and in the same time I'm unsure of this proposal 🤔 .
How would this work with webhook providers ? How the in-tree providers would be updated ? With the current number of in-tree providers, it would be a massive additions of CLI switch.
To say it differently: an important issue this project have currently is the lack of maintainers per provider.
It feels like it should be a feature on webhook provider, and also a source of motivation to move the provider out of tree.

@mloiseleur
Copy link
Contributor

cc @Raffo @szuecs @johngmyers

@DerekTBrown
Copy link
Author

How would this work with webhook providers?

Ultimately, this proposal is more of a guideline for creating per-provider metrics, rather than a rigid specification. There are benefits if we all follow the guidelines (because we get a consistent experience), but there isn't a technical reason we need to do so. My expectations would be:

  • Webhook providers must follow the external_dns_provider_{provider_name}_{metric_name} metric format.
  • Helm Charts for external-dns must include capabilities to export/scrape metrics for all webhook containers.
  • Webhook providers ideally should export the canonical metrics (request_count, request_latency, request_rate).
  • Webhook providers ideally should follow the CLI flag guidelines, to have a consistent experience.

How the in-tree providers would be updated?

Because we lack maintainers, it is unrealistic for us to expect that all 30+ in-tree providers have metrics added. My proposal would be that we should accept metrics PRs to any of the existing in-tree providers, so long as they follow the guidelines in this proposal. I am committed to adding these features for AWS, and I would hope that others add metric functionalities for providers as they see the need.

It feels like it should be a feature on webhook provider, and also a source of motivation to move the provider out of tree.

I think adding metrics and moving providers out-of-tree should be kept as distinct projects and goals:

  1. The majority of users still rely on in-tree providers. We shouldn't condition metrics features (which can be a critical reliability feature) on migrating to out-of-tree providers.

  2. No re-work is required. If we implement metrics for an in-tree provider and later migrate the provider out-of-tree, no additional changes or work is required; the provider already complies with the metrics proposal.

it would be a massive additions of CLI switch.

I am open to feedback on the flag aspect of this specification; my default assumption was that people would want to customize these metrics. However, I think there are many alternatives (such as using environment variables, which might be cleaner).

Over time, I would expect the CLI options to migrate into separate webhook provider repositories, so this wouldn't be a huge change in the long-term.

- The `slidingWindowDurationSec` is the period over which the client-side aggregation finds the maximum. This window should be larger than the Prometheus scrape window, so that spikes are guaranteed to be collected by Prometheus. The default value is 60s.
- Type: gauge
- Annotations:
- `method` - The method being invoked (eg. `ListTagsForResourceWithContext`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe func or function instead of method. I thought of method like http methods/verbs GET,POST,PUT

@szuecs
Copy link
Contributor

szuecs commented Jan 10, 2024

@DerekTBrown in general I like the idea. The question would be if this is only "guidance" for providers that use the webhook or if external-dns should collect the metrics.
Maybe you can add pros/cons why this should stay in-tree. On the other hand some of the metrics could also be collected by the webhook provider itself. We could also think of a flag like --metrics-provider="<aws|gcp|..>" and then use it as value of the metrics label that I wrote before:
external_dns_provider__$metric_name${"provider"="$provider_name$"}
so --metrics-provider="foo" would create metrics of kind external_dns_provider__$metric_name${"provider"="foo"}

@szuecs
Copy link
Contributor

szuecs commented Jan 10, 2024

What do you think about the proposal, would this help you, too @jbilliau-rcd ?
Maybe you have also input.

@jbilliau-rcd
Copy link

We only use external-dns with AWS and don't currently collect any metrics, so not any immediate use or me personally but yeah, seems helpful if and when we ever start.

So yeah....no strong opinion on this one, not like the "crash on error" issue I got a little overzealous about :)

@szuecs
Copy link
Contributor

szuecs commented Jan 11, 2024

The proposal should also have a section how the webhook provider would be doing this. This should be also part of the first change.

@DerekTBrown
Copy link
Author

DerekTBrown commented Jan 11, 2024

The proposal should also have a section how the webhook provider would be doing this. This should be also part of the first change.

@szuecs Can you clarify what you mean here?

It looks like the webhook provider already has its own metrics scheme:

https://github.com/kubernetes-sigs/external-dns/blob/master/provider/webhook/webhook.go#L41

@szuecs
Copy link
Contributor

szuecs commented Jan 12, 2024

@DerekTBrown can you check what is different or if something is missing?
Add this to the proposal, such that there is at least awareness.

@DerekTBrown
Copy link
Author

@DerekTBrown can you check what is different or if something is missing? Add this to the proposal, such that there is at least awareness.

I added a note about this. I think we are good to land.

- Labels:
- `method` - The method being invoked (eg. `ListTagsForResourceWithContext`).
- `error` - Whether the invocation of this method resulted in an error.
- `error_code` - Error code returned from the method.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we always get an error_code?
I think the SDKs do not return an error code but an error.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will clarify in the doc- Not all SDKs will return all labels, but if the SDK returns that specific label, it should be exported.

- `method` - The method being invoked (eg. `ListTagsForResourceWithContext`).
- `error` - Whether the invocation of this method resulted in an error.
- `error_code` - Error code returned from the method.
- `external_dns_provider_request_latency_ms`
Copy link
Contributor

@szuecs szuecs Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if latencies are useful for controllers.
latency buckets create a lot of data that seems not really useful, because who cares if it's 20ms or 200ms?

- Type: Histogram
- Labels:
- `method` - The method being invoked (eg. `ListTagsForResourceWithContext`).
- `external_dns_provider_request_rate`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Prometheus you can query rates if you expose a counter, so I would propose to have only counters and do the rate calculations like this sum(rate(external_dns_provider_requests_count{}[1m])) in prometheus.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szuecs I explained this in the documentation below. DNS providers potentially have ratelimiting windows that are more granular than the Prometheus scrape window:

Ratelimits are often enforced on a more granular time resolution than Prometheus metrics are collected. This metric pre-computes a peak request rate over a sliding window, to give users observability into bursty API requests.

For instance, suppose the Provider limits you to 5 requests per second, and Prometheus scrapes only every 30s. If the Prometheus count for that 30s window is 20, we have no way of knowing if rate-limiting is close to occurring, because its possible that we sent a peak of 1 request per second over 20 different seconds in the window, or alternatively, we would have sent a peak of 20 requests in a single second.

- The `slidingWindowDurationSec` is the period over which the client-side aggregation finds the maximum. This window should be larger than the Prometheus scrape window, so that spikes are guaranteed to be collected by Prometheus. The default value is 60s.
- Type: gauge
- Labels:
- `function` - The function being invoked (eg. `ListTagsForResourceWithContext`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would likely use this as label to the request counter.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand this?

@mloiseleur
Copy link
Contributor

/assign @szuecs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants