Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metrics documentation #339

Open
saad-ali opened this issue Jun 19, 2020 · 13 comments
Open

Add metrics documentation #339

saad-ali opened this issue Jun 19, 2020 · 13 comments
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@saad-ali
Copy link
Member

I need to add documentation to https://kubernetes-csi.github.io/docs/sidecar-containers.html

Background:

A new CSI Metrics Library was added to csi-lib-utils in and is part of v0.7.0 release. This library can be used to automatically generate Prometheus metrics for all CSI operations including total count, error count, and call latency. This library was integrated in to the following CSI Sidecar containers:

New flags “--metrics-address” or “--metrics-path” are now part of all 4 of those sidecars. Driver deployments should set those flags to ensure the metrics are being emitted.

@pohly
Copy link
Collaborator

pohly commented Jun 19, 2020

It would be good have a short example how those metrics can be used. Not sure whether that belongs into that documentation (which is probably more reference-oriented) or into a blog post.

@pohly
Copy link
Collaborator

pohly commented Jun 19, 2020

For a full example, integration with Prometheus and a Grafana dashboard would be useful. While investigating this, I found: https://github.com/helm/charts/tree/master/stable/prometheus#scraping-pod-metrics-via-annotations

But that only works for a single metrics endpoint per pod. When running external-provisioner, external-attacher, external-snapshotter and external-resizer all in the same statefulset and thus pod it won't be that easy, right?

@pohly
Copy link
Collaborator

pohly commented Jun 19, 2020

See prometheus/prometheus#3756

@pohly
Copy link
Collaborator

pohly commented Jun 19, 2020

CSI calls issued by kubelet are not exported yet?

@pohly
Copy link
Collaborator

pohly commented Jun 19, 2020

Would it make sense for CSI drivers to export the same function count metric?

The code in https://github.com/saad-ali/csi-lib-utils/blob/e9a22428988a90ba8d833b5e235fcd22d16cd5fa/metrics/metrics.go currently doesn't support that:

  • only has an interceptor for the gRPC client, but not the server
  • hard-codes "csi_sidecar" as subsystem

The subsystem string then appears in metrics names like csi_sidecar_operations_seconds_count.

I could imagine that correlating those different counts may be useful, for example to detect when calls have problems at the transport level and don't reach the CSI driver.

@pohly
Copy link
Collaborator

pohly commented Jun 19, 2020

After having read through the config documentation I believe I understand enough of it to replace or extend the example configuration such that it scrapes each sidecar container individually.

But then the problem remains that admins will have to add that to their Prometheus configuration. I don't see an easy way to do that when deploying through helm. If I understand it right, one can replace the entire default config, but not add to it.

@pohly
Copy link
Collaborator

pohly commented Jun 22, 2020

If I understand it right, one can replace the entire default config, but not add to it.

That turned out to be wrong. There is some limited support for extending the default configuration.

I found a solution with an additional, generic scrape config and filed helm/charts#22899 to figure out whether that is something that should be supported by the Helm chart out-of-the-box.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 20, 2020
@pohly
Copy link
Collaborator

pohly commented Sep 21, 2020

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 21, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 20, 2020
@pohly
Copy link
Collaborator

pohly commented Dec 20, 2020

/remove-lifecycle stale
/lifecycle frozen

@msau42
Copy link
Collaborator

msau42 commented Aug 5, 2022

/help

@k8s-ci-robot
Copy link
Contributor

@msau42:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Aug 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

5 participants