-
Notifications
You must be signed in to change notification settings - Fork 833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modification of Karpenter Grafana dashboards capacity and performance #5935
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for karpenter-docs-prod ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't pulled this down and applied it to my cluster, but I'd expect to see instances of karpenter_deprovisioning_actions_performed
replaced with karpenter_disruption_actions_performed
. But I don't see that. Can we just do a find all replace with the existing deprovisioning metrics and add the new metrics/changes in another PR?
@njtran I used metric karpenter_disruption_nodes_disrupted_total instead of metric karpenter_disruption_actions_performed_total. |
Hey @youwalther65, the number of actions taken is different from the number of nodes disrupted. Actions taken can provide information on the number of batches of disruptions. Do you mind adding the number of disruption actions back in as well? |
Hey @njtran , that's interesting, thank you. So it would look something like that: |
While disruption can be controlled at the nodepool level (TTLs, budgets, etc) the actual disruption actions are NodePool agnostic. For example, a single multi-node consolidation action could act upon nodes owned by multiple NodePools. We don't consider any action tied to a particular nodepool for that reason. |
Sorry it took me a moment to get around to pulling this down and reviewed, overall really like the changes! Some nits on the chart naming for consistency / spelling:
I was going to suggest "Nodes Disrupted" for the last one to be consistent with the other naming, but I realized that's really ambiguous since we also have "Nodes Terminated". Since these dashboards are really intended for newcomers to Karpenter, I don't think we want to rely on terminology that requires knowledge of Karpenter's node lifecycle to differentiate between. I think voluntary node disruptions hits the mark since it filters out involuntary disruptions which aren't driven by the disruption controller (e.g. spot interruption, kubectl delete node, etc). Definitely open to other ideas here though. Some other nitty feedback:
I did also notice some funky numbers for nodes eligible for disruption, turns out there's a bug in this metric that should be fixed with the following PR: kubernetes-sigs/karpenter#1187. |
@jmdeal Thank you for the feedback. I like the ideas and have incorporated all of them in a new version. Please check again, thx. |
This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity. |
/unstale I'll try and take a look by the end of the day or tomorrow, I got caught up with some other tasks last week and this slipped through the cracks. Need to get the upstream metric fix in as well. |
Sorry for the delays here, two last comments and this looks good to me:
|
@jmdeal Thx, will follow up after my vacation end of May. |
This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity. |
/remove-lifecycle stale |
Done with the requested changes @jmdeal |
...ent/en/docs/getting-started/getting-started-with-karpenter/karpenter-capacity-dashboard.json
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the contribution 🚀
Fixes Karpenter 0.35.2 Grafana dashboards do not work properly #5934
Description
How was this change tested?
Does this change impact docs?
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.