Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modification of Karpenter Grafana dashboards capacity and performance #5935

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

youwalther65
Copy link

@youwalther65 youwalther65 commented Mar 26, 2024

Fixes Karpenter 0.35.2 Grafana dashboards do not work properly #5934

Description

  • Capacity dashboard: removed deprecated metric "karpenter_deprovisioning_actions_performed" and provides new visualizations using metrics "karpenter_disruption_nodes_disrupted_total" and "karpenter_disruption_eligible_nodes"
  • Performance dashboard: filtered for job="karpenter" to avoid getting controllers from other scarep jobsof K8s components using controller-runtime metrics as well

How was this change tested?

  • Karpenter serviceMonitor object for kube-prometheus-stack created, which creates a scarep job "karpenter"
  • provisioned dashboards to local Grafana instance

Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: #
  • No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@youwalther65 youwalther65 requested a review from a team as a code owner March 26, 2024 15:55
Copy link

netlify bot commented Mar 26, 2024

Deploy Preview for karpenter-docs-prod ready!

Name Link
🔨 Latest commit f9e38e7
🔍 Latest deploy log https://app.netlify.com/sites/karpenter-docs-prod/deploys/66604b731cf8ef0007f28957
😎 Deploy Preview https://deploy-preview-5935--karpenter-docs-prod.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Copy link
Contributor

@njtran njtran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't pulled this down and applied it to my cluster, but I'd expect to see instances of karpenter_deprovisioning_actions_performed replaced with karpenter_disruption_actions_performed. But I don't see that. Can we just do a find all replace with the existing deprovisioning metrics and add the new metrics/changes in another PR?

@youwalther65
Copy link
Author

youwalther65 commented Apr 8, 2024

@njtran I used metric karpenter_disruption_nodes_disrupted_total instead of metric karpenter_disruption_actions_performed_total.
Both provide drill-down by label disruption action, method, and consolidation type, but karpenter_disruption_nodes_disrupted_total gives these details even by NodePool, which I believe is a clear advantage over karpenter_disruption_actions_performed_total.
So I feel, directly moving to this will provide better visibility for Karpenter users.

@njtran
Copy link
Contributor

njtran commented Apr 8, 2024

Hey @youwalther65, the number of actions taken is different from the number of nodes disrupted. Actions taken can provide information on the number of batches of disruptions. Do you mind adding the number of disruption actions back in as well?

@youwalther65
Copy link
Author

youwalther65 commented Apr 9, 2024

Hey @njtran , that's interesting, thank you. So it would look something like that:
Screenshot 2024-04-09 at 12 19 25
with the other Visualizations from the original dashboard still in place but selectable by nodepool as well:
Screenshot 2024-04-09 at 14 45 53
If you like that I will go ahead and put it into my PR.
I wonder why karpenter_disruption_actions_performed_total doesn't have a label NodePool. Because my current understanding is, that disruption actions are configured at the NodePoll level , so action should tie to that as well. Am I wrong with this?

@jmdeal
Copy link
Contributor

jmdeal commented Apr 12, 2024

While disruption can be controlled at the nodepool level (TTLs, budgets, etc) the actual disruption actions are NodePool agnostic. For example, a single multi-node consolidation action could act upon nodes owned by multiple NodePools. We don't consider any action tied to a particular nodepool for that reason.

@jmdeal
Copy link
Contributor

jmdeal commented Apr 16, 2024

Sorry it took me a moment to get around to pulling this down and reviewed, overall really like the changes!

Some nits on the chart naming for consistency / spelling:

  • Nodes eligable for Disruptions -> Nodes Eligible for Disruption
  • Number of disruption actions performed -> Disruption Actions Performed
  • Node Disruptions -> Voluntary Node Disruptions

I was going to suggest "Nodes Disrupted" for the last one to be consistent with the other naming, but I realized that's really ambiguous since we also have "Nodes Terminated". Since these dashboards are really intended for newcomers to Karpenter, I don't think we want to rely on terminology that requires knowledge of Karpenter's node lifecycle to differentiate between. I think voluntary node disruptions hits the mark since it filters out involuntary disruptions which aren't driven by the disruption controller (e.g. spot interruption, kubectl delete node, etc). Definitely open to other ideas here though.

Some other nitty feedback:

  • IMO nodes eligible for disruption shouldn't be stacked since the same node can be counted in multiple categories. Maybe I'm wrong but I've always read stacked graphs as the different sections being mutually exclusive.
  • Now that we have multiple charts regarding disruption here, I'm wondering if it would be good to add a dashboard filter for the disruption method? Seems like it would be a nice addition if you're interested in adding it.

I did also notice some funky numbers for nodes eligible for disruption, turns out there's a bug in this metric that should be fixed with the following PR: kubernetes-sigs/karpenter#1187.

@youwalther65
Copy link
Author

@jmdeal Thank you for the feedback. I like the ideas and have incorporated all of them in a new version. Please check again, thx.

Copy link
Contributor

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

@jmdeal
Copy link
Contributor

jmdeal commented Apr 30, 2024

/unstale I'll try and take a look by the end of the day or tomorrow, I got caught up with some other tasks last week and this slipped through the cracks. Need to get the upstream metric fix in as well.

@jmdeal
Copy link
Contributor

jmdeal commented May 7, 2024

Sorry for the delays here, two last comments and this looks good to me:

  • Looks like there's still a typo in "Nodes Eligibale for Disruption" (should be Eligible)
  • We should probably copy these dashboards into the v0.35, v0.36, and preview doc folders as well. Shouldn't need any additional changes there, just copies.

@youwalther65
Copy link
Author

@jmdeal Thx, will follow up after my vacation end of May.

Copy link
Contributor

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

@jmdeal
Copy link
Contributor

jmdeal commented May 22, 2024

/remove-lifecycle stale

@youwalther65
Copy link
Author

Done with the requested changes @jmdeal

Copy link
Contributor

@jmdeal jmdeal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the contribution 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Karpenter 0.35.2 Grafana dashboards do not work properly
3 participants