Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tracking/Action] Repair: how broken Kubernetes workloads lead to higher emissions #365

Open
1 of 4 tasks
xamebax opened this issue Apr 3, 2024 · 2 comments
Open
1 of 4 tasks

Comments

@xamebax
Copy link
Contributor

xamebax commented Apr 3, 2024

(ticket is part of sustainable k8s practices project work)

Description

What is the carbon cost of leaving broken workloads to run on Kubernetes? What is the untapped potential of making sure workloads repair themselves better, or that broken workloads aren't allowed to run for a long time? Is there a good "Kubernetes hygiene" around repairing workloads that can lead to lowering a cluster's carbon cost?

Outcome

A recommendation in our working document that helps the reader make a choice on how to repair their workloads, with an effort estimation (small, medium, large). Optional extra reading material with extra context if the reader's interested.

To-Do

  • add relevant labels to this issue when possible,
  • research if this is a worthy recommendation,
  • if yes, write a recommendation,
  • share it for review, implement feedback.

Comments

  • Only public cloud is in scope here.
  • I'm gonna work on writing this recommendation. 🙂

@mkorbi I'd love your input on this issue description, do you feel this captures the fullness of what we talked about?

(cc @JacobValdemar)

@xamebax
Copy link
Contributor Author

xamebax commented May 14, 2024

Just an update that I did start working on this and should hopefully have a draft by the end of the week.

@mkorbi
Copy link
Member

mkorbi commented May 21, 2024

It's relevant to help the reader to identify broken workload and we have to differentiate here.
You have sprawls, so workload that got "lost" and no one takes care about, and you have idle workload but that "misbehaves".

I think for both there is a fairly easy approach: compare the network traffic vs. the resource consumption ->

  • no traffic but continuous "high" consumption, something is wrong

There are also other use cases where for example you either have old programming languages false configuration and those demand to much resources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

No branches or pull requests

3 participants