Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: per-instance type VM memory overhead percentages #6146

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

pH14
Copy link

@pH14 pH14 commented May 3, 2024

Bandaid for #5161, without trying to solve the problem globally for all EC2 instance types

Description

As originally discussed in kubernetes-sigs/karpenter#716 the global vmMemoryOverheadPercent option is limiting when running on a heterogeneous set of nodes. For instance, a small EC2 instance may require ~7% overhead, but the larger nodes may only require 1-2% overhead. Undersizing this values risks provisioning nodes that cannot actually run the pods it's intended for, and oversizing is inefficient / expensive.

This PR is a strawman proposal for solving this problem: allow Karpenter to accept a mapping of instance type to memory overhead percentage, and fall back to the global vmMemoryOverheadPercent if no specific mapping exists. For our deployment this would allow us to create a mapping for the instance types we care about, and potentially sets up the suggestion in #5161 of "We could launch instance types, check their capacity and diff the reported capacity from the actual capacity in a generated file that can be shipped with Karpenter on release so we always have accurate measurements on the instance type overhead"

How was this change tested?

It hasn't been yet! Before going further on the tests, I wanted to gather input into whether this change might be accepted at all first.

Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: #
  • No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@pH14 pH14 requested a review from a team as a code owner May 3, 2024 20:31
@pH14 pH14 requested a review from jonathan-innis May 3, 2024 20:31
Copy link

netlify bot commented May 3, 2024

Deploy Preview for karpenter-docs-prod ready!

Name Link
🔨 Latest commit a578fa2
🔍 Latest deploy log https://app.netlify.com/sites/karpenter-docs-prod/deploys/6635499ab9d90300080c7ffe
😎 Deploy Preview https://deploy-preview-6146--karpenter-docs-prod.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@engedaam
Copy link
Contributor

@pH14 Thank you for the contribution! I have a few questions and suggestions on the next steps you can take. The team mainly hopes to have an RFC for long standing problems such as this one. Within the RFC, I hope that you can address questions such as; What are all the options for solving this problem? What would be the best short and long terms solutions? What is preventing us from implementing the long term solution? Why should Karpenter have a bandaid solution for the short term? Do we want to expose API surface to customers if we don’t intend for us to keep it long term? The team is heading to v1 of the project and we are critical thinking about API surface that we want to expose to user to avoid the need of introducing a breaking change.

The team is definitely open to accepting contributions for VM memory overhead percentages, but we just want to make sure we have evaluated all the possible solutions. Here are examples of RFC we have used in the past: https://github.com/kubernetes-sigs/karpenter/tree/main/designs and https://github.com/aws/karpenter-provider-aws/tree/main/designs. I also recommended coming to the working group meetings to discuss your ideas for possible solutions. Here is where you can find the schedule for the meetings: https://karpenter.sh/docs/contributing/working-group/.

Copy link
Contributor

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants