Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ensure memory requests/limits are reasonable #32175

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

upodroid
Copy link
Member

@upodroid upodroid commented Mar 6, 2024

Required for kubernetes/k8s.io#6525

/cc @BenTheElder @dims @ameukam

Some jobs had an outrageous amount of memory configured on them. I tweaked those down to sane values.

We want to make sure that jobs don't request more than 30Gi of memory so they can fit on the modern 8core 32gb VMs.

Before:

 mahamed  MAHAALI-M-2PY9  ~  Desktop  Git  k8s-test-infra   master  346⬆  6⚑  $  grep -rh memory: config/ | sed -e 's/^[ \t]*//' |tr -d "'\"" | sort | uniq -c
   1 #             memory: 1Gi
   4 memory: 1.2Gi
   1 memory: 100Mi
 154 memory: 10Gi
   8 memory: 1288490188800m
  42 memory: 12Gi
 118 memory: 14Gi
  20 memory: 15Gi
 113 memory: 16Gi
  28 memory: 1Gi
   6 memory: 2000Mi
  12 memory: 20Gi
  14 memory: 24Gi
   2 memory: 2500Mi
   6 memory: 256Mi
 124 memory: 2Gi
 164 memory: 32Gi
  10 memory: 34Gi
  34 memory: 36Gi
   2 memory: 39Gi
  70 memory: 3Gi
  12 memory: 40Gi
   2 memory: 41Gi
   2 memory: 48Gi
1055 memory: 4Gi
   1 memory: 50Mi
  20 memory: 512Mi
  10 memory: 64Gi
2136 memory: 6Gi
   6 memory: 8000Mi
 160 memory: 8Gi
 201 memory: 9000Mi
 631 memory: 9Gi

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. area/config Issues or PRs related to code in /config area/jobs area/release-eng Issues or PRs related to the Release Engineering subproject sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. sig/release Categorizes an issue or PR as relevant to SIG Release. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Mar 6, 2024
@dims
Copy link
Member

dims commented Mar 6, 2024

/approve
/lgtm
/hold until you need to land this!

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 6, 2024
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 6, 2024
@ameukam
Copy link
Member

ameukam commented Mar 6, 2024

FYI @kubernetes/release-engineering

Copy link
Member

@cpanato cpanato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cpanato, dims, upodroid

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@upodroid
Copy link
Member Author

upodroid commented Mar 6, 2024

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 6, 2024
@BenTheElder
Copy link
Member

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 6, 2024
@BenTheElder
Copy link
Member

Some jobs had an outrageous amount of memory configured on them. I tweaked those down to sane values.

Outrageous because they are large, or because we know they have no reason to make it?

Some of these are set very large to help ensure they're max QOS (scalability) but some jobs actually do use a LOT of ram.

I would like to actually make sure we check these before merging.

We want to make sure that jobs don't request more than 30Gi of memory so they can fit on the modern 8core 32gb VMs.

This isn't the only option on the table though, we have n2-highmem as an option.

@@ -23,10 +23,10 @@ presubmits:
resources:
limits:
cpu: 5
memory: 32Gi
memory: 16Gi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit low, considering even the 32/8 core machine has 4 gigs per core and typecheck is in fact memory intensive? We should not aim to use 100% of system memory but we can do say 18gi.

We're usually CPU bound for autoscaling anyhow (IE no jobs are requesting memory : cores in excess of the core : memory ratio of the host)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for other jobs like this one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually given we won't use all cores we can do 1:1 with host ratio, which would be 20 gi.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -51,14 +51,14 @@ periodics:
resources:
requests:
cpu: 6
memory: "39Gi"
memory: "16Gi"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scale jobs have over-provisioning to guarantee they're not eviction candidates because evicting them costs us a ton of wasted external resources. we could do a priority class instead but this works fine for our purposes.

and again, this is well below the CPU : memory ratio of the target hosts ? We're using 50% memory but 75% of cores

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll apply the standard 7 cpu and 30Gi to run on its own.

Also, we shouldn't be evicting pods if the scheduler can't find a free node but spin up new nodes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preemption happened to a job I was checking yesterday. Scale jobs are some of the few where even rare preemption is realllly expensive both in wasted compute and in additional time to get signal.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 11, 2024
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@@ -92,10 +92,10 @@ presubmits:
resources:
requests:
cpu: "10"
memory: "40Gi"
memory: "24Gi"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @tenzen-y @alculquicondor for kueue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the heads up. We reduced the limits in future versions. Changing it for the older version sgtm.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Thanks!

@k8s-ci-robot
Copy link
Contributor

@upodroid: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-test-infra-misc-image-build-test 74020c3 link true /test pull-test-infra-misc-image-build-test

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/config Issues or PRs related to code in /config area/jobs area/release-eng Issues or PRs related to the Release Engineering subproject cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. sig/release Categorizes an issue or PR as relevant to SIG Release. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants