Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP: Add local queue weight #963

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

KunWuLuan
Copy link
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

Add a new plugin to enable local queue sorting and filtering

Which issue(s) this PR fixes:

Fixes #752

Special notes for your reviewer:

This pr is also related to #10

Does this PR introduce a user-facing change?


@k8s-ci-robot
Copy link
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 10, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: KunWuLuan
Once this PR has been reviewed and has the lgtm label, please assign denkensk for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@netlify
Copy link

netlify bot commented Jul 10, 2023

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 72fe311
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/64c0943f6660c500078a8bc4

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jul 10, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @KunWuLuan. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 10, 2023
@alculquicondor
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 10, 2023
@alculquicondor
Copy link
Contributor

cc @mwielgus

[experience reports]: https://github.com/golang/go/wiki/ExperienceReports
-->

Currently, for queues inside one CQ, we can submit queues as much as we want until the bottleneck of CQ, and one queue can occupy the whole resources with higher priorities. A limitation for a local queue is helpful for resource partitioning.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, there are two options here:

  • stay with 2 level hierarchy that Kueue offers today - have each local queue point to a different cluster queue with some resources, and allow some borrowing/eviction between queues via cohorts.
  • explicitly introduce multi-level hierarchy

This KEP is somewhere between the two - introduces 3 level hierarchy, but with a different functionality set on level 1 (local queues in cluster queue) and level 2 (cluster queues in a cohort).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To Marcin's point, one thing to consider is that LocalQueues are namespaced. As such, it means that administrators could only define weights between users that are in the same namespace.

This could be quite limiting for some organizations that requires a namespace for each end-user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry maybe I didn't totally understand u. Do you mean maybe we should consider adding weight on cluster queues instead of local queues?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think weight in local queue is a resource consumption policy to control whose priority is more higher when scheduling and the resources in the queue will not be further segmented. We will still stay with 2 level hierarchy that Kueue offers today.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today, fair sharing and borrowing semantics happen for CQs within a cohort.

By adding weight to LQs, we are adding fair sharing to that level, implicitly adding one more level of hierarchy. The question is: do we want to do that, or should we rather implement arbitrary hierarchies where CQs can belong to other CQs or Cohort is an CRD and cohorts can belong to cohorts.

I'm not saying that one option or the other is better. I'm saying that we need to ponder all the options. Kueue APIs offer backwards compatibility, meaning that if we add an API, we need to maintain it forever or offer a deprecation process.

To Marcin's point, one thing to consider is that LocalQueues are namespaced. As such, it means that administrators could only define weights between users that are in the same namespace.
This could be quite limiting for some organizations that requires a namespace for each end-user.

Nvm about that comment. I realize it's wrong as multiple namespaces can point to the same CQ and we could respect weight among them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explain!
In my cloud scene, my customers will create a quota node for each department and create a leaf quota node for each user, each user will have their own namespace to submit workloads. Unused resources will be shared by users by min and max like elastic quota in scheduler-plugins.
So I think maybe a 3 level tree is enough for the cases.

I have three ideas.

  1. Add a crd to define weights for local queues in a certain namespace and allow users to create and modify the resource may meet the requirement of Abdullah. But will still add one more level of hierarchy.

  2. Allow admin to define parent cluster queue in cluster queue's spec. Cohort can only be set for the cluster queues who have same parent. Sum nominal resources of Children CQ must less than parent CQ's nominal resources. In this case, the functions in this KEP can be converted into multi-level CQ. But because users can not create and modify cluster queue, users will not be able to manage different workloads' resource consumption of their own.

  3. Combine 1 and 2.

HDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alculquicondor @ahg-g Hi, if u have time, please take a look and let me know your options, I will continue work on this issue, thank very much. :D

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the better approach would be something along the lines of 2, which is also similar to #79

And here we have two high level alternatives:
a. Make a cohort a CRD. Then you could make cohorts be part of other cohorts. In this case, cohorts themselves don't have quota (like today). But they could still have weights, that only have semantics around borrowing.
b. Make CQs either belong to a cohort (today's mode) or another CQ. In this case, the child CQ cannot have limits, but can only have weights.

It might be useful to look into how systems like YARN or Slurm do multi-level hierarchy.

Question about your use-case: why not create a cohort per department and one CQ per user?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also #62 is related.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks, if there is any progress, I will add it on KEP

@KunWuLuan KunWuLuan changed the title Kep/local queue weight KEP: add local queue weight Jul 26, 2023
@KunWuLuan
Copy link
Contributor Author

Hi, I have added some more designs to reach the goals. Please take a look if you have time. Thank you very much! @alculquicondor 👍

@KunWuLuan KunWuLuan changed the title KEP: add local queue weight KEP: Add local queue weight Sep 4, 2023
@KunWuLuan
Copy link
Contributor Author

I think this can be considered after #1093 .

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 20, 2024
@mwielgus
Copy link
Contributor

This will be largely handled by #1531 and #1773.

@k8s-ci-robot
Copy link
Contributor

@KunWuLuan: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kueue-verify-main 72fe311 link true /test pull-kueue-verify-main
pull-kueue-test-multikueue-e2e-main-1-29 72fe311 link true /test pull-kueue-test-multikueue-e2e-main-1-29

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Weight in localQueue for better resource partitioning
6 participants