Support Weight in localQueue for better resource partitioning #752

kerthcet · 2023-05-09T02:41:44Z

What would you like to be added:

Currently, for queues inside one CQ, we can submit queues as much as we want until the bottleneck of CQ, and one queue can occupy the whole resources with higher priorities, considering localQueue represents the tenant, this is not reasonable. We should have hard restrictions on how much resources each tenant can have. Instead of defining the resources requests inconveniently, we can have a weight field.

One think for better resource utilization consideration is when resources are sufficient, how to breakthrough the hard restriction of weight and also preemption when resources are not enough.

Why is this needed:

Multi-tenant requirement and better resource utilization.

Completion requirements:

This enhancement requires the following artifacts:

Design doc
API change
Docs update

The artifacts should be linked in subsequent comments.

KunWuLuan · 2023-06-13T10:17:37Z

If no one working on this, I will try to create a kep to track this issue and work on this
/assign

KunWuLuan · 2023-06-13T11:19:33Z

Due to the fact that each cluster queue can borrow resources from other cluster queues, the total amount of resources for each queue is uncertain. So it is not easy to design with weight.

I think we can use 'max and min' or 'nominal and borrowingLimit' to define the limit for local queue.

alculquicondor · 2023-06-13T13:31:20Z

Ideally we should be using just weight so that new LocalQueues can be easily added without changing the configuration for other queues.

In system that use weights, the calculations of fair sharing are always dynamic and depend on how many queues have running or pending jobs and the total resources from the parent queue.
https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/FairScheduler.html
https://karthiksharma1227.medium.com/deep-dive-into-yarn-scheduler-options-cf3f29e1d20d

kerthcet · 2023-06-13T16:04:03Z

The weight is a relative value, unlike max or min, it's an absolute value. So as Aldo shares, fair sharing is dynamic. Like queueA's weight is a, queueB's weight is b, then queueA can get a/(a+b) of the resources. One concern is if a queue get the resources, then another queue joined with a hight weight, then will the original queue be preempted?

KunWuLuan · 2023-06-14T02:36:30Z

If we just use weight, how this weight work with borrow? Should we use nominal+borrowing limit as total resources? Or we can just calculate the proportion of each queue's admitted workloads and only pop workloads from those queues whose proportion is under theirs weights.
If this is not enough, we can add min and max to guarantee min resources for each queue, this can be another question.

We can achieve the goals by two steps.
In first step, we add an interface in kueue scheduler to determine whether a queue has used too much resources and reject the next workload if yes. In this way we can make each queue get their desired resources when queues don't change.
In second step, we add a controller in kueue to preempt workload if a queue used too much resources. In this way we can make each queue get their desired resources when queues change.

alculquicondor · 2023-06-14T18:33:06Z

Or we can just calculate the proportion of each queue's admitted workloads and only pop workloads from those queues whose proportion is under theirs weights.

This sounds like a better idea, and closer to what I've seen other systems do. Another thing to consider is that if a queue is 100% empty and has no running workloads, there is no need to consider it in the calculations.

One concern is if a queue get the resources, then another queue joined with a high weight, then will the original queue be preempted?

That is a very important question. Maybe it should be configurable whether or not to preempt.

I think the best step at this point is to do a review of existing systems. Then we can start collaborating in a google doc or KEP PR.

alculquicondor · 2023-06-15T14:50:43Z

cc @mwielgus

KunWuLuan · 2023-07-10T04:11:05Z

@alculquicondor Hi, I have submit a new kep for this, we can discuss details there.

k8s-triage-robot · 2024-01-24T02:57:05Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

kerthcet · 2024-01-24T02:58:55Z

/remove-lifecycle stale

alculquicondor · 2024-02-09T18:15:29Z

/close
in favor of #1714

k8s-ci-robot · 2024-02-09T18:15:33Z

@alculquicondor: Closing this issue.

In response to this:

/close
in favor of #1714

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kerthcet added the kind/feature Categorizes issue or PR as related to a new feature. label May 9, 2023

k8s-ci-robot assigned KunWuLuan Jun 13, 2023

KunWuLuan mentioned this issue Jul 10, 2023

KEP: Add local queue weight #963

Open

kerthcet mentioned this issue Dec 15, 2023

Requirements for v0.6 #1269

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 24, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 24, 2024

k8s-ci-robot closed this as completed Feb 9, 2024

tenzen-y mentioned this issue May 1, 2024

Add priority to LocalQueue #2101

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Weight in localQueue for better resource partitioning #752

Support Weight in localQueue for better resource partitioning #752

kerthcet commented May 9, 2023

KunWuLuan commented Jun 13, 2023

KunWuLuan commented Jun 13, 2023

alculquicondor commented Jun 13, 2023

kerthcet commented Jun 13, 2023

KunWuLuan commented Jun 14, 2023

alculquicondor commented Jun 14, 2023

alculquicondor commented Jun 15, 2023

KunWuLuan commented Jul 10, 2023

k8s-triage-robot commented Jan 24, 2024

kerthcet commented Jan 24, 2024

alculquicondor commented Feb 9, 2024

k8s-ci-robot commented Feb 9, 2024

Support Weight in localQueue for better resource partitioning #752

Support Weight in localQueue for better resource partitioning #752

Comments

kerthcet commented May 9, 2023

KunWuLuan commented Jun 13, 2023

KunWuLuan commented Jun 13, 2023

alculquicondor commented Jun 13, 2023

kerthcet commented Jun 13, 2023

KunWuLuan commented Jun 14, 2023

alculquicondor commented Jun 14, 2023

alculquicondor commented Jun 15, 2023

KunWuLuan commented Jul 10, 2023

k8s-triage-robot commented Jan 24, 2024

kerthcet commented Jan 24, 2024

alculquicondor commented Feb 9, 2024

k8s-ci-robot commented Feb 9, 2024