Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Issue 2262: add priority capability for reclaim action #3340

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

lowang-bh
Copy link
Member

@lowang-bh lowang-bh commented Mar 9, 2024

Fixes #2262

  1. priority plugin support reclaimable switch and add UT about priority plugin
  2. priority plugin docs

Note: reclaimableFn usually is used in reclaim action to reclaim a queue's deserved resource when cluster has not enough resource to allocate new coming tasks in this queue. So please be carefully to set enableReclaimable to true in priority plugin, in case that a queue's resource owned by high priority jobs can not be released. And enableReclaimable is disabled by default for compatibility

@volcano-sh-bot volcano-sh-bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Mar 9, 2024
@lowang-bh lowang-bh changed the title Issue2262 Fix Issue 2262: add priority capability for reclaim action Mar 9, 2024
@lowang-bh lowang-bh force-pushed the issue2262 branch 6 times, most recently from e468089 to 0590c26 Compare March 9, 2024 10:28
@lowang-bh
Copy link
Member Author

/assign @Monokaix @william-wang

@Monokaix
Copy link
Member

If other queue's jobs all have higher priority,current queue can not reclaim their resources, reclaim will not happen, is this reasonable?

@lowang-bh
Copy link
Member Author

If other queue's jobs all have higher priority,current queue can not reclaim their resources, reclaim will not happen, is this reasonable?

We just support this feature and close it by default。 How to use it is depended on the cluster admin.

One solutions is to limit the higher priority jobs used resource not exceed queue's deserved in application layer.

@Monokaix
Copy link
Member

If other queue's jobs all have higher priority,current queue can not reclaim their resources, reclaim will not happen, is this reasonable?

We just support this feature and close it by default。 How to use it is depended on the cluster admin.

One solutions is to limit the higher priority jobs used resource not exceed queue's deserved in application layer.

But this seems places high demands on administrators and limits the job priority of the queue, we'd better add a desige doc and give some user-guide.

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign william-wang
You can assign the PR to them by writing /assign @william-wang in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 11, 2024
@lowang-bh lowang-bh closed this Mar 11, 2024
@lowang-bh lowang-bh reopened this Mar 11, 2024
@lowang-bh
Copy link
Member Author

we'd better add a desige doc and give some user-guide.

@Monokaix docs is added.
cc @wangyang0616 @hwdef please help to review. thanks.

@lowang-bh
Copy link
Member Author

/close

@volcano-sh-bot
Copy link
Contributor

@lowang-bh: Closed this PR.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@lowang-bh
Copy link
Member Author

/reopen

@volcano-sh-bot volcano-sh-bot reopened this Apr 1, 2024
@volcano-sh-bot
Copy link
Contributor

@lowang-bh: Reopened this PR.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Member

@hwdef hwdef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Apr 3, 2024
@lowang-bh
Copy link
Member Author

/assign @william-wang

@volcano-sh-bot volcano-sh-bot removed the lgtm Indicates that a PR is ready to be merged. label Apr 13, 2024
@volcano-sh-bot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@volcano-sh-bot volcano-sh-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 13, 2024
@Monokaix
Copy link
Member

It conflicts with the existing allocation logic between queues. When we allocate tasks, we do not consider the priority of jobs between queues, but only consider the priority at the queue level.

@zhoushuke
Copy link

I think there may be a bug in Reclaimable in session_plugins.go, for this configmap

    tiers:
    - plugins:
      - name: priority
      - name: proportion

as reclaim enabled in priority plugin, let us image that theres some victims return from priority, but if no victims return from proportion plugin , Reclaimable function will return no victims, and reclaim not work any more, put proportion in front of priority can fix this, but it may be a bug? @hwdef

@lowang-bh
Copy link
Member Author

I think there may be a bug in Reclaimable in session_plugins.go, for this configmap

    tiers:
    - plugins:
      - name: priority
      - name: proportion

as reclaim enabled in priority plugin, let us image that theres some victims return from priority, but if no victims return from proportion plugin , Reclaimable function will return no victims, and reclaim not work any more, put proportion in front of priority can fix this, but it may be a bug? @hwdef

That is a problem of your config. You'd better to put plugins about resource in a same tier, but a tier different from gang/priority, eg:

tiers:
- plugins:
  - name: priority
    enableReclaimable: false
  - name: gang
    enablePreemptable: false
  - name: conformance
- plugins:
  - name: overcommit
  - name: drf
    enablePreemptable: false
  - name: predicates
  - name: proportion
  - name: nodeorder
  - name: binpack

@zhoushuke
Copy link

I think there may be a bug in Reclaimable in session_plugins.go, for this configmap

    tiers:
    - plugins:
      - name: priority
      - name: proportion

as reclaim enabled in priority plugin, let us image that theres some victims return from priority, but if no victims return from proportion plugin , Reclaimable function will return no victims, and reclaim not work any more, put proportion in front of priority can fix this, but it may be a bug? @hwdef

That is a problem of your config. You'd better to put plugins about resource in a same tier, but a tier different from gang/priority, eg:

tiers:
- plugins:
  - name: priority
    enableReclaimable: false
  - name: gang
    enablePreemptable: false
  - name: conformance
- plugins:
  - name: overcommit
  - name: drf
    enablePreemptable: false
  - name: predicates
  - name: proportion
  - name: nodeorder
  - name: binpack

ok, make sense

@zhoushuke
Copy link

zhoushuke commented Apr 26, 2024

@lowang-bh , there is another promble as following:

actions: "enqueue,allocate,backfill,preempt,reclaim"
tiers:
    - plugins:
      - name: priority
      - name: gang
        enablePreemptable: false
        enableJobStarving: false
        enableReclaimable: false
        enabledQueueScoreOrder: false
      - name: conformance
    - plugins:
      - name: predicates
      - name: proportion

gang enablePreemptable: false( make it true also don't work too), so it's cant do any preempt/reclaim, when a high priority job comes, if there no enough resource to meet gang constraint, I hope get resources by reclaim, but as not meet gang constraint, this job will pending, and reclaim action is just skip pending job, so reclaim not happend, no reclaim, no release resource to get job running, like deadlock, delete enqueue may work, but in my scene,enqueue is a must, and
put reclaim in front of enqueue is not good ideal.
and overcommit won't use for me.
what better soluation to fix it?

@lowang-bh
Copy link
Member Author

lowang-bh commented Apr 27, 2024

@zhoushuke There are two kind of evicts:

  1. evicts in same queue: this is for high priority job to evict low priority jobs when your queue's deserved resource is used up
  2. evicts between different queues: a queue's coming job to reclaim their queue's deserved resource from other overused queue. In this scene, the overused resources in other queues are borrowed from this queue, and resource should be give back when need.

If you have any problems about how to use volcano, please file a issue to describe it.

…y-plugin

note: set priority plugin conf: enableReclaimable default to false

Signed-off-by: lowang-bh <lhui_wang@163.com>
Signed-off-by: lowang-bh <lhui_wang@163.com>
Signed-off-by: lowang-bh <lhui_wang@163.com>
@zhoushuke
Copy link

I have some volcano practice and use it in a production to support about 30k~50k pods a day.
I think you don't get my point, anyway, I will try to realize it by my own.

@Monokaix
Copy link
Member

Monokaix commented May 8, 2024

enablePreemptable: false

So can you explain why set enablePreemptable: false of gang?

@volcano-sh-bot volcano-sh-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 8, 2024
@volcano-sh-bot
Copy link
Contributor

@lowang-bh: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

added priority capability for reclaim action
6 participants