Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

victims in reclaim.go don't sort by ssn.TaskOrderFn #2807

Closed
zhaizhch opened this issue Apr 20, 2023 · 8 comments · Fixed by #3389
Closed

victims in reclaim.go don't sort by ssn.TaskOrderFn #2807

zhaizhch opened this issue Apr 20, 2023 · 8 comments · Fixed by #3389
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@zhaizhch
Copy link

What happened:
victims in reclaim.go don't sort by ssn.TaskOrderFn

What you expected to happen:
when reclaim happened, lower priority was reclaimed first.

How to reproduce it (as minimally and precisely as possible):
1.create two queues, queue1's weight = 5 and queue2's weigh = 3(this is only one node in cluster which has 8 cards)
2. in queue2, submit 5 low priority jobs that each one requires 1GPU card and 3 medium priority jobs that each one requires 1GPU card, wait until jobs are running
3. in queue1, submit 1 high priority job that requires 4GPU cards

expect
4 low priority jobs in queue2 will be killed and their resource was reclaimed. the high priority job in queue1 will be running

actually
4 jobs in queue2 was killed randomly.

Anything else we need to know?:
config file like:
volcano-scheduler.conf: |
actions: "enqueue, reclaim, allocate, preempt, backfill"
tiers:

  • plugins:

  • name: priority

  • name: gang
    enableJobStarving: false
    enableReclaimable: false
    enablePreemptable: false

  • name: conformance

  • name: sla

  • plugins:

  • name: drf

  • name: predicates

  • name: proportion
    enableJobEnqueued: false

  • name: nodeorder

  • name: binpack
    arguments:
    binpack.weight: 17
    binpack.cpu: 1
    binpack.memory: 1
    binpack.resources: nvidia.com/gpu, baidu.com/xpu, cambricon.com/mlu
    binpack.resources.nvidia.com/gpu: 5
    binpack.resources.baidu.com/xpu: 5
    binpack.resources.cambricon.com/mlu: 5
    Environment:

  • Others:
    image
    victim doesn't sort by ssn.TaskOrderFn like preempt.go
    image

@zhaizhch zhaizhch added the kind/bug Categorizes issue or PR as related to a bug. label Apr 20, 2023
@zhaizhch
Copy link
Author

@william-wang

@wangyang0616
Copy link
Member

/assign

@zhaizhch
Copy link
Author

zhaizhch commented Jun 2, 2023

@wangyang0616

@wangyang0616
Copy link
Member

I'm very sorry, things have been a little busy lately, I will fix this as soon as possible @zhaizhch

@zhaizhch
Copy link
Author

Do you have time to fix it now? @wangyang0616

@wangyang0616
Copy link
Member

Sorry, there are many things recently, and we have not yet come and modified. If some contributors are interested in this problem, we can fix it together.

@stale
Copy link

stale bot commented Oct 15, 2023

Hello 👋 Looks like there was no activity on this issue for last 90 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 15, 2023
@lowang-bh
Copy link
Member

/remove lifecycle/stale

@stale stale bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants