Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

put back the queue to priority queue after job's resource allocating … #3413

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

panoswoo
Copy link
Contributor

Resolves #3407

Put back the queue to priority queue after job's resource allocating finished to ensure that the priority of the queue is calculated based on the latest resource allocation situation.

@volcano-sh-bot volcano-sh-bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Apr 15, 2024
@panoswoo
Copy link
Contributor Author

/retest

@volcano-sh-bot
Copy link
Contributor

@panoswoo: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@lowang-bh
Copy link
Member

Hi, please add some ut to cover it. Thanks

@volcano-sh-bot volcano-sh-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 15, 2024
@panoswoo
Copy link
Contributor Author

Hi, please add some ut to cover it. Thanks

Done, but i haven't found a simple way to check the allocate order of tasks, so I made some extensions to TestCommonStruct to enable it to record the actual allocation order. This change is compatible with previous cases.
Looking forward to your suggestions

@lowang-bh
Copy link
Member

lowang-bh commented Apr 15, 2024

Hi, please add some ut to cover it. Thanks

Done, but i haven't found a simple way to check the allocate order of tasks, so I made some extensions to TestCommonStruct to enable it to record the actual allocation order. This change is compatible with previous cases. Looking forward to your suggestions

Don't need to do that. Just let the whole resource only enough for allocating first few jobs and check the left jobs are pending.

@panoswoo
Copy link
Contributor Author

Hi, please add some ut to cover it. Thanks

Done, but i haven't found a simple way to check the allocate order of tasks, so I made some extensions to TestCommonStruct to enable it to record the actual allocation order. This change is compatible with previous cases. Looking forward to your suggestions

Don't need to do that. Just let the whole resource only enough for allocating first few jobs and check the left jobs are pending.

I have considered doing this before, but due to the same weight for both queues, when I try to fill up all resources with tasks, they will be rejected by the proportion plugin during Overused and Allocable detection. (because we only have two queues, if we want to fill up all resources, must have one queue's resource allocated is more than half).
But I just realized that maybe I can disable the Overused and Allocatable checks directly in the plugin's config.

@lowang-bh
Copy link
Member

Don't need to do that. Just let the whole resource only enough for allocating first few jobs and check the left jobs are pending.

Add two cases:

  • total 5cpus, pod-small-1 use 1cpu, then pod-large-2 use 3cpu, and then pod-large-1 pending, indicate that q-2 is allocated first
  • total 5 cpus, pod-small-1 use 1cpu, then pod-large-2 use 2cpu, and then pod-large-1 use 2 cpus, pod-small-2 is pending, indicate that q-2 is first allocated and then q-1 is allocated

@lowang-bh
Copy link
Member

Wait #3408 fix the failed UT.

@lowang-bh
Copy link
Member

May be we need to adjust the e2e test case.

@panoswoo
Copy link
Contributor Author

May be we need to adjust the e2e test case.

let me take a look

@panoswoo
Copy link
Contributor Author

@lowang-bh Hi, I seem to have found some issues while searching for the volcano component logs generated by e2e testing.

  1. when e2e test failed, we will backup volcano component logs by
    function generate-log {
    echo "Generating volcano log files"
    kubectl logs deployment/${CLUSTER_NAME}-admission -n kube-system > volcano-admission.log
    kubectl logs deployment/${CLUSTER_NAME}-controllers -n kube-system > volcano-controller.log
    kubectl logs deployment/${CLUSTER_NAME}-scheduler -n kube-system > volcano-scheduler.log
    }
    we are trying to get log from namespace kube-system but those component are installed in volcano-system actually
    helm install ${CLUSTER_NAME} installer/helm/chart/volcano --namespace volcano-system --kubeconfig ${KUBECONFIG} \
  2. it seems that we didn't upload the log file at the end of the workflow
    - name: Run E2E Tests
    run: |
    make e2e-test-schedulingbase CC=/usr/local/musl/bin/musl-gcc

@lowang-bh
Copy link
Member

  1. we are trying to get log from namespace kube-system but those component are installed in volcano-system actually

That is a issue, now we use volcano-system. You can file another PR an merge it first. @Monokaix

@lowang-bh
Copy link
Member

/retest

@panoswoo panoswoo force-pushed the fix/3407 branch 2 times, most recently from cfb0334 to 4624db7 Compare April 20, 2024 13:01
@panoswoo
Copy link
Contributor Author

May be we need to adjust the e2e test case.

I triggered the test again and it passed.
I haven't made any modifications. Have we recently merged any fixes?

@lowang-bh
Copy link
Member

May be we need to adjust the e2e test case.

I triggered the test again and it passed. I haven't made any modifications. Have we recently merged any fixes?

No update. Maybe it is randomly failed. We'd better check it.

@panoswoo
Copy link
Contributor Author

May be we need to adjust the e2e test case.

I triggered the test again and it passed. I haven't made any modifications. Have we recently merged any fixes?

No update. Maybe it is randomly failed. We'd better check it.

I am unable to reproduce the problem now.
I don't have detailed logs of the previous exception cases, and this error is a large process (involving multiple components) that has timed out. I can't determine which part of the problem went wrong without detailed logs :(

if [[ $? -ne 0 ]]; then
generate-log

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will generate-log be put here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the modifications made in run-e2e-kind.sh are temporary and were used for debugging
I will revert these changes in this PR and create a new one

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign shinytang6
You can assign the PR to them by writing /assign @shinytang6 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@lowang-bh
Copy link
Member

@panoswoo please make the ci succeed and so we can make this pr merged.

@panoswoo
Copy link
Contributor Author

/retest

@volcano-sh-bot
Copy link
Contributor

@panoswoo: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@panoswoo
Copy link
Contributor Author

@panoswoo please make the ci succeed and so we can make this pr merged.

I will fix it ASAP

@lowang-bh
Copy link
Member

/ok-to-test

@volcano-sh-bot volcano-sh-bot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Apr 27, 2024
@lowang-bh
Copy link
Member

/retest

@lowang-bh
Copy link
Member

lowang-bh commented Apr 27, 2024

/assign @Monokaix @hwdef
Please also help to review it. Thanks.

@panoswoo
Copy link
Contributor Author

/retest

Copy link
Member

@hwdef hwdef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Apr 28, 2024
@lowang-bh
Copy link
Member

HI, @panoswoo , You can close and reopen it to trigger the CI.

…finished

Signed-off-by: Panos Woo <panoswoo@outlook.com>
@volcano-sh-bot volcano-sh-bot removed the lgtm Indicates that a PR is ready to be merged. label May 6, 2024
@volcano-sh-bot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

queue is put back before a job's resource allocating
4 participants