DRA: scheduler event handlers via assume cache #124595

pohly · 2024-04-28T12:53:34Z

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

Events that make pods scheduleable were triggered by the informer cache, not the assume cache. For "claim was deallocated", this led to a small, unlikely race if a pod got scheduled and stopped so quickly that the informer cache didn't ever see the "claim is allocated" state. The event handler now reacts to changes in the assume cache because that cache is guaranteed to receive the "claim is allocated" state which cause some pod to not get scheduled, because by definition the cache must have listed some other claim as using resources needed for that pod.

Which issue(s) this PR fixes:

Fixes ##123698

Does this PR introduce a user-facing change?

DRA: fix some small, unlikely race condition during pod scheduling

/assign @kerthcet

Do you have time to review?

/cc @towca

This is related to the work that you are doing for the cluster autoscaler.

This is a basic implementation of a first-in-first-out queue with unbounded size. It's useful for cases where a channel with fixed size might deadlock. The caller is responsible for locking.

Step simplifies using WithStep because it creates a local scope where the same tCtx variable is the one with the step name.

k8s-ci-robot · 2024-04-28T12:53:42Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2024-04-29T06:35:50Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pohly
Once this PR has been reviewed and has the lgtm label, please ask for approval from kerthcet. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/scheduler/OWNERS
~~test/OWNERS~~ [pohly]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

This enables using the assume cache for cluster events.

pohly · 2024-04-29T11:35:40Z

/retest

This enables connecting the event handler for ResourceClaim to the assume cache, which addresses a theoretic race condition. It may also be useful for implementing the autoscaler support, because now the autoscaler can modify the content of the cache.

pohly added 2 commits April 26, 2024 17:57

scheduler: add FIFO queue

1dc55af

This is a basic implementation of a first-in-first-out queue with unbounded size. It's useful for cases where a channel with fixed size might deadlock. The caller is responsible for locking.

ktesting: add Step

c6f2a55

Step simplifies using WithStep because it creates a local scope where the same tCtx variable is the one with the step name.

k8s-ci-robot assigned kerthcet Apr 28, 2024

k8s-ci-robot requested a review from towca April 28, 2024 12:53

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 28, 2024

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 28, 2024

pohly changed the title ~~DRA: schedule event handlers via assume cache~~ DRA: scheduler event handlers via assume cache Apr 28, 2024

pohly force-pushed the dra-scheduler-assume-cache-eventhandlers branch from 3cc6fe2 to 7d9abd5 Compare April 29, 2024 06:35

scheduler: AddEventHandler for assume cache

75dd31d

This enables using the assume cache for cluster events.

pohly force-pushed the dra-scheduler-assume-cache-eventhandlers branch from 7d9abd5 to 2d66ba2 Compare April 29, 2024 08:59

bart0sh added this to Triage in SIG Node PR Triage Apr 29, 2024

pohly force-pushed the dra-scheduler-assume-cache-eventhandlers branch from 2d66ba2 to 0b0e8e3 Compare April 29, 2024 12:43

SergeyKanzhelev added this to Triage in SIG Node CI/Test Board May 1, 2024

bart0sh moved this from Triage to Needs Reviewer in SIG Node PR Triage May 8, 2024

haircommander moved this from Triage to Archive-it in SIG Node CI/Test Board May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRA: scheduler event handlers via assume cache #124595

DRA: scheduler event handlers via assume cache #124595

pohly commented Apr 28, 2024

k8s-ci-robot commented Apr 28, 2024

k8s-ci-robot commented Apr 29, 2024

pohly commented Apr 29, 2024

DRA: scheduler event handlers via assume cache #124595

Are you sure you want to change the base?

DRA: scheduler event handlers via assume cache #124595

Conversation

pohly commented Apr 28, 2024

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Does this PR introduce a user-facing change?

k8s-ci-robot commented Apr 28, 2024

k8s-ci-robot commented Apr 29, 2024

pohly commented Apr 29, 2024