Backoff when no successful schedules #2102

gabesaba · 2024-04-30T12:54:47Z

What type of PR is this?

/kind bug
/kind failing-test

What this PR does / why we need it:

In the test environment, we observed many logs due to completing the scheduling loop very fast and entering it again immediately. This change adds a backoff if there were no admissions during a scheduling loop

Which issue(s) this PR fixes:

Fixes #2097

Special notes for your reviewer:

Does this PR introduce a user-facing change?

None

k8s-ci-robot · 2024-04-30T12:54:56Z

Hi @gabesaba. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

netlify · 2024-04-30T12:55:04Z

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Name	Link
🔨 Latest commit	`b49d9b7`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/66421ba7f45a2600082e8b5c

tenzen-y · 2024-05-01T12:33:42Z

As Aldo mentioned here, I prefer to use NewItemExponentialFailureRateLimiter.

tenzen-y · 2024-05-01T12:44:03Z

As Aldo mentioned here, I prefer to use NewItemExponentialFailureRateLimiter.

The implementation of the kube-controller manager might help in understanding how to use the library.

alculquicondor · 2024-05-03T19:12:58Z

Alternatively... we can make the wait time equal to 1second / qps?

The NonSliding nature can then maybe meet our needs https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#NonSlidingUntilWithContext

But I would also feel more confident by using an exponential backoff, to guarantee that we are scheduling as fast as possible on startup.

gabesaba · 2024-05-07T10:26:08Z

I uploaded a simple implementation using NewItemExponentialFailureRateLimiter. The downside with this approach is that Sleep will not notice an interrupt (ctx.Done). On the other hand, we are only sleeping for up to 100ms.

A slightly more involved approach would be to create a custom BackoffManager which implements the above logic. We would then call BackoffUntil rather than UntilWithContext. BackoffUntil handles the ctx.Done signal properly when backing off.

If you prefer the more involved approach, please let me know.

alculquicondor · 2024-05-07T15:09:16Z

/ok-to-test

mimowo · 2024-05-07T15:25:27Z

I uploaded a simple implementation using NewItemExponentialFailureRateLimiter. The downside with this approach is that Sleep will not notice an interrupt (ctx.Done). On the other hand, we are only sleeping for up to 100ms.

No strong view here, but it feels better to react to ctx.Done.

As you say, this is just 100ms, but I can imagine we would like to extend this in the future to say 1s or even more, as long as we can trigger scheduling on a change in the cache.

mimowo

/lgtm
As a quick fix lgtm, but at some point it would be good to have a solution that reacts to ctx.Done (and possibly uses longer maxDelay, but reacts to a change in cache).

mimowo · 2024-05-07T15:48:07Z

pkg/scheduler/scheduler.go

 		}
 	}
+	s.handleBackoff(shouldBackoff)


Suggested change

s.handleBackoff(shouldBackoff)

// we backoff if there are no successful admissions

s.handleBackoff(result != metrics.AdmissionResultSuccess)

non blocking nit, just to remove the redundant var

thanks, done!

k8s-ci-robot · 2024-05-07T15:50:19Z

LGTM label has been added.

Git tree hash: b3ef89869eb26dff4ffe7f89a43b4546c55c6650

alculquicondor · 2024-05-07T17:31:39Z

I synced with @gabesaba and decided to implement the BackoffManager to be able to use BackoffUntil

pkg/util/wait/backoff_test.go

gabesaba · 2024-05-08T15:36:01Z

/assign @PBundyra

pkg/scheduler/scheduler.go

pkg/util/wait/backoff.go

alculquicondor · 2024-05-08T18:29:27Z

pkg/util/wait/backoff.go

+// UntilWithBackoff runs f in a loop until context indicates finished. It
+// applies backoff depending on the SpeedSignal f returns.  Backoff increases
+// exponentially, ranging from 1ms to 100ms.
+func UntilWithBackoff(ctx context.Context, f func(context.Context) SpeedSignal) {


Suggested change

func UntilWithBackoff(ctx context.Context, f func(context.Context) SpeedSignal) {

func UntilWithBackoff(ctx context.Context, f func(context.Context) bool) {

The function can simply return whether to backoff or not. Or from the scheduler's perspective: whether it successfully scheduled anything.

Is a bool better than a specific type? I defined this type so that the API would be self-explanatory (and harder to mistakenly flip the return type)

We can use bools when there is no ambiguity of what the return value is. But in this case, it is a bit ambiguos.

@gabesaba Do you have more advantages rather than using bool?

For type safety: to make UntilWithBackoff harder to use incorrectly. When users read code, it will be clear that this return value means to backoff or not, rather than having to look up in library definition of UntilWithBackoff when true/false should be returned.

Especially since there's some indirection in the usage: Scheduler.schedule implements the API, and this function is provided to UntilWithBackoff by Scheduler.Start

pkg/util/wait/backoff.go

mimowo

LGTM

I think this is perfectly fine solution, but I'm thinking yet how does it compare against the job_controller approach ref, using the AddAfter.

I guess it could make it slightly simpler to implement triggering of the immediate schedule based on a change in the cache. However, we can return to this if we have a use-case (and if after consideration it really proves simpler).

mimowo · 2024-05-09T15:49:28Z

pkg/util/wait/backoff_test.go

+	return s.Timer.Reset(0)
+}
+
+func makeSpyTimer() SpyTimer {


This is interesting, I didn't see this used before. Usually we use fakeClocks. I guess one advantage of a fake clock is that you don't need to wait at all, but given these times are short I think it is ok.

But if we can use the fake clock instead of this mock function, I would prefer to use it.

Note that I call Timer.Reset(0) on the real timer, so there is no waiting. I am open to using a fake clock, but I suppose I will still need to implement some functionality to capture the history (as I did with SpyTimer)?

Note that I call Timer.Reset(0) on the real timer, so there is no waiting.

Cool.

I will still need to implement some functionality to capture the history (as I did with SpyTimer)?

Mostly in the tests I saw we are just using sleep to pass the arbitrary amount of time, like here.

Maybe fakeClock is more appropriate to test at the Scheduler level, where the internal details are more encapsulated. By analogy to the Job controller we could keep the clock at the Scheduler level, and use real for prod, but fake for testing.

Yeah, not quite the same purpose as the fake clock.

But why do we even keep a timer inside the SpyTimer? It seems rather unused.

But why do we even keep a timer inside the SpyTimer? It seems rather unused.

To avoid implementing the rest of the clock.Timer interface

pkg/scheduler/scheduler.go

pkg/util/wait/backoff_test.go

tenzen-y · 2024-05-10T06:16:55Z

pkg/util/wait/backoff_test.go

+	return s.Timer.Reset(0)
+}
+
+func makeSpyTimer() SpyTimer {


But if we can use the fake clock instead of this mock function, I would prefer to use it.

pkg/util/wait/backoff_test.go

inline context.WithCancel; use cmp.Diff

alculquicondor · 2024-05-10T14:48:13Z

pkg/util/wait/backoff.go

+// UntilWithBackoff runs f in a loop until context indicates finished. It
+// applies backoff depending on the SpeedSignal f returns.  Backoff increases
+// exponentially, ranging from 1ms to 100ms.
+func UntilWithBackoff(ctx context.Context, f func(context.Context) SpeedSignal) {


We can use bools when there is no ambiguity of what the return value is. But in this case, it is a bit ambiguos.

alculquicondor · 2024-05-10T14:51:05Z

pkg/util/wait/backoff.go

+	}
+	wait.BackoffUntil(func() {
+		mgr.toggleBackoff(f(ctx))
+	}, mgr, false, ctx.Done())


Suggested change

}, mgr, false, ctx.Done())

}, &mgr, false, ctx.Done())

This should remove the need for a boolean pointer.

BackoffManager is an interface, so it can accept a pointer or a value.

pkg/util/wait/backoff.go

alculquicondor · 2024-05-10T15:00:02Z

pkg/util/wait/backoff_test.go

+	return s.Timer.Reset(0)
+}
+
+func makeSpyTimer() SpyTimer {


Yeah, not quite the same purpose as the fake clock.

But why do we even keep a timer inside the SpyTimer? It seems rather unused.

alculquicondor · 2024-05-10T17:15:03Z

/approve

Only some minor comments left

mimowo · 2024-05-13T10:48:08Z

LGTM

pkg/util/wait/backoff.go

tenzen-y · 2024-05-13T12:30:17Z

pkg/util/wait/backoff.go

+// UntilWithBackoff runs f in a loop until context indicates finished. It
+// applies backoff depending on the SpeedSignal f returns.  Backoff increases
+// exponentially, ranging from 1ms to 100ms.
+func UntilWithBackoff(ctx context.Context, f func(context.Context) SpeedSignal) {


@gabesaba Do you have more advantages rather than using bool?

pkg/util/wait/backoff_test.go

alculquicondor

/lgtm
/approve

k8s-ci-robot · 2024-05-13T17:44:40Z

LGTM label has been added.

Git tree hash: 1e8a7a4907a37d091a14927eb0044d9a3d2c7077

k8s-ci-robot · 2024-05-13T17:44:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, gabesaba

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/OWNERS~~ [alculquicondor]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. labels Apr 30, 2024

k8s-ci-robot requested review from denkensk and mimowo April 30, 2024 12:54

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 30, 2024

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 30, 2024

gabesaba force-pushed the schedule_less_often branch from 0719198 to e09117f Compare May 7, 2024 10:14

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels May 7, 2024

gabesaba changed the title ~~run schedule at most once every 10ms~~ Backoff when no successful schedules May 7, 2024

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 7, 2024

mimowo reviewed May 7, 2024

View reviewed changes

k8s-ci-robot assigned mimowo May 7, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 7, 2024

Implement UntilWithBackoff

ef4677a

gabesaba force-pushed the schedule_less_often branch from e09117f to ef4677a Compare May 8, 2024 15:32

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 8, 2024

k8s-ci-robot requested a review from mimowo May 8, 2024 15:32

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 8, 2024

gabesaba commented May 8, 2024

View reviewed changes

pkg/util/wait/backoff_test.go Show resolved Hide resolved

k8s-ci-robot assigned PBundyra May 8, 2024

alculquicondor reviewed May 8, 2024

View reviewed changes

tenzen-y reviewed May 9, 2024

View reviewed changes

gabesaba added 2 commits May 9, 2024 11:05

address comments; add tests

926782c

use initialBackoff and maxBackoff

148635d

mimowo reviewed May 9, 2024

View reviewed changes

tenzen-y reviewed May 10, 2024

View reviewed changes

Clean up backoff_test

beb00e3

inline context.WithCancel; use cmp.Diff

alculquicondor reviewed May 10, 2024

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 10, 2024

gabesaba added 2 commits May 13, 2024 07:06

manager pointer; fix test error message

bb110db

Remove RateLimiter dependency

b3b0b2f

tenzen-y reviewed May 13, 2024

View reviewed changes

add license

b49d9b7

alculquicondor reviewed May 13, 2024

View reviewed changes

k8s-ci-robot assigned alculquicondor May 13, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 13, 2024

k8s-ci-robot merged commit 1f63a55 into kubernetes-sigs:main May 13, 2024
15 checks passed

k8s-ci-robot added this to the v0.7 milestone May 13, 2024

gabesaba deleted the schedule_less_often branch May 16, 2024 08:06

	s.handleBackoff(shouldBackoff)
	// we backoff if there are no successful admissions
	s.handleBackoff(result != metrics.AdmissionResultSuccess)

	func UntilWithBackoff(ctx context.Context, f func(context.Context) SpeedSignal) {
	func UntilWithBackoff(ctx context.Context, f func(context.Context) bool) {

Backoff when no successful schedules #2102

Backoff when no successful schedules #2102

Conversation

gabesaba commented Apr 30, 2024 • edited

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

k8s-ci-robot commented Apr 30, 2024

netlify bot commented Apr 30, 2024 • edited

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

tenzen-y commented May 1, 2024

tenzen-y commented May 1, 2024

alculquicondor commented May 3, 2024

gabesaba commented May 7, 2024

alculquicondor commented May 7, 2024

mimowo commented May 7, 2024

mimowo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented May 7, 2024

alculquicondor commented May 7, 2024

gabesaba commented May 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimowo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor commented May 10, 2024

mimowo commented May 13, 2024

Choose a reason for hiding this comment

alculquicondor left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented May 13, 2024

k8s-ci-robot commented May 13, 2024

gabesaba commented Apr 30, 2024 •

edited

netlify bot commented Apr 30, 2024 •

edited