New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backoff when no successful schedules #2102
Backoff when no successful schedules #2102
Conversation
Hi @gabesaba. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
As Aldo mentioned here, I prefer to use |
The implementation of the kube-controller manager might help in understanding how to use the library. |
Alternatively... we can make the wait time equal to The But I would also feel more confident by using an exponential backoff, to guarantee that we are scheduling as fast as possible on startup. |
0719198
to
e09117f
Compare
I uploaded a simple implementation using NewItemExponentialFailureRateLimiter. The downside with this approach is that Sleep will not notice an interrupt (ctx.Done). On the other hand, we are only sleeping for up to 100ms. A slightly more involved approach would be to create a custom BackoffManager which implements the above logic. We would then call BackoffUntil rather than UntilWithContext. BackoffUntil handles the ctx.Done signal properly when backing off. If you prefer the more involved approach, please let me know. |
/ok-to-test |
No strong view here, but it feels better to react to ctx.Done. As you say, this is just 100ms, but I can imagine we would like to extend this in the future to say 1s or even more, as long as we can trigger scheduling on a change in the cache. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
As a quick fix lgtm, but at some point it would be good to have a solution that reacts to ctx.Done (and possibly uses longer maxDelay, but reacts to a change in cache).
pkg/scheduler/scheduler.go
Outdated
} | ||
} | ||
s.handleBackoff(shouldBackoff) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s.handleBackoff(shouldBackoff) | |
// we backoff if there are no successful admissions | |
s.handleBackoff(result != metrics.AdmissionResultSuccess) |
non blocking nit, just to remove the redundant var
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, done!
LGTM label has been added. Git tree hash: b3ef89869eb26dff4ffe7f89a43b4546c55c6650
|
I synced with @gabesaba and decided to implement the BackoffManager to be able to use |
e09117f
to
ef4677a
Compare
/assign @PBundyra |
// UntilWithBackoff runs f in a loop until context indicates finished. It | ||
// applies backoff depending on the SpeedSignal f returns. Backoff increases | ||
// exponentially, ranging from 1ms to 100ms. | ||
func UntilWithBackoff(ctx context.Context, f func(context.Context) SpeedSignal) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func UntilWithBackoff(ctx context.Context, f func(context.Context) SpeedSignal) { | |
func UntilWithBackoff(ctx context.Context, f func(context.Context) bool) { |
The function can simply return whether to backoff or not. Or from the scheduler's perspective: whether it successfully scheduled anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a bool better than a specific type? I defined this type so that the API would be self-explanatory (and harder to mistakenly flip the return type)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use bools when there is no ambiguity of what the return value is. But in this case, it is a bit ambiguos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gabesaba Do you have more advantages rather than using bool
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For type safety: to make UntilWithBackoff harder to use incorrectly. When users read code, it will be clear that this return value means to backoff or not, rather than having to look up in library definition of UntilWithBackoff when true/false should be returned.
Especially since there's some indirection in the usage: Scheduler.schedule
implements the API, and this function is provided to UntilWithBackoff
by Scheduler.Start
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I think this is perfectly fine solution, but I'm thinking yet how does it compare against the job_controller approach ref, using the AddAfter
.
I guess it could make it slightly simpler to implement triggering of the immediate schedule based on a change in the cache. However, we can return to this if we have a use-case (and if after consideration it really proves simpler).
return s.Timer.Reset(0) | ||
} | ||
|
||
func makeSpyTimer() SpyTimer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting, I didn't see this used before. Usually we use fakeClocks. I guess one advantage of a fake clock is that you don't need to wait at all, but given these times are short I think it is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if we can use the fake clock instead of this mock function, I would prefer to use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that I call Timer.Reset(0) on the real timer, so there is no waiting. I am open to using a fake clock, but I suppose I will still need to implement some functionality to capture the history (as I did with SpyTimer)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that I call Timer.Reset(0) on the real timer, so there is no waiting.
Cool.
I will still need to implement some functionality to capture the history (as I did with SpyTimer)?
Mostly in the tests I saw we are just using sleep to pass the arbitrary amount of time, like here.
Maybe fakeClock is more appropriate to test at the Scheduler level, where the internal details are more encapsulated. By analogy to the Job controller we could keep the clock at the Scheduler level, and use real for prod, but fake for testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, not quite the same purpose as the fake clock.
But why do we even keep a timer inside the SpyTimer? It seems rather unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But why do we even keep a timer inside the SpyTimer? It seems rather unused.
To avoid implementing the rest of the clock.Timer interface
return s.Timer.Reset(0) | ||
} | ||
|
||
func makeSpyTimer() SpyTimer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if we can use the fake clock instead of this mock function, I would prefer to use it.
inline context.WithCancel; use cmp.Diff
// UntilWithBackoff runs f in a loop until context indicates finished. It | ||
// applies backoff depending on the SpeedSignal f returns. Backoff increases | ||
// exponentially, ranging from 1ms to 100ms. | ||
func UntilWithBackoff(ctx context.Context, f func(context.Context) SpeedSignal) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use bools when there is no ambiguity of what the return value is. But in this case, it is a bit ambiguos.
pkg/util/wait/backoff.go
Outdated
} | ||
wait.BackoffUntil(func() { | ||
mgr.toggleBackoff(f(ctx)) | ||
}, mgr, false, ctx.Done()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
}, mgr, false, ctx.Done()) | |
}, &mgr, false, ctx.Done()) |
This should remove the need for a boolean pointer.
BackoffManager is an interface, so it can accept a pointer or a value.
return s.Timer.Reset(0) | ||
} | ||
|
||
func makeSpyTimer() SpyTimer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, not quite the same purpose as the fake clock.
But why do we even keep a timer inside the SpyTimer? It seems rather unused.
/approve Only some minor comments left |
LGTM |
// UntilWithBackoff runs f in a loop until context indicates finished. It | ||
// applies backoff depending on the SpeedSignal f returns. Backoff increases | ||
// exponentially, ranging from 1ms to 100ms. | ||
func UntilWithBackoff(ctx context.Context, f func(context.Context) SpeedSignal) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gabesaba Do you have more advantages rather than using bool
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
LGTM label has been added. Git tree hash: 1e8a7a4907a37d091a14927eb0044d9a3d2c7077
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, gabesaba The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind bug
/kind failing-test
What this PR does / why we need it:
In the test environment, we observed many logs due to completing the scheduling loop very fast and entering it again immediately. This change adds a backoff if there were no admissions during a scheduling loop
Which issue(s) this PR fixes:
Fixes #2097
Special notes for your reviewer:
Does this PR introduce a user-facing change?