Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backoff when no successful schedules #2102

Merged
merged 7 commits into from May 13, 2024

Conversation

gabesaba
Copy link
Contributor

@gabesaba gabesaba commented Apr 30, 2024

What type of PR is this?

/kind bug
/kind failing-test

What this PR does / why we need it:

In the test environment, we observed many logs due to completing the scheduling loop very fast and entering it again immediately. This change adds a backoff if there were no admissions during a scheduling loop

Which issue(s) this PR fixes:

Fixes #2097

Special notes for your reviewer:

Does this PR introduce a user-facing change?

None

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. labels Apr 30, 2024
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 30, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @gabesaba. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 30, 2024
Copy link

netlify bot commented Apr 30, 2024

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit b49d9b7
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/66421ba7f45a2600082e8b5c

@tenzen-y
Copy link
Member

tenzen-y commented May 1, 2024

As Aldo mentioned here, I prefer to use NewItemExponentialFailureRateLimiter.

@tenzen-y
Copy link
Member

tenzen-y commented May 1, 2024

As Aldo mentioned here, I prefer to use NewItemExponentialFailureRateLimiter.

The implementation of the kube-controller manager might help in understanding how to use the library.

@alculquicondor
Copy link
Contributor

Alternatively... we can make the wait time equal to 1second / qps?

The NonSliding nature can then maybe meet our needs https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#NonSlidingUntilWithContext

But I would also feel more confident by using an exponential backoff, to guarantee that we are scheduling as fast as possible on startup.

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels May 7, 2024
@gabesaba gabesaba changed the title run schedule at most once every 10ms Backoff when no successful schedules May 7, 2024
@gabesaba
Copy link
Contributor Author

gabesaba commented May 7, 2024

I uploaded a simple implementation using NewItemExponentialFailureRateLimiter. The downside with this approach is that Sleep will not notice an interrupt (ctx.Done). On the other hand, we are only sleeping for up to 100ms.

A slightly more involved approach would be to create a custom BackoffManager which implements the above logic. We would then call BackoffUntil rather than UntilWithContext. BackoffUntil handles the ctx.Done signal properly when backing off.

If you prefer the more involved approach, please let me know.

@alculquicondor
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 7, 2024
@mimowo
Copy link
Contributor

mimowo commented May 7, 2024

I uploaded a simple implementation using NewItemExponentialFailureRateLimiter. The downside with this approach is that Sleep will not notice an interrupt (ctx.Done). On the other hand, we are only sleeping for up to 100ms.

No strong view here, but it feels better to react to ctx.Done.

As you say, this is just 100ms, but I can imagine we would like to extend this in the future to say 1s or even more, as long as we can trigger scheduling on a change in the cache.

Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
As a quick fix lgtm, but at some point it would be good to have a solution that reacts to ctx.Done (and possibly uses longer maxDelay, but reacts to a change in cache).

}
}
s.handleBackoff(shouldBackoff)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
s.handleBackoff(shouldBackoff)
// we backoff if there are no successful admissions
s.handleBackoff(result != metrics.AdmissionResultSuccess)

non blocking nit, just to remove the redundant var

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, done!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 7, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: b3ef89869eb26dff4ffe7f89a43b4546c55c6650

@alculquicondor
Copy link
Contributor

I synced with @gabesaba and decided to implement the BackoffManager to be able to use BackoffUntil

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 8, 2024
@k8s-ci-robot k8s-ci-robot requested a review from mimowo May 8, 2024 15:32
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 8, 2024
@gabesaba
Copy link
Contributor Author

gabesaba commented May 8, 2024

/assign @PBundyra

pkg/scheduler/scheduler.go Outdated Show resolved Hide resolved
pkg/util/wait/backoff.go Outdated Show resolved Hide resolved
pkg/util/wait/backoff.go Outdated Show resolved Hide resolved
pkg/util/wait/backoff.go Outdated Show resolved Hide resolved
// UntilWithBackoff runs f in a loop until context indicates finished. It
// applies backoff depending on the SpeedSignal f returns. Backoff increases
// exponentially, ranging from 1ms to 100ms.
func UntilWithBackoff(ctx context.Context, f func(context.Context) SpeedSignal) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func UntilWithBackoff(ctx context.Context, f func(context.Context) SpeedSignal) {
func UntilWithBackoff(ctx context.Context, f func(context.Context) bool) {

The function can simply return whether to backoff or not. Or from the scheduler's perspective: whether it successfully scheduled anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a bool better than a specific type? I defined this type so that the API would be self-explanatory (and harder to mistakenly flip the return type)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use bools when there is no ambiguity of what the return value is. But in this case, it is a bit ambiguos.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gabesaba Do you have more advantages rather than using bool?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For type safety: to make UntilWithBackoff harder to use incorrectly. When users read code, it will be clear that this return value means to backoff or not, rather than having to look up in library definition of UntilWithBackoff when true/false should be returned.

Especially since there's some indirection in the usage: Scheduler.schedule implements the API, and this function is provided to UntilWithBackoff by Scheduler.Start

pkg/util/wait/backoff.go Outdated Show resolved Hide resolved
pkg/util/wait/backoff.go Show resolved Hide resolved
pkg/util/wait/backoff.go Outdated Show resolved Hide resolved
pkg/util/wait/backoff.go Outdated Show resolved Hide resolved
pkg/util/wait/backoff.go Outdated Show resolved Hide resolved
Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I think this is perfectly fine solution, but I'm thinking yet how does it compare against the job_controller approach ref, using the AddAfter.

I guess it could make it slightly simpler to implement triggering of the immediate schedule based on a change in the cache. However, we can return to this if we have a use-case (and if after consideration it really proves simpler).

return s.Timer.Reset(0)
}

func makeSpyTimer() SpyTimer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting, I didn't see this used before. Usually we use fakeClocks. I guess one advantage of a fake clock is that you don't need to wait at all, but given these times are short I think it is ok.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if we can use the fake clock instead of this mock function, I would prefer to use it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I call Timer.Reset(0) on the real timer, so there is no waiting. I am open to using a fake clock, but I suppose I will still need to implement some functionality to capture the history (as I did with SpyTimer)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I call Timer.Reset(0) on the real timer, so there is no waiting.

Cool.

I will still need to implement some functionality to capture the history (as I did with SpyTimer)?

Mostly in the tests I saw we are just using sleep to pass the arbitrary amount of time, like here.

Maybe fakeClock is more appropriate to test at the Scheduler level, where the internal details are more encapsulated. By analogy to the Job controller we could keep the clock at the Scheduler level, and use real for prod, but fake for testing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, not quite the same purpose as the fake clock.

But why do we even keep a timer inside the SpyTimer? It seems rather unused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why do we even keep a timer inside the SpyTimer? It seems rather unused.

To avoid implementing the rest of the clock.Timer interface

pkg/scheduler/scheduler.go Show resolved Hide resolved
pkg/util/wait/backoff_test.go Outdated Show resolved Hide resolved
pkg/util/wait/backoff_test.go Outdated Show resolved Hide resolved
pkg/util/wait/backoff_test.go Outdated Show resolved Hide resolved
return s.Timer.Reset(0)
}

func makeSpyTimer() SpyTimer {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if we can use the fake clock instead of this mock function, I would prefer to use it.

pkg/util/wait/backoff_test.go Outdated Show resolved Hide resolved
inline context.WithCancel; use cmp.Diff
// UntilWithBackoff runs f in a loop until context indicates finished. It
// applies backoff depending on the SpeedSignal f returns. Backoff increases
// exponentially, ranging from 1ms to 100ms.
func UntilWithBackoff(ctx context.Context, f func(context.Context) SpeedSignal) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use bools when there is no ambiguity of what the return value is. But in this case, it is a bit ambiguos.

}
wait.BackoffUntil(func() {
mgr.toggleBackoff(f(ctx))
}, mgr, false, ctx.Done())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}, mgr, false, ctx.Done())
}, &mgr, false, ctx.Done())

This should remove the need for a boolean pointer.

BackoffManager is an interface, so it can accept a pointer or a value.

pkg/util/wait/backoff.go Outdated Show resolved Hide resolved
return s.Timer.Reset(0)
}

func makeSpyTimer() SpyTimer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, not quite the same purpose as the fake clock.

But why do we even keep a timer inside the SpyTimer? It seems rather unused.

@alculquicondor
Copy link
Contributor

/approve

Only some minor comments left

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 10, 2024
@mimowo
Copy link
Contributor

mimowo commented May 13, 2024

LGTM

pkg/util/wait/backoff.go Show resolved Hide resolved
// UntilWithBackoff runs f in a loop until context indicates finished. It
// applies backoff depending on the SpeedSignal f returns. Backoff increases
// exponentially, ranging from 1ms to 100ms.
func UntilWithBackoff(ctx context.Context, f func(context.Context) SpeedSignal) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gabesaba Do you have more advantages rather than using bool?

pkg/util/wait/backoff_test.go Show resolved Hide resolved
Copy link
Contributor

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 13, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 1e8a7a4907a37d091a14927eb0044d9a3d2c7077

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, gabesaba

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 1f63a55 into kubernetes-sigs:main May 13, 2024
15 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.7 milestone May 13, 2024
@gabesaba gabesaba deleted the schedule_less_often branch May 16, 2024 08:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Testgrid for integration tests is broken
6 participants