Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler throughput reduced when many gated pods #124384

Open
gabesaba opened this issue Apr 18, 2024 · 33 comments · Fixed by #124618
Open

Scheduler throughput reduced when many gated pods #124384

gabesaba opened this issue Apr 18, 2024 · 33 comments · Fixed by #124618
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@gabesaba
Copy link
Contributor

gabesaba commented Apr 18, 2024

What happened?

When handling a delete pod event, we attempt to remove pods from the unscheduablePods pool [1], taking a lock while doing so [2]. Gated pods reside in this unscheduable pool, and we process all of them for each delete event. Scheduling throughput is affected, likely because ScheduleOne also takes a lock [3]

[1]

logger.V(3).Info("Delete event for scheduled pod", "pod", klog.KObj(pod))
if err := sched.Cache.RemovePod(logger, pod); err != nil {
logger.Error(err, "Scheduler cache RemovePod failed", "pod", klog.KObj(pod))
}
sched.SchedulingQueue.MoveAllToActiveOrBackoffQueue(logger, queue.AssignedPodDelete, pod, nil, nil)

[2]

func (p *PriorityQueue) MoveAllToActiveOrBackoffQueue(logger klog.Logger, event framework.ClusterEvent, oldObj, newObj interface{}, preCheck PreEnqueueCheck) {
p.lock.Lock()
defer p.lock.Unlock()
p.moveAllToActiveOrBackoffQueue(logger, event, oldObj, newObj, preCheck)
}

[3]

podInfo, err := sched.NextPod(logger)

What did you expect to happen?

Scheduling throughput should not reduce when there are many gated pods. We eventually decide against moving these pods, but after some processing.

schedulingHint := p.isPodWorthRequeuing(logger, pInfo, event, oldObj, newObj)
if schedulingHint == queueSkip {
// QueueingHintFn determined that this Pod isn't worth putting to activeQ or backoffQ by this event.
logger.V(5).Info("Event is not making pod schedulable", "pod", klog.KObj(pInfo.Pod), "event", event.Label)
continue
}

How can we reproduce it (as minimally and precisely as possible)?

Create many (~10000) gated pods. Then, create non-gated pods which eventually terminate. As these pods finish and pod delete events are processed, scheduling throughput decreases

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
Server Version: v1.29.1-gke.1589017

Cloud provider

Google Cloud - GKE

OS version

No response

Install tools

No response

Container runtime (CRI) and version (if applicable)

No response

Related plugins (CNI, CSI, ...) and versions (if applicable)

No response

@gabesaba gabesaba added the kind/bug Categorizes issue or PR as related to a bug. label Apr 18, 2024
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Apr 18, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Apr 18, 2024
@gabesaba
Copy link
Contributor Author

cc @alculquicondor

@gabesaba
Copy link
Contributor Author

/sig scheduling

@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 18, 2024
@alculquicondor
Copy link
Member

@Huang-Wei, do you remember what the justification was to put the gated pods in the same unschedulable pool of pods?

I think we should just keep them entirely separate, so that gated pods truly don't affect scheduling throughput.

@AxeZhan
Copy link
Member

AxeZhan commented Apr 19, 2024

/cc

@sanposhiho
Copy link
Member

sanposhiho commented Apr 19, 2024

I don't think we need to take a lock before checking isPodWorthRequeuing.
Can we just refactor movePodsToActiveOrBackoffQueue so that it takes the lock after isPodWorthRequeuing is done, rather than requiring lock before movePodsToActiveOrBackoffQueue is called?

@chengjoey
Copy link
Contributor

chengjoey commented Apr 19, 2024

I tried to create 10,000 gated-pods, and then whether I changed the gated-pods to non-gated-pods or created normal pods, I did not find that the throughput was reduced. Is there something I am missing? I think the scheduling is well.

image
image

This should not be a bug, because if you want lock to affect throughput, the number of pods should be much more than 10,000.

@Huang-Wei
Copy link
Member

@Huang-Wei, do you remember what the justification was to put the gated pods in the same unschedulable pool of pods?

It's mainly for keeping the queueing code maintainable. Given the transition among activeQ/backoffQ/unschedulableQ is already complex, adding another internal queue doesn't sound a good idea by that time.

I think we should just keep them entirely separate, so that gated pods truly don't affect scheduling throughput.

I doubt if throughput is really impacted... In terms of the re-queuing check on gated pods, (I haven't went through the latest code w/ schedulingHint), the original version is fairly straightforward - a PodDeletion event won't interest the gated pod at all, so it simply return. Even if the throughput is degraded, it's suspicious that the dominant factor is iterating gated pods. Needless to say, all re-queueing check happens in a separate goroutine.

So if throughput is proven impacted, I'd suggest to narrow down the real problem; otherwise moving gated pods to a separate struct won't benefit that much.

Can we just refactor movePodsToActiveOrBackoffQueue so that it takes the lock after isPodWorthRequeuing is done, rather than requiring lock before movePodsToActiveOrBackoffQueue is called?

We should pin-point the specific issue by crafting a reproducible test, and then work on the solution. Maybe it's lock, maybe it's other things.

@gabesaba
Copy link
Contributor Author

Here is a test I ran which reproduces the issue:

Create 10k gated pods
Create 30k pods at rate of 50/s, which sleep for 5 minutes each

Observe pods scheduling at ~50/s for first 5 minutes. When pods start terminating, the scheduler throughput decreases, and a backlog of pods builds up

gated_performance_10k

@gabesaba
Copy link
Contributor Author

60s pprof sample with same experiment setup as comment directly above, after pods started terminating:
pprof

61.7% of samples are from deletePodFromCache, with 55.3% attributable to these 4 lines:

The smallest change would be to filter gated pods inline here. Could also use a preCheck function, but that is more than one line :)

/assign gabesaba

@sanposhiho
Copy link
Member

sanposhiho commented Apr 23, 2024

The point is that this issue shows a general possibility where too many Pods in unschedQ can have a negative impact on scheduling cycles.
We can easily imagine a similar degradation happens when numerous Pods are in the unschedulable Pod pool, regardless of whether the reason Pods are in the unschedulable Pod pool is the gate or not.

So, I'd suggest moving isPodWorthRequeuing outside the lock (#124384 (comment)) as a general solution, rather than having a fix for this specific scenario.

@alculquicondor
Copy link
Member

The point is that this issue shows a general possibility where too many Pods in unschedQ can have a negative impact on scheduling cycles.

I agree. Just note that the motivation for scheduling gates is to be able to have a lot of gates pods without affecting the scheduling performance. So this issue shows a weakness of the current implementation.

OTOH, having a lot of pending pods has always been a problem.

So, I'd suggest moving isPodWorthRequeuing outside the lock (#124384 (comment)) as a general solution, rather than having a fix for this specific scenario.

Are you sure that is possible? In order to run isPodWorthRequeueing, we need to obtain the list of pods from the queues, for which we need a lock. If we release the lock at this point, then the list might change and I'm not sure if we could run into more problematic race conditions.

@AxeZhan
Copy link
Member

AxeZhan commented Apr 23, 2024

Are you sure that is possible? In order to run isPodWorthRequeueing, we need to obtain the list of pods from the queues, for which we need a lock. If we release the lock at this point, then the list might change and I'm not sure if we could run into more problematic race conditions.

I think maybe what sanposhiho means is to release lock before isPodWorthRequeueing and acquire it again as soon as isPodWorthRequeueing finishes?

Maybe something like this:

p.lock()
podsToMove = p.unschedulablePods
p.unlock()

for po in podsToMove:
    schedulingHint := p.isPodWorthRequeuing
    p.lock()
    p.requeuePodViaQueueingHint
    p.unlock()

We only reading queueingHintMap during isPodWorthRequeueing , so it should be fine(?). But I'm not sure if this will be very helpful.

@alculquicondor
Copy link
Member

Right, I get that. My point is that it sounds very risky.

Definitely not a change that I would be comfortable backporting. And to make the change in master, I would like to see integration tests that exercise parallel pod deletions and creations. But I'm not really sure how the test should look like.

A fix targeting gated pods specifically can be much less risky.

@Huang-Wei
Copy link
Member

61.7% of samples are from deletePodFromCache, with 55.3% attributable to these 4 lines:

It seems the major contributing factors are related w/ queueingHint's (core) bits. Which means although the feature gate is disabled, the core logic cannot be disabled.

Given it's not easy to switch back to the moment queueingHint core logic was not introduced, I agree with Aldo that handle gatedPods specifically is a portable fix.

@gabesaba FYI, the original (v1.27) impl. is here, which merely depends on a lightweight podMatchesEvent() function.

@sanposhiho
Copy link
Member

OK, so we're going to cherry-pick. Then I agree with a quick patch for gated pods proposed earlier.

@sanposhiho
Copy link
Member

/assign

I'll explorer a general fix, including tests suggested by Aldo.

@AxeZhan
Copy link
Member

AxeZhan commented Apr 24, 2024

So what's our conclusion now,

Make a quick fix with separate gated pods from unschedulable pool, and in the mean time, trying to find a general fix ?
If what I understand is correct, I can help with this quick fix while sanposhiho is finding a general fix.

@Huang-Wei
Copy link
Member

@AxeZhan I think @gabesaba earlier self-assigned to explore a back-portable fix.

@AxeZhan
Copy link
Member

AxeZhan commented Apr 24, 2024

@AxeZhan I think @gabesaba earlier self-assigned to explore a back-portable fix.

Ah, didn't notice that, sry.

@Huang-Wei
Copy link
Member

Btw: i think the title is a bit vague - throughout meant the ratio that main scheudling goroutine handles pods; while in this issue, it's more on the ratio of how scheduling queue moves pods. @alculquicondor @sanposhiho @gabesaba I'd suggest rewording the title to avoid confusion.

@gabesaba
Copy link
Contributor Author

gabesaba commented Apr 24, 2024

Btw: i think the title is a bit vague - throughout meant the ratio that main scheudling goroutine handles pods; while in this issue, it's more on the ratio of how scheduling queue moves pods. @alculquicondor @sanposhiho @gabesaba I'd suggest rewording the title to avoid confusion.

Though this issue is impacting the rate at which we schedule pods - see the graphs in #124384 (comment). Perhaps due (edit: see #124384 (comment) below)

@alculquicondor
Copy link
Member

We observed increased scheduling latency. And this is because both the event handlers and the scheduling cycle need to access the lock (once for getting the head of the queue, and once for assuming).

@gabesaba
Copy link
Contributor Author

We acquire the scheduling queue's lock in these two places - though the 2nd is probably irrelevant as it is done async

podInfo, err := sched.NextPod(logger)

sched.SchedulingQueue.Done(assumedPodInfo.Pod.UID)

@sanposhiho
Copy link
Member

sanposhiho commented Apr 25, 2024

From my view:

This problem highlights two aspects:

  • Scheduler's event handling takes a longer time than the previous version (which we don't have QHint).
  • Scheduling latency is impacted negatively when event handling is taking time (event handling throughput is slow, there are too many incoming events, etc).

Based on them, action items:

  • Decouple the scheduling throughput from event handling throughput.
    • A quick fix: filter out gated Pods before isPodWorthRequeuing, which will fix the issue in this particular scenario. @gabesaba
    • A general fix: moving isPodWorthRequeuing outside the lock, or maybe exploring another option. @sanposhiho
  • Observe and maybe improve event handling throughput. After the decoupling above, scheduling throughput and event handling throughput would be two different metrics. (= I mean, one is supposed to have little impact on another.) We need to find how fast event handling should be on a certain large scale, like we have 300 pods/s target in scheduling throughput. And then we may or may not find that the event handling throughput needs to be improved accordingly.
    • 1. Add an observability for event handling throughput.
    • 2. Benchmark and decide the ideal throughput.
    • 3. Improve the throughput based on (2). (possible actions: simplify slower QHint(s), etc)

(Probably I should create another separated issue for second one(s), rather putting too much stuff in this issue.)

@alculquicondor
Copy link
Member

Scheduler's event handling takes a longer time than the previous version (which we don't have QHint).

TBH, we don't have proof of that. We didn't test 1.25. But it's still worth investigating how to improve the performance overall.

  1. Add an observability for event handling throughput.
  2. Benchmark and decide the ideal throughput.
  3. Improve the throughput based on (2). (possible actions: simplify slower QHint(s), etc)

I agree with all of this. TBH, profiling seems to be the best tool for finding the hotspots. And then metrics can help us monitor in the future.

moving isPodWorthRequeuing outside the lock.

Maybe this is good step forward, and we could potentially guard it with the SchedulingHints feature gate.

@Huang-Wei
Copy link
Member

We need to find how fast event handling should be on a certain large scale, like we have 300 pods/s target in scheduling throughput

and also the duration each moveAllToActiveOrBackoffQueue() operation spent, and the number of getting called per sec/min.

@sanposhiho
Copy link
Member

TBH, we don't have proof of that. We didn't test 1.25. But it's still worth investigating how to improve the performance overall.

Ah, okay. I thought it was a degradation from the past version.
But, yeah, we can see the difference from the past based on the benchmark we will add.

And then metrics can help us monitor in the future.

Maybe this is good step forward, and we could potentially guard it with the SchedulingHints feature gate.

and also the duration each moveAllToActiveOrBackoffQueue() operation spent

Agree with all.

@sanposhiho
Copy link
Member

created: #124566

@alculquicondor
Copy link
Member

Oh wait, was 1.28 the version that added hints for the first time? I got confused with this line https://github.com/kubernetes/enhancements/blob/3feb698013b4c227686a972e1da7e93e5fd0dcee/keps/sig-scheduling/4247-queueinghint/kep.yaml#L24

@gabesaba
Copy link
Contributor Author

I reran the experiment from #124384 (comment) with v1.27.11-gke.1062000, and indeed saw better performance (still observed some slowdown)

experiment_127

@sanposhiho
Copy link
Member

/reopen

I'll work on a general fix mentioned in #124384 (comment).

@k8s-ci-robot k8s-ci-robot reopened this May 11, 2024
@k8s-ci-robot
Copy link
Contributor

@sanposhiho: Reopened this issue.

In response to this:

/reopen

I'll work on a general fix mentioned in #124384 (comment).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants