Scheduler throughput reduced when many gated pods #124384

gabesaba · 2024-04-18T16:21:54Z

What happened?

When handling a delete pod event, we attempt to remove pods from the unscheduablePods pool [1], taking a lock while doing so [2]. Gated pods reside in this unscheduable pool, and we process all of them for each delete event. Scheduling throughput is affected, likely because ScheduleOne also takes a lock [3]

[1]

kubernetes/pkg/scheduler/eventhandlers.go

Lines 244 to 249 in 0011756

    
           logger.V(3).Info("Delete event for scheduled pod", "pod", klog.KObj(pod)) 
        
           if err := sched.Cache.RemovePod(logger, pod); err != nil { 
        
           	logger.Error(err, "Scheduler cache RemovePod failed", "pod", klog.KObj(pod)) 
        
           } 
        
           sched.SchedulingQueue.MoveAllToActiveOrBackoffQueue(logger, queue.AssignedPodDelete, pod, nil, nil)

[2]

kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go

Lines 1093 to 1097 in 0011756

    
           func (p *PriorityQueue) MoveAllToActiveOrBackoffQueue(logger klog.Logger, event framework.ClusterEvent, oldObj, newObj interface{}, preCheck PreEnqueueCheck) { 
        
           	p.lock.Lock() 
        
           	defer p.lock.Unlock() 
        
           	p.moveAllToActiveOrBackoffQueue(logger, event, oldObj, newObj, preCheck) 
        
           }

[3]

kubernetes/pkg/scheduler/schedule_one.go

Line 69 in 0011756

podInfo, err := sched.NextPod(logger)

What did you expect to happen?

Scheduling throughput should not reduce when there are many gated pods. We eventually decide against moving these pods, but after some processing.

kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go

Lines 1178 to 1183 in be4b717

    
           schedulingHint := p.isPodWorthRequeuing(logger, pInfo, event, oldObj, newObj) 
        
           if schedulingHint == queueSkip { 
        
           	// QueueingHintFn determined that this Pod isn't worth putting to activeQ or backoffQ by this event. 
        
           	logger.V(5).Info("Event is not making pod schedulable", "pod", klog.KObj(pInfo.Pod), "event", event.Label) 
        
           	continue 
        
           }

How can we reproduce it (as minimally and precisely as possible)?

Create many (~10000) gated pods. Then, create non-gated pods which eventually terminate. As these pods finish and pod delete events are processed, scheduling throughput decreases

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
Server Version: v1.29.1-gke.1589017

Cloud provider

Google Cloud - GKE

OS version

No response

Install tools

No response

Container runtime (CRI) and version (if applicable)

No response

Related plugins (CNI, CSI, ...) and versions (if applicable)

No response

k8s-ci-robot · 2024-04-18T16:22:03Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

gabesaba · 2024-04-18T16:22:57Z

cc @alculquicondor

gabesaba · 2024-04-18T16:27:18Z

/sig scheduling

alculquicondor · 2024-04-18T17:42:14Z

@Huang-Wei, do you remember what the justification was to put the gated pods in the same unschedulable pool of pods?

I think we should just keep them entirely separate, so that gated pods truly don't affect scheduling throughput.

AxeZhan · 2024-04-19T02:36:30Z

/cc

sanposhiho · 2024-04-19T03:17:00Z

I don't think we need to take a lock before checking isPodWorthRequeuing.
Can we just refactor movePodsToActiveOrBackoffQueue so that it takes the lock after isPodWorthRequeuing is done, rather than requiring lock before movePodsToActiveOrBackoffQueue is called?

chengjoey · 2024-04-19T05:44:49Z

I tried to create 10,000 gated-pods, and then whether I changed the gated-pods to non-gated-pods or created normal pods, I did not find that the throughput was reduced. Is there something I am missing? I think the scheduling is well.

This should not be a bug, because if you want lock to affect throughput, the number of pods should be much more than 10,000.

Huang-Wei · 2024-04-19T07:57:48Z

@Huang-Wei, do you remember what the justification was to put the gated pods in the same unschedulable pool of pods?

It's mainly for keeping the queueing code maintainable. Given the transition among activeQ/backoffQ/unschedulableQ is already complex, adding another internal queue doesn't sound a good idea by that time.

I think we should just keep them entirely separate, so that gated pods truly don't affect scheduling throughput.

I doubt if throughput is really impacted... In terms of the re-queuing check on gated pods, (I haven't went through the latest code w/ schedulingHint), the original version is fairly straightforward - a PodDeletion event won't interest the gated pod at all, so it simply return. Even if the throughput is degraded, it's suspicious that the dominant factor is iterating gated pods. Needless to say, all re-queueing check happens in a separate goroutine.

So if throughput is proven impacted, I'd suggest to narrow down the real problem; otherwise moving gated pods to a separate struct won't benefit that much.

Can we just refactor movePodsToActiveOrBackoffQueue so that it takes the lock after isPodWorthRequeuing is done, rather than requiring lock before movePodsToActiveOrBackoffQueue is called?

We should pin-point the specific issue by crafting a reproducible test, and then work on the solution. Maybe it's lock, maybe it's other things.

gabesaba · 2024-04-19T08:45:49Z

Here is a test I ran which reproduces the issue:

Create 10k gated pods
Create 30k pods at rate of 50/s, which sleep for 5 minutes each

Observe pods scheduling at ~50/s for first 5 minutes. When pods start terminating, the scheduler throughput decreases, and a backlog of pods builds up

gabesaba · 2024-04-22T16:33:52Z

60s pprof sample with same experiment setup as comment directly above, after pods started terminating:

61.7% of samples are from deletePodFromCache, with 55.3% attributable to these 4 lines:

The smallest change would be to filter gated pods inline here. Could also use a preCheck function, but that is more than one line :)

/assign gabesaba

sanposhiho · 2024-04-23T00:38:25Z

The point is that this issue shows a general possibility where too many Pods in unschedQ can have a negative impact on scheduling cycles.
We can easily imagine a similar degradation happens when numerous Pods are in the unschedulable Pod pool, regardless of whether the reason Pods are in the unschedulable Pod pool is the gate or not.

So, I'd suggest moving isPodWorthRequeuing outside the lock (#124384 (comment)) as a general solution, rather than having a fix for this specific scenario.

alculquicondor · 2024-04-23T12:27:56Z

The point is that this issue shows a general possibility where too many Pods in unschedQ can have a negative impact on scheduling cycles.

I agree. Just note that the motivation for scheduling gates is to be able to have a lot of gates pods without affecting the scheduling performance. So this issue shows a weakness of the current implementation.

OTOH, having a lot of pending pods has always been a problem.

So, I'd suggest moving isPodWorthRequeuing outside the lock (#124384 (comment)) as a general solution, rather than having a fix for this specific scenario.

Are you sure that is possible? In order to run isPodWorthRequeueing, we need to obtain the list of pods from the queues, for which we need a lock. If we release the lock at this point, then the list might change and I'm not sure if we could run into more problematic race conditions.

AxeZhan · 2024-04-23T12:43:08Z

Are you sure that is possible? In order to run isPodWorthRequeueing, we need to obtain the list of pods from the queues, for which we need a lock. If we release the lock at this point, then the list might change and I'm not sure if we could run into more problematic race conditions.

I think maybe what sanposhiho means is to release lock before isPodWorthRequeueing and acquire it again as soon as isPodWorthRequeueing finishes?

Maybe something like this:

p.lock()
podsToMove = p.unschedulablePods
p.unlock()

for po in podsToMove:
    schedulingHint := p.isPodWorthRequeuing
    p.lock()
    p.requeuePodViaQueueingHint
    p.unlock()

We only reading queueingHintMap during isPodWorthRequeueing , so it should be fine(?). But I'm not sure if this will be very helpful.

alculquicondor · 2024-04-23T14:54:42Z

Right, I get that. My point is that it sounds very risky.

Definitely not a change that I would be comfortable backporting. And to make the change in master, I would like to see integration tests that exercise parallel pod deletions and creations. But I'm not really sure how the test should look like.

A fix targeting gated pods specifically can be much less risky.

Huang-Wei · 2024-04-23T19:06:07Z

61.7% of samples are from deletePodFromCache, with 55.3% attributable to these 4 lines:

It seems the major contributing factors are related w/ queueingHint's (core) bits. Which means although the feature gate is disabled, the core logic cannot be disabled.

Given it's not easy to switch back to the moment queueingHint core logic was not introduced, I agree with Aldo that handle gatedPods specifically is a portable fix.

@gabesaba FYI, the original (v1.27) impl. is here, which merely depends on a lightweight podMatchesEvent() function.

sanposhiho · 2024-04-24T01:43:45Z

OK, so we're going to cherry-pick. Then I agree with a quick patch for gated pods proposed earlier.

sanposhiho · 2024-04-24T02:05:46Z

/assign

I'll explorer a general fix, including tests suggested by Aldo.

AxeZhan · 2024-04-24T03:01:12Z

So what's our conclusion now,

Make a quick fix with separate gated pods from unschedulable pool, and in the mean time, trying to find a general fix ?
If what I understand is correct, I can help with this quick fix while sanposhiho is finding a general fix.

Huang-Wei · 2024-04-24T03:33:37Z

@AxeZhan I think @gabesaba earlier self-assigned to explore a back-portable fix.

AxeZhan · 2024-04-24T03:34:44Z

@AxeZhan I think @gabesaba earlier self-assigned to explore a back-portable fix.

Ah, didn't notice that, sry.

Huang-Wei · 2024-04-24T03:37:24Z

Btw: i think the title is a bit vague - throughout meant the ratio that main scheudling goroutine handles pods; while in this issue, it's more on the ratio of how scheduling queue moves pods. @alculquicondor @sanposhiho @gabesaba I'd suggest rewording the title to avoid confusion.

gabesaba · 2024-04-24T12:21:03Z

Btw: i think the title is a bit vague - throughout meant the ratio that main scheudling goroutine handles pods; while in this issue, it's more on the ratio of how scheduling queue moves pods. @alculquicondor @sanposhiho @gabesaba I'd suggest rewording the title to avoid confusion.

Though this issue is impacting the rate at which we schedule pods - see the graphs in #124384 (comment). Perhaps due (edit: see #124384 (comment) below)

alculquicondor · 2024-04-24T12:27:31Z

We observed increased scheduling latency. And this is because both the event handlers and the scheduling cycle need to access the lock (once for getting the head of the queue, and once for assuming).

gabesaba · 2024-04-24T13:07:51Z

We acquire the scheduling queue's lock in these two places - though the 2nd is probably irrelevant as it is done async

kubernetes/pkg/scheduler/schedule_one.go

Line 69 in 0011756

podInfo, err := sched.NextPod(logger)

kubernetes/pkg/scheduler/schedule_one.go

Line 133 in 0011756

sched.SchedulingQueue.Done(assumedPodInfo.Pod.UID)

sanposhiho · 2024-04-25T03:30:56Z

From my view:

This problem highlights two aspects:

Scheduler's event handling takes a longer time than the previous version (which we don't have QHint).
Scheduling latency is impacted negatively when event handling is taking time (event handling throughput is slow, there are too many incoming events, etc).

Based on them, action items:

Decouple the scheduling throughput from event handling throughput.
- A quick fix: filter out gated Pods before isPodWorthRequeuing, which will fix the issue in this particular scenario. @gabesaba
- A general fix: moving isPodWorthRequeuing outside the lock, or maybe exploring another option. @sanposhiho
Observe and maybe improve event handling throughput. After the decoupling above, scheduling throughput and event handling throughput would be two different metrics. (= I mean, one is supposed to have little impact on another.) We need to find how fast event handling should be on a certain large scale, like we have 300 pods/s target in scheduling throughput. And then we may or may not find that the event handling throughput needs to be improved accordingly.
- 1. Add an observability for event handling throughput.
- 2. Benchmark and decide the ideal throughput.
- 3. Improve the throughput based on (2). (possible actions: simplify slower QHint(s), etc)

(Probably I should create another separated issue for second one(s), rather putting too much stuff in this issue.)

alculquicondor · 2024-04-25T13:40:54Z

Scheduler's event handling takes a longer time than the previous version (which we don't have QHint).

TBH, we don't have proof of that. We didn't test 1.25. But it's still worth investigating how to improve the performance overall.

Add an observability for event handling throughput.

Benchmark and decide the ideal throughput.

Improve the throughput based on (2). (possible actions: simplify slower QHint(s), etc)

I agree with all of this. TBH, profiling seems to be the best tool for finding the hotspots. And then metrics can help us monitor in the future.

moving isPodWorthRequeuing outside the lock.

Maybe this is good step forward, and we could potentially guard it with the SchedulingHints feature gate.

Huang-Wei · 2024-04-25T16:44:45Z

We need to find how fast event handling should be on a certain large scale, like we have 300 pods/s target in scheduling throughput

and also the duration each moveAllToActiveOrBackoffQueue() operation spent, and the number of getting called per sec/min.

sanposhiho · 2024-04-26T13:45:08Z

TBH, we don't have proof of that. We didn't test 1.25. But it's still worth investigating how to improve the performance overall.

Ah, okay. I thought it was a degradation from the past version.
But, yeah, we can see the difference from the past based on the benchmark we will add.

And then metrics can help us monitor in the future.

Maybe this is good step forward, and we could potentially guard it with the SchedulingHints feature gate.

and also the duration each moveAllToActiveOrBackoffQueue() operation spent

Agree with all.

sanposhiho · 2024-04-26T13:58:27Z

created: #124566

alculquicondor · 2024-04-26T15:29:00Z

Oh wait, was 1.28 the version that added hints for the first time? I got confused with this line https://github.com/kubernetes/enhancements/blob/3feb698013b4c227686a972e1da7e93e5fd0dcee/keps/sig-scheduling/4247-queueinghint/kep.yaml#L24

gabesaba · 2024-04-26T15:51:04Z

I reran the experiment from #124384 (comment) with v1.27.11-gke.1062000, and indeed saw better performance (still observed some slowdown)

sanposhiho · 2024-05-11T12:51:58Z

/reopen

I'll work on a general fix mentioned in #124384 (comment).

k8s-ci-robot · 2024-05-11T12:52:03Z

@sanposhiho: Reopened this issue.

In response to this:

/reopen

I'll work on a general fix mentioned in #124384 (comment).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

gabesaba added the kind/bug Categorizes issue or PR as related to a bug. label Apr 18, 2024

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Apr 18, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Apr 18, 2024

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 18, 2024

alculquicondor mentioned this issue Apr 18, 2024

[Umbrella] Implement QueueingHintFn in in-tree plugins #118893

Open

12 tasks

k8s-ci-robot assigned gabesaba Apr 22, 2024

k8s-ci-robot assigned sanposhiho Apr 24, 2024

sanposhiho mentioned this issue Apr 26, 2024

Scheduler: Observe/Improve event handling throughput #124566

Open

3 tasks

gabesaba mentioned this issue Apr 29, 2024

Filter gated pods before calling isPodWorthRequeueing #124618

Merged

k8s-ci-robot closed this as completed in #124618 May 9, 2024

k8s-ci-robot reopened this May 11, 2024

This was referenced May 13, 2024

cherry pick #124618 to 1.30 #124848

Open

cherry pick #124618 to 1.29 #124849

Open

cherry pick #124618 to 1.28 #124851

Open

cherry pick #124618 to 1.27 #124866

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler throughput reduced when many gated pods #124384

Scheduler throughput reduced when many gated pods #124384

gabesaba commented Apr 18, 2024 •

edited

k8s-ci-robot commented Apr 18, 2024

gabesaba commented Apr 18, 2024

gabesaba commented Apr 18, 2024

alculquicondor commented Apr 18, 2024

AxeZhan commented Apr 19, 2024

sanposhiho commented Apr 19, 2024 •

edited

chengjoey commented Apr 19, 2024 •

edited

Huang-Wei commented Apr 19, 2024

gabesaba commented Apr 19, 2024

gabesaba commented Apr 22, 2024

sanposhiho commented Apr 23, 2024 •

edited

alculquicondor commented Apr 23, 2024

AxeZhan commented Apr 23, 2024

alculquicondor commented Apr 23, 2024

Huang-Wei commented Apr 23, 2024

sanposhiho commented Apr 24, 2024

sanposhiho commented Apr 24, 2024

AxeZhan commented Apr 24, 2024

Huang-Wei commented Apr 24, 2024

AxeZhan commented Apr 24, 2024

Huang-Wei commented Apr 24, 2024

gabesaba commented Apr 24, 2024 •

edited

alculquicondor commented Apr 24, 2024

gabesaba commented Apr 24, 2024

sanposhiho commented Apr 25, 2024 •

edited

alculquicondor commented Apr 25, 2024

Huang-Wei commented Apr 25, 2024

sanposhiho commented Apr 26, 2024

sanposhiho commented Apr 26, 2024

alculquicondor commented Apr 26, 2024

gabesaba commented Apr 26, 2024

sanposhiho commented May 11, 2024

k8s-ci-robot commented May 11, 2024

Scheduler throughput reduced when many gated pods #124384

Scheduler throughput reduced when many gated pods #124384

Comments

gabesaba commented Apr 18, 2024 • edited

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

k8s-ci-robot commented Apr 18, 2024

gabesaba commented Apr 18, 2024

gabesaba commented Apr 18, 2024

alculquicondor commented Apr 18, 2024

AxeZhan commented Apr 19, 2024

sanposhiho commented Apr 19, 2024 • edited

chengjoey commented Apr 19, 2024 • edited

Huang-Wei commented Apr 19, 2024

gabesaba commented Apr 19, 2024

gabesaba commented Apr 22, 2024

sanposhiho commented Apr 23, 2024 • edited

alculquicondor commented Apr 23, 2024

AxeZhan commented Apr 23, 2024

alculquicondor commented Apr 23, 2024

Huang-Wei commented Apr 23, 2024

sanposhiho commented Apr 24, 2024

sanposhiho commented Apr 24, 2024

AxeZhan commented Apr 24, 2024

Huang-Wei commented Apr 24, 2024

AxeZhan commented Apr 24, 2024

Huang-Wei commented Apr 24, 2024

gabesaba commented Apr 24, 2024 • edited

alculquicondor commented Apr 24, 2024

gabesaba commented Apr 24, 2024

sanposhiho commented Apr 25, 2024 • edited

alculquicondor commented Apr 25, 2024

Huang-Wei commented Apr 25, 2024

sanposhiho commented Apr 26, 2024

sanposhiho commented Apr 26, 2024

alculquicondor commented Apr 26, 2024

gabesaba commented Apr 26, 2024

sanposhiho commented May 11, 2024

k8s-ci-robot commented May 11, 2024

gabesaba commented Apr 18, 2024 •

edited

sanposhiho commented Apr 19, 2024 •

edited

chengjoey commented Apr 19, 2024 •

edited

sanposhiho commented Apr 23, 2024 •

edited

gabesaba commented Apr 24, 2024 •

edited

sanposhiho commented Apr 25, 2024 •

edited