NodeAffinity/NodeUnschedulable QueueingHint may miss Node related events that make Pod schedulable #122284

sanposhiho · 2023-12-13T02:49:06Z

What happened?

/kind bug
/triage accepted
/priority urgent
/sig scheduling
/assign

NodeAffinity QueueingHint may miss Node related events that make Pod schedulable because of preCheck.
It's similar to: #119177 (comment)

So:

Node is added. But, it's filtered out by preCheck (due to node's unready or whatever) and the scheduling queue doesn't receive NodeAdded. (noderesourcefit cannot receive NodeAdded.)
Node is updated and ready now. But, this event is NodeUpdated.

In such scenarios, NodeAffinity may return QueueSkip to the event due to (2).
https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/plugins/nodeaffinity/node_affinity.go#L94-L131

What did you expect to happen?

We have to loosen the filtering in isSchedulableAfterNodeChange until preCheck is removed.
#110175

How can we reproduce it (as minimally and precisely as possible)?

Create a Pod with NodeAffinity under the situation where no Node can accommodate the Pod.
Create a new Node.

It may cause the above problematic scenario.

Anything else we need to know?

No response

Kubernetes version

master

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2023-12-13T02:49:10Z

@sanposhiho: The label(s) priority/urgent cannot be applied, because the repository doesn't have them.

In response to this:

What happened?

/kind bug
/triage accepted
/priority urgent
/sig scheduling
/assign

NodeAffinity QueueingHint may miss Node related events that make Pod schedulable because of preCheck.
It's similar to: #119177 (comment)

So:

Node is added. But, it's filtered out by preCheck (due to node's unready or whatever) and the scheduling queue doesn't receive NodeAdded. (noderesourcefit cannot receive NodeAdded.)

Node is updated and ready now. But, this event is NodeUpdated.

In such scenarios, NodeAffinity may return QueueSkip to the event due to (2).
https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/plugins/nodeaffinity/node_affinity.go#L94-L131

What did you expect to happen?

We have to loosen the filtering in isSchedulableAfterNodeChange until preCheck is removed.
#110175

How can we reproduce it (as minimally and precisely as possible)?

Create a Pod with NodeAffinity under the situation where no Node can accommodate the Pod.

Create a new Node.

It may cause the above problematic scenario.

Anything else we need to know?

No response

Kubernetes version

master

Cloud provider

OS version
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here
Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sanposhiho · 2023-12-13T02:50:10Z

/priority critical-urgent

sanposhiho · 2023-12-13T02:58:37Z

/assign @carlory

I'll have a PR to just revert QHint since the release date is close. But, I'll leave the QHint re-implementation for @carlory.

sanposhiho · 2023-12-13T03:09:45Z

/cc @kubernetes/sig-scheduling-leads

sanposhiho · 2023-12-13T03:15:44Z

Wait... It's the same for NodeUnschedulable. I should have noticed during the review 🤦

/retitle NodeAffinity/NodeUnschedulable QueueingHint may miss Node related events that make Pod schedulable

sanposhiho · 2023-12-13T03:22:28Z

/assign @wackxu
For the NodeUnschedulable QueueingHint re-implementation.

Vyom-Yadav · 2023-12-13T03:24:23Z

@sanposhiho The release cut is under progress, are we reverting #119396 and #119155?

sanposhiho · 2023-12-13T03:28:30Z

@Vyom-Yadav
Yes, I submitted the PRs.

But, unfortunately, I suppose our leads are not around, they're all in US timezone and it's late night now.

kerthcet · 2023-12-13T03:29:17Z

All node related plugins are influenced.

sanposhiho · 2023-12-13T03:32:06Z

Yes, I just went through all, and seems these two are the only ones that has got QHint in the last release.

Other QHints:

NodePort 🙆‍♂️ https://github.com/sanposhiho/kubernetes/blob/7cba35f6519c6e0cf72848185ef3f9a524601467/pkg/scheduler/framework/plugins/nodeports/node_ports.go#L118-L124
DRA: doesn't have QHint for Node events

sanposhiho · 2023-12-13T03:38:52Z

Wait... I think #109437 has been silently coming back since it's closed. (We should have mentioned it in the comment somewhere...)
Regardless of whether QHint is implemented or not, we have to make sure all plugins that have Node Add event to have Node Update event. I see some plugins don't follow it. (we must mention it in the doc somewhere.)

kerthcet · 2023-12-13T03:49:37Z

If you mean the newly added plugins like the DRA, I think yes, let me check that in.

sanposhiho · 2023-12-13T03:50:34Z

Yes, I meant that. For newly added plugins like DRA, they have NodeAdd, but not NodeUpdated.
I haven't checked all, but NodeUnschedulable is also the one that reverting QHint is not enough, but adding Node Update event is necessary to be added to completely fix the issue.

sanposhiho · 2023-12-13T04:21:27Z

We discussed at slack:
https://kubernetes.slack.com/archives/C2C40FMNF/p1702438650646789

And decided we'll make the feature gate for QHint disable by default. (There was a discussion of disablement for this before actually because we observed a memory consumption issue: #120622) This issue reinforces that idea. We'll make it enable by default once all issues here + memory consumption issue are addressed.

sanposhiho · 2023-12-13T04:50:30Z

Also, one general problem is that we have few integration (or e2e) tests for requeueing scenario.

kerthcet · 2023-12-13T05:56:20Z

Related problems we faced now:

preCheck has some potential bugs as scheduler: move all preCheck to QueueingHint #110175 described, so we're planning to remove preCheck someday(without obvious performance degradation)
currently, node related plugins like nodeUnschedulable & nodeAffinity & nodeResourceFit with tight validations in hintFunc is conflict with preCheck as this issue mentioned.
Also because of preCheck, node plugins register nodeAdd events should also register the nodeUpdate event, DRA is one plugin we missed. (I'll raise a quick fix with the consistent policy we applied now)

To the 2nd question, a feasible approach is only check the newObj (we're now validating the oldObj and newObj both, only the oldObj doesn't match but the newObj does, then we'll return Queue), still better than before when we only watch for the event. But the disadvantage is not that efficient, like we have two plugins A & B, also two nodes Node1, Node2. When pod is scheduling, Node1 is failed by pluginA, Node2 is failed by pluginB, so when Node2 updated, pluginA will pass the validation however, pod is still unschedulable.

sanposhiho · 2023-12-13T06:01:26Z

Regarding (3), what do you think about this idea: when plugins register nodeAdd, we automatically register nodeUpdated (until preCheck is removed) so that we could technically prevent #109437 happening again in any plugins (including both in-tree and external).

sanposhiho · 2023-12-13T06:36:13Z

The action items in my mind (correct me if I am mistaken or missed something.)

plugins that have nodeAdd should have nodeUpdate as well. (maybe we'll fix it via register Node/UpdateTaint event to plugins which has Node/Add only and doesn't have Node/UpdateTaint #122292)
revert PRs of NodeAffinity/NodeUnschedulable QHints (Revert "scheduler/nodeaffinity: reduce pod scheduling latency" #122285 and Revert "scheduler/NodeUnschedulable: reduce pod scheduling latency" #122288) and cherry-pick them.
re-implement QHints for NodeAffinity/NodeUnschedulable.
have a sufficient number of integration tests for QueueingHint to catch this kind of bugs.
have a UT for QHint feature gate disabled. (fix: disable SchedulerQueueingHints feature flag by default #122289 (comment))
remove preCheck eventually. (tracked in scheduler: move all preCheck to QueueingHint #110175)

I'll probably create other issues corresponding to them, not to put a lot of context in this issue later.

kerthcet · 2023-12-13T06:48:23Z

What I hope is we can have a quick fix for No.3 and cherry-picked, then let's work out the whole stuff in the next release, no longer need to patch anything additionally, and longer.

alculquicondor · 2023-12-13T14:04:32Z

Also, one general problem is that we have few integration (or e2e) tests for requeueing scenario.

Let's make it a requirement to increase integration coverage of requeuing before re-enabling the feature again.

sanposhiho · 2023-12-14T01:37:53Z

I created the issues based on the above checklist #122284 (comment)

re-implement QHints for NodeAffinity/NodeUnschedulable.

@wackxu @carlory
Let's track this in #118893, rather than creating another issue. I'll remove ✔️ from these plugins after reverting ↓ are done.

Other things, memory consumption issue and the removal of preCheck are already tracked:

carlory · 2023-12-14T03:19:24Z

@sanposhiho here's the re-implementation of the NodeAffinity plugin, please take a look.

sanposhiho · 2023-12-23T01:18:26Z

Reverting is done and we have issues for other action items, I believe we can close it.
Feel free to reopen if anyone disagree.

/close

k8s-ci-robot · 2023-12-23T01:18:32Z

@sanposhiho: Closing this issue.

In response to this:

Reverting is done and we have issues for other action items, I believe we can close it.
Feel free to reopen if anyone disagree.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Huang-Wei · 2024-01-04T22:05:53Z

@sanposhiho do you want (or already have) to create an umbrella issue tracking all on-going items to re-enable QueueingHint feature. In particular, we don't really want to re-stomp on issue like #109437.

sanposhiho · 2024-01-05T02:02:09Z

We have all issues individually, but don't have ☔ one. Let me create it for a better tracking.

sanposhiho · 2024-01-05T02:08:59Z

#122597
I'll organize the list later.

sanposhiho added the kind/bug Categorizes issue or PR as related to a bug. label Dec 13, 2023

k8s-ci-robot assigned sanposhiho Dec 13, 2023

k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Dec 13, 2023

k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Dec 13, 2023

k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Dec 13, 2023

sanposhiho mentioned this issue Dec 13, 2023

[Flaking Test][sig-scheduling] SchedulerPredicates [Serial] validates resource limits of pods that are allowed to run #122283

Closed

k8s-ci-robot assigned carlory Dec 13, 2023

sanposhiho mentioned this issue Dec 13, 2023

Revert "scheduler/nodeaffinity: reduce pod scheduling latency" #122285

Merged

k8s-ci-robot changed the title ~~NodeAffinity QueueingHint may miss Node related events that make Pod schedulable~~ NodeAffinity/NodeUnschedulable QueueingHint may miss Node related events that make Pod schedulable Dec 13, 2023

sanposhiho mentioned this issue Dec 13, 2023

Revert "scheduler/NodeUnschedulable: reduce pod scheduling latency" #122288

Merged

k8s-ci-robot assigned wackxu Dec 13, 2023

sanposhiho mentioned this issue Dec 13, 2023

fix: disable SchedulerQueueingHints feature flag by default #122289

Merged

sanposhiho mentioned this issue Dec 13, 2023

register Node/UpdateTaint event to plugins which has Node/Add only and doesn't have Node/UpdateTaint #122292

Merged

This was referenced Dec 14, 2023

re-create unit tests for the scheduling queue for the cases QueueingHint disabled #122304

Closed

Implement the integration tests for requeueing scenarios #122305

Open

plugins that register nodeAdd in EventsToRegister must register nodeUpdate #122306

Closed

nayihz mentioned this issue Dec 14, 2023

add unit test for the scheduling queue for QueueingHint disabled #122311

Merged

This was referenced Dec 15, 2023

nodeaffinity: scheduler queueing hints #122309

Merged

NodeUnschedulable: scheduler queueing hints #122334

Merged

k8s-ci-robot closed this as completed Dec 23, 2023

NodeAffinity/NodeUnschedulable QueueingHint may miss Node related events that make Pod schedulable #122284

NodeAffinity/NodeUnschedulable QueueingHint may miss Node related events that make Pod schedulable #122284

Comments

sanposhiho commented Dec 13, 2023

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

k8s-ci-robot commented Dec 13, 2023

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

sanposhiho commented Dec 13, 2023

sanposhiho commented Dec 13, 2023

sanposhiho commented Dec 13, 2023

sanposhiho commented Dec 13, 2023 • edited

sanposhiho commented Dec 13, 2023

Vyom-Yadav commented Dec 13, 2023

sanposhiho commented Dec 13, 2023 • edited

kerthcet commented Dec 13, 2023

sanposhiho commented Dec 13, 2023 • edited

sanposhiho commented Dec 13, 2023 • edited

kerthcet commented Dec 13, 2023

sanposhiho commented Dec 13, 2023 • edited

sanposhiho commented Dec 13, 2023

sanposhiho commented Dec 13, 2023

kerthcet commented Dec 13, 2023

sanposhiho commented Dec 13, 2023 • edited

sanposhiho commented Dec 13, 2023

kerthcet commented Dec 13, 2023 • edited

alculquicondor commented Dec 13, 2023 • edited

sanposhiho commented Dec 14, 2023 • edited

carlory commented Dec 14, 2023

sanposhiho commented Dec 23, 2023

k8s-ci-robot commented Dec 23, 2023

Huang-Wei commented Jan 4, 2024

sanposhiho commented Jan 5, 2024

sanposhiho commented Jan 5, 2024

sanposhiho commented Dec 13, 2023 •

edited

sanposhiho commented Dec 13, 2023 •

edited

sanposhiho commented Dec 13, 2023 •

edited

sanposhiho commented Dec 13, 2023 •

edited

sanposhiho commented Dec 13, 2023 •

edited

sanposhiho commented Dec 13, 2023 •

edited

kerthcet commented Dec 13, 2023 •

edited

alculquicondor commented Dec 13, 2023 •

edited

sanposhiho commented Dec 14, 2023 •

edited