Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NodeAffinity/NodeUnschedulable QueueingHint may miss Node related events that make Pod schedulable #122284

Closed
sanposhiho opened this issue Dec 13, 2023 · 27 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@sanposhiho
Copy link
Member

What happened?

/kind bug
/triage accepted
/priority urgent
/sig scheduling
/assign

NodeAffinity QueueingHint may miss Node related events that make Pod schedulable because of preCheck.
It's similar to: #119177 (comment)

So:

  1. Node is added. But, it's filtered out by preCheck (due to node's unready or whatever) and the scheduling queue doesn't receive NodeAdded. (noderesourcefit cannot receive NodeAdded.)
  2. Node is updated and ready now. But, this event is NodeUpdated.

In such scenarios, NodeAffinity may return QueueSkip to the event due to (2).
https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/plugins/nodeaffinity/node_affinity.go#L94-L131

What did you expect to happen?

We have to loosen the filtering in isSchedulableAfterNodeChange until preCheck is removed.
#110175

How can we reproduce it (as minimally and precisely as possible)?

  1. Create a Pod with NodeAffinity under the situation where no Node can accommodate the Pod.
  2. Create a new Node.

It may cause the above problematic scenario.

Anything else we need to know?

No response

Kubernetes version

master

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@sanposhiho sanposhiho added the kind/bug Categorizes issue or PR as related to a bug. label Dec 13, 2023
@k8s-ci-robot k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Dec 13, 2023
@k8s-ci-robot
Copy link
Contributor

@sanposhiho: The label(s) priority/urgent cannot be applied, because the repository doesn't have them.

In response to this:

What happened?

/kind bug
/triage accepted
/priority urgent
/sig scheduling
/assign

NodeAffinity QueueingHint may miss Node related events that make Pod schedulable because of preCheck.
It's similar to: #119177 (comment)

So:

  1. Node is added. But, it's filtered out by preCheck (due to node's unready or whatever) and the scheduling queue doesn't receive NodeAdded. (noderesourcefit cannot receive NodeAdded.)
  2. Node is updated and ready now. But, this event is NodeUpdated.

In such scenarios, NodeAffinity may return QueueSkip to the event due to (2).
https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/plugins/nodeaffinity/node_affinity.go#L94-L131

What did you expect to happen?

We have to loosen the filtering in isSchedulableAfterNodeChange until preCheck is removed.
#110175

How can we reproduce it (as minimally and precisely as possible)?

  1. Create a Pod with NodeAffinity under the situation where no Node can accommodate the Pod.
  2. Create a new Node.

It may cause the above problematic scenario.

Anything else we need to know?

No response

Kubernetes version

master

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Dec 13, 2023
@sanposhiho
Copy link
Member Author

/priority critical-urgent

@sanposhiho
Copy link
Member Author

/assign @carlory

I'll have a PR to just revert QHint since the release date is close. But, I'll leave the QHint re-implementation for @carlory.

@sanposhiho
Copy link
Member Author

/cc @kubernetes/sig-scheduling-leads

@sanposhiho
Copy link
Member Author

sanposhiho commented Dec 13, 2023

Wait... It's the same for NodeUnschedulable. I should have noticed during the review 🤦

/retitle NodeAffinity/NodeUnschedulable QueueingHint may miss Node related events that make Pod schedulable

@k8s-ci-robot k8s-ci-robot changed the title NodeAffinity QueueingHint may miss Node related events that make Pod schedulable NodeAffinity/NodeUnschedulable QueueingHint may miss Node related events that make Pod schedulable Dec 13, 2023
@sanposhiho
Copy link
Member Author

/assign @wackxu
For the NodeUnschedulable QueueingHint re-implementation.

@Vyom-Yadav
Copy link
Member

@sanposhiho The release cut is under progress, are we reverting #119396 and #119155?

@sanposhiho
Copy link
Member Author

sanposhiho commented Dec 13, 2023

@Vyom-Yadav
Yes, I submitted the PRs.

But, unfortunately, I suppose our leads are not around, they're all in US timezone and it's late night now.

@kerthcet
Copy link
Member

All node related plugins are influenced.

@sanposhiho
Copy link
Member Author

sanposhiho commented Dec 13, 2023

Yes, I just went through all, and seems these two are the only ones that has got QHint in the last release.

Other QHints:

@sanposhiho
Copy link
Member Author

sanposhiho commented Dec 13, 2023

Wait... I think #109437 has been silently coming back since it's closed. (We should have mentioned it in the comment somewhere...)
Regardless of whether QHint is implemented or not, we have to make sure all plugins that have Node Add event to have Node Update event. I see some plugins don't follow it. (we must mention it in the doc somewhere.)

@kerthcet
Copy link
Member

If you mean the newly added plugins like the DRA, I think yes, let me check that in.

@sanposhiho
Copy link
Member Author

sanposhiho commented Dec 13, 2023

Yes, I meant that. For newly added plugins like DRA, they have NodeAdd, but not NodeUpdated.
I haven't checked all, but NodeUnschedulable is also the one that reverting QHint is not enough, but adding Node Update event is necessary to be added to completely fix the issue.

@sanposhiho
Copy link
Member Author

We discussed at slack:
https://kubernetes.slack.com/archives/C2C40FMNF/p1702438650646789

And decided we'll make the feature gate for QHint disable by default. (There was a discussion of disablement for this before actually because we observed a memory consumption issue: #120622) This issue reinforces that idea. We'll make it enable by default once all issues here + memory consumption issue are addressed.

@sanposhiho
Copy link
Member Author

Also, one general problem is that we have few integration (or e2e) tests for requeueing scenario.

@kerthcet
Copy link
Member

Related problems we faced now:

  1. preCheck has some potential bugs as scheduler: move all preCheck to QueueingHint #110175 described, so we're planning to remove preCheck someday(without obvious performance degradation)
  2. currently, node related plugins like nodeUnschedulable & nodeAffinity & nodeResourceFit with tight validations in hintFunc is conflict with preCheck as this issue mentioned.
  3. Also because of preCheck, node plugins register nodeAdd events should also register the nodeUpdate event, DRA is one plugin we missed. (I'll raise a quick fix with the consistent policy we applied now)

To the 2nd question, a feasible approach is only check the newObj (we're now validating the oldObj and newObj both, only the oldObj doesn't match but the newObj does, then we'll return Queue), still better than before when we only watch for the event. But the disadvantage is not that efficient, like we have two plugins A & B, also two nodes Node1, Node2. When pod is scheduling, Node1 is failed by pluginA, Node2 is failed by pluginB, so when Node2 updated, pluginA will pass the validation however, pod is still unschedulable.

@sanposhiho
Copy link
Member Author

sanposhiho commented Dec 13, 2023

Regarding (3), what do you think about this idea: when plugins register nodeAdd, we automatically register nodeUpdated (until preCheck is removed) so that we could technically prevent #109437 happening again in any plugins (including both in-tree and external).

@sanposhiho
Copy link
Member Author

The action items in my mind (correct me if I am mistaken or missed something.)

I'll probably create other issues corresponding to them, not to put a lot of context in this issue later.

@kerthcet
Copy link
Member

kerthcet commented Dec 13, 2023

What I hope is we can have a quick fix for No.3 and cherry-picked, then let's work out the whole stuff in the next release, no longer need to patch anything additionally, and longer.

@alculquicondor
Copy link
Member

alculquicondor commented Dec 13, 2023

Also, one general problem is that we have few integration (or e2e) tests for requeueing scenario.

Let's make it a requirement to increase integration coverage of requeuing before re-enabling the feature again.

@sanposhiho
Copy link
Member Author

sanposhiho commented Dec 14, 2023

@carlory
Copy link
Member

carlory commented Dec 14, 2023

@sanposhiho here's the re-implementation of the NodeAffinity plugin, please take a look.

@sanposhiho
Copy link
Member Author

Reverting is done and we have issues for other action items, I believe we can close it.
Feel free to reopen if anyone disagree.

/close

@k8s-ci-robot
Copy link
Contributor

@sanposhiho: Closing this issue.

In response to this:

Reverting is done and we have issues for other action items, I believe we can close it.
Feel free to reopen if anyone disagree.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Huang-Wei
Copy link
Member

@sanposhiho do you want (or already have) to create an umbrella issue tracking all on-going items to re-enable QueueingHint feature. In particular, we don't really want to re-stomp on issue like #109437.

@sanposhiho
Copy link
Member Author

We have all issues individually, but don't have ☔ one. Let me create it for a better tracking.

@sanposhiho
Copy link
Member Author

#122597
I'll organize the list later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

8 participants