-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-4212: Declarative Node Maintenance #4213
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: atiratree The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for opening the issue and PR!
I know this is draft; I've got some feedback already, and I hope it's helpful.
keps/sig-apps/4212-improve-node-maintenance/node-maintenance.svg
Outdated
Show resolved
Hide resolved
0ed835a
to
81b4589
Compare
6194058
to
00dc7c5
Compare
00dc7c5
to
686227c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 should this be SIG Node rather than SIG Apps?
Not sure yet, the API itself is SIG Node I guess, but it has a lot of implications for SIG- Apps. Let's delay this decision after I redo the KEP and we have additional rounds of discussions. |
termination.NodeMaintenance could then be used even by spot instances. | ||
|
||
The NodeMaintenance object could survive kubelet restarts, and the kubelet would always know if the | ||
node is under shutdown (maintenance). The cluster admin would have to remove the NodeMaintenance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the shutdown cancellation work in the same way as it currently is under Graceful Node Shutdown? I assume, yes, it would.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This KEP also supports node maintenance for Windows nodes, where graceful shutdown handling is (AIUI) not implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's not—very much Linux, not even other *nix, centric indeed. Would this be a blocker then for this KEP to move forward?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO you can mark a node as needing maintenance even if it doesn't run systemd (either Linux without systemd, or Windows which doesn't really have an init process in the POSIX sense). So this KEP can move forward whether or not your nodes gracefully handle a shutdown signal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note:
There are cases when Node termination was cancelled by the system (or perhaps manually by an administrator). In either of those situations the Node will return to the Ready state. However, Pods which already started the process of termination will not be restored by kubelet and will need to be re-scheduled.
Is the node termination cancellation reliable? Can we use it for deletion of the NodeMaintenance?
Also the behavior is application specific, when the the NodeMaintenance is cancelled/deleted (please see the Evacuation KEP). The application/evacuator can terminate the pod or it can also recover.
|
||
If there is no connection to the apiserver (apiserver down, network issues, etc.) and the | ||
NodeMaintenance object cannot be created, we would fall back to the original behavior of Graceful | ||
Node Shutdown feature. If the connection is restored, we would stop the Graceful Node Shutdown and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be too late to fallback to something else. We should attempt to add NodeMaintenance object, should it be missing, but it might be too late to stop Graceful Node Shutdown, which might be too far gone, so to speak. This would be a one-way? hard? fallback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, perhaps then doing nothing or refusing to carry-out a shutdown would be a safer option?
That said, depends on how the shutdown was triggered: if it was due to external factors (which would be the default modus operandi for which Graceful Node Shutdown has been designed), then it would be potentially unsafe to interrupt the ongoing shutdown, or there might be not enough time, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be too late for the pods that are already terminating. But it might be useful for the running non-terminated pods, right?
It seems to me that by default it would be good to carry out the Graceful Node Shutdown in the disconnected scenario. At least to have a softer landing in bad situations. I wonder if it would make sense to make this behavior configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we would fall back to the original behavior of Graceful Node Shutdown feature
does this imply that Graceful Node Shutdown will be used the way it is now? or does it mean DNM will internally reproduce the same original GracefulNodeShutdown and just perform that behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AIUI, DNM is about marking the node as needing maintenance and barely overlaps with node shutdown handling.
The only overlap is that both things trigger a node drain; but other things could also trigger that, such as node problem detector deciding the node is irreparably faulty.
a4e07e6
to
2ca8ef9
Compare
54ef958
to
a15b78d
Compare
a15b78d
to
0658ae8
Compare
0658ae8
to
00679e9
Compare
|
One of the problems is that it would be difficult to get real word adoption and thus important | ||
feedback on this feature. This is mainly due to the requirement that the feature be implemented | ||
and integrated with multiple components to observe the benefits. And those components are both | ||
cluster admin and application developer facing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the Core APIs to get adoption is not a good argument, if we are assuming there is appetite, official CRDs will be equal desired for the users ... adding experimental features intersecting with the Pod and Node lifecycle than later does not get adoption is a big penalty for the project
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed the section to better reflect the current state of the KEP. Mainly the technical reasons why we prefer the core API. I have kept the adoption part there, but slimmed it down. The adoption was there from the time we had the Evacuation API in the KEP, but since we moved it to a separate KEP, it is not as important anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The KEP still lacks graduation criteria, and tests, and PRR. We can condition GA graduation on a reasonable adoption, to see if this feature brings a general benefit to the k8s project.
00679e9
to
063b4bf
Compare
- Solve the node lifecycle management or automatic shutdown after the node drain is completed. | ||
Implementation of this is better suited for other cluster components and actors who can use the | ||
node maintenance as a building block to achieve their desired goals. | ||
- Synchronize all pod termination mechanisms (see #8 in the [Motivation](#motivation) section), so that they do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This require to synchronize the Endpoints/Services too , after a quick skim to the KEP I wonder if this approach will not hit the same problem @smarterclayton is describing here kubernetes/kubernetes#116965 (comment), the Pods should report the status to avoid traffic to be sent to them if they are not ready
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is a non goal I think the KEP can comfortably find an easy way to not deliver it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This KEP does not propose to solve the non-goals nor does it suggest how or whether to implement them. It is a simply a list of things that could be done.
- Even the goals section expects additional KEPs on the node and apps side to fully examine the problem. But for the sake of simplicity, I keep the NodeMaintenanceAwareDaemonSet and NodeMaintenanceAwareKubelet features here for now.
Regarding what @smarterclayton wrote, I think we have a lot of room to solve that. By utilizing both the NodeMaintenance and Evacuation API and deciding which parts should be mandatory and which optional.
### Motivation Summary | ||
|
||
To sum up. Some applications solve the disruption problem by introducing validating admission | ||
webhooks. This has some drawbacks. The webhooks are not easily discoverable by cluster admins. And |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this not solved with the new addition of CEL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CEL cannot do everything. For example, it cannot check additional resources, call other APIs. Also the eviction API was not designed to support this, so it is still an unexpected behavior for the clients.
A misconfigured .spec.nodeSelector could select all the nodes (or just all master nodes) in the | ||
cluster. This can cause the cluster to get into a degraded and unrecoverable state. | ||
|
||
An admission plugin ([NodeMaintenance Admission](#nodemaintenance-admission)) is introduced to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this will be unrecoverable, right? if the master nodes are not available it means will you not be able to connect to the apiserver
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The admission restriction wouldn't prevent you marking all the nodes as under maintenance. It'd just warn you that this was a bad idea.
(I'd prefer to have a ValidatingAdmissionPolicy rather than code, but we don't have lifecycle management for this kind of thing. In other words, kubeadm can't deploy recommended ValidatingAdmissionPolicies for us, so we shouldn't assume they are in place.)
- You can mark all the nodes as being covered by maintenance (and this may well be true!)
- If you as a cluster adminstrator then go and pull out the power cords or whatever, you can try to recover by plugging them back in.
- Static Pods don't stop during a node drain, right? So even if you do mark every node as under maintenance, you may still have a control plane. Perhaps, no working cluster network and no DNS, but you still have a control plane.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it can still be a valid strategy. Assuming your control plane is deployed correctly - basically that it shuts down last or runs on the side (invisible to kube).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we need to allow this KEP to merge as provisional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have to go now, dropping a few comments
|
||
List of cases where the current solution is not optimal: | ||
|
||
1. Without extra manual effort, an application running with a single replica has to settle for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am super confused by this. I mean, the answer is "yes". If you have 1 replica, taking that replica down will cause a downtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some applications have a general ability (e.g. deployments) to scale up first. So they do not have to experience the downtime if configured correctly. This can be generalized to any number of replicas. There are more details in the linked issues.
|
||
## Proposal | ||
|
||
Most of these issues stem from the lack of a standardized way of detecting a start of the node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpbetz - this feels like another case for CRX :)
// +required | ||
NodeSelector *v1.NodeSelector | ||
|
||
// The order of the stages is Idle -> Cordon -> Drain -> Complete. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be really careful with this language. The way this is specced you have a state machine, which means any futre evolution will be very difficult.
Is it allowable for a cordon'ed node to become idle? This state machine doesn't allow it. I think you'd do better to describe this as a set of mutually exclusive intentions. Modes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is hard to offer this as modes, as it should be possible to manually switch between the states/stages. And also to have a strict transitions between them (please see Supported Stage Transitions) as not all of them are valid. It is up to the admin or a controller when these transitions should occur.
We also want to have a lifecycle for the NodeMaintenance. Once the NM completes we can clean up and recover the node. If an additional NM is required, a new object should be created.
I would be happy to hear ideas on how to improve this.
If we decide for the current solution:
- It may be possible to add new stages into the API.
- The extensibility can be offered through a set of modifiers for each stage, as we do with the
DrainPlan
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still to read the Evacuation API KEP, but this current one seems big and I'm not sure that the approach of building on top of a set of problems you've describe here before addressing them will help in the long run.
cluster, the admin has to make a potentially harmful decision. | ||
|
||
From an application owner or developer perspective, the only standard tool they have is | ||
a PodDisruptionBudget. This is sufficient in a basic scenario with a simple multi-replica |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a list of all PDB issues somewhere?
from one node to another. This has been discussed in the issue [kubernetes/kubernetes#66811](https://github.com/kubernetes/kubernetes/issues/66811) | ||
and in the issue [kubernetes/kubernetes#114877](https://github.com/kubernetes/kubernetes/issues/114877). | ||
2. Similar to the first point, it is difficult to use PDBs for applications that can have a variable | ||
number of pods; for example applications with a configured horizontal pod autoscaler (HPA). These |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two possible alternative would be:
- extend HPA to update associated PDB to maintain the required minimum?
- every application should have a reasonable minimum which would allow the application to survive any kind of disruption (planned or unplanned).
2. Similar to the first point, it is difficult to use PDBs for applications that can have a variable | ||
number of pods; for example applications with a configured horizontal pod autoscaler (HPA). These | ||
applications cannot be disrupted during a low load when they have only pod. However, it is | ||
possible to disrupt the pods during a high load without experiencing application downtime. If |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt any admin is interested in node maintenance during high load, unless in a critical situation, in which case I'm not sure any of the considerations from this document might apply, because you're focusing on bringing the cluster back to living.
This can also happen with other mechanism (e.g., different types of evictions). This has been | ||
discussed in the issue [kubernetes/kubernetes#124448](https://github.com/kubernetes/kubernetes/issues/124448) | ||
and in the issue [kubernetes/kubernetes#72129](https://github.com/kubernetes/kubernetes/issues/72129). | ||
9. There is not enough metadata about why the node drain was requested. This has been discussed in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of the above points nicely collected issues with Graceful Node Shutdown, which is being worked on as part of #2000. Have you considered pointing the authors to this list?
not possible to rely solely on this feature as a go-to solution for graceful node and workload | ||
termination. | ||
|
||
- The Graceful Node Shutdown feature is not application aware and may prematurely disrupt workloads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of the solutions will be ever application aware, that is one of the reasons we have operators per a specific application, which know how to deal with all the intricacies of managing it.
|
||
To sum up. Some applications solve the disruption problem by introducing validating admission | ||
webhooks. This has some drawbacks. The webhooks are not easily discoverable by cluster admins. And | ||
they can block evictions for other applications if they are misconfigured or misbehave. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Misconfigured webhooks are generally a problem for any kind of cluster operations, and as such I don't think this makes a good argument against it. There's an explicit cautious warning in the official documentation of them https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#experimenting-with-admission-webhooks. Also, "With great power comes great responsibility" 😉
progress is lost. The NodeMaintenance implementation should also utilize existing node's | ||
`.spec.unschedulable` field, which prevents new pods from being scheduled on such a node. | ||
|
||
We will deprecate the `kubectl drain` as the main mechanism for draining nodes and drive the whole |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With my sig-cli hat, I will call out this wasn't discussed with that group, so before proceeding with that kind of steps you'll need their approval.
|
||
We could implement the NodeMaintenance or Evacuation API out-of-tree first as a CRD. | ||
|
||
The KEP aims to solve graceful termination of any pod in the cluster. This is not possible with a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm seeing this not as an alternative, but rather as a prerequisite for what this document proposes. You're building a solution on top of existing k8s components, so you first need to address the problems in the underlying pieces. I've mentioned that above when asking whether your reported the problems with Node Graceful Shutdown to sig node.
during the node drain. To solve this, integration with NodeMaintenance is required. See | ||
[DaemonSet Controller](#daemonset-controller) for more details. | ||
|
||
Also, one of the disadvantages of using a CRD is that it would be more difficult to get real-word |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the reasons for introducing CRDs was to allow authors to iterate faster and provide feedback which can:
- improve the core k8s;
- allow to make an informed decision whether a particular problems is worth integrating in core or not.
So far, the several different projects you've mentioned in this document were able to achieve the functionality you're proposing here outside of core.
- Taints are used as a mechanism for involuntary disruption; to get pods out of the node for some | ||
reason (e.g. node is not ready). Modifying the taint mechanism to be less harmful | ||
(e.g. by adding a PDB support) is not possible due to the original requirements. | ||
- It is not possible to incrementally scale down according to pod priorities, labels, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see an alternative discussed in SIG Apps meeting from 4/29 during which Clayton proposed extending PDBs with the missing functionality like:
- backpreassure
- supporting single-replica apps
- others? (I've asked for a list of PDBs problems above).
063b4bf
to
55b2526
Compare
TODO: