Operator does not cleanly allow node drains to complete #1701

braunsonm · 2024-02-12T20:08:30Z

Describe the bug
On AWS EKS, nodes are set to SchedulingDisabled and pods are evicted in batches (not cordoned). With knative serving deployed using the operator, some workloads will never drain when HA is set to 3.

Expected behavior
The Knative Operator should allow these components to drain without user interaction.

To Reproduce

Deploy Knative Serving on EKS with HA set to 3 using the Knative Operator
Upgrade the version of Kubernetes by updating the AMI template. This will trigger AWS to do a rolling upgrade of the nodes
Notice that knative-serving components cause the operation to hand indefinitely until a human forcibly kills the pod in question.

Knative release version
1.13.0

Additional context
I have enough nodes that the PDB shouldn't be violated.

The text was updated successfully, but these errors were encountered:

braunsonm · 2024-02-13T14:55:57Z

I found the problem. When HA is set to 3, the operator creates a PDB where minAvailable is set to 80%. This will never allow any of those pods to be evicted since 1 unavailable would be 66%.

houshengbo · 2024-02-27T20:33:18Z

@braunsonm Thanks for reporting the issue. Do you have any suggestion on how operator can change or improve to avoid this issue?

braunsonm · 2024-02-27T20:34:38Z

@houshengbo I think the operator should set maxUnavailable to 1 as a sensible default (roll each of these critical components one at a time). And continue to allow the user to override that.

houshengbo · 2024-03-08T16:27:59Z

Is maxUnavailable for knative serving or knative eventing? I would rather say this configuration should be for them, right? Operator does not by default configure them, instead, it read the manifests for them and use the default values from serving or eventing. You can use operator CRs to configure PodDisruptionBudget.

braunsonm · 2024-03-08T16:43:50Z

Yes and no. Max unavailable would be set for serving I think. But if you're configuring HA it would make sense that the operator creates a PDB so that HA is actually guaranteed. Otherwise you could still have an outage if the pods are evicted at the same time.

I agree allowing overrides though like you currently do.

braunsonm added the kind/bug Categorizes issue or PR as related to a bug. label Feb 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operator does not cleanly allow node drains to complete #1701

Operator does not cleanly allow node drains to complete #1701

braunsonm commented Feb 12, 2024 •

edited

braunsonm commented Feb 13, 2024

houshengbo commented Feb 27, 2024

braunsonm commented Feb 27, 2024

houshengbo commented Mar 8, 2024

braunsonm commented Mar 8, 2024

Operator does not cleanly allow node drains to complete #1701

Operator does not cleanly allow node drains to complete #1701

Comments

braunsonm commented Feb 12, 2024 • edited

braunsonm commented Feb 13, 2024

houshengbo commented Feb 27, 2024

braunsonm commented Feb 27, 2024

houshengbo commented Mar 8, 2024

braunsonm commented Mar 8, 2024

braunsonm commented Feb 12, 2024 •

edited