Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle ELB instance deregistration #316

Open
sarahhodne opened this issue Dec 11, 2020 · 25 comments
Open

Handle ELB instance deregistration #316

sarahhodne opened this issue Dec 11, 2020 · 25 comments
Labels
Priority: Medium This issue will be seen by about half of users stalebot-ignore To NOT let the stalebot update or close the Issue / PR Type: Enhancement New feature or request
Projects

Comments

@sarahhodne
Copy link

We've noticed in our production environment that we have a need for something to deregister nodes from load balancers as part of the draining procedure, before the instance is terminated. We're currently using lifecycle-manager for this, but it would be nice if this was handled by the AWS Node Termination Handler instead.

The reason this is needed is that if the instance is terminated before it's deregistered from an ELB, a number of connections will fail until the health check starts failing. This is particularly noticeable on ELBv2 (NLB+ALB), which seem to take several minutes to react, so we need to have fairly high timeout times on the health checks.

The behaviour we're looking for is that the node termination handler finds a list of classic ELBs and target groups that it's a member of, sends a deregistration request and then waits for the deregistration to finish before marking the instance as being ready to terminate.

@yuri-1987
Copy link

same issue here, in addition, when termination handler cordons node, the node marked as unschedulable,
the service controller removes cordoned nodes from LB pools, it can potentially drop in-flight requests, there should be a better process for a node draining,

  1. taint the node (don't cordon)
  2. find elb's/target groups and safely de-register the node
  3. cordon
  4. drain

relevant issues:
kubernetes/autoscaler#1907
kubernetes/kubernetes#65013
kubernetes/kubernetes#44997

and a partial bug fix in 1.19
kubernetes/kubernetes#90823

@bwagner5 bwagner5 added the Type: Enhancement New feature or request label Dec 16, 2020
@bwagner5
Copy link
Contributor

I'm definitely interested in looking into this more. I've asked @kishorj who works on the aws-load-balancer-controller his thoughts since there needs to be a careful dance between the LB controller and NTH in the draining process. There might be more we can do in that controller without involving NTH as much. But if we need to add this logic to NTH, then I'm not opposed

@yuri-1987
Copy link

Hi Brandon, thank you for the quick response. I think an external tool such as NTH is suitable to handle such logic. Even if Kubernetes contributors solve it internally, it won't manage all cases such as draining due to spot interruptions, AZ rebalance, or spot recommendations. The current bug of removing cordon nodes immediately from the load balancer is four years old, if the service controller will be enhanced someday, it can take a lot of time till we can use it. I really hope to see this functionality in NTH.

@bwagner5
Copy link
Contributor

bwagner5 commented Feb 8, 2021

Linking taint-effect issue, since I think that would mitigate this: #273

@sarahhodne
Copy link
Author

I'm not sure it would really do what we need. The problem is that draining instances from an ELBv2 load balancer is quite slow (usually 4-5 minutes in our experience), and, at least for our nodes, draining the containers is much, much faster.

lifecycle-manager is nice because it polls to make sure the instance is removed from the load balancer before it continues. If I'm reading the taint-effect issue right, it would apply a taint, which could cause an ELB drain to start, but there's not really anything that then waits for the drain to finish before the instances are terminated?

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this issue to never become stale, please ask a maintainer to apply the "stalebot-ignore" label.

@github-actions github-actions bot added the stale Issues / PRs with no activity label Oct 17, 2021
@jillmon jillmon added Priority: Medium This issue will be seen by about half of users stalebot-ignore To NOT let the stalebot update or close the Issue / PR and removed Priority: High This issue will be seen by most users stale Issues / PRs with no activity labels Oct 19, 2021
@danquack
Copy link
Contributor

Can we get update on this? This would be a cool feature!

@infa-ddeore
Copy link

we are trying to find solution for the same problem for cluster autoscaler, 1.18 and previous k8 versions used to remove node from LBs with cordon command. We want similar behaviour to retain in 1.19+ k8s, one option is to have cluster autoscaler add below label to worker node or delete the worker node with kubectl delete node will remove the node from all associated k8 LBs

node.kubernetes.io/exclude-from-external-load-balancers=true (value doesnt matter, can be true/false or anything)

@farooqashraf1
Copy link

With the custom termination policy supported by EC2 Auto Scaling, you would specify a Lambda function that can drain the node as well as deregister it from an ELB. This can be a solution until ELB deregistration is natively supported.

Refer to the following links for more details:

@snay2
Copy link
Contributor

snay2 commented Feb 11, 2022

Interested to hear from contributors here whether the solution in #582, which adds the label node.kubernetes.io/exclude-from-external-load-balancers to nodes undergoing cordon-and-drain operations, is sufficient for your needs here.

Does that solve your problem, or do we need to do additional work to support your use cases?

@sarahhodne
Copy link
Author

It does not, in our case. The problem is that all the pods can be drained off the node faster than the node can be deregistered from the load balancer. So something like this happens:

  1. Instance drain starts. The node.kubernetes.io/exclude-from-external-load-balancers label is added, and the load balancer controller starts the deregistration process.
  2. The node termination handler evicts all the pods from the node.
  3. Once the pods are all evicted, the node is terminated, but it is not yet deregistered from the ELB.
  4. The instance is terminated, but the ELB continues to send requests to it, until either the deregistration finishes, or the health check trips.
  5. Finally, the ELB termination finishes.

In our experience, and after working with AWS support, the shortest duration we've been able to get load balancer deregistration down to is 2-3 minutes. Meanwhile we can usually evict all pods in less than 1 minute.

@tjs-intel
Copy link
Contributor

tjs-intel commented Feb 14, 2022

@sarahhodne admittedly I haven't done very comprehensive tests, but what I have observed is that if a target in a target group is draining before the associated instance is terminated then there is a much higher chance that the termination will not result in request errors. In fact I was not able to cause any requests errors in my testing this way.

I use the aws-load-balancer-controller to provision my load balancers.

@infa-ddeore
Copy link

Interested to hear from contributors here whether the solution in #582, which adds the label node.kubernetes.io/exclude-from-external-load-balancers to nodes undergoing cordon-and-drain operations, is sufficient for your needs here.

Does that solve your problem, or do we need to do additional work to support your use cases?

we updated cluster autoscaler to add node.kubernetes.io/exclude-from-external-load-balancers label to the nodes which removes the nodes from all LBs

in addition to that we also have ASG lifecycle hook to wait for 300 seconds before terminating node, ELB has 300 seconds connection draining, this way we avoid 5xx issues

@kristofferahl
Copy link

kristofferahl commented May 3, 2022

I use the aws-load-balancer-controller to provision my load balancers.

Do you run it in IP or Instance mode @tjs-intel ?

@tjs-intel
Copy link
Contributor

tjs-intel commented May 3, 2022

@kristofferahl I switched from Instance to IP mode because of general lack of support for node draining by NTH and brupop

@kristofferahl
Copy link

@sarahhodne admittedly I haven't done very comprehensive tests, but what I have observed is that if a target in a target group is draining before the associated instance is terminated then there is a much higher chance that the termination will not result in request errors. In fact I was not able to cause any requests errors in my testing this way.

Thanks @tjs-intel! We use IP mode as well so I was wondering if you could possibly explain your setup a bit further as it seems you're not having any issues with dropped requests when using aws-load-balancer-controller and NTH? How do you achieve draining before the target/underlying instance is terminated?

DWSR added a commit to DWSR/karpenter that referenced this issue Sep 16, 2022
Currently, when Karpenter drains and then deletes a Node from the
cluster, if that node is registered in a Target Group for an ALB/NLB the
corresponding EC2 instance is not removed. This leads to the potential
for increased errors when deleting nodes via Karpenter.

In order to help resolve this issue, this change adds the well-known
`node.kubernetes.io/exclude-from-external-balancers` label, which will
case the AWS LB controller to remove the node from the Target Group
while Karpenter is draining the node. This is similar to how the AWS
Node Termination Handler works (see
aws/aws-node-termination-handler#316).

In future, Karpenter might be enhanced to be able to wait for a
configurable period before deleting the Node and terminating the
associated instance as currently there's a race condition between the
Pods being drained off of the Node and the EC2 instance being removed
from the target group.
DWSR added a commit to DWSR/karpenter that referenced this issue Sep 16, 2022
Currently, when Karpenter drains and then deletes a Node from the
cluster, if that node is registered in a Target Group for an ALB/NLB the
corresponding EC2 instance is not removed. This leads to the potential
for increased errors when deleting nodes via Karpenter.

In order to help resolve this issue, this change adds the well-known
`node.kubernetes.io/exclude-from-external-balancers` label, which will
case the AWS LB controller to remove the node from the Target Group
while Karpenter is draining the node. This is similar to how the AWS
Node Termination Handler works (see
aws/aws-node-termination-handler#316).

In future, Karpenter might be enhanced to be able to wait for a
configurable period before deleting the Node and terminating the
associated instance as currently there's a race condition between the
Pods being drained off of the Node and the EC2 instance being removed
from the target group.
DWSR added a commit to DWSR/karpenter that referenced this issue Sep 17, 2022
Currently, when Karpenter drains and then deletes a Node from the
cluster, if that node is registered in a Target Group for an ALB/NLB the
corresponding EC2 instance is not removed. This leads to the potential
for increased errors when deleting nodes via Karpenter.

In order to help resolve this issue, this change adds the well-known
`node.kubernetes.io/exclude-from-external-balancers` label, which will
case the AWS LB controller to remove the node from the Target Group
while Karpenter is draining the node. This is similar to how the AWS
Node Termination Handler works (see
aws/aws-node-termination-handler#316).

In future, Karpenter might be enhanced to be able to wait for a
configurable period before deleting the Node and terminating the
associated instance as currently there's a race condition between the
Pods being drained off of the Node and the EC2 instance being removed
from the target group.
DWSR added a commit to DWSR/karpenter that referenced this issue Sep 20, 2022
Currently, when Karpenter drains and then deletes a Node from the
cluster, if that node is registered in a Target Group for an ALB/NLB the
corresponding EC2 instance is not removed. This leads to the potential
for increased errors when deleting nodes via Karpenter.

In order to help resolve this issue, this change adds the well-known
`node.kubernetes.io/exclude-from-external-balancers` label, which will
case the AWS LB controller to remove the node from the Target Group
while Karpenter is draining the node. This is similar to how the AWS
Node Termination Handler works (see
aws/aws-node-termination-handler#316).

In future, Karpenter might be enhanced to be able to wait for a
configurable period before deleting the Node and terminating the
associated instance as currently there's a race condition between the
Pods being drained off of the Node and the EC2 instance being removed
from the target group.
dewjam pushed a commit to aws/karpenter-provider-aws that referenced this issue Sep 20, 2022
Currently, when Karpenter drains and then deletes a Node from the
cluster, if that node is registered in a Target Group for an ALB/NLB the
corresponding EC2 instance is not removed. This leads to the potential
for increased errors when deleting nodes via Karpenter.

In order to help resolve this issue, this change adds the well-known
`node.kubernetes.io/exclude-from-external-balancers` label, which will
case the AWS LB controller to remove the node from the Target Group
while Karpenter is draining the node. This is similar to how the AWS
Node Termination Handler works (see
aws/aws-node-termination-handler#316).

In future, Karpenter might be enhanced to be able to wait for a
configurable period before deleting the Node and terminating the
associated instance as currently there's a race condition between the
Pods being drained off of the Node and the EC2 instance being removed
from the target group.
@pierpaolo-pagnoni
Copy link

pierpaolo-pagnoni commented Feb 27, 2024

@TaylorChristie
Copy link

We found a pretty nice way to handle this with Graceful Node Shutdown and preStop hooks on daemonsets. Essentially you set the kubelet parameters (in our case we use karpenter, so we used specified userData in the EC2NodeClass as follows

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: AL2
  userData: |
    #!/bin/bash -xe
    echo "$(jq '.shutdownGracePeriod="400s"' /etc/kubernetes/kubelet/kubelet-config.json)" > /etc/kubernetes/kubelet/kubelet-config.json
    echo "$(jq '.shutdownGracePeriodCriticalPods="100s"' /etc/kubernetes/kubelet/kubelet-config.json)" > /etc/kubernetes/kubelet/kubelet-config.json

and then deploy a daemonset on all karpenter nodes with a high terminationGracePeriodSeconds and preStop hook

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: karpenter-termination-waiter
  namespace: kube-system
  labels:
    k8s-app: karpenter-termination-waiter
spec:
  selector:
    matchLabels:
      name: karpenter-termination-waiter
  template:
    metadata:
      labels:
        name: karpenter-termination-waiter
    spec:
      nodeSelector:
        karpenter.sh/registered: "true"
      containers:
        - name: alpine
          image: alpine:latest
          command: ["sleep", "infinity"]
          # wait for the node to be completely deregistered from the load balancer
          lifecycle:
            preStop:
              exec:
                command: ["sleep", "300"]
          resources:
            limits:
              cpu: 5m
              memory: 10Mi
            requests:
              cpu: 2m
              memory: 5Mi
      priorityClassName: high-priority
      terminationGracePeriodSeconds: 300

the node is still running aws-node and kube-proxy behind the scene, so it can properly direct requests from the load balancer until it's completely drained. It's important that the gracePeriod and sleep hook is larger than the deregistration delay on the ALB so the node isn't terminated before being fully drained.

@deepakdeore2004
Copy link

@TaylorChristie similar issue with karpenter is being discussed at aws/karpenter-provider-aws#4673

in your workaround karpenter removes node from LB during draining time --> then all pods get deleted but karpenter-termination-waiter daemonset keeps waiting for preStop hook completion which indirectly holds the worker node for some time after getting removed from LB?

we are waiting for out of the box solution from kerpenter but your workaround makese sense to try and use until there is some karpenter solution

@TaylorChristie
Copy link

@TaylorChristie similar issue with karpenter is being discussed at aws/karpenter-provider-aws#4673

in your workaround karpenter removes node from LB during draining time --> then all pods get deleted but karpenter-termination-waiter daemonset keeps waiting for preStop hook completion which indirectly holds the worker node for some time after getting removed from LB?

we are waiting for out of the box solution from kerpenter but your workaround makese sense to try and use until there is some karpenter solution

Yep, because of the shutdownGracePeriod set in kubelet, it won't drain any daemonsets like kube-proxy or aws-node (since it is higher priority), so the nodes can still properly forward NodePort traffic to other endpoints. I agree a native karpenter solution would be much better, but in our testing this eliminates the LB 5XX issues we were experiencing.

@oridool
Copy link

oridool commented Mar 20, 2024

Interested to hear from contributors here whether the solution in #582, which adds the label node.kubernetes.io/exclude-from-external-load-balancers to nodes undergoing cordon-and-drain operations, is sufficient for your needs here.
Does that solve your problem, or do we need to do additional work to support your use cases?

we updated cluster autoscaler to add node.kubernetes.io/exclude-from-external-load-balancers label to the nodes which removes the nodes from all LBs

in addition to that we also have ASG lifecycle hook to wait for 300 seconds before terminating node, ELB has 300 seconds connection draining, this way we avoid 5xx issues

@infa-ddeore , is there any official PR/fix to the CA for adding the 'node.kubernetes.io/exclude-from-external-load-balancers' label ?
And another question for my understanding: assuming this label is added, what makes ALB remove the node and stop sending it requests? Do we need the alb-load-balancer-controller for that?
Currently, we experience 502 errors occasionally when CA scale in a node .
Thanks.

@deepakdeore2004
Copy link

Interested to hear from contributors here whether the solution in #582, which adds the label node.kubernetes.io/exclude-from-external-load-balancers to nodes undergoing cordon-and-drain operations, is sufficient for your needs here.
Does that solve your problem, or do we need to do additional work to support your use cases?

we updated cluster autoscaler to add node.kubernetes.io/exclude-from-external-load-balancers label to the nodes which removes the nodes from all LBs
in addition to that we also have ASG lifecycle hook to wait for 300 seconds before terminating node, ELB has 300 seconds connection draining, this way we avoid 5xx issues

@infa-ddeore , is there any official PR/fix to the CA for adding the 'node.kubernetes.io/exclude-from-external-load-balancers' label ? And another question for my understanding: assuming this label is added, what makes ALB remove the node and stop sending it requests? Do we need the alb-load-balancer-controller for that? Currently, we experience 502 errors occasionally when CA scale in a node . Thanks.

there isnt official PR for this, our devs made these changes and provided us custom cluster autoscaler image
during the node draining process this label is added and EKS control plane removes that node from all associated ELBs since we use in-tree controller

i havent tested this for ALB or with alb-load-balancer-controller, but i feel the alb controller also must be honoring the label, you can try adding the label manually to see if the node gets removed from ALB's target group or not

@oridool
Copy link

oridool commented Mar 21, 2024

Interested to hear from contributors here whether the solution in #582, which adds the label node.kubernetes.io/exclude-from-external-load-balancers to nodes undergoing cordon-and-drain operations, is sufficient for your needs here.
Does that solve your problem, or do we need to do additional work to support your use cases?

we updated cluster autoscaler to add node.kubernetes.io/exclude-from-external-load-balancers label to the nodes which removes the nodes from all LBs
in addition to that we also have ASG lifecycle hook to wait for 300 seconds before terminating node, ELB has 300 seconds connection draining, this way we avoid 5xx issues

@infa-ddeore , is there any official PR/fix to the CA for adding the 'node.kubernetes.io/exclude-from-external-load-balancers' label ? And another question for my understanding: assuming this label is added, what makes ALB remove the node and stop sending it requests? Do we need the alb-load-balancer-controller for that? Currently, we experience 502 errors occasionally when CA scale in a node . Thanks.

there isnt official PR for this, our devs made these changes and provided us custom cluster autoscaler image during the node draining process this label is added and EKS control plane removes that node from all associated ELBs since we use in-tree controller

i havent tested this for ALB or with alb-load-balancer-controller, but i feel the alb controller also must be honoring the label, you can try adding the label manually to see if the node gets removed from ALB's target group or not

@infa-ddeore I checked that indeed ALB is removing the node from node group when I set this label of node.kubernetes.io/exclude-from-external-load-balancers.
Any chance you (or your developers) can publish a PR for that? I think a lot of people would need it.

@oridool
Copy link

oridool commented Apr 10, 2024

Hi @deepakdeore2004 and all, I'm writing here my findings after I was able to resolve the issue without any code changes.
It might be helpful to other people.
What you need to do (among other operations) is adding the AutoScaler 60s delay after taint by setting
--node-delete-delay-after-taint=60s
You can read more about it here

Explanation:
When AutoScaler reaches the conclusion that a node needs to be drained and eventually removed from K8S, it sets a special taint on the node with the value of "ToBeDeletedByClusterAutoscaler". Then, the aws-load-balancer-controller recognizes that and asks ALB to remove the node from ALB by calling the DeregisterTargets API and causing ALB to drain connections to this node (more about ALB draining process here). Default ALB draining time is 300s.
5 seconds after that (default AutoScaler delay time), the AutoScaler is calling TerminateInstanceInAutoScalingGroup API, causing ASG to terminate the node by calling TerminateInstances" API.
ALB is not aware of the fact that the node is going to be terminated by ASG.
Even though no new requests are sent to the target by ALB while draining, those that are currently drained might last more than 5s. When the node is terminated, those requests are ended with 502 errors because the connection is interrupted.
To avoid this interruption, ALB needs some delay to allow all drained requests to be finished before the node is being terminated. This is achieved by setting the delay using node-delete-delay-after-taint parameter. Cluster AutoScaler waits 60s before it notifies the ASG to terminate the node.

To summarize, the parameter effectively causes a delay between the "DeregisterTargets" API and the "TerminateInstances" API, letting the ALB to gracefully drain the connections.

with-node-delete-delay-after-taint-mode

@deepakdeore2004
Copy link

Hi @deepakdeore2004 and all, I'm writing here my findings after I was able to resolve the issue without any code changes. It might be helpful to other people. What you need to do (among other operations) is adding the AutoScaler 60s delay after taint by setting --node-delete-delay-after-taint=60s You can read more about it here

Explanation: When AutoScaler reaches the conclusion that a node needs to be drained and eventually removed from K8S, it sets a special taint on the node with the value of "ToBeDeletedByClusterAutoscaler". Then, the aws-load-balancer-controller recognizes that and asks ALB to remove the node from ALB by calling the DeregisterTargets API and causing ALB to drain connections to this node (more about ALB draining process here). Default ALB draining time is 300s. 5 seconds after that (default AutoScaler delay time), the AutoScaler is calling TerminateInstanceInAutoScalingGroup API, causing ASG to terminate the node by calling TerminateInstances" API. ALB is not aware of the fact that the node is going to be terminated by ASG. Even though no new requests are sent to the target by ALB while draining, those that are currently drained might last more than 5s. When the node is terminated, those requests are ended with 502 errors because the connection is interrupted. To avoid this interruption, ALB needs some delay to allow all drained requests to be finished before the node is being terminated. This is achieved by setting the delay using node-delete-delay-after-taint parameter. Cluster AutoScaler waits 60s before it notifies the ASG to terminate the node.

To summarize, the parameter effectively causes a delay between the "DeregisterTargets" API and the "TerminateInstances" API, letting the ALB to gracefully drain the connections.

with-node-delete-delay-after-taint-mode

thanks for the details @oridool, i see aws lb controller understands ToBeDeletedByClusterAutoscaler and removes the node from LB when this taint is added, also --node-delete-delay-after-taint option will keep the node alive for specified duration, this is perfect solution

but we use in-tree controller which doesnt understand this taint so the cluster autoscaler customization is needed from our side

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: Medium This issue will be seen by about half of users stalebot-ignore To NOT let the stalebot update or close the Issue / PR Type: Enhancement New feature or request
Projects
V2
Awaiting triage
Development

No branches or pull requests