Handle ELB instance deregistration #316

sarahhodne · 2020-12-11T10:11:24Z

We've noticed in our production environment that we have a need for something to deregister nodes from load balancers as part of the draining procedure, before the instance is terminated. We're currently using lifecycle-manager for this, but it would be nice if this was handled by the AWS Node Termination Handler instead.

The reason this is needed is that if the instance is terminated before it's deregistered from an ELB, a number of connections will fail until the health check starts failing. This is particularly noticeable on ELBv2 (NLB+ALB), which seem to take several minutes to react, so we need to have fairly high timeout times on the health checks.

The behaviour we're looking for is that the node termination handler finds a list of classic ELBs and target groups that it's a member of, sends a deregistration request and then waits for the deregistration to finish before marking the instance as being ready to terminate.

yuri-1987 · 2020-12-16T16:43:38Z

same issue here, in addition, when termination handler cordons node, the node marked as unschedulable,
the service controller removes cordoned nodes from LB pools, it can potentially drop in-flight requests, there should be a better process for a node draining,

taint the node (don't cordon)
find elb's/target groups and safely de-register the node
cordon
drain

relevant issues:
kubernetes/autoscaler#1907
kubernetes/kubernetes#65013
kubernetes/kubernetes#44997

and a partial bug fix in 1.19
kubernetes/kubernetes#90823

bwagner5 · 2020-12-16T23:39:56Z

I'm definitely interested in looking into this more. I've asked @kishorj who works on the aws-load-balancer-controller his thoughts since there needs to be a careful dance between the LB controller and NTH in the draining process. There might be more we can do in that controller without involving NTH as much. But if we need to add this logic to NTH, then I'm not opposed

yuri-1987 · 2020-12-17T17:57:58Z

Hi Brandon, thank you for the quick response. I think an external tool such as NTH is suitable to handle such logic. Even if Kubernetes contributors solve it internally, it won't manage all cases such as draining due to spot interruptions, AZ rebalance, or spot recommendations. The current bug of removing cordon nodes immediately from the load balancer is four years old, if the service controller will be enhanced someday, it can take a lot of time till we can use it. I really hope to see this functionality in NTH.

bwagner5 · 2021-02-08T18:46:46Z

Linking taint-effect issue, since I think that would mitigate this: #273

sarahhodne · 2021-02-08T19:16:09Z

I'm not sure it would really do what we need. The problem is that draining instances from an ELBv2 load balancer is quite slow (usually 4-5 minutes in our experience), and, at least for our nodes, draining the containers is much, much faster.

lifecycle-manager is nice because it polls to make sure the instance is removed from the load balancer before it continues. If I'm reading the taint-effect issue right, it would apply a taint, which could cause an ELB drain to start, but there's not really anything that then waits for the drain to finish before the instances are terminated?

github-actions · 2021-10-17T17:05:38Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this issue to never become stale, please ask a maintainer to apply the "stalebot-ignore" label.

danquack · 2021-11-17T22:48:58Z

Can we get update on this? This would be a cool feature!

infa-ddeore · 2022-01-02T06:47:27Z

we are trying to find solution for the same problem for cluster autoscaler, 1.18 and previous k8 versions used to remove node from LBs with cordon command. We want similar behaviour to retain in 1.19+ k8s, one option is to have cluster autoscaler add below label to worker node or delete the worker node with kubectl delete node will remove the node from all associated k8 LBs

node.kubernetes.io/exclude-from-external-load-balancers=true (value doesnt matter, can be true/false or anything)

farooqashraf1 · 2022-01-03T21:40:41Z

With the custom termination policy supported by EC2 Auto Scaling, you would specify a Lambda function that can drain the node as well as deregister it from an ELB. This can be a solution until ELB deregistration is natively supported.

Refer to the following links for more details:

snay2 · 2022-02-11T20:58:17Z

Interested to hear from contributors here whether the solution in #582, which adds the label node.kubernetes.io/exclude-from-external-load-balancers to nodes undergoing cordon-and-drain operations, is sufficient for your needs here.

Does that solve your problem, or do we need to do additional work to support your use cases?

sarahhodne · 2022-02-14T18:49:17Z

It does not, in our case. The problem is that all the pods can be drained off the node faster than the node can be deregistered from the load balancer. So something like this happens:

Instance drain starts. The node.kubernetes.io/exclude-from-external-load-balancers label is added, and the load balancer controller starts the deregistration process.
The node termination handler evicts all the pods from the node.
Once the pods are all evicted, the node is terminated, but it is not yet deregistered from the ELB.
The instance is terminated, but the ELB continues to send requests to it, until either the deregistration finishes, or the health check trips.
Finally, the ELB termination finishes.

In our experience, and after working with AWS support, the shortest duration we've been able to get load balancer deregistration down to is 2-3 minutes. Meanwhile we can usually evict all pods in less than 1 minute.

tjs-intel · 2022-02-14T20:51:22Z

@sarahhodne admittedly I haven't done very comprehensive tests, but what I have observed is that if a target in a target group is draining before the associated instance is terminated then there is a much higher chance that the termination will not result in request errors. In fact I was not able to cause any requests errors in my testing this way.

I use the aws-load-balancer-controller to provision my load balancers.

infa-ddeore · 2022-02-15T05:48:13Z

Interested to hear from contributors here whether the solution in #582, which adds the label node.kubernetes.io/exclude-from-external-load-balancers to nodes undergoing cordon-and-drain operations, is sufficient for your needs here.

Does that solve your problem, or do we need to do additional work to support your use cases?

we updated cluster autoscaler to add node.kubernetes.io/exclude-from-external-load-balancers label to the nodes which removes the nodes from all LBs

in addition to that we also have ASG lifecycle hook to wait for 300 seconds before terminating node, ELB has 300 seconds connection draining, this way we avoid 5xx issues

kristofferahl · 2022-05-03T13:44:09Z

I use the aws-load-balancer-controller to provision my load balancers.

Do you run it in IP or Instance mode @tjs-intel ?

tjs-intel · 2022-05-03T13:48:44Z

@kristofferahl I switched from Instance to IP mode because of general lack of support for node draining by NTH and brupop

kristofferahl · 2022-05-03T14:03:18Z

@sarahhodne admittedly I haven't done very comprehensive tests, but what I have observed is that if a target in a target group is draining before the associated instance is terminated then there is a much higher chance that the termination will not result in request errors. In fact I was not able to cause any requests errors in my testing this way.

Thanks @tjs-intel! We use IP mode as well so I was wondering if you could possibly explain your setup a bit further as it seems you're not having any issues with dropped requests when using aws-load-balancer-controller and NTH? How do you achieve draining before the target/underlying instance is terminated?

Currently, when Karpenter drains and then deletes a Node from the cluster, if that node is registered in a Target Group for an ALB/NLB the corresponding EC2 instance is not removed. This leads to the potential for increased errors when deleting nodes via Karpenter. In order to help resolve this issue, this change adds the well-known `node.kubernetes.io/exclude-from-external-balancers` label, which will case the AWS LB controller to remove the node from the Target Group while Karpenter is draining the node. This is similar to how the AWS Node Termination Handler works (see aws/aws-node-termination-handler#316). In future, Karpenter might be enhanced to be able to wait for a configurable period before deleting the Node and terminating the associated instance as currently there's a race condition between the Pods being drained off of the Node and the EC2 instance being removed from the target group.

pierpaolo-pagnoni · 2024-02-27T12:12:56Z

@sarahhodne I think Remove nodes with Cluster Autoscaler taint from LB backends in service controller #105946 fixes the issue

TaylorChristie · 2024-02-28T22:14:35Z

We found a pretty nice way to handle this with Graceful Node Shutdown and preStop hooks on daemonsets. Essentially you set the kubelet parameters (in our case we use karpenter, so we used specified userData in the EC2NodeClass as follows

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: AL2
  userData: |
    #!/bin/bash -xe
    echo "$(jq '.shutdownGracePeriod="400s"' /etc/kubernetes/kubelet/kubelet-config.json)" > /etc/kubernetes/kubelet/kubelet-config.json
    echo "$(jq '.shutdownGracePeriodCriticalPods="100s"' /etc/kubernetes/kubelet/kubelet-config.json)" > /etc/kubernetes/kubelet/kubelet-config.json

and then deploy a daemonset on all karpenter nodes with a high terminationGracePeriodSeconds and preStop hook

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: karpenter-termination-waiter
  namespace: kube-system
  labels:
    k8s-app: karpenter-termination-waiter
spec:
  selector:
    matchLabels:
      name: karpenter-termination-waiter
  template:
    metadata:
      labels:
        name: karpenter-termination-waiter
    spec:
      nodeSelector:
        karpenter.sh/registered: "true"
      containers:
        - name: alpine
          image: alpine:latest
          command: ["sleep", "infinity"]
          # wait for the node to be completely deregistered from the load balancer
          lifecycle:
            preStop:
              exec:
                command: ["sleep", "300"]
          resources:
            limits:
              cpu: 5m
              memory: 10Mi
            requests:
              cpu: 2m
              memory: 5Mi
      priorityClassName: high-priority
      terminationGracePeriodSeconds: 300

the node is still running aws-node and kube-proxy behind the scene, so it can properly direct requests from the load balancer until it's completely drained. It's important that the gracePeriod and sleep hook is larger than the deregistration delay on the ALB so the node isn't terminated before being fully drained.

deepakdeore2004 · 2024-02-29T02:47:35Z

@TaylorChristie similar issue with karpenter is being discussed at aws/karpenter-provider-aws#4673

in your workaround karpenter removes node from LB during draining time --> then all pods get deleted but karpenter-termination-waiter daemonset keeps waiting for preStop hook completion which indirectly holds the worker node for some time after getting removed from LB?

we are waiting for out of the box solution from kerpenter but your workaround makese sense to try and use until there is some karpenter solution

TaylorChristie · 2024-02-29T04:50:06Z

@TaylorChristie similar issue with karpenter is being discussed at aws/karpenter-provider-aws#4673

in your workaround karpenter removes node from LB during draining time --> then all pods get deleted but karpenter-termination-waiter daemonset keeps waiting for preStop hook completion which indirectly holds the worker node for some time after getting removed from LB?

we are waiting for out of the box solution from kerpenter but your workaround makese sense to try and use until there is some karpenter solution

Yep, because of the shutdownGracePeriod set in kubelet, it won't drain any daemonsets like kube-proxy or aws-node (since it is higher priority), so the nodes can still properly forward NodePort traffic to other endpoints. I agree a native karpenter solution would be much better, but in our testing this eliminates the LB 5XX issues we were experiencing.

oridool · 2024-03-20T00:37:37Z

Interested to hear from contributors here whether the solution in #582, which adds the label node.kubernetes.io/exclude-from-external-load-balancers to nodes undergoing cordon-and-drain operations, is sufficient for your needs here.
Does that solve your problem, or do we need to do additional work to support your use cases?

we updated cluster autoscaler to add node.kubernetes.io/exclude-from-external-load-balancers label to the nodes which removes the nodes from all LBs

in addition to that we also have ASG lifecycle hook to wait for 300 seconds before terminating node, ELB has 300 seconds connection draining, this way we avoid 5xx issues

@infa-ddeore , is there any official PR/fix to the CA for adding the 'node.kubernetes.io/exclude-from-external-load-balancers' label ?
And another question for my understanding: assuming this label is added, what makes ALB remove the node and stop sending it requests? Do we need the alb-load-balancer-controller for that?
Currently, we experience 502 errors occasionally when CA scale in a node .
Thanks.

deepakdeore2004 · 2024-03-20T16:13:49Z

Interested to hear from contributors here whether the solution in #582, which adds the label node.kubernetes.io/exclude-from-external-load-balancers to nodes undergoing cordon-and-drain operations, is sufficient for your needs here.
Does that solve your problem, or do we need to do additional work to support your use cases?

we updated cluster autoscaler to add node.kubernetes.io/exclude-from-external-load-balancers label to the nodes which removes the nodes from all LBs
in addition to that we also have ASG lifecycle hook to wait for 300 seconds before terminating node, ELB has 300 seconds connection draining, this way we avoid 5xx issues

@infa-ddeore , is there any official PR/fix to the CA for adding the 'node.kubernetes.io/exclude-from-external-load-balancers' label ? And another question for my understanding: assuming this label is added, what makes ALB remove the node and stop sending it requests? Do we need the alb-load-balancer-controller for that? Currently, we experience 502 errors occasionally when CA scale in a node . Thanks.

there isnt official PR for this, our devs made these changes and provided us custom cluster autoscaler image
during the node draining process this label is added and EKS control plane removes that node from all associated ELBs since we use in-tree controller

i havent tested this for ALB or with alb-load-balancer-controller, but i feel the alb controller also must be honoring the label, you can try adding the label manually to see if the node gets removed from ALB's target group or not

oridool · 2024-03-21T14:29:07Z

Interested to hear from contributors here whether the solution in #582, which adds the label node.kubernetes.io/exclude-from-external-load-balancers to nodes undergoing cordon-and-drain operations, is sufficient for your needs here.
Does that solve your problem, or do we need to do additional work to support your use cases?

we updated cluster autoscaler to add node.kubernetes.io/exclude-from-external-load-balancers label to the nodes which removes the nodes from all LBs
in addition to that we also have ASG lifecycle hook to wait for 300 seconds before terminating node, ELB has 300 seconds connection draining, this way we avoid 5xx issues

@infa-ddeore , is there any official PR/fix to the CA for adding the 'node.kubernetes.io/exclude-from-external-load-balancers' label ? And another question for my understanding: assuming this label is added, what makes ALB remove the node and stop sending it requests? Do we need the alb-load-balancer-controller for that? Currently, we experience 502 errors occasionally when CA scale in a node . Thanks.

there isnt official PR for this, our devs made these changes and provided us custom cluster autoscaler image during the node draining process this label is added and EKS control plane removes that node from all associated ELBs since we use in-tree controller

i havent tested this for ALB or with alb-load-balancer-controller, but i feel the alb controller also must be honoring the label, you can try adding the label manually to see if the node gets removed from ALB's target group or not

@infa-ddeore I checked that indeed ALB is removing the node from node group when I set this label of node.kubernetes.io/exclude-from-external-load-balancers.
Any chance you (or your developers) can publish a PR for that? I think a lot of people would need it.

oridool · 2024-04-10T07:47:47Z

Hi @deepakdeore2004 and all, I'm writing here my findings after I was able to resolve the issue without any code changes.
It might be helpful to other people.
What you need to do (among other operations) is adding the AutoScaler 60s delay after taint by setting
--node-delete-delay-after-taint=60s
You can read more about it here

Explanation:
When AutoScaler reaches the conclusion that a node needs to be drained and eventually removed from K8S, it sets a special taint on the node with the value of "ToBeDeletedByClusterAutoscaler". Then, the aws-load-balancer-controller recognizes that and asks ALB to remove the node from ALB by calling the DeregisterTargets API and causing ALB to drain connections to this node (more about ALB draining process here). Default ALB draining time is 300s.
5 seconds after that (default AutoScaler delay time), the AutoScaler is calling TerminateInstanceInAutoScalingGroup API, causing ASG to terminate the node by calling TerminateInstances" API.
ALB is not aware of the fact that the node is going to be terminated by ASG.
Even though no new requests are sent to the target by ALB while draining, those that are currently drained might last more than 5s. When the node is terminated, those requests are ended with 502 errors because the connection is interrupted.
To avoid this interruption, ALB needs some delay to allow all drained requests to be finished before the node is being terminated. This is achieved by setting the delay using node-delete-delay-after-taint parameter. Cluster AutoScaler waits 60s before it notifies the ASG to terminate the node.

To summarize, the parameter effectively causes a delay between the "DeregisterTargets" API and the "TerminateInstances" API, letting the ALB to gracefully drain the connections.

deepakdeore2004 · 2024-04-11T04:50:11Z

Hi @deepakdeore2004 and all, I'm writing here my findings after I was able to resolve the issue without any code changes. It might be helpful to other people. What you need to do (among other operations) is adding the AutoScaler 60s delay after taint by setting --node-delete-delay-after-taint=60s You can read more about it here

Explanation: When AutoScaler reaches the conclusion that a node needs to be drained and eventually removed from K8S, it sets a special taint on the node with the value of "ToBeDeletedByClusterAutoscaler". Then, the aws-load-balancer-controller recognizes that and asks ALB to remove the node from ALB by calling the DeregisterTargets API and causing ALB to drain connections to this node (more about ALB draining process here). Default ALB draining time is 300s. 5 seconds after that (default AutoScaler delay time), the AutoScaler is calling TerminateInstanceInAutoScalingGroup API, causing ASG to terminate the node by calling TerminateInstances" API. ALB is not aware of the fact that the node is going to be terminated by ASG. Even though no new requests are sent to the target by ALB while draining, those that are currently drained might last more than 5s. When the node is terminated, those requests are ended with 502 errors because the connection is interrupted. To avoid this interruption, ALB needs some delay to allow all drained requests to be finished before the node is being terminated. This is achieved by setting the delay using node-delete-delay-after-taint parameter. Cluster AutoScaler waits 60s before it notifies the ASG to terminate the node.

To summarize, the parameter effectively causes a delay between the "DeregisterTargets" API and the "TerminateInstances" API, letting the ALB to gracefully drain the connections.

thanks for the details @oridool, i see aws lb controller understands ToBeDeletedByClusterAutoscaler and removes the node from LB when this taint is added, also --node-delete-delay-after-taint option will keep the node alive for specified duration, this is perfect solution

but we use in-tree controller which doesnt understand this taint so the cluster autoscaler customization is needed from our side

bwagner5 added the Type: Enhancement New feature or request label Dec 16, 2020

jillmon added the Priority: High This issue will be seen by most users label Feb 11, 2021

jillmon mentioned this issue Apr 2, 2021

Register and Deregister node to ALB Target group, when create new node or delete node #394

Closed

github-actions bot added the stale Issues / PRs with no activity label Oct 17, 2021

jillmon added Priority: Medium This issue will be seen by about half of users stalebot-ignore To NOT let the stalebot update or close the Issue / PR and removed Priority: High This issue will be seen by most users stale Issues / PRs with no activity labels Oct 19, 2021

This was referenced Feb 9, 2022

Deregister nodes that are tainted by aws-node-termination-handler kubernetes-sigs/aws-load-balancer-controller#2498

Closed

Add and remove exclude-from-external-load-balancers label #582

Merged

sarahhodne mentioned this issue Feb 14, 2022

Does this project deregister nodes from load balancers? bottlerocket-os/bottlerocket-update-operator#156

Closed

DWSR mentioned this issue Sep 16, 2022

feat: Add LB exclusion label when deleting node aws/karpenter-provider-aws#2517

Closed

3 tasks

DWSR mentioned this issue Sep 16, 2022

feat: Add LB exclusion label when deleting node aws/karpenter-provider-aws#2518

Merged

3 tasks

deepakdeore2004 mentioned this issue Feb 29, 2024

support to add delay in node termination to honor ELB connection draining interval aws/karpenter-provider-aws#4673

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle ELB instance deregistration #316

Handle ELB instance deregistration #316

sarahhodne commented Dec 11, 2020

yuri-1987 commented Dec 16, 2020

bwagner5 commented Dec 16, 2020

yuri-1987 commented Dec 17, 2020

bwagner5 commented Feb 8, 2021

sarahhodne commented Feb 8, 2021

github-actions bot commented Oct 17, 2021

danquack commented Nov 17, 2021

infa-ddeore commented Jan 2, 2022

farooqashraf1 commented Jan 3, 2022

snay2 commented Feb 11, 2022

sarahhodne commented Feb 14, 2022

tjs-intel commented Feb 14, 2022 •

edited

infa-ddeore commented Feb 15, 2022

kristofferahl commented May 3, 2022 •

edited

tjs-intel commented May 3, 2022 •

edited

kristofferahl commented May 3, 2022

pierpaolo-pagnoni commented Feb 27, 2024 •

edited

TaylorChristie commented Feb 28, 2024

deepakdeore2004 commented Feb 29, 2024

TaylorChristie commented Feb 29, 2024

oridool commented Mar 20, 2024

deepakdeore2004 commented Mar 20, 2024

oridool commented Mar 21, 2024

oridool commented Apr 10, 2024

deepakdeore2004 commented Apr 11, 2024

Handle ELB instance deregistration #316

Handle ELB instance deregistration #316

Comments

sarahhodne commented Dec 11, 2020

yuri-1987 commented Dec 16, 2020

bwagner5 commented Dec 16, 2020

yuri-1987 commented Dec 17, 2020

bwagner5 commented Feb 8, 2021

sarahhodne commented Feb 8, 2021

github-actions bot commented Oct 17, 2021

danquack commented Nov 17, 2021

infa-ddeore commented Jan 2, 2022

farooqashraf1 commented Jan 3, 2022

snay2 commented Feb 11, 2022

sarahhodne commented Feb 14, 2022

tjs-intel commented Feb 14, 2022 • edited

infa-ddeore commented Feb 15, 2022

kristofferahl commented May 3, 2022 • edited

tjs-intel commented May 3, 2022 • edited

kristofferahl commented May 3, 2022

pierpaolo-pagnoni commented Feb 27, 2024 • edited

TaylorChristie commented Feb 28, 2024

deepakdeore2004 commented Feb 29, 2024

TaylorChristie commented Feb 29, 2024

oridool commented Mar 20, 2024

deepakdeore2004 commented Mar 20, 2024

oridool commented Mar 21, 2024

oridool commented Apr 10, 2024

deepakdeore2004 commented Apr 11, 2024

tjs-intel commented Feb 14, 2022 •

edited

kristofferahl commented May 3, 2022 •

edited

tjs-intel commented May 3, 2022 •

edited

pierpaolo-pagnoni commented Feb 27, 2024 •

edited