NTH in Queue Processor mode isn't able to respond to Instance State Change Events correctly #874

CGPrakashW · 2023-08-04T12:13:52Z

Describe the bug
NTH in Queue Processor mode isn't able to respond to Instance State Change Events correctly.

After following the set up described for Queue Processor mode and having EventBridge rules for aws.ec2 - EC2 Instance State-change Notification and aws.autoscaling - EC2 Instance-terminate Lifecycle Action, my understanding is that the following scenarios should happen:

When a node is terminated via the ec2 console, the EventBridge rule should fire and add a message to the sqs queue, triggering the node being drained before being terminated.
When a node is terminated via a scale-in event (via asg autoscaling or aws autoscaling terminate-instance-in-auto-scaling-group cli command), the event bridge rule should fire and add a message to the sqs queue, triggering the node being drained before being terminated.

In practice, only scenario 2 works as expected. With scenario 1, I'm seeing that the drain is scheduled and begins but the node can (and generally does) terminate before node has been drained successfully.

With scenario 2, my understanding is that because the terminate action is triggered via the autoscaling api, the lifecycle hook kicks in which prevents the node from terminating immediately without either reaching a timeout or further input from nth. This grace period (assuming pods are able to evict within the period) means that the node is only terminated once the node has been drained.

However, as the lifecycle hook is only relevant and triggered by termination requests via the autoscaling api, this same behaviour is not seen if terminating through the ec2 console. When terminating through the ec2 console, nodes are frequently terminated before it has been successfully/fully drained.

Steps to reproduce
Deploy and configure NTH in queue processor mode with EventBridge rules to monitor autoscaling events and instance state change events.

Terminate a node through the ec2 console and monitor nodes and pods being terminated.
Scale down a node via autoscaling (or aws cli) and monitor nodes and pods being terminated

Expected outcome
In both scenarios, the node should finish draining before it is terminataed.

Environment

NTH App Version: 1.20.0
NTH Mode (IMDS/Queue processor): Queue processor
OS/Arch: bottlerocket
Kubernetes version: 1.24
Installation method: helm

The text was updated successfully, but these errors were encountered:

CGPrakashW · 2023-08-04T12:20:14Z

Sounds like I'm seeing similar behaviour described here: #354

cjerad · 2023-08-23T14:17:56Z

Hi @CGPrakashW

The difference you are seeing are due to different actions taken by AWS EC2 in each case.

When using the aws autoscaling terminate-instance-in-auto-scaling-group command the ASG first sends the lifecycle action notification then waits until it has been completed or times out. This allows time for NTH to receive the notification via SQS, cordon and drain the node, and then complete the lifecycle action. Once the ASG receives the completion it then instructs EC2 to terminate the instance.
When using the EC2 Console a state-change notification is sent and the instance termination is started -- i.e. EC2 does not wait for a "continue" signal before beginning to terminate the instance. As described in How to handle EC2 termination via console? #354 the instance termination triggers systemd to stop all running applications on the EC2 instance, e.g. kubernetes kubelet.

CGPrakashW · 2023-08-23T14:32:46Z

Hi, thanks for the response!

Can I suggest that the documentation is updated to make this clearer? The IMDS vs Queue Processor matrix in the readme suggests that both ASG Termination Lifecycle Hooks and Instance State Change Events can be equally handled by the NTH implementation described but without a mechanism in place to deal with the different actions you've described, I'd argue that these events are not handled equally.

LikithaVemulapalli · 2024-05-23T18:27:05Z

Updated the README documentation, thanks for the issue :)

cjerad added docs stalebot-ignore To NOT let the stalebot update or close the Issue / PR labels Aug 23, 2023

LikithaVemulapalli mentioned this issue May 23, 2024

Modified ReadMe Documentation #1014

Merged

LikithaVemulapalli closed this as completed May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NTH in Queue Processor mode isn't able to respond to Instance State Change Events correctly #874

NTH in Queue Processor mode isn't able to respond to Instance State Change Events correctly #874

CGPrakashW commented Aug 4, 2023

CGPrakashW commented Aug 4, 2023

cjerad commented Aug 23, 2023

CGPrakashW commented Aug 23, 2023

LikithaVemulapalli commented May 23, 2024

NTH in Queue Processor mode isn't able to respond to Instance State Change Events correctly #874

NTH in Queue Processor mode isn't able to respond to Instance State Change Events correctly #874

Comments

CGPrakashW commented Aug 4, 2023

CGPrakashW commented Aug 4, 2023

cjerad commented Aug 23, 2023

CGPrakashW commented Aug 23, 2023

LikithaVemulapalli commented May 23, 2024