Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node stuck in Pending:Wait and eventually hit timeout (ABANDON) #1007

Closed
Typositoire opened this issue May 16, 2024 · 4 comments
Closed

Node stuck in Pending:Wait and eventually hit timeout (ABANDON) #1007

Typositoire opened this issue May 16, 2024 · 4 comments

Comments

@Typositoire
Copy link

Typositoire commented May 16, 2024

I'll be honest this is a last stretch to be able to understand why it's happening and how...

Here's the bug: We have one nodegroup which seems to fail to launch randomly or more like succeed to launch randomly. We are still not able to pinpoint why this nodegroup specifically, since we can't replicate it so far.

New node comes up, stays in Pending:Wait until Heartbeat Timeout triggers Abandon and the node get's recycled (Terminated). Version 1.22 fixed the Termination taking for ever but launch still seems to be bugged.

Here's the weird part, it eventually works! If I leave the instance be, at some point it will hit one that works! We first though it was timing based so we raised the timeout to 600s but it just takes longer to fail launching now :p

Steps to reproduce
No idea... We tried to replicate with other nodegroups and we can't. We're still trying to figure out what's different with this nodegroup but we can't find anything.

Expected outcome
Node launches and Pending:Proceed event is sent after the node is ready.

Application Logs this is looping this until the node become ready then nothing until timeout.

2024/05/16 13:28:32 INF EC2 instance found, but not ready instanceID=i-079f4940a8cc693f4
2024/05/16 13:28:51 INF Adding new event to the event store event={"AutoScalingGroupName":"NAME","Description":"ASG Lifecycle Launch event received. Instance was started at 2024-05-16 13:24:10 +0000 UTC \n","EndTime":"0001-01-01T00:00:00Z","EventID":"asg-lifecycle-term-61366464663035322d306633362d393863332d326564662d653438656239306265636664","InProgress":false,"InstanceID":"i-079f4940a8cc693f4","IsManaged":true,"Kind":"ASG_LAUNCH_LIFECYCLE","Monitor":"SQS_MONITOR","NodeLabels":null,"NodeName":"ip-10-11-129-159.ec2.internal","NodeProcessed":false,"Pods":null,"ProviderID":"aws:///us-east-1b/i-079f4940a8cc693f4","StartTime":"2024-05-16T13:24:10Z","State":""}
2024/05/16 13:28:52 INF Requesting instance drain event-id=asg-lifecycle-term-61366464663035322d306633362d393863332d326564662d653438656239306265636664 instance-id=i-079f4940a8cc693f4 kind=ASG_LAUNCH_LIFECYCLE node-name=ip-10-11-129-159.ec2.internal provider-id=aws:///us-east-1b/i-079f4940a8cc693f4

Environment

  • NTH App Version: 1.22
  • NTH Mode (IMDS/Queue processor): Queue Processor
  • OS/Arch: Ubuntu 22.04
  • Kubernetes version: 1.23
  • Installation method: Helm Chart from https://aws.github.io/eks-charts
@Typositoire
Copy link
Author

More details from this morning testing:

Completed ASG Lifecycle Hook seems to only trigger from time to time? Those are processed in a channel, is it possible some events are dropped in this channel? I just had one failed to Complete on ec2_launch_hook but it succeeded the ec2_term_hook after reaching ABANDON.

@LikithaVemulapalli
Copy link
Contributor

Hello @Typositoire, thanks for opening the issue. We need few details, why were you not able to replicate it? Did you test it with a new node group, were you able to successfully launch lifecycle hook and complete the lifecycle action when launch hook triggered? It will be difficult for us to understand if the issue cannot be replicated. Could you please provide additional details so we can take a look at it sooner. Thanks :)

@Typositoire
Copy link
Author

That's exactly this... All of our other nodegroup can successfully proceed in putting the nodes to InService. And even for this nodegroup it doesn't happens 100% of the time. The only thing I found was my previous comment, the event seems to never get processed by the PostDrain loop as I never see the Completed ASG Lifecycle Hook log entry when the nodes are not put InService. For now we've disabled Launch Hook.

We originally though it was a timing issue but raising the Timeout value to 10 or even 30 minutes just leaves the node Ready in K8S and Pending:Wait in AWS until timeout is hit and the ASG proceeding with ABANDON and terminates the node.

I'll keep you posted if we ever hit this again but just so you know this exists. It bugs me that this doesn't always happen and it never happens on other nodegroups.

Feel free to close this issue for now or add proper labels for future debugging.

@LikithaVemulapalli
Copy link
Contributor

Thanks for the quick response, as per comment, I will close this issue as it is not consistent behavior that you are observing, please feel free to open another issue and link this one, if you find it consistently in your node groups. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants