-
Notifications
You must be signed in to change notification settings - Fork 261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node stuck in Pending:Wait and eventually hit timeout (ABANDON) #1007
Comments
More details from this morning testing:
|
Hello @Typositoire, thanks for opening the issue. We need few details, why were you not able to replicate it? Did you test it with a new node group, were you able to successfully launch lifecycle hook and complete the lifecycle action when launch hook triggered? It will be difficult for us to understand if the issue cannot be replicated. Could you please provide additional details so we can take a look at it sooner. Thanks :) |
That's exactly this... All of our other nodegroup can successfully proceed in putting the nodes to We originally though it was a timing issue but raising the Timeout value to 10 or even 30 minutes just leaves the node Ready in K8S and I'll keep you posted if we ever hit this again but just so you know this exists. It bugs me that this doesn't always happen and it never happens on other nodegroups. Feel free to close this issue for now or add proper labels for future debugging. |
Thanks for the quick response, as per comment, I will close this issue as it is not consistent behavior that you are observing, please feel free to open another issue and link this one, if you find it consistently in your node groups. Thank you. |
I'll be honest this is a last stretch to be able to understand why it's happening and how...
Here's the bug: We have one nodegroup which seems to fail to launch randomly or more like succeed to launch randomly. We are still not able to pinpoint why this nodegroup specifically, since we can't replicate it so far.
New node comes up, stays in Pending:Wait until Heartbeat Timeout triggers Abandon and the node get's recycled (Terminated). Version 1.22 fixed the Termination taking for ever but launch still seems to be bugged.
Here's the weird part, it eventually works! If I leave the instance be, at some point it will hit one that works! We first though it was timing based so we raised the timeout to 600s but it just takes longer to fail launching now :p
Steps to reproduce
No idea... We tried to replicate with other nodegroups and we can't. We're still trying to figure out what's different with this nodegroup but we can't find anything.
Expected outcome
Node launches and Pending:Proceed event is sent after the node is ready.
Application Logs this is looping this until the node become ready then nothing until timeout.
Environment
The text was updated successfully, but these errors were encountered: