Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add custom delay for instance refresh actions #465

Open
stevehipwell opened this issue Jul 8, 2021 · 5 comments
Open

Add custom delay for instance refresh actions #465

stevehipwell opened this issue Jul 8, 2021 · 5 comments
Labels
Priority: Medium This issue will be seen by about half of users stalebot-ignore To NOT let the stalebot update or close the Issue / PR Type: Enhancement New feature or request
Projects

Comments

@stevehipwell
Copy link
Contributor

When using instance refresh to update ASGs it looks like the events come through with a start date of now which triggers the node-termination handler to start cordoning and draining the node immediately. This does work correctly if the ASG healthy percentage is set to 100% and all pods have replicas and PDBs (for NTH we need #463 to satisfy this); but single pods such as Prometheus will often be un-schedulable for a short period while the new node boots up.

To make this whole process function without any downtime a custom duration to wait on ASG termination events could be adopted and defaulted to something like 90 seconds. Assuming that this wait time was longer than the time to start and join a node to the cluster there would be no un-schedulable pods and the ability to use a non 100% ASG healthy percentage. Combined with the ASG lifecycle hook timeout this would support a high level of customisation without much extra complexity.

@bwagner5
Copy link
Contributor

Sorry for the delay in responding to this!!!!! I see the PDB field in the helm chart has been released, so hopefully that helps a bit. Could a pod grace period / node grace period be used to make the pod wait until the new node is up (catching SIGTERM in the container and waiting a bit)?

@bwagner5 bwagner5 added the Type: Enhancement New feature or request label Aug 16, 2021
@stevehipwell
Copy link
Contributor Author

@bwagner5 AFAIK with instance refresh there is no need to complete the whole process in 120s like for spot termination. As there is the greatest likelihood of all nodes in an ASG being replaced in this mode it would be really useful to block the NTH draining the node until the new nodes are ready or if that's not possible until a predefined time has elapsed.

@stevehipwell
Copy link
Contributor Author

@bwagner5 was there any progress made on this?

@bwagner5
Copy link
Contributor

There has not been progress, but I did have a conversation about something similar to this earlier today. The conversation was about how to handle capacity-rebalance events for spot on ASG. The TLDR is that ASG will launch a replacement instance before sending the Terminate lifecycle hook and therefore triggering NTH to cordon and drain the node that is being replaced.

It would be nice if instance refresh had that functionality to launch before terminate. On the case where you specify 100% healthy for an instance refresh, does that cause ASG to provision a new node above the desired capacity (if you max is set higher)? It seems like that would be the best way to handle this in general, or maybe I'm missing some issue there too.

Delaying NTH from draining a node obviously sounds very doable. The only hesitation I have is the complexity that has grown in NTH from adding a bunch of these small knobs. If we absolutely need it, then sure we can add it. I just want to make sure all avenues have been explored.

@stevehipwell
Copy link
Contributor Author

@bwagner5 for capacity rebalance does the ASG wait for the instance to be ready or just pending/initialising? For instance refresh with 100% healthy if the ASG starts a new instance it's not waiting for it to be ready.

The specific use case I have is that my clusters get updated via a single process and when there are service and node changes I get failures due to nodes being cordoned. We have a system node pool made up of 3 (actually 6, but let's ignore that for now) ASGs in separate regions, so on an upgrade NTH would get 3 refresh events. Only 1 of these events would be actionable due to PDBs but I think all nodes would be cordoned.

Once the hostPort change has been released I can re-test. Due to this issue my clusters have needed to scale the system node pool if NTH or LB controller is on a refreshing node which might be the cause of my issues. FYI I had to uninstall the LB controller chart to remove the host ports automatically created by accidentally using host network, the NTH chart might need uninstalling for this fix too but I hope not due to the different way they're set.

@jillmon jillmon added the stalebot-ignore To NOT let the stalebot update or close the Issue / PR label Oct 19, 2021
@jillmon jillmon added the Priority: Medium This issue will be seen by about half of users label Nov 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: Medium This issue will be seen by about half of users stalebot-ignore To NOT let the stalebot update or close the Issue / PR Type: Enhancement New feature or request
Projects
V2
Awaiting triage
Development

No branches or pull requests

3 participants