Add custom delay for instance refresh actions #465

stevehipwell · 2021-07-08T12:00:54Z

When using instance refresh to update ASGs it looks like the events come through with a start date of now which triggers the node-termination handler to start cordoning and draining the node immediately. This does work correctly if the ASG healthy percentage is set to 100% and all pods have replicas and PDBs (for NTH we need #463 to satisfy this); but single pods such as Prometheus will often be un-schedulable for a short period while the new node boots up.

To make this whole process function without any downtime a custom duration to wait on ASG termination events could be adopted and defaulted to something like 90 seconds. Assuming that this wait time was longer than the time to start and join a node to the cluster there would be no un-schedulable pods and the ability to use a non 100% ASG healthy percentage. Combined with the ASG lifecycle hook timeout this would support a high level of customisation without much extra complexity.

bwagner5 · 2021-08-16T21:26:53Z

Sorry for the delay in responding to this!!!!! I see the PDB field in the helm chart has been released, so hopefully that helps a bit. Could a pod grace period / node grace period be used to make the pod wait until the new node is up (catching SIGTERM in the container and waiting a bit)?

stevehipwell · 2021-08-17T05:52:09Z

@bwagner5 AFAIK with instance refresh there is no need to complete the whole process in 120s like for spot termination. As there is the greatest likelihood of all nodes in an ASG being replaced in this mode it would be really useful to block the NTH draining the node until the new nodes are ready or if that's not possible until a predefined time has elapsed.

stevehipwell · 2021-10-13T08:26:51Z

@bwagner5 was there any progress made on this?

bwagner5 · 2021-10-13T21:16:58Z

There has not been progress, but I did have a conversation about something similar to this earlier today. The conversation was about how to handle capacity-rebalance events for spot on ASG. The TLDR is that ASG will launch a replacement instance before sending the Terminate lifecycle hook and therefore triggering NTH to cordon and drain the node that is being replaced.

It would be nice if instance refresh had that functionality to launch before terminate. On the case where you specify 100% healthy for an instance refresh, does that cause ASG to provision a new node above the desired capacity (if you max is set higher)? It seems like that would be the best way to handle this in general, or maybe I'm missing some issue there too.

Delaying NTH from draining a node obviously sounds very doable. The only hesitation I have is the complexity that has grown in NTH from adding a bunch of these small knobs. If we absolutely need it, then sure we can add it. I just want to make sure all avenues have been explored.

stevehipwell · 2021-10-13T21:45:46Z

@bwagner5 for capacity rebalance does the ASG wait for the instance to be ready or just pending/initialising? For instance refresh with 100% healthy if the ASG starts a new instance it's not waiting for it to be ready.

The specific use case I have is that my clusters get updated via a single process and when there are service and node changes I get failures due to nodes being cordoned. We have a system node pool made up of 3 (actually 6, but let's ignore that for now) ASGs in separate regions, so on an upgrade NTH would get 3 refresh events. Only 1 of these events would be actionable due to PDBs but I think all nodes would be cordoned.

Once the hostPort change has been released I can re-test. Due to this issue my clusters have needed to scale the system node pool if NTH or LB controller is on a refreshing node which might be the cause of my issues. FYI I had to uninstall the LB controller chart to remove the host ports automatically created by accidentally using host network, the NTH chart might need uninstalling for this fix too but I hope not due to the different way they're set.

bwagner5 added the Type: Enhancement New feature or request label Aug 16, 2021

jillmon added the stalebot-ignore To NOT let the stalebot update or close the Issue / PR label Oct 19, 2021

jillmon added the Priority: Medium This issue will be seen by about half of users label Nov 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add custom delay for instance refresh actions #465

Add custom delay for instance refresh actions #465

stevehipwell commented Jul 8, 2021

bwagner5 commented Aug 16, 2021

stevehipwell commented Aug 17, 2021

stevehipwell commented Oct 13, 2021

bwagner5 commented Oct 13, 2021

stevehipwell commented Oct 13, 2021

Add custom delay for instance refresh actions #465

Add custom delay for instance refresh actions #465

Comments

stevehipwell commented Jul 8, 2021

bwagner5 commented Aug 16, 2021

stevehipwell commented Aug 17, 2021

stevehipwell commented Oct 13, 2021

bwagner5 commented Oct 13, 2021

stevehipwell commented Oct 13, 2021