Feat/replace on spot termination events #475

cristim · 2021-12-03T16:36:21Z

Issue Type

Bugfix Pull Request

Summary

Implement instance replacements as a result of Spot termination and rebalancing events.

Benefits:

faster replacement, reducing the time to run with reduced capacity
reduced instance churn
make AutoSpotting ICE-proof, by failing over to multiple On-Demand instances as part of the instance replacement process in case none of the Spot instances is available.

In addition, as a non-functional change I also switched the Docker image, Lambda and Fargate to ARM64, for faster builds on my M1 Mac and lower runtime costs for the users.

Code contribution checklist

mello7tre · 2021-12-06T10:41:10Z

Hi @cristim, unfortunately i think we could have a concurrency problem.
Currently, if enabled, all replacements actions are triggered by messages in the SQS Fifo queue.
This way we assure that for the same ASG they are handled in a sequential way.

But rebalance/termination events can be triggered anytime, and if we directly execute swapWithGroupMember we have no assurance that another lambda, maybe triggered by an autoscaling event, is not acting on the same ASG at the same time.

If, on the other side, we send a proper message to the SQS queue, to replace the spot with a new spot/ondemand, there could be other messages, regarding the same ASG, that still neeed to be processed.
If the queue length is long the message can be processed after 2 minutes, time after which the spot instance is terminated, so that the replacement will fail.

I think that unfortunately we need to think back for another solution/workflow.
Concurrency problems can be very nasty, because can alter permanently the ASG Max/Desired Size, and that should be avoided.

cristim · 2021-12-07T12:32:51Z

Yes, I was thinking about this as well, but because of the timing critical nature of these events I think we can't use the queue, just like you said.

cristim · 2021-12-13T20:42:20Z

@mello7tre I've been thinking more about this and I'm thinking to change the handling of maximum ASG capacity instead of incrementing it gradually:

temporarily double the maximum but only in case of repeated(say 3) attach errors with random/exponential back-off retry in between, saving previous maximum in a tag set on the ASG
bringing it back to the original max value in the next cron run if desired capacity is less than the previous maximum saved in the tag

What do you think about this approach?

mello7tre · 2021-12-14T10:15:44Z

uhmmm, sincerely i do not like it so much.
With that approach i could have for a max of over 4 minutes (considering a default 5 min schedule) a double max capacity, maybe during that time autoscaling will start more instances (that will be replaced by autospotting) and then on the next cron schedule half of the running instances will be terminated.
And what should happen in the rare case of another spot termination during the time before the next cron schedule that "trigger" another attach error ? It will keep the original tag value?
Or maybe we manually change the max capacity in the time interval, and cron schedule will put back the wrong/old one.

But even putting apart those problem, all related to the Capacity Desired = Capacity Max state and subsequent increase of the Max one, there are the ones related to two lambdas that can act on the same ASG.
In particular for a concurrent execution of the cron schedule lambda; just think at needReplaceOnDemandInstances checks.

If we take in account "only" Capacity Max problem maybe can be better to, for termination/rebalance event, invert attach terminate/detach order.
What i mean is:

Lambda is triggered by termination/rebalance event
Lambda try to run another spot with fallback to ondemand
Lambda detach (decreasing Desired Capacity) terminating instance.
Lambda attach new spot/ondemand with fallback to re-attaching terminating one (maybe only for rebalance, for spot termination it does not make much sense)
Lambda terminate, if not previous fallback , terminating spot.

This way we have no need to change ASG Max Size.
But some of the above problems related to a concurrent schedule maybe still persist (i need to think deeper about this).
Maybe launched spot is attached by concurrent AS Schedule even if launched by AS event, what happen ?
Event Lambda will detach ondemand, but will get error when trying to attach spot, because already attached by Schedule one, and Schedule will get error when trying to execute asg.terminateInstanceInAutoScalingGroup because instance will not be in the asg group anymore.
So we should take in account all this, and maybe other problems too.

Initial implementation of replacement on Spot termination

c83d29a

cristim force-pushed the feat/replace-on-spot-termination-events branch from 88e8ae9 to c83d29a Compare January 1, 2022 22:06

cristim mentioned this pull request Feb 17, 2022

Comparison to EC2 Auto Scaling Capacity Rebalancing #455

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/replace on spot termination events #475

Feat/replace on spot termination events #475

cristim commented Dec 3, 2021 •

edited

mello7tre commented Dec 6, 2021

cristim commented Dec 7, 2021

cristim commented Dec 13, 2021

mello7tre commented Dec 14, 2021

Feat/replace on spot termination events #475

Are you sure you want to change the base?

Feat/replace on spot termination events #475

Conversation

cristim commented Dec 3, 2021 • edited

Issue Type

Summary

Code contribution checklist

mello7tre commented Dec 6, 2021

cristim commented Dec 7, 2021

cristim commented Dec 13, 2021

mello7tre commented Dec 14, 2021

cristim commented Dec 3, 2021 •

edited