Comparison to EC2 Auto Scaling Capacity Rebalancing #455

gabegorelick · 2021-02-12T20:36:31Z

Github issue

Issue type

Documentation Report

Summary

As of #448, AutoSpotting now responds to instance rebalance recommendation notifications. But AWS also has a native solution called "EC2 Auto Scaling Capacity Rebalancing" that responds to these events in a similar manner as AutoSpotting. It would be nice to highlight the differences between these two solutions in AutoSpotting's docs.

How it works:

Amazon EC2 Auto Scaling is aware of EC2 instance rebalance recommendation notifications. The Amazon EC2 Spot service emits these notifications when Spot Instances are at elevated risk of interruption. When Capacity Rebalancing is enabled for an Auto Scaling group, Amazon EC2 Auto Scaling attempts to proactively replace Spot Instances in the group that have received a rebalance recommendation, providing the opportunity to rebalance your workload to new Spot Instances that are not at elevated risk of interruption.

The main difference seems to be that AutoSpotting launches on-demand instances and then tries to replace them with spot instances, while Capacity Rebalancing seems to only attempt to launch spot instances. In theory, it's possible that AutoSpotting can do a better job at launching an on-demand instance than Capacity Rebalancing can do in finding a spot instance, but it seems like AWS's service should be pretty good at finding spare capacity (feel free to chime in if anyone has empirical data on this).

Are there any other differences between AutoSpotting and native autoscaling that should be documented?

cristim · 2021-02-12T22:25:41Z

Thanks for reporting this.

At the moment I don't have any time to look into this but by all means please try to test it and report back your findings, preferably in a pull request to update the documentation.

If you don't like the way AutoSpotting handles this, pull requests to change it for the better are always welcome 😁

cristim · 2022-02-17T19:55:06Z

The main difference seems to be that AutoSpotting launches OnDemand instances and then tries to replace them with spot instances, while Capacity Rebalancing seems to only attempt to launch spot instances. In theory, it's possible that AutoSpotting can do a better job at launching an OnDemand instance than Capacity Rebalancing can do in finding a spot instance, but it seems like AWS's service should be pretty good at finding spare capacity (feel free to chime in if anyone has empirical data on this).

The OnDemand instances are not launched by AutoSpotting, but by the ASG itself. When the event comes, (regardless if it's a termination of rebalancing event, as they're handled the same way) AutoSpotting will currently either:

proactively detach the terminating Spot instance from the ASG and leave it run outside the ASG for up to 14 minutes (we have a 15min Lambda timeout), then terminates it if it wasn't terminated by EC2 Spot. Spot will terminate the instance after 2 minutes if it was a termination notification, but rebalancing events may not always result in terminations, and that's why we terminate it ourselves.

Then the ASG will notice it runs with reduced capacity, and will attempt to launch an OnDemand instance to recover the desired capacity. Within seconds after launch, this new OnDemand instance will be replaced by a new Spot instance and terminated, so the new Spot instance is booting up inside the ASG.

or...

terminate the instance while it's still in the ASG, telling the ASG to replace it immediately with a new OnDemand instance, which will be replaced identically by AutoSpotting as it's mentioned at the end of option 1.

The default behavior depends if the ASG has Lifecycle Hooks configured:

if there are no termination lifecycle hooks configured, the instance will be detached and terminated after the 14 minutes timeout (option 1)
otherwise AutoSpotting will terminate the instance within the ASG , in order to have the termination lifecycle hooks triggered.

There is also a configuration flag that can enforce either of the above behaviors regardless if the ASG has Lifecycle hooks or not, as you can see in the CloudFormation stack parameters:

 TerminationNotificationAction:
      AllowedValues:
        - "auto"
        - "detach"
        - "terminate"
      Default: "auto"

Are there any other differences between AutoSpotting and native autoscaling that should be documented?

Yes, the ASG won't run any temporary OnDemand capacity. It will first attempt to launch the replacement Spot instance, and only terminates the instance that received the rebalancing event after the new Spot instance is ready and passes the EC2/ELB health checks.

I've been working on a similar implementation in #475 but it's not ready yet. In addition, this will also fallback to OnDemand capacity with fallback across instance types if we failed to launch Spot across all the suitable Spot instance types from the AZ.

I'm looking for people who can help me test/refine #475 to get it merged.

cristim added the Type: Question label Mar 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparison to EC2 Auto Scaling Capacity Rebalancing #455

Comparison to EC2 Auto Scaling Capacity Rebalancing #455

gabegorelick commented Feb 12, 2021

cristim commented Feb 12, 2021

cristim commented Feb 17, 2022 •

edited

Comparison to EC2 Auto Scaling Capacity Rebalancing #455

Comparison to EC2 Auto Scaling Capacity Rebalancing #455

Comments

gabegorelick commented Feb 12, 2021

Github issue

Issue type

Summary

cristim commented Feb 12, 2021

cristim commented Feb 17, 2022 • edited

cristim commented Feb 17, 2022 •

edited