Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoSpotting should bring replacement instance online before detaching or terminating spot instance #401

Open
gabegorelick opened this issue Jan 29, 2020 · 5 comments

Comments

@gabegorelick
Copy link
Contributor

Github issue

Issue type

Feature idea

Build number

master

Summary

By default, when AutoSpotting gets a CloudWatch Event signaling that a spot instance is due to be terminated soon (a "2 minute warning") it detaches the instance. This ensures that the ASG will start bringing up a replacement instance before the spot instance stops. The spot instance stays running (but unattached to the ASG) so that it can still (hopefully) do useful work in the meantime.

However, detaching the instance before its replacement is online means diminished capacity in the ASG. For example, if the spot instance is attached to a load balancer, then that unattached instance is not serving traffic, so you can potentially have an extended period with fewer usable instances.

When an ASG has a lifecycle rule, AutoSpotting terminates the instance instead of detaching it. But this can lead to similar downtime. AutoSpotting assumes that the lifecycle rules will block termination of the instance until new capacity is online. In practice, I think most lifecycle hooks simply drain work off the terminating instances. But if there isn't enough capacity to to shift that work onto existing instances, then you'd suffer downtime. I guess the lifecycle hook could launch a new instance, but that seems like it would cause a lot of problems (e.g. ASG updates that launch a new instance then terminate the old one wouldn't work).

All this can be extra dangerous if a significant portion of your spot instances get interrupted at the same time: AutoSpotting can terminate all your instances at once.

A similar issue happens when AutoSpotting detects the ratio of spot to on-demand instances is too high. It terminates a random spot instance and lets the ASG bring up a replacement afterwards.

Steps to reproduce

  1. Have an ASG with spot instance
  2. Somehow get spot instance interrupted
  3. Notice ASG size is < desired count for the interval between spot instance warning and launch of on-demand instance

Expected results

AutoSpotting launches on-demand instance before terminating or detaching spot instance.

Actual results

AutoSpotting terminates or detaches spot instance before replacement is online.

@gabegorelick
Copy link
Contributor Author

I guess the lifecycle hook could launch a new instance, but that seems like it would cause a lot of problems (e.g. ASG updates that launch a new instance then terminate the old one wouldn't work).

It may be possible to get this to work. I'm curious if anyone has done that.

@cristim
Copy link
Member

cristim commented Jan 30, 2020

Thanks for reporting this, I was actually contemplating to attempt launching a new spot instance with fallback to launching an on demand instance even of a different instance type (potentially more expensive) when handling the termination event. this would have to be the first one we attempt to replace with spot afterwards.

It should not be so hard to implement but I would like to have this done after merging the event based replacement and porting the spot termination handling to the model used for handling the other events.

The benefit of using this would be especially visible when handling ICE events which have been reported a few times in the past.

@gabegorelick
Copy link
Contributor Author

I was actually contemplating to attempt launching a new spot instance with fallback to launching an on demand instance even of a different instance type (potentially more expensive) when handling the termination event.

Why would we do that instead of launching an on-demand instance and then letting a subsequent invocation of AutoSpotting switch it out for a spot-instance? Wouldn't that make it more likely that no instance is brought online in time? Or can we fairly quickly determine that the spot request fails and then fallback to on-demand well within 2 minutes?

@cristim
Copy link
Member

cristim commented Jan 30, 2020

Why would we do that instead of launching an on-demand instance and then letting a subsequent invocation of AutoSpotting switch it out for a spot-instance?

Reducing reduced capacity/downtime and churn.

Wouldn't that make it more likely that no instance is brought online in time? Or can we fairly quickly determine that the spot request fails and then fallback to on-demand well within 2 minutes?

The RunInstances API call that we use to launch spot instances fails fairly quickly with insufficient capacity, if I remember correctly it was a matter of seconds so we have time to iterate over multiple instance types

@gabegorelick
Copy link
Contributor Author

The RunInstances API call that we use to launch spot instances fails fairly quickly with insufficient capacity, if I remember correctly it was a matter of seconds so we have time to iterate over multiple instance types

Awesome! If that's the case, then your approach makes sense.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants