Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/replace on spot termination events #475

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

cristim
Copy link
Member

@cristim cristim commented Dec 3, 2021

Issue Type

  • Bugfix Pull Request

Summary

Implement instance replacements as a result of Spot termination and rebalancing events.

Benefits:

  • faster replacement, reducing the time to run with reduced capacity
  • reduced instance churn
  • make AutoSpotting ICE-proof, by failing over to multiple On-Demand instances as part of the instance replacement process in case none of the Spot instances is available.

In addition, as a non-functional change I also switched the Docker image, Lambda and Fargate to ARM64, for faster builds on my M1 Mac and lower runtime costs for the users.

Code contribution checklist

  1. The contribution fixes a single existing github issue, and it is linked
    to it.
  2. The code is as simple as possible, readable and follows the idiomatic Go
    guidelines.
  3. All new functionality is covered by automated test cases so the overall
    test coverage doesn't decrease.
  4. No issues are reported when running make full-test.
  5. Functionality not applicable to all users should be configurable.
  6. Configurations should be exposed through Lambda function environment
    variables which are also passed as parameters to the
    CloudFormation
    and
    Terraform
    stacks defined as infrastructure code.
  7. Global configurations set from the infrastructure stack level should also
    support per-group overrides using tags.
  8. Tags names and expected values should be similar to the other existing
    configurations.
  9. Both global and tag-based configuration mechanisms should be tested and
    proven to work using log output from various test runs.
  10. The logs should be kept as clean as possible (use log levels as
    appropriate) and formatted consistently to the existing log output.
  11. The documentation is updated to cover the new behavior, as well as the
    new configuration options for both stack parameters and tag overrides.
  12. A code reviewer reproduced the problem and can confirm the code
    contribution actually resolves it.

@mello7tre
Copy link
Contributor

Hi @cristim, unfortunately i think we could have a concurrency problem.
Currently, if enabled, all replacements actions are triggered by messages in the SQS Fifo queue.
This way we assure that for the same ASG they are handled in a sequential way.

But rebalance/termination events can be triggered anytime, and if we directly execute swapWithGroupMember we have no assurance that another lambda, maybe triggered by an autoscaling event, is not acting on the same ASG at the same time.

If, on the other side, we send a proper message to the SQS queue, to replace the spot with a new spot/ondemand, there could be other messages, regarding the same ASG, that still neeed to be processed.
If the queue length is long the message can be processed after 2 minutes, time after which the spot instance is terminated, so that the replacement will fail.

I think that unfortunately we need to think back for another solution/workflow.
Concurrency problems can be very nasty, because can alter permanently the ASG Max/Desired Size, and that should be avoided.

@cristim
Copy link
Member Author

cristim commented Dec 7, 2021

Yes, I was thinking about this as well, but because of the timing critical nature of these events I think we can't use the queue, just like you said.

@cristim
Copy link
Member Author

cristim commented Dec 13, 2021

@mello7tre I've been thinking more about this and I'm thinking to change the handling of maximum ASG capacity instead of incrementing it gradually:

  • temporarily double the maximum but only in case of repeated(say 3) attach errors with random/exponential back-off retry in between, saving previous maximum in a tag set on the ASG
  • bringing it back to the original max value in the next cron run if desired capacity is less than the previous maximum saved in the tag

What do you think about this approach?

@mello7tre
Copy link
Contributor

uhmmm, sincerely i do not like it so much.
With that approach i could have for a max of over 4 minutes (considering a default 5 min schedule) a double max capacity, maybe during that time autoscaling will start more instances (that will be replaced by autospotting) and then on the next cron schedule half of the running instances will be terminated.
And what should happen in the rare case of another spot termination during the time before the next cron schedule that "trigger" another attach error ? It will keep the original tag value?
Or maybe we manually change the max capacity in the time interval, and cron schedule will put back the wrong/old one.

But even putting apart those problem, all related to the Capacity Desired = Capacity Max state and subsequent increase of the Max one, there are the ones related to two lambdas that can act on the same ASG.
In particular for a concurrent execution of the cron schedule lambda; just think at needReplaceOnDemandInstances checks.

If we take in account "only" Capacity Max problem maybe can be better to, for termination/rebalance event, invert attach terminate/detach order.
What i mean is:

  • Lambda is triggered by termination/rebalance event
  • Lambda try to run another spot with fallback to ondemand
  • Lambda detach (decreasing Desired Capacity) terminating instance.
  • Lambda attach new spot/ondemand with fallback to re-attaching terminating one (maybe only for rebalance, for spot termination it does not make much sense)
  • Lambda terminate, if not previous fallback , terminating spot.

This way we have no need to change ASG Max Size.
But some of the above problems related to a concurrent schedule maybe still persist (i need to think deeper about this).
Maybe launched spot is attached by concurrent AS Schedule even if launched by AS event, what happen ?
Event Lambda will detach ondemand, but will get error when trying to attach spot, because already attached by Schedule one, and Schedule will get error when trying to execute asg.terminateInstanceInAutoScalingGroup because instance will not be in the asg group anymore.
So we should take in account all this, and maybe other problems too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants