Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deregister Runner Application when Spot Interruption signal is received #804

Open
dimisjim opened this issue Apr 29, 2021 · 9 comments
Open
Labels
help wanted Extra attention is needed stale:exempt

Comments

@dimisjim
Copy link
Contributor

dimisjim commented Apr 29, 2021

to prevent issues like this: #84

@npalm npalm added the help wanted Extra attention is needed label May 10, 2021
@github-actions
Copy link
Contributor

github-actions bot commented Jun 2, 2022

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the Stale label Jun 2, 2022
@dimisjim
Copy link
Contributor Author

dimisjim commented Jun 2, 2022

Any news regarding this?

@github-actions github-actions bot removed the Stale label Jun 3, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Jul 3, 2022

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the Stale label Jul 3, 2022
@dimisjim
Copy link
Contributor Author

dimisjim commented Jul 3, 2022

bump

@ScottGuymer
Copy link
Member

Hi

Do you have any thoughts on how you might see this working?

@dimisjim
Copy link
Contributor Author

Hi @ScottGuymer ,

yeah, one idea would be to install a cron inside the runners that checks regularly for spot interruption notices (see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-instance-termination-notices.html), then it would deregister itself from the available pool of runners, or via a lambda that would be delegated to perform this action.

Ideally it would wait until a job is finished, if it currently runs one. Of course it would be difficult to predict when a certain job is expected to finish, so it would be enough to wait for a reasonable amount of time (which would be before the end of the termination timestamp that was given by the notice) so that a job has more chances on finishing.

These would hopefully minimize the runners being shut down while running a job.
What do you think?

@lmeynberg
Copy link

lmeynberg commented Aug 4, 2022

When I understand this correctly part of this module could be a connection to the SNS topic and listen for these events. And then shutdown the runner using the normal shutdown runner.

@tedparagon
Copy link

We went months with only getting a handful of evictions. So it was definitely worth rerunning a couple jobs to get that huge spot instance savings. However over the last month or so the evictions have increased significantly to the point where we are considering other options. I think this feature would help a lot.

These are just some of our evictions in the last 30 minutes. They are becoming more the norm than exception.
image

@npalm
Copy link
Member

npalm commented Mar 13, 2024

In PR #3789 we adding a first step to handle spot termination events.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed stale:exempt
Projects
None yet
Development

No branches or pull requests

5 participants