Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add re-submission of tasks during spot interruption disconnects #516

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

blanked
Copy link

@blanked blanked commented Oct 16, 2020

This PR adds a new feature - re-submission of tasks for agents that are disconnected due to spot interruption event in AWS.
Whenever an agent is disconnected, there are checks to determine if it is an unexpected disconnect and if the disconnection is a spot interruption event. If the answer is yes to both, the tasks that were running on the agent will be re-submitted to the queue.

Motivation

Builds may fail due to spot instances being terminated. This PR can help to reduce the number of build failures for spot interruption events.

Notes

This may or may not prevent build failures. There doesn't seem to be any documentation on how tasks can be resubmitted. This PR is inspired by another Jenkins plugin that has the suggested behaviour implemented - https://github.com/jenkinsci/ec2-fleet-plugin/blob/master/src/main/java/com/amazon/jenkins/ec2fleet/EC2FleetAutoResubmitComputerLauncher.java

@blanked
Copy link
Author

blanked commented Oct 21, 2020

Can someone help to review this PR to see if its ok? It's actually identical to #485 but I opened a new PR so that it's eligible for hacktoberfest 😅

Copy link
Contributor

@res0nance res0nance left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feature looks very interesting, AFAICT this seems to do what it says but it is a hard to test feature.

@blanked
Copy link
Author

blanked commented Oct 22, 2020

yeah i'll have a think on how to mock a spot interruption event and see if its possible using the aws sdk. if anyone has any idea on how to do so, that'll be super helpful!

Comment on lines +109 to +110
final boolean isUnexpectedDisconnection = computer.isOffline() && computer.getOfflineCause()
instanceof OfflineCause.ChannelTermination;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the customers complained that OfflineCause.ChannelTermination was not always triggered for spot interruption. You may be able to dig this further here: jenkinsci/ec2-fleet-plugin#121

@dgiffordaudio
Copy link

This seems to have been approved in October 2020. Is this going to be merged soon? This would be really helpful for us

@opajonk
Copy link

opajonk commented May 15, 2023

Yes, this would be really awesome to add - any plans?

@minhnnhat-urbanise
Copy link

Hello, we're also looking forward to this feature.

@schottsfired
Copy link
Contributor

schottsfired commented Sep 29, 2023

AFAICT this seems to do what it says but it is a hard to test feature.

i'll have a think on how to mock a spot interruption event and see if its possible using the aws sdk. if anyone has any idea on how to do so, that'll be super helpful!

It should be possible to test it now with this new-ish* AWS feature:
AWS Fault Injection Simulator now injects Spot Instance Interruptions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants