Update delay before killing old tasks #2249

pgkelley4 · 2015-09-14T18:25:04Z

My team would like to have more control over the speed of app updates. I would like to add support for a delay between when a new task is healthy and an old task is taken down. Something like Aurora's watch_secs:
http://aurora.apache.org/documentation/latest/configuration-tutorial/#defining-job-objects

This is somewhat similar to #1504, but a different use case.

Would the Marathon team be open to a PR adding this feature? If so, would TaskReplaceActor be the right place to add it?

kolloch · 2015-09-16T11:46:03Z

Thanks for submitting this. That makes sense. It is also related to #1111 as part of the bigger picture.

We would definitely accept PRs for this but we would have to collaborate closely and have to discuss this some more internally. Would you also be willing to write a medium sized document about how to deal with failure recovery, rolling updates and starting up tasks?

pgkelley4 · 2015-09-16T22:44:26Z

I put together some more information on how we were thinking this would work. It didn’t end up being that long so if there is anything you wanted more detail on let me know and I will expand this.

Overall

When a task is marked healthy, the removal of an old task is scheduled for updateDelay seconds in the future.

Failure recovery

This change has no effect on failure recovery. The updateDelay only delays the removal of old instances.

Rolling updates

The updateDelay would slow down rolling updates. After each batch of the update succeeds, the removal of old tasks is delayed. In total the update is delayed by updateDelay * the number of update batches. A batch being how many tasks are updated at once during a rolling update.

Starting up tasks

This change has no effect on starting up tasks. There are no old tasks to delay the removal of.

…tegy Added the killOldTasksDelaySeconds option to UpgradeStrategy. - The amount of time in seconds to wait before killing an old task when a new task becomes healthy. This only effects updates. It does not delay immediate killing of tasks to meet the minimumHealthCapacity at the beginning of an update.

…tegy Added the killOldTasksDelaySeconds option to UpgradeStrategy. The amount of time in seconds to wait before killing an old task when a new task becomes healthy. This only effects updates. It does not delay immediate killing of tasks to meet the minimumHealthCapacity at the beginning of an update.

kolloch · 2015-10-02T09:34:20Z

Hey @pgkelley4, thanks for following up with code and sorry for being a bit slow to react.

I think that the issue might be a bit more complicated than your commit suggests. If Marathon fails over during an upgrade, your code will not work. It will also incorrectly kill a task, if the tasks that became healthy becomes unhealthy while delaying the kill.

I think it would help if you do not consider the delay as a "killOldTasksDelaySeconds" but as a (just an example) "minHealthyDuration". E.g. in the context of the rolling upgrade, we only consider tasks as healthy that have lived for at least this time.

Can you maybe provide documentation for your solution? Sometimes it helps to evaluate if the solution is nice to use or a bit clumsy.

pgkelley4 · 2015-10-06T15:53:00Z

@kolloch, Thanks for taking the time to look at my PR and working with me on this.

If Marathon fails over during an upgrade, your code will not work.

Completely forgot to consider this. However, after taking a deeper look at the class, I think there are a few more issues that can occur when Marathon fails over during an update.

I put together a quick test to verify this. I started an update on a 2 instance app, then restarted the master. The new master picked up the deployment plan just fine, but it would kill too many tasks and fall below the minimum health capacity (Not every time).

I believe this is because the TaskReplaceActor.preStart method assumes it will be run once at the beginning of an update and then:

tasksToKill doesn't take into account that some may be the new version and might not be healthy (often the case in an update). Because of this nrToKillImmediately is wrong and will kill tasks right away.
conciliateNewTasks doesn't know that tasks were previously started.
the backoff delay is reset every time.

This can be fixed by changing the preStart method to assume things could have already been started. I am happy to put up a separate PR for these issues. But let me know if I am completely missing something here in the code and I will file the functional issues I saw as a bug.

Regarding my addition specifically, right now healthcheck information like firstSuccess isn't maintained during a failover. So I think it is reasonable to restart the kill delay after a failover. I can make this change if you agree.

It will also incorrectly kill a task, if the tasks that became healthy becomes unhealthy while delaying the kill.

I had added a comment on the PR to this effect. I wasn't 100% sure that this was the best behavior. After thinking about it a bit more I agree with what you describe. I can make this fix too.

I think it would help if you do not consider the delay as a "killOldTasksDelaySeconds" but as a (just an example) "minHealthyDuration". E.g. in the context of the rolling upgrade, we only consider tasks as healthy that have lived for at least this time.

I can certainly change it to that way if you think it is better. So we are clear we are talking about the same thing: I think you mean approaching this from the mesosphere.marathon.health package and doing something like not publishing HeathStatusChanged as Healthy events until a certain time period has elapsed (Not sure if that is exactly how it would be implemented).

What I liked about the killOldTasksDelaySeconds approach is that it only effects updates and the user gets as much information as available. I was thinking that it would be undesirable to slow down app starts and failure recovery. However, it may be more consistent to apply the delay to all new task creation.

Here is the documentation I had before:

killOldTasksDelaySeconds (Optional. Default: 0) - The amount of time in seconds to wait to kill an old task when a new task becomes healthy.

Here is what I am thinking the new way would be:

minHealthyDuration - The amount of time a task must be passing its healthchecks before marathon considers it to be healthy.

Would you still like me to change to the minHealthyDuration? I am happy to put up a PR for that approach if so.

Sorry for throwing so much at you all at once and thanks again!

pgkelley4 · 2015-11-11T21:38:03Z

@kolloch Any chance you could take a look at this soon? I don't want it to fall through... I'm happy to split out the bug to a separate ticket.

kolloch · 2015-11-18T15:04:47Z

hey @pgkelley4, just a quick note: I am sorry that I missed these comments! I was very busy and archived a couple of notifications from github too many. I'll at least read through your comments now.

kolloch · 2015-11-18T18:07:35Z

hi @pgkelley4, thanks for the awesome comment!

you are right with the bugs, I think (I would need a bit more time to verify that for sure). It would be great to create a separate ticket for that.

I can certainly change it to that way if you think it is better. So we are clear we are talking about the same thing: I think you mean approaching this from the mesosphere.marathon.health package and doing something like not publishing HeathStatusChanged as Healthy events until a certain time period has elapsed (Not sure if that is exactly how it would be implemented).

I think you are right in such a way that the health information should be propagated to consumers of the API as fast as possible. I was only thinking of treating it as delayed health state inside of the deployment logic. It would work better if the health status is flaky or the tasks die. The approach you take in the PR is a bit dangerous. If you have a configuration error that makes all tasks die after 10s and you have a killOldTasksDelaySeconds of 20s, you are still killing the old (working) tasks even though you don't have new tasks to substitute them.

kevincox · 2016-03-08T05:00:38Z

I'm really excited about this work but I just wanted to add one suggestion. It would be awesome if there was some way to tell that a task is "defunct" from the API. This way it could be removed from load balancers and DNS before it is actually killed. This would make it easy to provide seamless transitions between versions.

sgran · 2016-05-25T16:45:33Z

Hello,

I was just going to make a feature request for exactly the feature @kevincox made, I'm glad I found this first :)

What I was naively hoping for was an event of TaskScheduledForRemoval or something. I realize this will be slightly difficult if this is implemented as a delay on the healthy state of the new instance, rather than as a scheduling event on the old instance.

Thanks,

ogrodnek · 2016-11-02T21:48:56Z

I would also love to see something on the event bus as @sgran and @kevincox suggested to help schedule removal from LBs/etc.

We are trying to implement zero downtime app updates, and I think this would be a huge help.

Currently we're using marathon-consul/haproxy-consul to update haproxy config. We've got it all working nicely, but I am seeing a blip when the old instances are killed. It seems like the time between the tasks getting killed and us getting it on the bus, processing it, updating the config still causes some issues.

If we could get an event as suggested (TaskScheduledForRemoval) or similar, that would let us gracefully rotate the instances out of the config before they're killed.

meichstedt · 2017-03-07T12:45:31Z

Note: This issue has been migrated to https://jira.mesosphere.com/browse/MARATHON-4535. For more information see https://groups.google.com/forum/#!topic/marathon-framework/khtvf-ifnp8.

pgkelley4 mentioned this issue Sep 29, 2015

Fixes #2249 - Added killOldTasksDelaySeconds to UpgradeStrategy #2327

Closed

pgkelley4 mentioned this issue Nov 20, 2015

Failover during update scales below minimum health capacity #2688

Closed

pierrecdn mentioned this issue Dec 11, 2015

Allow per-app task "viability" period for task launch backoff #1504

Closed

aquamatthias closed this as completed Mar 27, 2017

mesosphere locked and limited conversation to collaborators Mar 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update delay before killing old tasks #2249

Update delay before killing old tasks #2249

pgkelley4 commented Sep 14, 2015

kolloch commented Sep 16, 2015

pgkelley4 commented Sep 16, 2015

kolloch commented Oct 2, 2015

pgkelley4 commented Oct 6, 2015

pgkelley4 commented Nov 11, 2015

kolloch commented Nov 18, 2015

kolloch commented Nov 18, 2015

kevincox commented Mar 8, 2016

sgran commented May 25, 2016

ogrodnek commented Nov 2, 2016

meichstedt commented Mar 7, 2017

Update delay before killing old tasks #2249

Update delay before killing old tasks #2249

Comments

pgkelley4 commented Sep 14, 2015

kolloch commented Sep 16, 2015

pgkelley4 commented Sep 16, 2015

Overall

Failure recovery

Rolling updates

Starting up tasks

kolloch commented Oct 2, 2015

pgkelley4 commented Oct 6, 2015

pgkelley4 commented Nov 11, 2015

kolloch commented Nov 18, 2015

kolloch commented Nov 18, 2015

kevincox commented Mar 8, 2016

sgran commented May 25, 2016

ogrodnek commented Nov 2, 2016

meichstedt commented Mar 7, 2017