[ECS] [Proposal]: Container Ordering #123

petderek · 2019-01-18T23:05:42Z

The ECS team is planning on implementing container startup and shutdown
ordering for tasks. We would like to get feedback on our current plan.
Specifically, we'd like to know:

Do the dependency conditions we propose solve your use cases for
application startup?
Is it reasonable for shutdown order to always be the inverse of startup
order?

Thanks!

Problem Statement

ECS does not currently have an explicit mechanism to ensure that containers
start in any particular order. Yet, many applications and services have
cross-container prerequisite dependencies. Common examples may include:

A container that gathers data or applies options before the rest of the
application may start.
An application that has a runtime expectation that another application
defined within the task has already started.
A container that reuses resources defined by other containers. Some
resources, such as volumes, are implicitly handled in ECS today.

Overview of Solution

ECS will address these use cases by improving container dependency management.
We will introduce the following concepts into our task definition:

A means to explicity declare dependencies on other containers within a task
A parameter to describe conditions of the container
Granular timeouts for container start and stop

These three components can be added to the container definition shape as follows:

"ContainerDefinitions": [
    {
        "Name": "containerOne",
        "DependsOn": [
            {
                "Container": "dependencyOne",
                "Condition": "COMPLETE"
            },
            {
                "Container": "dependencyTwo",
                "Condition": "START"
            }
        ],
        "StartTimeout": 30,
        "StopTimeout": 30
    }
]

Dependency Mappings

Within the container definition, we will make it possible to declare
dependencies on other named containers. A container may have zero, one, or
multiple dependencies. The chain of dependencies within a task will be used to
determine both start and stop order.

When starting up, agent will guarantee that a container will only start if its
dependent containers have already been started. Internally, agent respects
order if a task uses links or volumes between containers, but otherwise starts
containers in parallel. This project will extend this existing dependency logic
and make it usable in more situations.

Currently, the agent does not enforce any ordering when a task is stopped, even
for links and volumes. We will amend the behavior of container stops to respect
the order provided via declared dependencies. The shutdown order will simply be
the inverse of start order. For example: lets say container A depends on
container B. B will start before A is started, but A will stop before B is
stopped. If a task is shut down due to an essential container failing in the
middle of the chain, we will adhere to the shutdown ordering where possible.

Dependency Conditions

There is already an implicit dependency condition for containers using links or
volumes. However, for both of these cases it is only validated that the
required container be started before the dependent container may start.
However, only starting the container does not provide enough of a guarantee for
many application types. We will introduce dependency conditions as a way to
support these other kinds of applications.

A "condition" may be one of the three enumerated strings: "START", "COMPLETE",
"SUCCESS", or "HEALTHY". The behavior of these conditions follows:

"START" will emulate the behavior of links and volumes today. It will allow
customers to specify that a dependent container needs to only be started before
permitting other containers to start.
"COMPLETE" will validate that a dependent container runs to completion
(exits) before permitting other containers to start. This can be useful for
non-essential containers that run a script and then subsequently exit.
"SUCCESS" will be identical to "COMPLETE", but it will also require that the
container exits with status zero.
"HEALTHY" will validate that the dependent container passes its Docker
healthcheck before permitting other containers to start. This condition will
only be confirmed at task startup.

Granular Timeouts

Currently, the container start and stop timeouts are instance level settings
configured within the ecs.config file. These timeouts are strictly used as part
of the Docker timeout.

Introducing container dependencies will introduce an additional set of
potential failure conditions for startup that extend beyond the Docker API
timeout. For example, waiting for a container to complete or reach 'healthy'
won't use the Docker timeout as is. We will need to implement this feature in
order to prevent tasks to get stuck in 'starting' forever.

Additionally, a global option is not going to give customers enough
flexibility, since different containers will have different conditions.

In order to give customers the most flexibility, we will need to enhance the
timeout feature in two ways:

Provide start and stop timeouts on a per-container basis
Enhance the timeouts so that they can be applied to the "HEALTHY" and
"COMPLETE" conditions described earlier

The text was updated successfully, but these errors were encountered:

talawahtech · 2019-01-19T04:26:11Z

Yes! This looks well thought out!
Yea...I think so. I don't currently have any use cases that would require a different shutdown order

deleugpn · 2019-01-20T11:20:10Z

I moved already from designing this kind of dependency after reading moby/moby#31333, in particular comment moby/moby#31333 (comment).
Maybe this will be a great addition to ECS, but whoever relies on HEALTHY model should carefully consider the implications of relying on orchestration order state vs application that is fault tolerant.
The linked comment has an amazing example that applies here:

Container B should only start when Container A is healthy.
Container A is now healthy.
ECS starts Container B.
Container A is now unhealthy before Container B finishes starting.

Your orchestration is now out of order. Should ECS forcefully terminate B? What if there's no time?
My conclusion after carefully considering these scenarios is that it's a much safer approach to make Container B implement a pathway that is triggered if A is unavailable.
Each application knows what is best to do if their dependencies are out of reach.

That being said, the START, SUCCESS and the COMPLETE scenarios described here does seem safe to use as they represent a immutable container state e.g a started container will never not have been started or a container that already shutdown will never not have been executed.

talawahtech · 2019-01-20T21:26:53Z

@deleugpn your points are well taken and perhaps AWS should add a section to the documentation to warn users of the potential pitfalls and complexities of the feature, but overall I think it is a net positive for most situations.

Take for example a case where you have multiple containers depending on a database or message queue to fully start. Currently, each container is responsible for implementing the exact same logic in a custom entrypoint script. With this feature the DB or message queue container can take on the responsible for implementing the check and marking itself as healthy, and then all other containers simply reuse that information.

Regarding your example, the developer can decide what will happen by marking Container A as essential or not. If it is, then everything restarts, if not, not.

So I agree, some people may get a false sense of security from this feature and may end up having problems troubleshooting if they don't have adequate fallbacks and proper timeouts, but overall I think it much is better to have the functionality available (and off by default). Good documentation and practical examples will help as well.

alexbilbie · 2019-01-22T10:50:51Z

This would be extremely useful for us - at the moment if a container instance dies there's a land rush for our ECS services to launch tasks which causes lots of unnecessary logs to be written and alarms to go off.

For example we have a Consul ECS service which launches a deamon task on each cluster instance. Our microservices (each expressed as their own ECS service + task definition) expect this container to be running when they launch and inevitably fail until the Consul task is ready.

Our current solution for this involves using bash for loops and silent curl calls in the entrypoint.sh but it's in-elligant and should ideally be fixed at the orchestrator level.

petderek · 2019-01-23T21:08:50Z

Regarding 'healthy': this feature is designed for local coordination and should not be considered a solution for fault tolerance. The example that @deleugpn provided is a clear failure condition that this wouldn't catch by itself. The ordering won't report that your app is healthy or not -- rather, it is intended to replace the need to use a sleep / wait loop until resources are available.

Your applications will still need to handle the case where container A breaks, either during startup or hours after the task starts. If you are trying to implement self-healing architecture as described in the moby thread, you could use the ECS service abstraction paired with health checks on your essential containers.

Envoy proxy is the example we have been using to justify 'healthy' as a dependency condition. For Envoy, it is not enough to validate that the container has started. We also need to ensure that the container is ready to receive traffic. This means that containers that depend on Envoy can start knowing that Envoy has already finished its initialization sequences.

However, this doesn't mean an application depending on Envoy can assume that it will always be available. You would still need to implement a failure path, even if that failure path is reporting that the container is unhealthy and signaling the scheduler to restart the task.

deleugpn · 2019-01-23T21:15:16Z

That is a wonderful positioning for the feature. Although I personally don't need it yet, I think you have the heart of the feature in the right place.

jpoley · 2019-01-27T22:45:21Z

how does this relate to supporting the k8s concept of init containers? they might not be still running so reverse order should continue over ones possibly already done or stopped.

elerch · 2019-01-31T17:48:45Z

I have a need for this and agree the proposal looks well thought out. However, a colleague pointed out the documentation doesn't explicitly state that the dependencies would be run on the same instance. I think it's somewhat implied, but in reading the proposal, statements like "agent will ensure that dependencies are run" could imply that these containers are run but not necessarily on the local host.

In my case I would need dependencies run on the same host as the primary container, and it sounds like @alexbilbie has the same need. I think clarification of the proposal on this point would be good.

pagameba · 2019-02-05T13:03:18Z

This could be helpful for one of my use cases. I recently had to implement custom entrypoint logic that pauses containers on startup if there are database migrations pending and a companion task that actually applies pending migrations, orchestrated through cloudwatch events and lambdas. Couple of questions, though.

will it be possible to have multiple services await the success of a singleton task? If I deploy multiple services simultaneously (as would happen when scaling or replacing container instances perhaps), can they trigger a single instance of a task to run and await its SUCCESS exit status?
will it work across clusters? We use different clusters for different workloads but have a common database that migrations have to be evaluated for.

petderek · 2019-03-08T04:28:15Z

This was shipped in today’s release.

petderek · 2019-03-08T17:51:46Z

To answer some of the questions in the thread: this doesn't work across services or tasks. The ordering applies strictly within the task boundary. You will still need to employ other strategies to enforce dependencies across services / clusters.

coriet · 2019-03-29T08:10:27Z

Very helpful feature. Any plans by when this will be supported by CloudFormation?

th31nitiate · 2019-07-03T13:27:08Z

I understand that this is bound to the task definition boundary but what or how can we achieve this?

I am using Fargate & Terraform to provision my components. I am able to do this but when I look into the task definition on AWS has the depends on set to null. The platform version is 1.3.0 & I am using fargate which means there is no agent available.

raags · 2019-12-02T17:15:46Z

@petderek Is this feature available via the ecs-cli ? Does not seem so as per the docs: https://docs.amazonaws.cn/en_us/AmazonECS/latest/developerguide/cmd-ecs-cli-compose-ecsparams.html

pocockn · 2020-05-04T11:11:56Z

I need to configure the order of containers within different services, has anyone worked out a way to do this?

reganbaucke · 2022-06-22T05:37:01Z

Number 2. is not reasonable and cuts out the following use case involving 2 containers:

Container 1 starts
Container 1 performs a work load.
Container 1 writes result file to a volume (shared with Container 2)
Container 1 finishes with exit code 0
Container 2 starts
Container 2 pushes the result file to S3 (or whatever)
Container 2 finishes with exit code 0

Container 2 depends on Container 1 finishing (otherwise there is no result file for Container 2 to find).

This can be acheived with docker compose, however, it doesn't seem possible with ECS Task Definition.

eblfo · 2023-02-06T17:55:26Z

we have a nginx container which depends on a HEALTHY app container in a service.

which works find for start-up

on shut-down we need the nginx to stay open until the app finished processing requests (while receiving a shutdown)
with current " for container shutdown it is reversed." this breaks the logic as app waits for shutdown until nginx is down.

we would need a dependsOnShutDown container definition

petderek added the Proposed Community submitted issue label Jan 18, 2019

abby-fuller added the ECS Amazon Elastic Container Service label Jan 18, 2019

abby-fuller added this to Researching in containers-roadmap Jan 18, 2019

petderek mentioned this issue Jan 18, 2019

Termination order of linked containers aws/amazon-ecs-agent#474

Closed

snowhork mentioned this issue Jan 23, 2019

Make containers stopped after the linked containers stopped aws/amazon-ecs-agent#1809

Closed

8 tasks

adnxn mentioned this issue Feb 5, 2019

Tune SIGKILL timeout on a per ECS Task/Container Definition basis aws/amazon-ecs-agent#1241

Closed

petderek moved this from Researching to Just Shipped in containers-roadmap Mar 8, 2019

petderek closed this as completed Mar 8, 2019

RuslanHryn mentioned this issue Feb 25, 2021

[ECS] [Proposal]: ECS Service ordering #1290

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ECS] [Proposal]: Container Ordering #123

[ECS] [Proposal]: Container Ordering #123

petderek commented Jan 18, 2019 •

edited

talawahtech commented Jan 19, 2019

deleugpn commented Jan 20, 2019 •

edited

talawahtech commented Jan 20, 2019 •

edited

alexbilbie commented Jan 22, 2019

petderek commented Jan 23, 2019

deleugpn commented Jan 23, 2019

jpoley commented Jan 27, 2019

elerch commented Jan 31, 2019 •

edited

pagameba commented Feb 5, 2019

petderek commented Mar 8, 2019

petderek commented Mar 8, 2019

coriet commented Mar 29, 2019

th31nitiate commented Jul 3, 2019

raags commented Dec 2, 2019

pocockn commented May 4, 2020

reganbaucke commented Jun 22, 2022

eblfo commented Feb 6, 2023 •

edited

[ECS] [Proposal]: Container Ordering #123

[ECS] [Proposal]: Container Ordering #123

Comments

petderek commented Jan 18, 2019 • edited

Problem Statement

Overview of Solution

Dependency Mappings

Dependency Conditions

Granular Timeouts

talawahtech commented Jan 19, 2019

deleugpn commented Jan 20, 2019 • edited

talawahtech commented Jan 20, 2019 • edited

alexbilbie commented Jan 22, 2019

petderek commented Jan 23, 2019

deleugpn commented Jan 23, 2019

jpoley commented Jan 27, 2019

elerch commented Jan 31, 2019 • edited

pagameba commented Feb 5, 2019

petderek commented Mar 8, 2019

petderek commented Mar 8, 2019

coriet commented Mar 29, 2019

th31nitiate commented Jul 3, 2019

raags commented Dec 2, 2019

pocockn commented May 4, 2020

reganbaucke commented Jun 22, 2022

eblfo commented Feb 6, 2023 • edited

petderek commented Jan 18, 2019 •

edited

deleugpn commented Jan 20, 2019 •

edited

talawahtech commented Jan 20, 2019 •

edited

elerch commented Jan 31, 2019 •

edited

eblfo commented Feb 6, 2023 •

edited