New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ECS Deployment Fails Due to Premature Resource Availability Reporting #4106
Comments
Hello @zahorniak, Thank you for the detailed explanation. Would you be able to send the logs for further investigation to this email address: ecs-agent-external@amazon.com? Thank you |
Hey @hozkaya2000, I sent you the logs as soon as you provided your email. Sorry for forgetting to mention it here. |
Hi @zahorniak, thanks for raising this issue.
Just to add a bit more context, this is the expected behavior for agent where reported host resources will only be marked as free after agent sends a stop task state change. Agent will/should only send a stop task state change as well as clean up the task resources once the known status of the container is terminal (i.e. stopped).
Hm, may I ask which agent as well as AMI version you're currently using? There was a change in regards to our task launch behavior sometime last year (2023). Please check out the following public AWS document for more information -> https://aws.amazon.com/blogs/containers/improvements-to-amazon-ecs-task-launch-behavior-when-tasks-have-prolonged-shutdown/ |
Hi @zahorniak . In addition to the information shared by @mye956 above, the following two configuration options might be useful for you use case. Have you tried these already? Service deployment configuration option 'maximumPercent' -
If you set it to 100% I think Scheduler won't start a replacement task until a STOPPED task has been identified for it. ECS Agent's
What's the value for this option on your container instances? Default is 10 minutes so I don't follow why a container running a long job doesn't stop for several hours in your case. |
Also on this point, ECS Agent does not report available resources to ECS backend at all. ECS backend has its own resource accounting logic which is independent of ECS Agent's resource accounting logic. |
Hi @amogh09, Thanks for your input and questions. I'll try to answer if that's okay with you. Yes, we are using this parameter, which is set at 200% for our service. The problem is not that ECS doesn't start new containers; the problem is that ECS is trying to start a container on the EC2 instance, which is already at full capacity.
We do not set this option for the ECS Agent. Instead, we set the Right now, we're implementing a Task Scale-In protection mechanism for our ECS Services, as @mye956 suggested. We will begin testing it this week and hopefully receive positive results in a few days/weeks. |
Summary
During ECS deployments with EC2 capacity providers, tasks in "Pending" state occur when the ECS Agent prematurely reports resource availability before old containers fully stop. This misreporting leads to deployment issues, especially when containers run long-term jobs, resulting in new tasks attempting to start on already busy instances without actually deploying new EC2 instances as expected.
Description
We are using an ECS cluster with an EC2 capacity provider. The capacity provider completely controls ASG size. We always keep two tasks on each container instance according to our EC2 sizes (r5.large) and task definition parameters (1024 CPU and 7372 Memory reservation).
Usually, problems happen during the deployment period.
The last scenario was the following. We had two EC2 and four containers (two on each EC2). Deployment started and sent a stop signal to all four containers. Three containers stopped successfully and were replaced almost immediately. But the fourth container was running a long-term job that usually runs for a few hours, and it kept running (this is expected behavior). So we have the following situation:
In the end, we have one task that is stuck in the Pending state for up to 8 hours. But what happens if all four old tasks will run long-term jobs and cannot be stopped immediately? We had this scenario, and in the end, 0 new containers started.
I discovered that ECS Agent frees up container instance resources immediately after sending a stop signal to the old container. For example, when I monitor available CPU and Memory for Container instances in the ECS cluster Infrastructure tab, I can see that ECS Agent says that the second EC2 has 1024 CPUs and 8359 Memory available even though there are still two active containers (one new and the one old in stopping state). That's why it tries to place a new container in the same EC2 where two active containers are already running.
Expected Behavior
The ECS Agent only freed up resources once the container stopped. This behavior will solve all problems: the capacity provider will deploy a new EC2 and a new task there instead of trying to utilize an already busy container instance.
Observed Behavior
ECS Agent wrongly tells that EC2 has available resources from the stopped container, but it's still active and running in real life.
Environment Details
Supporting Log Snippets
I collected logs using ecs-logs-collector, but I am uncomfortable sharing them here. Can you provide me with another method for sending logs?
The text was updated successfully, but these errors were encountered: