A deep dive into HAQM ECS task health and task replacement

Introduction

HAQM Elastic Container Service (HAQM ECS) is a container orchestration service that manages the lifecycle of billions of application containers on AWS every week. One of the core goals of HAQM ECS is to remove overhead burden from human operators. HAQM ECS watches over your application containers 24/7, and can respond to unexpected changes faster and better than any human can. HAQM ECS reacts to undesired changes, such as application crashes and hardware failures by continuously attempting to self-heal your application container deployments back to your desired state. There are also external factors such as traffic spikes that can cause an application brown out. This can be more challenging to handle. This post dives deep into recent changes to how HAQM ECS handles task health issues and task replacement, and how these changes increase the availability of your HAQM ECS orchestrated applications.

Task health evaluation

HAQM ECS evaluates the health of a task based on a few criteria:

First, for a task to be healthy all containers that are marked as essential must be running. Every HAQM ECS task must have at least one essential container. Best practice containers run a single application process, and if that process ends because of a critical runtime exception, then the container stops. If that stopped container was marked as essential, then the entire task is considered to be unhealthy and the task must be replaced.
You can use the HAQM ECS Task Definition to configure an optional internal health check command that the HAQM ECS agent runs inside the container periodically. This command is expected to return a zero exit code that indicates success. If it returns a non-zero exit code, then that indicates failure. The container is considered unhealthy and an unhealthy essential container causes the task to be considered unhealthy, which causes HAQM ECS to replace the task.
You can use the HAQM ECS service to configure attachments between your application container and other AWS services. For example, you can connect your container deployment to an HAQM Elastic Load Balancer (ELB) or AWS Cloud Map. These services perform their own external health checks. For example, ELB periodically attempts to open a connection to your container and send a test request. If it isn’t possible to open that connection, your container returns an unexpected response, or your container takes too long to respond, then the ELB considers the target container to be unhealthy. HAQM ECS also considers this external health status when deciding whether an HAQM ECS task is healthy or unhealthy. An unhealthy ELB health check causes the task to be replaced.

For a task to be healthy, all sources of health status must evaluate as healthy. If any of the sources return an unhealthy status, then the HAQM ECS task is considered unhealthy and it will be replaced.

Task replacement behavior

Replacing an HAQM ECS task is something that happens in two main circumstances:

During a fresh deployment triggered by the UpdateService API call. Any existing tasks that are part of the previous deployment must be replaced by new tasks that are part of the new deployment.
When an existing task inside an active deployment becomes unhealthy. Unhealthy tasks must be replaced in order to maintain the desired count of healthy tasks.

From early on in the history of HAQM ECS, the behavior of task replacement during rolling deployments has been configurable using two properties of the HAQM ECS service:

maximumPercent – This controls how many additional tasks HAQM ECS can launch above the service’s desired count. For example, if the maximumPercent is 200% and the desired count for the service is eight tasks, then HAQM ECS can launch additional tasks up to a total of 16 tasks.
minimumHealthyPercent – This controls the percentage that an HAQM ECS service is allowed to go below the desired count during a deployment. For example, if minimumHealthyPercent is 75% and the desired count for the service is eight tasks, then HAQM ECS can stop two tasks, reducing the service deployment down to six running tasks.

The maximumPercent and minimumHealthyPercent have functioned for many years as efficient controls for fine tuning the behavior of rolling deployments when running HAQM ECS tasks on HAQM Elastic Compute Cloud (HAQM EC2) capacity. However, these deployment controls don’t make as much sense in a world where more and more HAQM ECS users are choosing serverless AWS Fargate capacity. In most cases, modern applications don’t require HAQM ECS to go below the desired count of running tasks during a rolling deployment or reduce the number of additional tasks being launched during a rolling deployment, because AWS Fargate utilization isn’t constrained by how many underlying HAQM EC2 instances you have registered into your cluster.

Additionally, the maximumPercent and minimumHealthyPercent controls were originally ignored when it came to replacing unhealthy tasks. If tasks became unhealthy, then your service’s desired count could dip well below the threshold defined by minimumHealthyPercent. For example, if you were running eight tasks and four of them became unhealthy, then HAQM ECS would terminate the four unhealthy tasks and launch four replacement tasks. The number of running tasks would temporarily dip to 50% of the desired count.

Updates to how HAQM ECS replaces unhealthy tasks

As of October 20, 2023, HAQM ECS now uses your maximumPercent whenever possible when replacing unhealthy tasks. Let’s look at a few scenarios to understand how this works:

Crashing tasks

You’re running a service with a desired count of eight tasks and maximum percent of 200%. Four of your eight tasks encounter critical runtime exceptions. Their processes crash and exit, which causes an essential container to exit. HAQM ECS observes that four of the eight tasks have gone unhealthy because their essential container exited. Unfortunately, HAQM ECS can’t avoid the healthy percentage dipping below 100% because the unhealthy container crashed. The running task count dips to 50% of the desired count briefly, but HAQM ECS launches four replacement tasks as quickly as possible to bring the number of running tasks back up to the desired count of eight tasks.

Frozen tasks

You’re running a service with a desired count of eight tasks and maximum percent of 200%. Because of an endless loop in your code four of your eight tasks freeze up, but the processes stay running. The attached load balancer that is sending health check requests to the service observes that the target container is no longer responsive to health check requests, so it marks the target as unhealthy. HAQM ECS considers those four frozen tasks to be unhealthy. The maximum percent for the service allows it to go up to 16 tasks. HAQM ECS launches four additional replacement tasks for the four unhealthy tasks, making a total of 12 running tasks. Once the four additional tasks have become healthy, HAQM ECS stops the four unhealthy tasks, which brings the running task count back down to the desired count of eight tasks.

Overburdened tasks

You’re running a service with a desired count of eight tasks and maximum percent of 150%. The service has autoscaling rules attached to it. It also has a load balancer attached to it, and a large spike of traffic arrives via the load balancer. The spike of traffic is so large that response time from the task rises dramatically. As a result of high response time, the load balancer health check fails and the ELB marks all eight targets as unhealthy. The ELB fails open and continues distributing traffic to all the targets as there are no healthy targets in the load balancer.

HAQM ECS observes that all eight tasks are unhealthy. As a result, HAQM ECS wants to replace these unhealthy tasks. The maximum percent of 150% allows the service to go up to 12 running tasks. Therefore, HAQM ECS avoids stopping the unhealthy running tasks immediately. Instead, it launches four replacement tasks in parallel with the existing eight unhealthy tasks. Fortunately these four additional tasks give the ELB more targets to distribute traffic across, and all 12 of the running tasks stabilize in health as they are now able to handle the incoming traffic without timing out. HAQM ECS observes that there are now 12 healthy running tasks.

Simultaneously with this, an Application Auto Scaling rule has kicked in based on seeing high CPU utilization by the original eight running tasks. The rule has updated the desired count for the HAQM ECS service from eight running tasks to 10 running tasks. Therefore, HAQM ECS only stops two of the 12 healthy running tasks, which reduces the task count back down to its current desired count of 10 running tasks.

Limited maximum percent

You’re running a service with a desired count of eight tasks and because of downstream limits or infrastructure constraints you have set a maximum percent of 100%. This doesn’t allow HAQM ECS to launch any additional tasks in parallel with your eight running tasks. If a task from this deployment freezes, or becomes overburdened and starts failing health checks, then HAQM ECS needs to replace it. HAQM ECS stops the unhealthy task first, then launches a replacement task after the unhealthy task has been stopped. This means the running task count still temporarily dips below the desired count.

Task fails health checks during a rolling deployment

You’re running a service with a desired count of eight tasks and a maximum percent of 150%. You’re doing a rolling deployment to update your running tasks to be based off of a new task definition. Because the maximum percent is 150%, this allows HAQM ECS to launch additional tasks in parallel with your currently running tasks. The rolling deployment has already triggered four additional task launches. The service currently has 12 running tasks: eight old tasks and four new tasks.

During this rolling deployment, some of the old tasks begin failing a health check due to an unexpected bug. Because there’s an active rolling deployment occurring, HAQM ECS resorts to terminating unhealthy tasks immediately and replacing them with instances of the new task as quickly as possible. During a rolling deployment, HAQM ECS always try to replace failing tasks with tasks from the new active deployment.

Ongoing task failures because of external factors

You’re running a service with a desired count of eight tasks and maximum percent of 150%. One of the downstream services that your code depends on starts returning an unexpected response, and this causes your code to start failing health checks. HAQM ECS sees that the eight tasks are unhealthy and need to be replaced, so it launches four additional replacement tasks in parallel with the eight initial tasks. At this point there are twelve tasks running: eight original tasks, and four replacement tasks. Unfortunately all twelve tasks are unhealthy because the replacement tasks are still relying on the same unreliable downstream service as the original tasks.

Because the replacement tasks did not stabilize, and ECS sees that the number of unhealthy tasks is greater than the desired count for the service, ECS will stop four of the unhealthy tasks at random, in order to bring the number of unhealthy tasks back down to the desired count. ECS does not maintain a stateful knowledge of which unhealthy tasks were “original” and which were “replacements”. Once enough of the excess unhealthy tasks have been stopped, and there is room for additional tasks, then ECS will attempt to launch replacement tasks again. This will continue endlessly until the downstream service becomes reliable again, or you make an UpdateService API call to roll out a code update that handles the failure condition more gracefully.

Health checks and responsive absorption of workload spikes

Previously, HAQM ECS always stopped unhealthy tasks first, then launched a replacement task. This behavior made sense in a world where tasks were binpacked densely onto a statically sized cluster of HAQM EC2 instances that had no room to launch a replacement task without stopping an existing task. But more modern container workloads are now running using serverless AWS Fargate capacity. There’s no need to stop an unhealthy running task to make room for its replacement, as AWS Fargate can supply as much on-demand container capacity as needed. Additionally, many customers of HAQM ECS on HAQM EC2 are now using HAQM ECS capacity providers to launch additional HAQM EC2 instances on demand, rather than deploying to statically sized clusters of HAQM EC2 instances. Therefore, HAQM ECS now prioritizes using the maximumPercent for a service, and whenever possible it keeps unhealthy tasks running until after their replacements have become healthy.

Additionally, the new HAQM ECS task replacement behavior helps prevent runaway task termination. In some cases, a large workload spike could cause a few tasks from the deployment to become unhealthy, which triggered their replacement. However, when HAQM ECS stopped unhealthy tasks in order to launch a replacement, the load balancer would shift more workload onto the remaining healthy tasks, which caused them to go unhealthy. In quick succession, all healthy tasks would be overwhelmed with workload that caused a cascade of runaway health check failures until every task had gone unhealthy.

Eventually, Application Auto Scaling rules would kick in and scale up the deployment to a large enough size to handle the workload. But in most cases, a traffic spike causes the load balancer health checks to fail before it triggers aggregate resource consumption-based autoscaling. Auto scaling rules need to observe at least one minute of high average resource utilization before they react by scaling out the container deployment. However, an overburdened task may begin failing load balancer health checks immediately.

In the scenario where your tasks are unhealthy because they are dealing with a large spike of incoming workload, the new task replacement behavior of HAQM ECS dramatically improves availability and reliability of your service. HAQM ECS catches health check failures and proactively launches a parallel replacement task that can help absorb the incoming workload spike before autoscaling rules even trigger. Once autoscaling rules trigger, the replacement task and the original task are both retained, if they are both healthy and if they fulfill the current desired task count of the service.

Conclusion

In this post, we explained new HAQM ECS behavior when handling unhealthy tasks. As more customers adopt HAQM ECS for their mission critical applications, we are always happy to tackle challenging new orchestration problems at scale. This updated task replacement behavior is designed to help serve the needs of customers both small and large. It helps keep your container deployments online and available—even in adverse circumstances such as application failure or traffic spikes.

Please visit the HAQM ECS public roadmap for more info on additional upcoming features for HAQM ECS or to create your own issue to request a change or new feature.

For more info on HAQM ECS scheduler behavior, see the official documentation, under Service Scheduler Concepts.

Containers