HAQM ECS enables easier EC2 capacity management, with managed instance draining

Introduction

HAQM Elastic Container Service (ECS) deploys and manages your containerized tasks on AWS infrastructure. Customers can avoid the need to maintain compute instances by using HAQM ECS to deploy tasks on serverless AWS Fargate capacity. But some customers prefer to use HAQM ECS with HAQM Elastic Compute Cloud (HAQM EC2) as capacity. Using HAQM EC2 instances as capacity to run your container gives you greater control over the underlying compute infrastructure, but this comes with the downside of increased maintenance overhead. Traditionally, it has been necessary for cluster operators to build custom tooling for HAQM EC2 instance maintenance in order to avoid unexpectedly disrupting the container workloads running on the HAQM EC2 instances. HAQM ECS has the built-in ability to drain tasks that are running on an HAQM EC2 instance and move these tasks to another instance, in order to allow the original instance to be replaced or terminated. However, utilizing this draining feature required customers to implement a custom solution that relied on Auto Scaling lifecycle hooks to set container instances to draining while all tasks were drained.

HAQM ECS now provides managed instance draining as a built-in feature of HAQM ECS capacity providers. This new feature enables HAQM ECS to safely and automatically drain tasks from HAQM EC2 instances that are part of an HAQM EC2 Auto Scaling Group associated with an HAQM ECS capacity provider. This simplification will allow many HAQM ECS customers to eliminate custom lifecycle hooks that they were previously using to drain HAQM EC2 instances. Customers can now perform infrastructure updates such as rolling out a new version of the ECS agent by seamlessly using Auto Scaling Group instance refresh, with HAQM ECS ensuring workloads are not interrupted.

Understanding instance refresh and instance draining

EC2 Auto Scaling Groups are the primary mechanism for scaling out a fleet of EC2 instances. EC2 instances in an Auto Scaling Group are configured using a Launch Template, which defines what type of EC2 instance to launch, and what HAQM Machine Image (AMI) to base the EC2 instance on. For HAQM ECS you should launch HAQM EC2 instances that are based off of the ECS Optimized AMI. This special AMI comes with everything needed to connect the instance to an HAQM ECS cluster, and launch container workloads on the host. New versions of the ECS Optimized AMI are regularly released in order to provide new HAQM ECS features and patches to bugs or security vulnerabilities in the underlying host operating system. Auto Scaling Group instance refresh is one solution for updating the AMI that is powering the HAQM EC2 instances in your cluster. It helps you to relaunch HAQM EC2 instances in the Auto Scaling Group by providing a new Launch Template that references the latest version of the HAQM ECS optimized AMI.

There are a variety of ways that you can trigger an AMI update for your EC2 Auto Scaling Group. You may wish to automate instance refreshes by periodically triggering an instance refresh via EventBridge Scheduler and a Lambda function. If you prefer to use AWS CloudFormation or AWS Cloud Development Kit, you can also use the UpdatePolicy setting to configure an infrastructure as code driven rolling replacement of your EC2 instances.

No matter how you choose to refresh your EC2 instances, it will cause HAQM EC2 instances that are part of an Auto Scaling Group to be replaced. If an HAQM EC2 instance is stopped and replaced while it is running tasks, then those tasks will be stopped as well. HAQM ECS will detect the loss of any task that is part of an HAQM ECS service. In order to maintain the service’s desired count of tasks, HAQM ECS will launch a replacement task onto a different HAQM EC2 instance in the cluster. But this is a reactive fail-safe that only happens after a container has already been stopped, and the service’s running task count has already dropped below it’s desired count of tasks.

Managed instance draining is proactive, rather than reactive. Whenever an HAQM EC2 instance is set to be terminated by the Auto Scaling Group, HAQM ECS will temporarily delay the instance termination, and automatically place the instance into draining mode. This draining mode prevents any more task launches on the instance, and causes any service launched tasks running on the instance to be proactively replaced onto new hosts. The task replacement attempts to launch new replacement tasks prior to stopping existing tasks that are currently running on the draining HAQM EC2 instance. You can read more about draining behavior under Container Instance Draining in the official HAQM ECS documentation.

New managed instance draining interacts with the following HAQM ECS features that are also designed to keep your workloads running and available.

Managed termination protection

When you configure an HAQM ECS capacity provider attached to an Auto Scaling Group, one of the available options in HAQM ECS is managed termination protection, which sets scale in protection of HAQM EC2 instances that are running HAQM ECS tasks. Once enabled, any HAQM EC2 instance that is running one or more tasks will be ineligible to be stopped while the Auto Scaling Group is being scaled in. Managed scale in protection does not protect against all forms of HAQM EC2 instance termination, but it does prevent many of the situations where an HAQM EC2 instance would need to be drained. However, it is still good to enable both managed scale in protection and managed draining. When both features are enabled you have maximum protection against interruptions of your production workload. Managed scale in protection blocks many types of disruptive EC2 instance termination, and managed draining ensures that when an EC2 instance does need to be terminated, it’s running workloads are handled gracefully.

Stop timeout

When defining an HAQM ECS task, you can specify a stop timeout for the task. If not specified, this stop timeout defaults to 30 seconds. When HAQM ECS is draining an HAQM EC2 instance, it will send a SIGTERM stop signal to each task container that needs to be stopped, and then wait for the duration of the stop timeout to see if the container gracefully exits on its own. If the container does not gracefully exit by the time the stop timeout has passed, then HAQM ECS will send a SIGKILL signal to force stop the container’s process. If your task is doing some heavy work that you can’t complete quickly, you can configure a longer stop timeout on the task. Managed instance draining will keep the HAQM EC2 instance in a draining status until the task gracefully exits on its own, or until the stop timeout period is exceeded and HAQM ECS force stops the task. However, while HAQM ECS stop timeout for HAQM EC2 tasks can be set to wait for years if you wish, the HAQM EC2 draining period can not exceed 48 hours. Therefore it is not advisable to set task stop timeout to greater than 48 hours.

Task protection

HAQM ECS gives running tasks the ability to mark themselves as protected. This feature tells HAQM ECS that the task is busy doing important work, and the task should not be stopped. If you are using managed termination protection, then an instance that is running such a protected task is already partially protected from being stopped as part of an Auto Scaling Group scale-in. However, if managed termination protection is disabled, then the HAQM EC2 Auto Scaling Group is not aware of whether the HAQM EC2 instance has protected tasks on it. Additionally, an HAQM EC2 instance refresh can be configured to replace HAQM EC2 instances even if they are protected from scale-in by HAQM ECS managed termination protection. The Auto Scaling Group can still choose to terminate an HAQM EC2 instance that is hosting protected tasks. This will place the HAQM EC2 instance into a draining state, and any running tasks on the instance will immediately be sent a SIGTERM signal to stop, irrespective of their task protection status. You can use a stop timeout on the task to delay the task force quit, and hold the HAQM EC2 instance in a draining state for up to 48 hours.

In general it is considered best practice to set a stop timeout for any task that will also be using the task protection feature. Additionally, any mission critical tasks that will be hosted on HAQM EC2 may wish to use the HAQM ECS metadata endpoint to discover the ID of the instance that is hosting the task, then use the HAQM EC2 ModifyInstanceAttribute API to set the disableApiStop or disableApiTermination attributes on the HAQM EC2 instance that hosts the task. This will provide an additional layer of protection against any automated HAQM EC2 instance disrupting actions.

Standalone tasks launched with RunTask API

Service launched tasks can be safely drained from an HAQM EC2 instance because they are part of a replica set, and the service can always launch other tasks on other HAQM EC2 instances. However, HAQM ECS does not drain standalone tasks that were launched with the RunTask API. Rather HAQM ECS waits for these tasks to exit on their own. As long as these tasks are running they will keep the HAQM EC2 instance in a draining status. The HAQM EC2 instance will be allowed to remain in a draining state for up to 48 hours. After that point all tasks on the instance will be force stopped, whether they were RunTask launched, or CreateService launched.

Conclusion

The new HAQM ECS managed instance draining feature originated directly from the feature requests on our public container roadmap. The open source RFC received over 300 reactions on Github. We appreciate your continued interest and engagement with the future of HAQM ECS, and we are excited to continue delivering powerful features that make it easier to deploy containers at scale on AWS. If you have a feature request of your own please submit an issue to our public roadmap, or upvote an existing feature request to express your interest in it.

To get started with managed instance draining please visit the HAQM ECS documentation for step by step details on how to enable managed draining for your capacity provider and autoscaling group.

Containers