Implement HAQM ECS Anywhere enhanced workload resilience in disconnected scenarios

Introduction

HAQM Elastic Container Service (ECS) Anywhere is a feature of HAQM ECS that lets you run and manage container workloads on your infrastructure. This feature helps you meet compliance requirements and scale your business without sacrificing your on-premises investments.

When extending HAQM ECS to customer-managed infrastructure, external instances are registered to a managed HAQM ECS cluster hosted in an AWS Region. External instances are compute resources (i.e., hosts) external to an AWS region where HAQM ECS can schedule tasks to run. External instances are typically an on-premises server or virtual machine (VM).

HAQM ECS Anywhere currently supports operation in deployment scenarios where there’s consistent and reliable network connectivity between external instances and the HAQM ECS cluster. HAQM ECS monitors for errors or failures that occur to managed containers running on external instances, and restarts any containers that have stopped due to an error. Although HAQM ECS Anywhere doesn’t support a fully disconnected mode of operation, the ECS agent can manage container restarts locally if a restart policy is enabled, even without connectivity to the ECS cluster. However, if the ECS agent itself fails, containers won’t be restarted regardless of network status.

The open source HAQM ECS External Instance Network Sentry (eINS) has been developed to augment the function of HAQM ECS Anywhere, by providing an additional layer of resilience for HAQM ECS external instances in deployment scenarios where connectivity to the HAQM ECS control-plane may be unreliable or intermittent.

The eINS is designed to detect any loss of network connectivity between an external instance and the associated HAQM ECS cluster, and to proactively ensure that for the duration of the outage that any HAQM ECS-managed containers will be restarted in the following circumstances:

the container exits due to an error, which manifests as a non-zero exit code;
the Docker daemon is restarted;
the external instance is rebooted.

This post describes how to implement the eINS to provide an additional layer of resilience for HAQM ECS external instances in deployment scenarios where connectivity to the associated HAQM ECS cluster may be unreliable or intermittent.

Note: The eINS isn’t an officially supported feature of HAQM ECS. Please submit an eINS GitHub issue for any feature requests, bugs, or documentation improvements.

Solution overview

The eINS is a Python application that can either be run manually, or be configured to run as a service on HAQM ECS Anywhere external instances. See the Installation section below for instruction for both deployment scenarios.

eINS regular operation with region connectivity

When running on an HAQM ECS external instance, the function of the eINS is entirely automatic. The following Connected and Disconnected Operation scenarios provide a detailed description of how the eINS functions as the availability of the on-region HAQM ECS control plane changes over time.

Connected operation

This scenario describes eINS behavior during periods when the on-region HAQM ECS control plane is reachable.

The eINS periodically attempts to establish a network connection with the HAQM ECS on-region control-plane to determine region availability status, and the on-region HAQM ECS control-plane responds without error.

In reference to the diagram:

eINS network connection with the HAQM ECS on-region control-plane [1] completes successfully:
- eINS takes no further action.
In communication with the on-region control-plane [2] the HAQM ECS agent on the external instance orchestrates local managed container lifecycle, including restarting containers which exit due to error condition [3].

Disconnected operation

This scenario describes eINS behavior during periods where the on-region HAQM ECS control plane is unreachable.

eINS operation with no region connectivity

In reference to the diagram:

eINS network connection with the HAQM ECS on-region control-plane [1] experiences timeout or return error condition:
- The HAQM ECS agent is paused [3] via the local Docker API [2]*.
- eINS updates Docker restart policy updated to on-failure for each HAQM ECS-managed container [4]. This ensures that any HAQM ECS-managed containers restarts if exiting due to error, the Docker daemon is restarted, or the external instance is rebooted.
When the HAQM ECS control-plane becomes reachable:
- HAQM ECS-managed containers that have been automatically restarted by the Docker daemon during network outage are stopped and removed.**
- HAQM ECS managed containers that haven’t been automatically restarted during network outage have their Docker restart policy set back to no.
- The local HAQM ECS agent is un-paused.

At this point the operational environment has been restored back to the Connected Operation scenario. eINS continues to monitor for network outage or HAQM ECS control-plane error.

Notes

*HAQM ECS agent is paused, as if left in a running state, and the agent detects and kills HAQM ECS-managed containers that have been restarted by the Docker daemon during the period of network outage.

**These containers are stopped and removed by eINS to avoid duplication:

Containers that have been restarted by the Docker daemon during a network outage become orphaned by HAQM ECS once back online.
The related HAQM ECS tasks are relaunched by HAQM ECS on the external instance once the HAQM ECS agent has established communication with the control-plane.

Configuration parameters

The eINS provides the ability to submit configuration parameters as command line arguments. Running the application with the –help parameter generates a summary of available parameters:

$ python3 ecs-external-instance-network-sentry.py --help
usage: ecs-external-instance-network-sentry [-h] -r REGION [-i INTERVAL] [-n RETRIES] [-l LOGFILE] [-k LOGLEVEL]

Purpose:
--------------
For use on ECS Anywhere external hosts:
Configures ECS orchestrated containers to automatically restart
on failure when on-region ecs control-plane is detected to be unreachable.

Configuration Parameters:
--------------
  -h, --help            Show this help message and exit.
  -r REGION, --region REGION
                        AWS region where ecs cluster is located.
  -i INTERVAL, --interval INTERVAL
                        Interval in seconds sentry will sleep between connectivity checks.
  -n RETRIES, --retries RETRIES
                        Number of times Docker will restart a crashing container.
  -l LOGFILE, --logfile LOGFILE
                        Logfile name & location.
  -k LOGLEVEL, --loglevel LOGLEVEL
                        Log data event severity.

Configuration parameters are described in further detail following:

--region

Provide the name of the AWS region where the HAQM ECS cluster that manages the external instance is hosted. eINS attempts to establish a network connection to the HAQM ECS public endpoint at the nominated region to evaluate HAQM ECS control-plane availability.

optional=no
default=””

$ python3 ecs-external-instance-network-sentry.py --region ap-southeast-2

--interval

Specify the number of seconds between connectivity tests.

optional=yes
default=20

$ python3 ecs-external-instance-network-sentry.py --region ap-southeast-2 --interval 15

--retries

Specify the number of times failing containers will be restarted during periods where the HAQM ECS control-plane is unavailable. The default setting is 0, which configures the Docker daemon to restart containers an unlimited number of times.

optional=yes
default=0

$ python3 ecs-external-instance-network-sentry.py --region ap-southeast-2 --interval 15 --retries 5

--logfile

Specify log file name and file-system path. The default value is /tmp/ecs-anywhere-network-sentry.log.

optional=yes
default=/tmp/ecs-external-instance-network-sentry.log

$ python3 ecs-external-instance-network-sentry.py --region ap-southeast-2 --interval 15 --retries 5 --logfile /mypath/myfile.log

--loglevel

Specify log data event severity.

optional=yes
default=INFO

$ python3 ecs-external-instance-network-sentry.py --region ap-southeast-2 --interval 15 --retries 5 --logfile /mypath/myfile.log --loglevel DEBUG

Walkthrough

Prerequisites

The following prerequisites should be implemented prior to deploying the eINS.

External instance host operating system

HAQM ECS Anywhere has been certified to run on a range of supported operating systems and system architectures. The eINS installation commands and procedure herein have been tested for compatibility with external instances provisioned with Ubuntu 20 as the host operating system. As the eINS is a Python application, it functions on the other supported Linux based distributions and system architectures; however, installation commands and procedure may vary.

HAQM ECS Anywhere

For each external instance you register with an HAQM ECS cluster, it requires the AWS Systems Manager Agent (SSM Agent), the HAQM ECS container agent, and Docker installed. To register the external instance to an HAQM ECS cluster, it must first be registered as an AWS Systems Manager managed instance. You can generate the comprehensive installation script in a few clicks on the HAQM ECS console. Follow the instructions as described here.

Python

The eINS has been developed and tested running on Python version 3.8.10.

Python Docker SDK

The eINS interacts with the Docker API, which requires installation of the Python Docker SDK on each external instance where the eINS will run. To install the Python Docker SDK, run the commands as follows:

# update package index files..
$ apt get update
# install python docker sdk..
$ python3 pip install docker

Clone the eINS git repository

On the HAQM ECS external instance, clone the ecs-external-instance-network-sentry repository:

# clone eins git repo..
$ git clone http://github.com/aws-samples/ecs-external-instance-network-sentry.git

Commands from this point forward assume that you’re in the root directory of the local git repository clone.

Manual operation

At this point, the external instance host operating system is ready to run the eINS. For testing or evaluation, the application can be launched manually according to the below procedure. However, it is recommended to configure the eINS as a Background Service in production deployment scenarios to ensure that the application is running at all times.

The eINS is located within the /python directory of the git repository. See the Configuration Parameters section for required and optional parameters to be submitted at runtime, and Logging to validate successful operation. Remember to provide the correct AWS region code:

# manual launch..
$ python3 python/ecs-external-instance-network-sentry.py --region ap-southeast-2

Background service

Configuring the application as an OS background service is an effective mechanism to ensure that the eINS remains running in the background at all times.

Service configuration requires the implementation of a unit configuration file, which encodes information about the process that will be controlled and supervised by systemd.

Configuration procedure

The following describes configuring the eINS as an OS background service.

Copy application and configuration files

Run the following commands to copy application and configuration files to the appropriate locations on the external instance file system:

# copy eins application file..
$ cp python/ecs-external-instance-network-sentry.py /usr/bin
# copy eins service unit config file..
$ cp config/ecs-external-instance-network-sentry.service /lib/systemd/system

Update service unit configuration file

Next, update the service unit configuration file /lib/systemd/system/ecs-external-instance-network-sentry.service.

$ cat /lib/systemd/system/ecs-external-instance-network-sentry.service
[Unit]
Description=HAQM ECS External Instance Network Service Documentation=http://github.com/aws-samples/ecs-external-instance-network-sentry Requires=docker.service
After=ecs.service
[Service]
Type=simple
Restart=on-failure RestartSec=10s
ExecStart=python3 /usr/bin/ecs-external-instance-network-sentry.py --region <INSERT-REGION-NAME-HERE>
[Install] WantedBy=multi-user.target

Make necessary modifications to the service unit config file ExecStart directive on line-11 as follows:

Update the –region configuration parameter with the AWS region name where your on-region HAQM ECS cluster is provisioned.
Optionally, include any additional Configuration Parameters to suit the particular requirements of your deployment scenario.

Configure and start service

# reload systemd..
$ systemctl daemon-reload # enable eins service..
$ sudo systemctl enable ecs-external-instance-network-sentry.service
# start eins service..
$ systemctl start ecs-external-instance-network-sentry

Check service status

To validate that the service has started successfully, run the following command. If the service has started correctly, the output should be similar to the following:

$ systemctl status ecs-external-instance-network-sentry

● ecs-external-instance-network-sentry.service - HAQM ECS External Instance Network Service
     Loaded: loaded (/lib/systemd/system/ecs-external-instance-network-sentry.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2021-07-30 07:57:08 UTC; 22min ago
       Docs: http://github.com/aws-samples/ecs-external-instance-network-sentry
   Main PID: 28366 (python3)
      Tasks: 1 (limit: 9412)
     Memory: 19.7M
     CGroup: /system.slice/ecs-external-instance-network-sentry.service
             └─28366 /usr/bin/python3 /usr/bin/ecs-external-instance-network-sentry.py --region ap-southeast-2 --interval 10 --retries 3 --logfile /tmp/ecs->

Jul 30 07:57:08 ubu20 systemd[1]: Started HAQM ECS External Instance Network Service.

Logging

The eINS has been configured to provide basic logging regarding its operation.

The default logfile location is /tmp/ecs-external-instance-network-sentry.log, which can be modified by submitting the –logfile configuration parameter.

Log level

By default, the loglevel is set to logging.INFO and can be updated at runtime using the –loglevel configuration parameter.

Log output

The following eINS log file excerpt illustrates;

A detected loss of connectivity to on-region control-plane, and associated Docker policy configuration actions for HAQM ECS managed containers;
Container cleanup and Docker policy configuration once HAQM ECS control-plane becomes reachable.

2021-07-10 09:00:01,200 INFO PID_713928 [startup] ecs-external-instance-network-sentry - starting..
2021-07-10 09:00:01,200 INFO PID_713928 [startup] arg - aws region: ap-southeast-2
2021-07-10 09:00:01,200 INFO PID_713928 [startup] arg - interval: 10
2021-07-10 09:00:01,201 INFO PID_713928 [startup] arg - retries: 0
2021-07-10 09:00:01,201 INFO PID_713928 [startup] arg - logfile: /tmp/ecs-external-instance-network-sentry.log
2021-07-10 09:00:01,201 INFO PID_713928 [startup] arg - loglevel: logging.INFO......
2021-07-10 09:39:33,756 INFO PID_713928 [begin] connectivity test..
2021-07-10 09:39:33,757 INFO PID_713928 [connect] connecting to ecs at ap-southeast-2..
2021-07-10 09:39:33,757 INFO PID_713928 [connect] create network socket..
2021-07-10 09:39:43,764 ERROR PID_713928 [connect] error creating network socket: [Errno -3] Temporary failure in name resolution
2021-07-10 09:39:43,764 INFO PID_713928 [connect] connecting to host..
2021-07-10 09:39:43,765 INFO PID_713928 [ecs-offline] ecs unreachable, configuring container restart policy..
2021-07-10 09:39:43,880 INFO PID_713928 [ecs-offline] container name: ecs-alpine-crash-test-9adba798f5f189968701
2021-07-10 09:39:43,881 INFO PID_713928 [ecs-offline] ecs cluster: ecs-anywhere-cluster-1
2021-07-10 09:39:43,882 INFO PID_713928 [ecs-offline] set container restart policy: {'Name': 'on-failure', 'MaximumRetryCount': 0}
2021-07-10 09:39:43,958 INFO PID_713928 [ecs-offline] container name: ecs-nginx-1-nginx-eaa6e7a9b0cd88988201
2021-07-10 09:39:43,959 INFO PID_713928 [ecs-offline] ecs cluster: ecs-anywhere-cluster-1
2021-07-10 09:39:43,959 INFO PID_713928 [ecs-offline] set container restart policy: {'Name': 'on-failure', 'MaximumRetryCount': 0}
2021-07-10 09:39:44,022 INFO PID_713928 [ecs-offline] ecs agent paused..
2021-07-10 09:39:44,022 INFO PID_713928 [end] sleeping for 10 seconds..
......
2021-07-10 09:41:14,298 INFO PID_713928 [begin] connectivity test..
2021-07-10 09:41:14,299 INFO PID_713928 [connect] connecting to ecs at ap-southeast-2..
2021-07-10 09:41:14,299 INFO PID_713928 [connect] create network socket..
2021-07-10 09:41:23,133 INFO PID_713928 [connect] connecting to host..
2021-07-10 09:41:23,258 INFO PID_713928 [connect] send/receive data..
2021-07-10 09:41:30,563 INFO PID_713928 [connect] ecs at ap-southeast-2 is available..
2021-07-10 09:41:30,564 INFO PID_713928 [ecs-online] ecs is reachable..
2021-07-10 09:41:30,621 INFO PID_713928 [ecs-online] container name: ecs-alpine-crash-test-9adba798f5f189968701
2021-07-10 09:41:30,621 INFO PID_713928 [ecs-online] ecs cluster: ecs-anywhere-cluster-1
2021-07-10 09:41:30,622 INFO PID_713928 [ecs-online] container has been restarted by docker, stopping & removing..
2021-07-10 09:41:41,330 INFO PID_713928 [ecs-online] container name: ecs-nginx-1-nginx-eaa6e7a9b0cd88988201
2021-07-10 09:41:41,330 INFO PID_713928 [ecs-online] ecs cluster: ecs-anywhere-cluster-1
2021-07-10 09:41:41,331 INFO PID_713928 [ecs-online] set container restart policy: {'Name': 'no', 'MaximumRetryCount': 0}
2021-07-10 09:41:41,470 INFO PID_713928 [ecs-online] ecs agent unpaused..
2021-07-10 09:41:41,471 INFO PID_713928 [end] sleeping for 10 seconds..

Log rotation

The log file rotates at 5Mb, and a history of the five most recent log files will be maintained.

Considerations

The eINS currently has the following limitation:

As described in the Disconnected Operation section, containers restarted during a period where the HAQM ECS control-plane is unavailable will be stopped and relaunched once the HAQM ECS control-plane becomes available.

Cleaning up

In order to avoid incurring future costs associated with this solution, follow this procedure to deregister your external instance from both HAQM ECS and AWS Systems Manager.

Following deregistration, the external instance is no longer able to accept new tasks. If you have tasks that are running on the external instance when you deregister it, the tasks remain running until they stop through some other means. However, these tasks are no longer monitored or accounted for by HAQM ECS.

Conclusion

In this post we have provided a detailed overview of the open source HAQM ECS External Instance Network Sentry, and we’ve showed you how to implement the eINS as an operating system background service on your ECS Anywhere external instances.

If you are deploying your workloads to HAQM ECS Anywhere external instances and require enhanced workload resiliency during periods where the on-region control plane isn’t contactable, then the eINS is a great open source solution that provides enhanced availability. This might include external instances deployed in well connected, but mission critical situations (e.g., data center, warehouse, manufacturing plant, etc.) or environments where internet connectivity may be more unreliable (e.g., maritime or rural use cases). To learn more, see ECS Anywhere in the HAQM ECS Developer Guide, and we encourage you to give it a try with the ECS Anywhere workshop as a next step.

A public version of our container services feature roadmap is available online. We know that our customers are making decisions and plans based on what we are developing, and we want to provide customers with the insights needed to appropriately plan for the future. If there are any features that you would like to be available, which are not currently on the feature roadmap, then please open an issue! Community submitted issues will be tagged “Proposed” and will be reviewed by the AWS team. You can read more information about how to contribute here.

Containers