Announcing AWS Fault Injection Simulator new features for HAQM ECS workloads

Introduction

We are happy to announce new features in AWS Fault Injection Simulator (FIS) that allow you to inject a variety faults into workloads running in HAQM Elastic Container Service (HAQM ECS) and HAQM Elastic Kubernetes Service (HAQM EKS). This blog shows how to use new AWS FIS actions with HAQM ECS.

AWS Fault Injection Simulator (FIS) is a fully managed service that helps you test your applications for resilience to failures. AWS FIS follows the principles of chaos engineering, which allows you to simulate failures in your AWS environment. These can be network outages, infrastructure failure, and service disruptions. AWS FIS experiments help you identify and fix potential problems before they cause outages in production.

New HAQM ECS Task actions

AWS FIS has added six new fault injection actions that target HAQM ECS workloads. New HAQM ECS task actions include stressing a ECS task’s CPU (Central Processing Unit), I/O, killing a process, and network actions like network blackhole, latency, and packet loss. These actions make it easy for you to evaluate your application’s reliability and resilience across a wide range of failure scenarios. If you are using AWS Fargate, you have the ability to conduct CPU and I/O actions.

Action Identifier	Description	Applicable Compute Engine
aws:ecs:task-cpu-stress	Simulates CPU stress Configurable parameters are the duration of the CPU stress test, the target CPU load percentage, and the number of CPU stressors	HAQM EC2 and AWS Fargate
aws:ecs:task-io-stress	Simulates I/O stress Configurable parameters are the duration of the I/O stress test, the percentage of free space on the file system to use during the test and the number of mixed I/O stressors to use	HAQM EC2 and AWS Fargate
aws:ecs:task-kill-process	Simulates killing certain process Configurable parameters are the name of the process to stop, the signal to send along with the command	HAQM EC2 only
aws:ecs:task-network-blackhole-port	Simulates a discarding network traffic Configurable parameters are the duration of the network blackhole, the port number and the protocol	HAQM EC2 only
aws:ecs:task-network-latency	Simulates a network latency Configurable parameters are the duration of the network latency, the network interface, the delay/ms, the jitter/ms, and the sources	HAQM EC2 only
aws:ecs:task-network-packet-loss	Simulates a network packet loss Configurable parameters are the duration of the network packet loss, the network interface, the percentage of packet loss and the sources	HAQM EC2 only

HAQM ECS Task actions under the hood

The following diagram shows how AWS FIS injects faults in HAQM ECS tasks. AWS FIS uses AWS Systems Manager SSM Agent to execute AWS FIS actions in HAQM ECS tasks. The SSM Agent sidecar enables AWS FIS to create a managed instance associated with your HAQM ECS tasks, which is required for injecting the faults by AWS FIS. This helps customer troubleshoot and get insights by Systems Manager Run Command. To conduct AWS FIS experiments in your workload, you add SSM Agent sidecar containers in your task definition.

Diagram shows how AWS FIS injects faults in HAQM ECS tasks

Walkthrough

In the following sections, we walk through the steps for AWS FIS experiment:

Setup infrastructure and deploy sample app
Unpack the generated HAQM ECS task Definition
Grant AWS FIS permissions to run experiments
Create an experiment to increase CPU stress
Create an experiment to kill a process in an HAQM ECS task

Prerequisite

For this walkthrough, you’ll need the following:

Step 1: Deploy the base infrastructure

We’ll use AWS CDK to create the base infrastructure, which includes an HAQM VPC (HAQM Virtual Private Cloud), an HAQM ECS Cluster, AWS IAM (AWS Identity and Access Management) roles, and two HAQM EC2 instances, as well as HAQM ECS service for experimenting with AWS FIS. The code is available on ecs-blueprints github repository.

Clone the sample code repository:

git clone http://github.com/aws-ia/ecs-blueprints.git
cd ecs-blueprints/cdk/examples/fis_service/

Setup the AWS Account and AWS Region environment variables to match your environment. We’ll then generate a .env file to be used by ECS Blueprint CDK template. In this post, we’ll use the Oregon region (us-west-2) for our examples.

export AWS_ACCOUNT=$(aws sts get-caller-identity --query 'Account' --output text)
export AWS_REGION=${AWS_REGION:=us-west-2}

sed -e "s/<ACCOUNT_NUMBER>/$AWS_ACCOUNT/g" \
-e "s/<REGION>/$AWS_REGION/g" sample.env > .env

Follow steps are required:

# manually create a virtualenv: 
python3 -m venv .venv

# activate your virtualenv:
source .venv/bin/activate

# install the required dependencies: 
python -m pip install -r ../../requirements.txt

Bootstrap CDK if this is your first time using CDK to create infrastructure:

cdk bootstrap aws://${AWS_ACCOUNT}/${AWS_REGION}

We’ll use CDK to create an HAQM ECS cluster, and run sample application within it to AWS FIS actions. We recommend deploying this stack in a non-production account.

Deploy the CDK stack using below command:

cdk deploy --all --require-approval never

Step 2: Inspect the generated HAQM ECS Task definition

The CDK code creates an HAQM ECS cluster, a task definition, two m5.large HAQM EC2 instances, and a load-balanced service. This task comprises of a web application container and SSM Agent running as a sidecar container. The sidecar is configured to be essential in the task definition. So, if the sidecar stops, HAQM ECS terminates the task and start a replacement task. The sidecar runs activation script to register HAQM ECS tasks as managed instances in AWS Systems Manager.

The sidecar assumes MANAGED_INSTANCE_ROLE_NAME AWS IAM role to register the HAQM ECS task as a managed instance in AWS Systems Manager. AWS FIS uses AWS Systems Manager to inject faults in HAQM ECS tasks. The role has HAQMSSMManagedInstanceCore policy and the following permissions attached:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": "ssm:DeleteActivation",
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": "ssm:DeregisterManagedInstance",
            "Resource": "arn:aws:ssm:${AWS_REGION}:${AWS_ACCOUNT}:managed-instance/*",
            "Effect": "Allow"
        }
    ]
}

The HAQM ECS Task IAM role has the following permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": "iam:PassRole",
            "Resource": "arn:aws:iam::${AWS_ACCOUNT}:role/{MANAGED_INSTANCE_ROLE_NAME}",
            "Effect": "Allow"
        },
        {
            "Action": [
                "ssm:CreateActivation",
                "ssm:AddTagsToResource"
            ],
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}

Step 3: Grant AWS FIS permissions to run experiments in your account

The AWS FIS service uses AWS IAM roles to perform experiments in customer accounts. Create a trust policy for AWS FIS’s IAM role:

cat > fis-trust-policy.json << EOF
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": [
                  "fis.amazonaws.com"
                ]
            },
            "Action": "sts:AssumeRole"
        }
    ]
}
EOF

Create an AWS IAM role for AWS FIS:

aws iam create-role --role-name ecs-fis-role \
 --assume-role-policy-document file://fis-trust-policy.json

Add the AWS IAM permissions this experiment needs to inject fault in HAQM ECS tasks:

cat > fis-ecs-experiment-policy.json << EOF
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "ssm:SendCommand",
                "ssm:ListCommands",
                "ssm:CancelCommand"
            ],
            "Effect": "Allow",
            "Resource": "*"
        }
    ]
}
EOF

Add the AWS IAM permissions this experiment needs to inject fault in HAQM ECS tasks:

aws iam put-role-policy --role-name ecs-fis-role \
  --policy-name ecs-fis-policy \
  --policy-document file://fis-ecs-experiment-policy.json

AWS FIS provides controls and guardrails for you to run experiments safely. We can implement a stop condition based on an HAQM CloudWatch alarm that stops the experiment when the alarm gets triggered.

We’ll set the stop condition to track the number of tasks running. The sample service runs three tasks by default. If the experiment crashes the task and causes the number of task replicas to go below three, then AWS FIS ends the experiment immediately to prevent further service disruption.

Create an HAQM CloudWatch alarm:

aws cloudwatch put-metric-alarm \
  --alarm-name 'ECS Sample service task count alarm' \
  --actions-enabled \
  --metric-name 'RunningTaskCount' \
  --namespace 'ECS/ContainerInsights' \
  --statistic 'Average' \
  --dimensions '[{"Name":"ServiceName","Value":"fis-service"},{"Name":"ClusterName","Value":"ecs-blueprint-infra"}]' \
  --period 300 \
  --evaluation-periods 1 \
  --datapoints-to-alarm 1 \
  --threshold 3 \
  --comparison-operator 'LessThanThreshold' \
  --treat-missing-data 'missing' \
  --region $AWS_REGION

Step 4: Create an experiment to stress task’s CPU

A CPU stress test analyzes your application’s performance under heavy CPU usage. When your application is starved for CPU resources, its response time increases. Stress testing CPU can help you determine the robustness of your workload and customer impact when the underlying system is under heavy load. Observing your application’s behavior during CPU stress can uncover systemic weaknesses that can cause disruption or problems with data integrity.

For example, if your applications call a remote procedure, during CPU stress events, it may take longer to complete the operation. As a result, you may have to tune your timeouts or implement retries. If your application takes longer to respond to client requests, you may want to scale your application in your service to compensate for heavy system load.

Let’s create an AWS FIS experiment template to stress the CPU in 66% (2/3) of the three tasks backing the HAQM ECS service. You can target tasks using ARNs (HAQM Resource Names), tags, filters, and parameters.

Here’s the reason we selected 66%: this cluster runs three tasks on two HAQM EC2 instances. It’s possible that high CPU utilization in one node affects two tasks at a time. You’d want to choose the disruption percent based on your cluster setup.

Create an experiment template:

cat > fis-ecs-experiment-template-cpu-stress.json << EOF
{
  "description": "ecs-task-cpu-stress",
  "targets": {
    "Tasks-Target-1": {
      "resourceType": "aws:ecs:task",
      "parameters": {
        "cluster": "ecs-blueprint-infra",
        "service": "fis-service"
      },
      "selectionMode": "PERCENT(66)"
    }
  },
  "actions": {
    "ecs-task-cpu": {
      "actionId": "aws:ecs:task-cpu-stress",
      "parameters": {
        "duration": "PT5M"
      },
      "targets": {
        "Tasks": "Tasks-Target-1"
      }
    }
  },
  "stopConditions": [
      {
         "source": "aws:cloudwatch:alarm",
         "value": "arn:aws:cloudwatch:${AWS_REGION}:${AWS_ACCOUNT}:alarm:ECS Sample service task count alarm"
       }
  ],
  "roleArn": "arn:aws:iam::${AWS_ACCOUNT}:role/ecs-fis-role",
  "tags": {}
}
EOF

FIS_EXPERIMENT_TEMPLATE_ID=$(aws fis create-experiment-template \
  --region $AWS_REGION \
  --tags Name=ecs-cpu-stress \
  --cli-input-json file://fis-ecs-experiment-template-cpu-stress.json \
  --query 'experimentTemplate.id' \
  --output text)

This experiment targets all tasks attached to the sample service deployed in the previous steps. The experiment runs for 5 minutes.

Start the experiment:

aws fis start-experiment \
  --experiment-template-id $FIS_EXPERIMENT_TEMPLATE_ID \
  --region $AWS_REGION

Head over to HAQM ECS console, select Clusters, select the ecs-blueprint-infra cluster, then in the Services tab select fis-service, and navigate to ECS service’s Health and metrics tab.

ECS console with ECS FIS service

You can observe an increase in CPU utilization in approximately 5 to 10 minutes. You may have to wait up to 5 minutes to see updates in the graph.

ECS console with ECS FIS service CPU utilization metric

During the experiment, the 95th percentile response time increased from approximately 200 to 430 ms. To handle such an event, we can either increase the size of the HAQM EC2 instances or add more tasks to handle traffic while the CPU resources in a set of tasks are starved.

Metrics during experiment to stress task's CPU

Wait for the experiment to complete (the experiment runs for 5 minutes) or end the experiment before proceeding to the next step.

Step 5: Use AWS FIS to kill a process in an HAQM ECS task

In our next experiment, we’ll use AWS FIS to kill a process in the task. The sample workload we’ve deployed runs a web application using by Flask. The web application container is the essential container in the task. When the python process is killed, HAQM ECS stops the task and creates a new task. We’ll use AWS FIS to kill the python process in the web application container, which forces HAQM ECS to create new tasks.

Killing a process helps you determine the business impact of a process terminating unexpectedly. If the process is essential to the workload, then you can also observe the downstream impact while HAQM ECS creates a new task to replace the failed task. If the recovery time is longer and causes business interruption, then you may want to increase the replica count to account unexpected failure. If the process is not essential, then you can implement a graceful termination procedure in your workload to minimize disruption.

The sample service runs three tasks. We’ll kill the essential process in 1 out of the 3 tasks. Since the application is operating at 33% capacity, the downstream impact is expected to be an increase in response latency and timeouts (5xx errors in Application Load Balancer).

Create a new experiment template:

cat > fis-ecs-experiment-template-kill-proc.json << EOF
{
  "description": "ecs-task-kill-process",
  "targets": {
    "Tasks-Target-1": {
      "resourceType": "aws:ecs:task",
      "parameters": {
        "cluster": "ecs-blueprint-infra",
        "service": "fis-service"
      },
      "selectionMode": "COUNT(1)"
    }
  },
  "actions": {
    "ecs-task-kill-proc": {
      "actionId": "aws:ecs:task-kill-process",
      "parameters": {
        "processName": "python"
      },
      "targets": {
        "Tasks": "Tasks-Target-1"
      }
    }
  },
  "stopConditions": [
    {
      "source": "none"
    }
  ],
  "roleArn": "arn:aws:iam::${AWS_ACCOUNT}:role/ecs-fis-role",
  "tags": {}
}
EOF

FIS_EXPERIMENT_ID=$(aws fis create-experiment-template \
  --region $AWS_REGION \
  --tags Name=ECS-kill-process \
  --cli-input-json file://fis-ecs-experiment-template-kill-proc.json \
  --query 'experimentTemplate.id' \
  --output text)

Head over to the HAQM ECS Console again, select the HAQM ECS service named fis-service and navigate to the Tasks tab. Change the task list filtering condition from Running tasks to All tasks as shown below. Keep this tab open, when we start the experiment, one of the tasks will fail.

ECS console with ECS FIS service's Tasks

Now, start the experiment:

aws fis start-experiment \
  --experiment-template-id $FIS_EXPERIMENT_ID \
  --region $AWS_REGION

After a few seconds, HAQM ECS stops the task as AWS FIS kills the python process in the essential container. Select the stopped task associated with AWS FIS Service. You’ll see below error message in the HAQM ECS Console.

ECS console with ECS FIS service's stopped Task

We used Locust to send traffic to the web application to visualize the impact of the experiment. As you can see, users would’ve experienced errors while accessing the site when AWS FIS killed the essential process. At the same time, P95 latency shot up because there weren’t enough tasks to handle requests.

Metrics during experiment to kill a process in an ECS task

ALB metrics show an increase in 5xx errors while HAQM ECS replaced the terminated task:

ALB metrics during experiment to kill a process in an ECS task

Please note that the aws:ecs:task-kill-process action requires PID(Process ID) mode to be set to task in the HAQM ECS task definition. HAQM ECS runs containers in private namespace by default. When PID mode is set to task, all containers within the task share the process namespace. Being in the same process namespace allows the sidecar container to terminate other processes running in the task.

Cleaning up

export AWS_PAGER=""
# Get experiment ids
fis_expmt1=$(aws fis list-experiment-templates --query 'experimentTemplates[?description == `ecs-task-kill-process`].id' --output text)
fis_expmt2=$(aws fis list-experiment-templates --query 'experimentTemplates[?description == `ecs-task-cpu-stress`].id' --output text)

# Delete the alarm 
aws cloudwatch delete-alarms --alarm-names 'ECS Sample service task count alarm' --region $AWS_REGION

# Delete experiments
aws fis delete-experiment-template --id $fis_expmt1
aws fis delete-experiment-template --id $fis_expmt2

# Delete ecs-fis IAM role
aws iam delete-role-policy --role-name ecs-fis-role --policy-name ecs-fis-policy
aws iam delete-role --role-name ecs-fis-role

# Destroy CDK stack
cdk destroy --force –all

Pricing

With AWS FIS, you pay only for what you use. There are no upfront costs or minimum fees. You are charged based on the duration that an action is active. Please see AWS Fault Injection Simulator pricing page for details.

The aws:ecs:task-kill-process action is free because it doesn’t have any duration.

By default, AWS FIS service has a quota on the number of tasks per action to prevent the accidental provisioning of more resources than you need. You can request this quota to be increased via AWS Console.

Conclusion

In this post, we showed you the new AWS FIS actions that make it easy for HAQM ECS customers to practice chaos engineering. We showed how you can increase CPU stress in HAQM ECS tasks as well as how to kill a process. AWS FIS simplifies fault injection by giving you full control over your experiments. AWS FIS provides the controls and guardrails that you need to run experiments in production, such as automatically rolling back or stopping the experiment if specific conditions are met. The new AWS FIS actions help create the real-world conditions needed to uncover application issues in HAQM ECS clusters that can be difficult to find otherwise.

HAQM ECS actions are now available in all AWS Regions where AWS FIS is available, including the AWS GovCloud (US) Regions.

Containers