Scaling AWS Fault Injection Service Across Your Organization And Regions

In the first two parts of our series, we explored how to scale AWS Fault Injection Service (FIS) across AWS Organizations. Part one focused on implementing FIS in a single AWS account environment, introducing the concept of standardized IAM roles and Service Control Policies (SCPs) as guardrails for controlled chaos engineering experiments, particularly in centralized networking models. Part two expanded on this foundation by demonstrating how organizations can implement a multi-account strategy for FIS experiments, detailing the setup of orchestrator and target accounts, along with the necessary IAM roles and permissions required to execute controlled chaos experiments across multiple accounts while maintaining security and governance. In part three, we’ll demonstrate how to run chaos engineering experiments at scale across multiple accounts and AWS Regions using the Cross-Region connectivity scenario.

The Importance of multi-Region resilience

As critical applications move to AWS, understanding their resilience objectives and how they align with your business becomes crucial. Applications with stringent recovery time and point objectives often require a multi-Region strategy. To properly verify that these applications can recover from regional impairments, organizations need mechanisms to inject regional failures and verify their business continuity processes. The AWS FIS Cross-Region Connectivity scenario addresses this need.

Understanding AWS Fault Injection Service (FIS)

AWS FIS is a chaos engineering service that allows customers to inject real-world failures into their architectures. For example, HAQM.com ran 733 AWS FIS experiments to prepare for Prime Day 2024. This approach helps customers to verify that resilience measures built into the application are kicking in when there is an unexpected regional service event.

To learn more and get hands-on experience, visit our Chaos Engineering workshop.

AWS Fault Injection Service (FIS) now offers a Cross-Region Connectivity testing scenario. This feature allows you to inject real-world failures into multi-Region architectures, helping you uncover hidden dependencies and improve your understanding of multi-Region configurations. It demonstrates that multi-Region applications operate as expected when the primary region is inaccessible. The scenario includes fault actions to disrupt various types of cross-region connectivity, including:

Virtual Private Cloud (VPC) traffic and peering
AWS Transit Gateway peering
Access to AWS public endpoints
Access to endpoints exposed via load balancers and API gateways
S3 and DynamoDB cross-region data replication

By running this scenario, you can identify gaps in your multi-Region applications design, recovery, and failover mechanisms.

Prerequisites for running the cross-Region connectivity scenario

As discussed in part one, large enterprises often implement a centralized networking model, where a dedicated networking account manages shared resources like Transit Gateway (TGW) for the entire organization. This approach allows for better control, security, and cost management across multiple accounts and regions. When implementing the AWS FIS Cross-Region Connectivity scenario in such an environment, you’ll need a multi-account strategy as discussed in Part two to effectively inject communication failures and test resilience across accounts.

Diagram A shows a multi-Region application utilizing TGW in a centralized network centralized account for connectivity between both regions:

Diagram A: Sample application for FIS in multi-Region

Given this setup, there are specific requirements for running the Cross-Region Connectivity scenario in a decentralized strategy:

Replicate permissions specified in our documentation for all roles used in the target configuration.
Add roles for all targets under the target configuration. As we are running a decentralized strategy, you will have two target roles.

One in the networking account represented below with XXXXXXXXXXXX and one in the application account represented below as YYYYYYYYYYYY that will also be the orchestrator:

Role Configuration examples: A target role for the networking account where TGW resides, here we are using the AWS-FIS-Experiment-TGW-Target role.

 {
    "Role": {
        "Path": "/",
            "RoleName": "AWS-FIS-Experiment-TGW-Target",
                "RoleId": "AROAXSRRIQZV7NIQPD4IF",
                    "Arn": "arn:aws:iam::XXXXXXXXXXXX:role/AWS-FIS-Experiment-TGW-Target",
                        "CreateDate": "2024-05-31T19:06:48+00:00",
                            "AssumeRolePolicyDocument": {
            "Version": "2012-10-17",
                "Statement": [
                    {
                        "Effect": "Allow",
                        "Principal": {
                            "AWS": "arn:aws:iam::YYYYYYYYYYYY:root"
                        },
                        "Action": "sts:AssumeRole",
                        "Condition": {
                            "StringLike": {
                                "sts:ExternalId": "arn:aws:fis:us-east-2:YYYYYYYYYYYY:experiment/*"
                            },
                            "ArnEquals": {
                                "aws:PrincipalArn": "arn:aws:iam::YYYYYYYYYYYY:role/FISOrchestration_ExecutionRole"
                            }
                        }
                    }
                ]
        },
    }
}

A role in the orchestration account that will be used to inject the actions called AWS-FIS-Experiment-App1-Target. Note: In this scenario the role is within the same account as the workload, ensure the permissions specified above match.

{
    "Role": {
        "Path": "/",
        "RoleName": "AWS-FIS-Experiment-App1-Target",
        "RoleId": "AROA5L7L4GFHC5IZTT5SW",
        "Arn": "arn:aws:iam::YYYYYYYYYYYY:role/AWS-FIS-Experiment-App1-Target",
        "CreateDate": "2024-10-09T15:20:17+00:00",
        "AssumeRolePolicyDocument": {
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Sid": "Statement1",
                    "Effect": "Allow",
                    "Principal": {
                        "AWS": "arn:aws:iam::YYYYYYYYYYYY:role/AWS-FIS-Experiment-App1-Target"
                    },
                    "Action": "sts:AssumeRole"
                }
            ]
        }
    }
}

An orchestrator role with a trust policy to assume all roles specified in the target configuration called AWS-FIS-Experiment-App1-Orchestrator. Note: In this scenario the orchestrator role will reside in the same account as the workload and will assume the AWS-FIS-Experiment-App1-Target and AWS-FIS-Experiment-TGW-Target you can easily change this to a centralized strategy as discussed in part two.

{
    "Role": {
        "Path": "/",
        "RoleName": "AWS-FIS-Experiment-App1-Orchestrator",
        "RoleId": "AROA5L7L4GFHIZL4PEG46",
        "Arn": "arn:aws:iam::YYYYYYYYYYYY:role/AWS-FIS-Experiment-App1-Orchestrator",
        "CreateDate": "2024-05-31T19:51:16+00:00",
        "AssumeRolePolicyDocument": {
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Principal": {
                        "Service": "fis.amazonaws.com"
                    },
                    "Action": "sts:AssumeRole"
                }
            ]
        },
    }
}

Note: You will need a trust policy added to allow the Orchestration Execution role to assume the networking account role:

    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": "sts:AssumeRole",
                "Resource": [
                    "arn:aws:iam::XXXXXXXXXXXX:role/AWS-FIS-Experiment-TGW-Target"
                ]
            }
        ]
    }

Cross-Region Connectivity scenario Actions:

The scenario includes four main actions, each with specific considerations. Let’s have a look at each:

Disrupt Subnet Connectivity:
- Action: Runs the aws:network:route-table-disrupt-cross-region-connectivity to block AWS service communication from the source VPC to the failover region. What you will observe:
  - Ensure you’ve increased your routes per route table quota to at least 250 in the source region.
  - Monitor for any unexpected impact on VPC endpoints in the source region.
  - Be prepared for potential disruptions in cross-Region API calls and data transfers.
  - Verify that your application’s error handling and retry mechanisms work as expected during this disruption.
Disrupt Transit Gateway Connectivity:
- Action: Runs the aws:network:transit-gateway-disrupt-cross-region-connectivity to stop communication from the source VPC to the destination VPC. What you will observe:
  - Ensure the TGW is properly tagged before running the experiment. The tag is crucial for FIS to identify the correct TGW.
  - CloudTrail is recording route table changes.
  - Be prepared for potential impacts on cross-region traffic flows, especially for applications that rely on multi-Region communication.
  - Verify that your network monitoring tools correctly detect and alert on this disruption.
Pause S3 Replication:
- Action: Runs the aws:s3:bucket-pause-replication action to stop replication from the source buckets to their destination buckets, including cross-region replication. What you will observe:
  - Ensure that the S3 buckets you intend to target are correctly tagged with the targetTagForS3Buckets specified in the experiment parameters.
  - Verify that replication is bi-directional (set up from the destination region to the source region as well). If not, the S3 pause replication action may fail.
  - Monitor for any data consistency issues that may arise during the replication pause.
  - Be prepared to handle potential data sync delays once replication is resumed.
Pause DynamoDB Replication:
- Action: Runs the aws:dynamodb:global-table-pause-replication action, stopping replication to and from the experiment region. What you will observe:
  - Ensure that you’re targeting global tables, not standard DynamoDB tables.
  - Verify that the DynamoDB tables are tagged with the targetTagForDdbTables specified in the experiment parameters.
  - Monitor for any potential data inconsistencies across regions during the replication pause.
  - Be prepared for a potential increase in write conflicts when replication resumes, especially if writes occurred in multiple regions during the pause.
Running the Experiment:

When creating and executing the AWS FIS Cross Region Experiment, follow these steps and best practices:
- Experiment Setup:
  - Navigate to the AWS FIS console and select “Scenario Library”. The Scenario Library displays AWS-managed templates you can use as starting points for your experiments.
  - In the list of AWS-managed templates, choose “Cross-Region: Connectivity”. This template contains predefined actions to test cross-region connectivity resilience.
  - Choose “Create Template with Scenario”. AWS FIS creates a new template based on the selected scenario.
  - Under the “Account Targeting” prompt choose “Multiple accounts”
- Configure Shared Parameters:
  - Set the disruptionDuration: This defaults to 3 hours, but adjust based on your needs. Consider starting with shorter durations and gradually increasing as you become more comfortable with the process.

FIS Specify Actions and Targets for experiment

- - Specify the Region: This is the secondary region.
  - Expand Advanced Parameters and define your tags to ensure you’re targeting the correct resources.
- Target Configuration:
  - This step is crucial: You must add both the shared networking account role and the orchestration application role to the experiment’s target configuration.
  - For the shared networking account role:
    - Add the ARN of the AWS-FIS-Experiment-TGW-Target created in the networking account.
    - Ensure this role has the necessary permissions to modify TGW attachments and route tables.
  - For the orchestration application role:
    - Add the ARN of the role created in the application or linked account.
    - This role should have permissions for S3 and DynamoDB replication actions.
  - Verify that the execution role has permission to assume both of these roles.
- Validate Target Resource Selection:
  - Use the AWS FIS Target preview feature to validate which resources will be impacted. This is crucial to avoid unintended disruptions.
  - Verify that all intended resources are properly tagged and visible in the preview.
  - Double-check that resources in both the networking and application accounts are correctly targeted.
- IAM Role Verification:
  - Ensure that all IAM roles (networking account role, orchestration account role, and execution role) are properly configured and have the necessary permissions as described here.
  - Verify that the execution role can assume both the networking and orchestration roles as needed.
- Pre-Experiment Checklist:
  - Depending on environment requirements, notify all relevant teams about the upcoming experiment, including both networking and application teams.
  - Verify that your monitoring and alerting systems are properly configured to detect the expected disruptions across all impacted accounts.
  - Ensure you have a clear rollback plan in case of unexpected issues, including steps for both networking and application resources.
  - If possible, run the experiment during a maintenance window or low-traffic period.
- Executing the Experiment:
  - Use the Target Preview feature, to validate permissions and resource selection without actually causing disruptions.
  - When ready, initiate the experiment and closely monitor its progress.
  - Keep all relevant teams on standby for quick response if needed.

Conclusion

By leveraging the AWS FIS Cross-Region Connectivity scenario, organizations can:

Validate complex cross-account dependencies in multi-Region deployments
Improve multi-Region reliability by uncovering hidden failure modes
Gain valuable insights into operational efficiencies
Enhance understanding of application behavior
Refine and validate business continuity and disaster recovery plans

Remember, chaos engineering is an ongoing process of continuous improvement. Each experiment with AWS FIS is an opportunity to learn, adapt, and strengthen your systems. By adopting a chaos engineering mindset and leveraging tools like AWS FIS, you’re building a culture of resilience that permeates throughout your entire organization. As you implement AWS FIS and the Cross-Region Connectivity scenario in your environment, keep in mind the strategies, best practices, and insights shared throughout this series. Embrace the power of controlled chaos, and watch as your systems become more robust, your teams more confident, and your services more reliable in the face of unexpected events.

Thank you for joining us on this exploration of AWS FIS and cross-region resilience testing. We hope this series has equipped you with the knowledge and tools you need to take your cloud infrastructure to the next level of reliability and performance. Here’s to building more resilient, fault-tolerant systems.