AWS Cloud Operations Blog

Scaling AWS Fault Injection Service Across Your Organization And Accounts

Welcome to part two of our series where we focus on scaling AWS Fault Injection Service (FIS) within your organization. In part one, we learned how customers can enable individual accounts within organizations by introducing a Service Control Policies (SCPs) rule to run network experiments when operating with a centralized networking infrastructure. In this blog, we will dive deeper into how organizations can use SCPs and IAM to enable application teams to run chaos experiments while adhering to security policies through a centralized strategy that enables controlled multi-account FIS experiments. This approach allows teams to systematically validate workload dependencies and resilience across different accounts and compliance domains.

Understanding AWS FIS Multi-Account Strategies

Multi-account support for AWS FIS experiments allows you to create and run experiments from an orchestrator account that injects faults into AWS resources in one or more target accounts. You can configure multi-account experiment templates and control their scope using IAM roles with fine-grained permissions and resource tags to specify each target. FIS provides multi-account visibility and safety, allowing you to review actions across all accounts from the FIS Console and audit API calls in each account with AWS CloudTrail. When you run a multi-account experiment, target accounts with affected resources will be notified via their AWS Health dashboards.

Multi-Account Experiment Strategies

Organizations can use two strategies for designing and conducting multi-account experiments:

  1. Centralized Management Strategy: In this strategy, an orchestrator account is created. This account is typically owned by a dedicated chaos engineering/SRE team, let’s call them the FIS Admins. The team is responsible for enabling configuration and management of experiments in the AWS FIS Console, as well as ensuring centralized logging of experiments. The orchestrator account owns the AWS FIS experiment templates and experiments. This approach allows the FIS Admins to collaborate with a decentralized developer organization distributed across multiple accounts.
  2. Decentralized Management Strategy: Each AWS account owner designs and runs their own experiments. This approach gives application owners the freedom to adopt chaos engineering within their own teams without the overhead of working with a centralized team. Organizations can implement additional guardrails with this model to prevent role modification or ensure FIS safety levers are utilized to prevent unwanted disruptions.

Note: If you plan to go for the decentralized approach, a trust relationship between accounts is needed, see part three for more guidance.

In the following example, we will guide you through the IAM policies and roles that are needed in the orchestrator and target account to enable your teams to run network experiments independently via centralized console.  Our experiment has the objective to disrupt network connectivity to and from a specific subnet. For this scenario, we will create the AWS-FIS-Experiment-Executor role as described in part one. This role will have an AWS managed policy named AWSFaultInjectionSimulatorNetworkAccess attached to it, allowing it to perform the needed network actions, please see details on all permission here.

Multi-Account Scenario Preparation

For this scenario, we use two types of accounts:

  1. AWS-FIS-Experiment-Orchestrator-Account: This is the centralized account used to create, delete, or update FIS experiment templates and run experiments across all associated application accounts.
  2. Workload Account (Target Account): Where the actual workload resources reside and faults are being injected.

FIS Roles and Permissions

To ensure secure and controlled execution of fault injection experiments, AWS FIS uses a robust role-based access control system. We define two standardized roles: AWS-FIS-Experiment-Orchestrator and AWS-FIS-Experiment-Target. By implementing these roles, AWS FIS provides a framework for conducting controlled chaos engineering experiments while maintaining the necessary safeguards to prevent unintended disruptions throughout your development and production environment. Let’s take a deeper look:

  1. AWS-FIS-Experiment-Orchestrator: A role in centralized orchestrator account allowed to create, update, or delete FIS experiment templates along with permissions to execute experiments to inject faults into your target workload. This role has trust relationship to all target accounts.
  2. AWS-FIS-Experiment-Target: A role in the target account that contains permissions required to take action on resources. For the aws:network:disrupt-connectivity action the role will need ec2:CreateNetworkAcl + 9 others for example plus the mandatory tags. Read more on the actions here.

By leveraging roles, organizations can maintain robust security, ensure efficient governance, and adapt to changing needs as their cloud ecosystem evolves, all while upholding compliance and enabling seamless collaboration across multiple accounts. Let’s take a look at the sample diagram below:

AWS Account structure for FIS multi-account targets

Diagram A: AWS Account structure for FIS multi-account targets

AWS-FIS-Experiment-Orchestrator Permissions

Add following permissions to allow this role create, update, and delete experiments rights in the orchestrator account:

{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Sid": "VisualEditor0",
        "Effect": "Allow",
        "Action": [
          "fis:ListExperimentTemplates",
          "fis:ListActions",
          "fis:ListTargetResourceTypes",
          "fis:ListExperiments",
          "fis:GetTargetResourceType"
        ],
        "Resource": "*"
      },
      {
        "Sid": "VisualEditor1",
        "Effect": "Allow",
        "Action": "fis:*",
        "Resource": [
          "arn:aws:fis::<TARGET_ACCOUNT_ID>:action/*",
          "arn:aws:fis::<TARGET_ACCOUNT_ID>:experiment/*",
          "arn:aws:fis::<TARGET_ACCOUNT_ID>:experiment-template/*"
            ]
        }
    ]
}

To execute experiments in target accounts, you need to grant the orchestrator role permissions to assume each target account role. See example below: (known as role-chaining):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "sts:AssumeRole",
            "Resource": [
                "arn:aws:iam::<<targetAccountID>>:role/AWS-FIS-Experiment-Target"
            ]
        }
    ]
}

AWS-FIS-Experiment-Target

Add the appropriate AWS managed policy to the AWS-FIS-Experiment-Target role based on the experiment type. For example, use the AWSFaultInjectionSimulatorNetworkAccess policy for network disruption experiments. Note: Here you will add the orchestration account role create above to allow the across account permissions.

{
    "Version": "2012-10-17",
    "Statement": [
        {
           "Effect": "Allow",
            "Principal": {
                "AWS": "<<AWS-FIS-Experiment-Orchestrator-Account-ID>>:/root",
	"Service": "fis.amazonaws.com"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringLike":{
"sts:ExternalId": "arn:aws:fis:region:<<AWS-FIS-Experiment-Orchestrator-Account-ID>>:experiment/*"
                },
                "ArnEquals": {
                    "aws:PrincipalArn": "arn:aws:iam::<<AWS-FIS-Experiment-Orchestrator-Account-ID>>::role/ AWS-FIS-Experiment-Orchestrator"
                }
            }
        }
    ]
}

Creating a Multi-Account Experiment

To create a multi-account FIS experiment template via console, navigate to AWS FIS, from left side menu, choose Experiment templates options under Resilience testing. Continue creating experiment template using following steps:

    1. Choose the “Multiple accounts” option when creating the template.Specify template details FIS page
    2. Specify actions and targets.FIS configuration to choose action and targets for experiment
    3. Configure service access and specify the target role(s) for cross-account access.
      FIS configure service access and target roles configuration

Note: In the Target account confirmation is where you add the Target account role that has permissions to inject the network disruption.

4. Define logging, stop conditions, safety levers, and report configuration to ensure safe experiment execution.Configure FIS Optional reportingConfigure FIS optional logging

CloudWatch Cross-Account Configuration

To enable cross-account CloudWatch monitoring:

      1. Create the AWSServiceRoleForCloudWatchCrossAccount role in the orchestrator account.
      2. Create the CloudWatch-CrossAccountSharingRole in each target account.
      3. Ensure the target role trusts the orchestrator account.

Stop Conditions, Safety Lever, Experiment Reports

AWS Fault Injection Service (AWS FIS) provides controls and guardrails for you to run experiments in controlled manner on AWS workloads. A stop condition is a mechanism to stop an experiment if it reaches a threshold that you define as an HAQM CloudWatch alarm. Safety levers are used to stop all running experiments and prevent new experiments from starting. You may want to use the safety lever to prevent FIS experiments during certain time periods or in response to application health alarms. Every AWS account has a safety lever per AWS Region. See Safety Levers for AWS FIS for details. Experiment reports are PDF summaries of the experiment action executed. The reports can be downloaded from the FIS console or sent to an S3 bucket specified in the experiment template.

Conclusion

Implementing a centralized multi-account strategy using the practices outlined in this blog offers:

      • Enhanced security through role-based access control
      • Improved governance with centralized experiment management
      • Increased scalability for growing organizations
      • Better compliance and audit capabilities across the organization chaos adoption

By adopting these best practices, you can create a robust framework for chaos engineering across your AWS environment. This approach allows you to systematically improve the resilience of your distributed systems, ultimately leading to more reliable and fault-tolerant applications. As you implement these strategies, consider regularly reviewing and updating your roles and permissions to align with your evolving organizational needs and AWS’s latest security recommendations. Remember that effective chaos engineering is an ongoing process, and these multi-account practices provide a solid foundation for continuous improvement in your systems’ reliability. Join us in part three where we dive into using a multi-account strategy with our AWS FIS Cross-Region: Connectivity Scenario.

About the authors

Dylan Reed

Dylan Reed is a Solutions Architect at AWS. He has a passion for helping customers build resilient, secure and innovative solutions on AWS. In his current role he works across industries to help solve complex business challenges through AWS services. Outside of work he enjoys traveling and playing whatever sports he can.

Isael Pimentel

Isael Pimentel is an Enterprise Support Lead and Chaos Engineering SME at AWS with over 15 years of experience in developing and managing complex infrastructures, IT Transformation, Resilience, and Security. He also holds several certifications including AWS Solution Architect, AWS Network Specialty, AWS Security Specialty, MSCA, and CCNA.

Venkata Moparthi

Venkata Moparthi is a Senior Solutions Architect, specializes in cloud migrations, generative AI, and secure architecture for financial services and other industries. He combines technical expertise with customer-focused strategies to accelerate digital transformation and drive business outcomes through optimized cloud solutions.

Jason Brown

Jason Brown is a Senior Technical Account Manager at AWS, where he serves as a subject matter expert in Resilience, Disaster Recovery, and Chaos Engineering. With over 10 years of diverse technical experience, he has developed a passion for building resilient systems and helping customers define and scale their resilience practices through a comprehensive people, process, and technology approach.

Satish Kumar

Satish is a Sr. Technical Account Manager at AWS and member of Resilience TFC focusing on Chaos Engineering. Over the past 25 years, he worked in different roles from leading teams in software development, consulting, and IT. His experience in various industry verticals like Media & Entertainment, High-Tech, Finance, and now Healthcare & Life Sciences provided him with a deep understanding of the various facets of the software industry. Currently in his role he helps Healthcare & Life Sciences customers to design and operate their platform resiliently, cost efficiently, and at scale on AWS.