AWS Cloud Operations Blog

New AWS Fault Injection Service recovery action for zonal autoshift

We’re excited to announce that AWS Fault Injection Service (FIS) now supports a recovery action for HAQM Application Recovery Controller (ARC) zonal autoshift. With this integration, you can now perform more comprehensive testing by creating disruptive events and trigger a zonal autoshift as part of the same experiment. That way, you can observe how your application will respond during an Availability Zone (AZ) impairment.

Deploying applications across multiple AZs is a key strategy for building resilient applications on AWS. Each AZ acts as a fault isolation boundary, which means any failures—from deployments, network issues, power outages, or human error—stay contained within that specific zone and don’t affect the entire system. This multi-AZ approach helps ensure your applications remain available and can continue running even when problems occur in one AZ, making them more reliable and fault-tolerant.

Zonal shift and zonal autoshift

In the event that an AZ is impaired, you can use zonal shift as a recovery mechanism to shift traffic away from that AZ in an AWS Region to healthy AZs in the same Region. This will isolate the impairment and allow your application to continue to serve your customers in a different AZ. To enhance this capability further, we’ve expanded the list of resources ARC supports to include HAQM EC2 Auto Scaling groups, HAQM Elastic Kubernetes Service, and Application and Network Load Balancers with cross-zone load balancing enabled or disabled.

Zonal autoshift in ARC helps minimize time to recovery. AWS will initiate an autoshift when our internal monitoring systems detect an AZ impairment that could potentially impact customers. An autoshift will temporarily move application traffic away from the affected zone for AWS resources configured for zonal autoshift. Once the issue is resolved, traffic will be distributed across all AZs again.

AWS FIS AZ Availability: Power Interruption

Customers have also been using AWS FIS to demonstrate how their application responds to AZ-level events by generating impairments to resources and AWS services in an AZ. The AZ Availability: Power Interruption scenario creates many of the expected symptoms of a complete power interruption in an AZ by combining multiple FIS actions. It temporarily “pulls the plug” on a targeted set of your resources in a single AZ by inducing a loss of zonal compute (HAQM EC2, EKS, and ECS), no re-scaling of compute in the AZ, subnet connectivity loss, HAQM Relational Database Service (RDS) failover, HAQM ElastiCache failover, and unresponsive HAQM Elastic Block Store volumes.

A common pitfall to testing AZ interruptions often focused solely on blocking network connectivity. However, this method affects only resources within your managed network and doesn’t impact AWS managed services, which operate independently of your network configuration. The FIS AZ scenario provides a more comprehensive testing solution by simulating interruptions to both your resources and AWS managed services.

The AZ Availability: Power Interruption scenario offers significant benefits for testing and improving application resilience. It allows you to observe how your multi-AZ applications behave under realistic failure conditions. By running this controlled experiment, you can uncover potential weaknesses in your architecture, monitoring systems, and operational procedures. This helps improve your application’s fault tolerance and overall reliability, potentially reducing recovery time, and supporting compliance requirements for disaster recovery planning.

Bringing AWS FIS and zonal autoshift together 

An important aspect of your recovery strategy is the ability to test it. Resilience testing is particularly important in the cloud where systems evolve more quickly to meet business demands. Regularly testing resilience is necessary to ensure that critical systems continue to meet your recovery objectives. It also validates your procedures, ensures team readiness, verifies performance metrics, and helps identify potential issues.

That’s why we’re excited to share how AWS FIS AZ Availability: Power Interruption scenario works together with zonal autoshift. By pairing these capabilities, you can now see how AWS will act on your behalf to shift traffic away from the impacted AZ as you’d expect to happen for your zonal autoshift resources during a real event. This integrated testing approach gives you a more complete view of how your application performs during infrastructure disruptions in an AZ.

AWS FIS recovery action for zonal autoshift

With this launch, we introduced a new category for AWS FIS actions with our first recovery action. With the new AWS FIS recovery action, customers that have enabled zonal autoshift can run the FIS AZ Availability: Power Interruption scenario to induce the expected symptoms of a complete interruption of power in an AZ and demonstrate how AWS would trigger a zonal autoshift during a real-world impairment. You’ll also discover any dependencies that your application may have in the unavailable AZ, which aren’t discovered by simply shifting traffic away from the AZ under healthy conditions.

Create an AWS FIS experiment template with a recovery action

Let’s take a look at how you can test zonal autoshift with AWS FIS. We’ll focus on the new integration, so if you haven’t created an experiment using the AZ Availability: Power Interruption scenario before Bootstrap your chaos engineering journey with AWS Fault Injection Service Scenarios Library will provide a step-by-step guide on how to set one up.

To get started, select the AZ Availability: Power Interruption scenario from the AWS FIS scenario library to create an experiment template.

Figure 1: Create experiment template using AZ Availability: Power Interruption Scenario.

Under Specify actions and targets, you’ll now see the addition of zonal autoshift. At the bottom is the new recovery action Start-ARC-Zonal-Autoshift and its target ARC-Managed-Resources.

Figure 2: Shows the New Start-ARC-Zonal-Autoshift FIS action and ARC-Managed-Resources target.

Let’s take a look at the configurations for the zonal autoshift action.

Action type: Here you’ll see the new ARC action type along with the aws:arc:start-zonal-autoshift action.

Start after: The action is configured to wait 5 minutes after the experiment starts with an FIS-Wait action to simulate sometime after an event occurs before starting a zonal autoshift. During a real event you can expect autoshift to trigger several minutes after the symptoms begin.

Target: A target called ARC-Managed-Resources will automatically be created, which defines which resources will be targeted.

Availability Zone identifier: Select the AZ you want zonal autoshift to move traffic away from. You can also use the Edit shared parameters, which will configure shared parameters for all actions in the experiment with the same value for consistency.

Duration: The time value will be configured to expire the zonal autoshift at the same time as when the other actions in the experiment ends to begin to shift traffic back. The duration time will be [outage duration - 5 minutes] to account for the 5 minutes wait time mentioned above before triggering the zonal autoshift. For example, if the outage duration for the experiment is 30 minutes, then the duration for zonal autoshift will be configured to be 25 minutes.

Figure 3: Zonal autoshift edit action configuration window.

In the ARC-Managed-Resources target configuration, you can define which resources will be included in the zonal autoshift. By default, it will include the supported AWS resources in your account that have been enabled for zonal autoshift. These resources have to have been opted-in for zonal shift.

Target method: By default, the Resource tags, filters and parameters option will be selected.

Resource tags: By default, a resource tag with key name of AzImpairmentPower and key value of Recover-autoshift-resources are created. You can use these resource tags on zonal autoshift supported resources to include in the experiment.

So far, the configuration options available are similar to other FIS actions. What’s new with the aws:arc:start-zonal-autoshift action is that we’ve added more flexibility on which resources to target with the addition of Resource parameters.

This new set of parameters provides a Managed resource types and Zonal autoshift status to target resources in addition to the Target method (e.g. Resource IDs or Resource tags).

Managed resource types: Select one or more of a zonal autoshift supported resources (Auto Scaling groups, Application Load Balancer, Network Load Balancer, and/or EKS cluster) that you want to target.

Zonal autoshift status: Managed resource types you want to target can either have zonal autoshift EnabledDisabled, or both. By default, Enabled is pre-selected. By setting this parameter to Disabled, it allows you to target resources that are not zonal autoshift enabled, but are opted-in for zonal shift. This allows you to preview how your application will behave before enabling it for zonal autoshift.

Figure 4: Zonal autoshift managed resource type configuration window.

For more information about the FIS recovery action, refer to the AWS Fault Injection Service User Guide.

Pricing and availability

With AWS FIS, you pay for what you use. There are no upfront costs or minimum fees. You are charged based on the duration that an action is active and the number of accounts included in an experiment. For pricing details, visit the FIS pricing page.

There is no additional charge for using zonal autoshift, but consider the additional cost required to pre-scale resources in multiple Availability Zones for your application to take on additional traffic when shifting away from an AZ along with associated costs like CloudWatch, Data Transfer, etc.

The FIS recovery action is available in all AWS Regions where FIS and zonal autoshift are available. For a list of AWS Regions where FIS is available, see FIS Service endpoints. For a list of Regions where zonal autoshift is available, see AWS Region availability for zonal autoshift.

Conclusion

By combining AWS Fault Injection Service’s AZ Availability: Power Interruption scenario with ARC zonal autoshift, you can simultaneously test against an AZ failure while validating your recovery mechanism. This combined testing approach provides a more comprehensive assessment of your application’s behavior during infrastructure disruptions.

Get started today with testing your application’s resilience using AWS Fault Injection Service and ARC zonal autoshift. You can start small by testing with non-production workloads before expanding to production environments.

For hands-on experience, try our step-by-step guide Bootstrap your chaos engineering journey with AWS Fault Injection Service Scenarios Library to set up your first experiment.

Stay resilient and keep testing!

Daniel Cil

Daniel Cil is a Senior Resilience Specialist Solutions Architect based out of Southern California. He helps AWS Industries and Strategic customers design fault-tolerant architectures and implement resilience best practices for their workloads on the AWS Cloud.