AWS Cloud Operations Blog
Category: Resilience
Scaling AWS Fault Injection Service Across Your Organization And Regions
In the first two parts of our series, we explored how to scale AWS Fault Injection Service (FIS) across AWS Organizations. Part one focused on implementing FIS in a single AWS account environment, introducing the concept of standardized IAM roles and Service Control Policies (SCPs) as guardrails for controlled chaos engineering experiments, particularly in centralized […]
Scaling AWS Fault Injection Service Across Your Organization And Accounts
Welcome to part two of our series where we focus on scaling AWS Fault Injection Service (FIS) within your organization. In part one, we learned how customers can enable individual accounts within organizations by introducing a Service Control Policies (SCPs) rule to run network experiments when operating with a centralized networking infrastructure. In this blog, […]
Scaling AWS Fault Injection Service Across Your Organization Using Account Controls
AWS Fault Injection Service (FIS) empowers you to adopt chaos engineering at scale within your AWS environment. Chaos engineering injects real-world, controlled failures into a system to verify resilience and reliability, ultimately improving the customer experience. This proactive, resilience-focused approach increases your confidence in a system’s ability to respond to adverse conditions in production. You […]
New AWS Fault Injection Service recovery action for zonal autoshift
We’re excited to announce that AWS Fault Injection Service (FIS) now supports a recovery action for HAQM Application Recovery Controller (ARC) zonal autoshift. With this integration, you can now perform more comprehensive testing by creating disruptive events and trigger a zonal autoshift as part of the same experiment. That way, you can observe how your application […]
Detecting gray failures with outlier detection in HAQM CloudWatch Contributor Insights
You may have encountered a situation in the past where a single user or small subset of users of your system are reporting an event that is impacting their experience, but your observability systems didn’t show any clear impact. The discrepancy between the customer’s experience and the system’s observation of its health is referred to […]