Maintaining spare capacity during host failures on AWS Outposts with dynamic monitoring

AWS Outposts Rack is a fully managed service that extends AWS infrastructure, services, and APIs to user managed locations. Although you may be used to the seemingly infinite capacity that AWS offers in region, those using Outposts rack for their workloads are limited to the capacity that they order. You will need to closely manage and monitor usage of the available resources as part of capacity management. It is also important to make sure that there is sufficient available capacity in the event of an impactful hardware failure. Although spare capacity is often planned for in the initial Outposts rack configuration order, scaling events and deployments of new workloads can often lead to capacity shortages that only become visible during a failure event.

In this post, we review best practices for capacity management and fault tolerance with Outposts rack followed by an example of how the Outposts API can be used to build an automated monitoring and alerting system to highlight potential resiliency issues.

Planning for failures

The AWS Outposts High Availability Design and Architecture whitepaper discusses the principals of capacity planning within Outposts rack, such as how instance families are mapped to hosts through capacity planning.

When looking to determine resiliency levels, we refer to having N+M capacity, where N represents the number of deployed hosts of a particular instance family (such as C5 or M5), and M represents the number of hosts that can fail while still meeting workload capacity requirements.

The capacity configuration that is applied to each host will impact the necessary recovery process in the event of a failure, depending on the number of configured or running instances. With this in mind, there are three potential recovery scenarios that can apply in the event of a host hardware failure:

Sufficient capacity exists within all instance pools to tolerate the failure of M hosts. This is the most ideal operational position to be in because, in the event of a failure, instances can be recovered to free capacity quickly either through automated features, such as EC2 Auto Scaling groups and instance recovery, or through manual stop/start of the instances.
The required instance type is not available within the available instance pools, however, there is sufficient vCPU available to execute capacity tasks to create the required instance capacity to fulfill the shortfall. As this requires changes to existing capacity, this results in a longer recovery time overall
Insufficient capacity within the Outpost at both the instance pool and vCPU level means that either workloads need to be stopped to fit within the available capacity, or more Outpost hardware needs to be added. This further extends the recovery time for workloads.

Consider the following example of an Outpost configured with four M5 hosts that have been designed with an N+1 resiliency model.

Figure 1: Example configuration with sufficient instance pool capacity

In this example, there are five configured instance pools with the following usages:

Instance size	Total instance pool capacity	Total free instance pool capacity	Max configured instances per host
M5.large	16	6	4
M5.xlarge	8	3	2
M5.2xlarge	8	3	2
M5.4xlarge	8	3	2
M5.8xlarge	4	2	1

For all instance pools, the number of available instances is greater than the maximum number of instances configured on a single host. Therefore, in the event of a failure of any host, instances can be moved to the existing available capacity without any reconfiguration.

We can consider another scenario of running instances on the same set of hosts:

Figure 2: Example configuration with sufficient vCPU capacity

With the usage as shown, four of the configured instance pools have sufficient available capacity. However, the m5.4xlarge instance pool only has one available instance placement, resulting in no tolerance to a single host failure. A single m5 host has a total of 96 vCPU, and in this example the overall capacity of the available slots is 156 vCPU. This means that, with the execution of a capacity task to rebalance the available slots, instances could be restarted after a host failure.

Automating a capacity observability solution

With the release of the capacity task functionality for Outposts, details of instance placement and slot configuration per host are now available within both the AWS Management Console and through the API. With the addition of capacity tasks for Outposts, an automated solution can be created to query this data and provide notifications when the N+M resiliency requirements for your workloads are at risk.

The following diagram shows an example solution to achieve this, with the sample code provided in the AWS Samples GitHub repository. The solution is deployed using an AWS Serverless Application Model (AWS SAM) template.

Figure 3: Sample code architectural diagram

HAQM EventBridge scheduler initiates an AWS Lambda function on a user defined time basis.
The Lambda function evaluates the Outpost rack capacity, creates and updates HAQM CloudWatch alarms, and initiates regular reporting.
An HAQM Simple Notification Service (HAQM SNS) Topic sends the report to user defined endpoints such as email or Slack.
CloudWatch alarms continually monitor for changes to Outpost capacity.
In the event of alarm thresholds being breached, a Lambda function is invoked to send notifications via SNS to the user defined endpoints.

At the core of the solution are two Lambda functions:

Monitoring stack manager: This Lambda function sets up the dynamic monitoring of the desired N+M resiliency level. It achieves this by creating and updating CloudWatch alarms based on the current capacity configuration of the Outposts being monitored, and the capacity usage for each instance family and type. The function generates detailed reports for each Outpost, identifying any potential resiliency issues for each instance family based on the M value that is specified at the time of deployment.

The detailed report, which is issued via the configured SNS topic, starts with an overall summary that clearly details the status of each instance family and the resiliency status.

Figure 4: Resiliency report summary section

Following the overall summary section, a more detailed analysis is provided for each instance family, looking at resiliency from both instance type and vCPU capacity perspectives. As part of this detailed analysis, the level of risk for each capacity pool is provided alongside a review of available instance capacity and suggested mitigation options.

Figure 5: Resiliency report instance pool analysis section

Figure 6: Resiliency report vCPU analysis section

This summary report is generated on every execution of the Monitoring Stack Manager function, with the default configuration that is triggered by the EventBridge Scheduler set to daily.

Process alarm: When the alarm that is configured by the Monitoring Stack Manager Lambda triggers, the Process Alarm Lambda analyzes Outpost capacity, checking for available free vCPUs within the hosts running the affected instance family. Then, a report is sent via SNS to immediately draw attention to the capacity risk, providing guidance if the resiliency risk can be mitigated through the application of an alternate capacity configuration.

Figure 7: Resiliency alarm notification report

Similar to the report generated by the Monitoring Stack Manager function, a more detailed breakdown of the capacity issue is provided that allows for easy identification of any necessary follow up actions. These actions are recommendations for manual resolution of the issue and require you to take action to implement.

When the available capacity returns to a level that matches the N+M resiliency requirements you defined, a further notification report is sent to confirm this, and the alarm is reset.

You may also prefer to integrate notifications into platforms such as Slack or Microsoft Teams. One option for this is to use a Lambda function to rewrite the HAQM SNS notification to publish the message through a Webhook. For more information on this, go to How do I use webhooks to publish HAQM SNS messages to HAQM Chime, Slack, or Microsoft Teams?. Alternatively, for sending messages to Slack, users can use Slack’s email-to-channel integration, which allows Slack to accept email messages and forward them to a Slack channel. For more information, go to Configure HAQM SNS to send messages for alerts to other destinations.

Considerations for deploying this solution

The sample solution provided has been designed to work for users who are operating Outposts at any scale. However, there are some considerations for deploying:

The solution is deployed within the AWS account that owns the Outpost, rather than workload/consumer accounts that might be using Outposts resources through AWS Resource Access Manager (AWS RAM)
The deployment is AWS Region-specific. Therefore, it would need to be deployed in each AWS Region you’re using Outposts in.
Each stack deployment supports dedicated N+M configuration monitoring, allowing you to create separate deployments to match the desired resilience requirements across multiple Outposts.

Cleaning up

Because this solution is implemented through AWS SAM, the only clean up required is to execute the AWS SAM deployment using the cleanup parameter as documented in the code repository readme file.

Conclusion

In this post, we reviewed how to calculate N+M resilience for Outposts rack deployments, and provided a sample solution that can dynamically monitor and report on capacity constraints. Making sure that there is sufficient available capacity within an Outpost rack to tolerate failures is critical to running resilient applications and minimizing any potential downtime. Combining good capacity management practices with service functionality, such as EC2 Auto Scaling, automatic instance recovery, and placement groups, gives you several options to make sure workloads can continue to run even during failure events. If you need any assistance calculating your Outposts rack resiliency, or further information on deploying and running fault tolerant workloads, reach out to your AWS Account team.

AWS Compute Blog