The frugal HPC architect – ensuring effective FinOps for HPC workloads at scale

The adoption of on-demand, elastic and scalable compute in the cloud has many benefits. It frees organizations from the recurring and protracted procurement processes associated with on-premises infrastructure which often require educated guess-work to size, and results in infrastructure which is only ever as good as the day it was installed. While the focus of this blog is on reducing costs, it’s also important to recognize the opportunities that experimenting at scale can unlock. The cloud allows customers to reach levels of scale which would be impossible to financially justify with on-premises infrastructure because it can be provisioned for the duration of any individual workload and then let go. However, with flexible provisioning in the cloud comes cost variability which can be daunting to organizations with fixed budgets, or tight margins.

In this post we’ll explore the best practices we’ve seen from our work with customers from various industries running HPC workloads at scale. Additionally, we’ll reference guidance from Werner Vogels, an industry leader who developed ‘The Frugal Architect’ to help organizations reason about these challenges and find the right balance between the mutual demands for flexibility, efficiency and effectiveness of systems.

HPC and the cloud

High performance computing (HPC) systems are increasingly common on AWS and were some of the first workloads to make use of elastic compute capacity at scale, whether augmenting on-premises capacity with ‘burst’ models or as the primary venue for large-scale computation. While the lessons of The Frugal Architect have broad applicability, there are some which are especially applicable to HPC workloads.

HPC encompasses both tightly-coupled systems running workloads like computational fluid dynamics (CFD), or massive, loosely-coupled systems perhaps running financial models. Both are likely to generate meaningful costs for their consumers, so careful consideration of the design, measurement, and observability of these systems is all the more important. This post will focus on the opportunities for loosely-coupled workloads but many of these will be applicable to any HPC workload.

Cost and the levers for savings

The first ‘law’ of The Frugal Architect is “Make cost a non-functional requirement”, emphasizing the importance of considering cost alongside other requirements such as security, accessibility, compliance and maintainability as measures of the success of a system.

In a ‘pay as you go’ provisioning model, costs are the product of the unit cost and the number of units consumed. Optimizations can come from reducing the cost of a given unit, by consuming fewer total units – or both. You can achieve either of these in different ways for HPC systems running in the cloud, and AWS has services and features to support each approach.

Reducing the unit cost

AWS offers a variety of opportunities to reduce the unit cost of consuming its services. Each of these strategies come with trade-offs, which may or may not impact application performance, but once assessed and implemented, can have significant benefits.

One of the most impactful examples of an opportunity to reduce the unit cost of compute is HAQM EC2 Spot which enables customers to achieve savings of up to 90% over on-demand provisioning for the exact same compute capacity. The trade-off with EC2 Spot is that these instances come from pools which are ready to serve consumers of on-demand capacity and so they may be reclaimed by AWS at any time with a 2-minute warning. HPC workloads that can accommodate sporadic interruptions (either because individual tasks are short, or because they can checkpoint their progress) can realize significant savings with EC2 Spot with minimal disruption.

There are also commercial offerings for customers who have a degree of steady demand. In this scenario an HAQM EC2 Savings Plan allows customers to achieve savings of up to 72% in exchange for a commitment to that capacity for 1 or 3 years. Often customers will combine EC2 Spot, Savings Plans, and on-demand capacity to achieve an optimal mix.

Beyond commercial solutions, there are technical opportunities to reduce the unit cost of compute by ensuring you provision only what you need. AWS currently offers over 750 EC2 instance types and as a result customers are better able to find the instance type that’s right for their workload. This represents a departure from traditional on-premises provisioning where homogenous deployments were used to support the widest possible set of tasks. In the cloud, each workload can provision the right mix of CPU, GPU, memory, storage and network performance, avoiding the need to pay for unused capabilities and features.

In addition to compute instances, AWS provides a wide variety of services for data storage which offer opportunities to reduce the unit cost of storing data. For example, depending on the characteristics of a given workload it may be optimal to make use of HAQM Simple Storage Service (HAQM S3) – perhaps with Mountpoint if you need POSIX access – over a traditional network filesystem. From there you can explore offerings like HAQM S3 Intelligent-Tiering to further reduce the unit cost per GB of storage if there are files which are accessed infrequently.

For each application there will likely be a number of opportunities to reduce the unit cost of resources on AWS. By analyzing the potential benefits and trade-offs you can prioritize the options and explore those with the most potential benefits for any individual workload, getting away from a ‘one size fits all’ approach.

Reducing unit consumption

Units of consumption will vary depending on the AWS service you’re using. Some even have multiple units which cover different types of utilization (for example, HAQM S3 has consumption units for GET operations as well as for GB of data stored).

There are some effective ways to reduce HAQM EC2 core-hour unit consumption, for example, you can increase efficiency by taking steps to ensure that – as much as possible – cores are working on compute tasks and not idle or working on non-compute activities. Or you can track effectiveness by ensuring that the value of the work they’re doing exceeds the cost of the compute resources provisioned and if that’s not the case, consider not running that particular workload at all. This aligns with the second law of The Frugal Architect – “Systems that last align cost to the business”, ensuring that any increase in costs is understood within the context of revenue.

This focus on efficiency and effectiveness represents a significant departure from on-premises task scheduling where the capacity is static. In that scenario, low value, or inefficient workloads may be tolerated as they don’t introduce any incremental cost and can simply be run at a low priority. While this may also be true in the cloud when capacity is covered by a Savings Plan, any additional capacity provisioned will bring an incremental cost which should be assessed against the business value.

One category of approaches to reducing unit consumption of compute is to analyze the billable lifecycle of an AWS instance from the point where the operating system (OS) starts to boot until termination. That time may encompass overhead activities like: OS boot, downloading and installing binaries, starting an HPC client agent, connecting to a scheduler, idle time with no active tasks, or active tasks waiting on dependent data. By minimizing time spent on overhead and maximizing time spent on useful computation, you can improve the efficiency of the system, reducing the compute units you need for a given workload.

Aggregate OS boot time costs can be reduced by: optimizing individual boot processes, reducing the number of instance boot events by using long-running On-Demand Instances, or by diversifying instance selection to minimize EC2 Spot Interruption events that require replacement instances to be started. CPU cores waiting on data could be kept occupied with a degree of oversubscription of tasks, for example by running 10 threads on a system with 8 vCPUs.

Arm-based AWS Graviton instances can offer significant price/performance benefits over x86 based instances running HPC workloads. Graviton-based instances provide up to 40% better price performance while using up to 60% less energy than comparable EC2 instances. Because AWS Graviton processors implement the Arm64 instruction set you may need to undertake some work to port your application, but AWS offers the AWS Porting Advisor for Graviton to make this easier. You can explore and track the impact of adoption of Graviton using the AWS Graviton Savings Dashboard. It’s possible that the performance benefits of Graviton will allow you to decide to complete a workload more quickly for the same cost, or to reduce costs by provisioning fewer instances for the same duration.

A good metric for success in reducing unit consumption is CPU utilization. If your HAQM EC2 instances have significant periods of under-utilization then it’s possible that you could be using a smaller number of instances, running at higher levels of utilization.

HPC applications which make use of large datasets can also benefit from opportunities to reduce consumption of high-performance storage. For example, many HPC workloads in the cloud make use of HAQM FSx for Lustre filesystems which offer hundreds of GB/s of throughput and millions of input/output operations (IOPS). But it may not make sense to host the entire dataset in Lustre. Often, customers opt to link the filesystem to HAQM S3. With this approach, FSx for Lustre transparently presents the HAQM S3 objects as files which can be imported into Lustre on demand. This greatly reduces the size of the Lustre file system while still allowing for high performance. Further reductions can be made by enabling Lustre data compression which reduces the size of the filesystem required for a given dataset.

Another approach is to make use of HAQM File Cache which can similarly provide high-performance access to HAQM S3 object storage but can also cache files stored in an on-premises NFSv3 filesystem, potentially reducing the cost of storing duplicate data in the cloud. Further savings can be realized with both HAQM File Cache and FSx for Lustre by shutting them down when not in use and recreating them when demand returns.

Observability

The fourth law of The Frugal Architect is “Unobserved Systems lead to unknown costs”. In the previous section we introduced the core concepts of efficiency (ensuring resources provisioned spend as little time as possible on overhead work) and effectiveness (ensuring that the value of the work being done exceeds the cost of the provisioned compute resources).

Both of these are hard to reason about without observability systems which can inform HPC managers of overall efficiency through a measure such as computation as a percentage of capacity provisioned, and their customers on effectiveness with measures of business value as a percentage of cost to compute (hopefully over 100%).

By surfacing these metrics in a timely way and with sufficient detail, your stakeholders will be better placed to identify and quantify ongoing opportunities for improvement, measure the results of any changes, and catch unexpected outcomes quickly. AWS offers various tools to support observability across metrics, logs, and traces. For example, HAQM CloudWatch or HAQM Managed Service for Prometheus.

It’s also important to have clear visibility of (and if possible, attribution of) costs, and there are tools to help you access, analyze, and understand cost data. For example, AWS Data Exports allows you to export data in the standard FinOps Open Cost and Usage Specification (FOCUS) 1.0 schema that you can then process using existing reporting solutions – or visualize with HAQM QuickSight. Additionally, AWS Cost Explorer allows you to explore costs in detail, starting with high level views but with an eye to dive deeper to identify trends and anomalies, or to identify key cost drivers.

Finally, the AWS Compute Optimizer can help you to identify underutilized resources and will make recommendations for how you might resolve them. For example, it may suggest changes like selecting instances with less memory if what is currently provisioned is under-utilized. These recommendations are presented along with the potential savings to help you make better decisions.

Conclusion

When planning a migration or expansion of an HPC workload into the cloud, cost should be considered along all of the other applicable non-functional requirements. In this post, we explored the inherent change in the drivers of cost as you move these systems into the cloud and how the mechanisms that worked for fixed on-premises clusters may not apply in the same way.

HPC systems can benefit greatly from the flexibility and scale offered by the AWS Cloud and as a result the costs can be material and need to be well understood. The way that costs accrue differs depending on how you choose to leverage compute, storage, and other resources in the cloud, requiring a different cost management approach compared to traditional fixed-capacity, on-premises infrastructure.

AWS offers a number of effective mechanisms to enable you to make full use of the scale and elasticity that the cloud offers while ensuring that you can understand costs and manage them accordingly. With guiding principles like those outlined in The Frugal Architect, the tools and frameworks offered by AWS make it possible to build highly effective, efficient, and flexible architectures to support your HPC workloads.

Consider exploring the AWS Well-Architected Framework which offers a pillar dedicated to cost-optimization alongside other key non-functionals including security, reliability, and sustainability.

AWS HPC Blog