Characteristics of financial services HPC workloads in the cloud

This blog post introduces high performance computing (HPC) requirements in the Financial Services Industry (FSI). It defines the technical attributes of compute-heavy workloads and discusses their essential characteristics. Additionally, it provides a decision tree process to help customers choose the right HPC platform for the cloud, whether a commercial vendor, open-source, or cloud-native solution, based on the customer and workload requirements. Although covering all possible options is not feasible, this blog post offers guidance on achieving price-performance goals and solving HPC challenges using AWS services and solutions, based on our experience with Financial Services Industry customers.

In the financial services, HPC (or as it is also known: Grid computing), has long been used to run simulations and solve complex challenges. For example, in capital markets, HPC systems are used to value financial instruments (i.e., stocks, exchange-traded funds (ETFs), derivatives, bonds) using mathematical models such as Monte Carlo. The usage patterns range from latency-sensitive real-time/intraday calculations (to simulate securities trades risks during trading hours) to typically large overnight batch workloads that produce results for internal risk monitoring and regulatory reporting. Elsewhere in the capital markets sector, investment management firms utilize HPC for pricing and risk analysis, calculating risk exposure, and defining, optimizing, and back-testing hedging strategies. In the insurance sector, HPC use cases include actuarial modeling and modeling potential scenarios for natural disasters to help insurers understand losses, set premiums, and develop risk mitigation strategies.

Cloud adoption patterns in FSI

Financial institutions face growing demands on their HPC resources due to regulatory requirements, market volatility, and other reporting needs – requiring significant capacity during certain periods like overnight processing or quarter-end reporting, while using minimal resources at other times.

Many organizations are moving beyond traditional on-premises and data center infrastructure to embrace cloud solutions. This shift allows them to scale resources up and down as needed while optimizing costs. The term “cloud-native” is still evolving, but there is a convergence towards serverless technologies for workload scheduling and ultimately for computation.

FSI customers typically migrate their HPC platforms to the cloud in phases. Below are the different levels of cloud adoption for our customers:

All On-premises: Schedulers, compute, and data sources are hosted on-premises.
Hybrid Burst: Schedulers, compute, and data sources are on-premises but augmented by additional cloud compute capacity during peak or unexpected demand.
Lift and Shift: Existing schedulers are moved as-is to the cloud and run compute nodes in the cloud with a similar configuration to on-premises.
AWS Optimized: This is more of a process than a state, but the focus here is on the elasticity of compute and making use of managed services wherever possible. Combining that with picking up the correct purchasing model, such as HAQM EC2 Spot, On-Demand or Savings plans, along with operational changes allows our customers to make the best of the cloud, lowering their costs and easing the operational burden of such platforms.
AWS Native: Represents a complete reimagining of traditional HPC platforms, emphasizing cloud-native and serverless architectures as core building blocks. This transformation encompasses modern scheduling solutions, architectural innovations, and optimized hardware selection – from latest-generation x86 processors to specialized AI accelerators. By freeing yourself from legacy licenses and dependencies, and modernizing algorithmic approaches, organizations can achieve exceptional price-performance benefits while future-proofing their HPC infrastructure on AWS.

Workload characteristics to determine architecture choice

Considering the vast selection of use-cases mentioned earlier, we can easily see how FSI HPC workloads can have varying software and hardware requirements that would affect their migration to the cloud. However, some common key characteristics of these workloads include:

Varying Task Duration: ranging from seconds to hours or even days.
High Throughput Requirements: often needing to process tens of thousands of tasks per second during peak periods.
High Parallelization Requirements: most calculations tend to be loosely coupled, which allows for them to be run in parallel to achieve the scale/speed requirements.
Resource Intensity and requirements: CPU, memory, and I/O patterns as well as CPU to Memory ratio requirements.
Software Requirements: Specific operating system, platform and library support.
Data & Application Dependencies: Local processing vs distributed requirements and integrating with existing frameworks and applications.
Elasticity & Flexibility: having the ability to scale up and down based on demand and use different compute options.
Cloud Service maturity and Support: the level of organizational capabilities in the cloud, including the current level of cloud adoption across the firm, existing reusable patterns and support for cloud-native and managed services.
Cost Optimization: Finding the right balance between performance and resource optimization, managing costs arising from the sheer scale and frequency of use.

While this is not an exhaustive list, it gives us an idea of what are some of the important things to consider when picking the right platform and HPC scheduler. As a starting point to drive our decision, we will consider the Task Duration and Software Requirements. We will use the decision tree in Fig. 1 to guide you through the journey, of picking the right platform for your FSI HPC workloads in the cloud.

Fig. 1. Decision tree for choosing an HPC Solution for the Cloud based on workload characteristics.

Navigating the intricate landscape of HPC platforms within financial services often begins with a critical decision for our customers: whether they currently have an on-premises HPC scheduler they would like to retain (at least in the short term), or if they would like to build a new cloud-native solution. This decision is crucial as it affects the level of cloud adoption that a customer can take their HPC platform.

Migrating your existing scheduler to the cloud

Let’s look at the case where you choose to keep your existing Grid scheduler, be it because of concerns around the required effort for migration, the need for further cloud knowledge or project timelines. Considering existing commercial choices, we see many of our FSI customers use IBM Spectrum Symphony or TIBCO DataSynapse GridServer®. Similarly, in the open-source domain, FSI customers have gravitated towards products like Slurm Workload Manager or HTCondor, but there are others.

Bearing in mind our cloud adoption levels from above; the migration of existing schedulers usually takes one of two forms:

Hybrid Burst: where customers use AWS resources to supplement their existing on-premises/third-party hosted grids, mostly due to need for more capacity at times.
Lift-and-Shift: where customers migrate their entire HPC system to AWS ‘as-is’ or with minimal changes.

Both of those approaches are possible with the commercial and open-source schedulers discussed above; however, they are all in different stages of maturity and support when it comes to making use of the current best practices and newest and best features and Application Programming Interfaces (APIs) available on AWS. Furthermore, while this approach would offer a familiar look and feel to what customers are used to on-premises, it will still come with the same requirements and constraints, such as having to manage a large estate of complex clusters and platforms, and in the case of commercial options, still comes with the same licensing requirements and constraints.

Building and using cloud-native HPC Schedulers and platforms

While the options discussed above may be seen as a quicker way to start reaping the benefits of the cloud, the constraints are often something our customers are looking to resolve as well. Thus, an increasingly popular approach we have seen customers choose is deciding to uplift their HPC platform for the cloud. At one end of the spectrum, we see customers deciding to use cloud-native managed services, decreasing the operational burden. And then we have customers who rather decide to build a custom solution using cloud-native technologies. While this option is quite popular, it obviously comes with the development and maintenance requirements for such a platform. When picking the right cloud-native solution, in addition to whether we use a managed service or build a custom solution, we will look at the job and/or task duration as a workload characteristic that drives our decision.

If you are currently using the open-source Slurm Workload Manager, are used to the look and feel and would like to continue using it, then AWS offers AWS ParallelCluster, an open source cluster management tool that makes it easy for you to deploy and manage Slurm clusters on AWS. In addition to that, we also offer a fully managed Slurm option through our AWS Parallel Computing Service. Slurm is a good HPC Platform/Scheduler choice for running a mixture of short and long-running tasks ranging from seconds to minutes. Further into the managed services realm, we also offer AWS Batch, which is a fully managed batch computing service. The scheduler is available at no cost – and customers are only required to pay for AWS resources used by their workloads. Considering task duration, AWS Batch is better suited for tasks that take longer than one minute.

We have helped multiple customers who have decided to build their own custom cloud-native schedulers and platforms, using managed cloud services, or adopt a cloud-native project. A popular option is HTC-Grid, which was started as an open-source blueprint by AWS, but has since been donated to the Fintech Open Source Foundation (FINOS) and is now a community project. Such systems are also suitable for varying task durations of seconds to minutes, but are built with high throughput in mind, having the ability to process tens of thousands of tasks per second.

Finally, if the task duration is always under 15 minutes, some customers have decided to leverage serverless architectures and build entire HPC systems with AWS Lambda. This approach has the benefit of not having to manage any always-on service and can be designed to have high throughput, but the maximum task duration is a hard constraint.

As mentioned before, another crucial aspect to consider is the operating system requirements for the workload and whether containers are required or used, as most of these solutions require containerization of the workloads—a fundamental change from conventional hosting practices.

Conclusion

We understand that no one-size-fits-all solution exists. Thus, we work with our customers on diving deeper and understanding their requirements and helping them with picking up the right solution and service(s) for their use-case. However, through this blog, we have provided you with a simplified decision tree on picking up the right paths and services to consider, depending on your decision to keep your existing HPC scheduler at one end of the spectrum or move to a fully cloud-native, workload-optimized platform at the other. The number of tasks you need to run over a specific time will also determine which solution will be the best fit for your workload running on AWS.

Lastly, regardless of the path chosen for the workload, it’s essential to consider that this is the start of the journey and not a means to an end. We see customers continue their journey ‘down the stack’ from ‘lift & shift’ to ‘AWS Optimized’ and/or ‘AWS Native’, and we are happy to guide you through that journey.

Key takeaways

Selecting an appropriate HPC platform for financial workloads requires careful consideration of your current infrastructure and future goals. Considering what was discussed above, once you make your choice between maintaining your existing scheduler investments or transitioning to cloud-native architectures, options become clearer.

For organizations leveraging existing schedulers and deciding to further keep them, the options are:

Commercial solutions including IBM Symphony and Tibco GridServer, which provide enterprise-grade reliability, but may lack the flexibility and elasticity features and support for the latest cloud APIs.
Proven open-source HPC schedulers such as Slurm and HTCondor are also frequent choices. AWS ParallelCluster and AWS Parallel Computing Service are good options for Slurm, giving you that familiar look and feel, but in a managed service environment.

Organizations ready for cloud-native adoption can select based on workload duration requirements:

AWS Batch for workloads that take longer than 5 minutes to complete.
AWS Lambda for workloads that run for less than 15 minutes.
HTC-Grid or custom container-based solutions for workloads ranging from seconds to minutes.

Importantly, these choices represent starting points rather than final destinations. Our customers frequently evolve their platforms through multiple stages of their cloud adoption journey – from initial cloud bursting through to comprehensive re-platforming and re-engineering, each stage offering opportunities for optimization.

AWS HPC Blog