Optimizing HPC workflows with automatically scaling clusters in Ansys Gateway powered by AWS

This post was contributed by Dnyanesh Digraskar, Principal HPC Partner Solutions Architect at AWS, and Trent Andrus, Product Specialist at Ansys.

Ansys Gateway powered by AWS cloud engineering solution (referred to as “Ansys Gateway” for the rest of the post) is an AWS Marketplace hosted product that provides customers a seamless interface to run Ansys simulations in their HAQM Web Services (AWS) accounts. Customers can quickly deploy pre-validated, and pre-tuned Ansys applications on recommended AWS high performance computing (HPC) infrastructure. Groupama FDJ, and Turntide Technologies use Ansys Gateway to accelerate their engineering design and simulation workloads. Compared to on-premises infrastructure, Groupama FDJ achieved 17x faster simulations for designing race bicycles, and Turntide Technologies accelerated their electric motor design simulations by 7x.

After the initial release of Ansys Gateway, some of the feedback from users was related to having more control in the HPC cluster creation process. To address this, Ansys Gateway now includes automatically scaling clusters in its 2024 R2 version release and onwards. This allows for dynamic provisioning of compute resources based on Slurm job queues. Along with an upgraded cluster creation process, Ansys Gateway now supports AWS services such as HAQM FSx for Lustre, HAQM FSx for OpenZFS, and HAQM Elastic File System (EFS). These file system services are commonly used for supporting a variety of HPC workflows. With its auto-sizing capabilities, EFS is a great starting point for most customers. For large simulations with lots of file I/O, high performance shared file systems like FSx for Lustre and OpenZFS can help avoid I/O bottlenecks. And, in compute environments with a mixture of Windows- and Linux-based resources, OpenZFS is accessible to both platforms.

Scaling HPC resources for engineering simulations is a complex challenge. Scalability is a concept used to indicate the ability of HPC infrastructure to deliver proportional computational resources as per the workload demands. Typically, tightly-coupled simulations such as Ansys Fluent fluid simulation software, Ansys LS-DYNA nonlinear dynamics structural simulation software, or Ansys Mechanical structural finite element analysis software can be scaled on large number of cores, and achieve nearly linear performance gains (depending on the benchmark). With auto-scaling clusters, users can efficiently manage resources based on the workload demands, and remove hardware bottlenecks during the peak demands.

In this blog post, we describe the architecture, workflow, and HAQM EC2 recommendations for running Ansys applications in Ansys Gateway.

Architectural components

Ansys Gateway architecture components including the Control Plane and the Application Plane are described in detail in our blog post published in 2023. For this release and onwards, Ansys Gateway integrates with AWS ParallelCluster, an open-source cluster management tool. Slurm workload manager is used as the job scheduler to automatically provision compute nodes when jobs are queued and deallocate them when they are no longer needed. An architecture diagram of AWS ParallelCluster deployment in an Ansys Gateway customers’ HAQM Virtual Private Cloud (VPC) is shown in Figure 1.

Figure 1: Architecture of Ansys Gateway’s HPC deployment based on AWS ParallelCluster. This represents the application plane that is deployed in customers’ AWS accounts.

The key components of the HPC cluster are:

Head Node – Manages job workflow and cluster state,
Slurm controller – Scheduler with Slurm workload manager,
Dynamic Compute Nodes – Provisioned and terminated based on workload demands. For example, launching HAQM EC2 instance types such as C6i, Hpc6a, Hpc7a for CPU-based workloads and P5, G6e for GPU acceleration.
Storage – HAQM EFS or HAQM Elastic Block Store (EBS) for persistent job data and application installation that can act as Network File System (NFS) mount, or HAQM FSx for Lustre and HAQM FSx for OpenZFS for high performance workloads.
Cluster queue – Job queues supporting multiple instance types for various workload types, such as: compute-optimized instance queue for running CFD and crash simulations; memory-optimized instance queue for running FEA or NVH simulations; GPU-optimized instance queue for running graphics-oriented simulations. It is always recommended to run an application on homogenous instance types, with all compute nodes being same in design to achieve optimum performance.
Advanced networking – HAQM Elastic Fabric Adapter (EFA) is an advanced network interface for HAQM EC2 instances for running applications requiring high inter-node communications at scale.

AWS ParallelCluster provides considerable flexibility to Ansys Gateway in defining compute resources and job queues. A queue can contain multiple compute resources, and each compute resource can contain multiple instance types. Consider a queue with a single compute resource that has multiple HAQM EC2 instance types defined (r6i.32xlarge, r6in.32xlarge, r6id.32xlarge). By default, the cluster will try and allocate the least expensive instance (r6i.32xlarge) first. If there is insufficient capacity for the cheaper instance, the cluster will try and allocate the next more expensive instance. This is an effective way to automatically mitigate temporary capacity issues.

With dynamically scaling clusters, Ansys Gateway users can now have a “job” based simulation workflow rather than a “cluster-per-simulation” based approach with static clusters. With a static cluster size, resources may be underutilized during off-peak times, resulting in unnecessary charges. Conversely, your static cluster size may not be suitable to handle all workload types, requiring you to either manually resize or create a new cluster with different EC2 instance types. This leads to downtime and can drive up costs unexpectedly.

Job submission workflow

In this section, we describe a step-by-step workflow within the Ansys Gateway interface for creating dynamically scaling HPC clusters to run Ansys simulations. It is assumed that you are already logged in to Ansys Gateway and have access to a workspace to submit simulation jobs.

Step 1: Create a HPC Cluster

Dynamically scaling cluster creation follows these steps:

Select a type of storage (between HAQM EFS, HAQM FSx for OpenZFS and HAQM FSx for Lustre)
Select the Ansys application packages to install
Define cluster compute queues and select resources

As with creating any resource, start by selecting a tenant (i.e. subscribed workspace) in Ansys Gateway as in Figure 2a. Then, create or select a project space that does not already have a dynamically scaling cluster. The project space must have an indicator in the Releases column that says “24R2+” as shown in Figure 2b.

Figure 2a: Ansys Gateway landing page. The user can select a tenant to access their project spaces. Two tenant options are shown, each with a tenant name and ID.

Figure 2b: Ansys Gateway project spaces landing page. A project space titled “Gateway Autoscaling Cluster Demonstration” is shown.

In this project space, create a new resource and select “Autoscaling Cluster” from the dropdown list, shown in Figure 3.

Figure 3: An empty project space with an open dropdown list prompting the user to create either a new virtual desktop or clusters.

After selecting “Autoscaling Cluster” from the dropdown list, the creation wizard will provide an opportunity to specify several key features of the cluster. The first is storage, then which applications should be installed, then compute resource queues. By default, an HAQM Elastic File System (EFS) drive will be mounted to the HPC cluster and made available to resources in the project space. In Figure 4, an optional second storage location is defined for the product installation path.

Figure 4: The cluster creation wizard options for storage. By default, an EFS file system is created and mounted to the cluster. A secondary HAQM FSx for OpenZFS storage is also defined with 256 GiB of capacity and a throughput of 2048 MiB/s/TiB. The storage name and mount path are also provided by the user.

Once the storage options are defined, the desired application packages can be selected for installation. In Figure 5, Ansys Structures is selected.

Figure 5: The cluster creation wizard step for selecting which simulation applications to install.

During the cluster creation process, the wizard can automatically deploy a server running Ansys HPC Platform Services (HPS), shown in Figure 6. By using HPS, users can significantly simplify their job submission by employing HPS to upload, submit, monitor, and download results directly from their workstation without the need to manually transfer files to a cloud resource and write a job submission script. Note that HPS is delivered via containers, and uses Docker to deploy. For more information on HPS setup and usage, see the Ansys HPC Platform Services documentation.

The HPS installation location must coincide with the location of the installed Ansys products. In this demonstration, both HPS and the products are installed on an HAQM FSx for OpenZFS storage mount.

Figure 6: Cluster creation wizard step for Ansys HPC Platform Services (HPS). By default, a Linux-based VM will be created for the user and the HPS services will automatically be deployed.

Next, the wizard helps the user create job submission queues. The user can specify both the static and dynamics node counts as per their AWS service quota limits. Up to ten queues can be defined for a cluster.

Figure 7: After pressing the “Add queue” button, the user selects the application for this queue, a queue name, the number of static (always available) nodes, the maximum number of dynamic nodes, any advanced options (such as EFA, placement group, etc.), and finally the desired instance type(s). This process is repeated for each queue.

The application associated with a queue is selected by a dropdown list. It is often helpful to use a queue name that indicates which application and version the queue is associated with, such as “mech242”.

After the desired queues are defined, the cluster can be named and created. Once cluster creation is finished, you will see a “Running” badge indicating that all resources and services are deployed. The compute resources will remain offline until a job is submitted. From the project space, clicking on the cluster will bring up an overview page showing the head node, HPS node, and details about queues and allocated compute nodes. See Figure 8a and 8b for the details provided after a cluster is created.

Figure 8a: Cluster creation overview page showing the head node and HPS node details.

Figure 8b: Cluster overview page showing: the available queues by name (one titled mech242), the number of allocated nodes (0 of 10), and the associated application (Ansys Structure 2024 R2); a list of applications and their installation location (Ansys Structures on HAQM FSx for Open ZFS); a list of mounted storage locations (Default EFS, OpenZFS) including their name, type, and mount path.

Step 2: Submitting a Job to the Cluster

With the cluster created, you can take advantage of the simplified job submission workflow provided by HPS. Using Ansys Mechanical as an example in Figure 9, connecting the HPS server is as simple as entering the IP address of the HPS server with the format http://example.com:port/hps. Refer to Figure 9 for the address of the HPS server.

Figure 9: Ansys Mechanical Solve Process Settings window.

For users comfortable with the Linux command line, job submission is still possible with standard Slurm commands like srun, sbatch, and salloc. Refer to the Ansys Help documentation for more information on job submission using Slurm.

Step 3: Monitoring node scaling in action

Once a job is submitted, users can access the HPS job monitoring interface (as shown in Figure 10) from the Ansys Gateway cluster page.

Figure 10: HPS job monitoring UI. Users can access this UI from their web browser to see the job’s status (Pending, Running, Evaluated), watch log files, and download individual files.

Using Slurm commands for cluster operations

Users comfortable monitoring via Slurm commands are still able to do so. The job submitted via HPS is visible via the “squeue” command in Figure 11 (after connecting to a Linux VDI in the workspace). The requested node is in the configuring (CF) state, meaning that it is scaling up.

Figure 11: Querying Slurm with “squeue” on the Linux command line. The command returns a list of jobs and details like the job state and assigned resources.

Likewise, we can see the number of allocated nodes with the “sinfo -s” command in Figure 12. A/I/O/T lists number of nodes Available/Idle/Offline/Total.

Figure 12: Querying Slurm with the “sinfo” command to view a list of available queues. One queue is returned with details about the name and number of nodes in use.

Users can also submit using Slurm commands like “srun,” “sbatch,” and “salloc” as in Figure 13. The job submission script is included in the application installation package on Ansys Gateway.

Figure 13: Submitting a job to Slurm directly via the “salloc” command and monitoring it with “squeue”.

Step 4: Retrieving Results & Cluster Shutdown

If a job was submitted using HPS, results files can be downloaded directly to your workstation from the cluster’s shared storage. If you submit a job using Slurm commands and use an instance’s temporary storage as scratch space, be sure to copy the results files back to the shared storage when the job has completed. By default, dynamic nodes will stay online for 10 minutes after they finish running a job. After 10 minutes of inactivity, the compute nodes will automatically shut down.

HAQM EC2 instance type recommendations for various Ansys applications

Ansys Gateway powered by AWS supports advanced cluster creation process for the following Ansys applications:

Detailed workflows for setting up each of these applications are available in Ansys help documentation Recommended Configurations by Application, in the Recommended Usage Guide. Now that you have learned about the advanced HPC cluster creation process in Ansys Gateway, refer to Table 1 for general recommendations on commonly used HAQM EC2 instances for running various Ansys applications using Ansys Gateway. Refer to the Ansys Help page for the detailed list of recommended instance types.

	HAQM EC2 Instances
Specifications	HPC6id	HPC7a / HPC6a	C6i*	P5⁺	P4d⁺	G6e	G5
Processor	Intel Ice Lake	AMD EPYC	Intel Ice Lake	NVIDIA H100	NVIDIA A100	L40S	NVIDIA A10G
Instance Size^	32xlarge	96xlarge / 48xlarge	32xlarge	48xlarge	24xlarge	48xlarge	48xlarge
Physical Cores	64	192 / 96	64	96	48	96	96
RAM per node (GiB)	1024	768 / 384	256	2048	1152	1536	768
Memory per core (GiB)	16	4 – 32 (HPC7a) / 4 (HPC6a)	4	24	24	16	8
EFA Network bandwidth (GB/s)	200	300 / 100	50	3200	400	400	100
Number of GPUs				8	8	8	8
RAM per GPU (GB)				640 HBM3	40 HBM2	48	24
Target Ansys applications^	Electronics Desktop, Fluids, LS-DYNA, Lumerical, Structures	Fluids, LS-DYNA, Structures	Electronics Desktop, Fluids, LS-DYNA, Lumerical, Pathfinder, Speos, Structures, Totem-SC	Fluids	Fluids	Fluids, Structures, HFSS	Fluids, Discovery
Physics description	Implicit, Explicit, CFD codes	Explicit, CFD codes	Implicit, Explicit, CFD, Optics codes	CFD codes	Implicit, Explicit, CFD codes	Implicit, Explicit, CFD codes	Interactive modeling and simulations
* Enable Elastic Fabric Adapter (EFA) for high speed, inter-node communication. Disable Simultaneous Multithreading (SMT) for consistent CPU performance. ^ HPC applications will usually benefit from having the complete instance represented by the maximum instance size, due to availability of features such as EFA.+ Do not run out-of-the-box on Ansys Gateway and requires users to configure additional NVIDIA packages.

Table 1: HAQM EC2 instance type recommendations for various Ansys applications.

Conclusion

Ansys Gateway powered by AWS now has an integration with AWS ParallelCluster to enable users deploy on-demand HPC clusters for running Ansys simulations on AWS. This allows engineers to run large-scale simulations efficiently while optimizing cloud costs. By dynamically adjusting resources based on simulation workload requirements, Ansys Gateway minimizes the idle compute time and ensures optimal scalability for HPC workloads.

To get started with Ansys Gateway, visit the Ansys Gateway offering in AWS Marketplace to deploy a cloud-based HPC environment in just a few clicks. Try it today and experience the benefits of scalable, high performance engineering simulations on AWS.

AWS HPC Blog