Posted On: Aug 20, 2021
We are introducing HAQM SageMaker Asynchronous Inference, a new inference option in HAQM SageMaker that queues incoming requests and processes them asynchronously. This option is ideal for inferences with large payload sizes (up to 1GB) and/or long processing times (up to 15 minutes) that need to be processed as requests arrive. Asynchronous inference enables you to save on costs by autoscaling the instance count to zero when there are no requests to process, so you only pay when your endpoint is processing requests.
With the introduction of asynchronous inference, HAQM SageMaker provides three options to deploy trained machine learning models for generating inferences on new data. Real-time inference is suitable for workloads where payload sizes are up to 6MB and need to be processed with low latency requirements in the order of milliseconds or seconds. Batch transform is ideal for offline predictions on large batches of data that is available upfront. The new asynchronous inference option is ideal for workloads where the request sizes are large (up to 1GB) and inference processing times are in the order of minutes (up to 15 minutes). Example workloads for asynchronous inference include running predictions on high resolution images generated from a mobile device at different intervals during the day and providing responses within minutes of receiving the request. For use cases that can tolerate a cold start penalty of a few minutes, you can optionally scale down the endpoint instance count to zero when there are no outstanding requests and scale back up as new requests arrive so that you only pay for the duration that the endpoints are actively processing requests.
Creating an asynchronous inference endpoint is similar to creating a real-time endpoint. You can use your existing HAQM SageMaker Models and only need to specify additional asynchronous inference specific configuration parameters while creating your endpoint configuration. To invoke the endpoint, you need to place the request payload in HAQM S3 and provide a pointer to the payload as a part of the invocation request. Upon invocation, HAQM SageMaker enqueues the request for processing and returns an output location as a response. Upon processing, HAQM SageMaker places the inference response in the previously returned HAQM S3 location. You can optionally choose to receive success or error notifications via Simple Notification Service (SNS).
For a detailed description of how to create, invoke, and monitor asynchronous inference endpoints, please read our documentation, which also contains a sample notebook to help you get started. For pricing information, please visit the HAQM SageMaker pricing page. HAQM SageMaker Asynchronous Inference is generally available in all commercial AWS Regions where HAQM SageMaker is available except Asia Pacific (Osaka), Europe (Milan), and Africa (Cape Town).