AWS Storage Blog
Prime Video improved stream analytics performance with HAQM S3 Express One Zone
HAQM Prime Video provides a selection of original content and licensed movies and TV shows that you can stream or download as part of the HAQM Prime subscription. Prime Video’s telemetry platform serves as the backbone for monitoring playback performance, saving data snapshots for failure recovery, providing business analytics, and generating real-time insights across its global streaming service. Every minute, millions of devices stream content on Prime Video, which internally generates vast amounts of metrics that are processed in real time to improve customer experience. This telemetry platform continuously tracks streaming metrics including video quality, bandwidth usage, and session duration from devices worldwide, allowing for real-time monitoring and analysis of system health and performance. A critical mechanism within this platform is checkpointing, a process for saving snapshots of all active streaming sessions every 70 seconds, to maintain fault tolerance and enable quick recovery from potential failures. During high-traffic events, like NFL and NBA games, these checkpoints create an intense burst of write operations to the underlying storage services.
To maintain a consistent data flow across the telemetry platform, Prime Video optimized checkpoint operations by using HAQM S3 Express One Zone, a high-performance storage class designed to handle sudden request spikes that occur during checkpointing. S3 Express One Zone can support up to 2 million reads and up to 200,000 writes per second, managing checkpoints effectively while maintaining HTTP 503 slowdown error rates below 0.1%, even during high-traffic events.
In this blog, we explore how Prime Video uses S3 Express One Zone to improve its telemetry platform’s performance during high-traffic events. We examine the telemetry platform’s initial architecture, the challenges faced during high-traffic events, and how S3 Express One Zone helps maintain continuous visibility into streaming performance by handling checkpoint operations.
Prime Video telemetry platform’s initial architecture
Figure 1: Initial telemetry platform architecture
Prime Video’s telemetry platform consists of three main data pipelines that work together to process streaming data. First, the telemetry event ingestion pipeline collects and validates playback data from customer devices globally, enriching it with metadata before forwarding to processing. Second, the telemetry event processing pipeline transforms these raw playback events into meaningful sessionized data. The StreamProcessor application, built on Apache Flink, maintains session states in memory and saves checkpoints to HAQM S3 Standard every 70 seconds. Third, the telemetry event publishing pipeline then delivers the processed data to various downstream services like HAQM S3, HAQM Kinesis, and HAQM SQS, enabling business intelligence and monitoring capabilities. The telemetry platform team wanted to optimize checkpointing performance of the event processing pipeline to maintain continuous monitoring capabilities, particularly during high-traffic events.
The challenge: Managing burst upload during checkpointing
The StreamProcessor application, part of the event processing pipeline, saves session-state checkpoints of active streaming sessions every 70 seconds. Built on Apache Flink, the application runs on multiple clusters, each processing a portion of the streaming data. These checkpoints allow each cluster to recover and resume processing from its last saved state if failures occur. While checkpoints are crucial for fault tolerance, checkpoint failures can trigger cluster restarts, causing brief delays in delivering data to the event publishing pipeline. Although these delays don’t directly affect Prime Video customers’ streaming experience, they can impact operators’ ability to detect and diagnose customer facing performance issues, so the telemetry platform team wanted to optimize checkpoint performance to maintain consistent real-time monitoring capabilities.
During high-traffic events where viewership increases by millions of concurrent users, checkpointing places significant performance demands on the underlying storage service. With increased viewership, the volume of streaming data increases significantly, leading to larger checkpoint volumes. Each checkpoint writes petabytes of data, with transaction rates increasing by millions per minute, and all these writes must happen concurrently every 70 seconds.
Initially, the StreamProcessor application saved checkpoints to an HAQM S3 general purpose bucket with objects stored in S3 Standard storage class. S3 general purpose buckets provide excellent performance during normal operations, offering at least 3,500 PUT requests per second per partitioned S3 prefix. While customers can parallelize requests across multiple prefixes to scale performance, this scaling happens gradually and is not instantaneous. While S3 is scaling to support the higher request rate, customers may receive HTTP 503 slowdown errors. For Prime Video’s StreamProcessor application, error rates are typically minimal (less than 0.1%) during day-to-day operations, but increased during peak events with much higher write rates. While these slowdown errors don’t always cause checkpoint failures, they increase their likelihood. When failures occur, affected clusters need to restart, with recovery taking a few minutes.
The telemetry platform utilizes a distributed architecture, spreading workloads across multiple clusters. This design limits the impact of individual cluster failures, as only the affected cluster’s data experiences delay during restarts. However, given the platform’s high standards for data availability and latency, particularly during high-traffic events, the team needed a storage service that could handle these burst write patterns at high transactions per second (TPS) with minimal slowdown errors.
Enter HAQM S3 Express One Zone
For this stream analytics workload, the Prime Video team identified HAQM S3 Express One Zone as a solution to handle checkpointing write operations during high-traffic events. S3 Express One Zone is a high-performance, single-Availability Zone storage class purpose-built to deliver consistent single-digit millisecond data access for most frequently accessed data. With S3 Express One Zone, data is stored in a different bucket type—an S3 directory bucket. The directories that are created when objects are uploaded to directory buckets have no per-prefix TPS limits. Instead, each directory bucket can support up to 2 million reads and up to 200,000 writes per second. This flexibility allows applications to parallelize read and write requests within and across directories as needed, making it suitable for workloads with sudden spikes in request rates.
Figure 2: Revised telemetry platform architecture with S3 Express One Zone
The Results
The team migrated from S3 Standard to S3 Express One Zone, accessing the storage class through existing S3 APIs. The impact was immediate and significant — error rates remained consistently below 0.1%, and these minimal 503 errors were easily handled by the application’s retry mechanism. During high-traffic events like the NFL games, S3 Express One Zone successfully handles the intense bursts of write operations without triggering application restarts. This enables Prime Video operators to access monitoring data with low latency, allowing them to quickly detect and resolve customer-impacting performance issues.
Conclusion
In this post, we looked at how Prime Video successfully leveraged S3 Express One Zone to continuously monitor millions of streaming sessions without any application restarts or delays. They turned to S3 Express One Zone to optimize checkpointing by maintaining slowdown error rates below 0.1%, even during high-traffic events.
Prime Video’s experience demonstrates how HAQM S3 Express One Zone can improve performance for applications that require frequent checkpointing operations that burst into hundreds of thousands of TPS. While this example focuses on streaming analytics, S3 Express One Zone is also a good fit for similar patterns in applications, such as machine learning training jobs which checkpoint to save model progress, AI inference engines preserving computational states, and distributed computing systems maintaining cluster-wide snapshots.
If you are interested in learning more about S3 Express One Zone, visit the S3 User Guide. If you have any questions or comments, leave them in the comments section.