AWS Big Data Blog

Category: HAQM EMR on EKS

Design patterns for implementing Hive Metastore for HAQM EMR on EKS

In this post, we explore the design patterns for implementing the Hive Metastore (HMS) with EMR on EKS with Spark Operator, each offering distinct advantages depending on your requirements. Whether you choose to deploy HMS as a sidecar container within the Apache Spark Driver pod, or as a Kubernetes deployment in the data processing EKS cluster, or as an external HMS service in a separate EKS cluster, the key considerations revolve around communication efficiency, scalability, resource isolation, high availability, and security.

Build a high-performance quant research platform with Apache Iceberg

In our previous post Backtesting index rebalancing arbitrage with HAQM EMR and Apache Iceberg, we showed how to use Apache Iceberg in the context of strategy backtesting. In this post, we focus on data management implementation options such as accessing data directly in HAQM Simple Storage Service (HAQM S3), using popular data formats like Parquet, or using open table formats like Iceberg. Our experiments are based on real-world historical full order book data, provided by our partner CryptoStruct, and compare the trade-offs between these choices, focusing on performance, cost, and quant developer productivity.

Use Batch Processing Gateway to automate job management in multi-cluster HAQM EMR on EKS environments

AWS customers often process petabytes of data using HAQM EMR on EKS. In enterprise environments with diverse workloads or varying operational requirements, customers frequently choose a multi-cluster setup due to the following advantages: Better resiliency and no single point of failure – If one cluster fails, other clusters can continue processing critical workloads, maintaining business […]

Introducing HAQM EMR on EKS with Apache Flink: A scalable, reliable, and efficient data processing platform

AWS recently announced that Apache Flink is generally available for HAQM EMR on HAQM Elastic Kubernetes Service (EKS). Apache Flink is a scalable, reliable, and efficient data processing framework that handles real-time streaming and batch workloads (but is most commonly used for real-time streaming). HAQM EMR on EKS is a deployment option for HAQM EMR […]

Dive deep into security management: The Data on EKS Platform

The construction of big data applications based on open source software has become increasingly uncomplicated since the advent of projects like Data on EKS, an open source project from AWS to provide blueprints for building data and machine learning (ML) applications on HAQM Elastic Kubernetes Service (HAQM EKS). In the realm of big data, securing […]

Run Apache Hive workloads using Spark SQL with HAQM EMR on EKS

Apache Hive is a distributed, fault-tolerant data warehouse system that enables analytics at a massive scale. Using Spark SQL to run Hive workloads provides not only the simplicity of SQL-like queries but also taps into the exceptional speed and performance provided by Spark. Spark SQL is an Apache Spark module for structured data processing. One […]

Backtesting index rebalancing arbitrage with HAQM EMR and Apache Iceberg

Backtesting is a process used in quantitative finance to evaluate trading strategies using historical data. This helps traders determine the potential profitability of a strategy and identify any risks associated with it, enabling them to optimize it for better performance. Index rebalancing arbitrage takes advantage of short-term price discrepancies resulting from ETF managers’ efforts to […]

emr-eks-compute-cost-quicksight-dashboard

Cost monitoring for HAQM EMR on HAQM EKS

HAQM EMR is the industry-leading cloud big data solution, providing a collection of open-source frameworks such as Spark, Hive, Hudi, and Presto, fully managed and with per-second billing. HAQM EMR on HAQM EKS is a deployment option allowing you to deploy HAQM EMR on the same HAQM Elastic Kubernetes Service (HAQM EKS) clusters that is […]

Introducing HAQM EMR on EKS job submission with Spark Operator and spark-submit

HAQM EMR on EKS provides a deployment option for HAQM EMR that allows organizations to run open-source big data frameworks on HAQM Elastic Kubernetes Service (HAQM EKS). With EMR on EKS, Spark applications run on the HAQM EMR runtime for Apache Spark. This performance-optimized runtime offered by HAQM EMR makes your Spark jobs run fast […]

Improve reliability and reduce costs of your Apache Spark workloads with vertical autoscaling on HAQM EMR on EKS

HAQM EMR on HAQM EKS is a deployment option offered by HAQM EMR that enables you to run Apache Spark applications on HAQM Elastic Kubernetes Service (HAQM EKS) in a cost-effective manner. It uses the EMR runtime for Apache Spark to increase performance so that your jobs run faster and cost less. Apache Spark allows […]