AWS Big Data Blog

Category: HAQM EMR

Read and write Apache Iceberg tables using AWS Lake Formation hybrid access mode

In this post, we demonstrate how to use Lake Formation for read access while continuing to use AWS Identity and Access Management (IAM) policy-based permissions for write workloads that update the schema and upsert (insert and update combined) data records into the Iceberg tables.

Build a data lakehouse in a hybrid Environment using HAQM EMR Serverless, Apache DolphinScheduler, and TiDB

This post discusses a decoupled approach of building a serverless data lakehouse using AWS Cloud-centered services, including HAQM EMR Serverless, HAQM Athena, HAQM Simple Storage Service (HAQM S3), Apache DolphinScheduler (an open source data job scheduler) as well as PingCAP TiDB, a third-party data warehouse product that can be deployed either on premises or on the cloud or through a software as a service (SaaS).

Implement HAQM EMR HBase Graceful Scaling

Apache HBase is a massively scalable, distributed big data store in the Apache Hadoop ecosystem. We can use HAQM EMR with HBase on top of HAQM Simple Storage Service (HAQM S3) for random, strictly consistent real-time access for tables with Apache Kylin. This post demonstrates how to gracefully decommission target region servers programmatically.

Architect fault-tolerant applications with instance fleets on HAQM EMR on EC2

In this post, we show how to optimize capacity by analyzing EMR workloads and implementing strategies tailored to your workload patterns. We walk through assessing the historical compute usage of a workload and use a combination of strategies to reduce the likelihood of InsufficientCapacityExceptions (ICE) when HAQM EMR launches specific EC2 instance types. We implement flexible instance fleet strategies to reduce dependency on specific instance types and use HAQM EC2 On-Demand Capacity Reservation (ODCRs) for predictable, steady-state workloads. Following this approach can help prevent job failures due to capacity limits while optimizing your cluster for cost and performance.

Design patterns for implementing Hive Metastore for HAQM EMR on EKS

In this post, we explore the design patterns for implementing the Hive Metastore (HMS) with EMR on EKS with Spark Operator, each offering distinct advantages depending on your requirements. Whether you choose to deploy HMS as a sidecar container within the Apache Spark Driver pod, or as a Kubernetes deployment in the data processing EKS cluster, or as an external HMS service in a separate EKS cluster, the key considerations revolve around communication efficiency, scalability, resource isolation, high availability, and security.

HAQM Web Services named a Leader in the 2024 Gartner Magic Quadrant for Data Integration Tools

HAQM Web Services (AWS) has been recognized as a Leader in the 2024 Gartner Magic Quadrant for Data Integration Tools. We were positioned in the Challengers Quadrant in 2023. This recognition, we feel, reflects our ongoing commitment to innovation and excellence in data integration, demonstrating our continued progress in providing comprehensive data management solutions.

Building and operating data pipelines at scale using CI/CD, HAQM MWAA and Apache Spark on HAQM EMR by Wipro

This blog post discusses how a programmatic data processing framework developed by Wipro can help data engineers overcome obstacles and streamline their organization’s ETL processes. The framework leverages HAQM EMR improved runtime for Apache Spark and integrates with AWS Managed services.

Enhance your workload resilience with new HAQM EMR instance fleet features

HAQM EMR has introduced new features for instance fleets that address critical challenges in big data operations. This post explores how these innovations improve cluster resilience, scalability, and efficiency, enabling you to build more robust data processing architectures on AWS.

Build a high-performance quant research platform with Apache Iceberg

In our previous post Backtesting index rebalancing arbitrage with HAQM EMR and Apache Iceberg, we showed how to use Apache Iceberg in the context of strategy backtesting. In this post, we focus on data management implementation options such as accessing data directly in HAQM Simple Storage Service (HAQM S3), using popular data formats like Parquet, or using open table formats like Iceberg. Our experiments are based on real-world historical full order book data, provided by our partner CryptoStruct, and compare the trade-offs between these choices, focusing on performance, cost, and quant developer productivity.