HAQM EMR | AWS Big Data Blog

Centralize Apache Spark observability on HAQM EMR on EKS with external Spark History Server

This post demonstrates how to centralize Apache Spark observability using SHS on EMR on EKS. We showcase how to enhance SHS with performance monitoring tools, with a pattern applicable to many monitoring solutions such as SparkMeasure and DataFlint.

HAQM SageMaker Lakehouse now supports attribute-based access control

HAQM SageMaker Lakehouse now supports attribute-based access control (ABAC) with AWS Lake Formation, using AWS Identity and Access Management (IAM) principals and session tags to simplify data access, grant creation, and maintenance. In this post, we demonstrate how to get started with SageMaker Lakehouse with ABAC.

Read and write Apache Iceberg tables using AWS Lake Formation hybrid access mode

In this post, we demonstrate how to use Lake Formation for read access while continuing to use AWS Identity and Access Management (IAM) policy-based permissions for write workloads that update the schema and upsert (insert and update combined) data records into the Iceberg tables.

Implement HAQM EMR HBase Graceful Scaling

Apache HBase is a massively scalable, distributed big data store in the Apache Hadoop ecosystem. We can use HAQM EMR with HBase on top of HAQM Simple Storage Service (HAQM S3) for random, strictly consistent real-time access for tables with Apache Kylin. This post demonstrates how to gracefully decommission target region servers programmatically.

Architect fault-tolerant applications with instance fleets on HAQM EMR on EC2

In this post, we show how to optimize capacity by analyzing EMR workloads and implementing strategies tailored to your workload patterns. We walk through assessing the historical compute usage of a workload and use a combination of strategies to reduce the likelihood of InsufficientCapacityExceptions (ICE) when HAQM EMR launches specific EC2 instance types. We implement flexible instance fleet strategies to reduce dependency on specific instance types and use HAQM EC2 On-Demand Capacity Reservation (ODCRs) for predictable, steady-state workloads. Following this approach can help prevent job failures due to capacity limits while optimizing your cluster for cost and performance.

Enhance your workload resilience with new HAQM EMR instance fleet features

HAQM EMR has introduced new features for instance fleets that address critical challenges in big data operations. This post explores how these innovations improve cluster resilience, scalability, and efficiency, enabling you to build more robust data processing architectures on AWS.

Use Batch Processing Gateway to automate job management in multi-cluster HAQM EMR on EKS environments

AWS customers often process petabytes of data using HAQM EMR on EKS. In enterprise environments with diverse workloads or varying operational requirements, customers frequently choose a multi-cluster setup due to the following advantages: Better resiliency and no single point of failure – If one cluster fails, other clusters can continue processing critical workloads, maintaining business […]

Migrate data from an on-premises Hadoop environment to HAQM S3 using S3DistCp with AWS Direct Connect

This post demonstrates how to migrate nearly any amount of data from an on-premises Apache Hadoop environment to HAQM Simple Storage Service (HAQM S3) by using S3DistCp on HAQM EMR with AWS Direct Connect. To transfer resources from a target EMR cluster, the traditional Hadoop DistCp must be run on the source cluster to move […]

How the GoDaddy data platform achieved over 60% cost reduction and 50% performance boost by adopting HAQM EMR Serverless

This is a guest post co-written with Brandon Abear, Dinesh Sharma, John Bush, and Ozcan IIikhan from GoDaddy. GoDaddy empowers everyday entrepreneurs by providing all the help and tools to succeed online. With more than 20 million customers worldwide, GoDaddy is the place people come to name their ideas, build a professional website, attract customers, […]

GoDaddy benchmarking results in up to 24% better price-performance for their Spark workloads with AWS Graviton2 on HAQM EMR Serverless

This is a guest post co-written with Mukul Sharma, Software Development Engineer, and Ozcan IIikhan, Director of Engineering from GoDaddy. GoDaddy empowers everyday entrepreneurs by providing all the help and tools to succeed online. With more than 20 million customers worldwide, GoDaddy is the place people come to name their ideas, build a professional website, […]

Tag: HAQM EMR