AWS Big Data Blog

Category: HAQM EMR

HAQM EMR 7.5 runtime for Apache Spark and Iceberg can run Spark workloads 3.6 times faster than Spark 3.5.3 and Iceberg 1.6.1

The HAQM EMR runtime for Apache Spark offers a high-performance runtime environment while maintaining 100% API compatibility with open source Apache Spark and Apache Iceberg table format. In this post, we demonstrate the performance benefits of using the HAQM EMR 7.5 runtime for Spark and Iceberg compared to open source Spark 3.5.3 with Iceberg 1.6.1 tables on the TPC-DS 3TB benchmark v2.13.

Run high-availability long-running clusters with HAQM EMR instance fleets

In this post, we demonstrate how to launch a high availability instance fleet cluster using the newly redesigned HAQM EMR console, as well as using an AWS CloudFormation template. We also go over the basic concepts of Hadoop high availability, EMR instance fleets, the benefits and trade-offs of high availability, and best practices for running resilient EMR clusters.

bdb4538_solution-overview

Expand data access through Apache Iceberg using Delta Lake UniForm on AWS

Delta Lake UniForm is an open table format extension designed to provide a universal data representation that can be efficiently read by different processing engines. It aims to bridge the gap between various data formats and processing systems, offering a standardized approach to data storage and retrieval. With UniForm, you can read Delta Lake tables as Apache Iceberg tables. This post explores how to start using Delta Lake UniForm on HAQM Web Services (AWS).

Fine-grained access control in HAQM EMR Serverless with AWS Lake Formation

In this post, we discuss how to implement fine-grained access control in EMR Serverless using Lake Formation. With this integration, organizations can achieve better scalability, flexibility, and cost-efficiency in their data operations, ultimately driving more value from their data assets.

Analyze HAQM EMR on HAQM EC2 cluster usage with HAQM Athena and HAQM QuickSight

In this post, we guide you through deploying a comprehensive solution in your HAQM Web Services (AWS) environment to analyze HAQM EMR on EC2 cluster usage. By using this solution, you will gain a deep understanding of resource consumption and associated costs of individual applications running on your EMR cluster.

Apache HBase online migration to HAQM EMR

Apache HBase is an open source, non-relational distributed database developed as part of the Apache Software Foundation’s Hadoop project. HBase can run on Hadoop Distributed File System (HDFS) or HAQM Simple Storage Service (HAQM S3), and can host very large tables with billions of rows and millions of columns. The followings are some typical use […]

Enhance HAQM EMR scaling capabilities with Application Master Placement

Starting with the HAQM EMR 7.2 release, HAQM EMR on EC2 introduced a new feature called Application Master (AM) label awareness, which allows users to enable YARN node labels to allocate the AM containers within On-Demand nodes only. In this post, we explore the key features and use cases where this new functionality can provide significant benefits, enabling cluster administrators to achieve optimal resource utilization, improved application reliability, and cost-efficiency in your EMR on EC2 clusters.

HAQM EMR on EC2 cost optimization: How a global financial services provider reduced costs by 30%

In this post, we highlight key lessons learned while helping a global financial services provider migrate their Apache Hadoop clusters to AWS and best practices that helped reduce their HAQM EMR, HAQM Elastic Compute Cloud (HAQM EC2), and HAQM Simple Storage Service (HAQM S3) costs by over 30% per month.