AWS Big Data Blog
Category: HAQM EMR
HAQM EMR 7.5 runtime for Apache Spark and Iceberg can run Spark workloads 3.6 times faster than Spark 3.5.3 and Iceberg 1.6.1
The HAQM EMR runtime for Apache Spark offers a high-performance runtime environment while maintaining 100% API compatibility with open source Apache Spark and Apache Iceberg table format. In this post, we demonstrate the performance benefits of using the HAQM EMR 7.5 runtime for Spark and Iceberg compared to open source Spark 3.5.3 with Iceberg 1.6.1 tables on the TPC-DS 3TB benchmark v2.13.
Run Apache Spark Structured Streaming jobs at scale on HAQM EMR Serverless
HAQM EMR Serverless emerges as a pivotal solution for running streaming workloads, enabling the use of the latest open source frameworks like Spark without the need for configuration, optimization, security, or cluster management. In this post, we highlight some of the key enhancements introduced for streaming jobs.
HAQM EMR streamlines big data processing with simplified HAQM S3 Glacier access
In this post, we demonstrate how to set up and use HAQM EMR on EC2 with S3 Glacier for cost-effective data processing.
Run high-availability long-running clusters with HAQM EMR instance fleets
In this post, we demonstrate how to launch a high availability instance fleet cluster using the newly redesigned HAQM EMR console, as well as using an AWS CloudFormation template. We also go over the basic concepts of Hadoop high availability, EMR instance fleets, the benefits and trade-offs of high availability, and best practices for running resilient EMR clusters.
Expand data access through Apache Iceberg using Delta Lake UniForm on AWS
Delta Lake UniForm is an open table format extension designed to provide a universal data representation that can be efficiently read by different processing engines. It aims to bridge the gap between various data formats and processing systems, offering a standardized approach to data storage and retrieval. With UniForm, you can read Delta Lake tables as Apache Iceberg tables. This post explores how to start using Delta Lake UniForm on HAQM Web Services (AWS).
Fine-grained access control in HAQM EMR Serverless with AWS Lake Formation
In this post, we discuss how to implement fine-grained access control in EMR Serverless using Lake Formation. With this integration, organizations can achieve better scalability, flexibility, and cost-efficiency in their data operations, ultimately driving more value from their data assets.
Analyze HAQM EMR on HAQM EC2 cluster usage with HAQM Athena and HAQM QuickSight
In this post, we guide you through deploying a comprehensive solution in your HAQM Web Services (AWS) environment to analyze HAQM EMR on EC2 cluster usage. By using this solution, you will gain a deep understanding of resource consumption and associated costs of individual applications running on your EMR cluster.
Apache HBase online migration to HAQM EMR
Apache HBase is an open source, non-relational distributed database developed as part of the Apache Software Foundation’s Hadoop project. HBase can run on Hadoop Distributed File System (HDFS) or HAQM Simple Storage Service (HAQM S3), and can host very large tables with billions of rows and millions of columns. The followings are some typical use […]
Enhance HAQM EMR scaling capabilities with Application Master Placement
Starting with the HAQM EMR 7.2 release, HAQM EMR on EC2 introduced a new feature called Application Master (AM) label awareness, which allows users to enable YARN node labels to allocate the AM containers within On-Demand nodes only. In this post, we explore the key features and use cases where this new functionality can provide significant benefits, enabling cluster administrators to achieve optimal resource utilization, improved application reliability, and cost-efficiency in your EMR on EC2 clusters.
HAQM EMR on EC2 cost optimization: How a global financial services provider reduced costs by 30%
In this post, we highlight key lessons learned while helping a global financial services provider migrate their Apache Hadoop clusters to AWS and best practices that helped reduce their HAQM EMR, HAQM Elastic Compute Cloud (HAQM EC2), and HAQM Simple Storage Service (HAQM S3) costs by over 30% per month.