AWS Big Data Blog

Tag: HAQM EMR

Query big data with resilience using Trino in HAQM EMR with HAQM EC2 Spot Instances for less cost

New enhancements in Trino with HAQM EMR provide improved resiliency for running ETL and batch workloads on Spot Instances with reduced costs. This post showcases the resilience of HAQM EMR with Trino using fault-tolerant configuration to run long-running queries on Spot Instances to save costs. We simulate Spot interruptions on Trino worker nodes by using AWS Fault Injection Simulator (AWS FIS).

Streaming Architecture

Apache Iceberg optimization: Solving the small files problem in HAQM EMR

Currently, Iceberg provides a compaction utility that compacts small files at a table or partition level. But this approach requires you to implement the compaction job using your preferred job scheduler or manually triggering the compaction job. In this post, we discuss the new Iceberg feature that you can use to automatically compact small files while writing data into Iceberg tables using Spark on HAQM EMR or HAQM Athena.

Build a data lake with Apache Flink on HAQM EMR

To build a data-driven business, it is important to democratize enterprise data assets in a data catalog. With a unified data catalog, you can quickly search datasets and figure out data schema, data format, and location. The AWS Glue Data Catalog provides a uniform repository where disparate systems can store and find metadata to keep […]

Accelerate your data exploration and experimentation with the AWS Analytics Reference Architecture library

Organizations use their data to solve complex problems by starting small, running iterative experiments, and refining the solution. Although the power of experiments can’t be ignored, organizations have to be cautious about the cost-effectiveness of such experiments. If time is spent creating the underlying infrastructure for enabling experiments, it further adds to the cost. Developers […]

Build your Apache Hudi data lake on AWS using HAQM EMR – Part 1

Apache Hudi is an open-source transactional data lake framework that greatly simplifies incremental data processing and data pipeline development. It does this by bringing core warehouse and database functionality directly to a data lake on HAQM Simple Storage Service (HAQM S3) or Apache HDFS. Hudi provides table management, instantaneous views, efficient upserts/deletes, advanced indexes, streaming […]

emr serverless application

Run a data processing job on HAQM EMR Serverless with AWS Step Functions

Update Feb 2023: AWS Step Functions adds direct integration for 35 services including HAQM EMR Serverless. In the current version of this blog, we are able to submit an EMR Serverless job by invoking the APIs directly from a Step Functions workflow. We are using the Lambda only for polling the status of the job […]

EMR Hive Metastore Upgrade

Upgrade HAQM EMR Hive Metastore from 5.X to 6.X

If you are currently running HAQM EMR 5.X clusters, consider moving to HAQM EMR 6.X as  it includes new features that helps you improve performance and optimize on cost. For instance, Apache Hive is two times faster with LLAP on HAQM EMR 6.X, and Spark 3 reduces costs by 40%. Additionally, HAQM EMR 6.x releases […]

Diagram to illustrate soft multi-tenancy

Design considerations for HAQM EMR on EKS in a multi-tenant HAQM EKS environment

Many AWS customers use HAQM Elastic Kubernetes Service (HAQM EKS) in order to take advantage of Kubernetes without the burden of managing the Kubernetes control plane. With Kubernetes, you can centrally manage your workloads and offer administrators a multi-tenant environment where they can create, update, scale, and secure workloads using a single API. Kubernetes also […]

Apache Hadoop Yarn Architecture Diagram

Configure Hadoop YARN CapacityScheduler on HAQM EMR on HAQM EC2 for multi-tenant heterogeneous workloads

Apache Hadoop YARN (Yet Another Resource Negotiator) is a cluster resource manager responsible for assigning computational resources (CPU, memory, I/O), and scheduling and monitoring jobs submitted to a Hadoop cluster. This generic framework allows for effective management of cluster resources for distributed data processing frameworks, such as Apache Spark, Apache MapReduce, and Apache Hive. When […]

Disaster recovery considerations with HAQM EMR on HAQM EC2 for Spark workloads

HAQM EMR is a cloud big data platform for running large-scale distributed data processing jobs, interactive SQL queries, and machine learning (ML) applications using open-source analytics frameworks such as Apache Spark, Apache Hive, and Presto. HAQM EMR launches all nodes for a given cluster in the same HAQM Elastic Compute Cloud (HAQM EC2) Availability Zone […]