AWS Big Data Blog

Category: HAQM EMR

Apache Hadoop Yarn Architecture Diagram

Configure Hadoop YARN CapacityScheduler on HAQM EMR on HAQM EC2 for multi-tenant heterogeneous workloads

Apache Hadoop YARN (Yet Another Resource Negotiator) is a cluster resource manager responsible for assigning computational resources (CPU, memory, I/O), and scheduling and monitoring jobs submitted to a Hadoop cluster. This generic framework allows for effective management of cluster resources for distributed data processing frameworks, such as Apache Spark, Apache MapReduce, and Apache Hive. When […]

HAQM EMR on EKS gets up to 19% performance boost running on AWS Graviton3 Processors vs. Graviton2

HAQM EMR on EKS is a deployment option that enables you to run Spark workloads on HAQM Elastic Kubernetes Service (HAQM EKS) easily. It allows you to innovate faster with the latest Apache Spark on Kubernetes architecture while benefiting from the performance-optimized Spark runtime powered by HAQM EMR. This deployment option elects HAQM EKS as […]

Walkthrough Overview

Design patterns to manage HAQM EMR on EKS workloads for Apache Spark

HAQM EMR on HAQM EKS enables you to submit Apache Spark jobs on demand on HAQM Elastic Kubernetes Service (HAQM EKS) without provisioning clusters. With EMR on EKS, you can consolidate analytical workloads with your other Kubernetes-based applications on the same HAQM EKS cluster to improve resource utilization and simplify infrastructure management. Kubernetes uses namespaces to provide isolation between […]

Stream HAQM EMR on EKS logs to third-party providers like Splunk, HAQM OpenSearch Service, or other log aggregators

Spark jobs running on HAQM EMR on EKS generate logs that are very useful in identifying issues with Spark processes and also as a way to see Spark outputs. You can access these logs from a variety of sources. On the HAQM EMR virtual cluster console, you can access logs from the Spark History UI. […]

Enable federated governance using Trino and Apache Ranger on HAQM EMR

Managing data through a central data platform simplifies staffing and training challenges and reduces the costs. However, it can create scaling, ownership, and accountability challenges, because central teams may not understand the specific needs of a data domain, whether it’s because of data types and storage, security, data catalog requirements, or specific technologies needed for […]

Disaster recovery considerations with HAQM EMR on HAQM EC2 for Spark workloads

HAQM EMR is a cloud big data platform for running large-scale distributed data processing jobs, interactive SQL queries, and machine learning (ML) applications using open-source analytics frameworks such as Apache Spark, Apache Hive, and Presto. HAQM EMR launches all nodes for a given cluster in the same HAQM Elastic Compute Cloud (HAQM EC2) Availability Zone […]

Build a high-performance, ACID compliant, evolving data lake using Apache Iceberg on HAQM EMR

HAQM EMR is a cloud big data platform for running large-scale distributed data processing jobs, interactive SQL queries, and machine learning (ML) applications using open-source analytics frameworks such as Apache Spark, Apache Hive, and Presto. Apache Iceberg is an open table format for huge analytic datasets. Table formats typically indicate the format and location of […]

Build an Apache Iceberg data lake using HAQM Athena, HAQM EMR, and AWS Glue

March 2024: This post was reviewed and updated for accuracy. Most businesses store their critical data in a data lake, where you can bring data from various sources to a centralized storage. The data is processed by specialized big data compute engines, such as HAQM Athena for interactive queries, HAQM EMR for Apache Spark applications, […]

Deep dive into HAQM EMR Kerberos authentication integrated with Microsoft Active Directory

Many of our customers that use HAQM EMR as their big data platform need to integrate with their existing Microsoft Active Directory (AD) for user authentication. This integration requires the Kerberos daemon of HAQM EMR to establish a trusted connection with an AD domain, which involves a lot of moving pieces and can be difficult […]

How Paytm modernized their data pipeline using HAQM EMR

This post was co-written by Rajat Bhardwaj, Senior Technical Account Manager at AWS and Kunal Upadhyay, General Manager at Paytm. Paytm is India’s leading payment platform, pioneering the digital payment era in India with 130 million active users. Paytm operates multiple lines of business, including banking, digital payments, bill recharges, e-wallet, stocks, insurance, lending and […]