AWS Big Data Blog

Category: HAQM EMR

EMR Hive Metastore Upgrade

Upgrade HAQM EMR Hive Metastore from 5.X to 6.X

If you are currently running HAQM EMR 5.X clusters, consider moving to HAQM EMR 6.X as  it includes new features that helps you improve performance and optimize on cost. For instance, Apache Hive is two times faster with LLAP on HAQM EMR 6.X, and Spark 3 reduces costs by 40%. Additionally, HAQM EMR 6.x releases […]

Diagram to illustrate soft multi-tenancy

Design considerations for HAQM EMR on EKS in a multi-tenant HAQM EKS environment

Many AWS customers use HAQM Elastic Kubernetes Service (HAQM EKS) in order to take advantage of Kubernetes without the burden of managing the Kubernetes control plane. With Kubernetes, you can centrally manage your workloads and offer administrators a multi-tenant environment where they can create, update, scale, and secure workloads using a single API. Kubernetes also […]

How ZS created a multi-tenant self-service data orchestration platform using HAQM MWAA

This is post is co-authored by Manish Mehra, Anirudh Vohra, Sidrah Sayyad, and Abhishek I S (from ZS), and Parnab Basak (from AWS). The team at ZS collaborated closely with AWS to build a modern, cloud-native data orchestration platform. ZS is a management consulting and technology firm focused on transforming global healthcare and beyond. We […]

Optimize Ama­zon EMR costs for legacy and Spark workloads

December 2023: This post was reviewed and updated for accuracy. Customers migrating from large on-premises Hadoop clusters to HAQM EMR like to reduce their operational costs while running resilient applications. On-premises customers typically use in-elastic, large, fixed-size Hadoop clusters, which incurs high capital expenditure. You can now migrate your mixed workloads to HAQM EMR, which […]

Run Apache Spark with HAQM EMR on EKS backed by HAQM FSx for Lustre storage

September 2023: This post was reviewed and updated for accuracy to reflect recent improvements and changes. Traditionally, Spark workloads have been run on a dedicated setup like a Hadoop stack with YARN or MESOS as a resource manager. Starting from Apache Spark 2.3, Spark added support for Kubernetes as a resource manager. The new Kubernetes […]

Implement a highly available key distribution center for HAQM EMR

High availability (HA) is the property of a system or service to operate continuously without failing for a designated period of time. Implementing HA properties over a system allows you to eliminate single points of failure that usually translate to service disruptions, which can then lead to a business loss or the inability to use […]

Store HAQM EMR in-transit data encryption certificates using AWS Secrets Manager

With HAQM EMR, you can use a security configuration to specify settings for encrypting data in transit. When in-transit encryption is configured, you can enable application-specific encryption features, for example: Hadoop HDFS NameNode or DataNode user interfaces use HTTPS Hadoop MapReduce encrypted shuffle uses Transport Layer Security (TLS) Presto nodes internal communication uses SSL/TLS (HAQM […]

Convert Oracle XML BLOB data using HAQM EMR and load to HAQM Redshift

In legacy relational database management systems, data is stored in several complex data types, such XML, JSON, BLOB, or CLOB. This data might contain valuable information that is often difficult to transform into insights, so you might be looking for ways to load and use this data in a modern cloud data warehouse such as […]

Removing complexity to improve business performance: How Bridgewater Associates built a scalable, secure, Spark-based research service on AWS

This is a guest post co-written by Sergei Dubinin, Oleksandr Ierenkov, Illia Popov and Joel Thompson, from Bridgewater. Bridgewater’s core mission is to understand how the world works by analyzing the drivers of markets and turning that understanding into high-quality portfolios and investment advice for our clients. Within Bridgewater Technology, we strive to make our […]

Set up federated access to HAQM Athena for Microsoft AD FS users using AWS Lake Formation and a JDBC client

Tens of thousands of AWS customers choose HAQM Simple Storage Service (HAQM S3) as their data lake to run big data analytics, interactive queries, high-performance computing, and artificial intelligence (AI) and machine learning (ML) applications to gain business insights from their data. On top of these data lakes, you can use AWS Lake Formation to […]