AWS Big Data Blog

Category: HAQM EMR

Using AWS Glue Data Catalog views with Apache Spark in EMR Serverless and Glue 5.0

In this post, we guide you through the process of creating a Data Catalog view using EMR Serverless, adding the SQL dialect to the view for Athena, sharing it with another account using LF-Tags, and then querying the view in the recipient account using a separate EMR Serverless workspace and AWS Glue 5.0 Spark job and Athena. This demonstration showcases the versatility and cross-account capabilities of Data Catalog views and access through various AWS analytics services.

Build a secure serverless streaming pipeline with HAQM MSK Serverless, HAQM EMR Serverless and IAM

The post demonstrates a comprehensive, end-to-end solution for processing data from MSK Serverless using an EMR Serverless Spark Streaming job, secured with IAM authentication. Additionally, it demonstrates how to query the processed data using HAQM Athena, providing a seamless and integrated workflow for data processing and analysis. This solution enables near real-time querying of the latest data processed from MSK Serverless and EMR Serverless using Athena, providing instant insights and analytics.

Scalable analytics and centralized governance for Apache Iceberg tables using HAQM S3 Tables and HAQM Redshift

In this post, we’ll build on the first post in this series to show you how to set up an Apache Iceberg data lake catalog using HAQM S3 Tables and provide different levels of access control to your data. Through this example, you’ll set up fine-grained access controls for multiple users and see how this works using HAQM Redshift. We’ll also review an example with simultaneously using data that resides both in HAQM Redshift and HAQM S3 Tables, enabling a unified analytics experience.

Access HAQM Redshift Managed Storage tables through Apache Spark on AWS Glue and HAQM EMR using HAQM SageMaker Lakehouse

With SageMaker Lakehouse, you can access tables stored in HAQM Redshift managed storage (RMS) through Iceberg APIs, using the Iceberg REST catalog backed by AWS Glue Data Catalog. This post describes how to integrate data on RMS tables through Apache Spark using SageMaker Unified Studio, HAQM EMR 7.5.0 and higher, and AWS Glue 5.0.

Petabyte-scale data migration made simple: AppsFlyer’s best practice journey with HAQM EMR Serverless

In this post, we share how AppsFlyer successfully migrated their massive data infrastructure from self-managed Hadoop clusters to HAQM EMR Serverless, detailing their best practices, challenges to overcome, and lessons learned that can help guide other organizations in similar transformations.

Build end-to-end Apache Spark pipelines with HAQM MWAA, Batch Processing Gateway, and HAQM EMR on EKS clusters

This post shows how to enhance the multi-cluster solution by integrating HAQM Managed Workflows for Apache Airflow (HAQM MWAA) with BPG. By using HAQM MWAA, we add job scheduling and orchestration capabilities, enabling you to build a comprehensive end-to-end Spark-based data processing pipeline.

Read and write Apache Iceberg tables using AWS Lake Formation hybrid access mode

In this post, we demonstrate how to use Lake Formation for read access while continuing to use AWS Identity and Access Management (IAM) policy-based permissions for write workloads that update the schema and upsert (insert and update combined) data records into the Iceberg tables.

Build a data lakehouse in a hybrid Environment using HAQM EMR Serverless, Apache DolphinScheduler, and TiDB

This post discusses a decoupled approach of building a serverless data lakehouse using AWS Cloud-centered services, including HAQM EMR Serverless, HAQM Athena, HAQM Simple Storage Service (HAQM S3), Apache DolphinScheduler (an open source data job scheduler) as well as PingCAP TiDB, a third-party data warehouse product that can be deployed either on premises or on the cloud or through a software as a service (SaaS).