AWS Big Data Blog

Category: Technical How-to

Build unified pipelines spanning multiple AWS accounts and Regions with HAQM MWAA

In this blog post, we demonstrate how to use HAQM MWAA for centralized orchestration, while distributing data processing and machine learning tasks across different AWS accounts and Regions for optimal performance and compliance.

Enhance governance with metadata enforcement rules in HAQM SageMaker

HAQM SageMaker Catalog now supports metadata rules allowing organizations to enforce metadata standards across data publishing and subscription workflows. In this post, we guide you through two workflows: setting up metadata enforcement rules for a specific domain and publishing an asset or data product in a catalog, and setting up metadata enforcement rules for a specific domain and subscribing to an asset or data product that is owned by a project within that domain.

Connect, share, and query where your data sits using HAQM SageMaker Unified Studio

In this blog post, we will demonstrate how business units can use HAQM SageMaker Unified Studio to discover, subscribe to, and analyze these distributed data assets. Through this unified query capability, you can create comprehensive insights into customer transaction patterns and purchase behavior for active products without the traditional barriers of data silos or the need to copy data between systems.

Architect fault-tolerant applications with instance fleets on HAQM EMR on EC2

In this post, we show how to optimize capacity by analyzing EMR workloads and implementing strategies tailored to your workload patterns. We walk through assessing the historical compute usage of a workload and use a combination of strategies to reduce the likelihood of InsufficientCapacityExceptions (ICE) when HAQM EMR launches specific EC2 instance types. We implement flexible instance fleet strategies to reduce dependency on specific instance types and use HAQM EC2 On-Demand Capacity Reservation (ODCRs) for predictable, steady-state workloads. Following this approach can help prevent job failures due to capacity limits while optimizing your cluster for cost and performance.

Design patterns for implementing Hive Metastore for HAQM EMR on EKS

In this post, we explore the design patterns for implementing the Hive Metastore (HMS) with EMR on EKS with Spark Operator, each offering distinct advantages depending on your requirements. Whether you choose to deploy HMS as a sidecar container within the Apache Spark Driver pod, or as a Kubernetes deployment in the data processing EKS cluster, or as an external HMS service in a separate EKS cluster, the key considerations revolve around communication efficiency, scalability, resource isolation, high availability, and security.

Governing streaming data in HAQM DataZone with the Data Solutions Framework on AWS

In this post, we explore how AWS customers can extend HAQM DataZone to support streaming data such as HAQM Managed Streaming for Apache Kafka (HAQM MSK) topics. Developers and DevOps managers can use HAQM MSK, a popular streaming data service, to run Kafka applications and Kafka Connect connectors on AWS without becoming experts in operating it.