AWS Big Data Blog
Category: Technical How-to
Build unified pipelines spanning multiple AWS accounts and Regions with HAQM MWAA
In this blog post, we demonstrate how to use HAQM MWAA for centralized orchestration, while distributing data processing and machine learning tasks across different AWS accounts and Regions for optimal performance and compliance.
Enhance governance with metadata enforcement rules in HAQM SageMaker
HAQM SageMaker Catalog now supports metadata rules allowing organizations to enforce metadata standards across data publishing and subscription workflows. In this post, we guide you through two workflows: setting up metadata enforcement rules for a specific domain and publishing an asset or data product in a catalog, and setting up metadata enforcement rules for a specific domain and subscribing to an asset or data product that is owned by a project within that domain.
Build multi-Region resilient Apache Kafka applications with identical topic names using HAQM MSK and HAQM MSK Replicator
This post explains how to use MSK Replicator for cross-cluster data replication and details the failover and failback processes while keeping the same topic name across Regions.
Connect, share, and query where your data sits using HAQM SageMaker Unified Studio
In this blog post, we will demonstrate how business units can use HAQM SageMaker Unified Studio to discover, subscribe to, and analyze these distributed data assets. Through this unified query capability, you can create comprehensive insights into customer transaction patterns and purchase behavior for active products without the traditional barriers of data silos or the need to copy data between systems.
Architect fault-tolerant applications with instance fleets on HAQM EMR on EC2
In this post, we show how to optimize capacity by analyzing EMR workloads and implementing strategies tailored to your workload patterns. We walk through assessing the historical compute usage of a workload and use a combination of strategies to reduce the likelihood of InsufficientCapacityExceptions (ICE) when HAQM EMR launches specific EC2 instance types. We implement flexible instance fleet strategies to reduce dependency on specific instance types and use HAQM EC2 On-Demand Capacity Reservation (ODCRs) for predictable, steady-state workloads. Following this approach can help prevent job failures due to capacity limits while optimizing your cluster for cost and performance.
Develop and test AWS Glue 5.0 jobs locally using a Docker container
In this post, we show how to develop and test AWS Glue 5.0 jobs locally using a Docker container. This post is an updated version of the post Develop and test AWS Glue version 3.0 and 4.0 jobs locally using a Docker container, and uses AWS Glue 5.0.
Cross-account data collaboration with HAQM DataZone and AWS analytical tools
In this post, we will cover how you can use HAQM DataZone to facilitate data collaboration between AWS accounts.
Design patterns for implementing Hive Metastore for HAQM EMR on EKS
In this post, we explore the design patterns for implementing the Hive Metastore (HMS) with EMR on EKS with Spark Operator, each offering distinct advantages depending on your requirements. Whether you choose to deploy HMS as a sidecar container within the Apache Spark Driver pod, or as a Kubernetes deployment in the data processing EKS cluster, or as an external HMS service in a separate EKS cluster, the key considerations revolve around communication efficiency, scalability, resource isolation, high availability, and security.
Governing streaming data in HAQM DataZone with the Data Solutions Framework on AWS
In this post, we explore how AWS customers can extend HAQM DataZone to support streaming data such as HAQM Managed Streaming for Apache Kafka (HAQM MSK) topics. Developers and DevOps managers can use HAQM MSK, a popular streaming data service, to run Kafka applications and Kafka Connect connectors on AWS without becoming experts in operating it.
HAQM Prime Video advances search for sports using HAQM OpenSearch Service
In this post, we will walk you through how Prime Video used HAQM OpenSearch Service and its AI and machine learning (AI/ML) capabilities to build a more intuitive and enhanced sports search experience.