AWS Big Data Blog
Category: Best Practices
Enhance Agentforce data security with Private Connect for Salesforce Data Cloud and HAQM Redshift – Part 3
In this post, we discuss how to create AWS endpoint services to improve data security with Private Connect for Salesforce Data Cloud.
Architect fault-tolerant applications with instance fleets on HAQM EMR on EC2
In this post, we show how to optimize capacity by analyzing EMR workloads and implementing strategies tailored to your workload patterns. We walk through assessing the historical compute usage of a workload and use a combination of strategies to reduce the likelihood of InsufficientCapacityExceptions (ICE) when HAQM EMR launches specific EC2 instance types. We implement flexible instance fleet strategies to reduce dependency on specific instance types and use HAQM EC2 On-Demand Capacity Reservation (ODCRs) for predictable, steady-state workloads. Following this approach can help prevent job failures due to capacity limits while optimizing your cluster for cost and performance.
Design patterns for implementing Hive Metastore for HAQM EMR on EKS
In this post, we explore the design patterns for implementing the Hive Metastore (HMS) with EMR on EKS with Spark Operator, each offering distinct advantages depending on your requirements. Whether you choose to deploy HMS as a sidecar container within the Apache Spark Driver pod, or as a Kubernetes deployment in the data processing EKS cluster, or as an external HMS service in a separate EKS cluster, the key considerations revolve around communication efficiency, scalability, resource isolation, high availability, and security.
Governing streaming data in HAQM DataZone with the Data Solutions Framework on AWS
In this post, we explore how AWS customers can extend HAQM DataZone to support streaming data such as HAQM Managed Streaming for Apache Kafka (HAQM MSK) topics. Developers and DevOps managers can use HAQM MSK, a popular streaming data service, to run Kafka applications and Kafka Connect connectors on AWS without becoming experts in operating it.
Migrate from Standard brokers to Express brokers in HAQM MSK using HAQM MSK Replicator
Creating a new cluster with Express brokers is straightforward, as described in HAQM MSK Express brokers. However, if you have an existing MSK cluster, you need to migrate to a new Express based cluster. In this post, we discuss how you should plan and perform the migration to Express brokers for your existing MSK workloads on Standard brokers. Express brokers offer a different user experience and a different shared responsibility boundary, so using them on an existing cluster is not possible. However, you can use HAQM MSK Replicator to copy all data and metadata from your existing MSK cluster to a new cluster comprising of Express brokers.
Handle errors in Apache Flink applications on AWS
This post discusses strategies for handling errors in Apache Flink applications. However, the general principles discussed here apply to stream processing applications at large.
Use CI/CD best practices to automate HAQM OpenSearch Service cluster management operations
This post explores how to automate HAQM OpenSearch Service cluster management using CI/CD best practices. It presents two options: the Terraform OpenSearch provider and the Evolution library. The solution demonstrates how to use AWS CDK, Lambda, and CodeBuild to implement automated index template creation and management. By applying these techniques, organizations can improve the consistency, reliability, and efficiency of their OpenSearch operations.
How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes
ANZ Institutional Division has transformed its data management approach by implementing a federated data platform based on data mesh principles. This shift aims to unlock untapped data potential, improve operational efficiency, and increase agility. The new strategy empowers domain teams to create and manage their own data products, treating data as a valuable asset rather than a byproduct. This post explores how the shift to a data product mindset is being implemented, the challenges faced, and the early wins that are shaping the future of data management in the Institutional Division.
Unlocking near real-time analytics with petabytes of transaction data using HAQM Aurora Zero-ETL integration with HAQM Redshift and dbt Cloud
In this post, we explore how to use Aurora MySQL-Compatible Edition Zero-ETL integration with HAQM Redshift and dbt Cloud to enable near real-time analytics. By using dbt Cloud for data transformation, data teams can focus on writing business rules to drive insights from their transaction data to respond effectively to critical, time sensitive events.
Accelerate HAQM Redshift Data Lake queries with AWS Glue Data Catalog Column Statistics
Over the last year, HAQM Redshift added several performance optimizations for data lake queries across multiple areas of query engine such as rewrite, planning, scan execution and consuming AWS Glue Data Catalog column statistics. In this post, we highlight the performance improvements we observed using industry standard TPC-DS benchmarks. Overall execution time of TPC-DS 3 TB benchmark improved by 3x. Some of the queries in our benchmark experienced up to 12x speed up.