AWS Big Data Blog

Category: Analytics

Unleash deeper insights with HAQM Redshift data sharing for data lake tables

HAQM Redshift now enables the secure sharing of data lake tables—also known as external tables or HAQM Redshift Spectrum tables—that are managed in the AWS Glue Data Catalog, as well as Redshift views referencing those data lake tables. By using granular access controls, data sharing in HAQM Redshift helps data owners maintain tight governance over who can access the shared information. In this post, we explore powerful use cases that demonstrate how you can enhance cross-team and cross-organizational collaboration, reduce overhead, and unlock new insights by using this innovative data sharing functionality.

HAQM EMR on EC2 cost optimization: How a global financial services provider reduced costs by 30%

In this post, we highlight key lessons learned while helping a global financial services provider migrate their Apache Hadoop clusters to AWS and best practices that helped reduce their HAQM EMR, HAQM Elastic Compute Cloud (HAQM EC2), and HAQM Simple Storage Service (HAQM S3) costs by over 30% per month.

Perform data parity at scale for data modernization programs using AWS Glue Data Quality

In this post, we show you how to use AWS Glue Data Quality, a feature of AWS Glue, to establish data parity during data modernization and migration programs with minimal configuration and infrastructure setup. AWS Glue Data Quality enables you to automatically measure and monitor the quality of your data in data repositories and AWS Glue ETL pipelines.

Extract insights in a 30TB time series workload with HAQM OpenSearch Serverless

We recently announced a new capacity level of 30TB for time series data per account per AWS Region. The OpenSearch Serverless compute capacity for data ingestion and search/query is measured in OpenSearch Compute Units (OCUs), which are shared among various collections with the same AWS Key Management Service (AWS KMS) key. This post discusses how you can analyze 30TB time series datasets with OpenSearch Serverless.

Build a dynamic rules engine with HAQM Managed Service for Apache Flink

This post demonstrates how to implement a dynamic rules engine using HAQM Managed Service for Apache Flink. Our implementation provides the ability to create dynamic rules that can be created and updated without the need to change or redeploy the underlying code or implementation of the rules engine itself. We discuss the architecture, the key services of the implementation, some implementation details that you can use to build your own rules engine, and an AWS Cloud Development Kit (AWS CDK) project to deploy this in your own account.

Deprecation of Lake Formation’s Governed Tables Feature

After careful consideration, we have made the decision to end support for Governed Tables, effective December 31, 2024, to focus on open source transactional table formats such as Apache Iceberg, Apache Hudi, and Linux Foundation Delta Lake. This decision stems from customer preference for these open source solutions, which offer ACID-compliant transactions, compaction, time travel, and other features previously provided by Governed Tables.

Accelerate HAQM Redshift Data Lake queries with AWS Glue Data Catalog Column Statistics

Over the last year, HAQM Redshift added several performance optimizations for data lake queries across multiple areas of query engine such as rewrite, planning, scan execution and consuming AWS Glue Data Catalog column statistics. In this post, we highlight the performance improvements we observed using industry standard TPC-DS benchmarks. Overall execution time of TPC-DS 3 TB benchmark improved by 3x. Some of the queries in our benchmark experienced up to 12x speed up.

HAQM EMR Serverless observability, Part 1: Monitor HAQM EMR Serverless workers in near real time using HAQM CloudWatch

We have launched job worker metrics in HAQM CloudWatch for EMR Serverless. This feature allows you to monitor vCPUs, memory, ephemeral storage, and disk I/O allocation and usage metrics at an aggregate worker level for your Spark and Hive jobs. This post is part of a series about EMR Serverless observability. In this post, we discuss how to use these CloudWatch metrics to monitor EMR Serverless workers in near real time.

Apply enterprise data governance and management using AWS Lake Formation and AWS IAM Identity Center

In this post, we explore a solution using AWS Lake Formation and AWS IAM Identity Center to address the complex challenges of managing and governing legacy data during digital transformation. We demonstrate how enterprises can effectively preserve historical data while enforcing compliance and maintaining user entitlements. This solution enables your organization to maintain robust audit trails, enforce governance controls, and provide secure, role-based access to data.

Achieve cross-Region resilience with HAQM OpenSearch Ingestion

In this post, we outline two solutions that provide cross-Region resiliency without needing to reestablish relationships during a failback, using an active-active replication model with HAQM OpenSearch Ingestion (OSI) and HAQM Simple Storage Service (HAQM S3). These solutions apply to both OpenSearch Service managed clusters and OpenSearch Serverless collections. We use OpenSearch Serverless as an example for the configurations in this post.