AWS Big Data Blog

Category: Analytics

Accelerate query performance with Apache Iceberg statistics on the AWS Glue Data Catalog

August 2024: This post was updated with HAQM Athena support. Today, we are pleased to announce a new capability for the AWS Glue Data Catalog: generating column-level aggregation statistics for Apache Iceberg tables to accelerate queries. These statistics are utilized by cost-based optimizer (CBO) in HAQM Redshift Spectrum and HAQM Athena, resulting in improved query performance […]

Run Apache XTable on HAQM MWAA to translate open table formats

In this post, we show you how to get started with Apache XTable on AWS and how you can use it in a batch pipeline orchestrated with HAQM Managed Workflows for Apache Airflow (HAQM MWAA). To understand how XTable and similar solutions work, we start with a high-level background on metadata management in an OTF and then dive deeper into XTable and its usage.

Architecture Diagram

How EchoStar ingests terabytes of data daily across its 5G Open RAN network in near real-time using HAQM Redshift Serverless Streaming Ingestion

EchoStar, a connectivity company providing television entertainment, wireless communications, and award-winning technology to residential and business customers throughout the US, deployed the first standalone, cloud-native Open RAN 5G network on AWS public cloud. This post provides an overview of real-time data analysis with HAQM Redshift and how EchoStar uses it to ingest hundreds of megabytes per second. As data sources and volumes grew across its network, EchoStar migrated from a single Redshift Serverless workgroup to a multi-warehouse architecture with live data sharing.

HAQM DataZone introduces OpenLineage-compatible data lineage visualization in preview

We are excited to announce the preview of API-driven, OpenLineage-compatible data lineage in HAQM DataZone to help you capture, store, and visualize lineage of data movement and transformations of data assets on HAQM DataZone. With the HAQM DataZone OpenLineage-compatible API, domain administrators and data producers can capture and store lineage events beyond what is available […]

HAQM Managed Service for Apache Flink now supports Apache Flink version 1.19

Apache Flink is an open source distributed processing engine, offering powerful programming interfaces for both stream and batch processing, with first-class support for stateful processing and event time semantics. Apache Flink supports multiple programming languages, Java, Python, Scala, SQL, and multiple APIs with different level of abstraction, which can be used interchangeably in the same […]

Enhance data security with fine-grained access controls in HAQM DataZone

Fine-grained access control is a crucial aspect of data security for modern data lakes and data warehouses. As organizations handle vast amounts of data across multiple data sources, the need to manage sensitive information has become increasingly important. Making sure the right people have access to the right data, without exposing sensitive information to unauthorized […]

Automate data loading from your database into HAQM Redshift using AWS Database Migration Service (DMS), AWS Step Functions, and the Redshift Data API

HAQM Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing ETL (extract, transform, and load), business intelligence (BI), and reporting tools. Tens of thousands of customers use HAQM Redshift to process exabytes of data per […]

Introducing self-managed data sources for HAQM OpenSearch Ingestion

Enterprise customers increasingly adopt HAQM OpenSearch Ingestion (OSI) to bring data into HAQM OpenSearch Service for various use cases. These include petabyte-scale log analytics, real-time streaming, security analytics, and searching semi-structured key-value or document data. OSI makes it simple, with straightforward integrations, to ingest data from many AWS services, including HAQM DynamoDB, HAQM Simple Storage […]

HAQM DataZone enhances data discovery with advanced search filtering

HAQM DataZone, a fully managed data management service, helps organizations catalog, discover, analyze, share, and govern data between data producers and consumers. We are excited to announce the introduction of advanced search filtering capabilities in the HAQM DataZone business data catalog. With the improved rendering of glossary terms, you can now navigate large sets of […]

Implement disaster recovery with HAQM Redshift

HAQM Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. This enables you to use your data to acquire new insights for your business and customers. The objective of a disaster recovery plan is […]