AWS Big Data Blog

Improve Apache Kafka scalability and resiliency using HAQM MSK tiered storage

Since the launch of tiered storage for HAQM Managed Streaming for Apache Kafka (HAQM MSK), customers have embraced this feature for its ability to optimize storage costs and improve performance. In previous posts, we explored the inner workings of Kafka, maximized the potential of HAQM MSK, and delved into the intricacies of HAQM MSK tiered […]

Create a customizable cross-company log lake for compliance, Part I: Business Background

As builders, sometimes you want to dissect a customer experience, find problems, and figure out ways to make it better. That means going a layer down to mix and match primitives together to get more comprehensive features and more customization, flexibility, and freedom. In this post, we introduce Log Lake, a do-it-yourself data lake based on logs from CloudWatch and AWS CloudTrail.

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to HAQM Redshift

Large-scale data warehouse migration to the cloud is a complex and challenging endeavor that many organizations undertake to modernize their data infrastructure, enhance data management capabilities, and unlock new business opportunities. As data volumes continue to grow exponentially, traditional data warehousing solutions may struggle to keep up with the increasing demands for scalability, performance, and […]

Deliver HAQM CloudWatch logs to HAQM OpenSearch Serverless

In this blog post, we will show how to use HAQM OpenSearch Ingestion to deliver CloudWatch logs to OpenSearch Serverless in near real-time. We outline a mechanism to connect a Lambda subscription filter with OpenSearch Ingestion and deliver logs to OpenSearch Serverless without explicitly needing a separate subscription filter for it.

Synchronize data lakes with CDC-based UPSERT using open table format, AWS Glue, and HAQM MSK

The post illustrates the construction of a comprehensive CDC system, enabling the processing of CDC data sourced from HAQM Relational Database Service (HAQM RDS) for MySQL. Initially, we’re creating a raw data lake of all modified records in the database in near real time using HAQM MSK and writing to HAQM S3 as raw data. Later, we use an AWS Glue exchange, transform, and load (ETL) job for batch processing of CDC data from the S3 raw data lake.

Integrate HAQM MWAA with Microsoft Entra ID using SAML authentication

HAQM Managed Workflows for Apache Airflow (HAQM MWAA) provides a fully managed solution for orchestrating and automating complex workflows in the cloud. HAQM MWAA offers two network access modes for accessing the Apache Airflow web UI in your environments: public and private. Customers often deploy HAQM MWAA in private mode and want to use existing […]

Federating access to HAQM DataZone with AWS IAM Identity Center and Okta

Many customers rely today on Okta or other identity providers (IdPs) to federate access to their technology stack and tools. With federation, security teams can centralize user management in a single place, which helps simplify and brings agility to their day-to-day operations while keeping highest security standards. To help develop a data-driven culture, everyone inside […]

Get started with the new HAQM DataZone enhancements for HAQM Redshift

In today’s data-driven landscape, organizations are seeking ways to streamline their data management processes and unlock the full potential of their data assets, while controlling access and enforcing governance. That’s why we introduced HAQM DataZone. HAQM DataZone is a powerful data management service that empowers data engineers, data scientists, product managers, analysts, and business users […]

Apache Iceberg metadata layer architecture diagram

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

In the era of big data, data lakes have emerged as a cornerstone for storing vast amounts of raw data in its native format. They support structured, semi-structured, and unstructured data, offering a flexible and scalable environment for data ingestion from multiple sources. Data lakes provide a unified repository for organizations to store and use […]

How ATPCO enables governed self-service data access to accelerate innovation with HAQM DataZone

ATPCO is the backbone of modern airline retailing, enabling airlines and third-party channels to deliver the right offers to customers at the right time. ATPCO’s reach is impressive, with its fare data covering over 89% of global flight schedules. In this post, using one of ATPCO’s use cases, we show you how ATPCO uses AWS services, including HAQM DataZone, to make data discoverable by data consumers across different business units so that they can innovate faster. We encourage you to read HAQM DataZone concepts and terminologies first to become familiar with the terms used in this post.