AWS Big Data Blog

Category: Analytics

Synchronize data lakes with CDC-based UPSERT using open table format, AWS Glue, and HAQM MSK

The post illustrates the construction of a comprehensive CDC system, enabling the processing of CDC data sourced from HAQM Relational Database Service (HAQM RDS) for MySQL. Initially, we’re creating a raw data lake of all modified records in the database in near real time using HAQM MSK and writing to HAQM S3 as raw data. Later, we use an AWS Glue exchange, transform, and load (ETL) job for batch processing of CDC data from the S3 raw data lake.

Federating access to HAQM DataZone with AWS IAM Identity Center and Okta

Many customers rely today on Okta or other identity providers (IdPs) to federate access to their technology stack and tools. With federation, security teams can centralize user management in a single place, which helps simplify and brings agility to their day-to-day operations while keeping highest security standards. To help develop a data-driven culture, everyone inside […]

Get started with the new HAQM DataZone enhancements for HAQM Redshift

In today’s data-driven landscape, organizations are seeking ways to streamline their data management processes and unlock the full potential of their data assets, while controlling access and enforcing governance. That’s why we introduced HAQM DataZone. HAQM DataZone is a powerful data management service that empowers data engineers, data scientists, product managers, analysts, and business users […]

Apache Iceberg metadata layer architecture diagram

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

In the era of big data, data lakes have emerged as a cornerstone for storing vast amounts of raw data in its native format. They support structured, semi-structured, and unstructured data, offering a flexible and scalable environment for data ingestion from multiple sources. Data lakes provide a unified repository for organizations to store and use […]

How ATPCO enables governed self-service data access to accelerate innovation with HAQM DataZone

ATPCO is the backbone of modern airline retailing, enabling airlines and third-party channels to deliver the right offers to customers at the right time. ATPCO’s reach is impressive, with its fare data covering over 89% of global flight schedules. In this post, using one of ATPCO’s use cases, we show you how ATPCO uses AWS services, including HAQM DataZone, to make data discoverable by data consumers across different business units so that they can innovate faster. We encourage you to read HAQM DataZone concepts and terminologies first to become familiar with the terms used in this post.

Manage HAQM Redshift provisioned clusters with Terraform

HAQM Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it straightforward and cost-effective to analyze all your data using standard SQL and your existing extract, transform, and load (ETL); business intelligence (BI); and reporting tools. Tens of thousands of customers use HAQM Redshift to process exabytes of data per […]

Migrate workloads from AWS Data Pipeline

After careful consideration, we have made the decision to close new customer access to AWS Data Pipeline, effective July 25, 2024. AWS Data Pipeline existing customers can continue to use the service as normal. AWS continues to invest in security, availability, and performance improvements for AWS Data Pipeline, but we do not plan to introduce […]

Transition from HAQM CloudSearch to HAQM OpenSearch Service

After careful consideration, we have made the decision to close new customer access to HAQM CloudSearch, effective July 25, 2024. HAQM CloudSearch existing customers can continue to use the service as normal. AWS continues to invest in security, availability, and performance improvements for HAQM CloudSearch, but we do not plan to introduce new features. At […]

Configure SAML federation with HAQM OpenSearch Serverless and Keycloak

HAQM OpenSearch Serverless is a serverless version of HAQM OpenSearch Service, a fully managed open search and analytics platform. On HAQM OpenSearch Service you can run petabyte-scale search and analytics workloads without the heavy lifting of managing the underlying OpenSearch Service clusters and HAQM OpenSearch Serverless supports workloads up to 30TB of data for time-series […]

Visual representation of the relationships of the high level entities in the customer, event and product subject areas

How ActionIQ built a truly composable customer data platform using HAQM Redshift

This post is written in collaboration with Mackenzie Johnson and Phil Catterall from ActionIQ. ActionIQ is a leading composable customer data (CDP) platform designed for enterprise brands to grow faster and deliver meaningful experiences for their customers. ActionIQ taps directly into a brand’s data warehouse to build smart audiences, resolve customer identities, and design personalized […]