AWS Big Data Blog

Category: Compute

Use Batch Processing Gateway to automate job management in multi-cluster HAQM EMR on EKS environments

AWS customers often process petabytes of data using HAQM EMR on EKS. In enterprise environments with diverse workloads or varying operational requirements, customers frequently choose a multi-cluster setup due to the following advantages: Better resiliency and no single point of failure – If one cluster fails, other clusters can continue processing critical workloads, maintaining business […]

How ZS built a clinical knowledge repository for semantic search using HAQM OpenSearch Service and HAQM Neptune

In this blog post, we will highlight how ZS Associates used multiple AWS services to build a highly scalable, highly performant, clinical document search platform. This platform is an advanced information retrieval system engineered to assist healthcare professionals and researchers in navigating vast repositories of medical documents, medical literature, research articles, clinical guidelines, protocol documents, […]

Build a serverless data quality pipeline using Deequ on AWS Lambda

Poor data quality can lead to a variety of problems, including pipeline failures, incorrect reporting, and poor business decisions. For example, if data ingested from one of the systems contains a high number of duplicates, it can result in skewed data in the reporting system. To prevent such issues, data quality checks are integrated into […]

Stream data to HAQM S3 for real-time analytics using the Oracle GoldenGate S3 handler

Modern business applications rely on timely and accurate data with increasing demand for real-time analytics. There is a growing need for efficient and scalable data storage solutions. Data at times is stored in different datasets and needs to be consolidated before meaningful and complete insights can be drawn from the datasets. This is where replication […]

Apache Iceberg metadata layer architecture diagram

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

In the era of big data, data lakes have emerged as a cornerstone for storing vast amounts of raw data in its native format. They support structured, semi-structured, and unstructured data, offering a flexible and scalable environment for data ingestion from multiple sources. Data lakes provide a unified repository for organizations to store and use […]

Disaster recovery strategies for HAQM MWAA – Part 2

HAQM Managed Workflows for Apache Airflow (HAQM MWAA) is a fully managed orchestration service that makes it straightforward to run data processing workflows at scale. HAQM MWAA takes care of operating and scaling Apache Airflow so you can focus on developing workflows. However, although HAQM MWAA provides high availability within an AWS Region through features […]

Implement a full stack serverless search application using AWS Amplify, HAQM Cognito, HAQM API Gateway, AWS Lambda, and HAQM OpenSearch Serverless

Designing a full stack search application requires addressing numerous challenges to provide a smooth and effective user experience. This encompasses tasks such as integrating diverse data from various sources with distinct formats and structures, optimizing the user experience for performance and security, providing multilingual support, and optimizing for cost, operations, and reliability. HAQM OpenSearch Serverless […]

Introducing HAQM EMR on EKS with Apache Flink: A scalable, reliable, and efficient data processing platform

AWS recently announced that Apache Flink is generally available for HAQM EMR on HAQM Elastic Kubernetes Service (EKS). Apache Flink is a scalable, reliable, and efficient data processing framework that handles real-time streaming and batch workloads (but is most commonly used for real-time streaming). HAQM EMR on EKS is a deployment option for HAQM EMR […]

Enable advanced search capabilities for HAQM Keyspaces data by integrating with HAQM OpenSearch Service

In this post, we explore the process of integrating HAQM Keyspaces and HAQM OpenSearch Service using AWS Lambda and HAQM OpenSearch Ingestion to enable advanced search capabilities. The content includes a reference architecture, a step-by-step guide on infrastructure setup, sample code for implementing the solution within a use case, and an AWS Cloud Development Kit (AWS CDK) application for deployment.

Backup and Restore - Pre

Disaster recovery strategies for HAQM MWAA – Part 1

In the dynamic world of cloud computing, ensuring the resilience and availability of critical applications is paramount. Disaster recovery (DR) is the process by which an organization anticipates and addresses technology-related disasters. For organizations implementing critical workload orchestration using HAQM Managed Workflows for Apache Airflow (HAQM MWAA), it is crucial to have a DR plan […]