AWS Big Data Blog

Migrate data from an on-premises Hadoop environment to HAQM S3 using S3DistCp with AWS Direct Connect

This post demonstrates how to migrate nearly any amount of data from an on-premises Apache Hadoop environment to HAQM Simple Storage Service (HAQM S3) by using S3DistCp on HAQM EMR with AWS Direct Connect. To transfer resources from a target EMR cluster, the traditional Hadoop DistCp must be run on the source cluster to move […]

Flow of logs from source to destination. All logs are sent to Cribl which routes portions of logs to the SIEM, portions to HAQM OpenSearch, and copies of logs to HAQM S3.

How Zurich Insurance Group built a log management solution on AWS

This post is written in collaboration with Clarisa Tavolieri, Austin Rappeport and Samantha Gignac from Zurich Insurance Group. The growth in volume and number of logging sources has been increasing exponentially over the last few years, and will continue to increase in the coming years. As a result, customers across all industries are facing multiple […]

Roller cages solution

How PostNL processes billions of IoT events with HAQM Managed Service for Apache Flink

This post is co-written with Çağrı Çakır and Özge Kavalcı from PostNL. PostNL is the designated universal postal service provider for the Netherlands and has three main business units offering postal delivery, parcel delivery, and logistics solutions for ecommerce and cross-border solutions. With 5,800 retail points, 11,000 mailboxes, and over 900 automated parcel lockers, the […]

Protein similarity search using ProtT5-XL-UniRef50 and HAQM OpenSearch Service

A protein is a sequence of amino acids that, when chained together, creates a 3D structure. This 3D structure allows the protein to bind to other structures within the body and initiate changes. This binding is core to the working of many drugs. A common workflow within drug discovery is searching for similar proteins, because […]

Improve your HAQM OpenSearch Service performance with OpenSearch Optimized Instances

HAQM OpenSearch Service introduced the OpenSearch Optimized Instances (OR1), deliver price-performance improvement over existing instances. The newly introduced OR1 instances are ideally tailored for heavy indexing use cases like log analytics and observability workloads. OR1 instances use a local and a remote store. The local storage utilizes either HAQM Elastic Block Store (HAQM EBS) of […]

Author data integration jobs with an interactive data preparation experience with AWS Glue visual ETL

We are excited to announce a new capability of the AWS Glue Studio visual editor that offers a new visual user experience. Now you can author data preparation transformations and edit them with the AWS Glue Studio visual editor. The AWS Glue Studio visual editor is a graphical interface that enables you to create, run, […]

Accelerate query performance with Apache Iceberg statistics on the AWS Glue Data Catalog

August 2024: This post was updated with HAQM Athena support. Today, we are pleased to announce a new capability for the AWS Glue Data Catalog: generating column-level aggregation statistics for Apache Iceberg tables to accelerate queries. These statistics are utilized by cost-based optimizer (CBO) in HAQM Redshift Spectrum and HAQM Athena, resulting in improved query performance […]

Introducing HAQM MWAA support for Apache Airflow version 2.9.2

HAQM Managed Workflows for Apache Airflow (HAQM MWAA) is a managed orchestration service for Apache Airflow that significantly improves security and availability, and reduces infrastructure management overhead when setting up and operating end-to-end data pipelines in the cloud. Today, we are announcing the availability of Apache Airflow version 2.9.2 environments on HAQM MWAA. Apache Airflow […]

Run Apache XTable on HAQM MWAA to translate open table formats

In this post, we show you how to get started with Apache XTable on AWS and how you can use it in a batch pipeline orchestrated with HAQM Managed Workflows for Apache Airflow (HAQM MWAA). To understand how XTable and similar solutions work, we start with a high-level background on metadata management in an OTF and then dive deeper into XTable and its usage.

Architecture Diagram

How EchoStar ingests terabytes of data daily across its 5G Open RAN network in near real-time using HAQM Redshift Serverless Streaming Ingestion

EchoStar, a connectivity company providing television entertainment, wireless communications, and award-winning technology to residential and business customers throughout the US, deployed the first standalone, cloud-native Open RAN 5G network on AWS public cloud. This post provides an overview of real-time data analysis with HAQM Redshift and how EchoStar uses it to ingest hundreds of megabytes per second. As data sources and volumes grew across its network, EchoStar migrated from a single Redshift Serverless workgroup to a multi-warehouse architecture with live data sharing.