AWS Big Data Blog

Category: Advanced (300)

Optimize cost and performance for HAQM MWAA

HAQM Managed Workflows for Apache Airflow (HAQM MWAA) is a managed service for Apache Airflow that allows you to orchestrate data pipelines and workflows at scale. With HAQM MWAA, you can design Directed Acyclic Graphs (DAGs) that describe your workflows without managing the operational burden of scaling the infrastructure. In this post, we provide guidance […]

Embed HAQM OpenSearch Service dashboards in your application

Customers across diverse industries rely on HAQM OpenSearch Service for interactive log analytics, real-time application monitoring, website search, vector database, deriving meaningful insights from data, and visualizing these insights using OpenSearch Dashboards. Additionally, customers often seek out capabilities that enable effortless sharing of visual dashboards and seamless embedding of these dashboards within their applications, further […]

DataZone High Level Architecture

Implement data quality checks on HAQM Redshift data assets and integrate with HAQM DataZone

In this post, we show how to capture the data quality metrics for data assets produced in HAQM Redshift. With HAQM DataZone, the data owner can directly import the technical metadata of a Redshift database table and views to the HAQM DataZone project’s inventory. As these data assets gets imported into HAQM DataZone, it bypasses the AWS Glue Data Catalog, creating a gap in data quality integration. This post proposes a solution to enrich the HAQM Redshift data asset with data quality scores and KPI metrics.

Build a serverless data quality pipeline using Deequ on AWS Lambda

Poor data quality can lead to a variety of problems, including pipeline failures, incorrect reporting, and poor business decisions. For example, if data ingested from one of the systems contains a high number of duplicates, it can result in skewed data in the reporting system. To prevent such issues, data quality checks are integrated into […]

HAQM Managed Service for Apache Flink State Transition

Improve the resilience of HAQM Managed Service for Apache Flink application with system-rollback feature

This post explores how to use the system-rollback feature in Managed Service for Apache Flink.We discuss how this functionality improves your application’s resilience by providing a highly available Flink application. Through an example, you will also learn how to use the APIs to have more visibility of the application’s operations.

Use AWS Glue to streamline SFTP data processing

In this blog post, we explore how to use the SFTP Connector for AWS Glue from the AWS Marketplace to efficiently process data from Secure File Transfer Protocol (SFTP) servers into HAQM Simple Storage Service (HAQM S3), further empowering your data analytics and insights.

High level architecture of the Estimations system using Athena

How AppsFlyer modernized their interactive workload by moving to HAQM Athena and saved 80% of costs

AppsFlyer develops a leading measurement solution focused on privacy, which enables marketers to gauge the effectiveness of their marketing activities and integrates them with the broader marketing world, managing a vast volume of 100 billion events every day. This post explores how AppsFlyer modernized their Audiences Segmentation product by using HAQM Athena.

Introducing AWS Glue Data Quality anomaly detection

We are excited to announce the general availability of anomaly detection capabilities in AWS Glue Data Quality. In this post, we demonstrate how this feature works with an example. We provide an AWS Cloud Formation template to deploy this setup and experiment with this feature.

Build a real-time analytics solution with Apache Pinot on AWS

In this, we will provide a step-by-step guide showing you how you can build a real-time OLAP datastore on HAQM Web Services (AWS) using Apache Pinot on HAQM Elastic Compute Cloud (HAQM EC2) and do near real-time visualization using Tableau. You can use Apache Pinot for batch processing use cases as well but, in this post, we will focus on a near real-time analytics use case.

Create a customizable cross-company log lake for compliance, Part I: Business Background

As builders, sometimes you want to dissect a customer experience, find problems, and figure out ways to make it better. That means going a layer down to mix and match primitives together to get more comprehensive features and more customization, flexibility, and freedom. In this post, we introduce Log Lake, a do-it-yourself data lake based on logs from CloudWatch and AWS CloudTrail.