AWS Big Data Blog
Category: Advanced (300)
How to track HAQM OpenSearch Service domain-level cost
HAQM OpenSearch Service Pricing is based on three dimensions: instances, storage, and data transfer. Storage pricing depends on the chosen storage type and also the storage tier. Visibility into domain-level charges enables accurate budgeting, efficient resource allocation, fair cost attribution across projects, and overall cost transparency. In this post, we show you how to view the OpenSearch Service domain-level cost using AWS Cost Explorer.
Migrate Delta tables from Azure Data Lake Storage to HAQM S3 using AWS Glue
Organizations are increasingly using a multi-cloud strategy to run their production workloads. We often see requests from customers who have started their data journey by building data lakes on Microsoft Azure, to extend access to the data to AWS services. Customers want to use a variety of AWS analytics, data, AI, and machine learning (ML) […]
Evaluating sample HAQM Redshift data sharing architecture using Redshift Test Drive and advanced SQL analysis
In this post, we walk you through the process of testing workload isolation architecture using HAQM Redshift Data Sharing and Test Drive utility. We demonstrate how you can use SQL for advanced price performance analysis and compare different workloads on different target Redshift cluster configurations.
Accelerate data integration with Salesforce and AWS using AWS Glue
To meet the demands of diverse data integration use cases, AWS Glue now supports SaaS connectivity for Salesforce. This enables users to quickly preview and transfer their customer relationship management (CRM) data, fetch the schema dynamically on request, and query the data. This post explores the new Salesforce connector for AWS Glue and demonstrates how to build a modern extract, transform, and load (ETL) pipeline with AWS Glue ETL scripts.
How Kaplan, Inc. implemented modern data pipelines using HAQM MWAA and HAQM AppFlow with HAQM Redshift as a data warehouse
Kaplan, Inc. provides individuals, educational institutions, and businesses with a broad array of services, supporting our students and partners to meet their diverse and evolving needs throughout their educational and professional journeys. In this post, we discuss how the Kaplan data engineering team implemented data integration from the Salesforce application to HAQM Redshift. The solution uses HAQM Simple Storage Service as a data lake, HAQM Redshift as a data warehouse, HAQM Managed Workflows for Apache Airflow (HAQM MWAA) as an orchestrator, and Tableau as the presentation layer.
Optimize cost and performance for HAQM MWAA
HAQM Managed Workflows for Apache Airflow (HAQM MWAA) is a managed service for Apache Airflow that allows you to orchestrate data pipelines and workflows at scale. With HAQM MWAA, you can design Directed Acyclic Graphs (DAGs) that describe your workflows without managing the operational burden of scaling the infrastructure. In this post, we provide guidance […]
Embed HAQM OpenSearch Service dashboards in your application
Customers across diverse industries rely on HAQM OpenSearch Service for interactive log analytics, real-time application monitoring, website search, vector database, deriving meaningful insights from data, and visualizing these insights using OpenSearch Dashboards. Additionally, customers often seek out capabilities that enable effortless sharing of visual dashboards and seamless embedding of these dashboards within their applications, further […]
Implement data quality checks on HAQM Redshift data assets and integrate with HAQM DataZone
In this post, we show how to capture the data quality metrics for data assets produced in HAQM Redshift. With HAQM DataZone, the data owner can directly import the technical metadata of a Redshift database table and views to the HAQM DataZone project’s inventory. As these data assets gets imported into HAQM DataZone, it bypasses the AWS Glue Data Catalog, creating a gap in data quality integration. This post proposes a solution to enrich the HAQM Redshift data asset with data quality scores and KPI metrics.
Build a serverless data quality pipeline using Deequ on AWS Lambda
Poor data quality can lead to a variety of problems, including pipeline failures, incorrect reporting, and poor business decisions. For example, if data ingested from one of the systems contains a high number of duplicates, it can result in skewed data in the reporting system. To prevent such issues, data quality checks are integrated into […]
Improve the resilience of HAQM Managed Service for Apache Flink application with system-rollback feature
This post explores how to use the system-rollback feature in Managed Service for Apache Flink.We discuss how this functionality improves your application’s resilience by providing a highly available Flink application. Through an example, you will also learn how to use the APIs to have more visibility of the application’s operations.