AWS Big Data Blog

Harness Zero Copy data sharing from Salesforce Data Cloud to HAQM Redshift for Unified Analytics – Part 1

In a previous post, we showed how Zero Copy data federation empowers businesses to access HAQM Redshift data within the Salesforce Data Cloud to enrich customer 360 data with operational data. This two-part series explores how analytics teams can access customer 360 data from Salesforce Data Cloud within HAQM Redshift to generate insights on unified data without the overhead of extract, transform, and load (ETL) pipelines. In this post, we cover data sharing between Salesforce Data Cloud and customers’ AWS accounts in the same AWS Region. Part 2 covers cross-Region data sharing between Salesforce Data Cloud and customers’ AWS accounts.

Attribute HAQM EMR on EC2 costs to your end-users

In this post, we share a chargeback model that you can use to track and allocate the costs of Spark workloads running on HAQM EMR on EC2 clusters. We describe an approach that assigns HAQM EMR costs to different jobs, teams, or lines of business. You can use this feature to distribute costs across various business units. This can assist you in monitoring the return on investment for your Spark-based workloads.

High-level architecture overview

Copy and mask PII between HAQM RDS databases using visual ETL jobs in AWS Glue Studio

In this post, I’ll walk you through how to copy data from one HAQM Relational Database Service (HAQM RDS) for PostgreSQL database to another, while scrubbing PII along the way using AWS Glue. You will learn how to prepare a multi-account environment to access the databases from AWS Glue, and how to model an ETL data flow that automatically masks PII as part of the transfer process, so that no sensitive information will be copied to the target database in its original form.

HAQM EMR 7.1 runtime for Apache Spark and Iceberg can run Spark workloads 2.7 times faster than Apache Spark 3.5.1 and Iceberg 1.5.2

In this post, we explore the performance benefits of using the HAQM EMR runtime for Apache Spark and Apache Iceberg compared to running the same workloads with open source Spark 3.5.1 on Iceberg tables. Iceberg is a popular open source high-performance format for large analytic tables. Our benchmarks demonstrate that HAQM EMR can run TPC-DS […]

How Kaplan, Inc. implemented modern data pipelines using HAQM MWAA and HAQM AppFlow with HAQM Redshift as a data warehouse

Kaplan, Inc. provides individuals, educational institutions, and businesses with a broad array of services, supporting our students and partners to meet their diverse and evolving needs throughout their educational and professional journeys. In this post, we discuss how the Kaplan data engineering team implemented data integration from the Salesforce application to HAQM Redshift. The solution uses HAQM Simple Storage Service as a data lake, HAQM Redshift as a data warehouse, HAQM Managed Workflows for Apache Airflow (HAQM MWAA) as an orchestrator, and Tableau as the presentation layer.

Optimize cost and performance for HAQM MWAA

HAQM Managed Workflows for Apache Airflow (HAQM MWAA) is a managed service for Apache Airflow that allows you to orchestrate data pipelines and workflows at scale. With HAQM MWAA, you can design Directed Acyclic Graphs (DAGs) that describe your workflows without managing the operational burden of scaling the infrastructure. In this post, we provide guidance […]

Optimize your workloads with HAQM Redshift Serverless AI-driven scaling and optimization

The current scaling approach of HAQM Redshift Serverless increases your compute capacity based on the query queue time and scales down when the queuing reduces on the data warehouse. However, you might need to automatically scale compute resources based on factors like query complexity and data volume to meet price-performance targets, irrespective of query queuing. […]

Reducing long-term logging expenses by 4,800% with HAQM OpenSearch Service

When you use HAQM OpenSearch Service for time-bound data like server logs, service logs, application logs, clickstreams, or event streams, storage cost is one of the primary drivers for the overall cost of your solution. Over the last year, OpenSearch Service has released features that have opened up new possibilities for storing your log data […]

BDB-4354-architecture

Unlock scalable analytics with a secure connectivity pattern in AWS Glue to read from or write to Snowflake

In today’s data-driven world, the ability to seamlessly integrate and utilize diverse data sources is critical for gaining actionable insights and driving innovation. As organizations increasingly rely on data stored across various platforms, such as Snowflake, HAQM Simple Storage Service (HAQM S3), and various software as a service (SaaS) applications, the challenge of bringing these […]

Embed HAQM OpenSearch Service dashboards in your application

Customers across diverse industries rely on HAQM OpenSearch Service for interactive log analytics, real-time application monitoring, website search, vector database, deriving meaningful insights from data, and visualizing these insights using OpenSearch Dashboards. Additionally, customers often seek out capabilities that enable effortless sharing of visual dashboards and seamless embedding of these dashboards within their applications, further […]