AWS Big Data Blog

Introducing generative AI upgrades for Apache Spark in AWS Glue (preview)

Today, we are excited to announce the preview of generative AI upgrades for Spark, a new capability that enables data practitioners to quickly upgrade and modernize their Spark applications running on AWS. Starting with Spark jobs in AWS Glue, this feature allows you to upgrade from an older AWS Glue version to AWS Glue version 4.0. This new capability reduces the time data engineers spend on modernizing their Spark applications, allowing them to focus on building new data pipelines and getting valuable analytics faster.

Accelerate your data workflows with HAQM Redshift Data API persistent sessions

In this post, we’ll walk through an example ETL process that uses session reuse to efficiently create, populate, and query temporary staging tables across the full data transformation workflow—all within the same persistent HAQM Redshift database session. You’ll learn best practices for optimizing ETL orchestration code, reducing job runtimes by eliminating connection overhead, and simplifying pipeline complexity

Accelerate your migration to HAQM OpenSearch Service with Reindexing-from-Snapshot

In this post, we introduce a new mechanism called Reindexing-from-Snapshot (RFS), and explain how it can address your concerns and simplify migrating to OpenSearch.

From data lakes to insights: dbt adapter for HAQM Athena now supported in dbt Cloud

We are excited to announce that the dbt adapter for HAQM Athena is now officially supported in dbt Cloud. This integration enables data teams to efficiently transform and manage data using Athena with dbt Cloud’s robust features, enhancing the overall data workflow experience. In this post, we discuss the advantages of dbt Cloud over dbt Core, common use cases, and how to get started with HAQM Athena using the dbt adapter.

AWS Glue Data Catalog supports automatic optimization of Apache Iceberg tables through your HAQM VPC

The AWS Glue Data Catalog supports automatic table optimization of Apache Iceberg tables, including compaction, snapshots, and orphan data management. The data compaction optimizer constantly monitors table partitions and kicks off the compaction process when the threshold is exceeded for the number of files and file sizes. This post demonstrates how it works with step-by-step instructions.

Run high-availability long-running clusters with HAQM EMR instance fleets

In this post, we demonstrate how to launch a high availability instance fleet cluster using the newly redesigned HAQM EMR console, as well as using an AWS CloudFormation template. We also go over the basic concepts of Hadoop high availability, EMR instance fleets, the benefits and trade-offs of high availability, and best practices for running resilient EMR clusters.

Enhance data governance with enforced metadata rules in HAQM DataZone

We’re excited to announce a new feature in HAQM DataZone that offers enhanced metadata governance for your subscription approval process. Using this update, domain owners can define metadata requirements and enforce them on data consumers when they request subscriptions to data assets. By making it mandatory for data consumers to provide specific metadata, domain owners can achieve compliance, meet organizational standards, and support audit and reporting needs.

Introducing Point in Time queries and SQL/PPL support in HAQM OpenSearch Serverless

Today we announced support for three new features for HAQM OpenSearch Serverless: Point in Time (PIT) search, which enables you to maintain stable sorting for deep pagination in the presence of updates, and PPL and SQL, which give you new ways to query your data. In this post, we discuss the benefits of these new features and how to get started.

Introducing HAQM MWAA micro environments for Apache Airflow

Today, we’re excited to announce mw1.micro, the latest addition to HAQM MWAA environment classes. This offering is designed to provide an even more cost-effective solution for running Airflow environments in the cloud. With mw1.micro, we’re bringing the power of HAQM MWAA to teams who require a lightweight environment without compromising on essential features. In this post, we’ll explore mw1.micro characteristics, key benefits, ideal use cases, and how you can set up an HAQM MWAA environment based on this new environment class.

Integrate custom applications with AWS Lake Formation – Part 1

In this two-part series, we show how to integrate custom applications or data processing engines with Lake Formation using the third-party services integration feature. In this post, we dive deep into the required Lake Formation and AWS Glue APIs. We walk through the steps to enforce Lake Formation policies within custom data applications. As an example, we present a sample Lake Formation integrated application implemented using AWS Lambda.