Intermediate (200) | AWS Big Data Blog

Develop and test AWS Glue 5.0 jobs locally using a Docker container

In this post, we show how to develop and test AWS Glue 5.0 jobs locally using a Docker container. This post is an updated version of the post Develop and test AWS Glue version 3.0 and 4.0 jobs locally using a Docker container, and uses AWS Glue 5.0.

Unlock the power of optimization in HAQM Redshift Serverless

In this post, we demonstrate how HAQM Redshift Serverless AI-driven scaling and optimization impacts performance and cost across different optimization profiles.

Automate topic provisioning and configuration using Terraform with HAQM MSK

In this post, we address common challenges associated with manual MSK topic configuration management and present a robust Terraform-based solution. This solution supports both provisioned and serverless MSK clusters.

HAQM EMR 7.5 runtime for Apache Spark and Iceberg can run Spark workloads 3.6 times faster than Spark 3.5.3 and Iceberg 1.6.1

The HAQM EMR runtime for Apache Spark offers a high-performance runtime environment while maintaining 100% API compatibility with open source Apache Spark and Apache Iceberg table format. In this post, we demonstrate the performance benefits of using the HAQM EMR 7.5 runtime for Spark and Iceberg compared to open source Spark 3.5.3 with Iceberg 1.6.1 tables on the TPC-DS 3TB benchmark v2.13.

Run Apache XTable in AWS Lambda for background conversion of open table formats

In this post, we explore how Apache XTable, combined with the AWS Glue Data Catalog, enables background conversions between open table formats residing on HAQM S3-based data lakes, with minimal to no changes to existing pipelines, in a scalable and cost-effective way.

Run high-availability long-running clusters with HAQM EMR instance fleets

In this post, we demonstrate how to launch a high availability instance fleet cluster using the newly redesigned HAQM EMR console, as well as using an AWS CloudFormation template. We also go over the basic concepts of Hadoop high availability, EMR instance fleets, the benefits and trade-offs of high availability, and best practices for running resilient EMR clusters.

Enrich your AWS Glue Data Catalog with generative AI metadata using HAQM Bedrock

By harnessing the capabilities of generative AI, you can automate the generation of comprehensive metadata descriptions for your data assets based on their documentation, enhancing discoverability, understanding, and the overall data governance within your AWS Cloud environment. This post shows you how to enrich your AWS Glue Data Catalog with dynamic metadata using foundation models (FMs) on HAQM Bedrock and your data documentation.

How Volkswagen Autoeuropa built a data solution with a robust governance framework, simplifying access to quality data using HAQM DataZone

This second post of a two-part series that details how Volkswagen Autoeuropa, a Volkswagen Group plant, together with AWS, built a data solution with a robust governance framework using HAQM DataZone to become a data-driven factory. Part 1 of this series focused on the customer challenges, overall solution architecture and solution features, and how they helped Volkswagen Autoeuropa overcome their challenges. This post dives into the technical details, highlighting the robust data governance framework that enables ease of access to quality data using HAQM DataZone.

Use HAQM Kinesis Data Streams to deliver real-time data to HAQM OpenSearch Service domains with HAQM OpenSearch Ingestion

In this post, we show how to use HAQM Kinesis Data Streams to buffer and aggregate real-time streaming data for delivery into HAQM OpenSearch Service domains and collections using HAQM OpenSearch Ingestion. You can use this approach for a variety of use cases, from real-time log analytics to integrating application messaging data for real-time search. In this post, we focus on the use case for centralizing log aggregation for an organization that has a compliance need to archive and retain its log data.

Achieve data resilience using HAQM OpenSearch Service disaster recovery with snapshot and restore

This post focuses on introducing an active-passive approach using a snapshot and restore strategy. The snapshot and restore strategy in OpenSearch Service involves creating point-in-time backups, known as snapshots, of your OpenSearch domain. These snapshots capture the entire state of the domain, including indexes, mappings, and settings. In the event of data loss or system failure, these snapshots will be used to restore the domain to a specific point in time. The post walks through the steps to set up this disaster recovery solution, including launching OpenSearch Service domains in primary and secondary regions, configuring snapshot repositories, restoring snapshots, and failing over/failing back between the regions.

Category: Intermediate (200)