AWS Big Data Blog

Category: Analytics

HAQM EMR 7.5 runtime for Apache Spark and Iceberg can run Spark workloads 3.6 times faster than Spark 3.5.3 and Iceberg 1.6.1

The HAQM EMR runtime for Apache Spark offers a high-performance runtime environment while maintaining 100% API compatibility with open source Apache Spark and Apache Iceberg table format. In this post, we demonstrate the performance benefits of using the HAQM EMR 7.5 runtime for Spark and Iceberg compared to open source Spark 3.5.3 with Iceberg 1.6.1 tables on the TPC-DS 3TB benchmark v2.13.

Fitch Group achieves multi-Region resiliency for mission-critical Kafka infrastructure with HAQM MSK Replicator

In this post, we explore how Fitch Group, one of the top credit rating companies, used HAQM MSK and HAQM MSK Replicator to achieve multi-Region resiliency for their mission-critical Kafka infrastructure.

HAQM Q data integration adds DataFrame support and in-prompt context-aware job creation

HAQM Q data integration, introduced in January 2024, allows you to use natural language to author extract, transform, load (ETL) jobs and operations in AWS Glue specific data abstraction DynamicFrame. This post introduces exciting new capabilities for HAQM Q data integration that work together to make ETL development more efficient and intuitive. We’ve added support for DataFrame-based code generation that works across any Spark environment. We’ve also introduced in-prompt context-aware development that applies details from your conversations, working seamlessly with a new iterative development experience.

HEMA accelerates their data governance journey with HAQM DataZone

HEMA is a household Dutch retail brand name since 1926, providing daily convenience products using unique design. This post describes how HEMA used HAQM DataZone to build their data mesh and enable streamlined data access across multiple business areas. It explains HEMA’s unique journey of deploying HAQM DataZone, the key challenges they overcame, and the transformative benefits they have realized since deployment in May 2024. From establishing an enterprise-wide data inventory and improving data discoverability, to enabling decentralized data sharing and governance, HAQM DataZone has been a game changer for HEMA.

Accelerate queries on Apache Iceberg tables through AWS Glue auto compaction

In this post, we explore new features of the AWS Glue Data Catalog, which now supports improved automatic compaction of Iceberg tables for streaming data, making it straightforward for you to keep your transactional data lakes consistently performant. Enabling automatic compaction on Iceberg tables reduces metadata overhead on your Iceberg tables and improves query performance

Implement a custom subscription workflow for unmanaged HAQM S3 assets published with HAQM DataZone

In this post, we demonstrate how to implement a custom subscription workflow using HAQM DataZone, HAQM EventBridge, and AWS Lambda to automate the fulfillment process for unmanaged data assets, such as unstructured data stored in HAQM S3. This solution enhances governance and simplifies access to unstructured data assets across the organization.

Enhancing Search Relevancy with Cohere Rerank 3.5 and HAQM OpenSearch Service

In this blog post, we’ll dive into the various scenarios for how Cohere Rerank 3.5 improves search results for best matching 25 (BM25), a keyword-based algorithm that performs lexical search, in addition to semantic search. We will also cover how businesses can significantly improve user experience, increase engagement, and ultimately drive better search outcomes by implementing a reranking pipeline.

Role of connectors in a Flink applications

Introducing the new HAQM Kinesis source connector for Apache Flink

On November 11, 2024, the Apache Flink community released a new version of AWS services connectors, an AWS open source contribution. This new release, version 5.0.0, introduces a new source connector to read data from HAQM Kinesis Data Streams. In this post, we explain how the new features of this connector can improve performance and reliability of your Apache Flink application.

Recap of HAQM Redshift key product announcements in 2024

HAQM Redshift made significant strides in 2024, that enhanced price-performance, enabled data lakehouse architectures by blurring the boundaries between data lakes and data warehouses, simplified ingestion and accelerated near real-time analytics, and incorporated generative AI capabilities to build natural language-based applications and boost user productivity. This blog post provides a comprehensive overview of the major product innovations and enhancements made to HAQM Redshift in 2024.

How DeNA Co., Ltd. accelerated anonymized data quality tests up to 100 times faster using HAQM Redshift Serverless and dbt

DeNA Co., Ltd. (DeNA) engages in a variety of businesses, from games and live communities to sports & the community and healthcare & medical, under our mission to delight people beyond their wildest dreams. This post introduces a case study where DeNA combined HAQM Redshift Serverless and dbt (dbt Core) to accelerate data quality tests in their business.