AWS Big Data Blog

Category: Technical How-to

Simplify real-time analytics with zero-ETL from HAQM DynamoDB to HAQM SageMaker Lakehouse

At AWS re:Invent 2024, we introduced a no code zero-ETL integration between HAQM DynamoDB and HAQM SageMaker Lakehouse, simplifying how organizations handle data analytics and AI workflows. In this post, we share how to set up this zero-ETL integration from DynamoDB to your SageMaker Lakehouse environment.

Using AWS Glue Data Catalog views with Apache Spark in EMR Serverless and Glue 5.0

In this post, we guide you through the process of creating a Data Catalog view using EMR Serverless, adding the SQL dialect to the view for Athena, sharing it with another account using LF-Tags, and then querying the view in the recipient account using a separate EMR Serverless workspace and AWS Glue 5.0 Spark job and Athena. This demonstration showcases the versatility and cross-account capabilities of Data Catalog views and access through various AWS analytics services.

Architecture patterns to optimize HAQM Redshift performance at scale

In this post, we will show you five HAQM Redshift architecture patterns that you can consider to optimize your HAQM Redshift data warehouse performance at scale using features such as HAQM Redshift Serverless, HAQM Redshift data sharing, HAQM Redshift Spectrum, zero-ETL integrations, and HAQM Redshift streaming ingestion.

PackScan: Building real-time sort center analytics with AWS Services

In this post, we explore how PackScan uses HAQM cloud-based services to drive real-time visibility, improve logistics efficiency, and support the seamless movement of packages across HAQM’s Middle Mile network.

Unlock self-serve streaming SQL with HAQM Managed Service for Apache Flink

In this post, we present Riskified’s journey toward enabling self-service streaming SQL pipelines. We walk through the motivations behind the shift from Confluent ksqlDB to Apache Flink, the architecture Riskified built using HAQM Managed Service for Apache Flink, the technical challenges they faced, and the solutions that helped them make streaming accessible, scalable, and production-ready.

Unify streaming and analytical data with HAQM Data Firehose and HAQM SageMaker Lakehouse

In this post, we show you how to create Iceberg tables in HAQM SageMaker Unified Studio and stream data to these tables using Firehose. With this integration, data engineers, analysts, and data scientists can seamlessly collaborate and build end-to-end analytics and ML workflows using SageMaker Unified Studio, removing traditional silos and accelerating the journey from data ingestion to production ML models.

Access HAQM Redshift Managed Storage tables through Apache Spark on AWS Glue and HAQM EMR using HAQM SageMaker Lakehouse

With SageMaker Lakehouse, you can access tables stored in HAQM Redshift managed storage (RMS) through Iceberg APIs, using the Iceberg REST catalog backed by AWS Glue Data Catalog. This post describes how to integrate data on RMS tables through Apache Spark using SageMaker Unified Studio, HAQM EMR 7.5.0 and higher, and AWS Glue 5.0.

Petabyte-scale data migration made simple: AppsFlyer’s best practice journey with HAQM EMR Serverless

In this post, we share how AppsFlyer successfully migrated their massive data infrastructure from self-managed Hadoop clusters to HAQM EMR Serverless, detailing their best practices, challenges to overcome, and lessons learned that can help guide other organizations in similar transformations.