AWS Big Data Blog
Category: Intermediate (200)
Read and write Apache Iceberg tables using AWS Lake Formation hybrid access mode
In this post, we demonstrate how to use Lake Formation for read access while continuing to use AWS Identity and Access Management (IAM) policy-based permissions for write workloads that update the schema and upsert (insert and update combined) data records into the Iceberg tables.
Integrate ThoughtSpot with HAQM Redshift using AWS IAM Identity Center
In this post, we walk you through the process of setting up ThoughtSpot integration with HAQM Redshift using IAM Identity Center authentication. The solution provides a secure, streamlined analytics environment that empowers your team to focus on what matters most: discovering and sharing valuable business insights.
Correlate telemetry data with HAQM OpenSearch Service and HAQM Managed Grafana
In this post, we show you how to use HAQM OpenSearch Service and HAQM Managed Grafana to correlate the various observability signals that improve root cause analysis, thereby resulting in reduced Mean Time to Resolution (MTTR). We also provide a reference solution that can be used at scale for proactive monitoring of enterprise applications to avoid a problem before they occur.
Develop and test AWS Glue 5.0 jobs locally using a Docker container
In this post, we show how to develop and test AWS Glue 5.0 jobs locally using a Docker container. This post is an updated version of the post Develop and test AWS Glue version 3.0 and 4.0 jobs locally using a Docker container, and uses AWS Glue 5.0.
Unlock the power of optimization in HAQM Redshift Serverless
In this post, we demonstrate how HAQM Redshift Serverless AI-driven scaling and optimization impacts performance and cost across different optimization profiles.
Automate topic provisioning and configuration using Terraform with HAQM MSK
In this post, we address common challenges associated with manual MSK topic configuration management and present a robust Terraform-based solution. This solution supports both provisioned and serverless MSK clusters.
HAQM EMR 7.5 runtime for Apache Spark and Iceberg can run Spark workloads 3.6 times faster than Spark 3.5.3 and Iceberg 1.6.1
The HAQM EMR runtime for Apache Spark offers a high-performance runtime environment while maintaining 100% API compatibility with open source Apache Spark and Apache Iceberg table format. In this post, we demonstrate the performance benefits of using the HAQM EMR 7.5 runtime for Spark and Iceberg compared to open source Spark 3.5.3 with Iceberg 1.6.1 tables on the TPC-DS 3TB benchmark v2.13.
Run Apache XTable in AWS Lambda for background conversion of open table formats
In this post, we explore how Apache XTable, combined with the AWS Glue Data Catalog, enables background conversions between open table formats residing on HAQM S3-based data lakes, with minimal to no changes to existing pipelines, in a scalable and cost-effective way.
Run high-availability long-running clusters with HAQM EMR instance fleets
In this post, we demonstrate how to launch a high availability instance fleet cluster using the newly redesigned HAQM EMR console, as well as using an AWS CloudFormation template. We also go over the basic concepts of Hadoop high availability, EMR instance fleets, the benefits and trade-offs of high availability, and best practices for running resilient EMR clusters.
Enrich your AWS Glue Data Catalog with generative AI metadata using HAQM Bedrock
By harnessing the capabilities of generative AI, you can automate the generation of comprehensive metadata descriptions for your data assets based on their documentation, enhancing discoverability, understanding, and the overall data governance within your AWS Cloud environment. This post shows you how to enrich your AWS Glue Data Catalog with dynamic metadata using foundation models (FMs) on HAQM Bedrock and your data documentation.