AWS Big Data Blog

Category: How-To

How BMW streamlined data access using AWS Lake Formation fine-grained access control

This post explores how BMW implemented AWS Lake Formation’s fine-grained access control (FGAC) in the Cloud Data Hub and how this saves them up to 25% on compute and storage costs. By using AWS Lake Formation fine-grained access control capabilities, BMW has transparently implemented finer data access management within the Cloud Data Hub. The integration of Lake Formation has enabled data stewards to scope and grant granular access to specific subsets of data, reducing costly data duplication.

HAQM MWAA best practices for managing Python dependencies

Customers with data engineers and data scientists are using HAQM Managed Workflows for Apache Airflow (HAQM MWAA) as a central orchestration platform for running data pipelines and machine learning (ML) workloads. To support these pipelines, they often require additional Python packages, such as Apache Airflow Providers. For example, a pipeline may require the Snowflake provider […]

Run interactive workloads on HAQM EMR Serverless from HAQM EMR Studio

Starting from release 6.14, HAQM EMR Studio supports interactive analytics on HAQM EMR Serverless. You can now use EMR Serverless applications as the compute, in addition to HAQM EMR on EC2 clusters and HAQM EMR on EKS virtual clusters, to run JupyterLab notebooks from EMR Studio Workspaces. EMR Studio is an integrated development environment (IDE) […]

Simplify data streaming ingestion for analytics using HAQM MSK and HAQM Redshift

Towards the end of 2022, AWS announced the general availability of real-time streaming ingestion to HAQM Redshift for HAQM Kinesis Data Streams and HAQM Managed Streaming for Apache Kafka (HAQM MSK), eliminating the need to stage streaming data in HAQM Simple Storage Service (HAQM S3) before ingesting it into HAQM Redshift. Streaming ingestion from HAQM […]

Automate secure access to HAQM MWAA environments using existing OpenID Connect single-sign-on authentication and authorization

Customers use HAQM Managed Workflows for Apache Airflow (HAQM MWAA) to run Apache Airflow at scale in the cloud. They want to use their existing login solutions developed using OpenID Connect (OIDC) providers with HAQM MWAA; this allows them to provide a uniform authentication and single sign-on (SSO) experience using their adopted identity providers (IdP) […]

Use MSK Connect for managed MirrorMaker 2 deployment with IAM authentication

March 2025: This post was reviewed and updated for accuracy. MSK Replicator now makes it easier to set up cross-Region and same-Region replication without running MirrorMaker 2. Read AWS News Blog to learn more.  In this post, we show how to use MSK Connect for MirrorMaker 2 deployment with AWS Identity and Access Management (IAM) authentication. We create […]

Copy large datasets from Google Cloud Storage to HAQM S3 using HAQM EMR

Data migration between GCS and HAQM S3 is possible by utilizing Hadoop’s native support for S3 object storage and using a Google-provided Hadoop connector for GCS. This post demonstrates how to configure an EMR cluster for DistCp and S3DistCP, goes over the settings and parameters for both tools, performs a copy of a test 9.4 TB dataset, and compares the performance of the copy.