AWS Big Data Blog
Tag: HAQM S3
How to export an HAQM DynamoDB table to HAQM S3 using AWS Step Functions and AWS Glue
In this post, I show you how to use AWS Glue’s DynamoDB integration and AWS Step Functions to create a workflow to export your DynamoDB tables to S3 in Parquet. I also show how to create an Athena view for each table’s latest snapshot, giving you a consistent view of your DynamoDB table exports.
Trigger cross-region replication of pre-existing objects using HAQM S3 inventory, HAQM EMR, and HAQM Athena
In HAQM Simple Storage Service (HAQM S3), you can use cross-region replication (CRR) to copy objects automatically and asynchronously across buckets in different AWS Regions. CRR is a bucket-level configuration, and it can help you meet compliance requirements and minimize latency by keeping copies of your data in different Regions. CRR replicates all objects in […]
Improve Apache Spark write performance on Apache Parquet formats with the EMRFS S3-optimized committer
November 2024: This post was reviewed and updated for accuracy. The EMRFS S3-optimized committer is a new output committer available for use with Apache Spark jobs as of HAQM EMR 5.19.0. This committer improves performance when writing Apache Parquet files to HAQM S3 using the EMR File System (EMRFS). In this post, we run a performance […]
Our data lake story: How Woot.com built a serverless data lake on AWS
February 9, 2024: HAQM Kinesis Data Firehose has been renamed to HAQM Data Firehose. Read the AWS What’s New post to learn more. In this post, we talk about designing a cloud-native data warehouse as a replacement for our legacy data warehouse built on a relational database. At the beginning of the design process, the […]
How to migrate a Hue database from an existing HAQM EMR cluster
This post describes the step-by-step process for migrating the Hue database from an existing EMR cluster.
Best Practices for Running Apache Kafka on AWS
The best practices described in this post are based on our experience in running and operating large-scale Kafka clusters on AWS for more than two years. Our intent for this post is to help AWS customers who are currently running Kafka on AWS, and also customers who are considering migrating on-premises Kafka deployments to AWS.
Best Practices for Running Apache Cassandra on HAQM EC2
In this post, we outline three Cassandra deployment options, as well as provide guidance about determining the best practices for your use case.
Build a Multi-Tenant HAQM EMR Cluster with Kerberos, Microsoft Active Directory Integration and IAM Roles for EMRFS
In this post, we will discuss what EMRFS authorization is (HAQM S3 storage-level access control) and show how to configure the role mappings with detailed examples.
Dynamically Create Friendly URLs for Your HAQM EMR Web Interfaces
This solution provides a serverless approach to automatically assigning a friendly name for your EMR cluster for easy access to popular notebooks and other web interfaces.
Build a Data Lake Foundation with AWS Glue and HAQM S3
A data lake is an increasingly popular way to store and analyze data that addresses the challenges of dealing with massive volumes of heterogeneous data. A data lake allows organizations to store all their data—structured and unstructured—in one centralized repository. Because data can be stored as-is, there is no need to convert it to a predefined schema. This post walks you through the process of using AWS Glue to crawl your data on HAQM S3 and build a metadata store that can be used with other AWS offerings.