AWS Big Data Blog
Category: HAQM Simple Storage Service (S3)
Migrate to Apache HBase on HAQM S3 on HAQM EMR: Guidelines and Best Practices
This whitepaper walks you through the stages of a migration. It also helps you determine when to choose Apache HBase on HAQM S3 on HAQM EMR, plan for platform security, tune Apache HBase and EMRFS to support your application SLA, identify options to migrate and restore your data, and manage your cluster in production.
Connect to HAQM Athena with federated identities using temporary credentials
This post walks through three scenarios to enable trusted users to access Athena using temporary security credentials. First, we use SAML federation where user credentials were stored in Active Directory. Second, we use a custom credentials provider library to enable cross-account access. And third, we use an EC2 Instance Profile role to provide temporary credentials for users in our organization to access Athena.
How to build a front-line concussion monitoring system using AWS IoT and serverless data lakes – Part 2
August 2024: This post was reviewed and updated for accuracy. In part 1 of this series, we demonstrated how to build a data pipeline in support of a data lake. We used key AWS services such as HAQM Kinesis Data Streams, Kinesis Data Analytics, Kinesis Data Firehose, and AWS Lambda. In part 2, we discuss […]
How to build a front-line concussion monitoring system using AWS IoT and serverless data lakes – Part 1
In this two-part series, we show you how to build a data pipeline in support of a data lake. We use key AWS services such as HAQM Kinesis Data Streams, Kinesis Data Analytics, Kinesis Data Firehose, and AWS Lambda. In part 2, we focus on generating simple inferences from that data that can support RTP parameters.
Build a Concurrent Data Orchestration Pipeline Using HAQM EMR and Apache Livy
In this post, we explore orchestrating a Spark data pipeline on HAQM EMR using Apache Livy and Apache Airflow, we create a simple Airflow DAG to demonstrate how to run spark jobs concurrently, and we see how Livy helps to hide the complexity to submit spark jobs via REST by using optimal EMR resources.
How Goodreads offloads HAQM DynamoDB tables to HAQM S3 and queries them using HAQM Athena
In this post, we show you how to export data from a DynamoDB table, convert it into a more efficient format with AWS Glue, and query the data with Athena. This approach gives you a way to pull insights from your data stored in DynamoDB.
Analyze Apache Parquet optimized data using HAQM Kinesis Data Firehose, HAQM Athena, and HAQM Redshift
Kinesis Data Firehose can now save data to HAQM S3 in Apache Parquet or Apache ORC format. These are optimized columnar formats that are highly recommended for best performance and cost-savings when querying data in S3. This feature directly benefits you if you use HAQM Athena, HAQM Redshift, AWS Glue, HAQM EMR, or any other big data tools that are available from the AWS Partner Network and through the open-source community.
Analyze data in HAQM DynamoDB using HAQM SageMaker for real-time prediction
I’ll describe how to read the DynamoDB backup file format in Data Pipeline, how to convert the objects in S3 to a CSV format that HAQM ML can read, and I’ll show you how to schedule regular exports and transformations using Data Pipeline.
How to migrate a Hue database from an existing HAQM EMR cluster
This post describes the step-by-step process for migrating the Hue database from an existing EMR cluster.
Power from wind: Open data on AWS
Data that describe processes in a spatial context are everywhere in our day-to-day lives and they dominate big data problems. Map data, for instance, whether describing networks of roads or remote sensing data from satellites, get us where we need to go. Atmospheric data from simulations and sensors underlie our weather forecasts and climate models. […]