AWS Big Data Blog
Tag: HAQM EMR
Implementing Authorization and Auditing using Apache Ranger on HAQM EMR
Updated 3/30/2022: HAQM EMR has announced official support of Apache Ranger (link). Open-source plugin support will not be maintained moving forward and compatibility with latest versions will not be tested. We recommend customers to move to the HAQM EMR support for Apache Ranger. Ranger Presto plugin support on EMR has been deprecated. Updated 12/03/2020: Support for […]
Low-Latency Access on Trillions of Records: FINRA’s Architecture Using Apache HBase on HAQM EMR with HAQM S3
John Hitchingham is Director of Performance Engineering at FINRA The Financial Industry Regulatory Authority (FINRA) is a private sector regulator responsible for analyzing 99% of the equities and 65% of the option activity in the US. In order to look for fraud, market manipulation, insider trading, and abuse, FINRA’s technology group has developed a robust […]
Dynamically Scale Applications on HAQM EMR with Auto Scaling
Jonathan Fritz is a Senior Product Manager for HAQM EMR Customers running Apache Spark, Presto, and the Apache Hadoop ecosystem take advantage of HAQM EMR’s elasticity to save costs by terminating clusters after workflows are complete and resizing clusters with low-cost HAQM EC2 Spot Instances. For instance, customers can create clusters for daily ETL or machine learning […]
Use Apache Flink on HAQM EMR
Today we are making it even easier to run Flink on AWS as it is now natively supported in HAQM EMR 5.1.0. EMR supports running Flink-on-YARN so you can create either a long-running cluster that accepts multiple jobs or a short-running Flink session in a transient cluster that helps reduce your costs by only charging you for the time that you use.
Running sparklyr – RStudio’s R Interface to Spark on HAQM EMR
This post was last updated July 7th, 2021 (original version by Tom Zeng). The Sparklyr package by RStudio has made processing big data in R a lot easier. Sparklyr is an R interface to Spark, it allows using Spark as the backend for dplyr – one of the most popular data manipulation packages. Sparklyr also […]
How Eliza Corporation Moved Healthcare Data to the Cloud
In this post, I discuss some of the practical challenges faced during the implementation of the data lake for Eliza and the corresponding details of the ways we solved these issues with AWS. The challenges we faced involved the variety of data and a need for a common view of the data.
Building Event-Driven Batch Analytics on AWS
In this post, I walk you through an architectural approach as well as a sample implementation on how to collect, process, and analyze data for event-driven applications in AWS.
Real-time Stream Processing Using Apache Spark Streaming and Apache Kafka on AWS
This post demonstrates how to set up Apache Kafka on EC2, use Spark Streaming on EMR to process data coming in to Apache Kafka topics, and query streaming data using Spark SQL on EMR.
HAQM EMR-DynamoDB Connector Repository on AWSLabs GitHub
HAQM Web Services is excited to announce that the HAQM EMR-DynamoDB Connector is now open-source. The code you see in the GitHub repository is exactly what is available on your EMR cluster, making it easier to build applications with this component.
Encrypt Data At-Rest and In-Flight on HAQM EMR with Security Configurations
ustomers running analytics, stream processing, machine learning, and ETL workloads on personally identifiable information, health information, and financial data have strict requirements for encryption of data at-rest and in-transit. The Apache Spark and Hadoop ecosystems lend themselves to these big data use cases, and customers have asked us to provide a quick and easy way to encrypt data at-rest and data in-transit between nodes in each execution framework.