AWS Big Data Blog

Category: HAQM EMR

How NortonLifelock built a serverless architecture for real-time analysis of their VPN usage metrics

August 30, 2023: HAQM Kinesis Data Analytics has been renamed to HAQM Managed Service for Apache Flink. Read the announcement in the AWS News Blog and learn more. This post presents a reference architecture and optimization strategies for building serverless data analytics solutions on AWS using HAQM Kinesis Data Analytics. In addition, this post shows […]

Configure HAQM EMR Studio and HAQM EKS to run notebooks with HAQM EMR on EKS

HAQM EMR on HAQM EKS provides a deployment option for HAQM EMR that allows you to run analytics workloads on HAQM Elastic Kubernetes Service (HAQM EKS). This is an attractive option because it allows you to run applications on a common pool of resources without having to provision infrastructure. In addition, you can use HAQM […]

Reduce costs and increase resource utilization of Apache Spark jobs on Kubernetes with HAQM EMR on HAQM EKS

HAQM EMR on HAQM EKS is a deployment option for HAQM EMR that allows you to run Apache Spark on HAQM Elastic Kubernetes Service (HAQM EKS). If you run open-source Apache Spark on HAQM EKS, you can now use HAQM EMR to automate provisioning and management, and run Apache Spark up to three times faster. […]

Run and debug Apache Spark applications on AWS with HAQM EMR on HAQM EKS

Customers today want to focus more on their core business model and less on the underlying infrastructure and operational burden. As customers migrate to the AWS Cloud, they’re realizing the benefits of being able to innovate faster on their own applications by relying on AWS to handle big data platforms, operations, and automation. Many of […]

Run a Spark SQL-based ETL pipeline with HAQM EMR on HAQM EKS

Increasingly, a business’s success depends on its agility in transforming data into actionable insights, which requires efficient and automated data processes. In the previous post – Build a SQL-based ETL pipeline with Apache Spark on HAQM EKS, we described a common productivity issue in a modern data architecture. To address the challenge, we demonstrated how to utilize a declarative approach as the key enabler to improve efficiency, which resulted in a faster time to value for businesses. Generally speaking, managing applications declaratively in Kubernetes is a widely adopted best practice. You can use the same approach to build and deploy Spark applications with open-source or in-house build frameworks to achieve the same productivity goal.

Visualize data using Apache Spark running on HAQM EMR with HAQM QuickSight

Organizations often need to process large volumes of data before serving to business stakeholders. In this blog, we will learn how to leverage HAQM EMR to process data using Apache Spark, the go-to platform for in-memory analytics of large data volume, and connect business intelligence (BI) tool HAQM QuickSight to serve data to end-users. QuickSight […]

Improve query performance using AWS Glue partition indexes

While creating data lakes on the cloud, the data catalog is crucial to centralize metadata and make the data visible, searchable, and queryable for users. With the recent exponential growth of data volume, it becomes much more important to optimize data layout and maintain the metadata on cloud storage to keep the value of data […]

Manage and process your big data workflows with HAQM MWAA and HAQM EMR on HAQM EKS

Many customers are gathering large amount of data, generated from different sources such as IoT devices, clickstream events from websites, and more. To efficiently extract insights from the data, you have to perform various transformations and apply different business logic on your data. These processes require complex workflow management to schedule jobs and manage dependencies […]

The following graph shows performance improvements measured as total runtime for TPC-DS queries. HAQM EMR 5.31 with EMR runtime has the better (lower) runtime.

HAQM EMR introduces EMR runtime for Presto, providing a 2.6 times speedup

Presto is an open-source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Presto was designed and written from the ground up for interactive analytics, and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook. Running Presto […]

HAQM EMR announces general availability of EMR Studio

At AWS re:Invent 2020, we announced the preview of HAQM EMR Studio, an integrated development environment (IDE) that makes it easy for data scientists and data engineers to develop, visualize, and debug applications written in R, Python, Scala, and PySpark. Today, we’re excited to announce the general availability of EMR Studio and new features we’ve […]