AWS Big Data Blog
Category: HAQM EMR on EKS
HAQM EMR on EKS widens the performance gap: Run Apache Spark workloads 5.37 times faster and at 4.3 times lower cost
HAQM EMR on EKS provides a deployment option for HAQM EMR that allows organizations to run open-source big data frameworks on HAQM Elastic Kubernetes Service (HAQM EKS). With EMR on EKS, Spark applications run on the HAQM EMR runtime for Apache Spark. This performance-optimized runtime offered by HAQM EMR makes your Spark jobs run fast […]
Build event-driven data pipelines using AWS Controllers for Kubernetes and HAQM EMR on EKS
An event-driven architecture is a software design pattern in which decoupled applications can asynchronously publish and subscribe to events via an event broker. By promoting loose coupling between components of a system, an event-driven architecture leads to greater agility and can enable components in the system to scale independently and fail without impacting other services. […]
How SafeGraph built a reliable, efficient, and user-friendly Apache Spark platform with HAQM EMR on HAQM EKS
This is a guest post by Nan Zhu, Tech Lead Manager, SafeGraph, and Dave Thibault, Sr. Solutions Architect – AWS SafeGraph is a geospatial data company that curates over 41 million global points of interest (POIs) with detailed attributes, such as brand affiliation, advanced category tagging, and open hours, as well as how people interact […]
Accelerate your data exploration and experimentation with the AWS Analytics Reference Architecture library
Organizations use their data to solve complex problems by starting small, running iterative experiments, and refining the solution. Although the power of experiments can’t be ignored, organizations have to be cautious about the cost-effectiveness of such experiments. If time is spent creating the underlying infrastructure for enabling experiments, it further adds to the cost. Developers […]
Run fault tolerant and cost-optimized Spark clusters using HAQM EMR on EKS and HAQM EC2 Spot Instances
HAQM EMR on EKS is a deployment option in HAQM EMR that allows you to run Spark jobs on HAQM Elastic Kubernetes Service (HAQM EKS). HAQM Elastic Compute Cloud (HAQM EC2) Spot Instances save you up to 90% over On-Demand Instances, and is a great way to cost optimize the Spark workloads running on HAQM […]
Introducing ACK controller for HAQM EMR on EKS
AWS Controllers for Kubernetes (ACK) was announced in August, 2020, and now supports 14 AWS service controllers as generally available with an additional 12 in preview. The vision behind this initiative was simple: allow Kubernetes users to use the Kubernetes API to manage the lifecycle of AWS resources such as HAQM Simple Storage Service (HAQM […]
Use Karpenter to speed up HAQM EMR on EKS autoscaling
HAQM EMR on HAQM EKS is a deployment option for HAQM EMR that allows organizations to run Apache Spark on HAQM Elastic Kubernetes Service (HAQM EKS). With EMR on EKS, the Spark jobs run on the HAQM EMR runtime for Apache Spark. This increases the performance of your Spark jobs so that they run faster […]
Get a quick start with Apache Hudi, Apache Iceberg, and Delta Lake with HAQM EMR on EKS
A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can keep your data as is in your object store or file-based storage without having to first structure the data. Additionally, you can run different types of analytics against your loosely formatted data […]
Design considerations for HAQM EMR on EKS in a multi-tenant HAQM EKS environment
Many AWS customers use HAQM Elastic Kubernetes Service (HAQM EKS) in order to take advantage of Kubernetes without the burden of managing the Kubernetes control plane. With Kubernetes, you can centrally manage your workloads and offer administrators a multi-tenant environment where they can create, update, scale, and secure workloads using a single API. Kubernetes also […]
Run Apache Spark with HAQM EMR on EKS backed by HAQM FSx for Lustre storage
September 2023: This post was reviewed and updated for accuracy to reflect recent improvements and changes. Traditionally, Spark workloads have been run on a dedicated setup like a Hadoop stack with YARN or MESOS as a resource manager. Starting from Apache Spark 2.3, Spark added support for Kubernetes as a resource manager. The new Kubernetes […]