AWS Big Data Blog

Category: HAQM EMR

Introducing HAQM EMR on EKS job submission with Spark Operator and spark-submit

HAQM EMR on EKS provides a deployment option for HAQM EMR that allows organizations to run open-source big data frameworks on HAQM Elastic Kubernetes Service (HAQM EKS). With EMR on EKS, Spark applications run on the HAQM EMR runtime for Apache Spark. This performance-optimized runtime offered by HAQM EMR makes your Spark jobs run fast […]

Improve operational efficiencies of Apache Iceberg tables built on HAQM S3 data lakes

Apache Iceberg is an open table format for large datasets in HAQM Simple Storage Service (HAQM S3) and provides fast query performance over large tables, atomic commits, concurrent writes, and SQL-compatible table evolution. When you build your transactional data lake using Apache Iceberg to solve your functional use cases, you need to focus on operational […]

Data Ingestion Workflow

How Zoom implemented streaming log ingestion and efficient GDPR deletes using Apache Hudi on HAQM EMR

In today’s digital age, logging is a critical aspect of application development and management, but efficiently managing logs while complying with data protection regulations can be a significant challenge. Zoom, in collaboration with the AWS Data Lab team, developed an innovative architecture to overcome these challenges and streamline their logging and record deletion processes. In […]

Improve reliability and reduce costs of your Apache Spark workloads with vertical autoscaling on HAQM EMR on EKS

HAQM EMR on HAQM EKS is a deployment option offered by HAQM EMR that enables you to run Apache Spark applications on HAQM Elastic Kubernetes Service (HAQM EKS) in a cost-effective manner. It uses the EMR runtime for Apache Spark to increase performance so that your jobs run faster and cost less. Apache Spark allows […]

Build, deploy, and run Spark jobs on HAQM EMR with the open-source EMR CLI tool

Today, we’re pleased to introduce the HAQM EMR CLI, a new command line tool to package and deploy PySpark projects across different HAQM EMR environments. With the introduction of the EMR CLI, you now have a simple way to not only deploy a wide range of PySpark projects to remote EMR environments, but also integrate […]

Accelerate HiveQL with Oozie to Spark SQL migration on HAQM EMR

Many customers run big data workloads such as extract, transform, and load (ETL) on Apache Hive to create a data warehouse on Hadoop. Apache Hive has performed pretty well for a long time. But with advancements in infrastructure such as cloud computing and multicore machines with large RAM, Apache Spark started to gain visibility by […]

How CyberSolutions built a scalable data pipeline using HAQM EMR Serverless and the AWS Data Lab

This post is co-written by Constantin Scoarță and Horațiu Măiereanu from CyberSolutions Tech. CyberSolutions is one of the leading ecommerce enablers in Germany. We design, implement, maintain, and optimize award-winning ecommerce platforms end to end. Our solutions are based on best-in-class software like SAP Hybris and Adobe Experience Manager, and complemented by unique services that […]

HAQM EMR on EKS widens the performance gap: Run Apache Spark workloads 5.37 times faster and at 4.3 times lower cost

HAQM EMR on EKS provides a deployment option for HAQM EMR that allows organizations to run open-source big data frameworks on HAQM Elastic Kubernetes Service (HAQM EKS). With EMR on EKS, Spark applications run on the HAQM EMR runtime for Apache Spark. This performance-optimized runtime offered by HAQM EMR makes your Spark jobs run fast […]

Push HAQM EMR step logs from HAQM EC2 instances to HAQM CloudWatch logs

HAQM EMR is a big data service offered by AWS to run Apache Spark and other open-source applications on AWS to build scalable data pipelines in a cost-effective manner. Monitoring the logs generated from the jobs deployed on EMR clusters is essential to help detect critical issues in real time and identify root causes quickly. […]

Build event-driven data pipelines using AWS Controllers for Kubernetes and HAQM EMR on EKS

An event-driven architecture is a software design pattern in which decoupled applications can asynchronously publish and subscribe to events via an event broker. By promoting loose coupling between components of a system, an event-driven architecture leads to greater agility and can enable components in the system to scale independently and fail without impacting other services. […]