AWS Big Data Blog
Category: HAQM EMR
HAQM EMR launches support for HAQM EC2 C6i, M6i, I4i, R6i and R6id instances to improve cost performance for Spark workloads by 6–33%
HAQM EMR provides a managed service to easily run analytics applications using open-source frameworks such as Apache Spark, Hive, Presto, Trino, HBase, and Flink. The HAQM EMR runtime for Spark and Presto includes optimizations that provide over two times performance improvements over open-source Apache Spark and Presto, so that your applications run faster and at […]
Build your Apache Hudi data lake on AWS using HAQM EMR – Part 1
Apache Hudi is an open-source transactional data lake framework that greatly simplifies incremental data processing and data pipeline development. It does this by bringing core warehouse and database functionality directly to a data lake on HAQM Simple Storage Service (HAQM S3) or Apache HDFS. Hudi provides table management, instantaneous views, efficient upserts/deletes, advanced indexes, streaming […]
Introducing ACK controller for HAQM EMR on EKS
AWS Controllers for Kubernetes (ACK) was announced in August, 2020, and now supports 14 AWS service controllers as generally available with an additional 12 in preview. The vision behind this initiative was simple: allow Kubernetes users to use the Kubernetes API to manage the lifecycle of AWS resources such as HAQM Simple Storage Service (HAQM […]
Use Karpenter to speed up HAQM EMR on EKS autoscaling
HAQM EMR on HAQM EKS is a deployment option for HAQM EMR that allows organizations to run Apache Spark on HAQM Elastic Kubernetes Service (HAQM EKS). With EMR on EKS, the Spark jobs run on the HAQM EMR runtime for Apache Spark. This increases the performance of your Spark jobs so that they run faster […]
Build an optimized self-service interactive analytics platform with HAQM EMR Studio
Data engineers and data scientists are dependent on distributed data processing infrastructure like HAQM EMR to perform data processing and advanced analytics jobs on large volumes of data. In most mid-size and enterprise organizations, cloud operations teams own procuring, provisioning, and maintaining the IT infrastructures, and their objectives and best practices differ from the data […]
How Kyligence Cloud uses HAQM EMR Serverless to simplify OLAP
This post was co-written with Daniel Gu and Yolanda Wang, from Kyligence. Today, more than ever, organizations realize that modern business runs on data—almost all our interactions with business are based on data, and organizations must use analytics to understand, plan, and improve their operations. That is where Online Analytical Processing (OLAP) comes in. OLAP […]
Introducing runtime roles for HAQM EMR steps: Use IAM roles and AWS Lake Formation for access control with HAQM EMR
You can use the HAQM EMR Steps API to submit Apache Hive, Apache Spark, and others types of applications to an EMR cluster. You can invoke the Steps API using Apache Airflow, AWS Steps Functions, the AWS Command Line Interface (AWS CLI), all the AWS SDKs, and the AWS Management Console. Jobs submitted with the […]
Build a high-performance, transactional data lake using open-source Delta Lake on HAQM EMR
Data lakes on HAQM Simple Storage Service (HAQM S3) have become the default repository for all enterprise data and serve as a common choice for a large number of users querying from a variety of analytics and machine learning (ML) tools. Oftentimes you want to ingest data continuously into the data lake from multiple sources […]
Get a quick start with Apache Hudi, Apache Iceberg, and Delta Lake with HAQM EMR on EKS
A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can keep your data as is in your object store or file-based storage without having to first structure the data. Additionally, you can run different types of analytics against your loosely formatted data […]
Run a data processing job on HAQM EMR Serverless with AWS Step Functions
Update Feb 2023: AWS Step Functions adds direct integration for 35 services including HAQM EMR Serverless. In the current version of this blog, we are able to submit an EMR Serverless job by invoking the APIs directly from a Step Functions workflow. We are using the Lambda only for polling the status of the job […]