AWS Big Data Blog
Tag: HAQM EMR
Query big data with resilience using Trino in HAQM EMR with HAQM EC2 Spot Instances for less cost
New enhancements in Trino with HAQM EMR provide improved resiliency for running ETL and batch workloads on Spot Instances with reduced costs. This post showcases the resilience of HAQM EMR with Trino using fault-tolerant configuration to run long-running queries on Spot Instances to save costs. We simulate Spot interruptions on Trino worker nodes by using AWS Fault Injection Simulator (AWS FIS).
Apache Iceberg optimization: Solving the small files problem in HAQM EMR
Currently, Iceberg provides a compaction utility that compacts small files at a table or partition level. But this approach requires you to implement the compaction job using your preferred job scheduler or manually triggering the compaction job. In this post, we discuss the new Iceberg feature that you can use to automatically compact small files while writing data into Iceberg tables using Spark on HAQM EMR or HAQM Athena.
Build a data lake with Apache Flink on HAQM EMR
To build a data-driven business, it is important to democratize enterprise data assets in a data catalog. With a unified data catalog, you can quickly search datasets and figure out data schema, data format, and location. The AWS Glue Data Catalog provides a uniform repository where disparate systems can store and find metadata to keep […]
Accelerate your data exploration and experimentation with the AWS Analytics Reference Architecture library
Organizations use their data to solve complex problems by starting small, running iterative experiments, and refining the solution. Although the power of experiments can’t be ignored, organizations have to be cautious about the cost-effectiveness of such experiments. If time is spent creating the underlying infrastructure for enabling experiments, it further adds to the cost. Developers […]
Build your Apache Hudi data lake on AWS using HAQM EMR – Part 1
Apache Hudi is an open-source transactional data lake framework that greatly simplifies incremental data processing and data pipeline development. It does this by bringing core warehouse and database functionality directly to a data lake on HAQM Simple Storage Service (HAQM S3) or Apache HDFS. Hudi provides table management, instantaneous views, efficient upserts/deletes, advanced indexes, streaming […]
Run a data processing job on HAQM EMR Serverless with AWS Step Functions
Update Feb 2023: AWS Step Functions adds direct integration for 35 services including HAQM EMR Serverless. In the current version of this blog, we are able to submit an EMR Serverless job by invoking the APIs directly from a Step Functions workflow. We are using the Lambda only for polling the status of the job […]
Upgrade HAQM EMR Hive Metastore from 5.X to 6.X
If you are currently running HAQM EMR 5.X clusters, consider moving to HAQM EMR 6.X as it includes new features that helps you improve performance and optimize on cost. For instance, Apache Hive is two times faster with LLAP on HAQM EMR 6.X, and Spark 3 reduces costs by 40%. Additionally, HAQM EMR 6.x releases […]
Design considerations for HAQM EMR on EKS in a multi-tenant HAQM EKS environment
Many AWS customers use HAQM Elastic Kubernetes Service (HAQM EKS) in order to take advantage of Kubernetes without the burden of managing the Kubernetes control plane. With Kubernetes, you can centrally manage your workloads and offer administrators a multi-tenant environment where they can create, update, scale, and secure workloads using a single API. Kubernetes also […]
Configure Hadoop YARN CapacityScheduler on HAQM EMR on HAQM EC2 for multi-tenant heterogeneous workloads
Apache Hadoop YARN (Yet Another Resource Negotiator) is a cluster resource manager responsible for assigning computational resources (CPU, memory, I/O), and scheduling and monitoring jobs submitted to a Hadoop cluster. This generic framework allows for effective management of cluster resources for distributed data processing frameworks, such as Apache Spark, Apache MapReduce, and Apache Hive. When […]
Disaster recovery considerations with HAQM EMR on HAQM EC2 for Spark workloads
HAQM EMR is a cloud big data platform for running large-scale distributed data processing jobs, interactive SQL queries, and machine learning (ML) applications using open-source analytics frameworks such as Apache Spark, Apache Hive, and Presto. HAQM EMR launches all nodes for a given cluster in the same HAQM Elastic Compute Cloud (HAQM EC2) Availability Zone […]