AWS Big Data Blog

Category: HAQM EMR

Improve observability across HAQM MWAA tasks

HAQM Managed Workflows for Apache Airflow (HAQM MWAA) is a managed orchestration service for Apache Airflow that makes it simple to set up and operate end-to-end data pipelines in the cloud at scale. A data pipeline is a set of tasks and processes used to automate the movement and transformation of data between different systems.­ […]

HAQM EMR launches support for HAQM EC2 C7g (Graviton3) instances to improve cost performance for Spark workloads by 7–13%

HAQM EMR provides a managed service to easily run analytics applications using open-source frameworks such as Apache Spark, Hive, Presto, Trino, HBase, and Flink. The HAQM EMR runtime for Spark and Presto includes optimizations that provide over twice the performance improvements compared to open-source Apache Spark and Presto. With HAQM EMR release 6.7, you can […]

Run Apache Spark workloads 3.5 times faster with HAQM EMR 6.9

In this post, we analyze the results from our benchmark tests running a TPC-DS application on open-source Apache Spark and then on HAQM EMR 6.9, which comes with an optimized Spark runtime that is compatible with open-source Spark. We walk through a detailed cost analysis and finally provide step-by-step instructions to run the benchmark. With HAQM EMR 6.9.0, you can now run your Apache Spark 3.x applications faster and at lower cost without requiring any changes to your applications. In our performance benchmark tests, derived from TPC-DS performance tests at 3 TB scale, we found the EMR runtime for Apache Spark 3.3.0 provides a 3.5 times (using total runtime) performance improvement on average over open-source Apache Spark 3.3.0.

Build a data lake with Apache Flink on HAQM EMR

To build a data-driven business, it is important to democratize enterprise data assets in a data catalog. With a unified data catalog, you can quickly search datasets and figure out data schema, data format, and location. The AWS Glue Data Catalog provides a uniform repository where disparate systems can store and find metadata to keep […]

How BookMyShow saved 80% in costs by migrating to an AWS modern data architecture

This is a guest post co-authored by Mahesh Vandi Chalil, Chief Technology Officer of BookMyShow. BookMyShow (BMS), a leading entertainment company in India, provides an online ticketing platform for movies, plays, concerts, and sporting events. Selling up to 200 million tickets on an annual run rate basis (pre-COVID) to customers in India, Sri Lanka, Singapore, […]

Add your own libraries and application dependencies to Spark and Hive on HAQM EMR Serverless with custom images

HAQM EMR Serverless allows you to run open-source big data frameworks such as Apache Spark and Apache Hive without managing clusters and servers. Many customers who run Spark and Hive applications want to add their own libraries and dependencies to the application runtime. For example, you may want to add popular open-source extensions to Spark, […]

Accelerate your data exploration and experimentation with the AWS Analytics Reference Architecture library

Organizations use their data to solve complex problems by starting small, running iterative experiments, and refining the solution. Although the power of experiments can’t be ignored, organizations have to be cautious about the cost-effectiveness of such experiments. If time is spent creating the underlying infrastructure for enabling experiments, it further adds to the cost. Developers […]

Run fault tolerant and cost-optimized Spark clusters using HAQM EMR on EKS and HAQM EC2 Spot Instances

HAQM EMR on EKS is a deployment option in HAQM EMR that allows you to run Spark jobs on HAQM Elastic Kubernetes Service (HAQM EKS). HAQM Elastic Compute Cloud (HAQM EC2) Spot Instances save you up to 90% over On-Demand Instances, and is a great way to cost optimize the Spark workloads running on HAQM […]

HAQM EMR Serverless cost estimator

HAQM EMR Serverless is a serverless option in HAQM EMR that makes it easy for data analysts and engineers to run applications using open-source big data analytics frameworks such as Apache Spark and Hive without configuring, managing, and scaling clusters or servers. You get all the features of the latest open-source frameworks with the performance-optimized […]

HAQM EMR launches support for HAQM EC2 M6A, R6A instances to improve cost performance for Spark workloads by 15–50% 

HAQM EMR provides a managed service to easily run analytics applications using open-source frameworks such as Apache Spark, Hive, Presto, Trino, HBase, and Flink. The HAQM EMR runtime for Spark and Presto includes optimizations that provide over 2x performance improvements over open-source Apache Spark and Presto. With HAQM EMR release 6.8, you can now use […]