AWS Big Data Blog

Category: HAQM EMR

Build Slowly Changing Dimensions Type 2 (SCD2) with Apache Spark and Apache Hudi on HAQM EMR

April 2024: This post was reviewed for accuracy. Organizations across the globe are striving to improve the scalability and cost efficiency of the data warehouse. Offloading data and data processing from a data warehouse to a data lake empowers companies to introduce new use cases like ad hoc data analysis and AI and machine learning […]

Orchestrate an HAQM EMR on HAQM EKS Spark job with AWS Step Functions

At re:Invent 2020, we announced the general availability of HAQM EMR on HAQM EKS, a new deployment option for HAQM EMR that allows you to automate the provisioning and management of open-source big data frameworks on HAQM Elastic Kubernetes Service (HAQM EKS). With HAQM EMR on EKS, you can now run Spark applications alongside other […]

The following graph shows that the minimum throughput achieved with the persistent HFile

HAQM EMR 6.2.0 adds persistent HFile tracking to improve performance with HBase on HAQM S3

Apache HBase is an open-source, NoSQL database that you can use to achieve low latency random access to billions of rows. Starting with HAQM EMR 5.2.0, you can enable HBase on HAQM Simple Storage Service (HAQM S3). With HBase on HAQM S3, the HBase data files (HFiles) are written to HAQM S3, enabling data lake […]

Top 9 performance tuning tips for PrestoDB on HAQM EMR

Presto is a popular distributed SQL query engine for interactive data analytics. With its massively parallel processing (MPP) architecture, it’s capable of directly querying large datasets without the need of time-consuming and costly ETL processes. With a properly tuned Presto cluster you can run fast queries against big data with response times ranging from subsecond […]

HAQM EMR 2020 year in review

Tens of thousands of customers use HAQM EMR to run big data analytics applications on Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto at scale. HAQM EMR automates the provisioning and scaling of these frameworks, and delivers high performance at low cost with optimized runtimes and support for a wide range […]

The following diagram illustrates the workflow.

Orchestrating analytics jobs on HAQM EMR Notebooks using HAQM MWAA

May 2024: This post was reviewed and updated with a new dataset. In a previous post, we introduced the HAQM EMR notebook APIs, which allow you to programmatically run a notebook on HAQM EMR Studio (preview) without accessing the AWS web console. With the APIs, you can schedule running EMR notebooks with cron scripts, chain multiple notebooks, […]

The following table shows the total runtime in seconds.

Run Apache Spark 3.0 workloads 1.7 times faster with HAQM EMR runtime for Apache Spark

With HAQM EMR release 6.1.0, HAQM EMR runtime for Apache Spark is now available for Spark 3.0.0. EMR runtime for Apache Spark is a performance-optimized runtime for Apache Spark that is 100% API compatible with open-source Apache Spark. In our benchmark performance tests using TPC-DS benchmark queries at 3 TB scale, we found EMR runtime […]

The state machine transforms data using AWS Glue.

Building complex workflows with HAQM MWAA, AWS Step Functions, AWS Glue, and HAQM EMR

HAQM Managed Workflows for Apache Airflow (HAQM MWAA) is a fully managed service that makes it easy to run open-source versions of Apache Airflow on AWS and build workflows to run your extract, transform, and load (ETL) jobs and data pipelines. You can use AWS Step Functions as a serverless function orchestrator to build scalable […]

The following diagram illustrates the architecture for this solution.

Introducing HAQM EMR integration with Apache Ranger

This post was last updated July 2022. Data security is an important pillar in data governance. It includes authentication, authorization , encryption and audit. HAQM EMR enables you to set up and run clusters of HAQM Elastic Compute Cloud (HAQM EC2) instances with open-source big data applications like Apache Spark, Apache Hive, Apache Flink, and Presto. You may […]

Let’s look at PyDeequ’s main components, and how they relate to Deequ (shown in the following diagram)

Testing data quality at scale with PyDeequ

June 2024: This post was reviewed and updated to add instructions for using PyDeequ with HAQM SageMaker Notebook, SageMaker Studio, EMR, and updated the examples against a new dataset. March 2023: You can now use AWS Glue Data Quality to measure and manage the quality of your data. AWS Glue Data Quality is built on Deequ […]