AWS Big Data Blog

Tag: Apache Spark

Use Batch Processing Gateway to automate job management in multi-cluster HAQM EMR on EKS environments

AWS customers often process petabytes of data using HAQM EMR on EKS. In enterprise environments with diverse workloads or varying operational requirements, customers frequently choose a multi-cluster setup due to the following advantages: Better resiliency and no single point of failure – If one cluster fails, other clusters can continue processing critical workloads, maintaining business […]

Detect and handle data skew on AWS Glue

October 2024: This post was reviewed and updated for accuracy. AWS Glue is a fully managed, serverless data integration service provided by HAQM Web Services (AWS) that uses Apache Spark as one of its backend processing engines (as of this writing, you can use Python Shell or Spark). Data skew occurs when the data being […]

Introducing HAQM EMR on EKS job submission with Spark Operator and spark-submit

HAQM EMR on EKS provides a deployment option for HAQM EMR that allows organizations to run open-source big data frameworks on HAQM Elastic Kubernetes Service (HAQM EKS). With EMR on EKS, Spark applications run on the HAQM EMR runtime for Apache Spark. This performance-optimized runtime offered by HAQM EMR makes your Spark jobs run fast […]

Run Apache Spark workloads 3.5 times faster with HAQM EMR 6.9

In this post, we analyze the results from our benchmark tests running a TPC-DS application on open-source Apache Spark and then on HAQM EMR 6.9, which comes with an optimized Spark runtime that is compatible with open-source Spark. We walk through a detailed cost analysis and finally provide step-by-step instructions to run the benchmark. With HAQM EMR 6.9.0, you can now run your Apache Spark 3.x applications faster and at lower cost without requiring any changes to your applications. In our performance benchmark tests, derived from TPC-DS performance tests at 3 TB scale, we found the EMR runtime for Apache Spark 3.3.0 provides a 3.5 times (using total runtime) performance improvement on average over open-source Apache Spark 3.3.0.

Diagram to illustrate soft multi-tenancy

Design considerations for HAQM EMR on EKS in a multi-tenant HAQM EKS environment

Many AWS customers use HAQM Elastic Kubernetes Service (HAQM EKS) in order to take advantage of Kubernetes without the burden of managing the Kubernetes control plane. With Kubernetes, you can centrally manage your workloads and offer administrators a multi-tenant environment where they can create, update, scale, and secure workloads using a single API. Kubernetes also […]

Simplify data pipelines with AWS Glue automatic code generation and Workflows

In this post, we discuss how to leverage the automatic code generation process in AWS Glue ETL to simplify common data manipulation tasks, such as data type conversion and flattening complex structures. We also explore using AWS Glue Workflows to build and orchestrate data pipelines of varying complexity. Lastly, we look at how you can leverage the power of SQL, with the use of AWS Glue ETL and Glue Data Catalog, to query and transform your data.

Best practices to scale Apache Spark jobs and partition data with AWS Glue

The first post of this series discusses two key AWS Glue capabilities to manage the scaling of data processing jobs. The first allows you to horizontally scale out Apache Spark applications for large splittable datasets. The second allows you to vertically scale up memory-intensive Apache Spark applications with the help of new AWS Glue worker types. The post also shows how to use AWS Glue to scale Apache Spark applications with a large number of small files commonly ingested from streaming applications using HAQM Kinesis Data Firehose. Finally, the post shows how AWS Glue jobs can use the partitioning structure for large datasets in HAQM S3 to provide faster execution times for Apache Spark applications.