Posted On: Sep 19, 2022

We are excited to announce that HAQM EMR on EKS release 6.7.0 and onwards includes the ability to run Apache Spark SQL scripts through the StartJobRun API. Spark SQL is a Spark module for structured data processing. Unlike the Spark DataFrame API, Spark SQL interfaces provide Spark with more information about the structure of both the data and computation being performed. Internally, Spark SQL uses this extra information to perform additional optimizations. With this release, you can run Spark SQL queries and Spark SQL-based ETL pipelines directly through HAQM EMR on EKS’ StartJobRun API.

HAQM EMR on EKS users rely on the StartJobRun API to kick-off Spark jobs. Previously, to run Spark SQL scripts, users had to embed their SQL scripts in interfaces such as PySpark, which required user modifications to existing Spark SQL scripts. As part of this release, a new Spark SQL job driver is added to the HAQM EMR on EKS’ base image that users use to run their Spark jobs. Users will now be able to supply SQL entry-point files to run Spark SQL queries on HAQM EMR on EKS using the StartJobRun API directly, without any modifications to existing Spark SQL scripts. This feature is available in all regions where HAQM EMR on EKS is available.

To learn more about how to run Spark SQL scripts on HAQM EMR on EKS, please visit the documentation page.