Posted On: Oct 1, 2021
You can now use open source frameworks such as Apache Spark, Apache Hive, and Presto running on HAQM EMR clusters directly from HAQM SageMaker Studio notebooks to run petabyte-scale data analytics and machine learning. HAQM EMR automatically installs and configures open source frameworks and provides a performance-optimized runtime that is compatible with and faster than standard open source. For e.g. Spark 3.0 on HAQM EMR is 1.7x faster than it’s open source equivalent. HAQM SageMaker Studio provides a single, web-based visual interface where you can perform all ML development steps required to prepare data, as well as build, train, and deploy models. Analyzing, transforming and preparing large amounts of data is a foundational step of any data science and ML workflow. This release makes it simple to use popular frameworks such as Apache Spark, Hive, and Presto running on EMR clusters directly from Sagemaker Studio to help simplify data science and ML workflows.
With this release, you can now visually browse a list of EMR clusters directly from SageMaker Studio and connect to them in a few simple clicks. Once connected to an EMR cluster, you can use Spark SQL, Scala, Python, and HiveQL to interactively query, explore and visualize data, and run Apache Spark, Hive and Presto jobs to process data. Jobs run fast because they use EMR’s performance-optimized versions of Spark, Hive, and Presto. Further, clusters can automatically scale up or down based on the workloads and integrate with Spot instances and Graviton2 based processors to lower costs. Finally, Sagemaker Studio users can authenticate when they connect to HAQM EMR clusters using LDAP-based credentials or Kerberos.
These features are supported on EMR 5.9.0 and above, and are generally available in all AWS Regions where SageMaker Studio is available. To learn more, watch the demo Interactive data processing on HAQM EMR from HAQM SageMaker, read the blog Perform interactive data engineering and data science workflows from HAQM SageMaker Studio notebooks or the SageMaker Studio documentation here.