AWS Big Data Blog
Tag: EMR
EMR Notebooks: A managed analytics environment based on Jupyter notebooks
Notebooks are increasingly becoming the standard tool for interactively developing big data applications. It’s easy to see why. Their flexible architecture allows you to experiment with data in multiple languages, test code interactively, and visualize large datasets. To help scientists and developers easily access notebook tools, we launched HAQM EMR Notebooks, a managed notebook environment […]
Test data quality at scale with Deequ
In this blog post, we introduce Deequ, an open source tool developed and used at HAQM. Deequ allows you to calculate data quality metrics on your dataset, define and verify data quality constraints, and be informed about changes in the data distribution. Instead of implementing checks and verification algorithms on your own, you can focus on describing how your data should look.
Optimize HAQM EMR costs with idle checks and automatic resource termination using advanced HAQM CloudWatch metrics and AWS Lambda
Many customers use HAQM EMR to run big data workloads, such as Apache Spark and Apache Hive queries, in their development environment. Data analysts and data scientists frequently use these types of clusters, known as analytics EMR clusters. Users often forget to terminate the clusters after their work is done. This leads to idle running […]
Spark enhancements for elasticity and resiliency on HAQM EMR
This blog post provides an overview of the issues with how open-source Spark handles node loss and the improvements in HAQM EMR to address the issues.
Dynamically scale up storage on HAQM EMR clusters
February 2025: The bootstrap action script in this blog post uses IMDS v1 for accessing EC2 instance metadata. The script does not support IMDS v2 and cannot be used in an AWS account which has IMDS v2 enforced across the account. Using the script in an IMDS v2 enabled account will cause issues and unexpected […]
Sharpen your Skill Set with Apache Spark on the AWS Big Data Blog
The AWS Big Data Blog has a large community of authors who are passionate about Apache Spark and who regularly publish content that helps customers use Spark to build real-world solutions. You’ll see content on a variety of topics, including deep-dives on Spark’s internals, building Spark Streaming applications, creating machine learning pipelines using MLlib, and ways […]