AWS Big Data Blog

Tag: HAQM EMR

Build a self-service environment for each line of business using HAQM EMR and AWS Service Catalog

Enterprises often want to centralize governance and compliance requirements, and provide a common set of policies on how HAQM EMR instances should be set up. You can use AWS Service Catalog to centrally manage commonly deployed HAQM EMR cluster configurations, and this helps you achieve consistent governance and meet your compliance requirements, while at the […]

Enhancing customer safety by leveraging the scalable, secure, and cost-optimized Toyota Connected Data Lake

February 9, 2024: HAQM Kinesis Data Firehose has been renamed to HAQM Data Firehose. Read the AWS What’s New post to learn more. Toyota Motor Corporation (TMC), a global automotive manufacturer, has made “connected cars” a core priority as part of its broader transformation from an auto company to a mobility company. In recent years, […]

Monitor and Optimize Analytic Workloads on HAQM EMR with Prometheus and Grafana

This post discusses installing and configuring Prometheus and Grafana on an HAQM Elastic Compute Cloud (HAQM EC2) instance, configuring an EMR cluster to emit metrics that Prometheus can scrape from the cluster, and using the Grafana dashboards to analyze the metrics for a workload on the EMR cluster and optimize it. Additionally, we also cover how Prometheus can push alerts to the Alertmanager, and configuring HAQM SNS to send email notifications.

Build a distributed big data reconciliation engine using HAQM EMR and HAQM Athena

This is a guest post by Sara Miller, Head of Data Management and Data Lake, Direct Energy; and Zhouyi Liu, Senior AWS Developer, Direct Energy. Enterprise companies like Direct Energy migrate on-premises data warehouses and services to AWS to achieve fully manageable digital transformation of their organization. Freedom from traditional data warehouse constraints frees up […]

Enable fine-grained data access in Zeppelin Notebook with AWS Lake Formation

This post explores how you can use AWS Lake Formation integration with HAQM EMR (still in beta) to implement fine-grained column-level access controls while using Spark in a Zeppelin Notebook. My previous post Extract Salesforce.com data using AWS Glue and analyzing with HAQM Athena showed you a simple use case for extracting any Salesforce object data using AWS Glue and Apache Spark, saving it to HAQM Simple Storage Service (HAQM S3), cataloging the data using the Data Catalog in Glue, and querying it using HAQM Athena.

Improving RAPIDS XGBoost performance and reducing costs with HAQM EMR running HAQM EC2 G4 instances

This is a guest post by Kong Zhao, Solution Architect at NVIDIA Corporation This post shares how NVIDIA sped up RAPIDS XGBoost performance up to 4.5 times faster and reduced costs up to 5.4 times less by using HAQM EMR running HAQM Elastic Compute Cloud (HAQM EC2) G4 instances. Gradient boosting is a powerful machine […]

Control data access and permissions with AWS Lake Formation and HAQM EMR

What if you could control the access to your data lake centrally? Would it be more convenient to share specific data securely with internal and external customers? With AWS Lake Formation and its integration with HAQM EMR, you can easily perform these administrative tasks. This post goes through a use case and reviews the steps to control the data access and permissions of your existing data lake.

Introducing HAQM EMR Managed Scaling – Automatically Resize Clusters to Lower Cost

AWS is happy to announce the release of HAQM EMR Managed Scaling—a new feature that automatically resizes your cluster for best performance at the lowest possible cost. With EMR Managed Scaling you specify the minimum and maximum compute limits for your clusters and HAQM EMR automatically resizes them for best performance and resource utilization. EMR Managed Scaling continuously samples key metrics associated with the workloads running on clusters. EMR Managed Scaling is supported for Apache Spark, Apache Hive and YARN-based workloads on HAQM EMR versions 5.30.1 and above.

Access web interfaces securely on HAQM EMR launched in a private subnet using an Application Load Balancer

HAQM EMR web interfaces are hosted on the master node of an EMR cluster. When you launch an EMR cluster in a private subnet, the EMR master node doesn’t have a public DNS record. The web interfaces hosted in a private subnet aren’t easily accessible outside the subnet. You can use an Application Load Balancer (ALB), launched in a public subnet, as an HTTPS proxy to access EMR web interfaces over the internet without requiring SSH tunneling through a bastion host. This approach greatly simplifies accessing EMR web interfaces. This post outlines how to use an ALB to securely access EMR web interfaces over the internet for an EMR cluster launched in a private subnet.