Apache Spark on HAQM EMR
Why Apache Spark on EMR?
HAQM EMR is the best place to run Apache Spark. You can quickly and easily create managed Spark clusters from the AWS Management Console, AWS CLI, or the HAQM EMR API. Additionally, you can leverage additional HAQM EMR features, including fast HAQM S3 connectivity using the HAQM EMR File System (EMRFS), integration with the HAQM EC2 Spot market and the AWS Glue Data Catalog, and EMR Managed Scaling to add or remove instances from your cluster. AWS Lake Formation brings fine-grained access control, while integration with AWS Step Functions helps with orchestrating your data pipelines. EMR Studio (preview) is an integrated development environment (IDE) that makes it easy for data scientists and data engineers to develop, visualize, and debug data engineering and data science applications written in R, Python, Scala, and PySpark. EMR Studio provides fully managed Jupyter Notebooks, and tools like Spark UI and YARN Timeline Service to simplify debugging. EMR Notebooks make it easy for you to experiment and build applications with Spark. If you prefer, you can use Apache Zeppelin to create interactive and collaborative notebooks for data exploration using Spark.
Features and benefits
Use cases
Customer success
-
Yelp
Yelp’s advertising targeting team makes prediction models to determine the likelihood of a user interacting with an advertisement. By using Apache Spark on HAQM EMR to process large amounts of data to train machine learning models, Yelp increased revenue and advertising click-through rate.
-
The Washington Post
The Washington Post uses Apache Spark on HAQM EMR to build models powering its website’s recommendation engine to boost reader engagement and satisfaction. They leverage HAQM EMR's performant connectivity with HAQM S3 to update models in near real-time.
-
Krux
As part of its Data Management Platform for customer insights, Krux runs many machine learning and general processing workloads using Apache Spark. Krux utilizes ephemeral HAQM EMR clusters with HAQM EC2 Spot Capacity to save costs and uses HAQM S3 with EMRFS as a data layer for Apache Spark.
-
GumGum
GumGum, an in-image and in-screen advertising platform, uses Spark on HAQM EMR for inventory forecasting, processing of clickstream logs, and ad hoc analysis of unstructured data in HAQM S3. Spark’s performance enhancements saved GumGum time and money for these workflows.
-
Hearst Corporation
Hearst Corporation, a large diversified media and information company, has customers viewing content on over 200 web properties. Using Apache Spark Streaming on HAQM EMR, Hearst’s editorial staff can keep a real-time pulse on which articles are performing well and which themes are trending.
-
CrowdStrike
CrowdStrike provides endpoint protection to stop breaches. They use HAQM EMR with Spark to process hundreds of terabytes of event data and roll it up into higher-level behavioral descriptions on the hosts. From that data, CrowdStrike can pull event data together and identify the presence of malicious activity.