Apache Hadoop on HAQM EMR

Why Apache Hadoop on EMR?

Apache™ Hadoop® is an open source software project that can be used to efficiently process large datasets. Instead of using one large computer to process and store the data, Hadoop allows clustering commodity hardware together to analyze massive data sets in parallel.

There are many applications and execution engines in the Hadoop ecosystem, providing a variety of tools to match the needs of your analytics workloads. HAQM EMR makes it easy to create and manage fully configured, elastic clusters of HAQM EC2 instances running Hadoop and other applications in the Hadoop ecosystem.

Applications and frameworks in the Hadoop ecosystem

Open all

Hadoop commonly refers to the actual Apache Hadoop project, which includes MapReduce (execution framework), YARN (resource manager), and HDFS (distributed storage). You can also install Apache Tez, a next-generation framework which can be used instead of Hadoop MapReduce as an execution engine. HAQM EMR also includes EMRFS, a connector allowing Hadoop to use HAQM S3 as a storage layer.

However, there are also other applications and frameworks in the Hadoop ecosystem, including tools that enable low-latency queries, GUIs for interactive querying, a variety of interfaces like SQL, and distributed NoSQL databases. The Hadoop ecosystem includes many open source tools designed to build additional functionality on Hadoop core components, and you can use HAQM EMR to easily install and configure tools such as Hive, Pig, Hue, Ganglia, Oozie, and HBase on your cluster. You can also run other frameworks, like Apache Spark for in-memory processing, or Presto for interactive SQL, in addition to Hadoop on HAQM EMR.

Hadoop: the basic components

Open all

HAQM EMR programmatically installs and configures applications in the Hadoop project, including Hadoop MapReduce, YARN, HDFS, and Apache Tez across the nodes in your cluster.

Hadoop MapReduce and Tez, execution engines in the Hadoop ecosystem, process workloads using frameworks that break down jobs into smaller pieces of work that can be distributed across nodes in your HAQM EMR cluster. They are built with the expectation that any given machine in your cluster could fail at any time and are designed for fault tolerance. If a server running a task fails, Hadoop reruns that task on another machine until completion.

You can write MapReduce and Tez programs in Java, use Hadoop Streaming to execute custom scripts in a parallel fashion, utilize Hive and Pig for higher level abstractions over MapReduce and Tez, or other tools to interact with Hadoop.

Starting with Hadoop 2, resource management is managed by Yet Another Resource Negotiator (YARN). YARN keeps track of all the resources across your cluster, and it ensures that these resources are dynamically allocated to accomplish the tasks in your processing job. YARN is able to manage Hadoop MapReduce and Tez workloads as well as other distributed frameworks such as Apache Spark.

By using the EMR File System (EMRFS) on your HAQM EMR cluster, you can leverage HAQM S3 as your data layer for Hadoop. HAQM S3 is highly scalable, low cost, and designed for durability, making it a great data store for big data processing. By storing your data in HAQM S3, you can decouple your compute layer from your storage layer, allowing you to size your HAQM EMR cluster for the amount of CPU and memory required for your workloads instead of having extra nodes in your cluster to maximize on-cluster storage. Additionally, you can terminate your HAQM EMR cluster when it is idle to save costs, while your data remains in HAQM S3.

EMRFS is optimized for Hadoop to directly read and write in parallel to HAQM S3 performantly, and can process objects encrypted with HAQM S3 server-side and client-side encryption. EMRFS allows you to use HAQM S3 as your data lake, and Hadoop in HAQM EMR can be used as an elastic query layer.

Hadoop also includes a distributed storage system, the Hadoop Distributed File System (HDFS), which stores data across local disks of your cluster in large blocks. HDFS has a configurable replication factor (with a default of 3x), giving increased availability and durability. HDFS monitors replication and balances your data across your nodes as nodes fail and new nodes are added.

HDFS is automatically installed with Hadoop on your HAQM EMR cluster, and you can use HDFS along with HAQM S3 to store your input and output data. You can easily encrypt HDFS using an HAQM EMR security configuration. Also, HAQM EMR configures Hadoop to uses HDFS and local disk for intermediate data created during your Hadoop MapReduce jobs, even if your input data is located in HAQM S3.

Advantages of Hadoop on HAQM EMR

Open all

You can initialize a new Hadoop cluster dynamically and quickly, or add servers to your existing HAQM EMR cluster, significantly reducing the time it takes to make resources available to your users and data scientists. Using Hadoop on the AWS platform can dramatically increase your organizational agility by lowering the cost and time it takes to allocate resources for experimentation and development.

Hadoop configuration, networking, server installation, security configuration, and ongoing administrative maintenance can be a complicated and challenging activity. As a managed service, HAQM EMR addresses your Hadoop infrastructure requirements so you can focus on your core business.

You can easily integrate your Hadoop environment with other services such as HAQM S3HAQM KinesisHAQM Redshift, and HAQM DynamoDB to enable data movement, workflows, and analytics across the many diverse services on the AWS platform. Additionally, you can use the AWS Glue Data Catalog as a managed metadata repository for Apache Hive and Apache Spark.

Many Hadoop jobs are spiky in nature. For instance, an ETL job can run hourly, daily, or monthly, while modeling jobs for financial firms or genetic sequencing may occur only a few times a year. Using Hadoop on HAQM EMR allows you to spin up these workload clusters easily, save the results, and shut down your Hadoop resources when they’re no longer needed, to avoid unnecessary infrastructure costs. EMR 6.x supports Hadoop 3, which allows the YARN NodeManager to launch containers either directly on the EMR cluster host or inside a Docker container. Please see our documentation to learn more.

By using Hadoop on HAQM EMR, you have the flexibility to launch your clusters in any number of Availability Zones in any AWS region. A potential problem or threat in one region or zone can be easily circumvented by launching a cluster in another zone in minutes.

Capacity planning prior to deploying a Hadoop environment can often result in expensive idle resources or resource limitations. With HAQM EMR, you can create clusters with the required capacity within minutes and use EMR Managed Scaling to dynamically scale out and scale in nodes.

Use cases

Apache and Hadoop are trademarks of the Apache Software Foundation.

Hadoop can be used to analyze clickstream data in order to segment users and understand user preferences. Advertisers can also analyze clickstreams and advertising impression logs to deliver more effective ads.

Learn how Razorfish uses Hadoop on HAQM EMR for clickstream analysis

Hadoop can be used to process logs generated by web and mobile applications. Hadoop helps you turn petabytes of un-structured or semi-structured data into useful insights about your applications or users.

Learn how Yelp uses Hadoop on HAQM EMR to drive key website features

 

Hadoop ecosystem applications like Hive allow users to leverage Hadoop MapReduce using a SQL interface, enabling analytics at a massive scale, distributed, and fault-tolerant data warehousing. Use Hadoop to store your data and allow your users to send queries at data of any size.

Watch how Netflix uses Hadoop on HAQM EMR to run a petabyte scale data warehouse

Hadoop can be used to process vast amounts of genomic data and other large scientific data sets quickly and efficiently. AWS has made the 1000 Genomes Project data publicly available to the community free of charge.

Read more about Genomics on AWS

 

Given its massive scalability and lower costs, Hadoop is ideally suited for common ETL workloads such as collecting, sorting, joining, and aggregating big datasets for easier consumption by downstream systems.

Read how Euclid uses Hadoop on HAQM EMR for ETL and data aggregation