AWS Big Data Blog
Category: HAQM EMR
Access Apache Livy using a Network Load Balancer on a Kerberos-enabled HAQM EMR cluster
HAQM EMR is a cloud big data platform for running large-scale distributed data processing jobs, interactive SQL queries, and machine learning (ML) applications using open-source analytics frameworks such as Apache Spark, Apache Hive, and Presto. HAQM EMR supports Kerberos for authentication; you can enable Kerberos on HAQM EMR and put the cluster in a private […]
HAQM EMR on HAQM EKS provides up to 61% lower costs and up to 68% performance improvement for Spark workloads
HAQM EMR on HAQM EKS is a deployment option offered by HAQM EMR that enables you to run Apache Spark applications on HAQM Elastic Kubernetes Service (HAQM EKS) in a cost-effective manner. It uses the EMR runtime for Apache Spark to increase performance so that your jobs run faster and cost less. In our benchmark […]
How SailPoint solved scaling issues by migrating legacy big data applications to HAQM EMR on HAQM EKS
This post is co-written with Richard Li from SailPoint. SailPoint Technologies is an identity security company based in Austin, TX. Its software as a service (SaaS) solutions support identity governance operations in regulated industries such as healthcare, government, and higher education. SailPoint distinguishes multiple aspects of identity as individual identity security services, including cloud governance, […]
Best practices to optimize data access performance from HAQM EMR and AWS Glue to HAQM S3
June 2024: This post was reviewed for accuracy and updated to cover Apache Iceberg. June 2023: This post was reviewed and updated for accuracy. Customers are increasingly building data lakes to store data at massive scale in the cloud. It’s common to use distributed computing engines, cloud-native databases, and data warehouses when you want to […]
New features from Apache Hudi 0.9.0 on HAQM EMR
Apache Hudi is an open-source transactional data lake framework that greatly simplifies incremental data processing and data pipeline development. It does this by providing transaction support and record-level insert, update, and delete capabilities on data lakes on HAQM Simple Storage Service (HAQM S3) or Apache HDFS. Apache Hudi is integrated with open-source big data analytics […]
Up to 15 times improvement in Hive write performance with the HAQM EMR Hive zero-rename feature
Our customers use Apache Hive on HAQM EMR for large-scale data analytics and extract, transform, and load (ETL) jobs. HAQM EMR Hive uses Apache Tez as the default job execution engine, which creates Directed Acyclic Graphs (DAGs) to process data. Each DAG can contain multiple vertices from which tasks are created to run the application […]
Create a low-latency source-to-data lake pipeline using HAQM MSK Connect, Apache Flink, and Apache Hudi
August 30, 2023: HAQM Kinesis Data Analytics has been renamed to HAQM Managed Service for Apache Flink. Read the announcement in the AWS News Blog and learn more. During the recent years, there has been a shift from monolithic to the microservices architecture. The microservices architecture makes applications easier to scale and quicker to develop, […]
How Cynamics built a high-scale, near-real-time, streaming AI inference system using AWS
This post is co-authored by Dr. Yehezkel Aviv, Co-Founder and CTO of Cynamics and Sapir Kraus, Head of Engineering at Cynamics. Cynamics provides a new paradigm of cybersecurity — predicting attacks long before they hit by collecting small network samples (less than 1%), inferring from them how the full network (100%) behaves, and predicting threats […]
Doing more with less: Moving from transactional to stateful batch processing
HAQM processes hundreds of millions of financial transactions each day, including accounts receivable, accounts payable, royalties, amortizations, and remittances, from over a hundred different business entities. All of this data is sent to the eCommerce Financial Integration (eCFI) systems, where they are recorded in the subledger. Ensuring complete financial reconciliation at this scale is critical […]
How Belcorp decreased cost and improved reliability in its big data processing framework using HAQM EMR managed scaling
This is a guest post by Diego Benavides and Luis Bendezú, Senior Data Architects, Data Architecture Direction at Belcorp. Belcorp is one of the main consumer packaged goods (CPG) companies providing cosmetics products in the region for more than 50 years, allocated to around 13 countries in North, Central, and South America (AMER). Born in Peru […]