AWS Big Data Blog

Category: HAQM EMR

New features from Apache Hudi 0.7.0 and 0.8.0 available on HAQM EMR

Apache Hudi is an open-source transactional data lake framework that greatly simplifies incremental data processing and data pipeline development by providing record-level insert, update, and delete capabilities. This record-level capability is helpful if you’re building your data lakes on HAQM Simple Storage Service (HAQM S3) or Hadoop Distributed File System (HDFS). You can use it […]

How Goldman Sachs built persona tagging using Apache Flink on HAQM EMR

The Global Investment Research (GIR) division at Goldman Sachs is responsible for providing research and insights to the firm’s clients in the equity, fixed income, currency, and commodities markets. One of the long-standing goals of the GIR team is to deliver a personalized experience and relevant research content to their research users. Previously, in order to customize […]

Announcing HAQM EMR Serverless (Preview): Run big data applications without managing servers

Today we’re happy to announce HAQM EMR Serverless, a new option in HAQM EMR that makes it easy and cost-effective for data engineers and analysts to run petabyte-scale data analytics in the cloud. With EMR Serverless, you can run applications built using open-source frameworks such as Apache Spark and Hive without having to configure, manage, […]

Provide data reliability in HAQM Redshift at scale using Great Expectations library

Ensuring data reliability is one of the key objectives of maintaining data integrity and is crucial for building data trust across an organization. Data reliability means that the data is complete and accurate. It’s the catalyst for delivering trusted data analytics and insights. Incomplete or inaccurate data leads business leaders and data analysts to make […]

Improve HAQM Athena query performance using AWS Glue Data Catalog partition indexes

The AWS Glue Data Catalog provides partition indexes to accelerate queries on highly partitioned tables. In the post Improve query performance using AWS Glue partition indexes, we demonstrated how partition indexes reduce the time it takes to fetch partition information during the planning phase of queries run on HAQM EMR, HAQM Redshift Spectrum, and AWS […]

Now Available: Updated guidance on the Data Analytics Lens for AWS Well-Architected Framework

Nearly all businesses today require some form of data analytics processing, from auditing user access to generating sales reports. For all your analytics needs, the Data Analytics Lens for AWS Well-Architected Framework provides prescriptive guidance to help you assess your workloads and identify best practices aligned to the AWS Well-Architected Pillars: Operational Excellence, Security, Reliability, […]

Copy large datasets from Google Cloud Storage to HAQM S3 using HAQM EMR

Data migration between GCS and HAQM S3 is possible by utilizing Hadoop’s native support for S3 object storage and using a Google-provided Hadoop connector for GCS. This post demonstrates how to configure an EMR cluster for DistCp and S3DistCP, goes over the settings and parameters for both tools, performs a copy of a test 9.4 TB dataset, and compares the performance of the copy.

Accelerate large-scale data migration validation using PyDeequ

March 2023: You can now use AWS Glue Data Quality to measure and manage the quality of your data. AWS Glue Data Quality is built on DeeQu and it offers a simplified user experience for customers who want to this open-source package. Refer to the blog and documentation for additional details. Many enterprises are migrating their […]

How Viasat scaled their big data applications by migrating to HAQM EMR

This post is co-written with Manoj Gundawar from Viasat. Viasat is a satellite internet service provider based in Carlsbad, CA, with operations across the United States and worldwide. Viasat’s ambition is to be the first truly global, scalable, broadband service provider with a mission to deliver connections that can change the world. Viasat operates across […]

Authorize SparkSQL data manipulation on HAQM EMR using Apache Ranger

This post was last updated July 2022. With HAQM EMR 6.7, all Apache Spark DDL’s are now supported, except for CREATE VIEW. For details, see the section under “limitations”. NOTE: You will need to redeploy Spark service definition (link) on your Apache Ranger server. Instructions on how to redeploy can be found here. With HAQM […]