AWS Big Data Blog

Category: HAQM Simple Storage Service (S3)

Prepare and load HAQM S3 data into Teradata using AWS Glue through its native connector for Teradata Vantage

In this post, we explore how to use the AWS Glue native connector for Teradata Vantage to streamline data integrations and unlock the full potential of your data. Businesses often rely on HAQM Simple Storage Service (HAQM S3) for storing large amounts of data from various data sources in a cost-effective and secure manner. For […]

Use HAQM EMR with S3 Access Grants to scale Spark access to HAQM S3

HAQM EMR is pleased to announce integration with HAQM Simple Storage Service (HAQM S3) Access Grants that simplifies HAQM S3 permission management and allows you to enforce granular access at scale. With this integration, you can scale job-based HAQM S3 access for Apache Spark jobs across all HAQM EMR deployment options and enforce granular HAQM […]

Unstructured Data Management - AWS Native Architecture

Unstructured data management and governance using AWS AI/ML and analytics services

In this post, we discuss how AWS can help you successfully address the challenges of extracting insights from unstructured data. We discuss various design patterns and architectures for extracting and cataloging valuable insights from unstructured data using AWS. Additionally, we show how to use AWS AI/ML services for analyzing unstructured data.

How healthcare organizations can analyze and create insights using price transparency data

In recent years, there has been a growing emphasis on price transparency in the healthcare industry. Under the Transparency in Coverage (TCR) rule, hospitals and payors to publish their pricing data in a machine-readable format. With this move, patients can compare prices between different hospitals and make informed healthcare decisions. For more information, refer to […]

Process and analyze highly nested and large XML files using AWS Glue and HAQM Athena

In today’s digital age, data is at the heart of every organization’s success. One of the most commonly used formats for exchanging data is XML. Analyzing XML files is crucial for several reasons. Firstly, XML files are used in many industries, including finance, healthcare, and government. Analyzing XML files can help organizations gain insights into […]

Operational Data Processing Framework for Modern Data Architectures

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

AWS has invested in native service integration with Apache Hudi and published technical contents to enable you to use Apache Hudi with AWS Glue (for example, refer to Introducing native support for Apache Hudi, Delta Lake, and Apache Iceberg on AWS Glue for Apache Spark, Part 1: Getting Started). In AWS ProServe-led customer engagements, the use cases we work on usually come with technical complexity and scalability requirements. In this post, we discuss a common use case in relation to operational data processing and the solution we built using Apache Hudi and AWS Glue.

Extracting key insights from HAQM S3 access logs with AWS Glue for Ray

This blog post presents an architecture solution that allows customers to extract key insights from HAQM S3 access logs at scale. We will partition and format the server access logs with HAQM Web Services (AWS) Glue, a serverless data integration service, to generate a catalog for access logs and create dashboards for insights.

Query your Iceberg tables in data lake using HAQM Redshift

HAQM Redshift supports querying a wide variety of data formats, such as CSV, JSON, Parquet, and ORC, and table formats like Apache Hudi and Delta. HAQM Redshift also supports querying nested data with complex data types such as struct, array, and map. With this capability, HAQM Redshift extends your petabyte-scale data warehouse to an exabyte-scale data lake on HAQM S3 in a cost-effective manner. Apache Iceberg is the latest table format that is supported by HAQM Redshift. In this post, we show you how to query Iceberg tables using HAQM Redshift, and explore Iceberg support and options.

Build an ETL process for HAQM Redshift using HAQM S3 Event Notifications and AWS Step Functions

In this post we discuss how we can build and orchestrate in a few steps an ETL process for HAQM Redshift using HAQM S3 Event Notifications for automatic verification of source data upon arrival and notification in specific cases. And we show how to use AWS Step Functions for the orchestration of the data pipeline. It can be considered as a starting point for teams within organizations willing to create and build an event driven data pipeline from data source to data warehouse that will help in tracking each phase and in responding to failures quickly. Alternatively, you can also use HAQM Redshift auto-copy from HAQM S3 to simplify data loading from HAQM S3 into HAQM Redshift.

Automate the archive and purge data process for HAQM RDS for PostgreSQL using pg_partman, HAQM S3, and AWS Glue

The post Archive and Purge Data for HAQM RDS for PostgreSQL and HAQM Aurora with PostgreSQL Compatibility using pg_partman and HAQM S3 proposes data archival as a critical part of data management and shows how to efficiently use PostgreSQL’s native range partition to partition current (hot) data with pg_partman and archive historical (cold) data in […]