AWS Big Data Blog

Category: HAQM Simple Storage Service (S3)

Stream real-time data into Apache Iceberg tables in HAQM S3 using HAQM Data Firehose

In this post, we discuss how you can send real-time data streams into Iceberg tables on HAQM S3 by using HAQM Data Firehose. HAQM Data Firehose simplifies the process of streaming data by allowing users to configure a delivery stream, select a data source, and set Iceberg tables as the destination. Once set up, the Firehose stream is ready to deliver data.

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

This is part two of a three-part series where we show how to build a data lake on AWS using a modern data architecture. This post shows how to load data from a legacy database (SQL Server) into a transactional data lake (Apache Iceberg) using AWS Glue. We show how to build data pipelines using AWS Glue jobs, optimize them for both cost and performance, and implement schema evolution to automate manual tasks. To review the first part of the series, where we load SQL Server data into HAQM Simple Storage Service (HAQM S3) using AWS Database Migration Service (AWS DMS), see Modernize your legacy databases with AWS data lakes, Part 1: Migrate SQL Server using AWS DMS.

Simplify data ingestion from HAQM S3 to HAQM Redshift using auto-copy

HAQM Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze your data using standard SQL and your existing business intelligence (BI) tools. Tens of thousands of customers today rely on HAQM Redshift to analyze exabytes of data and run complex analytical queries, making it […]

Apache HBase online migration to HAQM EMR

Apache HBase is an open source, non-relational distributed database developed as part of the Apache Software Foundation’s Hadoop project. HBase can run on Hadoop Distributed File System (HDFS) or HAQM Simple Storage Service (HAQM S3), and can host very large tables with billions of rows and millions of columns. The followings are some typical use […]

Take manual snapshots and restore in a different domain spanning across various Regions and accounts in HAQM OpenSearch Service

This post provides a detailed walkthrough about how to efficiently capture and manage manual snapshots in OpenSearch Service. It covers the essential steps for taking snapshots of your data, implementing safe transfer across different AWS Regions and accounts, and restoring them in a new domain. This guide is designed to help you maintain data integrity and continuity while navigating complex multi-Region and multi-account environments in OpenSearch Service.

Unleash deeper insights with HAQM Redshift data sharing for data lake tables

HAQM Redshift now enables the secure sharing of data lake tables—also known as external tables or HAQM Redshift Spectrum tables—that are managed in the AWS Glue Data Catalog, as well as Redshift views referencing those data lake tables. By using granular access controls, data sharing in HAQM Redshift helps data owners maintain tight governance over who can access the shared information. In this post, we explore powerful use cases that demonstrate how you can enhance cross-team and cross-organizational collaboration, reduce overhead, and unlock new insights by using this innovative data sharing functionality.

Accelerate HAQM Redshift Data Lake queries with AWS Glue Data Catalog Column Statistics

Over the last year, HAQM Redshift added several performance optimizations for data lake queries across multiple areas of query engine such as rewrite, planning, scan execution and consuming AWS Glue Data Catalog column statistics. In this post, we highlight the performance improvements we observed using industry standard TPC-DS benchmarks. Overall execution time of TPC-DS 3 TB benchmark improved by 3x. Some of the queries in our benchmark experienced up to 12x speed up.

architecture

The AWS Glue Data Catalog now supports storage optimization of Apache Iceberg tables

The AWS Glue Data Catalog now enhances managed table optimization of Apache Iceberg tables by automatically removing data files that are no longer needed. Along with the Glue Data Catalog’s automated compaction feature, these storage optimizations can help you reduce metadata overhead, control storage costs, and improve query performance. Iceberg creates a new version called […]

Use AWS Glue to streamline SFTP data processing

In this blog post, we explore how to use the SFTP Connector for AWS Glue from the AWS Marketplace to efficiently process data from Secure File Transfer Protocol (SFTP) servers into HAQM Simple Storage Service (HAQM S3), further empowering your data analytics and insights.

Stream data to HAQM S3 for real-time analytics using the Oracle GoldenGate S3 handler

Modern business applications rely on timely and accurate data with increasing demand for real-time analytics. There is a growing need for efficient and scalable data storage solutions. Data at times is stored in different datasets and needs to be consolidated before meaningful and complete insights can be drawn from the datasets. This is where replication […]