AWS Big Data Blog

Category: HAQM Simple Storage Service (S3)

Migrate data from an on-premises Hadoop environment to HAQM S3 using S3DistCp with AWS Direct Connect

This post demonstrates how to migrate nearly any amount of data from an on-premises Apache Hadoop environment to HAQM Simple Storage Service (HAQM S3) by using S3DistCp on HAQM EMR with AWS Direct Connect. To transfer resources from a target EMR cluster, the traditional Hadoop DistCp must be run on the source cluster to move […]

Disaster recovery strategies for HAQM MWAA – Part 2

HAQM Managed Workflows for Apache Airflow (HAQM MWAA) is a fully managed orchestration service that makes it straightforward to run data processing workflows at scale. HAQM MWAA takes care of operating and scaling Apache Airflow so you can focus on developing workflows. However, although HAQM MWAA provides high availability within an AWS Region through features […]

Modernize your data observability with HAQM OpenSearch Service zero-ETL integration with HAQM S3

We are excited to announce the general availability of HAQM OpenSearch Service zero-ETL integration with HAQM Simple Storage Service (HAQM S3) for domains running 2.13 and above. The integration is new way for customers to query operational logs in HAQM S3 and HAQM S3-based data lakes without needing to switch between tools to analyze operational data. By querying across OpenSearch Service and S3 datasets, you can evaluate multiple data sources to perform forensic analysis of operational and security events. The new integration with OpenSearch Service supports AWS’s zero-ETL vision to reduce the operational complexity of duplicating data or managing multiple analytics tools by enabling you to directly query your operational data, reducing costs and time to action.

Implement a full stack serverless search application using AWS Amplify, HAQM Cognito, HAQM API Gateway, AWS Lambda, and HAQM OpenSearch Serverless

Designing a full stack search application requires addressing numerous challenges to provide a smooth and effective user experience. This encompasses tasks such as integrating diverse data from various sources with distinct formats and structures, optimizing the user experience for performance and security, providing multilingual support, and optimizing for cost, operations, and reliability. HAQM OpenSearch Serverless […]

Simplify data lake access control for your enterprise users with trusted identity propagation in AWS IAM Identity Center, AWS Lake Formation, and HAQM S3 Access Grants

Many organizations use external identity providers (IdPs) such as Okta or Microsoft Azure Active Directory to manage their enterprise user identities. These users interact with and run analytical queries across AWS analytics services. To enable them to use the AWS services, their identities from the external IdP are mapped to AWS Identity and Access Management […]

Use Apache Iceberg in your data lake with HAQM S3, AWS Glue, and Snowflake

Customers are using AWS and Snowflake to develop purpose-built data architectures that provide the performance required for modern analytics and artificial intelligence (AI) use cases. Implementing these solutions requires data sharing between purpose-built data stores. This is why Snowflake and AWS are delivering enhanced support for Apache Iceberg to enable and facilitate data interoperability between data services. Apache Iceberg is an open-source table format that provides reliability, simplicity, and high performance for large datasets with transactional integrity between various processing engines.

Petabyte-scale log analytics with HAQM S3, HAQM OpenSearch Service, and HAQM OpenSearch Ingestion

Organizations often need to manage a high volume of data that is growing at an extraordinary rate. At the same time, they need to optimize operational costs to unlock the value of this data for timely insights and do so with a consistent performance. With this massive data growth, data proliferation across your data stores, […]

Use AWS Glue ETL to perform merge, partition evolution, and schema evolution on Apache Iceberg

As enterprises collect increasing amounts of data from various sources, the structure and organization of that data often need to change over time to meet evolving analytical needs. However, altering schema and table partitions in traditional data lakes can be a disruptive and time-consuming task, requiring renaming or recreating entire tables and reprocessing large datasets. […]

Backup and Restore - Pre

Disaster recovery strategies for HAQM MWAA – Part 1

In the dynamic world of cloud computing, ensuring the resilience and availability of critical applications is paramount. Disaster recovery (DR) is the process by which an organization anticipates and addresses technology-related disasters. For organizations implementing critical workload orchestration using HAQM Managed Workflows for Apache Airflow (HAQM MWAA), it is crucial to have a DR plan […]

How HR&A uses HAQM Redshift spatial analytics on HAQM Redshift Serverless to measure digital equity in states across the US

In our increasingly digital world, affordable access to high-speed broadband is a necessity to fully participate in our society, yet there are still millions of American households without internet access. HR&A Advisors—a multi-disciplinary consultancy with extensive work in the broadband and digital equity space is helping its state, county, and municipal clients deliver affordable internet […]