AWS Big Data Blog

Category: HAQM Simple Storage Service (S3)

Creating a source to Lakehouse data replication pipe using Apache Hudi, AWS Glue, AWS DMS, and HAQM Redshift

February 2021 update – Please refer to the post Writing to Apache Hudi tables using AWS Glue Custom Connector to learn about an easier mechanism to write to Hudi tables using AWS Glue Custom Connector. In this post, we include the modified Apache Hudi JARs as an external dependency. The AWS Glue Custom Connector feature […]

Handling data erasure requests in your data lake with HAQM S3 Find and Forget

February 2024: This post was reviewed and updated for accuracy. Data lakes are a popular choice for organizations to store data around their business activities. Best practice design of data lakes impose that data is immutable once stored, but new regulations such as the European General Data Protection Regulation (GDPR), California Consumer Privacy Act (CCPA), […]

Apply record level changes from relational databases to HAQM S3 data lake using Apache Hudi on HAQM EMR and AWS Database Migration Service

Data lakes give organizations the ability to harness data from multiple sources in less time. Users across different roles are now empowered to collaborate and analyze data in different ways, leading to better, faster decision-making. HAQM Simple Storage Service (HAQM S3) is the highly performant object storage service for structured and unstructured data and the […]

How to delete user data in an AWS data lake

General Data Protection Regulation (GDPR) is an important aspect of today’s technology world, and processing data in compliance with GDPR is a necessity for those who implement solutions within the AWS public cloud. One article of GDPR is the “right to erasure” or “right to be forgotten” which may require you to implement a solution […]

Streaming data from HAQM S3 to HAQM Kinesis Data Streams using AWS DMS

August 30, 2023: HAQM Kinesis Data Analytics has been renamed to HAQM Managed Service for Apache Flink. Read the announcement in the AWS News Blog and learn more. December 2022: This post was reviewed for accuracy. Stream processing is very useful in use cases where we need to detect a problem quickly and improve the […]

Analyzing HAQM S3 server access logs using HAQM OpenSearch Service

This blog post was last reviewed and updated April, 2022. When you use HAQM Simple Storage Service (HAQM S3) to store corporate data and host websites, you need additional logging to monitor access to your data and the performance of your application. An effective logging solution enhances security and improves the detection of security incidents. […]

Anonymize and manage data in your data lake with HAQM Athena and AWS Lake Formation

Most organizations have to comply with regulations when dealing with their customer data. For that reason, datasets that contain personally identifiable information (PII) is often anonymized. A common example of PII can be tables and columns that contain personal information about an individual (such as first name and last name) or tables with columns that, if joined with another table, can trace back to an individual. You can use AWS Analytics services to anonymize your datasets. In this post, I describe how to use HAQM Athena to anonymize a dataset.  You can then use AWS Lake Formation to provide the right access to the right personas.

How Wind Mobility built a serverless data architecture

We parse through millions of scooter and user events generated daily (over 300 events per second) to extract actionable insight. We selected AWS Glue to perform this task. Our primary ETL job reads the newly added raw event data from HAQM S3, processes it using Apache Spark, and writes the results to our HAQM Redshift data warehouse. AWS Glue plays a critical role in our ability to scale on demand. After careful evaluation and testing, we concluded that AWS Glue ETL jobs meet all our needs and free us from procuring and managing infrastructure.

Extend your HAQM Redshift Data Warehouse to your Data Lake

HAQM Redshift is a fast, fully managed, cloud-native data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing business intelligence tools. Many companies today are using HAQM Redshift to analyze data and perform various transformations on the data. However, as data continues to grow and become […]

Running a high-performance SAS Grid Manager cluster on AWS with HAQM FSx for Lustre

SAS® is a software provider of data science and analytics used by enterprises and government organizations. SAS Grid is a highly available, fast processing analytics platform that offers centralized management that balances workloads across different compute nodes. This application suite is capable of data management, visual analytics, governance and security, forecasting and text mining, statistical […]