AWS Big Data Blog
Category: Analytics
Implement data quality checks on HAQM Redshift data assets and integrate with HAQM DataZone
In this post, we show how to capture the data quality metrics for data assets produced in HAQM Redshift. With HAQM DataZone, the data owner can directly import the technical metadata of a Redshift database table and views to the HAQM DataZone project’s inventory. As these data assets gets imported into HAQM DataZone, it bypasses the AWS Glue Data Catalog, creating a gap in data quality integration. This post proposes a solution to enrich the HAQM Redshift data asset with data quality scores and KPI metrics.
Improve the resilience of HAQM Managed Service for Apache Flink application with system-rollback feature
This post explores how to use the system-rollback feature in Managed Service for Apache Flink.We discuss how this functionality improves your application’s resilience by providing a highly available Flink application. Through an example, you will also learn how to use the APIs to have more visibility of the application’s operations.
Organize content across business units with enterprise-wide data governance using HAQM DataZone domain units and authorization policies
HAQM DataZone has announced a set of new data governance capabilities—domain units and authorization policies—that enable you to create business unit-level or team-level organization and manage policies according to your business needs. In this post, we discuss common approaches to structuring domain units, use cases that customers in the healthcare and life sciences (HCLS) industry encounter, and how to get started with the new domain units and authorization policies features from HAQM DataZone.
Use AWS Glue to streamline SFTP data processing
In this blog post, we explore how to use the SFTP Connector for AWS Glue from the AWS Marketplace to efficiently process data from Secure File Transfer Protocol (SFTP) servers into HAQM Simple Storage Service (HAQM S3), further empowering your data analytics and insights.
Automate HAQM Redshift Advisor recommendations with email alerts using an API
HAQM Redshift Advisor offers recommendations about optimizing your Redshift cluster performance and helps you save on operating costs. In this post, we show you how to use the ListRecommendations API to set up email notifications for Advisor recommendations on your Redshift cluster. These recommendations, such as identifying tables that should be vacuumed to sort the data or finding table columns that are candidates for compression, can help improve performance and save costs.
Migrate HAQM Redshift from DC2 to RA3 to accommodate increasing data volumes and analytics demands
As businesses strive to make informed decisions, the amount of data being generated and required for analysis is growing exponentially. This trend is no exception for Dafiti, an ecommerce company that recognizes the importance of using data to drive strategic decision-making processes. With the ever-increasing volume of data available, Dafiti faces the challenge of effectively managing and extracting valuable insights from this vast pool of information to gain a competitive edge and make data-driven decisions that align with company business objectives. The growing need for storage space to maintain data from over 90 sources and the functionality available on the new HAQM Redshift node types, including managed storage, data sharing, and zero-ETL integrations, led us to migrate from DC2 to RA3 nodes. In this post, we share how we handled the migration process and provide further impressions of our experience.
How AppsFlyer modernized their interactive workload by moving to HAQM Athena and saved 80% of costs
AppsFlyer develops a leading measurement solution focused on privacy, which enables marketers to gauge the effectiveness of their marketing activities and integrates them with the broader marketing world, managing a vast volume of 100 billion events every day. This post explores how AppsFlyer modernized their Audiences Segmentation product by using HAQM Athena.
Query AWS Glue Data Catalog views using HAQM Athena and HAQM Redshift
Glue Data Catalog views is a new feature of the AWS Glue Data Catalog that customers can use to create a common view schema and single metadata container that can hold view-definitions in different dialects that can be used across engines such as HAQM Redshift and HAQM Athena. In this blog post, we will show how you can define and query a Data Catalog view on top of open source table formats such as Iceberg across Athena and HAQM Redshift. We will also show you the configurations needed to restrict access to the underlying database and tables. To follow along, we have provided an AWS CloudFormation template.
Introducing AWS Glue Data Quality anomaly detection
We are excited to announce the general availability of anomaly detection capabilities in AWS Glue Data Quality. In this post, we demonstrate how this feature works with an example. We provide an AWS Cloud Formation template to deploy this setup and experiment with this feature.
OpenSearch optimized instance (OR1) is game changing for indexing performance and cost
HAQM OpenSearch Service securely unlocks real-time search, monitoring, and analysis of business and operational data for use cases like application monitoring, log analytics, observability, and website search. In this post, we examine the OR1 instance type, an OpenSearch optimized instance introduced on November 29, 2023. OR1 is an instance type for HAQM OpenSearch Service that […]