AWS Big Data Blog
Category: Intermediate (200)
Integrate Tableau and Okta with HAQM Redshift using AWS IAM Identity Center
This blog post is co-written with Sid Wray and Jake Koskela from Salesforce, and Adiascar Cisneros from Tableau. HAQM Redshift is a fast, scalable cloud data warehouse built to serve workloads at any scale. With HAQM Redshift as your data warehouse, you can run complex queries using sophisticated query optimization to quickly deliver results to […]
Introducing HAQM EMR on EKS with Apache Flink: A scalable, reliable, and efficient data processing platform
AWS recently announced that Apache Flink is generally available for HAQM EMR on HAQM Elastic Kubernetes Service (EKS). Apache Flink is a scalable, reliable, and efficient data processing framework that handles real-time streaming and batch workloads (but is most commonly used for real-time streaming). HAQM EMR on EKS is a deployment option for HAQM EMR […]
Architectural Patterns for real-time analytics using HAQM Kinesis Data Streams, Part 2: AI Applications
Welcome back to our exciting exploration of architectural patterns for real-time analytics with HAQM Kinesis Data Streams! In this fast-paced world, Kinesis Data Streams stands out as a versatile and robust solution to tackle a wide range of use cases with real-time data, from dashboarding to powering artificial intelligence (AI) applications. In this series, we […]
Get started with AWS Glue Data Quality dynamic rules for ETL pipelines
In this post, we show how to create an AWS Glue job that measures and monitors the data quality of a data pipeline using dynamic rules. We also show how to take action based on the data quality results.
Entity resolution and fuzzy matches in AWS Glue using the Zingg open source library
In this post, we explore how to use Zingg’s entity resolution capabilities within an AWS Glue notebook, which you can later run as an extract, transform, and load (ETL) job. By integrating Zingg in your notebooks or ETL jobs, you can effectively address data governance challenges and provide consistent and accurate data across your organization.
Revolutionizing data querying: HAQM Redshift and Visual Studio Code integration
In today’s data-driven landscape, the efficiency and accessibility of querying tools play a crucial role in driving businesses forward. HAQM Redshift recently announced integration with Visual Studio Code (), an action that transforms the way data practitioners engage with HAQM Redshift and reshapes your interactions and practices in data management. This innovation not only unlocks […]
Analyze more demanding as well as larger time series workloads with HAQM OpenSearch Serverless
In today’s data-driven landscape, managing and analyzing vast amounts of data, especially logs, is crucial for organizations to derive insights and make informed decisions. However, handling this data efficiently presents a significant challenge, prompting organizations to seek scalable solutions without the complexity of infrastructure management. HAQM OpenSearch Serverless lets you run OpenSearch in the AWS […]
Optimize data layout by bucketing with HAQM Athena and AWS Glue to accelerate downstream queries
In this post, we discuss how to implement bucketing on AWS data lakes, including using Athena CTAS statement and AWS Glue for Apache Spark. We also cover bucketing for Apache Iceberg tables.
Run interactive workloads on HAQM EMR Serverless from HAQM EMR Studio
Starting from release 6.14, HAQM EMR Studio supports interactive analytics on HAQM EMR Serverless. You can now use EMR Serverless applications as the compute, in addition to HAQM EMR on EC2 clusters and HAQM EMR on EKS virtual clusters, to run JupyterLab notebooks from EMR Studio Workspaces. EMR Studio is an integrated development environment (IDE) […]
Automate large-scale data validation using HAQM EMR and Apache Griffin
Many enterprises are migrating their on-premises data stores to the AWS Cloud. During data migration, a key requirement is to validate all the data that has been moved from source to target. This data validation is a critical step, and if not done correctly, may result in the failure of the entire project. However, developing […]