AWS Big Data Blog
Category: Compute
Enhance your workload resilience with new HAQM EMR instance fleet features
HAQM EMR has introduced new features for instance fleets that address critical challenges in big data operations. This post explores how these innovations improve cluster resilience, scalability, and efficiency, enabling you to build more robust data processing architectures on AWS.
How Open Universities Australia modernized their data platform and significantly reduced their ETL costs with AWS Cloud Development Kit and AWS Step Functions
At Open Universities Australia (OUA), we empower students to explore a vast array of degrees from renowned Australian universities, all delivered through online learning. In this post, we show you how we used AWS services to replace our existing third-party ETL tool, improving the team’s productivity and producing a significant reduction in our ETL operational costs.
Hybrid big data analytics with HAQM EMR on AWS Outposts
In this post, we dive into the transformative features of EMR on Outposts, showcasing its flexibility as a native hybrid data analytics service that allows seamless data access and processing both on premises and in the cloud.
Building end-to-end data lineage for one-time and complex queries using HAQM Athena, HAQM Redshift, HAQM Neptune and dbt
In this post, we use dbt for data modeling on both HAQM Athena and HAQM Redshift. dbt on Athena supports real-time queries, while dbt on HAQM Redshift handles complex queries, unifying the development language and significantly reducing the technical learning curve. Using a single dbt modeling language not only simplifies the development process but also automatically generates consistent data lineage information. This approach offers robust adaptability, easily accommodating changes in data structures.
Run Apache XTable in AWS Lambda for background conversion of open table formats
In this post, we explore how Apache XTable, combined with the AWS Glue Data Catalog, enables background conversions between open table formats residing on HAQM S3-based data lakes, with minimal to no changes to existing pipelines, in a scalable and cost-effective way.
Run high-availability long-running clusters with HAQM EMR instance fleets
In this post, we demonstrate how to launch a high availability instance fleet cluster using the newly redesigned HAQM EMR console, as well as using an AWS CloudFormation template. We also go over the basic concepts of Hadoop high availability, EMR instance fleets, the benefits and trade-offs of high availability, and best practices for running resilient EMR clusters.
Integrate custom applications with AWS Lake Formation – Part 1
In this two-part series, we show how to integrate custom applications or data processing engines with Lake Formation using the third-party services integration feature. In this post, we dive deep into the required Lake Formation and AWS Glue APIs. We walk through the steps to enforce Lake Formation policies within custom data applications. As an example, we present a sample Lake Formation integrated application implemented using AWS Lambda.
Integrate custom applications with AWS Lake Formation – Part 2
In this two-part series, we show how to integrate custom applications or data processing engines with Lake Formation using the third-party services integration feature. In this post, we explore how to deploy a fully functional web client application, built with JavaScript/React through AWS Amplify (Gen 1), that uses the same Lambda function as the backend. The provisioned web application provides a user-friendly and intuitive way to view the Lake Formation policies that have been enforced.
Analyze HAQM EMR on HAQM EC2 cluster usage with HAQM Athena and HAQM QuickSight
In this post, we guide you through deploying a comprehensive solution in your HAQM Web Services (AWS) environment to analyze HAQM EMR on EC2 cluster usage. By using this solution, you will gain a deep understanding of resource consumption and associated costs of individual applications running on your EMR cluster.
Enrich your serverless data lake with HAQM Bedrock
Organizations are collecting and storing vast amounts of structured and unstructured data like reports, whitepapers, and research documents. By consolidating this information, analysts can discover and integrate data from across the organization, creating valuable data products based on a unified dataset. This post shows how to integrate HAQM Bedrock with the AWS Serverless Data Analytics Pipeline architecture using HAQM EventBridge, AWS Step Functions, and AWS Lambda to automate a wide range of data enrichment tasks in a cost-effective and scalable manner.