AWS Big Data Blog
Category: Advanced (300)
HAQM MWAA best practices for managing Python dependencies
Customers with data engineers and data scientists are using HAQM Managed Workflows for Apache Airflow (HAQM MWAA) as a central orchestration platform for running data pipelines and machine learning (ML) workloads. To support these pipelines, they often require additional Python packages, such as Apache Airflow Providers. For example, a pipeline may require the Snowflake provider […]
Build a real-time streaming generative AI application using HAQM Bedrock, HAQM Managed Service for Apache Flink, and HAQM Kinesis Data Streams
Data streaming enables generative AI to take advantage of real-time data and provide businesses with rapid insights. This post looks at how to integrate generative AI capabilities when implementing a streaming architecture on AWS using managed services such as Managed Service for Apache Flink and HAQM Kinesis Data Streams for processing streaming data and HAQM Bedrock to utilize generative AI capabilities. We include a reference architecture and a step-by-step guide on infrastructure setup and sample code for implementing the solution with the AWS Cloud Development Kit (AWS CDK). You can find the code to try it out yourself on the GitHub repo.
Build multimodal search with HAQM OpenSearch Service
Multimodal search enables both text and image search capabilities, transforming how users access data through search applications. Consider building an online fashion retail store: you can enhance the users’ search experience with a visually appealing application that customers can use to not only search using text but they can also upload an image depicting a […]
How Swisscom automated HAQM Redshift as part of their One Data Platform solution using AWS CDK – Part 2
In this series, we talk about Swisscom’s journey of automating HAQM Redshift provisioning as part of the Swisscom One Data Platform (ODP) solution using the AWS Cloud Development Kit (AWS CDK), and we provide code snippets and the other useful references. In Part 1, we did a deep dive on provisioning a secure and compliant […]
How Swisscom automated HAQM Redshift as part of their One Data Platform solution using AWS CDK – Part 1
In this post, we deep dive into provisioning a secure and compliant Redshift cluster using the AWS CDK and discuss the best practices of secret rotation. We also explain how Swisscom used AWS CDK custom resources in automating the creation of dynamic user groups that are relevant for the AWS Identity and Access management (IAM) roles matching different job functions.
Design a data mesh pattern for HAQM EMR-based data lakes using AWS Lake Formation with Hive metastore federation
In this post, we delve into the key aspects of using HAQM EMR for modern data management, covering topics such as data governance, data mesh deployment, and streamlined data discovery. One of the key challenges in modern big data management is facilitating efficient data sharing and access control across multiple EMR clusters. Organizations have multiple […]
Simplify data lake access control for your enterprise users with trusted identity propagation in AWS IAM Identity Center, AWS Lake Formation, and HAQM S3 Access Grants
Many organizations use external identity providers (IdPs) such as Okta or Microsoft Azure Active Directory to manage their enterprise user identities. These users interact with and run analytical queries across AWS analytics services. To enable them to use the AWS services, their identities from the external IdP are mapped to AWS Identity and Access Management […]
Build a decentralized semantic search engine on heterogeneous data stores using autonomous agents
In this post, we show how to build a Q&A bot with RAG (Retrieval Augmented Generation). RAG uses data sources like HAQM Redshift and HAQM OpenSearch Service to retrieve documents that augment the LLM prompt. For getting data from HAQM Redshift, we use the Anthropic Claude 2.0 on HAQM Bedrock, summarizing the final response based on pre-defined prompt template libraries from LangChain. To get data from HAQM OpenSearch Service, we chunk, and convert the source data chunks to vectors using HAQM Titan Text Embeddings model.
Build Spark Structured Streaming applications with the open source connector for HAQM Kinesis Data Streams
Apache Spark is a powerful big data engine used for large-scale data analytics. Its in-memory computing makes it great for iterative algorithms and interactive queries. You can use Apache Spark to process streaming data from a variety of streaming sources, including HAQM Kinesis Data Streams for use cases like clickstream analysis, fraud detection, and more. Kinesis Data Streams is a serverless streaming data service that makes it straightforward to capture, process, and store data streams at any scale.
With the new open source HAQM Kinesis Data Streams Connector for Spark Structured Streaming, you can use the newer Spark Data Sources API. It also supports enhanced fan-out for dedicated read throughput and faster stream processing. In this post, we deep dive into the internal details of the connector and show you how to use it to consume and produce records from and to Kinesis Data Streams using HAQM EMR.
Achieve peak performance and boost scalability using multiple HAQM Redshift serverless workgroups and Network Load Balancer
As data analytics use cases grow, factors of scalability and concurrency become crucial for businesses. Your analytic solution architecture should be able to handle large data volumes at high concurrency and without compromising speed, thereby delivering a scalable high-performance analytics environment. HAQM Redshift Serverless provides a fully managed, petabyte-scale, auto scaling cloud data warehouse to […]