AWS Big Data Blog

Category: Advanced (300)

Optimize multimodal search using the TwelveLabs Embed API and HAQM OpenSearch Service

In this blog post, we show you the process of integrating TwelveLabs Embed API with OpenSearch Service to create a multimodal search solution. You’ll learn how to generate rich, contextual embeddings from video content and use OpenSearch Service’s vector database capabilities to enable search functionalities. By the end of this post, you’ll be equipped with the knowledge to implement a system that can transform the way your organization handles and extracts value from video content.

Build a data lakehouse in a hybrid Environment using HAQM EMR Serverless, Apache DolphinScheduler, and TiDB

This post discusses a decoupled approach of building a serverless data lakehouse using AWS Cloud-centered services, including HAQM EMR Serverless, HAQM Athena, HAQM Simple Storage Service (HAQM S3), Apache DolphinScheduler (an open source data job scheduler) as well as PingCAP TiDB, a third-party data warehouse product that can be deployed either on premises or on the cloud or through a software as a service (SaaS).

Migrate from Standard brokers to Express brokers in HAQM MSK using HAQM MSK Replicator

Creating a new cluster with Express brokers is straightforward, as described in HAQM MSK Express brokers. However, if you have an existing MSK cluster, you need to migrate to a new Express based cluster. In this post, we discuss how you should plan and perform the migration to Express brokers for your existing MSK workloads on Standard brokers. Express brokers offer a different user experience and a different shared responsibility boundary, so using them on an existing cluster is not possible. However, you can use HAQM MSK Replicator to copy all data and metadata from your existing MSK cluster to a new cluster comprising of Express brokers.

Generate vector embeddings for your data using AWS Lambda as a processor for HAQM OpenSearch Ingestion

In this post, we demonstrate how to use the OpenSearch Ingestion’s Lambda processor to generate embeddings for your source data and ingest them to an OpenSearch Serverless vector collection. This solution uses the flexibility of OpenSearch Ingestion pipelines with a Lambda processor to dynamically generate embeddings.

Accelerate queries on Apache Iceberg tables through AWS Glue auto compaction

In this post, we explore new features of the AWS Glue Data Catalog, which now supports improved automatic compaction of Iceberg tables for streaming data, making it straightforward for you to keep your transactional data lakes consistently performant. Enabling automatic compaction on Iceberg tables reduces metadata overhead on your Iceberg tables and improves query performance

Implement a custom subscription workflow for unmanaged HAQM S3 assets published with HAQM DataZone

In this post, we demonstrate how to implement a custom subscription workflow using HAQM DataZone, HAQM EventBridge, and AWS Lambda to automate the fulfillment process for unmanaged data assets, such as unstructured data stored in HAQM S3. This solution enhances governance and simplifies access to unstructured data assets across the organization.

Federate to HAQM Redshift Query Editor v2 with Microsoft Entra ID

In this post, we explore the process of federating into AWS using Microsoft Entra ID and AWS Identity and Access Management (IAM), and how to restrict access to datasets based on permissions linked to AD groups. We guide you through the setup process, and demonstrate how to seamlessly connect to the Redshift Query Editor while making sure data access permissions are accurately enforced based on your Microsoft Entra ID groups.

Introducing the HubSpot connector for AWS Glue

This post introduces the new HubSpot managed connector for AWS Glue, and demonstrates how you can integrate HubSpot data into your existing data lake on AWS. By consolidating HubSpot data with data from your AWS accounts and from other SaaS services, you can enhance, analyze, and optionally write the data back to HubSpot, creating a seamless and integrated data experience.

Architecture

Develop a business chargeback model within your organization using HAQM Redshift multi-warehouse writes

Now, we are announcing general availability (GA) of HAQM Redshift multi-data warehouse writes through data sharing. This new capability allows you to scale your write workloads and achieve better performance for extract, transform, and load (ETL) workloads by using different warehouses of different types and sizes based on your workload needs.