AWS Big Data Blog

Access HAQM Redshift data from Salesforce Data Cloud with Zero Copy Data Federation

This post is co-authored by Vijay Gopalakrishnan, Director of Product, Salesforce Data Cloud. In today’s data-driven business landscape, organizations collect a wealth of data across various touch points and unify it in a central data warehouse or a data lake to deliver business insights. This data is primarily used for analytical and machine learning purposes, […]

Perform reindexing in HAQM OpenSearch Serverless using HAQM OpenSearch Ingestion

In this post, we outline the steps to copy data between two indexes in the same OpenSearch Serverless collection using the new OpenSearch source feature of OpenSearch Ingestion. This is particularly useful for reindexing operations where you want to change your data schema. OpenSearch Serverless and OpenSearch Ingestion are both serverless services that enable you to seamlessly handle your data workflows, providing optimal performance and scalability.

Uncover social media insights in real time using HAQM Managed Service for Apache Flink and HAQM Bedrock

This post takes a step-by-step approach to showcase how you can use Retrieval Augmented Generation (RAG) to reference real-time tweets as a context for large language models (LLMs). RAG is the process of optimizing the output of an LLM so it references an authoritative knowledge base outside of its training data sources before generating a response. LLMs are trained on vast volumes of data and use billions of parameters to generate original output for tasks such as answering questions, translating languages, and completing sentences.

Configure a custom domain name for your HAQM MSK cluster

HAQM Managed Streaming for Kafka (HAQM MSK) is a fully managed service that enables you to build and run applications that use Apache Kafka to process streaming data. It runs open-source versions of Apache Kafka. This means existing applications, tooling, and plugins from partners and the Apache Kafka community are supported without requiring changes to […]

Run Apache Spark 3.5.1 workloads 4.5 times faster with HAQM EMR runtime for Apache Spark

The HAQM EMR runtime for Apache Spark is a performance-optimized runtime that is 100% API compatible with open source Apache Spark. It offers faster out-of-the-box performance than Apache Spark through improved query plans, faster queries, and tuned defaults. HAQM EMR on EC2, HAQM EMR Serverless, HAQM EMR on HAQM EKS, and HAQM EMR on AWS […]

Image showing multiple producers and consumers each publishing to a stream-per-tenant

Stream multi-tenant data with HAQM MSK

AWS helps SaaS vendors by providing the building blocks needed to implement a streaming application with HAQM Kinesis Data Streams and HAQM Managed Streaming for Apache Kafka (HAQM MSK), and real-time processing applications with HAQM Managed Service for Apache Flink. In this post, we look at implementation patterns a SaaS vendor can adopt when using a streaming platform as a means of integration between internal components, where streaming data is not directly exposed to third parties. In particular, we focus on HAQM MSK.

Apply fine-grained access and transformation on the SUPER data type in HAQM Redshift

HAQM Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing ETL (extract, transform, and load), business intelligence (BI), and reporting tools. Tens of thousands of customers use HAQM Redshift to process exabytes of data per […]

Build multimodal search with HAQM OpenSearch Service

Multimodal search enables both text and image search capabilities, transforming how users access data through search applications. Consider building an online fashion retail store: you can enhance the users’ search experience with a visually appealing application that customers can use to not only search using text but they can also upload an image depicting a […]

Introducing AWS Glue usage profiles for flexible cost control

AWS Glue is a serverless data integration service that enables you to run extract, transform, and load (ETL) workloads on your data in a scalable and serverless manner. One of the main advantages of using a cloud platform is its flexibility; you can provision compute resources when you actually need them. However, with this ease […]

Disaster recovery strategies for HAQM MWAA – Part 2

HAQM Managed Workflows for Apache Airflow (HAQM MWAA) is a fully managed orchestration service that makes it straightforward to run data processing workflows at scale. HAQM MWAA takes care of operating and scaling Apache Airflow so you can focus on developing workflows. However, although HAQM MWAA provides high availability within an AWS Region through features […]