AWS Big Data Blog
Category: Advanced (300)
Hybrid Search with HAQM OpenSearch Service
This post explains the internals of hybrid search and how to build a hybrid search solution using OpenSearch Service. We experiment with sample queries to explore and compare lexical, semantic, and hybrid search. All the code used in this post is publicly available in the GitHub repository.
Scale AWS Glue jobs by optimizing IP address consumption and expanding network capacity using a private NAT gateway
As businesses expand, the demand for IP addresses within the corporate network often exceeds the supply. An organization’s network is often designed with some anticipation of future requirements, but as enterprises evolve, their information technology (IT) needs surpass the previously designed network. Companies may find themselves challenged to manage the limited pool of IP addresses. […]
Enrich your customer data with geospatial insights using HAQM Redshift, AWS Data Exchange, and HAQM QuickSight
It always pays to know more about your customers, and AWS Data Exchange makes it straightforward to use publicly available census data to enrich your customer dataset. The United States Census Bureau conducts the US census every 10 years and gathers household survey data. This data is anonymized, aggregated, and made available for public use. […]
Multicloud data lake analytics with HAQM Athena
Many organizations operate data lakes spanning multiple cloud data stores. This could be for various reasons, such as business expansions, mergers, or specific cloud provider preferences for different business units. In these cases, you may want an integrated query layer to seamlessly run analytical queries across these diverse cloud stores and streamline your data analytics […]
Build a RAG data ingestion pipeline for large-scale ML workloads
For building any generative AI application, enriching the large language models (LLMs) with new data is imperative. This is where the Retrieval Augmented Generation (RAG) technique comes in. RAG is a machine learning (ML) architecture that uses external documents (like Wikipedia) to augment its knowledge and achieve state-of-the-art results on knowledge-intensive tasks. For ingesting these […]
How the GoDaddy data platform achieved over 60% cost reduction and 50% performance boost by adopting HAQM EMR Serverless
This is a guest post co-written with Brandon Abear, Dinesh Sharma, John Bush, and Ozcan IIikhan from GoDaddy. GoDaddy empowers everyday entrepreneurs by providing all the help and tools to succeed online. With more than 20 million customers worldwide, GoDaddy is the place people come to name their ideas, build a professional website, attract customers, […]
Real-time cost savings for HAQM Managed Service for Apache Flink
When running Apache Flink applications on HAQM Managed Service for Apache Flink, you have the unique benefit of taking advantage of its serverless nature. This means that cost-optimization exercises can happen at any time—they no longer need to happen in the planning phase. With Managed Service for Apache Flink, you can add and remove compute […]
In-stream anomaly detection with HAQM OpenSearch Ingestion and HAQM OpenSearch Serverless
Unsupervised machine learning analytics has emerged as a powerful tool for anomaly detection in today’s data-rich landscape, especially with the growing volume of machine-generated data. In-stream anomaly detection offers real-time insights into data anomalies, enabling proactive response. HAQM OpenSearch Serverless focuses on delivering seamless scalability and management of search workloads; HAQM OpenSearch Ingestion complements this […]
Petabyte-scale log analytics with HAQM S3, HAQM OpenSearch Service, and HAQM OpenSearch Ingestion
Organizations often need to manage a high volume of data that is growing at an extraordinary rate. At the same time, they need to optimize operational costs to unlock the value of this data for timely insights and do so with a consistent performance. With this massive data growth, data proliferation across your data stores, […]
Bring your workforce identity to HAQM EMR Studio and Athena
Customers today may struggle to implement proper access controls and auditing at the user level when multiple applications are involved in data access workflows. The key challenge is to implement proper least-privilege access controls based on user identity when one application accesses data on behalf of the user in another application. It forces you to […]