AWS Big Data Blog

Category: Expert (400)

Build a real-time analytics solution with Apache Pinot on AWS

In this, we will provide a step-by-step guide showing you how you can build a real-time OLAP datastore on HAQM Web Services (AWS) using Apache Pinot on HAQM Elastic Compute Cloud (HAQM EC2) and do near real-time visualization using Tableau. You can use Apache Pinot for batch processing use cases as well but, in this post, we will focus on a near real-time analytics use case.

Improve Apache Kafka scalability and resiliency using HAQM MSK tiered storage

Since the launch of tiered storage for HAQM Managed Streaming for Apache Kafka (HAQM MSK), customers have embraced this feature for its ability to optimize storage costs and improve performance. In previous posts, we explored the inner workings of Kafka, maximized the potential of HAQM MSK, and delved into the intricacies of HAQM MSK tiered […]

Architecture Diagram

How EchoStar ingests terabytes of data daily across its 5G Open RAN network in near real-time using HAQM Redshift Serverless Streaming Ingestion

EchoStar, a connectivity company providing television entertainment, wireless communications, and award-winning technology to residential and business customers throughout the US, deployed the first standalone, cloud-native Open RAN 5G network on AWS public cloud. This post provides an overview of real-time data analysis with HAQM Redshift and how EchoStar uses it to ingest hundreds of megabytes per second. As data sources and volumes grew across its network, EchoStar migrated from a single Redshift Serverless workgroup to a multi-warehouse architecture with live data sharing.

Build multimodal search with HAQM OpenSearch Service

Multimodal search enables both text and image search capabilities, transforming how users access data through search applications. Consider building an online fashion retail store: you can enhance the users’ search experience with a visually appealing application that customers can use to not only search using text but they can also upload an image depicting a […]

Detect and handle data skew on AWS Glue

October 2024: This post was reviewed and updated for accuracy. AWS Glue is a fully managed, serverless data integration service provided by HAQM Web Services (AWS) that uses Apache Spark as one of its backend processing engines (as of this writing, you can use Python Shell or Spark). Data skew occurs when the data being […]

Best practices to implement near-real-time analytics using HAQM Redshift Streaming Ingestion with HAQM MSK

HAQM Redshift is a fully managed, scalable cloud data warehouse that accelerates your time to insights with fast, straightforward, and secure analytics at scale. Tens of thousands of customers rely on HAQM Redshift to analyze exabytes of data and run complex analytical queries, making it the most widely used cloud data warehouse. You can run […]

Use AWS Glue ETL to perform merge, partition evolution, and schema evolution on Apache Iceberg

As enterprises collect increasing amounts of data from various sources, the structure and organization of that data often need to change over time to meet evolving analytical needs. However, altering schema and table partitions in traditional data lakes can be a disruptive and time-consuming task, requiring renaming or recreating entire tables and reprocessing large datasets. […]

Implement fine-grained access control in HAQM SageMaker Studio and HAQM EMR using Apache Ranger and Microsoft Active Directory

In this post, we show how you can authenticate into SageMaker Studio using an existing Active Directory (AD), with authorized access to both HAQM S3 and Hive cataloged data using AD entitlements via Apache Ranger integration and AWS IAM Identity Center (successor to AWS Single Sign-On). With this solution, you can manage access to multiple SageMaker environments and SageMaker Studio notebooks using a single set of credentials. Subsequently, Apache Spark jobs created from SageMaker Studio notebooks will access only the data and resources permitted by Apache Ranger policies attached to the AD credentials, inclusive of table and column-level access.

Resolve private DNS hostnames for HAQM MSK Connect

HAQM MSK Connect is a feature of HAQM Managed Streaming for Apache Kafka (HAQM MSK) that offers a fully managed Apache Kafka Connect environment on AWS. With MSK Connect, you can deploy fully managed connectors built for Kafka Connect that move data into or pull data from popular data stores like HAQM S3 and HAQM […]