AWS Big Data Blog
Category: HAQM Managed Streaming for Apache Kafka (HAQM MSK)
Securely process near-real-time data from HAQM MSK Serverless using an AWS Glue streaming ETL job with IAM authentication
Streaming data has become an indispensable resource for organizations worldwide because it offers real-time insights that are crucial for data analytics. The escalating velocity and magnitude of collected data has created a demand for real-time analytics. This data originates from diverse sources, including social media, sensors, logs, and clickstreams, among others. With streaming data, organizations […]
Build streaming data pipelines with HAQM MSK Serverless and IAM authentication
HAQM’s serverless Apache Kafka offering, HAQM Managed Streaming for Apache Kafka (HAQM MSK) Serverless, is attracting a lot of interest. It’s appreciated for its user-friendly approach, ability to scale automatically, and cost-saving benefits over other Kafka solutions. However, a hurdle encountered by many users is the requirement of MSK Serverless to use AWS Identity and Access Management (IAM) access control. At the time of writing, the HAQM MSK library for IAM is exclusive to Kafka libraries in Java, creating a challenge for users of other programming languages. In this post, we aim to address this issue and present how you can use HAQM API Gateway and AWS Lambda to navigate around this obstacle.
Introducing HAQM MSK as a source for HAQM OpenSearch Ingestion
Ingesting a high volume of streaming data has been a defining characteristic of operational analytics workloads with HAQM OpenSearch Service. Many of these workloads involve either self-managed Apache Kafka or HAQM Managed Streaming for Apache Kafka (HAQM MSK) to satisfy their data streaming needs. Consuming data from HAQM MSK and writing to OpenSearch Service has been a challenge for customers. AWS Lambda, custom code, Kafka Connect, and Logstash have been used for ingesting this data. These methods involve tools that must be built and maintained. In this post, we introduce HAQM MSK as a source to HAQM OpenSearch Ingestion, a serverless, fully managed, real-time data collector for OpenSearch Service that makes this ingestion even easier.
Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics
This post is co-written with Eliad Gat and Oded Lifshiz from Orca Security. With data becoming the driving force behind many industries today, having a modern data architecture is pivotal for organizations to be successful. One key component that plays a central role in modern data architectures is the data lake, which allows organizations to […]
Multi-tenancy Apache Kafka clusters in HAQM MSK with IAM access control and Kafka Quotas – Part 1
With HAQM Managed Streaming for Apache Kafka (HAQM MSK), you can build and run applications that use Apache Kafka to process streaming data. To process streaming data, organizations either use multiple Kafka clusters based on their application groupings, usage scenarios, compliance requirements, and other factors, or a dedicated Kafka cluster for the entire organization. It […]
Multi-tenancy Apache Kafka clusters in HAQM MSK with IAM access control and Kafka quotas – Part 2
Kafka quotas are integral to multi-tenant Kafka clusters. They prevent Kafka cluster performance from being negatively affected by poorly behaved applications overconsuming cluster resources. Furthermore, they enable the central streaming data platform to be operated as a multi-tenant platform and used by downstream and upstream applications across multiple business lines. Kafka supports two types of quotas: […]
Best practices for running production workloads using HAQM MSK tiered storage
In the second post of the series, we discussed some core concepts of the HAQM Managed Streaming for Apache Kafka (HAQM MSK) tiered storage feature and explained how read and write operations work in a tiered storage enabled cluster. This post focuses on how to properly size your MSK tiered storage cluster, which metrics to […]
AWS Glue streaming application to process HAQM MSK data using AWS Glue Schema Registry
Organizations across the world are increasingly relying on streaming data, and there is a growing need for real-time data analytics, considering the growing velocity and volume of data being collected. This data can come from a diverse range of sources, including Internet of Things (IoT) devices, user applications, and logging and telemetry information from applications, […]
Deep dive on HAQM MSK tiered storage
In the first post of the series, we described some core concepts of Apache Kafka cluster sizing, the best practices for optimizing the performance, and the cost of your Kafka workload. This post explains how the underlying infrastructure affects Kafka performance when you use HAQM Managed Streaming for Apache Kafka (HAQM MSK) tiered storage. We […]
Stream data with HAQM MSK Connect using an open-source JDBC connector
Customers are adopting HAQM Managed Service for Apache Kafka (HAQM MSK) as a fast and reliable streaming platform to build their enterprise data hub. In addition to streaming capabilities, setting up HAQM MSK enables organizations to use a pub/sub model for data distribution with loosely coupled and independent components. To publish and distribute the data […]