AWS Big Data Blog

Category: Intermediate (200)

Explore visualizations with AWS Glue interactive sessions

AWS Glue interactive sessions offer a powerful way to iteratively explore datasets and fine-tune transformations using Jupyter-compatible notebooks. Interactive sessions enable you to work with a choice of popular integrated development environments (IDEs) in your local environment or with AWS Glue or HAQM SageMaker Studio notebooks on the AWS Management Console, all while seamlessly harnessing […]

Introducing enhanced support for tagging, cross-account access, and network security in AWS Glue interactive sessions

AWS Glue interactive sessions allow you to run interactive AWS Glue workloads on demand, which enables rapid development by issuing blocks of code on a cluster and getting prompt results. This technology is enabled by the use of notebook IDEs, such as the AWS Glue Studio notebook, HAQM SageMaker Studio, or your own Jupyter notebooks. […]

Securely process near-real-time data from HAQM MSK Serverless using an AWS Glue streaming ETL job with IAM authentication

Streaming data has become an indispensable resource for organizations worldwide because it offers real-time insights that are crucial for data analytics. The escalating velocity and magnitude of collected data has created a demand for real-time analytics. This data originates from diverse sources, including social media, sensors, logs, and clickstreams, among others. With streaming data, organizations […]

Build streaming data pipelines with HAQM MSK Serverless and IAM authentication

HAQM’s serverless Apache Kafka offering, HAQM Managed Streaming for Apache Kafka (HAQM MSK) Serverless, is attracting a lot of interest. It’s appreciated for its user-friendly approach, ability to scale automatically, and cost-saving benefits over other Kafka solutions. However, a hurdle encountered by many users is the requirement of MSK Serverless to use AWS Identity and Access Management (IAM) access control. At the time of writing, the HAQM MSK library for IAM is exclusive to Kafka libraries in Java, creating a challenge for users of other programming languages. In this post, we aim to address this issue and present how you can use HAQM API Gateway and AWS Lambda to navigate around this obstacle.

Build an ETL process for HAQM Redshift using HAQM S3 Event Notifications and AWS Step Functions

In this post we discuss how we can build and orchestrate in a few steps an ETL process for HAQM Redshift using HAQM S3 Event Notifications for automatic verification of source data upon arrival and notification in specific cases. And we show how to use AWS Step Functions for the orchestration of the data pipeline. It can be considered as a starting point for teams within organizations willing to create and build an event driven data pipeline from data source to data warehouse that will help in tracking each phase and in responding to failures quickly. Alternatively, you can also use HAQM Redshift auto-copy from HAQM S3 to simplify data loading from HAQM S3 into HAQM Redshift.

Automate the archive and purge data process for HAQM RDS for PostgreSQL using pg_partman, HAQM S3, and AWS Glue

The post Archive and Purge Data for HAQM RDS for PostgreSQL and HAQM Aurora with PostgreSQL Compatibility using pg_partman and HAQM S3 proposes data archival as a critical part of data management and shows how to efficiently use PostgreSQL’s native range partition to partition current (hot) data with pg_partman and archive historical (cold) data in […]

Introducing AWS Glue crawler and create table support for Apache Iceberg format

Apache Iceberg is an open table format for large datasets in HAQM Simple Storage Service (HAQM S3) and provides fast query performance over large tables, atomic commits, concurrent writes, and SQL-compatible table evolution. Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time […]

Introducing Apache Airflow version 2.6.3 support on HAQM MWAA

HAQM Managed Workflows for Apache Airflow (HAQM MWAA) is a managed orchestration service for Apache Airflow that makes it simple to set up and operate end-to-end data pipelines in the cloud. Trusted across various industries, HAQM MWAA helps organizations like Siemens, ENGIE, and Choice Hotels International enhance and scale their business workflows, while significantly improving security […]

Perform HAQM Kinesis load testing with Locust

February 9, 2024: HAQM Kinesis Data Firehose has been renamed to HAQM Data Firehose. Read the AWS What’s New post to learn more. Building a streaming data solution requires thorough testing at the scale it will operate in a production environment. Streaming applications operating at scale often handle large volumes of up to GBs per […]