AWS Big Data Blog

Category: Intermediate (200)

Build efficient, cross-Regional, I/O-intensive workloads with Dask on AWS

Welcome to the era of data. The sheer volume of data captured daily continues to grow, calling for platforms and solutions to evolve. Services such as HAQM Simple Storage Service (HAQM S3) offer a scalable solution that adapts yet remains cost-effective for growing datasets. The HAQM Sustainability Data Initiative (ASDI) uses the capabilities of HAQM […]

Build, deploy, and run Spark jobs on HAQM EMR with the open-source EMR CLI tool

Today, we’re pleased to introduce the HAQM EMR CLI, a new command line tool to package and deploy PySpark projects across different HAQM EMR environments. With the introduction of the EMR CLI, you now have a simple way to not only deploy a wide range of PySpark projects to remote EMR environments, but also integrate […]

Compose your ETL jobs for MongoDB Atlas with AWS Glue

In today’s data-driven business environment, organizations face the challenge of efficiently preparing and transforming large amounts of data for analytics and data science purposes. Businesses need to build data warehouses and data lakes based on operational data. This is driven by the need to centralize and integrate data coming from disparate sources. At the same […]

Data load made easy and secure in HAQM Redshift using Query Editor V2

HAQM Redshift is a fast, fully managed, petabyte-scale data warehouse service that makes it simple and cost-effective to analyze all your data efficiently and securely. Users such as data analysts, database developers, and data scientists use SQL to analyze their data in HAQM Redshift data warehouses. HAQM Redshift provides a web-based Query Editor V2 in […]

What’s new with HAQM MWAA support for Apache Airflow version 2.4.3

HAQM Managed Workflows for Apache Airflow (HAQM MWAA) is a managed orchestration service for Apache Airflow that makes it simple to set up and operate end-to-end data pipelines in the cloud at scale. HAQM MWAA supports multiple versions of Apache Airflow (v1.10.12, v2.0.2, and v2.2.2). Earlier in 2023, we added support for Apache Airflow v2.4.3 […]

Real-time anomaly detection via Random Cut Forest in HAQM Managed Service for Apache Flink

August 30, 2023: HAQM Kinesis Data Analytics has been renamed to HAQM Managed Service for Apache Flink. Read the announcement in the AWS News Blog and learn more. Real-time anomaly detection describes a use case to detect and flag unexpected behavior in streaming data as it occurs. Online machine learning (ML) algorithms are popular for […]

Monitor and optimize cost on AWS Glue for Apache Spark

AWS Glue is a serverless data integration service that makes it simple to discover, prepare, and combine data for analytics, machine learning (ML), and application development. You can use AWS Glue to create, run, and monitor data integration and ETL (extract, transform, and load) pipelines and catalog your assets across multiple data stores. One of […]

How Dafiti made HAQM QuickSight its primary data visualization tool

This is a guest post by Valdiney Gomes, Hélio Leal, and Flávia Lima from Dafiti. Data and its various uses is increasingly evident in companies, and each professional has their preferences about which technologies to use to visualize data, which isn’t necessarily in line with the technological needs and infrastructure of a company. At Dafiti, […]

Cross-account integration between SaaS platforms using HAQM AppFlow

Implementing an effective data sharing strategy that satisfies compliance and regulatory requirements is complex. Customers often need to share data between disparate software as a service (SaaS) platforms within their organization or across organizations. On many occasions, they need to apply business logic to the data received from the source SaaS platform before pushing it […]

Exploring new ETL and ELT capabilities for HAQM Redshift from the AWS Glue Studio visual editor

In a modern data architecture, unified analytics enable you to access the data you need, whether it’s stored in a data lake or a data warehouse. In particular, we have observed an increasing number of customers who combine and integrate their data into an HAQM Redshift data warehouse to analyze huge data at scale and […]