WeatherBug reduced ETL latency to 30 times faster using HAQM Redshift Spectrum

This post is co-written with data engineers, Anton Morozov and James Phillips, from Weatherbug.

WeatherBug is a brand owned by GroundTruth, based in New York City, that provides location-based advertising solutions to businesses. WeatherBug consists of a mobile app reporting live and forecast data on hyperlocal weather to consumer users. The WeatherBug Data Engineering team has built a modern analytics platform to serve multiple use cases, including weather forecasting and location-based advertising, that is completely built on AWS. They use an HAQM Simple Storage Service (HAQM S3) data lake to store clickstream data and use HAQM Redshift as their cloud data warehouse platform.

In this post, we share how WeatherBug reduced their extract, transform, and load (ETL) latency using HAQM Redshift Spectrum.

HAQM Redshift Spectrum overview

HAQM Redshift is a fast, petabyte-scale cloud data warehouse delivering the best price-performance. It allows you to run complex analytic queries against terabytes to petabytes of structured and semi-structured data, using sophisticated query optimization, columnar storage on high-performance storage, and massively parallel query execution.

Redshift Spectrum allows you to query open format data directly in the S3 data lake without having to load the data or duplicate your infrastructure. With Redshift Spectrum, you can query open file formats such as Apache Parquet, ORC, JSON, Avro, and CSV. For more information, see HAQM Redshift Spectrum overview and HAQM Redshift Spectrum Extends Data Warehousing Out to Exabytes—No Loading Required.

Redshift Spectrum runs on a massive compute fleet independent of your HAQM Redshift cluster. Redshift Spectrum pushes many compute-intensive tasks, such as predicate filtering and aggregation, down to the Redshift Spectrum layer. Therefore, Redshift Spectrum queries use much less of your cluster’s processing capacity than other queries. With Redshift Spectrum, you can efficiently query and retrieve structured and semi-structured data from files in HAQM S3 without having to load the data into HAQM Redshift tables. Redshift Spectrum queries employ massive parallelism to run very fast against large datasets. Much of the processing occurs in the Redshift Spectrum layer, and most of the data remains in HAQM S3. Multiple clusters can concurrently query the same dataset in HAQM S3 without the need to make copies of the data for each cluster.

Previous solution

To summarize performance metrics for internal BI needs, WeatherBug has to move a lot of data between their S3 data lake and HAQM Redshift cluster using complex ETL processings. They used Apache Airflow to orchestrate their pipeline during the ETL process and used Apache Hive for large-scale ETL jobs in order to offload the data from the HAQM Redshift cluster. The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. You can project this structure onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive.

WeatherBug chose Hive as a solution because it was easy to use with their Python/Airflow pipeline and they were able to launch multiple jobs in parallel. This solution was working well for WeatherBug but it needed more engineering effort to build the ETL processes along with operational overheads on the current solution because it involved multiple technologies.

The following diagram illustrates this previous architecture.

New HAQM Redshift Spectrum-based solution

To optimize the current Hadoop-based solution to move data between their HAQM Redshift cluster and S3 buckets, the WeatherBug team considered Redshift Spectrum as an alternative solution. The following diagram shows their updated architecture.

WeatherBug created Redshift Spectrum external tables that pointed to the data stored in their S3 buckets. This helped them perform the data movement and transformations from HAQM Redshift to HAQM S3 using Insert into <external table> select from <Redshift table> and from HAQM S3 to HAQM Redshift using Insert into <Redshift table> select from <external table> along with the data transformations in the inline SQL queries.

During the pilot implementation on a few sample tables with this current solution, WeatherBug found it very easy to learn how to use the HAQM Redshift Spectrum features, and not long after they had a proof of concept far superior to their existing Hadoop-based solution. They reduced the targeted pipeline’s runtime from 17 minutes to 30 seconds, a 3300% improvement, with the additional benefit of eliminating the cost and management of Hadoop cluster. They were excited to apply this approach to additional pipelines that support their HAQM Redshift clusters. This was a nice win for them to be able to improve processing times and reduce cost and overhead with low engineering effort.

In addition to these improvements, they replaced some of their jobs that use Apache Hive to query tables with HAQM Redshift Spectrum.

In their initial testing, WeatherBug is seeing costs for HAQM Redshift Spectrum of $0.14 per day instead of $16.12 per day on Hive for a selected sample job, which is a 115 times reduction in cost.

Conclusion

When you have a data lake and cloud data warehouse built in HAQM S3 and HAQM Redshift, you may need frequent ETL jobs between the two systems for different use cases. HAQM Redshift Spectrum provides an easy-to-implement, cost-effective, and high-performance solution to interact between HAQM Redshift and HAQM S3 to query the HAQM S3 data from HAQM Redshift, join HAQM Redshift tables with S3 objects, and transform using simple SQL queries. Many Data Engineering use cases similar to the WeatherBug example discussed here can be optimized using HAQM Redshift Spectrum.

About the Authors

Anton Morozov is a WeatherBug data engineer working on analytics pipelines. He considers AWS Redshift an essential technology to understanding data for business intelligence needs.

James Phillips is a WeatherBug data engineer who uses many AWS technologies on a daily basis. Some of these include HAQM Redshift, HAQM EMR, and HAQM SageMaker.

Avijit Goswami is a Principal Solutions Architect at AWS, helping his digital native and startup customers become tomorrow’s enterprises using AWS services. He is part of the analytics specialists field community in AWS and works closely with the WeatherBug Data Engineering team.

AWS Big Data Blog