Extend your HAQM Redshift Data Warehouse to your Data Lake

HAQM Redshift is a fast, fully managed, cloud-native data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing business intelligence tools.

Many companies today are using HAQM Redshift to analyze data and perform various transformations on the data. However, as data continues to grow and become even more important, companies are looking for more ways to extract valuable insights from the data, such as big data analytics, numerous machine learning (ML) applications, and a range of tools to drive new use cases and business processes. Companies are looking to access all their data, all the time, by all users and get fast answers. The best solution for all those requirements is for companies to build a data lake, which is a centralized repository that allows you to store all your structured, semi-structured, and unstructured data at any scale.

With a data lake built on HAQM Simple Storage Service (HAQM S3), you can easily run big data analytics using services such as HAQM EMR and AWS Glue. You can also query structured data (such as CSV, Avro, and Parquet) and semi-structured data (such as JSON and XML) by using HAQM Athena and HAQM Redshift Spectrum. You can also use a data lake with ML services such as HAQM SageMaker to gain insights.

A large startup company in Europe uses an HAQM Redshift cluster to allow different company teams to analyze vast amounts of data. They wanted a way to extend the collected data into the data lake and allow additional analytical teams to access more data to explore new ideas and business cases.

Additionally, the company was looking to reduce their storage utilization, which had already reached more than 80% of their HAQM Redshift cluster’s storage capacity. The high storage utilization necessitated ongoing cleanup of growing tables to avoid purchasing additional nodes and associated increased costs. The cleanup operations, however, created a larger operational footprint. The proposed solution implemented a hot/cold storage pattern using HAQM Redshift Spectrum and reduced the local disk utilization on the HAQM Redshift cluster to make sure costs are maintained.

In this post we demonstrate how the company, with the support of AWS, implemented a lake house architecture by employing the following best practices:

Unloading data into HAQM Simple Storage Service (HAQM S3)
Instituting a hot/cold pattern using HAQM Redshift Spectrum
Using AWS Glue to crawl and catalog the data
Querying data using Athena

Solution overview

The following diagram illustrates the solution architecture.

The solution includes the following steps:

Unload data from HAQM Redshift to HAQM S3
Create an AWS Glue Data Catalog using an AWS Glue crawler
Query the data lake in HAQM Athena
Query HAQM Redshift and the data lake with HAQM Redshift Spectrum

Prerequisites

To complete this walkthrough, you must have the following prerequisites:

An AWS account.
An HAQM Redshift cluster.
The following AWS services and access: HAQM Redshift, HAQM S3, AWS Glue, and Athena.
The appropriate AWS Identity and Access Management (IAM) permissions for HAQM Redshift Spectrum and AWS Glue to access HAQM S3 buckets. For more information, see IAM policies for HAQM Redshift Spectrum and Setting up IAM Permissions for AWS Glue.

Walkthrough

To demonstrate the process performed by the company, we use the industry-standard TPC-H dataset provided publicly by the TPC organization.

The Orders table has the following columns:

Column	Type
O_ORDERKEY	int4
O_CUSTKEY	int4
O_ORDERSTATUS	varchar
O_TOTALPRICE	numeric
O_ORDERDATE	date
O_ORDERPRIORITY	varchar
O_CLERK	varchar
O_SHIPPRIORITY	int4
O_COMMENT	varchar
SKIP	varchar

Unloading data from HAQM Redshift to HAQM S3

HAQM Redshift allows you to unload your data using a data lake export to an Apache Parquet file format. Parquet is an efficient open columnar storage format for analytics. Parquet format is up to twice as fast to unload and consumes up to six times less storage in HAQM S3, compared with text formats.

To unload cold or historical data from HAQM Redshift to HAQM S3, you need to run an UNLOAD statement similar to the following code (substitute your IAM role ARN):

UNLOAD ('select o_orderkey, o_custkey, o_orderstatus, o_totalprice, o_orderdate, o_orderpriority, o_clerk, o_shippriority, o_comment, skip
FROM tpc.orders
ORDER BY o_orderkey, o_orderdate') 
TO 's3://tpc-bucket/orders/' 
CREDENTIALS 'aws_iam_role=arn:aws:iam::<account_number>:role/>Role<'
FORMAT AS parquet allowoverwrite PARTITION BY (o_orderdate);

It is important to define a partition key or column that minimizes HAQM S3 scans as much as possible based on the query patterns intended. The query pattern is often by date ranges; for this use case, use the o_orderdate field as the partition key.

Another important recommendation when unloading is to have file sizes between 128 MB and 512 MB. By default, the UNLOAD command splits the results to one or more files per node slice (virtual worker in the HAQM Redshift cluster) which allows you to use the HAQM Redshift MPP architecture. However, this can potentially cause files created by every slice to be small. In the company’s use case, the default UNLOAD using PARALLEL ON yielded dozens of small (MBs) files. For the company, PARALLEL OFF yielded the best results because it aggregated all the slices’ work into the LEADER node and wrote it out as a single stream controlling the file size using the MAXFILESIZE option.

Another performance enhancement applied in this use case was the use of Parquet’s min and max statistics. Parquet files have min_value and max_value column statistics for each row group that allow HAQM Redshift Spectrum to prune (skip) row groups that are out of scope for a query (range-restricted scan). To use row group pruning, you should sort the data by frequently-used columns. Min/max pruning helps scan less data from HAQM S3, which results in improved performance and reduced cost.

After unloading the data to your data lake, you can view your Parquet file’s content in HAQM S3 (assuming it’s under 128 MB). From the Actions drop-down menu, choose Select from.

You’re now ready to populate your Data Catalog using an AWS Glue crawler.

Creating a Data Catalog with an AWS Glue crawler

To query your data lake using Athena, you must catalog the data. The Data Catalog is an index of the location, schema, and runtime metrics of the data.

An AWS Glue crawler accesses your data store, extracts metadata (such as field types), and creates a table schema in the Data Catalog. For instructions, see Working with Crawlers on the AWS Glue Console.

Querying the data lake in Athena

After you create the crawler, you can view the schema and tables in AWS Glue and Athena, and can immediately start querying the data in Athena. The following screenshot shows the table in the Athena Query Editor.

Querying HAQM Redshift and the data lake using a unified view with HAQM Redshift Spectrum

HAQM Redshift Spectrum is a feature of HAQM Redshift that allows multiple Redshift clusters to query from same data in the lake. It enables the lake house architecture and allows data warehouse queries to reference data in the data lake as they would any other table. HAQM Redshift clusters transparently use the HAQM Redshift Spectrum feature when the SQL query references an external table stored in HAQM S3. Large multiple queries in parallel are possible by using HAQM Redshift Spectrum on external tables to scan, filter, aggregate, and return rows from HAQM S3 back to the HAQM Redshift cluster.\

Following best practices, the company decided to persist all their data in their HAQM S3 data lake and only store hot data in HAQM Redshift. They could query both hot and cold datasets in a single query with HAQM Redshift Spectrum.

The first step is creating an external schema in HAQM Redshift that maps a database in the Data Catalog. See the following code:

CREATE EXTERNAL SCHEMA spectrum 
FROM data catalog 
DATABASE 'datalake' 
iam_role 'arn:aws:iam::<account_number>:role/mySpectrumRole'
CREATE EXTERNAL DATABASE IF NOT EXISTS;

After the crawler creates the external table, you can start querying in HAQM Redshift using the mapped schema that you created earlier. See the following code:

SELECT * FROM spectrum.orders;

Lastly, create a late binding view that unions the hot and cold data:

CREATE OR REPLACE VIEW lake_house_joint_view AS (SELECT * FROM public.orders WHERE o_orderdate >= dateadd(‘day’,-90,date_trunc(‘day’,getdate())) 
UNION ALL SELECT * FROM spectrum.orders WHERE o_orderdate < dateadd(‘day’,-90,date_trunc(‘day’,getdate())) WITH NO SCHEMA BINDING;

Summary

In this post, we showed how a large startup company unloaded data from HAQM Redshift to a data lake. By doing that, they exposed the data to many additional groups within the organization and democratized the data. These benefits of data democratization are substantial because various teams within the company can access the data, analyze it with various tools, and come up with new insights.

As an additional benefit, the company reduced their HAQM Redshift utilized storage, which allowed them to maintain cluster size and avoid additional spending by keeping all historical data within the data lake and only hot data in the HAQM Redshift cluster. Keeping only hot data on the HAQM Redshift cluster prevents the company from deleting data frequently, which saves IT resources, time, and effort.

If you are looking to extend your data warehouse to a data lake and leverage various tools for big data analytics and machine learning (ML) applications, we invite you to try out this walkthrough.

About the Authors

Yonatan Dolan is a Business Development Manager at HAQM Web Services. He is located in Israel and helps customers harness AWS analytical services to leverage data, gain insights, and derive value.

Alon Gendler is a Startup Solutions Architect at HAQM Web Services. He works with AWS customers to help them architect secure, resilient, scalable and high performance applications in the cloud.

Vincent Gromakowski is a Specialist Solutions Architect for HAQM Web Services.

AWS Big Data Blog