AWS Big Data Blog
Building an event-driven application with AWS Lambda and the HAQM Redshift Data API
Event–driven applications are becoming popular with many customers, where applications run in response to events. A primary benefit of this architecture is the decoupling of producer and consumer processes, allowing greater flexibility in application design and building decoupled processes.
An example of an even-driven application is an automated workflow being triggered by an event, which runs a series of transformations in the data warehouse. At the end of this workflow, another event gets initiated to notify end-users about the completion of those transformations and that they can start analyzing the transformed dataset.
In this post, we explain how you can easily design a similar event-driven application with HAQM Redshift, AWS Lambda, and HAQM EventBridge. In response to a scheduled event defined in EventBridge, this application automatically triggers a Lambda function to run a stored procedure performing extract, load, and transform (ELT) operations in an HAQM Redshift data warehouse, using its out-of-the-box integration with the HAQM Redshift Data API. This stored procedure copies the source data from HAQM Simple Storage Service (HAQM S3) to HAQM Redshift and aggregates the results. When complete, it sends an event to EventBridge, which triggers a Lambda function to send notification to end-users through HAQM Simple Notification Service (HAQM SNS) to inform them about the availability of updated data in HAQM Redshift.
This event-driven server-less architecture offers greater extensibility and simplicity, making it easier to maintain and faster to release new features, and also reduces the impact of changes. It also simplifies adding other components or third-party products to the application without many changes.
Prerequisites
As a prerequisite for creating the application in this post, you need to set up an HAQM Redshift cluster and associate it with an AWS Identity and Access Management (IAM) role. For more information, see Getting Started with HAQM Redshift.
Solution overview
The following architecture diagram highlights the end-to-end solution, which you can provision automatically with an AWS CloudFormation template.
The workflow includes the following steps:
- The EventBridge rule
EventBridgeScheduledEventRule
is initiated based on a cron schedule. - The rule triggers the Lambda function
LambdaRedshiftDataApiETL
, with the actionrun_sql
as an input parameter. The Python code for the Lambda function is available in the GitHub repo. - The function performs an asynchronous call to the stored procedure
run_elt_process
in HAQM Redshift, performing ELT operations using the HAQM Redshift Data API. - The stored procedure uses the HAQM S3 location event-driven-app-with-lambda-redshift/nyc_yellow_taxi_raw/ as the data source for the ELT process. We have pre-populated this with the NYC Yellow Taxi public dataset for the year 2015 to test this solution.
- When the stored procedure is complete, the EventBridge rule
EventBridgeRedshiftEventRule
is triggered automatically to capture the event based on the source parameterredshift-data
from the HAQM Redshift Data API. - The rule triggers the Lambda function
LambdaRedshiftDataApiETL
, with the action notify as an input parameter. - The function uses the SNS topic
RedshiftNotificationTopicSNS
to send an automated email notification to end-users that the ELT process is complete.
The HAQM Redshift database objects required for this solution are provisioned automatically by the Lambda function LambdaSetupRedshiftObjects
as part of the CloudFormation template initiation by invoking the function LambdaRedshiftDataApiETL,
which creates the following objects in HAQM Redshift:
- Table
nyc_yellow_taxi
, which we use to copy the New York taxi dataset from HAQM S3 - Materialized view
nyc_yellow_taxi_volume_analysis
, providing an aggregated view of table - Stored procedure
run_elt_process
to take care of data transformations
The Python code for this function is available in the GitHub repo.
We also use the IAM role LambdaRedshiftDataApiETLRole
for the Lambda function and LambdaRedshiftDataApiETL
to allow the following permissions:
- Federate to the HAQM Redshift cluster through
getClusterCredentials
permission, avoiding password credentials - Initiate queries in the HAQM Redshift cluster through
redshift-data
API calls - Log with HAQM CloudWatch for troubleshooting purposes
- Send notifications through HAQM SNS
A sample IAM role for this function is available in the GitHub repo.
Lambda is a key service in this solution because it initiates queries in HAQM Redshift using the redshift-data
client. Based on the input parameter action
, this function can asynchronously initiate Structured Query Language (SQL) statements in HAQM Redshift, thereby avoiding chances of timing out in case of long-running SQL statements. It can also publish custom notifications through HAQM SNS. Also, it uses the HAQM Redshift Data API temporary credentials functionality, which allows it to communicate with HAQM Redshift using IAM permissions without the need of any password-based authentication. With the Data API, you also don’t need to configure drivers and connections for your HAQM Redshift cluster, because it’s handled automatically.
Deploying the CloudFormation template
When your HAQM Redshift cluster is set up, use the provided CloudFormation template to automatically create all required resources for this solution in your AWS account. For more information, see Getting started with AWS CloudFormation.
The template requires you to provide the following parameters:
- RedshiftClusterIdentifier – Cluster identifier for your HAQM Redshift cluster.
- DbUsername – HAQM Redshift database user name that has access to run the SQL script.
- DatabaseName – Name of the HAQM Redshift database where the SQL script runs.
- RedshiftIAMRoleARN – ARN of the IAM role associated with the HAQM Redshift cluster.
- NotificationEmailId – Email to send event notifications through HAQM SNS.
- ExecutionSchedule – Cron expression to schedule the ELT process through an EventBridge rule.
- SqlText – SQL text to run as part of the ELT process. Don’t change the default value
call run_elt_process()
; if you want to test this solution with the test dataset provided for this post.
The following screenshot shows the stack details on the AWS CloudFormation console.
Testing the pipeline
After setting up the architecture, you should have an automated pipeline to trigger based on the schedule you defined in the EventBridge rule’s cron expression. You can view the CloudWatch logs and troubleshoot issues in the Lambda function. The following screenshot shows the logs for our setup.
You can also view the query status on the HAQM Redshift console, which allows you to view detailed execution plans for the queries you ran. Although the stored procedure may take around 6 minutes to complete, the Lambda function finishes in seconds. This is primarily because the execution from Lambda on HAQM Redshift was asynchronous. Therefore, the function is complete after initiating the process in HAQM Redshift without caring about the query completion.
When this process is complete, you receive the email notification that the ELT process is complete.
You may then view the updated data in your business intelligence tool, like HAQM QuickSight, or query data directly in HAQM Redshift Query Editor (see the following screenshot) to view the most recent data processed by this event-driven architecture.
Conclusion
The HAQM Redshift Data API enables you to painlessly interact with HAQM Redshift and enables you to build event-driven and cloud-native applications. We demonstrated how to build an event-driven application with HAQM Redshift, Lambda, and EventBridge. For more information about the Data API, see Using the HAQM Redshift Data API to interact with HAQM Redshift clusters and Using the HAQM Redshift Data API.
About the Authors
Manash Deb is a Senior Analytics Specialist Solutions Architect. He has worked in different database and data warehousing technologies for more than 15 years.
Debu Panda, a senior product manager at AWS, is an industry leader in analytics, application platform, and database technologies. He has more than 20 years of experience in the IT industry and has published numerous articles on analytics, enterprise Java, and databases and has presented at multiple conferences. He is lead author of the EJB 3 in Action (Manning Publications 2007, 2014) and Middleware Management (Packt).
Fei Peng is a Software Dev Engineer working in the HAQM Redshift team.