Transform data and create dashboards using AWS Glue DataBrew and Tableau

Before you can create visuals and dashboards that convey useful information, you need to transform and prepare the underlying data. With AWS Glue DataBrew, you can now easily transform and prepare datasets from HAQM Simple Storage Service (HAQM S3), an HAQM Redshift data warehouse, HAQM Aurora, and other HAQM Relational Database Service (HAQM RDS) databases and upload them into HAQM S3 to visualize the transformed data in a dashboard using HAQM QuickSight or other business intelligence (BI) tools like Tableau.

DataBrew now also supports writing prepared data into Tableau Hyper format, allowing you to easily take prepared datasets from HAQM S3 and upload them into Tableau for further visualization and analysis. Hyper is Tableau’s in-memory data engine technology optimized for fast data ingest and analytical query processing on large or complex datasets.

In this post, we use DataBrew to extract data from HAQM Redshift, cleanse and transform data using DataBrew to Tableau Hyper format without any coding, and store it in HAQM S3.

Overview of solution

The following diagram illustrates the architecture of the solution.

The solution workflow includes the following steps:

You create a JDBC connection for HAQM Redshift and a DataBrew project on the DataBrew console.
DataBrew queries data from HAQM Redshift by creating a recipe and performing transformations.
The DataBrew job writes the final output to an S3 bucket in Tableau Hyper format.
You can now upload the file into Tableau for further visualization and analysis.

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account
An S3 bucket
AWS Identity and Access Management (IAM) permissions for DataBrew (for more information, see Setting up IAM policies for DataBrew)
An HAQM Redshift cluster

The following screenshots show the configuration for creating an HAQM Redshift cluster using the HAQM Redshift console with demo sales data. For more information about network security for the cluster, see Setting Up a VPC to Connect to JDBC Data Stores.

For this post, we use the sample data that comes with the HAQM Redshift cluster.

In this post, we only demonstrate how to transform your HAQM Redshift data to Hyper format; uploading the file for further analysis is out of scope.

Create an HAQM Redshift connection

In this step, you use the DataBrew console to create an HAQM Redshift connection.

On the DataBrew console, choose Datasets.
On the Connections tab, choose Create connection.
For Connection name, enter a name (for example, ticket-db-connection).
For Connection type, select HAQM Redshift.
In the Connection access section, provide details like cluster name, database name, user name, and password.
Choose Create connection.

Create your dataset

To create a new dataset, complete the following steps:

On the DataBrew console, choose Datasets.
On the Datasets tab, choose Connect new dataset.
For Dataset name, enter sales.
For Connect to new dataset, select HAQM Redshift.
Choose the connection you created (AwsGlueDataBrew-tickit-sales-db-connection).
Select the public schema and sales table
In the Additional configurations section, for Enter S3 destination, enter the S3 bucket you created as a prerequisite.

DataBrew uses this bucket to store the intermediate results.

Choose Create dataset.
If your query is taking too much time, then add LIMIT clause in your Select statement.

Create a project using the dataset

To create a new project, complete the following steps:

On the DataBrew console, choose Projects and choose Create project.
For Project name, enter sales-project.
For Attached recipe, choose Create new recipe.
For Recipe name, enter sales-project-recipe.
For Select a dataset, select My datasets.
Select the sales dataset.
Under Permissions, for Role name, choose an existing IAM role created during the prerequisites or create a new role.
Choose Create project.

When the project is opened, a DataBrew interactive session is created. DataBrew retrieves sample data based on your sampling configuration selection.

When we connect a dataset to an HAQM Redshift cluster in your VPC, DataBrew provisions an elastic network interface in your VPC without a public IPV4 address. Because of this, you need to provision a NAT gateway in your VPC as well as an appropriate subnet route table configured for the subnets associated with the AWS Glue network interfaces. To use DataBrew with a VPC subnet without a NAT, you must have a gateway VPC endpoint to HAQM S3 and a VPC endpoint for the AWS Glue interface in your VPC. For more information, see Create a gateway endpoint and Interface VPC endpoints (AWS PrivateLink).

Build a transformation recipe

In this step, we perform some feature engineering techniques (transformations) to prepare our dataset and drop the unnecessary columns from our dataset that aren’t required for this exercise.

On the DataBrew console, choose Column.
Choose Delete.
For Source columns, select the columns pricepaid and commissions.
Choose Apply.

Add a logical condition

With DataBrew, you can now use IF, AND, OR, and CASE logical conditions to create transformations based on functions. With this feature, you have the flexibility to use custom values or reference other columns within the expressions, and can create adaptable transformations for their specific use cases.

To add a logical condition to your transformation recipe, complete the following steps:

On the DataBrew console, choose Conditions.
Choose IF.
For Matching conditions, select Match all conditions.
For Source, choose the value qtysold.
For Enter a value, select Enter a custom value and enter 2.
For Destination column, enter opportunity.
Choose Apply.

The following screenshot shows the full recipe that we applied to our dataset.

Create the DataBrew job

Now that we have built the recipe, we can create and run the DataBrew recipe job.

On the project details page, choose Create job.
For Job name, enter sales-opportunities.
We choose TABLEAU HYPER as the output format.
For S3 location, enter the previously created S3 bucket.
For Role name, choose an existing role created during the prerequisites or create a new role.
Choose Create and run job.
Navigate to the Jobs page and wait for the sales-opportunity job to complete.
Choose the output link to navigate to the HAQM S3 console to access the job output.

Clean up

To avoid incurring future charges, delete the resources you created:

HAQM Redshift cluster
Recipe job
Job output stored in the S3 bucket
IAM roles created as part of this exercise
DataBrew project sales-project and its associated recipe sales-project-recipe
DataBrew datasets

Conclusion

In this post, we showed you how to connect to an HAQM Redshift cluster and create a DataBrew dataset.

We saw how easy it is to get data from HAQM Redshift into DataBrew and apply transformations without any coding. We then ran a recipe job to convert this dataset to Tableau Hyper format file and store it in HAQM S3 for visualization using Tableau. Learn more about all the products and service integrations that AWS Glue DataBrew supports.

About the Authors

Nipun Chagari is a Senior Solutions Architect at AWS, where he helps customers build highly available, scalable, and resilient applications on the AWS Cloud. He is currently focused on helping customers leverage serverless technology to meet their business objectives.

Mohit Malik is a Senior Solutions Architect at HAQM Web Services who specializes in compute, networking, and serverless technologies. He enjoys helping customers learn how to operate efficiently and effectively in the cloud. In his spare time, Mohit enjoys spending time with his family, reading books, and watching movies.

AWS Big Data Blog