AWS Big Data Blog
Share and publish your Snowflake data to AWS Data Exchange using HAQM Redshift data sharing
HAQM Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. Today, tens of thousands of AWS customers—from Fortune 500 companies, startups, and everything in between—use HAQM Redshift to run mission-critical business intelligence (BI) dashboards, analyze real-time streaming data, and run predictive analytics. With the constant increase in generated data, HAQM Redshift customers continue to achieve successes in delivering better service to their end-users, improving their products, and running an efficient and effective business.
In this post, we discuss a customer who is currently using Snowflake to store analytics data. The customer needs to offer this data to clients who are using HAQM Redshift via AWS Data Exchange, the world’s most comprehensive service for third-party datasets. We explain in detail how to implement a fully integrated process that will automatically ingest data from Snowflake into HAQM Redshift and offer it to clients via AWS Data Exchange.
Overview of the solution
The solution consists of four high-level steps:
- Configure Snowflake to push the changed data for identified tables into an HAQM Simple Storage Service (HAQM S3) bucket.
- Use a custom-built Redshift Auto Loader to load this HAQM S3 landed data to HAQM Redshift.
- Merge the data from the change data capture (CDC) S3 staging tables to HAQM Redshift tables.
- Use HAQM Redshift data sharing to license the data to customers via AWS Data Exchange as a public or private offering.
The following diagram illustrates this workflow.
Prerequisites
To get started, you need the following prerequisites:
- A Snowflake account in the same Region as your HAQM Redshift cluster.
- An S3 bucket. Refer to Create your first S3 bucket for more details.
- An HAQM Redshift cluster with encryption enabled and an AWS Identity and Access Management (IAM) role with permission to the S3 bucket. See Create a sample HAQM Redshift cluster and Create an IAM role for HAQM Redshift for more details.
- A database schema from Snowflake to HAQM Redshift that is migrated using the AWS Schema Conversion Tool (AWS SCT). For more information, refer to Accelerate Snowflake to HAQM Redshift migration using AWS Schema Conversion Tool.
- An IAM role and external HAQM S3 stage for Snowflake access to the S3 bucket you created earlier. For instructions, refer to Configuring Secure Access to HAQM S3. Name this external stage unload_to_s3, pointing to the s3-redshift-loader-source folder of the target S3 bucket. It will be referenced in COPY commands later in this post for offloading the data to HAQM S3. Once created, you should see an external stage created as shown in the following screenshot.
- You must be a registered provider on AWS Data Exchange. For more information, see Providing data products on AWS Data Exchange.
Configure Snowflake to track the changed data and unload it to HAQM S3
In Snowflake, identify the tables that you need to replicate to HAQM Redshift. For the purpose of this demo, we use the data in the TPCH_SF1
schema’s Customer
, LineItem
, and Orders
tables of the SNOWFLAKE_SAMPLE_DATA
database, which comes out of the box with your Snowflake account.
- Make sure that the Snowflake external stage name
unload_to_s3
created in the prerequisites is pointing to the S3 prefixs3-redshift-loader-source
created in the previous step. - Create a new schema
BLOG_DEMO
in theDEMO_DB
database:CREATE SCHEMA demo_db.blog_demo;
- Duplicate the
Customer
,LineItem
, andOrders
tables in theTPCH_SF1
schema to theBLOG_DEMO
schema: - Verify that the tables have been duplicated successfully:
- Create table streams to track data manipulation language (DML) changes made to the tables, including inserts, updates, and deletes:
- Perform DML changes to the tables (for this post, we run UPDATE on all tables and MERGE on the
customer
table): - Validate that the stream tables have recorded all changes:
- Run the COPY command to offload the CDC from the stream tables to the S3 bucket using the external stage name
unload_to_s3
.In the following code, we’re also copying the data to S3 folders ending with_stg
to ensure that when Redshift Auto Loader automatically creates these tables in HAQM Redshift, they get created and marked as staging tables: - Verify the data in the S3 bucket. There will be three sub-folders created in the s3-redshift-loader-source folder of the S3 bucket, and each will have .parquet data files.
You can also automate the preceding COPY commands using tasks, which can be scheduled to run at a set frequency for automatic copy of CDC data from Snowflake to HAQM S3.
- Use the
ACCOUNTADMIN
role to assign theEXECUTE TASK
privilege. In this scenario, we’re assigning the privileges to theSYSADMIN
role: - Use the
SYSADMIN
role to create three separate tasks to run three COPY commands every 5 minutes:USE ROLE sysadmin;
When the tasks are first created, they’re in a
SUSPENDED
state. - Alter the three tasks and set them to RESUME state:
- Validate that all three tasks have been resumed successfully:
SHOW TASKS;
Now the tasks will run every 5 minutes and look for new data in the stream tables to offload to HAQM S3.As soon as data is migrated from Snowflake to HAQM S3, Redshift Auto Loader automatically infers the schema and instantly creates corresponding tables in HAQM Redshift. Then, by default, it starts loading data from HAQM S3 to HAQM Redshift every 5 minutes. You can also change the default setting of 5 minutes.
- On the HAQM Redshift console, launch the query editor v2 and connect to your HAQM Redshift cluster.
- Browse to the
dev
database,public
schema, and expand Tables.
You can see three staging tables created with the same name as the corresponding folders in HAQM S3. - Validate the data in one of the tables by running the following query:
SELECT * FROM "dev"."public"."customer_stg";
Configure the Redshift Auto Loader utility
The Redshift Auto Loader makes data ingestion to HAQM Redshift significantly easier because it automatically loads data files from HAQM S3 to HAQM Redshift. The files are mapped to the respective tables by simply dropping files into preconfigured locations on HAQM S3. For more details about the architecture and internal workflow, refer to the GitHub repo.
We use an AWS CloudFormation template to set up Redshift Auto Loader. Complete the following steps:
- Launch the CloudFormation template.
- Choose Next.
- For Stack name, enter a name.
- Provide the parameters listed in the following table.
CloudFormation Template Parameter Allowed Values Description RedshiftClusterIdentifier
HAQM Redshift cluster identifier Enter the HAQM Redshift cluster identifier. DatabaseUserName
Database user name in the HAQM Redshift cluster The HAQM Redshift database user name that has access to run the SQL script. DatabaseName
S3 bucket name The name of the HAQM Redshift primary database where the SQL script is run. DatabaseSchemaName
Database name in HAQM Redshift The HAQM Redshift schema name where the tables are created. RedshiftIAMRoleARN
Default or the valid IAM role ARN attached to the HAQM Redshift cluster The IAM role ARN associated with the HAQM Redshift cluster. Your default IAM role is set for the cluster and has access to your S3 bucket, leave it at the default. CopyCommandOptions
Copy option; default is delimiter ‘|’ gzip Provide the additional COPY command data format parameters.
If InitiateSchemaDetection = Yes, then the process attempts to detect the schema and automatically set the suitable copy command options.
In the event of failure on schema detection or when InitiateSchemaDetection = No, then this value is used as the default COPY command options to load data.
SourceS3Bucket
S3 bucket name The S3 bucket where the data is stored. Make sure the IAM role that is associated to the HAQM Redshift cluster has access to this bucket. InitiateSchemaDetection
Yes/No Set to Yes to dynamically detect the schema prior to file load and create a table in HAQM Redshift if it doesn’t exist already. If a table already exists, then it won’t drop or recreate the table in HAQM Redshift.
If schema detection fails, the process uses the default COPY options as specified in
CopyCommandOptions
.The Redshift Auto Loader uses the COPY command to load data into HAQM Redshift. For this post, set
CopyCommandOptions
as follows, and configure any supported COPY command options: - Choose Next.
- Accept the default values on the next page and choose Next.
- Select the acknowledgement check box and choose Create stack.
- Monitor the progress of the Stack creation and wait until it is complete.
- To verify the Redshift Auto Loader configuration, sign in to the HAQM S3 console and navigate to the S3 bucket you provided.
You should see a new directorys3-redshift-loader-source
is created.
Copy all the data files exported from Snowflake under s3-redshift-loader-source
.
Merge the data from the CDC S3 staging tables to HAQM Redshift tables
To merge your data from HAQM S3 to HAQM Redshift, complete the following steps:
- Create a temporary staging table
merge_stg
and insert all the rows from the S3 staging table that havemetadata_action
asINSERT
, using the following code. This includes all the new inserts as well as the update. - Use the S3 staging table
customer_stg
to delete the records from the base tablecustomer
, which are marked as deletes or updates: - Use the temporary staging table
merge_stg
to insert the records marked for updates or inserts: - Truncate the staging table, because we have already updated the target table:
truncate customer_stg;
- You can also run the preceding steps as a stored procedure:
- Now, to update the target table, we can run the stored procedure as follows:
CALL merge_customer()
The following screenshot shows the final state of the target table after the stored procedure is complete.
Run the stored procedure on a schedule
You can also run the stored procedure on a schedule via HAQM EventBridge. The scheduling steps are as follows:
- On the EventBridge console, choose Create rule.
- For Name, enter a meaningful name, for example,
Trigger-Snowflake-Redshift-CDC-Merge
. - For Event bus, choose default.
- For Rule Type, select Schedule.
- Choose Next.
- For Schedule pattern, select A schedule that runs at a regular rate, such as every 10 minutes.
- For Rate expression, enter Value as 5 and choose Unit as Minutes.
- Choose Next.
- For Target types, choose AWS service.
- For Select a Target, choose Redshift cluster.
- For Cluster, choose the HAQM Redshift cluster identifier.
- For Database name, choose dev.
- For Database user, enter a user name with access to run the stored procedure. It uses temporary credentials to authenticate.
- Optionally, you can also use AWS Secrets Manager for authentication.
- For SQL statement, enter
CALL merge_customer()
. - For Execution role, select Create a new role for this specific resource.
- Choose Next.
- Review the rule parameters and choose Create rule.
After the rule has been created, it automatically triggers the stored procedure in HAQM Redshift every 5 minutes to merge the CDC data into the target table.
Configure HAQM Redshift to share the identified data with AWS Data Exchange
Now that you have the data stored inside HAQM Redshift, you can publish it to customers using AWS Data Exchange.
- In HAQM Redshift, using any query editor, create the data share and add the tables to be shared:
- On the AWS Data Exchange console, create your dataset.
- Select HAQM Redshift datashare.
- Create a revision in the dataset.
- Add assets to the revision (in this case, the HAQM Redshift data share).
- Finalize the revision.
After you create the dataset, you can publish it to the public catalog or directly to customers as a private product. For instructions on how to create and publish products, refer to NEW – AWS Data Exchange for HAQM Redshift
Clean up
To avoid incurring future charges, complete the following steps:
- Delete the CloudFormation stack used to create the Redshift Auto Loader.
- Delete the HAQM Redshift cluster created for this demonstration.
- If you were using an existing cluster, drop the created external table and external schema.
- Delete the S3 bucket you created.
- Delete the Snowflake objects you created.
Conclusion
In this post, we demonstrated how you can set up a fully integrated process that continuously replicates data from Snowflake to HAQM Redshift and then uses HAQM Redshift to offer data to downstream clients over AWS Data Exchange. You can use the same architecture for other purposes, such as sharing data with other HAQM Redshift clusters within the same account, cross-accounts, or even cross-Regions if needed.
About the Authors
Raks Khare is an Analytics Specialist Solutions Architect at AWS based out of Pennsylvania. He helps customers architect data analytics solutions at scale on the AWS platform.
Ekta Ahuja is a Senior Analytics Specialist Solutions Architect at AWS. She is passionate about helping customers build scalable and robust data and analytics solutions. Before AWS, she worked in several different data engineering and analytics roles. Outside of work, she enjoys baking, traveling, and board games.
Tahir Aziz is an Analytics Solution Architect at AWS. He has worked with building data warehouses and big data solutions for over 13 years. He loves to help customers design end-to-end analytics solutions on AWS. Outside of work, he enjoys traveling and cooking.
Ahmed Shehata is a Senior Analytics Specialist Solutions Architect at AWS based on Toronto. He has more than two decades of experience helping customers modernize their data platforms, Ahmed is passionate about helping customers build efficient, performant and scalable Analytic solutions.