AWS Big Data Blog

Running queries securely from the same VPC where an HAQM Redshift cluster is running

Customers who don’t need to set up a VPN or a private connection to AWS often use public endpoints to access AWS. Although this is acceptable for testing out the services, most production workloads need a secure connection to their VPC on AWS. If you’re running your production data warehouse on HAQM Redshift, you can run your queries in HAQM Redshift query editor or use HAQM WorkSpaces from HAQM Virtual Private Cloud (HAQM VPC) to connect to HAQM Redshift securely and analyze and graph a data summary in your favorite business intelligence (BI) or data visualization desktop tool.

With HAQM Redshift, you can query petabytes of structured and semi-structured data across your data warehouse, operational database, and your data lake using standard SQL. HAQM WorkSpaces is a managed, secure Desktop-as-a-Service (DaaS) solution deployed within an HAQM VPC. In this post, we show how you can run SQL queries on HAQM Redshift securely without VPN using HAQM Redshift query editor and HAQM WorkSpaces. First, we discuss how to run queries that return large datasets from the HAQM Redshift query editor using the UNLOAD command. Next, we discuss how to set up HAQM WorkSpaces and use it to securely run queries on HAQM Redshift. We cover the detailed steps for setting up HAQM WorkSpaces and show different scenarios to test HAQM Redshift queries with HAQM WorkSpaces.

The following diagram illustrates these architectures.

Using the HAQM Redshift query editor with UNLOAD command

HAQM Redshift has a query editor on its console that is typically used to run short queries and explore the data in the HAQM Redshift database. You may have a production scenario with queries that return large result sets. For instance, you may want to unload CSV data for use by a downstream process. In this case, you can run your query in the query editor and use the UNLOAD command to send the output directly to HAQM Simple Storage Service (HAQM S3) and get notified when the data is uploaded.

  1. Create separate S3 buckets for each user. You can use the default configuration for the bucket.

Create separate S3 buckets for each user. You can use the default configuration for the bucket.

  1. On the HAQM Redshift console, choose Editor.
  2. In the Connect to database section, enter the database connection details.
  3. Choose Connect to database.

Choose Connect to database.

  1. Use the following format for the queries run from the query editor using the IAM role for HAQM Redshift, so the results get uploaded to HAQM S3:
    UNLOAD 
    ('select id, name
    from <your_ schema>.<your_table>)
    TO 's3://<username>/<yy-mm-dd-hh24-mi-ss>/'
    FORMAT as CSV
    iam_role 'arn:aws:iam:<myaccount>:role/MyHAQM_RedshiftUnloadRole';

Use the following format for the queries run from the query editor.

This query creates multiple files in the designated user’s S3 bucket under the date/time prefix. The user can preview or download the individual files on the HAQM S3 console.

This query creates multiple files in the designated user’s S3 bucket under the date/time prefix.

A large unload may take some time. You can configure HAQM Simple Notification Service (HAQM SNS) to send a notification when the results are uploaded to HAQM S3.

  1. On the HAQM SNS console, choose Topics.
  2. Choose Create topic.

Choose Create topic.

  1. Create an SNS topic with a meaningful description text, like Your query results are uploaded to S3.

In the next steps, you edit the access policy of the SNS topic to give permission for HAQM S3 to publish to it.

  1. Change the Principal from "AWS": "*" to "Service": "s3.amazonaws.com".
  2. Scroll down to “Action” and delete everything except “SNS:Publish”. Make sure to delete the extra comma.
  3. Scroll down to “Condition” and modify the text "StringEquals": { "AWS:SourceOwner": <Your account id>} to "ArnLike": { "aws:SourceArn": "arn:aws:s3:*:*:<user-bucket-name>" }.

In the next steps, you edit the access policy of the SNS topic to give permission for HAQM S3 to publish to it.

  1. In the navigation pane, choose Subscriptions.
  2. Choose Create subscription.

Choose Create subscription.

  1. Subscribe to the SNS notification with the user’s email address as the endpoint.

Subscribe to the SNS notification with the user’s email address as the endpoint.

  1. Make sure the user chooses confirms the subscription from their email.
  2. On the HAQM S3 console, choose the Properties tab of the user’s S3 bucket.
  3. Under Event Notifications, choose Create event notification.

Under Event Notifications, choose Create event notification.

  1. Select All object create events.

Select All object create events.

  1. For Destination, select SNS topic.
  2. For Specify SNS topic, select Choose from your SNS topics.
  3. For SNS topic, choose the topic you created.

For SNS topic, choose the topic you created.

  1. Save your settings.
  2. To test the notification, on the HAQM Redshift console, open the query editor.
  3. Edit the UNLOAD query and change the S3 bucket name to the current date and time.
  4. Run the query and check if the user gets the email notification.

Using HAQM WorkSpaces to run HAQM Redshift queries

In this section, we cover setting up HAQM WorkSpaces, including HAQM VPC prerequisites, creating an HAQM VPC endpoint for HAQM S3, launching HAQM WorkSpaces in the same VPC where an HAQM Redshift cluster is running, setting up an HAQM WorkSpaces client, installing PSQL or a SQL client, and connecting to the client.

When setup is complete, we show different scenarios to test with HAQM WorkSpaces, such as testing a SQL command from the HAQM WorkSpaces client, testing SCREEN program to run SQL in the background, and testing PSQL with HAQM S3 and getting a notification through HAQM SNS.

Prerequisites

By default, AWS Identity and Access Management (IAM) users and roles can’t perform tasks using the AWS Management Console and they don’t have permission to create or modify HAQM VPC resources. Make sure you have administrator privileges or an administrator creates IAM policies that grants sufficient permissions to edit the route table, edit the VPC security group, and enable a DNS hostname for the VPC.

When you have the correct permissions, complete the following prerequisite steps:

  1. On the HAQM Redshift console, in config, check the cluster subnet groups to make sure the HAQM Redshift cluster is created in an HAQM VPC with at least two subnets that are in separate Availability Zones.
  2. On the HAQM VPC console, edit the route table and make sure to associate these two subnets.
  3. Make sure the HAQM VPC security group has a self-referencing inbound rule for the security group for all traffic (not all tcp). The self-referencing rule restricts the source to the same security group in the VPC, and it’s not open to all networks. Consider limiting to just the protocol and port needed for Redshift to talk to Workspaces.
  4. Edit the DNS hostname of the HAQM VPC and enable it.

Creating an HAQM VPC endpoint for HAQM S3 for software downloads

In this step, you create your HAQM VPC endpoint for HAQM S3. This gives you HAQM S3 access to download PSQL from the HAQM repository. Alternatively, you could set up a NAT Gateway and download PSQL or other SQL clients from the internet.

  1. On the HAQM VPC console, choose Endpoints.
  2. Choose Create endpoint.
    Choose Create endpoint.
  3. Search for Service Name: S3
  4. Select the S3 service gateway
    Select the S3 service gateway
  5. Select the HAQM VPC where the HAQM Redshift cluster is running
  6. Select the route table
    Select the route table
  7. Enter the following custom policy for the endpoint to access the HAQM Linux AMI
    {
        "Version": "2008-10-17",
        "Statement": [
            {
                "Sid": "HAQM Linux AMI Repository Access",
                "Effect": "Allow",
                "Principal": "*",
                "Action": "s3:GetObject",
                "Resource": [
                    "arn:aws:s3:::*.amazonaws.com",
                    "arn:aws:s3:::*.amazonaws.com/*"
                ]
            }
        ]
    }

    Select the HAQM VPC where the HAQM Redshift cluster is running

  8. Create the endpoint

Launching HAQM WorkSpaces in the VPC where the HAQM Redshift cluster runs

You’re now ready to launch HAQM WorkSpaces.

  1. On the HAQM WorkSpaces console, choose Launch WorkSpaces.

On the HAQM WorkSpaces console, choose Launch WorkSpaces.

  1. For Directory types, select Simple AD.

Directory Service Solutions helps you store information and manage access to resources. For this post, we choose Simple AD.

Directory Service Solutions helps you store information and manage access to resources. For this post, we choose Simple AD.

  1. For Directory size, select Small.
  2. Enter your directory details.

Enter your directory details.

  1. For VPC, choose the VPC where the HAQM Redshift cluster is running.
  2. For Subnets, choose the two subnets you created.

For Subnets, choose the two subnets you created.

It may take a few minutes to provision the directory. You see the status show as Active when it’s ready.

It may take a few minutes to provision the directory. You see the status show as Active when it’s ready.

  1. When the directory is provisioned, choose the directory and subnets you created.
  2. Choose Next Step.

Choose Next Step.

  1. Create and identify your users.

Create and identify your users.

  1. Use the default settings for the compute bundle.
  2. For Running Mode, select AlwaysOn.

Alternatively, select AutoStop and adjust the time in order to run long-running queries.

Alternatively, select AutoStop and adjust the time in order to run long-running queries.

  1. Review and launch WorkSpaces.

It may take up to 20 minutes to become available.

Setting up the HAQM WorkSpaces client

In this section, you configure your HAQM WorkSpaces client.

  1. Use the link from your email to download the HAQM WorkSpaces client.
  2. Register it using the registration code from the email.
  3. Login with your username in the email and your newly created password
  4. In the HAQM WorkSpaces client, open the terminal.

In the HAQM WorkSpaces client, open the terminal.

  1. Run the following command to capture the IP address:
hostname -I | awk '{print $2}'

The following screenshot shows your output.

The following screenshot shows your output.

  1. On the HAQM Redshift console, choose Clusters.
  2. Choose your cluster.
  3. Save the endpoint information to use later.
  4. Choose the Properties tab.

Choose the Properties tab.

  1. In the Network and security section, note the VPC security group.
    In the Network and security section, note the VPC security group.
  2. On the HAQM VPC console, under Security, choose Security Groups.
  3. Select the security group that the HAQM Redshift cluster uses.

Select the security group that the HAQM Redshift cluster uses.

  1. Add an inbound rule with the type Redshift and the source value of the IP Address you captured/32.

Add an inbound rule with the type Redshift and the source value of the IP Address you captured/32.

  1. On the HAQM WorkSpaces client, use the HAQM Redshift hostname from the endpoint you saved earlier and verify the VPC setup with the following code:
     nslookup <HAQM Redshift hostname>

If you see an IP address within your subnet range, the private endpoint setup for HAQM Redshift was successful.

If you see an IP address within your subnet range, the private endpoint setup for HAQM Redshift was successful.

Testing a SQL command from PSQL or a SQL client in the HAQM WorkSpaces client

To test a SQL command, complete the following steps:

  1. From the terminal in the HAQM WorkSpaces client, run the following command to install PostgreSQL:
    sudo yum install postgresql-server

Alternatively, setup a NAT Gateway and download a SQL client such as SQL Workbench on the HAQM WorkSpaces client:

sudo wget http://www.sql-workbench.eu/Workbench-Build125-with-optional-libs.zip

Then unzip the content of the downloaded file and save it to a directory:

unzip Workbench-Build125-with-optional-libs.zip -d ~/Workbench
  1. Use the HAQM Redshift hostname, port, and database names of the HAQM Redshift cluster endpoint you copied earlier and try connecting to the database:
    psql -h <HAQM Redshift hostname> -p <port> -d <database> -U <username> -W
  1. Enter your password when prompted.

Enter your password when prompted.

  1. Run a SQL command and check the results.

Testing the SCREEN program to run SQL in the background

You can use the SCREEN program to run the SQL command in the background and resume to see the results.

  1. From the terminal in the HAQM WorkSpaces client, install the SCREEN program:
    sudo yum install screen
  1. Run the program:
    screen
  1. Connect to PSQL:
    psql -h <HAQM Redshift hostname> -p <port> -d <database> -U <username> -W
  1. Enter the password when prompted.
  2. Run the SQL command
  3. Enter the command ctrl A D to detach from the screen.

The SQL command is now running in the background. You can check by running ps -ef | grep psql.

The SQL command is now running in the background. You can check by running ps -ef | grep psql.

  1. To go back to that screen, run the following command:
    screen -r
  1. To quit SCREEN, enter the following command:
    ctrl A \

Testing PSQL with HAQM S3 and HAQM SNS

Similar to the UNLOAD command we used from the HAQM Redshift query editor in the beginning of this post, you can run PSQL from the HAQM WorkSpaces client, send the output to an S3 bucket, and get an HAQM SNS notification for an object create event.

  1. From the terminal in the HAQM WorkSpaces client, run aws configure to set up AWS credentials with write access to the S3 bucket.
  2. Run the following command to write a single output file to HAQM S3 and send an email notification:
    psql -h <HAQM Redshift hostname> -p <port> -d <database> -U <username> -W -c 'select column1, column2 from myschema.mytable' > s3://<username>/<yy-mm-dd-hh24-mi-ss>/

Conclusion

This post explained how to securely query HAQM Redshift databases running in an HAQM VPC using the UNLOAD command with the HAQM Redshift query editor and using HAQM WorkSpaces running in the same VPC. Try out this solution to securely access HAQM Redshift databases without a VPN and run long-running queries. If you have any questions or comments, please share your thoughts in the comments section.


About the Authors

Seetha Sarma is a Senior Database Specialist Solutions Architect with HAQM Web Services. Seetha provides guidance to customers on using AWS services for distributed data processing. In her spare time she likes to go on long walks and enjoy nature.

 

 

 

Moiz MianMoiz Mian is a Solutions Architect for AWS Strategic Accounts. He focuses on enabling customers to build innovative, scalable, and secure solutions for the cloud. In his free time, he enjoys building out Smart Home systems and driving at race tracks.