Access your existing data and resources through HAQM SageMaker Unified Studio, Part 2: HAQM S3, HAQM RDS, HAQM DynamoDB, and HAQM EMR

Organizations often face the challenge of managing and analyzing data spread across multiple storage systems and databases while providing secure, efficient access for their data science teams. HAQM SageMaker Unified Studio addresses this challenge by providing a unified analytics and AI development environment where data scientists can access, analyze, and use data from various sources within a single, governed workspace, allowing teams to use their existing data infrastructure while taking advantage of advanced analytics and AI capabilities. SageMaker Unified Studio is part of the next generation of HAQM SageMaker, the center for all your data, analytics, and AI.

In Part 1 of this series, we explored how to access AWS Glue Data Catalog tables and HAQM Redshift resources through SageMaker Unified Studio. Continuing our journey, this post discusses integrating additional vital data sources such as HAQM Simple Storage Service (HAQM S3) buckets, HAQM Relational Database Service (HAQM RDS), HAQM DynamoDB, and HAQM EMR clusters. We demonstrate how to configure the necessary permissions, establish connections, and effectively use these resources within SageMaker Unified Studio. Whether you’re working with object storage, relational databases, NoSQL databases, or big data processing, this post can help you seamlessly incorporate your existing data infrastructure into your SageMaker Unified Studio workflows.

Access your existing data and resources through HAQM SageMaker Unified Studio

Part 1: AWS Glue Data Catalog and HAQM Redshift
Part 2: HAQM S3, HAQM RDS, HAQM DynamoDB, and HAQM EMR (This post)

Solution overview

SageMaker Unified Studio seamlessly works with your existing data and resources through relevant permissions and network settings.

Let’s understand how we can access existing datasets across S3, RDS, DynamoDB, and EMR through SageMaker Unified Studio.

Prerequisites

To run the instruction, you must complete the following prerequisites:

An AWS account
A SageMaker Unified Studio domain
A SageMaker Unified Studio project with All capabilities project profile

In SageMaker Unified Studio, select the project and navigate to the Project overview page. Copy the Project role ARN as highlighted in the screenshot. This project role will be used further in the post to provide permissions on existing datasets and resources.

Use existing S3 buckets

This section has following prerequisites:

An S3 bucket

To use an existing S3 bucket in SageMaker Unified Studio, configure an S3 bucket policy that allows the appropriate actions for the project AWS Identity and Access Management (IAM) role.

The following is an example bucket policy. Replace <aws_accountid> with the AWS account ID where the domain resides, <s3_bucket> with the name of the S3 bucket that you intend to query in SageMaker Unified Studio, and <datazone_usr_role_xxxxxxxxxxxxxx_yyyyyyyyyyyyyy> with the project role in SageMaker Unified Studio:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Statement1",
            "Effect": "Allow",
            "Principal": {
                "AWS": "*"
            },
            "Action": "s3:ListBucket",
            "Resource": "arn:aws:s3:::<s3_bucket>",
            "Condition": {
                "ArnEquals": {
                    "aws:PrincipalArn": "arn:aws:iam::<aws_accountid>:role/<datazone_usr_role_xxxxxxxxxxxxxx_yyyyyyyyyyyyyy>"
                }
            }
        },
        {
            "Sid": "Statement2",
            "Effect": "Allow",
            "Principal": {
                "AWS": "*"
            },
            "Action": [
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Resource": "arn:aws:s3:::<s3_bucket>/*",
            "Condition": {
                "ArnEquals": {
                    "aws:PrincipalArn": "arn:aws:iam::<aws_accountid>:role/<datazone_usr_role_xxxxxxxxxxxxxx_yyyyyyyyyyyyyy>"
                }
            }
        }
    ]
}

After you configure the policy, log in to SageMaker Unified Studio and open the project.

Query the data using the JupyterLab IDE to perform analysis, as shown in the following screenshot.

Although the project role has been given appropriate permissions to access the S3 bucket in SageMaker Unified Studio, you will not able to list the contents of the bucket and show the S3 path in the data explorer section within SageMaker Unified Studio.

Use existing RDS DB instances

This section has following prerequisites:

A VPC and a private subnet
A RDS DB instance on the private subnet in the VPC

SageMaker Unified Studio uses the virtual private cloud (VPC) and subnets that are specified in the domain creation. If you have the data source like an RDS DB instance in a separate VPC, you can configure network reachability between the domain VPC and the data source VPC using VPC peering, AWS Transit Gateway, or a resource VPC endpoint, or alternatively you can create a new domain using the data source VPC.

Add a PostgreSQL connection

Complete the following steps to configure that reachability using VPC peering with HAQM Virtual Private Cloud (HAQM VPC):

On the HAQM VPC console, choose Your VPCs, and make a note of the VPC ID of your VPC named SageMakerUnifiedStudioVPC.
Choose Peering connections, and choose Create peering connection.
Under Select another VPC to peer with, for VPC ID (Requester), choose the VPC ID noted earlier.
Under Select another VPC to peer with, for VPC ID (Accepter), choose the VPC where the target RDS DB instance is located.
Review your settings and choose Create peering connection.
On the Peering connections page, select your peering connection.
Under Actions, choose Accept request.
Review the settings and choose Accept request.

Now you have configured the VPC peering connection. The next step is to configure the network route from the SageMaker Unified Studio VPC to the HAQM RDS VPC.

On the HAQM VPC console, choose Route tables in the navigation pane.
Choose the route table that is used in the private subnets of SageMakerUnifiedStudioVPC.
Choose Edit routes.
Choose Add route.
For Destination, choose the VPC CIDR of the VPC where the RDS DB instance is located.
For Target, choose Peering Connection, and choose the peering connection you created earlier.
Choose Save changes.

Now you have configured the route table from the SageMaker Unified Studio VPC to the HAQM RDS VPC. The next step is to configure the opposite route.

On the HAQM VPC console, choose Route tables in the navigation pane.
Choose the route table that is used in the private subnets of the RDS DB instance.
Choose Edit routes.
Choose Add route.
For Destination, choose the VPC CIDR of SageMakerUnifiedStudioVPC.
For Target, choose Peering Connection, and choose the peering connection you created earlier.
Choose Save changes.

Now you configure your RDS security group to accept traffic coming from SageMaker Unified Studio.

On the HAQM RDS console, navigate to your RDS DB instance, and choose VPC security groups.
Select your security group, and choose Inbound rules.
Choose Edit inbound rules.
Choose Add rule.
For Type, choose Custom TPC.
For Port range, enter your RDS port number.
For Source, enter the VPC CIDR of SageMakerUnifiedStudioVPC.

Now you have network reachability required to use the existing RDS DB instance. The next step is to create a connection pointing to that RDS DB instance in SageMaker Unified Studio.

Sign in to SageMaker Unified Studio and open your project.
In your project, in the navigation pane, choose Data.
Choose the plus sign, and for Add data source, choose Add connection.
Select PostgreSQL.
For Data source name, enter postgresql_source.
For Host, enter the host name of your Aurora PostgreSQL database cluster.
For Port, enter the port number of your Aurora PostgreSQL database cluster (by default, it’s 5432).
For Database, enter your database name.
For Authentication, select Username and password, and enter your user name and password.
Choose Add data source.

You will need to wait for several minutes to complete this step.

Use a visual ETL flow to ingest data to HAQM RDS

In a visual extract, transform, and load (ETL) flow, you can use PostgreSQL as source and target. You can create a PostgreSQL target, and for Name, choose postgresql_source to ingest data into HAQM RDS.

Choose the plus sign, and under Data sources, choose HAQM S3.
Choose HAQM S3 for the source node, and enter following values:
1. S3 URI: s3://aws-blogs-artifacts-public/artifacts/BDB-4798/data/venue.csv
2. Format: CSV
3. Sep: ,
4. Multiline: Enabled
5. Header: Disabled
6. Leave the rest as default.
Wait for the data preview to be available.
Choose the plus sign to the right of HAQM S3 Under Transforms, choose Rename Columns.
Choose the Rename Columns node, and choose Add new rename pair.
For Current name and New name, enter following pairs:
1. _c0: venueid
2. _c1: venuename
3. _c2: venuecity
4. _c3: venuestate
5. _c4: venueseats
Choose the plus sign to the right of Rename Columns
Under Targets, choose PostgreSQL, and enter following values:
1. Name: postgresql_source
2. Schema: public
3. Table: venue

Choose Save to project. You can optionally change the name and add a description.
Choose Run. Optionally, you can change the compute parameters.

Wait for completion. Then the data has been successfully ingested.

Run an Athena query to explore the table on HAQM RDS

After you create a table on HAQM RDS, you can explore the table through a data explorer in SageMaker Unified Studio:

On SageMaker Unified Studio, choose Data.
Under Lakehouse, choose postgresql_source, public, and venue.
On the options menu (three dots), choose Query with Athena.

You get records from the RDS table venue.

Use existing DynamoDB tables

This section has following prerequisites:

A DynamoDB table

To access existing DynamoDB tables, configure a resource-based policy that allows the appropriate actions for the project role:

On the DynamoDB console, choose Tables in the navigation pane.
Select your table.
Choose the Permissions tab and choose Create table policy.

The following example policy allows connecting to DynamoDB tables as a federated source. Replace <aws_region> with your AWS Region, <aws_account_id> with the AWS account ID where DynamoDB is deployed, <dynamodb_table> with the DynamoDB table that you intend to query from SageMaker Unified Studio, and <datazone_usr_role_xxxxxxxxxxxxxx_yyyyyyyyyyyyyy> with the project role in SageMaker Unified Studio:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": "*",
            "Action": [
                "dynamodb:Query",
                "dynamodb:Scan",
                "dynamodb:DescribeTable",
                "dynamodb:PartiQLSelect",
                "dynamodb:BatchWriteItem"
            ],
            "Resource": "arn:aws:dynamodb:<aws_region>:<aws_accountid>:table/<dynamodb_table>",
            "Condition": {
                "ArnEquals": {
                    "aws:PrincipalArn": "arn:aws:iam::<aws_accountid>:role/<datazone_usr_role_xxxxxxxxxxxxxx_yyyyyyyyyyyyyy>"
                }
            }
        }
    ]
}

After the policies are incorporated on the DynamoDB table, create an HAQM SageMaker Lakehouse connection within SageMaker Unified Studio:

Choose Data in the navigation pane.
In the data explorer, choose the plus sign to add a data source.
Select Add connection and choose Next.
Select HAQM DynamoDB and choose Next.
For Name, enter a name, then choose Add data.

The following screenshot shows the detailed steps to create a federated DynamoDB connection in SageMaker Unified Studio. After the connection is established, you can query the data from the DynamoDB table with using the Athena query editor.

You can also use existing DynamoDB tables as part of the ETL process. In the following screenshot, we demonstrate this using a visual ETL flow.

Use existing EMR clusters

This section has following prerequisites:

An EMR on EC2 cluster

SageMaker Unified Studio enables you to create new compute or add existing compute resources to a project for submitting jobs. You can add existing HAQM EMR on EC2 clusters or add existing HAQM EMR Serverless applications to submit data analytics jobs. To add a new EMR Serverless application, an administrator must enable a blueprint for the project.

To add an existing EMR on EC2 cluster, complete the following steps:

In SageMaker Unified Studio, navigate to the project for which you plan to add compute, then choose Compute in the navigation pane.
Choose the Data processing
To add an existing EMR on EC2 cluster, choose Add compute.
Choose Connect to existing compute resources and choose Next.

To specify the compute resources to choose from, choose EMR on EC2 cluster.

The Add Compute dialog box requires you to have the correct permissions to access the EMR on EC2 cluster. You can choose Copy project information to copy the data; the admin will need to grant the data worker access. Send the information to your admin.
After the account administrator has granted the data worker access, you can specify the HAQM Resource Names (ARNs) associated with the cluster. You must fill in the Access role ARN, EMR on EC2 cluster ARN, Instance profile role ARN, and Name
After you configure these settings, choose Add compute.

Your EMR on EC2 instance will be added to your project.

After you have added a cluster to a project, you will be able to see the cluster on the Data processing tab of the Compute page. You can then view the cluster details by choosing the specific cluster.

In addition to adding existing compute resources, you have the option to create new compute resources, which allows you to create both EMR on EC2 cluster and EMR Serverless applications.

Conclusion

SageMaker Unified Studio enables you to integrate with multiple data sources, providing data scientists and analysts with a powerful, unified environment for their AI and analytics workflows. As demonstrated throughout this two-part series, you can seamlessly connect to and use data from the Data Catalog, HAQM Redshift, HAQM S3, HAQM RDS, DynamoDB, and HAQM EMR—while maintaining proper security controls and permissions. This flexibility alleviates the need for complex data movement operations and allows teams to focus on extracting insights from their data rather than managing infrastructure. By following the approaches outlined in these posts, organizations can maximize their existing data investments while taking advantage of the advanced capabilities of SageMaker Unified Studio for their data science and analytics needs.

About the Authors

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is also the author of the book Serverless ETL and Analytics with AWS Glue. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.

Lakshmi Nair is a Senior Analytics Specialist Solutions Architect at AWS. She specializes in designing advanced analytics systems across industries. She focuses on crafting cloud-based data platforms, enabling real-time streaming, big data processing, and robust data governance. She can be reached via LinkedIn.

Sakti Mishra is a Principal Data and AI Solutions Architect at AWS, where he helps customers modernize their data architecture and define end-to end-data strategies, including data security, accessibility, governance, and more. He is also the author of Simplify Big Data Analytics with HAQM EMR and AWS Certified Data Engineer Study Guide. Outside of work, Sakti enjoys learning new technologies, watching movies, and visiting places with family. He can be reached via LinkedIn.

Daiyan Alamgir is a Principal Frontend Engineer on the HAQM SageMaker Unified Studio team based in New York.

Vipin Mohan is a Principal Product Manager at AWS, leading the launch of generative AI capabilities in HAQM SageMaker Unified Studio. He is committed to shaping impactful products by working backward from customer insights, championing user-focused solutions, and delivering scalable results.

Chanu Damarla is a Principal Product Manager on the HAQM SageMaker Unified Studio team. He works with customers around the globe to translate business and technical requirements into products that delight customers and enable them to be more productive with their data, analytics, and AI.

AWS Big Data Blog

Access your existing data and resources through HAQM SageMaker Unified Studio, Part 2: HAQM S3, HAQM RDS, HAQM DynamoDB, and HAQM EMR

Solution overview

Prerequisites

Use existing S3 buckets

Use existing RDS DB instances

Add a PostgreSQL connection

Use a visual ETL flow to ingest data to HAQM RDS

Run an Athena query to explore the table on HAQM RDS

Use existing DynamoDB tables

Use existing EMR clusters

Conclusion

About the Authors

Resources

Follow