AWS Database Blog

HAQM DynamoDB zero-ETL integration with HAQM SageMaker Lakehouse – Part 1

HAQM DynamoDB zero-ETL integration with HAQM SageMaker Lakehouse allows you to run analytics workloads on your DynamoDB data without having to set up and manage extract, transform, and load (ETL) pipelines. You can now easily consolidate your data from different tables and databases into SageMaker Lakehouse, giving you the ability to run holistic analytics across all your data. The primary benefits of using the zero-ETL integration with SageMaker Lakehouse are:

  • Isolating analytics workloads – With zero-ETL integrations, you can run analytics workloads on you DynamoDB data without consuming any DynamoDB table capacity. You can isolate analytics workloads from operational workloads, making sure that there is no impact on your critical application running on DynamoDB.
  • Enhanced query flexibility – Analysts can perform complex, multi-step aggregations and queries using the full capabilities of analytics tools, such as SageMaker Lakehouse.
  • Partitioning and schema evolution – With the flexible schema and partitioning features, SageMaker Lakehouse allows storing and querying data efficiently, supporting schema changes without downtime or data migration.
  • Time travel for historical analytics – Because your data is stored in Apache Iceberg-compatible table formats, the SageMaker Lakehouse time travel feature simplifies querying historical snapshots, making it possible to analyze past data states without complex archiving or rollback systems.

In this two-part series, we first walk through the prerequisites and initial setup for the zero-ETL integration. In Part 2, we cover setting up HAQM SageMaker Unified Studio, followed by running data analysis to showcase its capabilities. We illustrate our solution walkthrough with an example of a credit card company that wants to analyze its customer behavior and spending trends.

Solution overview

This zero-ETL integration creates and maintains Iceberg-compatible tables in an AWS Glue Data Catalog that sits on top of your HAQM Simple Storage Service (HAQM S3) bucket. You can either query the Data Catalog directly using tools like HAQM Athena, or publish it to SageMaker Lakehouse to combine with other datasets.

The zero-ETL integration uses DynamoDB exports to continuously replicate data changes from DynamoDB to your S3 bucket every 15–30 minutes. The initial load time may be higher and is based on the table size. This replication is done with no performance or availability impact to your DynamoDB tables, and without consuming DynamoDB RCUs. Your applications can continue to read from and write to your DynamoDB tables. Data from those tables will be available for analytics through SageMaker Lakehouse or other analytics tools.

Note: Using this feature requires enabling point-in-time recovery (PITR) and uses exports which has associated additional cost. See pricing section for detail.

The following diagram illustrates the solution architecture.

architecture

For our use case, a leading credit card company relies on DynamoDB as its operational data store for managing customer profiles. To drive smarter, faster decision-making, the company is now looking to gain deeper insights into customer behavior and spending trends. Using the zero-ETL integration between DynamoDB and SageMaker Lakehouse, they plan to build an analytics pipeline that enables access to operational data for advanced query and analysis. This integration is fully managed and does not require the customer to setup complex data pipelines and manage them. While there are several options to enable analytics, zero-ETL integration is the quickest to setup and requires minimal oversight. This streamlined data flow empowers teams with timely, actionable insights to stay ahead of evolving business dynamics.The solution outlines a step-by-step approach to:

  • Integrate DynamoDB with SageMaker Lakehouse, enabling easy data flow.
  • Use SageMaker Unified Studio with its generative SQL assistant powered by HAQM Q, making it effortless to explore, analyze, and query operational and analytical data.

Prerequisites

Setting up an integration between the source (DynamoDB table) and target (SageMaker Lakehouse) require some prerequisites, such as configuring AWS Identity and Access Management (IAM) roles that AWS Glue uses to access data from the source, and write to the target.

Complete the following steps to set up the prerequisite resources:

  1. Open an AWS CloudShell terminal window.
  2. Clone the repo:

git clone http://github.com/aws-samples/sample-zetl-ddb-to-lakehouse.git

  1. Replace <ACCOUNT-ID> in the files with your own AWS account ID using the following command:
cd sample-zetl-ddb-to-lakehouse
find . -type f -exec sed -i 's/<ACCOUNT-ID>/YOUR-OWN-AWS-ACCOUNT-ID/g' {} \;

The following commands create the AWS resources in the us-east-1 AWS Region. If you want to change your Region, DynamoDB table name, S3 URI location, AWS Glue database, or AWS Glue table name, update the files accordingly.

  1. Create the DynamoDB table CustomerAccounts and verify table creation status:
aws dynamodb create-table \
--cli-input-json file://CustomerAccountsTable.json \
--region us-east-1

aws dynamodb describe-table --table-name CustomerAccounts --region us-east-1
  1. Enable point-in-time recovery (PITR) for the DynamoDB table CustomerAccounts:
aws dynamodb update-continuous-backups \
    --table-name CustomerAccounts \
    --point-in-time-recovery-specification PointInTimeRecoveryEnabled=true \
    --region us-east-1
  1. Load sample data into the DynamoDB table CustomerAccounts:
aws dynamodb batch-write-item \
--request-items file://sample-data-load-1.json \
--region us-east-1

To access data from your source DynamoDB table, AWS Glue requires access to describe the table, and export data from it. DynamoDB recently introduced a feature that allows configuring a resource-based access control policy. Add and verify the following resource policy for the DynamoDB table CustomerAccounts, enabling the zero-ETL integration to access DynamoDB table data.

  1. Before running the following commands, replace <ACCOUNT-ID> with your own AWS account ID:
aws dynamodb put-resource-policy \
    --resource-arn arn:aws:dynamodb:us-east-1:<ACCOUNT-ID>:table/CustomerAccounts \
    --policy file://CustomerAccounts_ResourcePolicy.json \
    --region us-east-1
aws dynamodb get-resource-policy \
--resource-arn arn:aws:dynamodb:us-east-1::table/CustomerAccounts \
--region us-east-1
  1. Create an S3 bucket, folder, and an AWS Glue database by providing the S3 URI location:

Note: S3 bucket name must be globally unique. Follow the documentation for naming rules, best practices, and an example for creating a globally unique S3 bucket.

aws s3 mb s3://glue-zetl-target-customerdb-us-east-1 --region us-east-1

aws s3api put-object \
--bucket glue-zetl-target-customerdb-us-east-1 \
--key customerdb/

aws glue create-database --database-input '{
    "Name": "customerdb",
    "Description": "Glue database for storing metadata with S3 location",
    "LocationUri": "s3://glue-zetl-target-customerdb-us-east-1/customerdb/",
    "Parameters": {
        "CreatedBy": "AWS CLI"
    }
}' --region us-east-1
  1. For integrations that use an AWS Glue database, add the permissions to the catalog resource-based access policy to allow for integrations between source and target. You can access the catalog settings under the Data Catalog on the AWS Glue console, or run the following command:

aws glue put-resource-policy --policy-in-json file://glue_catalog_rbac_policy.json

  1. Create an IAM policy for source resources. This will enable the zero-ETL integration to access your connection.
aws iam create-policy \
--policy-name zETLSPolicy \
--policy-document file://source_policy.json
  1. If your S3 bucket name is different from “glue-zetl-target-customerdb-us-east-1”, please update S3 bucket name in the target_policy.json file before creating an IAM policy for target resources
aws iam create-policy \
    --policy-name zETLTPolicy \
    --policy-document file://target_policy.json
  1. Create an IAM policy to enable HAQM CloudWatch logging.
aws iam create-policy \
    --policy-name zETLCWPolicy \
    --policy-document file://cloud_watch_policy.json
  1. Create an IAM role with the following trust permissions and attach the IAM policies to the target role:
aws iam create-role --role-name zETLTRole --assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Principal": {
          "Service": "glue.amazonaws.com"
        },
        "Action": "sts:AssumeRole"
      }
    ]
  }'

aws iam attach-role-policy \
--role-name zETLTRole \
--policy-arn arn:aws:iam:::policy/zETLSPolicy

aws iam attach-role-policy \
--role-name zETLTRole \
--policy-arn arn:aws:iam:::policy/zETLTPolicy

aws iam attach-role-policy \
--role-name zETLTRole \
--policy-arn arn:aws:iam:::policy/zETLCWPolicy

Create the zero-ETL integration

With the prerequisites complete, complete the following steps to create the zero-ETL integration:

  1. On the DynamoDB console, choose Integrations in the navigation pane.
  2. Choose Create integration, then choose HAQM SageMaker Lakehouse. You will be redirected to Glue console.

zETL Integration creation step-1

  1. For your data source, select HAQM DynamoDB, then choose Next.

zETL Integration creation step-2

Next, you need to configure the source and target details.

  1. In the Source details section, for DynamoDB table, choose the table CustomerAccounts.
  2. In the Target details section, specify the Data Catalog name, target database name (customerdb), and target IAM role you created previously (zETLTargetRole).

Now you have options to configure the output.

  1. For Data partitioning, you can either use DynamoDB table keys for partitioning, or specify custom partition keys.
  2. Choose Next.

zETL Integration creation step-4

  1. Under Configure integration, you can configure your data encryption. You can use AWS Key Management Service (AWS KMS), or a custom encryption key.
  2. Enter a name for the integration, and choose Next.

zETL Integration creation step-5

  1. Review the configurations, and choose Create and launch integration.

zETL Integration creation step-6

After the initial data ingestion is complete, the zero-ETL integration will be ready for use. The completion time varies depending on the size of your source DynamoDB table.

zETL Integration creation step-7

On the AWS Glue console, choose Tables under Data Catalog in the navigation pane, and open your table to observe more details, including the schema. The zero-ETL integration uses Iceberg to transform related data formats and structure in your DynamoDB data into appropriate formats in HAQM S3.

Lastly, you can confirm that your data is available in your S3 bucket.

zETL Integration creation step-9

Pricing

AWS doesn’t charge an additional fee for the zero-ETL integration. You pay for existing DynamoDB and AWS Glue resources used to create and process the change data created as part of a zero-ETL integration. These include DynamoDB point-in-time recovery (PITR), DynamoDB exports for the initial and ongoing data changes to your DynamoDB data, additional AWS Glue storage for storing replicated data, and SageMaker compute on the target. For pricing on DynamoDB PITR and DynamoDB exports, see HAQM DynamoDB pricing.

Clean up

Note: Readers following Part 2 of this tutorial can bypass this section and move straight to the cleanup instructions in Part 2.

Complete the following steps to clean up your resources. When you delete a zero-ETL integration, your data isn’t deleted from the DynamoDB table or AWS Glue, but data changes happening after that point of time aren’t sent to SageMaker Lakehouse.

  1. Delete the zero-ETL integration:
    1. On the AWS Glue console, choose Zero-ETL integrations in the navigation pane.
    2. Select the zero-ETL integration that you want to delete, and on the Actions menu, choose Delete.
    3. To confirm the deletion, enter confirm and choose Delete.
  2. Delete the AWS Glue database:
    1. On the AWS Glue console, choose Data Catalog in the navigation pane.
    2. Select your database and choose Delete to delete the database and its tables.
  3. On the HAQM S3 console, delete the S3 folder and bucket you created.
  4. On the IAM console, delete the IAM policies and role you created.

Conclusion

In this post, we walked through the prerequisites and steps required to set up a zero-ETL integration from DynamoDB to SageMaker Lakehouse. In Part 2, we show how to set up SageMaker Unified Studio, and demonstrate how you can run analytics using the capabilities of SageMaker Lakehouse.

To learn more about zero-ETL integration, refer to DynamoDB zero-ETL integration with HAQM SageMaker Lakehouse.


About the Authors

Mukesh Agrawal

Mukesh Agrawal

Mukesh is a New York based Senior Database Specialist Solutions Architect at AWS, helping customers design scalable, optimized and innovative database solutions to maximize the value of AWS services. He has been working with Database Technologies since 1997

Shahzeb Farrukh

Shahzeb Farrukh

Shahzeb is a Seattle-based Senior Product Manager at AWS DynamoDB. He works on DynamoDB’s data protection features like backups and restores, and data movement capabilities that help customers integrate their data with other services. He has been working with databases and analytics since 2010.