AWS Database Blog
HAQM DynamoDB zero-ETL integration with HAQM SageMaker Lakehouse – Part 2
In Part 1 of this series, we walked through the prerequisites and steps to create a zero-ETL integration between HAQM DynamoDB and HAQM SageMaker Lakehouse. In this post, we continue setting up SageMaker Lakehouse using HAQM SageMaker Unified Studio, and show sample analytics that a financial services team can run to understand their customer’s behavior and spend.
Set up SageMaker Unified Studio
SageMaker Unified Studio is an integrated development environment (IDE) for data, analytics, and AI. You can discover your data and put it to work using familiar AWS tools to complete end-to-end development workflows in a single governed environment. Such tools may include data analysis, data processing, model training, generative AI application building, and more. You can create or join projects to collaborate with your teams, share AI and analytics artifacts securely, and discover and use your data stored in HAQM Simple Storage Service (HAQM S3), HAQM Redshift, and other data sources through SageMaker Lakehouse. As AI and analytics use cases converge, you can transform how data teams work together with SageMaker Unified Studio.
Follow the steps in An integrated experience for all your data and AI with HAQM SageMaker Unified Studio to create a SageMaker Unified Studio domain and project. For this example, we name the project CustomerDB_Project
. On the project overview page, copy the project role HAQM Resource Name (ARN) for future use.
Import AWS Glue Data Catalog assets to the SageMaker Unified Studio project
Follow the instructions and script provided in the GitHub repo to bring in existing resources to the SageMaker Unified Studio project CustomerDB_Project
.
- Retrieve details about the AWS Identity and Access Management (IAM) user or role whose credentials will be used in running the script. In the following commands, the assumed role
Admin
is used:
aws sts get-caller-identity
- Add the executor (IAM user or role, in our case the
Admin
role) as the AWS Lake Formation data lake administrator in the AWS Region where you will execute the script:
- Download the
bring_your_own_gdc_assets.py
script from GitHub, and execute the following command:
At this point, the AWS Glue database customerdb
and table CustomerAccounts
should be visible in the project CustomerDB_Project
under AwsDataCatalog
.
Explore your data through SageMaker Lakehouse
In this section, we use the SageMaker Lakehouse data explorer to explore the imported table with HAQM Athena. Complete the following steps:
- On the project page, choose Data.
- Under Lakehouse, expand
AwsDataCatalog
. - Expand your database
customerdb
. - Choose the
customeraccounts
table, then choose Query with Athena. - Choose Run all.
The following screenshot shows an example of the query result.
You can also open a generative SQL assistant powered by HAQM Q to help your query authoring experience. For example, you can ask “calculate the total balance as a percentage of the credit limit outstanding per state and zipcode” in the HAQM Q assistant, and the query is automatically suggested.
You can choose Add to querybook to copy the suggested query to your querybook, and run it.
Next, regarding the query generated by HAQM Q, let’s try a quick visualization to analyze the data distribution.
- Choose the chart view icon.
- Under Structure, choose Traces.
- For Type, choose Pie.
- For Values, choose
balance_percentage
. - For Labels, choose
state
.
The query result will display as a pie chart like the following example. You can customize the graph title, axis title, subplot styles, and more in the UI. The generated images can also be downloaded as PNG or JPEG files. After you’ve finished querying the data, you can choose to view the queries in your query history, and save them to share with other project members.
For more information about reviewing query history, see Review query history. For more information about other operations you can do with the query editor, such as using generative AI to create SQL queries, see SQL analytics.
Let’s ask HAQM Q to generate a query by providing the prompt “calculate the count of account status Past Due by state, zipcode.” The following screenshot shows the results.
Next, we enter the prompt “pivot table where rows represent states, and columns represent account statuses with the corresponding counts.” The following screenshot shows the results.
Schema evolution
The zero-ETL integration uses Apache Iceberg to transform related data formats and structure in your DynamoDB data into appropriate formats in HAQM S3. Iceberg provides a robust solution for managing table schemas in data lakes, enabling schema evolution (adding or removing columns) without disrupting data, or requiring costly rewrites. It uses metadata files to track schema changes, allowing for querying historical data even after schema modifications. For more information about zero-ETL integration general limitations, refer to Limitations.
To see how the zero-ETL integration handles schema evolution, let’s make some changes in the DynamoDB table CustomerAccounts
. We will do the following:
- Insert new items with matching attributes of an existing item with a few additional new attributes.
- Update an item by adding new attributes.
- Delete an item.
Open an AWS CloudShell terminal window and run the following commands:
The zero-ETL integration will take up to 15–30 minutes to propagate the change in the AWS Glue Data Catalog database. Run the following query to validate the changed data:
Time travel and version travel queries for historical analysis
Each Iceberg table maintains a versioned manifest of the HAQM S3 objects that it contains. Previous versions of the manifest can be used for time travel and version travel queries. Time travel queries in Athena query HAQM S3 for historical data from a consistent snapshot as of a specified date and time. Version travel queries in Athena query HAQM S3 for historical data as of a specified snapshot ID. For more information about time travel and version travel queries, see Perform time travel and version travel queries.
Retrieve the snapshot ID using the following query:
select * from "awsdatacatalog"."customerdb"."customeraccounts$snapshots";
Use the following version travel query for historical analysis:
Use the following time travel query for historical analysis:
Monitor the zero-ETL integration
There are several options to obtain metrics on the performance and status of the DynamoDB zero-ETL integration with SageMaker Lakehouse.
On the AWS Glue console, choose Zero-ETL integrations in the navigation pane. You can choose your zero-ETL integration, and drill down to view HAQM CloudWatch logs and HAQM CloudWatch metrics to the integration.
Viewing HAQM CloudWatch logs for an integration
Zero-ETL integrations generate HAQM CloudWatch logs for visibility into your data movement. Log events are emitted to a default log group created in customer account. Log events may include successful ingestion events, failures experienced due to problematic data records at source, and data write errors due to schema changes or insufficient permissions.
Viewing HAQM CloudWatch metrics for an integration
Zero-ETL integration provides real-time operational insights through CloudWatch metrics, enabling proactive monitoring of data integration processes without direct querying of target Iceberg tables. When enabled by adding appropriate permissions on source and target processing roles, CloudWatch metrics are automatically emitted to the AWS/Glue/ZeroETL namespace after completion of each table ingestion operation. You can setup alarms on your CloudWatch metrics to get notified when a particular Ingestion Job fails.
To learn about zero ETL monitoring, refer to zero ETL monitoring
Pricing
AWS doesn’t charge an additional fee for the zero-ETL integration. You pay for existing DynamoDB and AWS Glue resources used to create and process the change data created as part of a zero-ETL integration. These include DynamoDB point-in-time recovery (PITR), DynamoDB exports for the initial and ongoing data changes to your DynamoDB data, additional AWS Glue storage for storing replicated data, and SageMaker compute on the target. For pricing on DynamoDB PITR and DynamoDB exports, see HAQM DynamoDB pricing.
Clean up
Complete the following steps to clean up your resources created in blog post Part 1 & Part 2. When you delete a zero-ETL integration, your data isn’t deleted from the DynamoDB table or AWS Glue, but data changes happening after that point of time aren’t sent to SageMaker Lakehouse.
- Delete the zero-ETL integration:
- On the AWS Glue console, choose Zero-ETL integrations in the navigation pane.
- Select the zero-ETL integration that you want to delete, and on the Actions menu, choose Delete.
- To confirm the deletion, enter
confirm,
and choose Delete.
- Delete the SageMaker Unified Studio project, domain and VPC:
- On the SageMaker Unified Studio console, select your project, and on the Action menu, choose Delete project.
- On the SageMaker console, select the domain you want to delete, and choose Delete.
- On CloudFormation Stacks, select SageMakerUnifiedStudio-VPC, and choose Delete.
- Delete the AWS Glue database:
- On the AWS Glue console, choose Data Catalog in the navigation pane.
- Select your database and choose Delete to delete the database and its tables.
- On the HAQM S3 console, delete the S3 folder and bucket you created.
- On the IAM console, delete the IAM policies and role you created.
Conclusion
In this post, we explained how you can set up a zero-ETL integration from DynamoDB to SageMaker Lakehouse to derive holistic insights across many applications, break data silos in your organization, and gain significant cost savings and operational efficiencies.
To learn more about zero-ETL integration, refer to DynamoDB zero-ETL integration with HAQM SageMaker Lakehouse.