AWS Big Data Blog
Unified scheduling for visual ETL flows and query books in HAQM SageMaker Unified Studio
Data engineers and analysts often need to automate their data processing workflows and queries to maintain up-to-date data pipelines and reports. HAQM SageMaker Unified Studio provides a unified environment for data, analytics, machine learning (ML), and AI workloads. HAQM SageMaker Unified Studio provides powerful tools for visual extract, transform, and load (ETL) flows and query books. Until today, scheduling these workflows has required additional setup and infrastructure.
Today, we’re excited to introduce a new unified scheduling feature that simplifies this process. SageMaker Unified Studio allows you to create ETL flows using a visual interface and write SQL analytics queries using query books. This new unified scheduling feature allows you to schedule your visual ETL flows and query books directly from SageMaker Unified Studio within the same interface, eliminating the need for visiting other consoles or complex configurations. Using HAQM EventBridge Scheduler, this feature provides a seamless and easy-to-use scheduling experience.
In this post, we walk through how to schedule your visual ETL flows and query books with just a few clicks, explore the underlying architecture, and demonstrate how this feature can streamline your data workflow automation.
Feature overview
SageMaker Unified Studio unified scheduling is built on top of EventBridge Scheduler and HAQM SageMaker Training. When you configure a new schedule from SageMaker Unified Studio, a new EventBridge schedule is automatically created in your AWS account. The EventBridge schedule is configured with the SageMaker CreateTrainingJob API. The SageMaker Training job runs visual ETL flows or query books.
The following diagram illustrates how it works.
Prerequisites
To run the instruction, you must have the following prerequisites:
- An AWS account
- A SageMaker Unified Studio domain
- A SageMaker Unified Studio project with a All capabilities profile. This profile includes Tooling blueprint in which Scheduling is enabled by default. If scheduling is disabled, you may need to update your project’s profile.
- A SageMaker Unified Studio project role without project boundaries or with an explicit allow for
GetScheduleGroup
.
Schedule a visual ETL flow
Complete the following steps to configure a schedule on a visual ETL flow:
- On the SageMaker Unified Studio console, on the top menu, choose Build.
- Under DATA ANALYSIS & INTEGRATION, choose Visual ETL flows.
- For Select or create project to continue, select your project, and choose Continue.
- Choose your visual ETL flow. If you don’t have any visual ETL flows, refer to Author visual ETL flows on HAQM SageMaker Unified Studio to create a new visual ETL flow.
- Choose the Schedule icon.
- For Schedule name, enter a unique name (for example,
everyday
). - For Schedule Type, select Recurring.
- For Value, enter
1
. - For Unit, choose days.
- For Timezone, choose your time zone.
- Choose Create schedule.
You have successfully configured the schedule. Because Start date and time is not given, the visual ETL flow is triggered immediately and then it is triggered once a day after that.
Edit the schedule
You can view the configured schedules with the following steps:
- On the SageMaker Unified Studio console, navigate to Visual ETL flows for your project.
- Choose the Schedules tab.
- Choose Edit schedule under Actions.
- Edit with your preferences, then choose Save.
Pause or resume the schedule
If you want to pause the schedule, complete the following steps:
- Choose Pause schedule under Actions.
On the same Schedule tab, Status of the schedule will be updated to Paused.
- To resume the schedule, choose Activate schedule.
Delete the schedule
To delete the schedule, complete the following steps:
- Choose Delete schedule under Actions.
- Choose Delete schedule in the dialog.
On the same Schedule tab, you can verify that the deleted schedule disappears.
Schedule a query book flow
Complete the following steps to configure a schedule on a query book:
- On the SageMaker Unified Studio console, on the top menu, choose Build.
- Under DATA ANALYSIS & INTEGRATION, choose Query Editor.
- On the data explorer, under Lakehouse, choose
AwsDataCatalog
. - Navigate to the table
venue_event_agg
. This table is created in the previous section. - On the options menu (three dots), choose Query with Athena.
- On the Actions menu, choose Save to project.
- Choose Save changes.
- On the Actions menu, choose Create schedule.
- For Schedule Type, choose Recurring.
- For Value, enter 1.
- For Unit, choose days.
- For Timezone, choose your time zone.
- Choose Create schedule.
You have successfully configured the schedule. Because Start date and time was not set, the query book is triggered immediately and then it is triggered once a day after that. You can optionally configure start and end times if you want to limit your schedule to run in a specific date range.
To view the configured schedules, in the navigation pane, choose Scheduled queries.
You can view the list of scheduled queries and edit, pause, resume, or delete them, as shown in the previous section.
Clean up
To avoid incurring future charges, clean up the resources you created during this walkthrough:
- On the Schedule tab of Visual ETL flows, select the
everyday
schedule, and choose Delete schedule under Actions. The related EventBridge schedule is automatically deleted as well. - On the SageMaker AI console, choose Training jobs under Training, and delete all the SageMaker training jobs that start with
everyday-
. - (Optional) To delete the visual ETL flow, on the Flows tab of Visual ETL flows, select your visual ETL flow, and choose Delete flow under Actions.
Conclusion
The new unified scheduling experience in SageMaker Unified Studio simplifies workflow automation. With unified scheduling, you can seamlessly orchestrate your visual ETL flows and query books in one centralized location.
Whether you’re running daily data transformations, weekly analytical queries, or monthly reporting workflows, the unified scheduling experience provides a straightforward path to automation. This capability enables data teams to focus more on deriving insights from their data and less on managing infrastructure and scheduling configurations.
We encourage you to try out this new experience and share your feedback with us. For more information about SageMaker Unified Studio and its capabilities, visit our documentation or explore our other blog posts about visual ETL flows and query books.
About the Authors
Noritaka Sekiyama is a Principal Big Data Architect for AWS Analytics services with a strong focus on data engineering. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling on his road bike.
Daniel Obi is a Frontend Engineer on the HAQM SageMaker Unified Studio team. He is dedicated to building intuitive and effective solutions that enhance user experience and technical functionality. Outside of his professional work, he enjoys watching and playing basketball.
Vasudevan Venkataramanan is a Senior Software Engineer on the HAQM SageMaker Unified Studio team. He is responsible for technical direction of scheduling and orchestration within SageMaker Unified Studio. Outside of his professional work, he enjoys spending time with his kid, and playing pickleball and cricket.
Yuhang Huang is a Software Development Manager on the HAQM SageMaker Unified Studio team. He leads the engineering team to design, build, and operate scheduling and orchestration capabilities in SageMaker Unified Studio. In his free time, he enjoys playing tennis.
Gal Heyne is a Senior Technical Product Manager for AWS Analytics services with a strong focus on AI/ML and data engineering. She is passionate about developing a deep understanding of customers’ business needs and collaborating with engineers to design simple-to-use data products.