AWS News Blog

Announcing the general availability of data lineage in the next generation of HAQM SageMaker and HAQM DataZone

Voiced by Polly

Today, I’m happy to announce the general availability of data lineage in HAQM DataZone, following its preview release in June 2024. This feature is also extended as part of the catalog capabilities in the next generation of HAQM SageMaker, a unified platform for data, analytics, and AI.

Traditionally, business analysts have relied on manual documentation or personal connections to validate data origins, leading to inconsistent and time-consuming processes. Data engineers have struggled to evaluate the impact of changes to data assets, especially as self-service analytics adoption increases. Additionally, data governance teams have faced difficulties in enforcing practices and responding to auditor queries about data movement.

Data lineage in HAQM DataZone addresses the challenges faced by organizations striving to remain competitive by using their data for strategic analysis. It enhances data trust and validation by providing a visual, traceable history of data assets, enabling business analysts to quickly understand data origins without manual research. For data engineers, it facilitates impact analysis and troubleshooting by clearly showing relationships between assets and allowing easy tracing of data flows.

The feature supports data governance and compliance efforts by offering a comprehensive view of data movement, helping governance teams to quickly respond to compliance queries and enforce data policies. It improves data discovery and understanding, helping consumers grasp the context and relevance of data assets more efficiently. Additionally, data lineage contributes to better change management, increased data literacy, reduced data duplication, and enhanced cross-team collaboration. By tackling these challenges, data lineage in HAQM DataZone helps organizations build a more trustworthy, efficient, and compliant data ecosystem, ultimately enabling more effective data-driven decision-making.

Automated lineage capture is a key feature of the data lineage in HAQM DataZone, which focuses on automatically collecting and mapping lineage information from AWS Glue and HAQM Redshift. This automation significantly reduces the manual effort required to maintain accurate and up-to-date lineage information.

Get started with data lineage in HAQM DataZone
Data producers and domain administrators get started by setting up the data source run jobs for the AWS Glue Data Catalog and HAQM Redshift sources to HAQM DataZone to periodically collect metadata from the source catalog. Additionally, the data producers can hydrate the lineage information programmatically by creating custom lineage nodes using APIs that accept OpenLineage compatible events from existing pipeline components—such as schedulers, warehouses, analysis tools, and SQL engines—to send data about datasets, jobs, and runs directly to HAQM DataZone API endpoint. With the information being sent, HAQM DataZone will start populating the lineage model and map them to the assets already cataloged. As new lineage events are captured, HAQM DataZone maintains versions of events that were already captured, so users can navigate to previous versions if needed.

From the consumer’s perspective, lineage can help with three scenarios. First, a business analyst browsing an asset, can go to the HAQM DataZone portal, search for an asset by name, and select an asset that interests them to dive into the details. Initially, they’ll be presented with details in the Business Metadata tab and move right to neighboring tabs. To view lineage, the analyst can go the Lineage tab for details of upstream nodes to find the source. The analyst is presented with a view of that asset’s lineage with 1-level upstream and downstream. To get the source, the analyst can choose upstream and get to the source of the asset. When the analyst is sure that this is the correct asset, they can subscribe to the asset and continue with their work.

Second, if a data issue is reported—for instance, when a dashboard unexpectedly shows a significant increase in customer count—a data engineer can use the HAQM DataZone portal to locate and examine the relevant asset details. In the asset details page, the data engineer navigates to the Lineage tab to view the details of upstream nodes of the asset in question. The engineer can dive into the details of each node, its snapshots, column mapping between each table node, the jobs that ran in between, and view the query that was executed in the job run. Using this information, the data engineer can spot that a new input table was added to the pipeline, which has introduced an uptick in customer count, because they notice that this new table wasn’t part of the previous snapshots of the job runs. This helps them clarify that a new source was added and hence the data shown in the dashboard is accurate.

Lastly, a steward looking to respond to questions from an auditor can go to the asset in question and navigates to the Lineage tab of that asset. The steward traverses the graph upstream to see where the data is coming from and notices that the data is from two different teams—for instance, from two different on-premises databases—that has its own pipelines until it reaches a point where the pipelines merge. While navigating through the lineage graph, the steward can expand the columns to make sure sensitive columns are dropped during the transformations processes and respond to the auditors with details in a timely manner.

How HAQM DataZone automates lineage collection
HAQM DataZone now enables automatic capture of lineage events, helping data producers and administrators to streamline the tracking of data relationships and transformations across their AWS Glue and HAQM Redshift resources. To allow automatic capture of lineage events from AWS Glue and HAQM Redshift, you have to opt in because some of your jobs or connections might be for testing and you might not need any lineage to be captured. With the integrated experience available, the services will provide you an option in your configuration settings to opt-in to collect and emit lineage events directly to HAQM DataZone.

These events should capture the various data transformation operations you perform on tables and other objects, such as table creation with column definitions, schema changes, and transformation queries, including aggregations and filtering. By obtaining these lineage events directly from your processing engines, HAQM DataZone can build a foundation of accurate and consistent data lineage information. This will then help you, as a data producer, to further curate the lineage data as part of the broader business data catalog capabilities.

Administrators can enable lineage when setting up the built-in DefaultDataLake or the DefaultDataWarehouse blueprints.

Data producers can view the status of automated lineage while setting up the data source runs.

With the recent launch of the next generation of HAQM SageMaker, data lineage is available as one of the catalog capabilities in the HAQM SageMaker Unified Studio (preview). Data users can set up lineage using connections, and that configuration will automate the capture of lineage in the platform for all users to browse and understand the data. Here’s how data lineage in next generation HAQM SageMaker will look.

Now available
You can begin using this capability to gain deeper insights into your data ecosystem and drive more informed, data-driven decision-making.

Data lineage is generally available in all AWS Regions where HAQM DataZone is available. For a list of Regions where HAQM DataZone domains can be provisioned, visit AWS Services by Region.

Data lineage costs are dependent on storage usage and API requests, which are already included in the HAQM DataZone pricing model. For more details, visit HAQM DataZone pricing.

To get started with data lineage in HAQM DataZone, visit the HAQM DataZone User Guide.

— Esra
Esra Kayabali

Esra Kayabali

Esra Kayabali is a Senior Solutions Architect at AWS, specialising in analytics, including data warehousing, data lakes, big data analytics, batch and real-time data streaming, and data integration. She has more than ten years of software development and solution architecture experience. She is passionate about collaborative learning, knowledge sharing, and guiding community in their cloud technologies journey.