AWS Big Data Blog
HAQM DataZone announces integration with AWS Lake Formation hybrid access mode for the AWS Glue Data Catalog
Last week, we announced the general availability of the integration between HAQM DataZone and AWS Lake Formation hybrid access mode. In this post, we share how this new feature helps you simplify the way you use HAQM DataZone to enable secure and governed sharing of your data in the AWS Glue Data Catalog. We also delve into how data producers can share their AWS Glue tables through HAQM DataZone without needing to register them in Lake Formation first.
Overview of the HAQM DataZone integration with Lake Formation hybrid access mode
HAQM DataZone is a fully managed data management service to catalog, discover, analyze, share, and govern data between data producers and consumers in your organization. With HAQM DataZone, data producers populate the business data catalog with data assets from data sources such as the AWS Glue Data Catalog and HAQM Redshift. They also enrich their assets with business context to make it straightforward for data consumers to understand. After the data is available in the catalog, data consumers such as analysts and data scientists can search and access this data by requesting subscriptions. When the request is approved, HAQM DataZone can automatically provision access to the data by managing permissions in Lake Formation or HAQM Redshift so that the data consumer can start querying the data using tools such as HAQM Athena or HAQM Redshift.
To manage the access to data in the AWS Glue Data Catalog, HAQM DataZone uses Lake Formation. Previously, if you wanted to use HAQM DataZone for managing access to your data in the AWS Glue Data Catalog, you had to onboard your data to Lake Formation first. Now, the integration of HAQM DataZone and Lake Formation hybrid access mode simplifies how you can get started with your HAQM DataZone journey by removing the need to onboard your data to Lake Formation first.
Lake Formation hybrid access mode allows you to start managing permissions on your AWS Glue databases and tables through Lake Formation, while continuing to maintain any existing AWS Identity and Access Management (IAM) permissions on these tables and databases. Lake Formation hybrid access mode supports two permission pathways to the same Data Catalog databases and tables:
- In the first pathway, Lake Formation allows you to select specific principals (opt-in principals) and grant them Lake Formation permissions to access databases and tables by opting in
- The second pathway allows all other principals (that are not added as opt-in principals) to access these resources through the IAM principal policies for HAQM Simple Storage Service (HAQM S3) and AWS Glue actions
With the integration between HAQM DataZone and Lake Formation hybrid access mode, if you have tables in the AWS Glue Data Catalog that are managed through IAM-based policies, you can publish these tables directly to HAQM DataZone, without registering them in Lake Formation. HAQM DataZone registers the location of these tables in Lake Formation using hybrid access mode, which allows managing permissions on AWS Glue tables through Lake Formation, while continuing to maintain any existing IAM permissions.
HAQM DataZone enables you to publish any type of asset in the business data catalog. For some of these assets, HAQM DataZone can automatically manage access grants. These assets are called managed assets, and include Lake Formation-managed Data Catalog tables and HAQM Redshift tables and views. Prior to this integration, you had to complete the following steps before HAQM DataZone could treat the published Data Catalog table as a managed asset:
- Identity the HAQM S3 location associated with Data Catalog table.
- Register the HAQM S3 location with Lake Formation in hybrid access mode using a role with appropriate permissions.
- Publish the table metadata to the HAQM DataZone business data catalog.
The following diagram illustrates this workflow.
With the HAQM DataZone’s integration with Lake Formation hybrid access mode, you can simply publish your AWS Glue tables to HAQM DataZone without having to worry about registering the HAQM S3 location or adding an opt-in principal in Lake Formation by delegating these steps to HAQM DataZone. The administrator of an AWS account can enable the data location registration setting under the DefaultDataLake
blueprint on the HAQM DataZone console. Now, a data owner or publisher can publish their AWS Glue table (managed through IAM permissions) to HAQM DataZone without the extra setup steps. When a data consumer subscribes to this table, HAQM DataZone registers the HAQM S3 locations of the table in hybrid access mode, adds the data consumer’s IAM role as an opt-in principal, and grants access to the same IAM role by managing permissions on the table through Lake Formation. This makes sure that IAM permissions on the table can coexist with newly granted Lake Formation permissions, without disrupting any existing workflows. The following diagram illustrates this workflow.
Solution overview
To demonstrate this new capability, we use a sample customer scenario where the finance team wants to access data owned by the sales team for financial analysis and reporting. The sales team has a pipeline that creates a dataset containing valuable information about ticket sales, popular events, venues, and seasons. We call it the tickit dataset. The sales team stores this dataset in HAQM S3 and registers it in a database in the Data Catalog. The access to this table is currently managed through IAM-based permissions. However, the sales team wants to publish this table to HAQM DataZone to facilitate secure and governed data sharing with the finance team.
The steps to configure this solution are as follows:
- The HAQM DataZone administrator enables the data lake location registration setting in HAQM DataZone to automatically register the HAQM S3 location of the AWS Glue tables in Lake Formation hybrid access mode.
- After the hybrid access mode integration is enabled in HAQM DataZone, the finance team requests a subscription to the sales data asset. The asset shows up as a managed asset, which means HAQM DataZone can manage access to this asset even if the HAQM S3 location of this asset isn’t registered in Lake Formation.
- The sales team is notified of a subscription request raised by the finance team. They review and approve the access request. After the request is approved, HAQM DataZone fulfills the subscription request by managing permissions in the Lake Formation. It registers the HAQM S3 location of the subscribed table in Lake Formation hybrid mode.
- The finance team gains access to the sales dataset required for their financial reports. They can go to their DataZone environment and start running queries using Athena against their subscribed dataset.
Prerequisites
To follow the steps in this post, you need an AWS account. If you don’t have an account, you can create one. In addition, you must have the following resources configured in your account:
- An S3 bucket
- An AWS Glue database and crawler
- IAM roles for different personas and services
- An HAQM DataZone domain and project
- An HAQM DataZone environment profile and environment
- An HAQM DataZone data source
If you don’t have these resources already configured, you can create them by deploying the following AWS CloudFormation stack:
- Choose Launch Stack to deploy a CloudFormation template.
- Complete the steps to deploy the template and leave all settings as default.
- Select I acknowledge that AWS CloudFormation might create IAM resources, then choose Submit.
After the CloudFormation deployment is complete, you can log in to the HAQM DataZone portal and manually trigger a data source run. This pulls any new or modified metadata from the source and updates the associated assets in the inventory. This data source has been configured to automatically publish the data assets to the catalog.
- On the HAQM DataZone console, choose View domains.
You should be logged in using the same role that is used to deploy CloudFormation and verify that you are in the same AWS Region.
- Find the domain
blog_dz_domain
, then choose Open data portal. - Choose Browse all projects and choose Sales producer project.
- On the Data tab, choose Data sources in the navigation pane.
- Locate and choose the data source that you want to run.
This opens the data source details page.
- Choose the options menu (three vertical dots) next to
tickit_datasource
and choose Run.
The data source status changes to Running as HAQM DataZone updates the asset metadata.
Enable hybrid mode integration in HAQM DataZone
In this step, the HAQM DataZone administrator goes through the process of enabling the HAQM DataZone integration with Lake Formation hybrid access mode. Complete the following steps:
- On a separate browser tab, open the HAQM DataZone console.
Verify that you are in the same Region where you deployed the CloudFormation template.
- Choose View domains.
- Choose the domain created by AWS CloudFormation,
blog_dz_domain
. - Scroll down on the domain details page and choose the Blueprints tab.
A blueprint defines what AWS tools and services can be used with the data assets published in HAQM DataZone. The DefaultDataLake
blueprint is enabled as part of the CloudFormation stack deployment. This blueprint enables you to create and query AWS Glue tables using Athena. For the steps to enable this in your own deployments, refer to Enable built-in blueprints in the AWS account that owns the HAQM DataZone domain.
- Choose the
DefaultDataLake
blueprint.
- On the Provisioning tab, choose Edit.
- Select Enable HAQM DataZone to register S3 locations using AWS Lake Formation hybrid access mode.
You have the option of excluding specific HAQM S3 locations if you don’t want HAQM DataZone to automatically register them to Lake Formation hybrid access mode.
- Choose Save changes.
Request access
In this step, you log in to HAQM DataZone as the finance team, search for the sales data asset, and subscribe to it. Complete the following steps:
- Return to your HAQM DataZone data portal browser tab.
- Switch to the finance consumer project by choosing the dropdown menu next to the project name and choosing Finance consumer project.
From this step onwards, you take on the persona of a finance user looking to subscribe to a data asset published in the previous step.
- In the search bar, search for and choose the
sales
data asset.
- Choose Subscribe.
The asset shows up as managed asset. This means that HAQM DataZone can grant access to this data asset to the finance team’s project by managing the permissions in Lake Formation.
- Enter a reason for the access request and choose Subscribe.
Approve access request
The sales team gets a notification that an access request from the finance team is submitted. To approve the request, complete the following steps:
- Choose the dropdown menu next to the project name and choose Sales producer project.
You now assume the persona of the sales team, who are the owners and stewards of the sales data assets.
- Choose the notification icon at the top-right corner of the DataZone portal.
- Choose the Subscription Request Created task.
- Grant access to the sales data asset to the finance team and choose Approve.
Analyze the data
The finance team has now been granted access to the sales data, and this dataset has been to their HAQM DataZone environment. They can access the environment and query the sales dataset with Athena, along with any other datasets they currently own. Complete the following steps:
- On the dropdown menu, choose Finance consumer project.
On the right pane of the project overview screen, you can find a list of active environments available for use.
- Choose the HAQM DataZone environment
finance_dz_environment
.
- In the navigation pane, under Data assets, choose Subscribed.
- Verify that your environment now has access to the sales data.
It may take a few minutes for the data asset to be automatically added to your environment.
- Choose the new tab icon for Query data.
A new tab opens with the Athena query editor.
- For Database, choose
finance_consumer_db_tickitdb-<suffix>
.
This database will contain your subscribed data assets.
- Generate a preview of the sales table by choosing the options menu (three vertical dots) and choosing Preview table.
Clean up
To clean up your resources, complete the following steps:
- Switch back to the administrator role you used to deploy the CloudFormation stack.
- On the HAQM DataZone console, delete the projects used in this post. This will delete most project-related objects like data assets and environments.
- On the AWS CloudFormation console, delete the stack you deployed in the beginning of this post.
- On the HAQM S3 console, delete the S3 buckets containing the tickit dataset.
- On the Lake Formation console, delete the Lake Formation admins registered by HAQM DataZone.
- On the Lake Formation console, delete tables and databases created by HAQM DataZone.
Conclusion
In this post, we discussed how the integration between HAQM DataZone and Lake Formation hybrid access mode simplifies the process to start using HAQM DataZone for end-to-end governance of your data in the AWS Glue Data Catalog. This integration helps you bypass the manual steps of onboarding to Lake Formation before you can start using HAQM DataZone.
For more information on how to get started with HAQM DataZone, refer to the Getting started guide. Check out the YouTube playlist for some of the latest demos of HAQM DataZone and short descriptions of the capabilities available. For more information about HAQM DataZone, see How HAQM DataZone helps customers find value in oceans of data.
About the Authors
Utkarsh Mittal is a Senior Technical Product Manager for HAQM DataZone at AWS. He is passionate about building innovative products that simplify customers’ end-to-end analytics journeys. Outside of the tech world, Utkarsh loves to play music, with drums being his latest endeavor.
Praveen Kumar is a Principal Analytics Solution Architect at AWS with expertise in designing, building, and implementing modern data and analytics platforms using cloud-centered services. His areas of interests are serverless technology, modern cloud data warehouses, streaming, and generative AI applications.
Paul Villena is a Senior Analytics Solutions Architect in AWS with expertise in building modern data and analytics solutions to drive business value. He works with customers to help them harness the power of the cloud. His areas of interests are infrastructure as code, serverless technologies, and coding in Python