AWS Big Data Blog
HAQM DataZone announces custom blueprints for AWS services
Last week, we announced the general availability of custom AWS service blueprints, a new feature in HAQM DataZone allowing you to customize your HAQM DataZone project environments to use existing AWS Identity and Access Management (IAM) roles and AWS services to embed the service into your existing processes. In this post, we share how this new feature can help you in federating to your existing AWS resources using your own IAM role. We also delve into details on how to configure data sources and subscription targets for a project using a custom AWS service blueprint.
New feature: Custom AWS service blueprints
Previously, HAQM DataZone provided default blueprints that created AWS resources required for data lake, data warehouse, and machine learning use cases. However, you may have existing AWS resources such as HAQM Redshift databases, HAQM Simple Storage Service (HAQM S3) buckets, AWS Glue Data Catalog tables, AWS Glue ETL jobs, HAQM EMR clusters, and many more for your data lake, data warehouse, and other use cases. With HAQM DataZone default blueprints, you were limited to only using preconfigured AWS resources that HAQM DataZone created. Customers needed a way to integrate these existing AWS service resources with HAQM DataZone, using a customized IAM role so that HAQM DataZone users can get federated access to those AWS service resources and use the publication and subscription features of HAQM DataZone to share and govern them.
Now, with custom AWS service blueprints, you can use your existing resources using your preconfigured IAM role. Administrators can customize HAQM DataZone to use existing AWS resources, enabling HAQM DataZone portal users to have federated access to those AWS services to catalog, share, and subscribe to data, thereby establishing data governance across the platform.
Benefits of custom AWS service blueprints
Custom AWS service blueprints don’t provision any resources for you, unlike other blueprints. Instead, you can configure your IAM role (bring your own role) to integrate your existing AWS resources with HAQM DataZone. Additionally, you can configure action links, which provide federated access to any AWS resources like S3 buckets, AWS Glue ETL jobs, and so on, using your IAM role.
You can also configure custom AWS service blueprints to bring your own resources, namely AWS databases, as data sources and subscription targets to enhance governance across those assets. With this release, administrators can configure data sources and subscription targets on the HAQM DataZone console and not be restricted to do those actions in the data portal.
Custom blueprints and environments can only be set up by administrators to manage access to configured AWS resources. As custom environments are created in specific projects, the right to grant access to custom resources is delegated to the project owners who can manage project membership by adding or removing members. This restricts the ability of portal users to create custom environments without the right permissions in AWS Console for HAQM DataZone or access custom AWS resources configured in a project that they are not a member of.
Solution overview
To get started, administrators need to enable the custom AWS service blueprints feature on the HAQM DataZone console. Then administrators can customize configurations by defining which project and IAM role to use when federating to the AWS services that are set up as action links for end-users. After the customized set up is complete, when a data producer or consumer logs in to the HAQM DataZone portal and if they’re part of those customized projects, they can federate to any of the configured AWS services such as HAQM S3 to upload or download files or seamlessly go to existing AWS Glue ETL jobs using their own IAM roles and continue their work with data with the customized tool of choice. With this feature, you can how include HAQM DataZone in your existing data pipeline processes to catalog, share, and govern data.
The following diagram shows an administrator’s workflow to set up a custom blueprint.
In the following sections, we discuss common use cases for custom blueprints, and walk through the setup step by step. If you’re new to HAQM DataZone, refer to Getting started.
Use case 1: Bring your own role and resources
Customers manage data platforms that consist of AWS managed services such as AWS Lake Formation, HAQM S3 for data lakes, AWS Glue for ETL, and so on. With those processes already set up, you may want to bring your own roles and resources to HAQM DataZone to continue with an existing process without any disruption. In such cases, you may not want HAQM DataZone to create new resources because it disrupts existing processes in data pipelines and to also curtail AWS resource usage and costs.
In the current setup, you can create an HAQM DataZone domain associated with different accounts. There could be a dedicated account that acts like a producer to share data, and a few other consumer accounts to subscribe to published assets in the catalog. The consumer account has IAM permissions set up for the AWS Glue ETL job to use for the subscription environment of a project. By doing so, the role has access to the newly subscribed data as well as permissions from previous setups to access data from other AWS resources. After you configure the AWS Glue job IAM role in the environment using the custom AWS service blueprint, the authorized users of that role can use the subscribed assets in the AWS Glue ETL job and extend that data for downstream activities to store them in HAQM S3 and other databases to be queried and analyzed using the HAQM Athena SQL editor or HAQM QuickSight.
Use case 2: HAQM S3 multi-file downloads
Customers and users of the HAQM DataZone portal often need the ability to download files after searching and filtering through the catalog in an HAQM DataZone project. This requirement arises because the data and analytics associated with a particular use case can sometimes involve hundreds of files. Downloading these files individually would be a tedious and time-consuming process for HAQM DataZone users. To address this need, the HAQM DataZone portal can take advantage of the capabilities provided by custom AWS service blueprints. These custom blueprints allow you to configure action links to S3 bucket folders associated with specified HAQM DataZone projects.
You can build projects and subscribe to both unstructured and structured data assets within the HAQM DataZone portal. For structured datasets, you can use HAQM DataZone blueprint-based environments like data lakes (Athena) and data warehouses (HAQM Redshift). For unstructured data assets, you can use the custom blueprint-based HAQM S3 environment, which provides a familiar HAQM S3 browser interface with access to specific buckets and folders, using an IAM role owned and provided by the customer. This functionality streamlines the process of finding and accessing unstructured data and allows you to download multiple files at once, enabling you to build and enhance your analytics more efficiently.
Use case 3: HAQM S3 file uploads
In addition to the download functionality, users often need to retain and attach metadata to new versions of files. For example, when you download a file, you can perform data changes, enrichment, or analysis on the file, and then upload the updated version back to the HAQM DataZone portal. For uploading files, HAQM DataZone users can use the same custom blueprint-based HAQM S3 environment action links to upload files.
Use case 4: Extend existing environments to custom blueprint environments
You may have existing HAQM DataZone project environments created using default data lake and data warehouse blueprints. With other AWS services set up in the data platform, you may want to extend the configured project environments to include those additional services to provide a seamless experience for your data producers or consumers while switching between tools.
Now that you understand the capabilities of the new feature, let’s look at how administrators can set up a custom role and resources on the HAQM DataZone console.
Create a domain
First, you need an HAQM DataZone domain. If you already have one, you can skip to enabling your custom blueprints. Otherwise, refer to Create domains for instructions to set up a domain. Optionally, you can associate accounts if you want to set up HAQM DataZone across multiple accounts.
Associate accounts for cross-account scenarios
You can optionally associate accounts. For instructions, refer to Request association with other AWS accounts. Make sure to use the latest AWS Resource Access Manager (AWS RAM) DataZonePortalReadWrite
policy when requesting account association. If your account is already associated, request access again with the new policy.
Accept the account association request
To accept the account associated request, refer to Accept an account association request from an HAQM DataZone domain and enable an environment blueprint. After you accept the account association, you should see the following screenshot.
Add associated account users in the HAQM DataZon domain account
With this launch, you can set up associated account owners to access the HAQM DataZone data portal from their account. To enable this, they need to be registered as users in the domain account. As a domain admin, you can create HAQM DataZone user profiles to allow HAQM DataZone access to users and roles from the associated account. Complete the following steps:
- On the HAQM DataZone console, navigate to your domain.
- On the User management tab, choose Add IAM Users from the Add dropdown menu.
- Enter the ARNs of your associated account IAM users or roles. For this post, we add
arn:aws:iam::123456789101:role/serviceBlueprintRole
andarn:aws:iam::123456789101:user/Jacob
. - Choose Add users(s).
Back on the User management tab, you should see the new user state with Assigned status. This means that the domain owner has assigned associated account users to access HAQM DataZone. This status will change to Active when the identity starts using HAQM DataZone from the associated account.
As of writing this post, there is a maximum limit of adding six identities (users or roles) per associated account.
Enable the custom AWS service blueprint feature
You can enable custom AWS service blueprints in the domain account or the associated account, according to your requirements. Complete the following steps:
- On the Account associations tab, choose the associated domain.
- Choose the AWS service blueprint.
- Choose Enable.
Create an environment using the custom blueprint
If an associated account is being used to create this environment, use the same associated account IAM identity assigned by the domain owner in the previous step. Your identity needs to be explicitly assigned a user profile in order for you to create this environment. Complete the following steps:
- Choose the custom blueprint.
- In the Created environments section, choose Create environment.
- Select Create and use a new project or use an existing project if you already have one.
- For Environment role, choose a role. For this post, we curated a cross-account role called
HAQMDataZoneAdmin
and gave itAdministratorAccess
This is the bring your own role feature. You should curate your role according to your requirements. Here are some guidelines on how to set up custom role as we have used a more permissible policy for this blog:- You can use AWS Policy Generator to build a policy that fits your requirements and attach it to the custom IAM role you want to use.
- Make sure the role begins with
HAQMDataZone*
to follow conventions. This is not mandatory, but recommended. If the IAM admin is using anHAQMDataZoneFullAccess
policy, you need to follow this convention because there is a pass role check validation. - When you create the
CustomRole
(AWSDataZone*
) make sure it trustsamazonaws.com
in its trust policy:
- For Region, choose an AWS Region.
- Choose Create environment.
Although you could use the same IAM role for multiple environments in a project, the recommendation is to not use a same IAM role for multiple environments across projects. Subscription grants are fulfilled at the project construct and therefore we don’t allow the same environment role to be used across different projects.
Configure custom action links
After you create the AWS service environment, you can configure any AWS Management Console links to your environment. HAQM DataZone will assume the custom role to help federate environment users to the configured action links. Complete the following steps:
- In your environment, choose Customize AWS links.
- Configure any S3 buckets, Athena workgroups, AWS Glue jobs, or other custom resources.
- Select Custom AWS links and enter any AWS service console custom resources. For this post, we link to the HAQM Relational Database Service (HAQM RDS) console.
You should now see the console links set up for your environment.
Access resources using a custom role through the HAQM DataZone portal from an associated account
Associate account users who have been added to HAQM DataZone can access the data portal from their associated account directly. Complete the following steps:
- In your environment, in the Summary section, choose the My Environment link.
You should see all your configured resources (role and action links) for your environment.
- Choose any action link to navigate to the appropriate console resources.
- Choose any action link for a custom resource (for this post, HAQM RDS).
You’re directed to the appropriate service console.
With this setup, you have now configured a custom AWS service blueprint to use your own role for the environment to use for data access as well. You have also set up action links for configured AWS resources to be shown to data producers and consumers in the HAQM DataZone data portal. With these links, you can federate to those services in a single click and take the project context along while working with the data.
Configure data sources and subscription targets
Additionally, administrators can now configure data sources and subscription targets on the HAQM DataZone console using custom AWS service blueprint environments. This needs to be configured to set up the database role ManagedAccessRole
to the data source and subscription target, which you can’t do through the HAQM DataZone portal.
Configure data sources in the custom AWS service blueprint environment for publishing
Complete the following steps to configure your data source:
- On the HAQM DataZone console, navigate to the custom AWS service blueprint environment you just created.
- On the Data sources tab, choose Add
- Select AWS Glue or HAQM Redshift.
- For AWS Glue, complete the following steps:
- Enter your AWS Glue database. If you don’t already have an existing AWS Glue database setup, refer to Create a database.
- Enter the
manageAccessRole
role that is added as a Lake Formation admin. Make sure the role provided hasaws.internal
in its trust policy. The role starts withHAQMDataZone*
. - Choose Add.
- For HAQM Redshift, complete the following steps:
- Select Cluster or Serverless. If you don’t already have a Redshift cluster, refer to Create a sample HAQM Redshift cluster. If you don’t already have an HAQM Redshift Serverless workgroup, refer HAQM Redshift Serverless to create a sample database.
- Choose Create new AWS Secret or use a preexisting one.
- If you’re creating a new secret, enter a secret name, user name, and password.
- Choose the cluster or workgroup you want to connect to.
- Enter the database and schema names.
- Enter the role ARN for
manageAccessRole
. - Choose Add.
Configure a subscription target in the AWS service environment for subscribing
Complete the following steps to add your subscription target
- On the HAQM DataZone console, navigate the custom AWS service blueprint environment you just created.
- On the Subscription targets tab, choose Add.
- Follow the same steps as you did to set up a data source.
- For Redshift subscription targets, you also need to add a database role that will be granted access to the given schema. You can enter a specific Redshift user role or, if you’re a Redshift admin, enter
sys:superuser
. - Create a new tag on the environment role (BYOR) with
RedshiftDbRoles
as key and the database name used for configuring the Redshift subscription target as value.
Extend existing data lake and data warehouse blueprints
Finally, if you want to extend existing data lake or data warehouse project environments to create to use existing AWS services in the platform, complete the following steps:
- Create a copy of the environment role of an existing HAQM DataZone project environment.
- Extend this role by adding additional required policies to allow this custom role to access additional resources.
- Create a custom AWS service environment in the same HAQM DataZone project using this new custom role.
- Configure the subscription target and data source using the database name of the existing HAQM DataZone environment (
<env_name>_pub_db
,<env_name>_sub_db
). - Use the same
managedAccessRole
role from the existing HAQM DataZone environment. - Request subscription to the required data assets or add subscribed assets from the project to this new AWS service environment.
Clean up
To clean up your resources, complete the following steps:
- If you used sample code for AWS Glue and Redshift databases, make sure to clean up all those resources to avoid incurring additional charges. Delete any S3 buckets you created as well.
- On the HAQM DataZone console, delete the projects used in this post. This will delete most project-related objects like data assets and environments.
- On the Lake Formation console, delete the Lake Formation admins registered by HAQM DataZone.
- On the Lake Formation console, delete any tables and databases created by HAQM DataZone.
Conclusion
In this post, we discussed how the custom AWS service blueprint simplifies the process to start using existing IAM roles and AWS services in HAQM DataZone for end-to-end governance of your data in AWS. This integration helps you circumvent the prescriptive default data lake and data warehouse blueprints.
To learn more about HAQM DataZone and how to get started, refer to the Getting started guide. Check out the YouTube playlist for some of the latest demos of HAQM DataZone and more information about the capabilities available.
About the Authors
Anish Anturkar is a Software Engineer and Designer and part of HAQM DataZone with an expertise in distributed software solutions. He is passionate about building robust, scalable, and sustainable software solutions for his customers.
Navneet Srivastava is a Principal Specialist and Analytics Strategy Leader, and develops strategic plans for building an end-to-end analytical strategy for large biopharma, healthcare, and life sciences organizations. Navneet is responsible for helping life sciences organizations and healthcare companies deploy data governance and analytical applications, electronic medical records, devices, and AI/ML-based applications, while educating customers about how to build secure, scalable, and cost-effective AWS solutions. His expertise spans across data analytics, data governance, AI, ML, big data, and healthcare-related technologies.
Priya Tiruthani is a Senior Technical Product Manager with HAQM DataZone at AWS. She focuses on improving data discovery and curation required for data analytics. She is passionate about building innovative products to simplify customers’ end-to-end data journey, especially around data governance and analytics. Outside of work, she enjoys being outdoors to hike, capture nature’s beauty, and recently play pickleball.
Subrat Das is a Senior Solutions Architect and part of the Global Healthcare and Life Sciences industry division at AWS. He is passionate about modernizing and architecting complex customer workloads. When he’s not working on technology solutions, he enjoys long hikes and traveling around the world.