Skip to main content

HAQM SageMaker Lakehouse FAQs

General

Open all

HAQM SageMaker Lakehouse unifies your data across HAQM S3 data lakes and HAQM Redshift data warehouses, helping you build powerful analytics and AI/ML applications on a single copy of data. SageMaker Lakehouse gives you the flexibility to access and query your data with all Apache Iceberg–compatible tools and engines. You can also connect to federated data sources such as HAQM DynamoDB, Google BigQuery, and Snowflake and query your data in-place. Bring data from operational databases and applications into your lakehouse in near real time through zero-ETL integrations. Secure your data with integrated fine-grained access controls, which are enforced across all analytics and ML tools and engines. With HAQM SageMaker Lakehouse, you can build an open lakehouse on your existing data investments, without changing your data architecture. 

SageMaker Lakehouse provides 3 primary benefits:

a) Unified data access: SageMaker Lakehouse reduces data silos by providing unified access to your data across HAQM S3 data lakes and HAQM Redshift data warehouses. You can also connect to federated data sources such as HAQM DynamoDB, Google BigQuery, and Snowflake. In addition, data from operational databases and applications can be ingested into your lakehouse in near real time via zero-ETL integrations.

b) Open source compatibility: SageMaker Lakehouse gives you the flexibility to access and query all your data in-place, from a wide range of AWS services and open source and third-party tools and engines compatible with Apache Iceberg. You can use analytic tools and engines of your choice such as SQL, Apache Spark, business intelligence (BI), and AI/ML tools, and collaborate with a single copy of data stored across HAQM S3 or HAQM Redshift.

c) Secure data access: SageMaker Lakehouse provides integrated fine-grained access control to your data. This means that you can define permissions and consistently apply them across all analytics and ML tools and engines, regardless of the underlying storage formats or query engines used.

Directly accessible from HAQM SageMaker Unified Studio, SageMaker Lakehouse is an open lakehouse architecture that unifies data across your data estate. Data from different sources is organized in logical containers called catalogs in SageMaker Lakehouse. Each catalog represents sources such as HAQM Redshift data warehouses, S3 data lakes, or databases. You can also create new catalogs to store data in HAQM S3 or Redshift Managed Storage (RMS). Data in SageMaker Lakehouse can be accessed from Apache Iceberg–compatible engines such as Apache Spark, Athena, or HAQM EMR. Additionally, you can also connect to and analyze data in your lakehouse using SQL tools. Data is secured by defining fine-grained access controls, which is enforced across tools and engines that access the data.

Capabilities

Open all

SageMaker Lakehouse unifies access control to your data with two capabilities: 1) SageMaker Lakehouse allows you to define fine-grained permissions. These permissions get enforced by query engines such as HAQM EMR, Athena, and HAQM Redshift. 2) SageMaker Lakehouse allows you to get in-place access to your data, removing the need for making data copies. You can maintain a single copy of data and a single set of access control policies to benefit from unified fine-grained access control in SageMaker Lakehouse.

SageMaker Lakehouse is built on multiple technical catalogs across AWS Glue Data Catalog, Lake Formation, and HAQM Redshift to provide unified data access across data lakes and data warehouses. SageMaker Lakehouse uses AWS Glue Data Catalog and Lake Formation to store table definitions and permissions. Lake Formation fine-grained permissions are available to tables defined in SageMaker Lakehouse. You can manage your table definitions in AWS Glue Data Catalog and define fine-grained permissions, such as table-level, column-level, and cell-level permissions, to secure your data. In addition, using the cross-account data-sharing capabilities, you can enable zero-copy data sharing to make data available for secure collaboration.

Yes. The open source Apache Iceberg client library is required to access SageMaker Lakehouse. Customers using third-party or self-managed open source engines such as Apache Spark or Trino need to include the Apache Iceberg client library in their query engines to access SageMaker Lakehouse.

Yes, using an Apache Iceberg client library, you can read and write data to your existing HAQM Redshift from Apache Spark engines on AWS services such as HAQM EMR, AWS Glue,  Athena, and HAQM SageMaker or the third-party Apache Spark. However, you must have appropriate write permissions on the tables to write data to them.

Yes, you can join your data lake tables on HAQM S3 with the tables in your HAQM Redshift data warehouse across multiple databases using an engine of your choice, such as Apache Spark.

HAQM S3 Tables now seamlessly integrate with SageMaker Lakehouse, making it easy to query and join S3 Tables with data in S3 data lakes, HAQM Redshift data warehouses, and third-party data sources. SageMaker Lakehouse provides the flexibility to access and query data in-place across S3 Tables, S3 buckets, and Redshift warehouses using the Apache Iceberg open standard. You can secure and centrally manage your data in the lakehouse by defining fine-grained permissions that are consistently applied across all analytics and ML tools and engines.

Zero-ETL integrations

Open all

SageMaker Lakehouse enables support for zero-ETL integrations with HAQM DynamoDB, HAQM Aurora, and HAQM RDS for MySQL, and eight applications: Zoho CRM, Salesforce, Salesforce Pardot, ServiceNow, Facebook ads, Instagram ads, Zendesk, and SAP.

You can configure and monitor your zero-ETL integrations through the AWS Glue console within HAQM SageMaker Data Processing with AWS Glue. Once the data is ingested, you can access and query the data from Apache Iceberg–compatible query engines. For more details, visit Zero-ETL integrations.

To learn more about pricing, visit the SageMaker Lakehouse and AWS Glue pricing pages.

Pricing

Open all

Visit SageMaker Lakehouse pricing for details.

Availability

Open all

SageMaker Lakehouse is available in US East (N. Virginia), US East (Ohio), US West (Oregon), Asia Pacific (Hong Kong), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Canada (Central), Europe (Frankfurt), Europe (Ireland), Europe (London), Europe (Stockholm), and South America (Sao Paulo).

Yes. SageMaker Lakehouse stores metadata in AWS Glue Data Catalog and offers the same SLA as HAQM Glue.

Getting started

Open all

SageMaker Lakehouse is accessible from HAQM SageMaker Unified Studio. From SageMaker Unified Studio, you can create a new project or select an existing project. From your project, click on Data on the left navigation to view the Data explorer panel. The Data explorer panel gives you a view of the data your have access to in SageMaker Lakehouse. To help you get started, a default S3 managed catalog is automatically created with your project where you can add new data files to your lakehouse. In addition, from the Data explorer panel, when you click (+) Add Data, you can continue to build out your lakehouse by creating additional managed catalogs in Redshift Managed Storage, connect to federated data sources, or upload data to your managed catalogs.

If you have existing databases and catalogs, you can add them to the lakehouse by granting permissions to your project role using AWS Lake Formation. For example, you can bring your HAQM Redshift data warehouse to SageMaker Lakehouse by registering the Redshift cluster or serverless namespace with Glue Data Catalog. You can then accept the cluster or namespace invitation and grant the appropriate permissions in Lake Formation to make it available for access.

No, you don't have to migrate your data to use SageMaker Lakehouse. SageMaker Lakehouse allows you to access and query your data in-place, with the open standard of Apache Iceberg. You can directly access your data in HAQM S3 data lakes, S3 Tables, and HAQM Redshift data warehouses. You can also connect to federated data sources such as Snowflake and Google BigQuery data warehouses, as well as operational databases like PostgreSQL and SQL Server. Data from operational databases and third-party applications can be brought into managed catalogs in the lakehouse in near real-time through zero-ETL integrations, without having to maintain infrastructure or complex pipelines. In addition to these, you can use hundreds of AWS Glue connectors to integrate with your existing data sources. 

To bring your HAQM Redshift data warehouse to SageMaker Lakehouse, go to the Redshift management console, and register the Redshift cluster or serverless namespace with Glue Data Catalog via the Action drop-down menu. You can then go to Lake Formation, and accept the cluster or namespace invitation to create a federated catalog, and grant the appropriate permissions to make it available for access in SageMaker Lakehouse. Instructions are available in the documentation here. These tasks can also be performed using the AWS Command Line Interface (AWS CLI), or APIs/SDKs.

To bring your S3 data lake to SageMaker Lakehouse, you must first catalog your S3 data lake in AWS Glue Data Catalog by following the instructions here. Once you have cataloged your HAQM S3 data lake using AWS Glue Data Catalog, your data is available for access in SageMaker Lakehouse. In AWS Lake Formation, you can grant permissions to a Unified Studio project role, to make the S3 data lake available for use in SageMaker Unified Studio. 

HAQM SageMaker Lakehouse unifies access all your data across HAQM S3 data lakes, HAQM Redshift data warehouses, and third-party data sources. HAQM S3 Tables delivers the first cloud object store with built-in Apache Iceberg support. HAQM SageMaker Lakehouse integrates with HAQM S3 Tables so you can access S3 Tables from AWS analytics services, such as HAQM Redshift, HAQM Athena, HAQM EMR, AWS Glue, or Apache Iceberg-compatible engines (Apache Spark or PyIceberg). SageMaker Lakehouse also enables centralized management of fine-grained data access permissions for S3 Tables and other data, and consistently applies them across all engines.


To get started, navigate to the HAQM S3 console and enable the integration the S3 Table bucket with AWS analytics services. Once the integration is enabled, navigate to AWS Lake Formation to grant permissions to your S3 Table bucket to your SageMaker Unified Studio project role. Your then use integrated analytics services in SageMaker Unified Studio to query, analyze data in S3 Tables. You can even join data from HAQM S3 Tables with other sources, such as HAQM Redshift data warehouses, third-party, and federated data sources (HAQM DynamoDB, Snowflake, or PostgreSQL). 

SageMaker Lakehouse is directly accessible from HAQM SageMaker Unified Studio. SageMaker Unified Studio provides an integrated experience to access all your data from SageMaker Lakehouse and put it to work using familiar AWS tools for model development, generative AI, data processing, and SQL analytics. To get started, you can log into your SageMaker domain using your corporate credentials on SageMaker Unified Studio. In a few short steps in SageMaker Unified Studio, administrators can create projects by choosing a specific project profile. You can then choose a project to work with data in SageMaker Lakehouse. Once a project is selected, you get a unified view of the data in your lakehouse in the Data explorer panel, and access your query engines and developer tools in one place.

SageMaker Lakehouse also gives you the flexibility to access and query your data with all Apache Iceberg–compatible tools and engines. You can use analytics tools and engines of your choice, such as SQL, Apache Spark, business intelligence (BI), and AI/ML tools, and collaborate with data stored across SageMaker Lakehouse.

Yes. SageMaker Lakehouse gives you the flexibility to access and query your data with all Apache Iceberg–compatible tools and engines. You can use analytics tools and engines of your choice, such as SQL, Apache Spark, business intelligence (BI), and AI/ML tools, and collaborate with data stored in SageMaker Lakehouse.