AWS Storage Blog
Build a managed Apache Iceberg data lake using Starburst and HAQM S3 Tables
Managing large-scale data analytics across diverse data sources has long been a challenge for enterprises. Data teams often struggle with complex data lake configurations, performance bottlenecks, and the need to maintain consistent data governance while enabling broad access to analytics capabilities.
Today, Starburst announces a powerful solution to these challenges by extending their Apache Iceberg connector integration to support HAQM S3 Tables. Starburst’s Iceberg connector now seamlessly integrates with S3 Tables. This helps customers get more options for a data lake architecture of their choice.
In this post, we explore how this integration solves key enterprise data challenges, what it means for your existing Iceberg implementations, and provide a step-by-step guide to using these new capabilities in Starburst Enterprise.
About Starburst and Iceberg
Apache Iceberg is the foundation of the Starburst Icehouse architecture. In production, Starburst clusters use the Iceberg connector to access data in Iceberg. This includes the storage of Iceberg metadata and data files on HAQM S3, and integrating with a variety of metastores to include Iceberg REST Catalogs.
About Starburst’s integration with S3 Tables
Starburst integrates with S3 Tables using the Iceberg REST endpoint offered by SageMaker Lakehouse. This integration allows you to query and modify these Iceberg tables and federate their content with any other data source using connectors. Additionally, S3 Tables offer out-of-the-box table maintenance operations like compaction, snapshot expiration, and unreferenced file removal.
Now, let’s walk through a simple setup to get you started on S3 Tables using Starburst Enterprise.
Getting started with S3 Tables on Starburst Enterprise
In this post, we connect our Starburst Enterprise cluster to a table bucket, using the AWS Glue Iceberg REST endpoint. Once the cluster is connected, we will use the Starburst Enterprise query editor to create a schema, which maps to a namespace in S3 Tables. We use this editor to then create a table, load a sample TPC Benchmark H (TPC-H) region data and query it.
Prerequisites
To follow along with this post, you need the following setup:
- A Starburst Enterprise cluster.
- An S3 table bucket with the AWS analytics services integration enabled.
Step 1: Setup table bucket and related permissions on AWS
Note the table bucket ARN for the bucket you previously created from the S3 console. The ARN looks like:
arn:aws:s3tables:{REGION}:{ACCOUNT_ID}:bucket/{S3_BUCKET_NAME}
Now, you create an IAM role for Starburst to access S3 Tables and grant this role the necessary permissions in AWS Lake Formation to access the tables. For a detailed step-by-step walkthrough of this process, please refer to the documentation.
With this step, we’re now ready to connect our table bucket to Starburst.
Step 2: Create a new Iceberg REST Catalog connection in Starburst Enterprise
Next, open Starburst Enterprise query editor and prepare to create a new catalog.
This catalog will use the Iceberg REST Catalog to access the AWS table. To do this, we set the catalog type to rest iceberg.catalog.type=rest
. This setting configures the catalog to use the Iceberg REST Catalog, which provides a standard protocol for metadata management. This allows new query engines to support any catalog with a single implementation.
The Iceberg connector retrieves the metadata location from the SageMaker Lakehouse Iceberg REST endpoint, and then accesses table storage to read or write files.
The example below uses static AWS credentials with an access key and secret key. Additionally, Starburst is working on adding support for credential vending and IAM role-based authentication for improved security.
The ${asm:path}
syntax utilizes AWS Secret Manager. You can find a list of all supported secrets managers in the Starburst Enterprise documentation.
- In Starburst Enterprise, open the query editor.
- Insert the following code block.
CREATE CATALOG s3_tables_demo USING iceberg
WITH (
"iceberg.catalog.type" = 'rest',
"iceberg.rest-catalog.uri" = 'http://glue.${asm:region}.amazonaws.com/iceberg',
"iceberg.rest-catalog.warehouse" = '${asm:account_id}:s3tablescatalog/${asm:s3_bucket_name}',
"iceberg.rest-catalog.view-endpoints-enabled" = 'false',
"iceberg.rest-catalog.sigv4-enabled" = 'true',
"iceberg.rest-catalog.signing-name" = 'glue',
"fs.hadoop.enabled" = 'false',
"fs.native-s3.enabled" = 'true',
"s3.region" = '${asm:region}',
"s3.aws-access-key" = '${asm:access_key}',
"s3.aws-secret-key" = '${asm:secret_key}'
);
Step 3: Define a schema
In Starburst, a schema is equivalent to a namespace in table buckets. For this post, we create a schema called ‘example’ from the query editor.
Step 4: Create a new table using TPC-H region data
Now, we create a table by using a sample TPC-H region data.
Step 5: Read data from S3 Tables
Once the data is loaded, you can query the data within the editor.
Note: Starburst’s integration with S3 Tables supports Iceberg connector features like time travel, schema evolution, and more. However, the maintenance operations offered by S3 Tables out of the box are not supported through this integration at the time of publishing this post. You can configure these operations directly on table buckets.
Conclusion
In this post, we showcased building a data lake on Starburst and HAQM S3. The integration builds on Starburst’s long-standing focus on Apache Iceberg and the benefits of automated table maintenance tasks by HAQM S3 Tables. Starburst is an AWS Data and Analytics and Financial Services Competency Partner and is available via AWS Marketplace.