Streamlining access to tabular datasets stored in HAQM S3 Tables with DuckDB

As businesses continue to rely on data-driven decision-making, there’s an increasing demand for tools that streamline and accelerate the process of data analysis. Efficiency and simplicity in application architecture can serve as a competitive edge when driving high-stakes decisions. Developers are seeking lightweight, flexible tools that seamlessly integrate with their existing application stack, specifically solutions for quickly and efficiently querying datasets stored in data lakes.

In this evolving landscape, DuckDB stands out as a practical solution for developers to quickly analyze data without the need for complex installations. DuckDB needs no separate server or software to install, update, and maintain, as it is fully embedded within the host process and doesn’t run as a separate process. DuckDB’s lightweight nature allows it to be easily integrated into Python, Java, and other environments, making it ideal for interactive data analysis. DuckDB recently added preview support for reading tables stored in HAQM S3 Tables using Apache Iceberg REST APIs. S3 Tables are purpose-built for tabular storage, offering a direct, performant, and cost-effective way to store and query your data in Apache Iceberg format. S3 Tables continuously optimize storage to maximize query performance and minimize costs, making it an excellent choice for businesses looking to streamline their data lake operations without further infrastructure setup. This DuckDB integration with S3 Tables gives developers the ability to query data stored in HAQM S3, which allows businesses to streamline and centralize their data lake while accessing tables without complex data movements or further infrastructure.

In this post, we demonstrate how to use DuckDB to read Iceberg tables stored in S3 table buckets using HAQM SageMaker Lakehouse Iceberg REST endpoint and AWS Lake Formation permissions on your local machine. The same data consumption workflow can also be applied when accessing DuckDB from small, resource-constrained edge devices to large multi-terabyte memory servers with 100+ CPU cores. DuckDB, in conjunction with S3 Tables, provides a direct and strong foundation for users to start building scalable data lakes on AWS.

Solution overview

This post demonstrates how to use the DuckDB over AWS CloudShell CLI to read tables from an S3 bucket on your local system. Although we use the command-line interface here, DuckDB supports multiple programming languages and client APIs. For a complete list of supported languages and integration options, visit the DuckDB official documentation. Moreover, this post demonstrates a common pattern where developers and data scientists use local environments to test and iterate on their queries before deploying them at scale. We begin with installing DuckDB on the client and connecting Lake Formation permissions to access table stored in S3 table buckets, as shown in the following figure. Then, we attach the catalog to the DuckDB instance on client and test out queries on the tables.

Figure 1_This diagram illustrates how DuckDB client works with HAQM S3 Tables using HAQM SageMaker Lakehouse Iceberg REST endpoint and AWS Lake Formations

Prerequisites

To follow along, you need an AWS account with access to the following AWS services:

Walkthrough

To implement this solution, we need tables in the S3 table bucket, configure permissions to read these tables, and set up a DuckDB instance on your local client. We cover each of these steps in the following sections:

Part A: S3 Tables setup
Part B: Permissions setup (optional)
Part C: DuckDB setup

Part A: S3 Tables setup

DuckDB currently supports S3 Tables in read-only mode. If you don’t have a table in your account or haven’t set up integration with AWS analytics services, then follow the instructions. Otherwise, you can skip this section. This section provides instructions on how to create a table bucket, set up a table, and add data, preparing the data for use with DuckDB for analysis. If you already have a table with data, then skip this section and move to the next section.

1. Create S3 table bucket and set up integration with AWS analytics services

1.1. Navigate to S3 in the AWS Management Console. From the left navigation menu, choose Table buckets. Choose Create table bucket. Specify a name for your table bucket and choose Enable integration if the integration with AWS analytics services isn’t already enabled, as shown in the following figure. This integration allows users to discover all the tables created in this AWS Region and this account in SageMaker Lakehouse and Lake Formation, and access them through AWS analytics services, such as HAQM Data Firehose, Athena, HAQM Redshift, and HAQM EMR. To learn more about this integration, refer to Using HAQM S3 with analytics services.

Figure 2_Create bucket page on HAQM S3 Management Console

1.2. A table bucket, mytraveldata, has been successfully created. Copy the HAQM Resource Name (ARN) for the table bucket by choosing the copy icon for later during DuckDB setup.

Figure 3_List table buckets view on S3 Management Console

2. Create a new table in table bucket

2.1. In the S3 console, go to the bucket and choose Create table with Athena.

Figure 4_Create table with Athena within an S3 table bucket on S3 Management Console

2.2. Create namespace and table within the table bucket

Add a Namespace name in the Create table workflow.

Figure 5_Add namespace name in the Create table workflow

Proceed to Create table with Athena.

Figure 6_Namespace is created before proceeding to create table with Athena

A database with the name nyc is created in SageMaker Lakehouse.

2.3. In Athena query editor, create a table using the following SQL command:

CREATE TABLE nyc.nyc_trips (trip_id INT,
    pickup_datetime TIMESTAMP,
    dropoff_datetime TIMESTAMP,
    passenger_count INT,
    trip_distance DOUBLE,
    fare_amount DOUBLE,
    payment_type STRING,
    pickup_zone STRING) TBLPROPERTIES ( 'table_type'='ICEBERG' );

3. Insert data into S3 Tables

In this step, we insert sample data on New York City taxi trips into the newly created table in the table bucket using the following SQL command:

INSERT INTO nyc.nyc_trips (trip_id, pickup_datetime, dropoff_datetime, passenger_count, trip_distance, fare_amount, payment_type, pickup_zone)
VALUES
    (1, TIMESTAMP '2023-03-01 08:30:00', TIMESTAMP '2023-03-01 09:15:00', 2, 5.8, 22.50, 'Credit Card', 'Midtown'),
    (2, TIMESTAMP '2023-03-01 12:15:00', TIMESTAMP '2023-03-01 12:45:00', 1, 3.2, 15.00, 'Cash', 'Upper East Side'),
    (3, TIMESTAMP '2023-03-01 17:00:00', TIMESTAMP '2023-03-01 17:45:00', 3, 7.1, 28.75, 'Credit Card', 'Financial District'),
    (4, TIMESTAMP '2023-03-02 09:45:00', TIMESTAMP '2023-03-02 10:30:00', 1, 4.5, 18.25, 'Credit Card', 'Chelsea'),
    (5, TIMESTAMP '2023-03-02 14:30:00', TIMESTAMP '2023-03-02 15:15:00', 2, 6.3, 24.50, 'Cash', 'Greenwich Village'),
    (6, TIMESTAMP '2023-03-02 20:00:00', TIMESTAMP '2023-03-02 20:30:00', 4, 2.8, 14.75, 'Credit Card', 'Times Square'),
    (7, TIMESTAMP '2023-03-03 07:15:00', TIMESTAMP '2023-03-03 08:00:00', 1, 8.2, 32.00, 'Credit Card', 'Brooklyn Heights'),
    (8, TIMESTAMP '2023-03-03 11:30:00', TIMESTAMP '2023-03-03 12:15:00', 2, 5.5, 21.75, 'Cash', 'East Village'),
    (9, TIMESTAMP '2023-03-03 16:45:00', TIMESTAMP '2023-03-03 17:30:00', 3, 6.7, 26.25, 'Credit Card', 'SoHo'),
    (10, TIMESTAMP '2023-03-03 22:00:00', TIMESTAMP '2023-03-03 22:45:00', 1, 4.9, 19.50, 'Cash', 'Upper West Side');

Once you run the SQL command, data will be ingested in the table, nyc_trips.

Figure 7_Ingesting sample data in the table using Athena

Part B: Permissions setup (optional)

If you aren’t the user who performed the table buckets integration with AWS analytics services for your account, then you must be granted the necessary Lake Formation permissions on the table. For more information, see Grant permission on a table or database. For the purpose of this post, we assume that it is the same user who performed integration with AWS analytics services on the console who is accessing the table over the DuckDB instance.

Part C: DuckDB setup

1. Install DuckDB

Visit the DuckDB Installation page to get the latest DuckDB Command line (v1.2.1. at the time of writing). Copy the curl command mentioned in the Installation section. Figure 8_Official DuckDB installation page to download latest package

Navigate to the CloudShell environment on the AWS Management Console.

Figure 9_AWS CloudShell on AWS Management Console

Run the following curl command.

curl http://install.duckdb.org | sh

The following image shows an example of the execution of the curl command.

Figure 10_Launching DuckDB latest package

2. Run DuckDB instance on your CloudShell instance

To launch DuckDB instance command, type the following command in the AWS Command Line Interface (AWS CLI):

/home/cloudshell-user/.duckdb/cli/latest/duckdb

Figure 11_Launching DuckDB instance

3. Load Iceberg extension

DuckDB support for S3 Tables is currently part of nightly extension builds. To install the latest version of Iceberg extension, run the following command:

FORCE INSTALL iceberg FROM core_nightly;

At this stage, the DuckDB instance setup is complete.

4. Register AWS credentials

Create a secret containing your AWS credentials. When using CloudShell, this is your AWS console credential. DuckDB uses this secret to access your table data. See DuckDB documentation for more configuration options.

CREATE SECRET (
  TYPE S3,
  PROVIDER credential_chain
);

5. Attach the catalog as a database

In this step, we attach the table bucket with DuckDB as a database instance.

ATTACH '<account-id>:s3tablescatalog/mytraveldata' AS duck_db
(TYPE ICEBERG, ENDPOINT_TYPE 'GLUE');

6. Access data stored in table bucket

Validate that DuckDB can find tables in the table bucket using the following command.

Figure 13_Attaching S3 table bucket as a DB in DuckDB

SHOW ALL TABLES;

This tells us that our mytraveldata contains a single namespace nyc with a single table called nyc_trips. The following are a few sample queries to analyze the data stored in the table.

6.1. View all the data stored in the table.

SELECT * from duck_db.nyc.nyc_trips

Figure 14_Output of sample query 1. Viewing all the rows from the table nyc_trips in S3 table bucket
6.2. Get the total fare amount by payment type.

SELECT payment_type, SUM(fare_amount) AS total_fare
FROM duck_db.nyc.nyc_trips
GROUP BY payment_type;

Figure 15_Output of sample query 2. Viewing total fare amount by payment type

6.3. Find the total fare amount for trips made on March 1, 2023.

SELECT SUM(fare_amount) AS total_fare
FROM duck_db.nyc.nyc_trips
WHERE pickup_datetime BETWEEN '2023-03-01 00:00:00' AND '2023-03-01 23:59:59';

Figure 16_Output of sample query 3. Viewing total fare amount for trips made on March 1, 2023

In this post, we used the HAQM SageMaker Lakehouse Iceberg REST endpoint to connect to the DuckDB instance. You also have an alternative option to connect to S3 Tables Iceberg REST Catalog APIs to connect to DuckDB. For this you use the following Attach command.

ATTACH '<table-bucket-arn>' AS duck_db (TYPE ICEBERG, 
ENDPOINT_TYPE 'S3_Tables');

If you need basic read access to tabular data in a single S3 table bucket, then use the S3 Tables Iceberg REST Catalog APIs, and if you want unified data management across all of your tabular data, data governance, and fine-grained access controls, you can use S3 Tables with SageMaker Lakehouse.

Cleaning up

To clean up your resources, complete the following steps:

Delete the S3 table.
Delete the namespace in the table bucket.
Delete the S3 table bucket.

Conclusion

In this post, we demonstrated how to use DuckDB to read Iceberg tables stored in HAQM S3 with HAQM SageMaker Lakehouse and AWS Lake Formation permissions. We walked through a solution that involved creating a table in an S3 table bucket, populating it with data, and then accessing the table using a DuckDB client. Combining DuckDB with HAQM S3 Tables, be it a small or a large-scale environment, allows businesses to interact with the tables in their data lake without any further infrastructure.

AWS Storage Blog