Adapting to change with data patterns on AWS: The “aggregate” cloud data pattern

As part of my re:Invent 2024 Innovation talk, I shared three data patterns that many of our largest AWS customers have adopted. This article focuses on the “Aggregate” cloud data pattern, which is the most commonly adopted across AWS customers. You can also watch this six-minute video clip on the Aggregate data pattern for a quick summary.

We started to see the first data lakes, which typically use the Aggregate data pattern, emerge on HAQM S3 about five years after HAQM S3 launched in March 2006. As Don MacAskill, CEO and Co-founder of SmugMug, shared, HAQM S3 was immediately adopted by customers like SmugMug to store rapidly growing unstructured data such as images.

But developers wanted to take advantage of the security, availability, durability, scalability, and low cost of S3 for other business uses and that led to the integration of HAQM S3 into the Hadoop ecosystem for business analytics. Developers that wanted to use HAQM S3 instead of HDFS as their data store depended on open-source project S3A, which is part of the Apache Hadoop ecosystem, or solutions like HAQM Elastic MapReduce (EMR) with its built-in S3 integration so that Hadoop could directly read and write HAQM S3 objects. If you are interested in how customers with large-scale data lakes thought of “Aggregate” over ten years ago, you can read this Netflix blog from January 2013 that talks about how they built a Hadoop-based system on HAQM S3 – a super interesting glimpse into the evolution of data lakes at scale.

If you fast-forward to today, what Netflix said in 2013 (“store all of our data on HAQM’s Storage Service (S3), which is the core principle on which our architecture is based”) is still the core of the Aggregate data pattern and has grown to expand beyond data lakes. Companies that use Aggregate send data from many different sources (like sensor feeds, data processing pipelines, applications that track consumer patterns, data streams, log feeds, databases, data warehouses, etc.) into HAQM S3 to store and use with any application, across compute types, application architectures, and use cases. Because so many customers have adopted Aggregate as a data pattern, HAQM S3 is increasingly being used as a generalized application storage layer as-is or with additional infrastructure to optimize S3 for integration like what S3A did for open-source Hadoop so many years ago. More than a million data lakes run on AWS these days but there is much more storage beyond data lakes in S3. Today HAQM S3 holds more than 400 trillion objects, exabytes of data, and averages over 150 million requests per second. Ten years ago, just under 100 HAQM S3 customers were storing 1 petabyte or more of data, and now thousands of customers are storing over a petabyte of data, with some customers managing more than an exabyte of data. For context, a petabyte of data is about a thousand terabytes and one exabyte of data is about one million terabytes of data.

The Aggregate data pattern is super common among AWS customers for a few reasons. First, it lets application developers across organizations take advantage of the volume and diversity of datain a company. This is very different from the old on-premises world where data sets tended to be locked away in vertically integrated applications. By aggregating data in HAQM S3, application owners and other members of the team (such as data scientists or AI researchers) have access to a wide variety of raw and processed data sets to use for experiments and application innovation. Simply the act of bringing together data into one place can significantly change the speed of the business.

Second, the Aggregate data pattern uses a federated ownership model, which many companies like because it decentralizes data ownership and fits in with the culture of their organization. Different organizations own the delivery of data into HAQM S3 and organizations own their own usage of the data sets as well. With HAQM S3 as the foundation of Aggregate, this data pattern gives the most flexibility to different organizations to use data.

And third, the Aggregate data pattern offers the most choice in tools because no matter what choice you make for an ISV or native AWS service, you can generally expect an integration with HAQM S3 for data storage.

The key to success with the Aggregate data pattern is standardization on the building blocks of your data infrastructure. The underlying storage of HAQM S3 is one form of standardization but many customers apply other standards as well. That is because your aggregated data sets often grow very quickly and you want to have some consistency across your organizations around the data. While federated ownership optimizes for flexibility, AWS customers also want to make sure that teams are using the data in the right way and teams are not creating or replicating work in data processing, governance, and other data workflows across the organization. For example, Roche, a pioneer in healthcare, uses HAQM S3 to store various data types. Data is standardized in HAQM S3 through their data pipeline, running data through a single ETL pipeline to enforce consistent and accurate results across diverse document types, which helps various users, like analysts and business users, accelerate the time it takes to get to the right data for the task at hand.

There are many other ways that customers apply standards across an Aggregate data pattern but one of the most common is to standardize on file formats. For example, many of our largest data lake customers including Netflix, Nubank, Lyft, and Pinterest commonly use a file format called Apache Parquet to store business data. Any text or numerical data that can be represented as a table, such as credit scoring, transaction records, or inventory reporting, can be stored in a Parquet file. In fact, Parquet is one of the most common and fastest growing data types in HAQM S3, with exabytes of Parquet data (which is highly compressed) stored today. Customers make over 15 million requests per second to this data, and we serve hundreds of petabytes of Parquet every day. As one example of standardization, Pinterest standardizes their storage on S3, their tabular data on Parquet, and their open table format (OTF) on Apache Iceberg. They have thousands of business-critical Iceberg tables, and last year, adopted large language models (LLMs) to automate query generation to the right Iceberg table.

If you are using the Aggregate data pattern, you are in good company. Many of our AWS customers moved to the cloud using Aggregate and scaled using it. However, particularly in the last 12-18 months, as customers look to leverage analytics data for AI, we have more customers moving to the Curate data pattern either as their primary data pattern across an organization or as applied to specific teams in the business.

This post is part of a series on adapting to change with data patterns on AWS:

AWS Storage Blog

Adapting to change with data patterns on AWS: The “aggregate” cloud data pattern

Resources

Follow