Architecture patterns to optimize HAQM Redshift performance at scale

Tens of thousands of customers use HAQM Redshift as a fully managed, petabyte-scale data warehouse service in the cloud. As an organization’s business data grows in volume, the data analytics need also grows. HAQM Redshift performance needs to be optimized at scale to achieve faster, near real-time business intelligence (BI). You might also consider optimizing HAQM Redshift performance when your data analytics workloads or user base increases, or to meet a data analytics performance service level agreement (SLA). You can also look for ways to optimize HAQM Redshift data warehouse performance after you complete an online analytical processing (OLAP) migration from another system to HAQM Redshift.

In this post, we will show you five HAQM Redshift architecture patterns that you can consider to optimize your HAQM Redshift data warehouse performance at scale using features such as HAQM Redshift Serverless, HAQM Redshift data sharing, HAQM Redshift Spectrum, zero-ETL integrations, and HAQM Redshift streaming ingestion.

Use HAQM Redshift Serverless to automatically provision and scale your data warehouse capacity

To start, let’s review using HAQM Redshift Serverless to automatically provision and scale your data warehouse capacity. The architecture is shown in the following diagram and includes different components within HAQM Redshift Serverless like ML-based workload monitoring and automatic workload management.

HAQM Redshift Serverless architecture diagram

HAQM Redshift Serverless is a deployment model that you can use to run and scale your Redshift data warehouse without managing infrastructure. HAQM Redshift Serverless will automatically provision and scale your data warehouse capacity to deliver fast performance for even the most demanding, unpredictable, or massive workloads.

HAQM Redshift Serverless measures data warehouse capacity in Redshift Processing Units (RPUs). You pay for the workloads you run in RPU-hours on a per-second basis. You can optionally configure your Base, Max RPU-Hours, and MaxRPU parameters to modify your warehouse performance costs. This post dives deep into understanding cost mechanisms to consider when managing HAQM Redshift Serverless.

HAQM Redshift Serverless scaling is automatic and based on your RPU capacity. To further optimize scaling operations for large scale datasets, HAQM Redshift Serverless has AI-driven scaling and optimization. It uses AI to scale automatically with workload changes across key metrics such as data volume changes, concurrent users, and query complexity, accurately meeting your price performance targets.

There is no maintenance window in HAQM Redshift Serverless, because software version updates are applied automatically. This maintenance occurs with no interruptions for any existing connections or query executions. Make sure to consult the considerations guide to better understand the operation of HAQM Redshift Serverless.

You can migrate from an existing provisioned HAQM Redshift data warehouse to HAQM Redshift Serverless by creating a snapshot of your current provisioned data warehouse and then restoring that snapshot in HAQM Redshift Serverless. HAQM Redshift will automatically convert interleaved keys to compound keys when you restore a provisioned data warehouse snapshot to a Serverless namespace. You can also get started with a new HAQM Redshift Serverless data warehouse.

HAQM Redshift Serverless use cases

You can use HAQM Redshift Serverless for:

Self-service analytics
Auto scaling for unpredictable or variable workloads
New applications
Multi-tenant applications

With HAQM Redshift, you can access and query data stored in HAQM S3 Tables – fully managed Apache Iceberg tables optimized for analytics workloads. HAQM Redshift also supports querying data stored using Apache Iceberg tables, and other open table formats like Apache Hudi and Linux Foundation Delta Lake, for more information see External tables for Redshift Spectrum and Expand data access through Apache Iceberg using Delta Lake UniForm on AWS.

You can also use HAQM Redshift Serverless with HAQM Redshift data sharing, which can automatically scale your large dataset in independent datashares and maintain workload isolation controls.

HAQM Redshift data sharing to share live data between separate HAQM Redshift data warehouses

Next, we will look at an HAQM Redshift data sharing architecture pattern, shown in below diagram, to share data between a hub HAQM Redshift data warehouse and spoke HAQM Redshift data warehouses , and to share data across multiple HAQM Redshift data warehouses with each other.

HAQM Redshift data sharing architecture patterns diagram

With HAQM Redshift data sharing, you can securely share access to live data between separate HAQM Redshift data warehouses without manually moving or copying the data. Because the data is live, all users can see the most up-to-date and consistent information in HAQM Redshift as soon as it’s updated using separate dedicated resources. Because the compute accessing the data is isolated, you can size the data warehouse configurations to individual workload price performance requirements rather than the aggregate of all workloads. This also provides additional flexibility to scale with new workloads without affecting the workloads already being run on HAQM Redshift.

A datashare is the unit of sharing data in HAQM Redshift. A producer data warehouse administrator can create datashares and add datashare objects to share data with other data warehouses, referred to as outbound shares. A consumer data warehouse administrator can receive datashares from other data warehouses, referred to as inbound shares.

To get started, a producer data warehouse needs to add all objects (and potential permissions) that need to be accessed by another data warehouse to a datashare, and share that datashare with a consumer. After that consumer creates a database from the datashare, the shared objects can be accessed using three-part notation consumer_database_name.schema_name.table_name on the consumer, using the consumer’s compute.

HAQM Redshift data sharing use cases

HAQM Redshift data sharing, including multi-warehouse writes in HAQM Redshift, can be used to:

Support different kinds of business-critical workloads, including workload isolation and chargeback for individual workloads.
Enable cross-group collaboration across teams for broader analytics, data science, and cross-product impact analysis.
Deliver data as a service.
Share data between environments to improve team agility by sharing data at different granularity levels such as development, test, and production.
License access to data in HAQM Redshift by listing HAQM Redshift data sets in the AWS Data Exchange catalog so that customers can find, subscribe to, and query the data in minutes.
Update business source data on the producer. You can share data as a service across your organization, but then consumers can also perform actions on the source data.
Insert additional records on the producer. Consumers can add records to the original source data.

The following articles provide examples of how you can use HAQM Redshift data sharing to scale performance:

HAQM Redshift Spectrum to query data in HAQM S3

You can use HAQM Redshift Spectrum to query data in , as shown in below diagram using AWS Glue Data Catalog.

HAQM Redshift Spectrum architecture diagram

You can use HAQM Redshift Spectrum to efficiently query and retrieve structured and semi-structured data from files in HAQM S3 without having to directly load data into HAQM Redshift tables. Using the large, parallel scale of the HAQM Redshift Spectrum layer, you can run massive, fast, parallel queries against large datasets while most of the data remains in HAQM S3. This can significantly improve the performance and cost-effectiveness of massive analytics workloads, because you can use the scalable storage of HAQM S3 to handle large volumes of data while still benefiting from the powerful query processing capabilities of HAQM Redshift.

HAQM Redshift Spectrum uses separate infrastructure independent of your HAQM Redshift data warehouse, offloading many compute-intensive tasks, such as predicate filtering and aggregation. This means that you can use significantly less data warehouse processing capacity than other queries. HAQM Redshift Spectrum can also automatically scale to potentially thousands of instances, based on the demands of your queries.

When implementing HAQM Redshift Spectrum, make sure to consult the considerations guide which details how to configure your networking, external table creation, and permissions requirements.

Review this best practices guide and this blog post, which outlines recommendations on how to optimize performance including the impact of different file types, how to design around the scaling behavior, and how you can efficiently partition files. You can check out an example architecture in Accelerate self-service analytics with HAQM Redshift Query Editor V2.

To get started with HAQM Redshift Spectrum, you define the structure for your files and register them as an external table in an external data catalog (AWS Glue, HAQM Athena, and Apache Hive metastore are supported). After creating your external table, you can query your data in HAQM S3 directly from HAQM Redshift.

HAQM Redshift Spectrum use cases

You can use HAQM Redshift Spectrum in the following use cases:

Huge volume but less frequently accessed data, build lake house architecture to query exabytes of data in an S3 data lake
Heavy scan- and aggregation-intensive queries
Selective queries that can use partition pruning and predicate pushdown, so the output is fairly small

Zero-ETL to unify all data and achieve near real-time analytics

You can use Zero-ETL integration with HAQM Redshift to integrate with your transactional databases like HAQM Aurora MySQL-Compatible Edition, so you can run near real-time analytics in HAQM Redshift, or BI in HAQM QuickSight, or machine learning workload in HAQM SageMaker AI, shown in below diagram.

Zero-ETL integration with HAQM Redshift architecture diagram

Zero-ETL integration with HAQM Redshift removes the undifferentiated heavy lifting to build and manage complex extract, transform, and load (ETL) data pipelines; unifies data across databases, data lakes, and data warehouses; and makes data available in HAQM Redshift in near real time for analytics, artificial intelligence (AI) and machine learning (ML) workloads.

Currently HAQM Redshift supports the following zero-ETL integrations:

HAQM Aurora MySQL-Compatible Edition
HAQM Aurora PostgreSQL-Compatible Edition
HAQM Relational Database Service (HAQM RDS) for MySQL
HAQM DynamoDB
Applications such as Salesforce, SAP, ServiceNow, and Zendesk

To create a zero-ETL integration, you specify an integration source, such as an HAQM Aurora DB cluster, and an HAQM Redshift data warehouse, such as HAQM Redshift Serverless workgroup or a provisioned data warehouse (including Multi-AZ deployment on RA3 clusters to automatically recover from any infrastructure or Availability Zone failures and help ensure that your workloads remain uninterrupted), as the target. The integration replicates data from the source to the target and makes data available in the target data warehouse within seconds. The integration also monitors the health of the integration pipeline and recovers from issues when possible.

Make sure to review considerations, limitations, and quotas on both the data source and target when using zero-ETL integrations with HAQM Redshift.

Zero-ETL integration use cases

You can use zero-ETL integration with HAQM Redshift as an architecture pattern to boost analytical query performance at scale, enable a straightforward and secure way to create near real-time analytics on petabytes of transactional data, with continuous change-data-capture (CDC). Plus, you can use other HAQM Redshift capabilities such as built-in machine learning, materialized views, data sharing, and federated access to multiple data stores and data lakes. You can see more other zero-ETL integrations use cases at What is ETL.

Ingest streaming data into HAQM Redshift data warehouse for near real-time analytics

You can ingest streaming data with HAQM Kinesis Data Streams or HAQM Managed Streaming for Apache Kafka (HAQM MSK) to HAQM Redshift and run near real-time analytics in HAQM Redshift, as shown in the following diagram.

HAQM Redshift data streaming architecture diagram

HAQM Redshift streaming ingestion provides low-latency, high-speed data ingestion directly from HAQM Kinesis Data Streams or HAQM MSK to an HAQM Redshift provisioned or HAQM Redshift Serverless data warehouse, without staging data in HAQM S3. You can connect to and access the data from the stream using standard SQL and simplify data pipelines by creating materialized views in HAQM Redshift on top of the data stream. For best practices, you can review these blog posts:

To get started on HAQM Redshift streaming ingestion, you create an external schema that maps to the streaming data source and create a materialized view that references the external schema. For details on how to set up HAQM Redshift streaming ingestion for HAQM KDS, see Getting started with streaming ingestion from HAQM Kinesis Data Streams. For details on how to set up HAQM Redshift streaming ingestion for HAQM MSK, see Getting started with streaming ingestion from Apache Kafka sources.

HAQM Redshift streaming ingestion use cases

You can use HAQM Redshift streaming ingestion to:

Improve gaming experience by analyzing real-time data from gamers
Analyze real-time IoT data and use machine learning (ML) within HAQM Redshift to improve operations, predict customer churn, and grow your business
Analyze clickstream user data
Conduct real-time troubleshooting by analyzing streaming data from log files
Perform near real-time retail analytics on streaming point of sale (POS) data

Other HAQM Redshift features to optimize performance

There are other HAQM Redshift features that you can use to optimize performance.

You can resize HAQM Redshift provisioned clusters to optimize data warehouse compute and storage use.
You can use concurrency scaling, where HAQM Redshift provisioning automatically adds additional capacity to process increases in read, such as dashboard queries; and write operations, such as data ingestion and processing.
You can also consider materialized views in HAQM Redshift, applicable to both provisioned and serverless data warehouses, which contains a precomputed result set, based on an SQL query over one or more base tables. They are especially useful for speeding up queries that are predictable and repeated.
You can use auto-copy for HAQM Redshift to set up continuous file ingestion from your HAQM S3 prefix and automatically load new files to tables in your HAQM Redshift data warehouse without the need for additional tools or custom solutions.

Cloud security at AWS is the highest priority. HAQM Redshift offers broad security-related configurations and controls to help ensure information is appropriately protected. See HAQM Redshift Security Best Practices for a comprehensive guide to HAQM Redshift security best practices.

Conclusion

In this post, we reviewed HAQM Redshift architecture patterns and features that you can use to help scale your data warehouse to dynamically accommodate different workload combinations, volumes, and data sources to achieve optimal price performance. You can use them alone or together—choosing the best infrastructural set up for your use case requirements—and scale to accommodate for any future growth.

Get started with these HAQM Redshift architecture patterns and features today by following the instructions provided in each section. If you have questions or suggestions, leave a comment below.

About the authors

Eddie Yao is a Principal Technical Account Manager (TAM) at AWS. He helps enterprise customers build scalable, high-performance cloud applications and optimize cloud operations. With over a decade of experience in web application engineering, digital solutions, and cloud architecture, Eddie currently focuses on Media & Entertainment (M&E) and Sports industries and AI/ML and generative AI.

Julia Beck is an Analytics Specialist Solutions Architect at AWS. She supports customers in validating analytics solutions by architecting proof of concept workloads designed to meet their specific needs.

Scott St. Martin is a Solutions Architect at AWS who is passionate about helping customers build modern applications. Scott uses his decade of experience in the cloud to guide organizations in adopting best practices around operational excellence and reliability, with a focus the manufacturing and financial services spaces. Outside of work, Scott enjoys traveling, spending time with family, and playing piano.

AWS Big Data Blog