AWS Big Data Blog
Get up to 3x better price performance with HAQM Redshift than other cloud data warehouses
Since we announced HAQM Redshift in 2012, tens of thousands of customers have trusted us to deliver the performance and scale they need to gain business insights from their data. HAQM Redshift customers span all industries and sizes, from startups to Fortune 500 companies, and we work to deliver the best price performance for any use case. Earlier in 2020, we published a blog post about improved speed and scalability in HAQM Redshift. This includes optimizations such as dynamically adding cluster capacity when you need it with concurrency scaling, making sure you use cluster resources efficiently with automatic workload management (WLM), and automatically adjusting data layout, distribution keys, and query plans to provide optimal performance for a given workload. We also described how customers, including Codeacademy, OpenVault, Yelp, and Nielsen, have taken advantage of HAQM Redshift RA3 nodes with managed storage to scale their cloud data warehouses and reduce costs.
In addition to improving performance and scale, we are constantly looking at how to also improve the price performance that HAQM Redshift provides. One of the ways we ensure that we provide the best value for customers is to measure performance regularly using a benchmark derived from the industry-standard TPC-DS benchmark. You can read the details of the benchmark at the end of this blog, and can reproduce the results using the scripts queries, and data in this Github repository.
We completed our most recent benchmark derived from the TPC-DS benchmark in November using the latest version of the products available across the vendors tested at that time. For HAQM Redshift, this includes more than 15 new capabilities released this year prior to November, but not new capabilities announced during AWS re:Invent 2020.
Best Out-of-the-Box and Tuned Price Performance
Our November test of HAQM Redshift and three other leading cloud data warehouse showed that HAQM Redshift delivers up to three times better price performance out-of-the-box. The following chart illustrates these findings.
For this test, we ran all 99 queries derived from the TPC-DS benchmark against a 3 TB data set. We calculated price performance by multiplying the time required to run all queries in hours by the price per hour for each cloud data warehouse. We used clusters with comparable hardware characteristics for each data warehouse. We also used default settings for each cloud data warehouse, except we enabled encryption for all four services because it is enabled on two by default, and we disabled result caching where applicable. The default settings allowed us to determine the price performance delivered with no manual tuning effort. We selected the best result out of three runs for each query in order to take advantage of optimizations provided by each service. Finally, to ensure an apples-to-apples comparison, we used public pricing, and compared price performance rather than performance alone. For HAQM Redshift specifically, we used on-demand pricing; HAQM Redshift Reserved Instance pricing provides up to a 60% discount vs. on-demand pricing.
These results show that HAQM Redshift provides the best price performance out-of-the-box, even for a comparatively small 3 TB dataset. This means that you can benefit from HAQM Redshift’s leading price performance from the start without manual tuning.
You can also take advantage of performance tuning techniques for HAQM Redshift to achieve even better results for your workloads. We repeated the benchmark test using tuning best practices provided by each cloud data warehouse vendor. After all cloud data warehouses are tuned, HAQM Redshift has 1.5 times better price performance than the nearest cloud data warehouse competitor, as shown in the following chart.
As with all benchmarks, transparency and reproducibility are crucial. For this reason, we have made the data and queries available on GitHub for anyone to use. See the README in GitHub for detailed instructions on re-running these benchmarks.
Tuned price performance improves as your data warehouse grows
One critical aspect of a data warehouse is how it scales as your data grows. Will you be paying more per TB as you add more data, or will your costs remain consistent and predictable? We work to make sure that HAQM Redshift delivers not only strong performance as your data grows, but also consistent price performance. We tested HAQM Redshift price performance using the queries derived from TPC-DS with 3 TB, 30 TB, and 100 TB datasets on three different cluster sizes. As shown in the following graph, HAQM Redshift tuned price performance improved (from $2.80 to $2.41 per TB per run) as the datasets grew. Tuning reduces the amount of network and disk I/O required for a given workload, and has varying impact depending on the combination of workload and cluster size.
In addition, as shown in the following table, HAQM Redshift out-of-the-box price performance is nearly the same ($4.80 to $5.01 per TB per run) for all three dataset sizes. This linear scaling of price performance across data size and cluster size, both out-of-the-box and tuned, makes sure that HAQM Redshift will scale predictably as your data and workloads grow.
HAQM Redshift results on test derived from TPC-DS benchmark | |||||
Out-of-Box | Tuned | ||||
Data set (TB) |
Cluster | Runtime (sec) |
Price per TB per run | Runtime (sec) |
Price per TB per run |
3 | 10 node ra3.4xlarge | 1591 | $4.80 | 926 | $2.80 |
30 | 5 node ra3.16xlarge | 8291 | $5.01 | 4198 | $2.53 |
100 | 10 node ra3.16xlarge | 13,343 | $4.83 | 6644 | $2.41 |
You can learn more about HAQM Redshift’s performance on large datasets in How HAQM Redshift powers large-scale analytics for HAQM.com. This AWS re:Invent 2020 session shows how HAQM.com is using HAQM Redshift to keep up with exploding data growth, and how you can upgrade your existing data warehouse workloads to RA3 nodes to get scale and performance at great value.
Up to 10x better query performance with AQUA
We’re investing to make sure HAQM Redshift continues to improve as your data warehouse needs grow. As noted earlier, these benchmark results reflect the latest version of HAQM Redshift as of November, 2020. This version includes more than 15 new features released earlier this year, such as distributed bloom filters, vectorized queries, and automatic WLM, but doesn’t include the benefits from new capabilities announced during AWS re:Invent 2020. You can join What’s new with HAQM Redshift at AWS re:Invent 2020 to learn more about the new capabilities.
These new capabilities include AQUA (Advanced Query Accelerator) for HAQM Redshift. AQUA is a new distributed and hardware-accelerated cache for HAQM Redshift that delivers up to 10x better query performance than other cloud data warehouses for certain types of queries. AQUA takes a new approach to cloud data warehousing. AQUA brings the compute to storage by doing a substantial share of data processing in-place on the innovative cache. In addition, it uses AWS-designed processors and a scale-out architecture to accelerate data processing beyond anything traditional CPUs can do today. AQUA’s preview is now open to all customers, and AQUA will be generally available in January 2021. You can learn more about AQUA and other new HAQM Redshift capabilities by joining What’s new with HAQM Redshift at AWS re:Invent 2020.
Price performance continues to improve
We’re investing to make sure HAQM Redshift continues to improve as your data warehouse needs grow. As noted earlier, these benchmark results reflect the latest version of HAQM Redshift as of November, 2020. This version includes more than 15 new features released earlier this year, such as distributed bloom filters, vectorized queries, and automatic WLM, but doesn’t include the benefits from new capabilities announced during AWS re:Invent 2020. You can join What’s new with HAQM Redshift at AWS re:Invent 2020 to learn more about the new capabilities.
Find the best price performance for your workloads
The benchmark used in this blog is derived from the industry-standard TPC-DS benchmark, and has the following characteristics:
- The schema and data are used unmodified from TPC-DS.
- The queries are used unmodified from TPC-DS. TPC-approved query variants are used for a warehouse if the warehouse does not support the SQL dialect of the default TPC-DS query.
- The test includes only the 99 TPC-DS SELECT queries. It does not include maintenance and throughput steps.
- Three power runs (i.e. single stream) were run with query parameters generated using the default random seed of the TPC-DS kit.
- The primary metric of total query runtime is used. The runtime is taken as the best of the three runs.
- Price performance is calculated as cost per hour (USD) divided by queries per hour, which is equivalent to average cost per query. Published on-demand pricing is used for all data warehouses.
We call this benchmark the Cloud Data Warehouse Benchmark, and you can reproduce the benchmark results above using the scripts, queries, and data available on GitHub. It is derived from the TPC-DS benchmark, and as such is not comparable to published TPC-DS results, as the results of our tests do not comply with the specification.
Each workload has unique characteristics, so if you’re just getting started, a proof of concept is the best way to understand how HAQM Redshift performs for your requirements. When running your own proof of concept, it’s important that you focus on proper cluster sizing and the right metrics—query throughput (number of queries per hour) and price performance. You can make a data-driven decision by requesting assistance with a proof of concept or working with a system integration and consulting partner.
If you’re an existing HAQM Redshift customer, connect with us for a free optimization session and briefing on the new features announced at AWS re:Invent 2020.
To stay up-to-date with the latest developments in HAQM Redshift, subscribe to the What’s New in HAQM Redshift RSS feed.
About the Authors
Eugene Kawamoto is a director of product management for HAQM Redshift. Eugene leads the product management and database engineering teams at AWS. He has been with AWS for ~8 years supporting analytics and database services both in Seattle and in Tokyo. In his spare time, he likes running trails in Seattle, loves finding new temples and shrines in Kyoto, and enjoys exploring his travel bucket list.
Stefan Gromoll is a Senior Performance Engineer with HAQM Redshift where he is responsible for measuring and improving Redshift performance. In his spare time, he enjoys cooking, playing with his three boys, and chopping firewood.