Optimizing cost for using foundational models with HAQM Bedrock

This post was made better through reviews from Andrew Shamiya and Zhibin Cao.

As we continue our five-part series on optimizing costs for generative AI workloads on AWS, our third blog shifts focus to HAQM Bedrock. In our previous posts, we explored general Cloud Financial Management principles on generative AI adoption and strategies for custom model development using HAQM EC2 and HAQM SageMaker AI. Today, we’ll guide you through cost optimization techniques for HAQM Bedrock. We’ll explore making informed decisions about pricing options, model selection, knowledge base optimization, prompt caching, and automated reasoning. Whether you’re just starting with foundation models or looking to optimize your existing HAQM Bedrock implementation, these techniques will help you balance capability and cost while leveraging the convenience of managed AI models.

What is HAQM Bedrock?

HAQM Bedrock is a fully managed service that provides access to leading foundation models (FMs) from multiple AI companies through a unified API. This enables developers to build and scale generative AI applications without managing complex infrastructure. Key benefits include seamless model switching, enterprise-grade security and privacy controls, customization capabilities through model fine-tuning, and direct integration with AWS services. HAQM Bedrock offers several powerful levers to help you balance cost and performance.

Inference, the New Building Block in Modern Applications

At re:Invent 2024, our CEO Matt Garman introduced a paradigm shift in how we think about application architecture: positioning inference as a fundamental building block of modern applications, alongside traditional components like compute, storage, and databases (listen to Matt’s keynote presentation, AWS re:Invent 2024 – CEO Keynote with Matt Garman). As you increasingly embed generative AI capabilities into your operational workflows, managing and optimizing inference costs will become as crucial as traditional cloud cost management. To support this evolution, AWS introduced inference-level Cost Allocation Tags, providing granular visibility into inference spend. This enhanced monitoring capability enables you to visualize and analyze cost at the inference level, set and manage budgets specifically for your AI workloads, and make data-driven decisions on model selection and usage. In the following sections, we’ll explore practical cost optimization techniques that can help you lower your inference cost.

Flexible Pricing Models for Every Use Case

HAQM Bedrock’s flexible pricing model encompasses three key options: 1) On-Demand for pay-as-you-go flexibility, 2) Provisioned Throughput offering 40-60% savings through one or six month commitments, and 3) batch processing which can offer up to a 50% lower price when compared to on-demand. Selecting the optimal pricing option is crucial for your success as it directly impacts financial outcomes and operational efficiency. You can optimize spend while maintaining service quality by choosing the most appropriate option: On-Demand for variable workloads, Provisioned Throughput for consistent usage patterns, or Batch processing for non-time-sensitive operations. This flexibility supports various stages of AI implementation and enables proper resource allocation, prevents over-provisioning, and ensures better budget predictability. Making an informed pricing decision is essential, as the wrong choice could lead to unnecessary expenses that impact both operational efficiency and bottom-line results.

Strategic Model Selection

List of currently supported Bedrock models

Figure 1. HAQM Bedrock offers the broadest selection of fully managed models from leading AI companies.

Model selection in HAQM Bedrock is a strategic decision that can impact cost, efficiency, and performance outcomes. HAQM Bedrock provides access to diverse foundation models from industry leaders like Anthropic, Meta, Mistral AI, and HAQM. In addition to the models available from those providers, you can leverage 100+ other models in the HAQM Bedrock Marketplace. Rather than committing to a single model or provider, you can leverage HAQM Bedrock’s flexibility to seamlessly switch between models with minimal code modifications. As newer, more efficient models are released, you can easily switch for cost savings and increased performance. The platform’s batch processing capabilities further enhance this advantage by enabling continuous evaluation of new models as they become available, ensuring solutions remain optimized over time and maintain competitive advantages in rapidly evolving AI landscapes. This strategic approach to model diversity and evaluation helps you maximize your AI investments while maintaining operational agility.

Another consideration, when selecting a model, is response time. Some HAQM Bedrock models support latency optimized configurations, see Optimize model inference for latency, which deliver faster response times compared to standard performance. These models improve efficiency and make your generative AI applications more responsive. You can currently use latency optimized configurations for HAQM Nova Pro, Anthropic’s Claude 3.5 Haiku, and Meta’s Llama 3.1 405B and 70B, running them faster on AWS than anywhere else.

Leveraging Knowledge Bases

HAQM Bedrock supports the inclusion of Knowledge Bases that enable you to create highly accurate, low-latency, secure, and custom generative AI applications by incorporating contextual information from your own data sources. Knowledge Bases, also known as RAG (retrieval augmented generation), can lead to more accurate, relevant, and up-to-date responses. Utilizing Knowledge Bases helps you get higher quality answers which will drive cost savings through a reduction in the number of prompts and responses needed to get a relevant answer. The key to optimizing Knowledge Bases is to manage the data and indexing frequency. Indexing fees are the primary cost driver and they are charged per object, or OpenSearch Compute Unit (OCU) hour, depending on the vector database. The three things you can do to minimize these costs are:

Include only relevant data in your data source(s) to avoid indexing data that will not contribute to your solution
Avoid updating or modifying files that are already indexed. If a file is modified, it will cause that file to be re-indexed which will incur additional charges.
Remove data that is no longer needed to simplify your index. This will reduce the total indexing costs and speed up requests against the indexed data.

By following these practices, you can achieve cost reductions and faster indexing for your knowledge base deployments.

HAQM Bedrock user queries go through an augmentation progress with data from knowledge bases.

Figure 2. HAQM Bedrock has native support for Retrieval Augmented Generation (RAG)

Customization for Enhanced Performance

Recent advances in fine-tuning capabilities have made it easier than ever to optimize model performance. You can now customize and fine-tune models, using your data, and without needing to writing code. This reduces the need for continuous model retraining and results in a more efficient, less costly, solution to operate because of the higher quality output.

Distillation for More Cost-Efficient Performance

HAQM Bedrock’s Model Distillation feature represents an opportunity to balance performance with efficiency. Through a sophisticated knowledge transfer process from larger “teacher” models to smaller “student” models, this technology enables you to achieve optimization without significantly compromising accuracy. This process produces distilled models that can operate up to 500% faster and costs up to 75% less than the original counterpart with less than 2% accuracy loss for use cases like RAG. This feature addresses the traditional trade-off between model capability and operational efficiency, making advanced AI applications more accessible and economically viable for budgets of all sizes.

Figure 3. Match the performance of advanced models with cost-efficient models for your use case with Model Distillation

Prompt Caching for Cost and Latency Reduction

HAQM Bedrock’s prompt caching capability delivers exceptional cost and performance benefits. By intelligently caching frequently used prompts across multiple API calls, this feature eliminates the need to re-process identical requests. This results in up to 90% reduction in costs and 85% decrease in latency for supported models. Prompt Caching works by reusing cached prompt prefixes, bypassing the need for reprocessing of matching prefixes, thereby significantly reducing the computational resources required for generating outputs while maintaining the quality of responses. This optimization makes enterprise-scale AI implementations more economically viable and responsive.

Automated Reasoning to Improve Accuracy

Through the integration of Automated Reasoning, HAQM Bedrock presents a chance to improve generative AI accuracy and cost optimization. Automated Reasoning is available through HAQM Bedrock Guardrails and it employs mathematical methods to guarantee accuracy in strategic areas like human resources, finance, and compliance. This mathematical proof process not only increases the response reliability but also improves the operational efficiency. This efficiency is accomplished by decreasing the number of prompts that you have to give in order to get an accurate response. Moreover, by offering a proof of accuracy and a logical explanation for every response, the system enables you to speed up your AI adoption in accuracy critical use cases while not spending the resources that are usually put into manual verification and error correction. When a combination of improved accuracy, low interaction cost, and verification are factored in, the cost savings can be significant.

Conclusion

By implementing the above optimization strategies, you can significantly reduce costs while maintaining or improving performance. The key is to continuously evaluate and adjust your approach as new capabilities become available. The flexibility and comprehensive feature set of HAQM Bedrock make it an ideal platform when looking to optimize your generative AI implementation.

We’ve covered various approaches to optimize your HAQM Bedrock generative AI application development costs. In our next post, we’ll explore cost optimization strategies for HAQM Q, including pricing tier selection, user management, and content indexing. Join us to learn how to make the most of AWS’s AI-powered assistant while maintaining cost efficiency.

AWS Cloud Financial Management