Optimizing cost for building AI models with HAQM EC2 and SageMaker AI

This post was made better through reviews from Shane Robbins, Shruti Koparkar, and Natalia Cummings

Welcome to the second blog of the series on optimizing costs for generative AI workloads on AWS. In our first blog, we provided an overview of different implementation approaches and Cloud Financial Management principles for generative AI adoption. Today, we’ll dive deep into cost optimization strategies for building and deploying custom AI models using HAQM Elastic Compute Cloud (HAQM EC2) and HAQM SageMaker AI. Whether you’re training large language models, fine-tuning existing models, or deploying inference endpoints, we’ll explore key cost levers such as instance type selection, capacity management, and commitment planning for HAQM EC2, as well as model optimization, training efficiency, and deployment strategies for HAQM SageMaker AI. These practices will help you balance performance requirements with cost efficiency while maintaining the flexibility and control that comes with managing your own AI infrastructure.

HAQM EC2 and SageMaker AI are two of the foundational AWS services for Generative AI. HAQM EC2 provides the scalable computing power needed for training and inference, while SageMaker AI offers built-in tools for model development, deployment, and optimization. Cost optimization is crucial since Generative AI workloads require high-performance accelerators (GPU, Trainium, or Inferentia) and extensive processing, which can become expensive without efficient resource management. By leveraging the below cost optimization strategies, you can reduce costs while maintaining performance and scalability.

Image 1 “HAQM EC2 and SageMaker AI cost optimization strategy: cost savings vs effort.” This graph is for illustrative purposes only. Actual effort required and cost savings achieved may vary based on implementation scale, infrastructure, and team expertise.

HAQM EC2

1. Selecting the optimal instance type

HAQM EC2 instances are the primary component in a self-managed deployment and it is important to choose the right instance type. With CPU based instance types, like those powered by AWS Graviton, you can use AWS Compute Optimizer to easily analyze and rightsize your instances. Generative AI solutions typically need accelerated instances that are powered by NVIDIA GPUs or AWS AI chips like Tranium and Inferentia. AWS Trainium and Inferentia based instances offer up to 30-50% better price performance for training and inference and can be a cost-effective option for your workloads. To rightsize GPU based instances, you can enable NVIDIA GPU utilization with the CloudWatch agent. This allows AWS Compute Optimizer to collect NVIDIA GPU utilization and provide rightsizing recommendations.

Contextual testing should be used for more comprehensive analysis of your instance performance against your data sets, workloads, and models. Tools like FM Bench can help streamline this testing by analyzing performance across different instance types and serving stacks. It will help you identify the most cost-effective configuration through markdown reports that show inference latency, transactions per-minute, error rates, and cost per transaction. The reports contain explanatory text, tables, and figures that give you the information you need to rightsize your instances and to ensure that you are only paying for what you need.

2. Smart capacity management

Once you understand the type of instance to use, the next step is to understand your capacity management strategy. Some common questions to think about are:

How many instances are needed?
How often do they need to run?
How long will they be needed?

On-Demand Capacity Reservations (ODCRs)

ODCRs allow you to reserve compute capacity for HAQM EC2 instances in a specific Availability Zone (AZ). For machine learning (ML) model training and fine-tuning, ODCRs allow you to get uninterrupted access to the accelerated (GPU, Trainium, or Inferentia) instances that you reserve. You should consider using ODCRs if you have strict capacity requirements or your solution requires capacity assurance.

ODCRs require no long term commitment and you can modify or cancel them at any time. Capacity Reservations are charged at the equivalent On-Demand rate whether you run instances in reserved capacity or not. Billing starts as soon as the ODCR is provisioned in your account, and it continues while the Capacity Reservation remains provisioned in your account.

To ensure that you are using ODCRs efficiently, it is important to monitor the utilization. The first way to accomplish this is to utilize HAQM CloudWatch. With CloudWatch you can setup alarms to monitor metrics like ‘AvailableInstanceCount’. This metric will help you determine how many instances in the ODCR are going unused. Another option to monitor ODCRs is to utilize AWS Cost Explorer or the Cost and Usage Reports (CUR). Using these tools, you can filter for usage types containing ‘UnusedBox’ or ‘UnusedDed’. This will show you the amount of ODCRs that are not being used. Lastly, the AWS Health Dashboard will send an email when capacity utilization for ODCRs in your account drops below 20 percent.

Instance scheduling

If the workloads in your environment do not need to run 24/7, you should consider using AWS instance scheduler. AWS Instance Scheduler is a solution designed to automate the starting and stopping of HAQM Elastic Compute Cloud (EC2) and HAQM Relational Database Service (RDS) instances. This automation is beneficial to help reduce operational costs by ensuring that resources are only running when needed. Instance Scheduler can be configured to work across multiple AWS accounts, enhancing its utility for larger organizations. The scheduler is installed with AWS CloudFormation templates and can be customized with various parameters such as schedule, service (HAQM EC2 or HAQM RDS), and timezone settings. This flexibility allows you to tailor the scheduler to your specific needs, ensuring efficient resource management and cost optimization.

If you cannot shut down or release instances for capacity purposes, consider using instance scheduler with ODCRs so the spare capacity can be temporarily shifted to other accounts, teams, or workloads in your environment. While this method may not result in cost savings, it will allow you to get more value out of the instances you are utilizing.

3. Strategic commitment planning

When developing your AWS commitment strategy, the following factors will help maximize your savings: workload longevity (1 or 3 years), instance family requirements, and regional flexibility. The Savings Plan Purchase Analyzer tool can help examine your historical usage patterns and recommend optimal commitment levels based on these factors. For those requiring specific instance families in particular regions, Instance Savings Plans (ISPs) offer the deepest discounts. Alternatively, Compute Savings Plans (CSPs) provide greater flexibility across instance generations and regions at a lower discount rate. CSPs allow you to move workloads between regions or upgrade to newer instance families without losing your committed discount benefits. Depending on which you choose, you can save up to 72% vs. On-Demand pricing, which makes AWS Savings Plans an impactful cost optimization tool for your AWS infrastructure.

4. Maximizing resource efficiency

Tracking accelerator (GPU, Trainium, or Inferentia) utilization will give you a better understanding of how efficiently resources are being utilized. GPU utilization metrics help to validate instance requirements, maximize efficiency, and identify opportunities for resource sharing across teams and projects. While CPU utilization monitoring is relatively straightforward, GPU monitoring presents unique challenges. As mentioned above, when selecting the optimal instance type, GPU utilization requires more detailed metrics. Two metrics that can be used to estimate GPU utilization are temperature and power draw. These metrics are available from the CloudWatch agent and this approach allows GPU saturation levels to be estimated which provides valuable insights into resource utilization patterns. With this estimation, you can achieve greater utility from existing infrastructure which translates to a reduction in the Total Cost of Ownership.

QuotebyBrentSegner

HAQM SageMaker AI

As you accelerate AI/ML initiatives, you’re confronted with a strategic decision: should you invest resources in building and maintaining ML infrastructure, or channel you effort toward driving business outcomes? HAQM SageMaker AI is the ideal solution to this dilemma, offering a fully managed service that removes undifferentiated heavy lifting while maintaining the flexibility you need. SageMaker JumpStart helps you get started quickly by providing pre-built solutions, ready-to-deploy models, and example notebooks. Whether you’re just starting your SageMaker AI journey, or looking to optimize your existing implementation, these strategies will help you build a more efficient and cost-effective AI and ML solution.

1. Rightsizing for success

Optimizing your SageMaker AI instance type and size can impact both performance and cost-efficiency in your solutions. Through careful analysis of our your usage patterns, you can reduce your ML infrastructure costs through strategic rightsizing of your SageMaker AI instances. To select the right instance type and size for your workload, it is critical to test. FM Bench, mentioned above, is a valuable tool in making this process easier.

2. Balancing model capability and cost

Selecting the right model for your machine learning workloads is one of the most critical decisions in building effective SageMaker AI projects. Thoughtful model selection can improve both performance and cost-efficiency by up to 45%. We recommend a systematic approach that evaluates three key dimensions: 1) your specific use case requirements, 2) available computational resources, and 3) your budget. For example, while large language models (LLMs) offer impressive capabilities, they may not always be the most cost-effective solution for straightforward tasks where simpler models could suffice. You can achieve optimal results by starting with models from SageMaker Jumpstart which is a hub with foundation models, built-in algorithms, and prebuilt ML solutions. You can then evaluate whether the incremental benefits of more sophisticated (and often more expensive) models justify the additional computational and financial costs. This iterative approach to model selection often leads to solutions that are both technically superior and more sustainable.

3. Leverage SageMaker AI Savings Plans

Optimizing costs while maintaining operational flexibility is crucial for running machine learning solutions at scale. Machine Learning Savings Plans (MLSPs) are a commitment available to you that can save up to 64% on SageMaker AI through a usage-based pricing model. These plans require a commitment to a consistent amount of usage (measured in dollars per hour) over either a one or three-year term. What makes MLSPs particularly powerful is their flexibility – the savings automatically apply to the usage of eligible SageMaker ML instances in SageMaker Studio Notebooks, SageMaker On-Demand Notebooks, SageMaker Processing, SageMaker Data Wrangler, SageMaker Training, SageMaker Real-Time Inference, and SageMaker Batch Transform. This means you can freely innovate and adjust your machine learning infrastructure without worrying about losing your committed savings. MLSPs provide a low effort path to significant cost optimization, especially for sustained machine learning operations. The simplicity and effectiveness of MLSPs make them an essential consideration when you are looking to scale your machine learning initiatives while maintaining cost efficiency.

4. Optimize training costs with Spot Instances

Managed Spot Training in HAQM SageMaker AI offers a cost optimization strategy for machine learning workloads. By leveraging Spot Instance pricing, you can reduce training costs by up to 90% compared to On-Demand instances, making it a valuable option for projects with restricted budgets. This cost-effective approach is particularly well-suited for non-time-sensitive training jobs that can tolerate occasional interruptions. When combined with AWS Graviton processors, you can achieve even greater price-performance benefits. To make the process easy to use, SageMaker AI can automatically manage the Spot Instance lifecycle. This is accomplished through checkpointing model artifacts to ensuring training progress is preserved even if instances are reclaimed. This makes it an ideal solution for development, testing, and other environments where training time flexibility exists.

5. Choose the right inference strategy

HAQM SageMaker AI provides several options for inference deployment, which are designed for various use cases and cost structures, see Inference cost optimization best practices for more information. Real-time inference provides low-latency but incurs continuous costs as the instance run constantly. Real-time is ideal for applications requiring immediate responses. Serverless Inference, introduced to reduce costs for intermittent workloads, automatically scales to zero when not in use and charges only for the compute time used during invocations. For batch processing, SageMaker AI Batch transform offers the most cost-effective solution by processing large datasets in bulk without maintaining persistent endpoints. Lastly, Asynchronous inference processes requests asynchronously, making it ideal for large payloads, long processing times, and near real-time latency needs. It reduces costs by automatically scaling instances to zero when idle, so you only pay when requests are being processed.

By implementing these optimization strategies, you can significantly reduce your infrastructure costs while maintaining high performance and scalability. The key to success is aligning these choices with your specific use case and business requirements, ensuring that you’re not just cutting costs, but optimizing for long-term success in your machine learning operations.

Conclusion:

In this post, we’ve explored cost optimization strategies for custom AI model development using HAQM EC2 and SageMaker AI. In our next blog, we’ll dive into cost optimization techniques for HAQM Bedrock, including smart model selection, knowledge base optimization, and caching strategies. Stay tuned to learn how to maximize the value of foundation models while keeping costs under control.

AWS Cloud Financial Management