AWS Machine Learning Blog
Category: HAQM EC2
Host concurrent LLMs with LoRAX
In this post, we explore how Low-Rank Adaptation (LoRA) can be used to address these challenges effectively. Specifically, we discuss using LoRA serving with LoRA eXchange (LoRAX) and HAQM Elastic Compute Cloud (HAQM EC2) GPU instances, allowing organizations to efficiently manage and serve their growing portfolio of fine-tuned models, optimize costs, and provide seamless performance for their customers.
Optimizing Mixtral 8x7B on HAQM SageMaker with AWS Inferentia2
This post demonstrates how to deploy and serve the Mixtral 8x7B language model on AWS Inferentia2 instances for cost-effective, high-performance inference. We’ll walk through model compilation using Hugging Face Optimum Neuron, which provides a set of tools enabling straightforward model loading, training, and inference, and the Text Generation Inference (TGI) Container, which has the toolkit for deploying and serving LLMs with Hugging Face.
Unleash AI innovation with HAQM SageMaker HyperPod
In this post, we show how SageMaker HyperPod, and its new features introduced at AWS re:Invent 2024, is designed to meet the demands of modern AI workloads, offering a persistent and optimized cluster tailored for distributed training and accelerated inference at cloud scale and attractive price-performance.
Reduce conversational AI response time through inference at the edge with AWS Local Zones
This guide demonstrates how to deploy an open source foundation model from Hugging Face on HAQM EC2 instances across three locations: a commercial AWS Region and two AWS Local Zones. Through comparative benchmarking tests, we illustrate how deploying foundation models in Local Zones closer to end users can significantly reduce latency—a critical factor for real-time applications such as conversational AI assistants.
How Rocket Companies modernized their data science solution on AWS
In this post, we share how we modernized Rocket Companies’ data science solution on AWS to increase the speed to delivery from eight weeks to under one hour, improve operational stability and support by reducing incident tickets by over 99% in 18 months, power 10 million automated data science and AI decisions made daily, and provide a seamless data science development experience.
HAQM EC2 P5e instances are generally available
In this post, we discuss the core capabilities of HAQM Elastic Compute Cloud (HAQM EC2) P5e instances and the use cases they’re well-suited for. We walk you through an example of how to get started with these instances and carry out inference deployment of Meta Llama 3.1 70B and 405B models on them.
Accelerated PyTorch inference with torch.compile on AWS Graviton processors
Originally PyTorch used an eager mode where each PyTorch operation that forms the model is run independently as soon as it’s reached. PyTorch 2.0 introduced torch.compile to speed up PyTorch code over the default eager mode. In contrast to eager mode, the torch.compile pre-compiles the entire model into a single graph in a manner that’s optimal for […]
Get started quickly with AWS Trainium and AWS Inferentia using AWS Neuron DLAMI and AWS Neuron DLC
Starting with the AWS Neuron 2.18 release, you can now launch Neuron DLAMIs (AWS Deep Learning AMIs) and Neuron DLCs (AWS Deep Learning Containers) with the latest released Neuron packages on the same day as the Neuron SDK release. When a Neuron SDK is released, you’ll now be notified of the support for Neuron DLAMIs […]
Sprinklr improves performance by 20% and reduces cost by 25% for machine learning inference on AWS Graviton3
This is a guest post co-written with Ratnesh Jamidar and Vinayak Trivedi from Sprinklr. Sprinklr’s mission is to unify silos, technology, and teams across large, complex companies. To achieve this, we provide four product suites, Sprinklr Service, Sprinklr Insights, Sprinklr Marketing, and Sprinklr Social, as well as several self-serve offerings. Each of these products are […]
End-to-end LLM training on instance clusters with over 100 nodes using AWS Trainium
In this post, we show you how to accelerate the full pre-training of LLM models by scaling up to 128 trn1.32xlarge nodes, using a Llama 2-7B model as an example. We share best practices for training LLMs on AWS Trainium, scaling the training on a cluster with over 100 nodes, improving efficiency of recovery from system and hardware failures, improving training stability, and achieving convergence.