AWS Machine Learning Blog
Category: HAQM SageMaker HyperPod
Scalable training platform with HAQM SageMaker HyperPod for innovation: a video generation case study
In this post, we share an ML infrastructure architecture that uses SageMaker HyperPod to support research team innovation in video generation. We will discuss the advantages and pain points addressed by SageMaker HyperPod, provide a step-by-step setup guide, and demonstrate how to run a video generation algorithm on the cluster.
Accelerate pre-training of Mistral’s Mathstral model with highly resilient clusters on HAQM SageMaker HyperPod
In this post, we present to you an in-depth guide to starting a continual pre-training job using PyTorch Fully Sharded Data Parallel (FSDP) for Mistral AI’s Mathstral model with SageMaker HyperPod.
Enabling production-grade generative AI: New capabilities lower costs, streamline production, and boost security
As generative AI moves from proofs of concept (POCs) to production, we’re seeing a massive shift in how businesses and consumers interact with data, information—and each other. In what we consider “Act 1” of the generative AI story, we saw previously unimaginable amounts of data and compute create models that showcase the power of generative […]
Scaling Thomson Reuters’ language model research with HAQM SageMaker HyperPod
In this post, we explore the journey that Thomson Reuters took to enable cutting-edge research in training domain-adapted large language models (LLMs) using HAQM SageMaker HyperPod, an HAQM Web Services (AWS) feature focused on providing purpose-built infrastructure for distributed training at scale.
Introducing HAQM EKS support in HAQM SageMaker HyperPod
This post is designed for Kubernetes cluster administrators and ML scientists, providing an overview of the key features that SageMaker HyperPod introduces to facilitate large-scale model training on an EKS cluster.
Integrate HyperPod clusters with Active Directory for seamless multi-user login
HAQM SageMaker HyperPod is purpose-built to accelerate foundation model (FM) training, removing the undifferentiated heavy lifting involved in managing and optimizing a large training compute cluster. With SageMaker HyperPod, you can train FMs for weeks and months without disruption. Typically, HyperPod clusters are used by multiple users: machine learning (ML) researchers, software engineers, data scientists, […]