HAQM SageMaker HyperPod | AWS Machine Learning Blog

Implementing login node load balancing in SageMaker HyperPod for enhanced multi-user experience

In this post, we explore a solution for implementing load balancing across login nodes in Slurm-based HyperPod clusters. By distributing user activity evenly across all available nodes, this approach provides more consistent performance, better resource utilization, and a smoother experience for all users. We guide you through the setup process, providing practical steps to achieve effective load balancing in your HyperPod clusters.

Speed up your cluster procurement time with HAQM SageMaker HyperPod training plans

In this post, we explore how HAQM SageMaker HyperPod training plans accelerate compute resource procurement for machine learning workloads. We guide you through a step-by-step implementation on how you can use the AWS CLI or the AWS Management Console to find, review, and create optimal training plans for your specific compute and timeline needs. We further guide you through using the training plan to submit SageMaker training jobs or create SageMaker HyperPod clusters.

Generative AI foundation model training on HAQM SageMaker

In this post, we explore how organizations can cost-effectively customize and adapt FMs using AWS managed services such as HAQM SageMaker training jobs and HAQM SageMaker HyperPod. We discuss how these powerful tools enable organizations to optimize compute resources and reduce the complexity of model training and fine-tuning. We explore how you can make an informed decision about which HAQM SageMaker service is most applicable to your business needs and requirements.

Scalable training platform with HAQM SageMaker HyperPod for innovation: a video generation case study

In this post, we share an ML infrastructure architecture that uses SageMaker HyperPod to support research team innovation in video generation. We will discuss the advantages and pain points addressed by SageMaker HyperPod, provide a step-by-step setup guide, and demonstrate how to run a video generation algorithm on the cluster.

Accelerate pre-training of Mistral’s Mathstral model with highly resilient clusters on HAQM SageMaker HyperPod

In this post, we present to you an in-depth guide to starting a continual pre-training job using PyTorch Fully Sharded Data Parallel (FSDP) for Mistral AI’s Mathstral model with SageMaker HyperPod.

Enabling production-grade generative AI: New capabilities lower costs, streamline production, and boost security

As generative AI moves from proofs of concept (POCs) to production, we’re seeing a massive shift in how businesses and consumers interact with data, information—and each other. In what we consider “Act 1” of the generative AI story, we saw previously unimaginable amounts of data and compute create models that showcase the power of generative […]

Scaling Thomson Reuters’ language model research with HAQM SageMaker HyperPod

In this post, we explore the journey that Thomson Reuters took to enable cutting-edge research in training domain-adapted large language models (LLMs) using HAQM SageMaker HyperPod, an HAQM Web Services (AWS) feature focused on providing purpose-built infrastructure for distributed training at scale.

Introducing HAQM EKS support in HAQM SageMaker HyperPod

This post is designed for Kubernetes cluster administrators and ML scientists, providing an overview of the key features that SageMaker HyperPod introduces to facilitate large-scale model training on an EKS cluster.

Integrate HyperPod clusters with Active Directory for seamless multi-user login

HAQM SageMaker HyperPod is purpose-built to accelerate foundation model (FM) training, removing the undifferentiated heavy lifting involved in managing and optimizing a large training compute cluster. With SageMaker HyperPod, you can train FMs for weeks and months without disruption. Typically, HyperPod clusters are used by multiple users: machine learning (ML) researchers, software engineers, data scientists, […]

Category: HAQM SageMaker HyperPod