HAQM Elastic Kubernetes Service | AWS Machine Learning Blog

Scale AI training and inference for drug discovery through HAQM EKS and Karpenter

This is a guest post co-written with the leadership team of Iambic Therapeutics. Iambic Therapeutics is a drug discovery startup with a mission to create innovative AI-driven technologies to bring better medicines to cancer patients, faster. Our advanced generative and predictive artificial intelligence (AI) tools enable us to search the vast space of possible drug […]

Open source observability for AWS Inferentia nodes within HAQM EKS clusters

This post walks you through the Open Source Observability pattern for AWS Inferentia, which shows you how to monitor the performance of ML chips, used in an HAQM Elastic Kubernetes Service (HAQM EKS) cluster, with data plane nodes based on HAQM Elastic Compute Cloud (HAQM EC2) instances of type Inf1 and Inf2.

Scale LLMs with PyTorch 2.0 FSDP on HAQM EKS – Part 2

This is a guest post co-written with Meta’s PyTorch team and is a continuation of Part 1 of this series, where we demonstrate the performance and ease of running PyTorch 2.0 on AWS. Machine learning (ML) research has proven that large language models (LLMs) trained with significantly large datasets result in better model quality. In […]

Federated learning on AWS using FedML, HAQM EKS, and HAQM SageMaker

This post is co-written with Chaoyang He, Al Nevarez and Salman Avestimehr from FedML. Many organizations are implementing machine learning (ML) to enhance their business decision-making through automation and the use of large distributed datasets. With increased access to data, ML has the potential to provide unparalleled business insights and opportunities. However, the sharing of […]

Enable pod-based GPU metrics in HAQM CloudWatch

This post details how to set up container-based GPU metrics and provides an example of collecting these metrics from EKS pods.

How Sportradar used the Deep Java Library to build production-scale ML platforms for increased performance and efficiency

This is a guest post co-written with Fred Wu from Sportradar. Sportradar is the world’s leading sports technology company, at the intersection between sports, media, and betting. More than 1,700 sports federations, media outlets, betting operators, and consumer platforms across 120 countries rely on Sportradar knowhow and technology to boost their business. Sportradar uses data […]

Accelerate hyperparameter grid search for sentiment analysis with BERT models using Weights & Biases, HAQM EKS, and TorchElastic

Financial market participants are faced with an overload of information that influences their decisions, and sentiment analysis stands out as a useful tool to help separate out the relevant and meaningful facts and figures. However, the same piece of news can have a positive or negative impact on stock prices, which presents a challenge for […]

Scaling distributed training with AWS Trainium and HAQM EKS

Recent developments in deep learning have led to increasingly large models such as GPT-3, BLOOM, and OPT, some of which are already in excess of 100 billion parameters. Although larger models tend to be more powerful, training such models requires significant computational resources. Even with the use of advanced distributed training libraries like FSDP and […]

Run inference at scale for OpenFold, a PyTorch-based protein folding ML model, using HAQM EKS

This post was co-written with Sachin Kadyan, a leading developer of OpenFold. In drug discovery, understanding the 3D structure of proteins is key to assessing the ability of a drug to bind to it, directly impacting its efficacy. Predicting the 3D protein form, however, is very complex, challenging, expensive, and time consuming, and can take […]

Build flexible and scalable distributed training architectures using Kubeflow on AWS and HAQM SageMaker

In this post, we demonstrate how Kubeflow on AWS (an AWS-specific distribution of Kubeflow) used with AWS Deep Learning Containers and HAQM Elastic File System (HAQM EFS) simplifies collaboration and provides flexibility in training deep learning models at scale on both HAQM Elastic Kubernetes Service (HAQM EKS) and HAQM SageMaker utilizing a hybrid architecture approach. […]

Category: HAQM Elastic Kubernetes Service