AWS HPC Blog
Category: Artificial Intelligence
Scale Reinforcement Learning with AWS Batch Multi-Node Parallel Jobs
Autonomous robots are increasingly used across industries, from warehouses to space exploration. While developing these robots requires complex simulation and reinforcement learning (RL), setting up training environments can be challenging and time-consuming. AWS Batch multi-node parallel (MNP) infrastructure, combined with NVIDIA Isaac Lab, offers a solution by providing scalable, cost-effective robot training capabilities for sophisticated behaviors and complex tasks.
Enhancing Equity Strategy Backtesting with Synthetic Data: An Agent-Based Model Approach
Developing robust investment strategies requires thorough testing, but relying solely on historical data can introduce biases and limit your insights. Learn how synthetic data from agent-based models can provide an unbiased testbed to systematically evaluate your strategies and prepare for future market scenarios. Part 1 of 2 covers the theoretical foundations of the approach.
Scaling your LLM inference workloads: multi-node deployment with TensorRT-LLM and Triton on HAQM EKS
LLMs are scaling exponentially. Learn how advanced technologies like Triton, TRT-LLM and EKS enable seamless deployment of models like the 405B parameter Llama 3.1. Let’s go large.
Deploying Generative AI Applications with NVIDIA NIM Microservices on HAQM Elastic Kubernetes Service (HAQM EKS) – Part 2
Learn how to deploy AI models at scale with @AWS using NVIDIA’s NIM and HAQM EKS! This step-by-step guide shows you how to create a GPU cluster for inference in this second post of a two-part series!
Whisper audio transcription powered by AWS Batch and AWS Inferentia
Transcribe audio files at scale for really low cost using Whisper and AWS Batch with Inferentia. Check out this post to deploy a cost-effective solution in minutes!
Deploying generative AI applications with NVIDIA NIMs on HAQM EKS
Learn how to deploy AI models at scale with @AWS using NVIDIA’s NIM and HAQM EKS! This step-by-step guide shows you how to create a GPU cluster for inference. Don’t miss part 1 of this 2-part blog series!
Gang scheduling pods on HAQM EKS using AWS Batch multi-node processing jobs
AWS Batch multi-node parallel jobs can now run on HAQM EKS to provide gang scheduling of pods across nodes for large scale distributed computing like ML model training. More details here.
Large scale training with NVIDIA NeMo Megatron on AWS ParallelCluster using P5 instances
Launching distributed GPT training? See how AWS ParallelCluster sets up a fast shared filesystem, SSH keys, host files, and more between nodes. Our guide has the details for creating a Slurm-managed cluster to train NeMo Megatron at scale.
Enhancing ML workflows with AWS ParallelCluster and HAQM EC2 Capacity Blocks for ML
No more guessing if GPU capacity will be available when you launch ML jobs! EC2 Capacity Blocks for ML let you lock in GPU reservations so you can start tasks on time. Learn how to integrate Caacity Blocks into AWS ParallelCluster to optimize your workflow in our latest technical blog post.
How computer vision is enabling a circular economy
In this post, we show how Reezocar uses computer vision to change the way they detect damage and price used vehicles for re-sale in secondary markets. This reduces landfill and helps achieve the goals of the circular economy.