AWS Cloud Operations Blog

Category: HAQM EC2 Container Service

Gain operational insights for NVIDIA GPU workloads using HAQM CloudWatch Container Insights

As machine learning models grow more advanced, they require extensive computing power to train efficiently. Many organizations are turning to GPU-accelerated Kubernetes clusters for both model training and online inference. However, properly monitoring GPU usage is critical for machine learning engineers and cluster administrators to understand model performance and to optimize infrastructure utilization. Without visibility […]

Distributed Tracing using AWS Distro for OpenTelemetry

More and more applications are being developed using serverless architectures with multiple microservices. Customers use managed AWS services including AWS Lambda, HAQM ECS and HAQM EKS running on HAQM Elastic Compute Cloud (EC2) and AWS Fargate for running their code along with services like HAQM API Gateway, HAQM SNS, HAQM SQS, HAQM DynamoDB, HAQM S3, and others. Developers use multiple […]