AWS Cloud Operations Blog
Category: HAQM EC2 Container Service
Gain operational insights for NVIDIA GPU workloads using HAQM CloudWatch Container Insights
As machine learning models grow more advanced, they require extensive computing power to train efficiently. Many organizations are turning to GPU-accelerated Kubernetes clusters for both model training and online inference. However, properly monitoring GPU usage is critical for machine learning engineers and cluster administrators to understand model performance and to optimize infrastructure utilization. Without visibility […]
Distributed Tracing using AWS Distro for OpenTelemetry
More and more applications are being developed using serverless architectures with multiple microservices. Customers use managed AWS services including AWS Lambda, HAQM ECS and HAQM EKS running on HAQM Elastic Compute Cloud (EC2) and AWS Fargate for running their code along with services like HAQM API Gateway, HAQM SNS, HAQM SQS, HAQM DynamoDB, HAQM S3, and others. Developers use multiple […]