AWS Cloud Operations Blog

Category: HAQM CloudWatch

How HAQM CloudWatch Logs Data Protection can help detect and protect sensitive log data

Customer applications running on HAQM Web Services (AWS) often require handling sensitive data such as personally identifiable information (PII) or protected health information (PHI). As a result, sensitive log data can be intentionally or unintentionally logged as part of an application’s observability data. While comprehensive logging is important for application troubleshooting, monitoring and forensics, any […]

Using Generative AI to Gain Insights into CloudWatch Logs

Have you ever been investigating a problem and opened up a log file and thought “I have no idea what I am looking at. If only I could get a summary of the data.” Observability and log data play an important role in maintaining operational excellence and ensuring the reliability of your applications and services. […]

AWS named as a Challenger in the 2024 Gartner Magic Quadrant for Observability Platforms

AWS has been named as a Challenger in the 2024 Gartner Magic Quadrant for Observability Platforms, previously known as Gartner Application Performance Monitoring (APM) and Observability Magic Quadrant. This report assesses vendors based on their Ability to Execute and Completeness of Vision. Compared to the previous year, AWS has moved up higher on the Ability […]

Improve HAQM Bedrock Observability with HAQM CloudWatch AppSignals

With the pace of innovation with Generative AI applications, there is increasing demand for more granular observability into applications using Large Language Models (LLMs). Specifically, customers want visibility into: Prompt metrics like token usage, costs, and model IDs for individual transactions and operations, apart from service-level aggregations. Output quality factors including potential toxicity, harm, truncation […]

Reduce code duplication in load testing and synthetic monitoring using HAQM CloudWatch Synthetics

Load testing is an integral step in the quality assurance phase of a software development lifecycle, that offers you confidence about the performance of your workload before it is deployed to production. Once that workload moves to production, you monitor its health using synthetic monitoring. Load testing and synthetic monitoring typically test the same application […]

Use HAQM CloudWatch Contributor Insights for general analysis of Apache logs

Customers build, deploy, and maintain millions of web applications on AWS and many customers deploy these applications using the Apache web application server. Web application performance is a key metric in modern enterprise applications. On AWS customers leverage HAQM CloudWatch to monitor response times, uptime, and provide SLAs. Engineering teams that run large scale applications […]

Gain operational insights for NVIDIA GPU workloads using HAQM CloudWatch Container Insights

As machine learning models grow more advanced, they require extensive computing power to train efficiently. Many organizations are turning to GPU-accelerated Kubernetes clusters for both model training and online inference. However, properly monitoring GPU usage is critical for machine learning engineers and cluster administrators to understand model performance and to optimize infrastructure utilization. Without visibility […]

Automate CloudWatch Dashboard creation for your AWS Elemental Mediapackage and AWS Elemental Medialive

Introduction Monitoring the health and performance of your media services is critical to ensuring a seamless viewing experience for your customers. HAQM CloudWatch provides powerful monitoring capabilities for HAQM Web Services (AWS) resources. Setting up comprehensive dashboards can be a time-consuming process, especially for organizations managing large number of resources across multiple regions. The Automatic CloudWatch […]

Ten Ways to Improve Your AWS Operations

Introduction When I take my car in for service for a simple oil change, the technician often reads off a litany of other services my car needs that I had put off since the previous service (and maybe the service before that, too). I tend to wait for the “check engine” light to come on […]

How SLAs, SLOs, and SLIs interact

Improve application reliability with effective SLOs

At AWS, we consider reliability as a capability of services to withstand major disruptions within acceptable degradation parameters and to recover within an acceptable timeframe. Service reliability goes beyond traditional disciplines, such as availability and performance, to achieve its goal. Components of a system or application will eventually fail over time. Like our CTO Werner Vogels […]