Containers
Monitoring Windows pods with Prometheus and Grafana
This post was co-authored by Cezar Guimarães, Sr. Software Engineer, VTEX
Introduction
Customers across the globe are increasingly adopting HAQM Elastic Kubernetes Service (HAQM EKS) to run their Windows workloads. This is a result of customers figuring out that refactoring existing Windows-based applications into an open-source environment, while ideal, is a very complex task. It needs investments that usually don’t immediately translate into cost savings, and investing in this application refactoring isn’t in the best interest for the IT yearly budget. However, re-platforming the existing yet critical Windows-based applications into Windows containers makes sense from a cost-saving and modernization lens.
Tools such as App2Container (A2C) have made application re-platforming easy. However, for successful day two operations, customers should consider certain infra-transformations, such as logging, monitoring, tracing, etc. As part of achieving full Windows containers observability on AWS, in 2022 we published a Containers post on how customers can use an AWS-managed Windows fluent-bit container image to centralize Windows pods log in different destinations.
Prometheus and Grafana are some of the most popular monitoring stacks for Kubernetes-based workloads. Therefore, today we are launching a post focusing on how customers can centralize Windows pod metrics using HAQM Managed Service for Prometheus and HAQM Managed Grafana.
Solution overview
This post walks you through how to set up Windows Exporter (A Prometheus exporter for Windows) as a Kubernetes daemonset and a PromQL (Prometheus Query Language) to enrich windows-exporter container metrics while merging with kube-state-metrics (KSM). This lets you extend existing Linux-based Kubernetes monitoring to support Windows-based workloads.
Image 1. Solution workflow
- HAQM Managed Service for Prometheus scrapes Windows node/container metrics, such as CPU, Memory, Disk, and Network usage from the Windows Exporter HostProcess DaemonSet.
- HAQM Managed Service for Prometheus scrapes KSM to map pod and container names to their container ID.
- HAQM Managed Grafana provides the ability to create monitoring dashboards from the collected metrics using HAQM Managed Service for Prometheus as the data source.
Prerequisites
The following prerequisites are required to continue with this post:
- An HAQM EKS cluster with Windows nodes up and running. See this step-by-step
- HAQM Managed Service for Prometheus with HAQM EKS ingestion properly setup. See this step-by-step
- HAQM Managed Grafana fully integrated with HAQM Managed Service for Prometheus. See this step-by-step
This post’s prerequisites use AWS-managed services such as HAQM Managed Service for Prometheus with managed-collector and HAQM Managed Grafana. However, this post also applies to self-managed Prometheus, Grafana, and ADOT/Prom-server agents.
Walkthrough
The following steps walk you through the steps described previously.
1. Install KSM
We now install KSM, a simple service that listens to the Kubernetes API server and generates metrics about the state of the objects. We must collect KSM to map pod and container names to their container ID.
1.1 Enter the following command to install KSM:
2. Create a Windows Exporter daemonset
First, going deep into the daemonset configuration, we are setting up the securityContext to hostProcess:True. This means the container process has access to the host network namespace, storage, and devices, allowing us to fetch metrics for all the containers running at the host by listening to built-in Windows metrics.
The second part is the initContainer, where we set up the host firewall to allow TCP/9182 incoming traffic so that HAQM Managed Service for Prometheus can scrape the host. In the third part, we create a ConfigMap to inject windows-exporter configurations and mount it to the Windows-exporter pod.
2.1 Create a file containing the following code and save it as windows-exporter.yaml :
If you have any taints in the Windows nodes, then make sure you add the tolerations in the Daemonset configuration.
This solution uses a public, open-source Prometheus container image. It is your responsibility to perform security due diligence.
2.2 Create the Kubernetes Namespace, Daemonset and ConfigMap. Enter the following command:
2.3 Check if the Daemonset pods are running. Enter the following command:
2.4 Once the pods are in the running status, you can check if they are accepting connections on port 9182. Enter the following command:
2.5 You should see the windows-exporter pod listening on port 9182, which is the one that is scrapped by HAQM Managed Service for Prometheus.
3. Visualizing Windows pods metrics in HAQM Managed Grafana
Assuming you already have Grafana knowledge, you can create panels that are relevant for your day two operation. In the following, you can find PromQL queries that automatically bring the correct data scrapped by Prometheus, merging Windows container metrics and mapping to its pod. We are setting the query to populate new data every two minutes.
Make sure you are selecting the right data source when creating panels. In this post, we are using HAQM Managed Service for Prometheus as a data source.
Metric | Query | Unit |
CPU | kube_pod_container_info{} * on(container_id) group_left avg by (container_id) (rate(windows_container_cpu_usage_seconds_total{}[2m])) * 1000 | custom: milliCPU |
Memory | kube_pod_container_info{} * on(container_id) group_left avg by (container_id) (windows_container_memory_usage_private_working_set_bytes{}) | bytes |
Network (sent) | kube_pod_container_info{} * on(container_id) group_left avg by (container_id) (rate(windows_container_network_transmit_bytes_total{}[2m])) | bytes/sec |
Network (received) | kube_pod_container_info{} * on(container_id) group_left avg by (container_id) (rate(windows_container_network_receive_bytes_total{}[2m])) | bytes/sec |
Disk (written) | kube_pod_container_info{} * on(container_id) group_left avg by (container_id) (rate(windows_container_storage_write_size_bytes_total{}[2m])) | bytes/sec |
Disk (read) | kube_pod_container_info{} * on(container_id) group_left avg by (container_id) (rate(windows_container_storage_read_size_bytes_total{}[2m])) | bytes/sec |
Check the Windows Exporter GitHub repository for a complete list of Windows containers metrics exported.
For example, in the following query, we are filtering total CPU usage percentage per second at the pod level. To do so, you need to create a custom legend with the value pod. Furthermore, it is essential to set the Units in the panel to the ones in the following table.
Image 3. Grafana query panel
The milliCPU query generates the following panel:
Image 4. Windows Pods – milliCPU
The CPU Query measures Kubernetes CPU Unit usage per second multiplied by 1000 to match Kubernetes milliCPUss. This allows you to quickly and easily identify if a pod needs CPU limits/request right-sizing. A CPU second refers to one second on a CPU. This is the amount of time in seconds your CPU spends actively running a process, as opposed to the elapsed time.
4. Visualizing Windows nodes metrics in HAQM Managed Grafana
Nonetheless, visualizing Windows nodes metrics is crucial as Windows pods metrics. In the following table, you can find PromQL queries that automatically bring the correct data scrapped by Prometheus per Windows nodes. We are setting the query to populate new data every two minutes.
Metric | Query | Unit |
CPU | sum by (instance) (rate(windows_cpu_time_total{mode!=”idle”}[2m])) / count by (instance) (rate(windows_cpu_time_total{mode=”idle”}[2m])) | Percent (0.0-1.0) |
Memory | (1 – windows_os_physical_memory_free_bytes{} / windows_cs_physical_memory_bytes{}) | bytes/sec |
Network (sent) | rate(windows_net_bytes_sent_total{}[2m]) | bytes/sec |
Network (received) | rate(windows_net_bytes_received_total{}[2m]) | bytes/sec |
Disk (written) | sum by (instance) (rate(windows_physical_disk_write_bytes_total{}[2m])) | bytes/sec |
Disk (read) | sum by (instance) (rate(windows_physical_disk_read_bytes_total{}[2m])) | bytes/sec |
Check the Windows Exporter GitHub repository for a complete list of Windows nodes metrics exported.
For example, in the following query, we are filtering the total CPU usage percentage per second at the pod level. To do so, you must create a custom legend with the value node. Furthermore, it is essential to set the Units in the panel to the ones in the preceding table.
Image 5. Grafana query panel
The Memory query generates the following panel:
Image 6. Windows nodes memory percent usage panel
Conclusion
This post covered how to successfully deploy Windows Exporter as a daemonset using a hostProcess container mode. Then, we covered which Windows and KSM should be used to have a proper Grafana monitoring dashboard. You can also use these metrics to create additional panels to an existing Grafana dashboard, such as when an HAQM EKS with a mixed data plane is deployed.
In addition, see the best practices for running Windows containers on HAQM EKS in the HAQM EKS Best Practice guide