Monitoring Windows pods with Prometheus and Grafana

This post was co-authored by Cezar Guimarães, Sr. Software Engineer, VTEX

Introduction

Customers across the globe are increasingly adopting HAQM Elastic Kubernetes Service (HAQM EKS) to run their Windows workloads. This is a result of customers figuring out that refactoring existing Windows-based applications into an open-source environment, while ideal, is a very complex task. It needs investments that usually don’t immediately translate into cost savings, and investing in this application refactoring isn’t in the best interest for the IT yearly budget. However, re-platforming the existing yet critical Windows-based applications into Windows containers makes sense from a cost-saving and modernization lens.

Tools such as App2Container (A2C) have made application re-platforming easy. However, for successful day two operations, customers should consider certain infra-transformations, such as logging, monitoring, tracing, etc. As part of achieving full Windows containers observability on AWS, in 2022 we published a Containers post on how customers can use an AWS-managed Windows fluent-bit container image to centralize Windows pods log in different destinations.

Prometheus and Grafana are some of the most popular monitoring stacks for Kubernetes-based workloads. Therefore, today we are launching a post focusing on how customers can centralize Windows pod metrics using HAQM Managed Service for Prometheus and HAQM Managed Grafana.

Solution overview

This post walks you through how to set up Windows Exporter (A Prometheus exporter for Windows) as a Kubernetes daemonset and a PromQL (Prometheus Query Language) to enrich windows-exporter container metrics while merging with kube-state-metrics (KSM). This lets you extend existing Linux-based Kubernetes monitoring to support Windows-based workloads.

Image 1. Solution workflow

HAQM Managed Service for Prometheus scrapes Windows node/container metrics, such as CPU, Memory, Disk, and Network usage from the Windows Exporter HostProcess DaemonSet.
HAQM Managed Service for Prometheus scrapes KSM to map pod and container names to their container ID.
HAQM Managed Grafana provides the ability to create monitoring dashboards from the collected metrics using HAQM Managed Service for Prometheus as the data source.

Prerequisites

The following prerequisites are required to continue with this post:

An HAQM EKS cluster with Windows nodes up and running. See this step-by-step
HAQM Managed Service for Prometheus with HAQM EKS ingestion properly setup. See this step-by-step
HAQM Managed Grafana fully integrated with HAQM Managed Service for Prometheus. See this step-by-step

This post’s prerequisites use AWS-managed services such as HAQM Managed Service for Prometheus with managed-collector and HAQM Managed Grafana. However, this post also applies to self-managed Prometheus, Grafana, and ADOT/Prom-server agents.

Walkthrough

The following steps walk you through the steps described previously.

1. Install KSM

We now install KSM, a simple service that listens to the Kubernetes API server and generates metrics about the state of the objects. We must collect KSM to map pod and container names to their container ID.

1.1 Enter the following command to install KSM:

helm repo add prometheus-community http://prometheus-community.github.io/helm-charts
  
helm install kube-state-metrics prometheus-community/kube-state-metrics -n kube-system

2. Create a Windows Exporter daemonset

First, going deep into the daemonset configuration, we are setting up the securityContext to hostProcess:True. This means the container process has access to the host network namespace, storage, and devices, allowing us to fetch metrics for all the containers running at the host by listening to built-in Windows metrics.

The second part is the initContainer, where we set up the host firewall to allow TCP/9182 incoming traffic so that HAQM Managed Service for Prometheus can scrape the host. In the third part, we create a ConfigMap to inject windows-exporter configurations and mount it to the Windows-exporter pod.

2.1 Create a file containing the following code and save it as windows-exporter.yaml :

If you have any taints in the Windows nodes, then make sure you add the tolerations in the Daemonset configuration.

kind: Namespace
apiVersion: v1
metadata:
  name: windows-monitoring
  labels:
    name: windows-monitoring
---
kind: DaemonSet
apiVersion: apps/v1
metadata:
  name: windows-exporter
  namespace: windows-monitoring
  labels:
    app: windows-exporter
spec:
  selector:
    matchLabels:
      app: windows-exporter
  template:
    metadata:
      labels:
        app: windows-exporter
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/scheme: http
        prometheus.io/path: "/metrics"
        prometheus.io/port: "9182"
    spec:
      securityContext:
        windowsOptions:
          hostProcess: true
          runAsUserName: "NT AUTHORITY\\system"
      hostNetwork: true
      initContainers:
        - name: configure-firewall
          image: mcr.microsoft.com/powershell:lts-nanoserver-1809
          command: ["powershell"]
          args: ["New-NetFirewallRule", "-DisplayName", "'windows-exporter'", "-Direction", "inbound", "-Profile", "Any", "-Action", "Allow", "-LocalPort", "9182", "-Protocol", "TCP"]
      containers:
      - args: 
        - --config.file=%CONTAINER_SANDBOX_MOUNT_POINT%/config.yml
        name: windows-exporter
        image: ghcr.io/prometheus-community/windows-exporter:latest
        imagePullPolicy: Always
        ports:
        - containerPort: 9182
          hostPort: 9182
          name: http
        volumeMounts:
        - name:  windows-exporter-config
          mountPath: /config.yml
          subPath: config.yml
      nodeSelector:
        kubernetes.io/os: windows
      volumes:
      - name: windows-exporter-config
        configMap:
          name: windows-exporter-config
---
kind: ConfigMap
apiVersion: v1
metadata:
  name: windows-exporter-config
  namespace: windows-monitoring
  labels:
    app: windows-exporter
data:
  config.yml: |
    collectors:
      enabled: '[defaults],container'
    collector:
      service:
        services-where: "Name='containerd' or Name='kubelet'"

This solution uses a public, open-source Prometheus container image. It is your responsibility to perform security due diligence.

2.2 Create the Kubernetes Namespace, Daemonset and ConfigMap. Enter the following command:

kubectl create -f windows-exporter.yaml

2.3 Check if the Daemonset pods are running. Enter the following command:

kubectl get pods -n windows-monitoring

2.4 Once the pods are in the running status, you can check if they are accepting connections on port 9182. Enter the following command:

kubectl logs windows-exporter-pod-name -n windows-monitoring

2.5 You should see the windows-exporter pod listening on port 9182, which is the one that is scrapped by HAQM Managed Service for Prometheus.

ts=2024-01-30T00:03:22.226Z caller=tls_config.go:313 level=info msg="Listening on" address=[::]:9182

3. Visualizing Windows pods metrics in HAQM Managed Grafana

Assuming you already have Grafana knowledge, you can create panels that are relevant for your day two operation. In the following, you can find PromQL queries that automatically bring the correct data scrapped by Prometheus, merging Windows container metrics and mapping to its pod. We are setting the query to populate new data every two minutes.

Make sure you are selecting the right data source when creating panels. In this post, we are using HAQM Managed Service for Prometheus as a data source.

Metric	Query	Unit
CPU	kube_pod_container_info{} * on(container_id) group_left avg by (container_id) (rate(windows_container_cpu_usage_seconds_total{}[2m])) * 1000	custom: milliCPU
Memory	kube_pod_container_info{} * on(container_id) group_left avg by (container_id) (windows_container_memory_usage_private_working_set_bytes{})	bytes
Network (sent)	kube_pod_container_info{} * on(container_id) group_left avg by (container_id) (rate(windows_container_network_transmit_bytes_total{}[2m]))	bytes/sec
Network (received)	kube_pod_container_info{} * on(container_id) group_left avg by (container_id) (rate(windows_container_network_receive_bytes_total{}[2m]))	bytes/sec
Disk (written)	kube_pod_container_info{} * on(container_id) group_left avg by (container_id) (rate(windows_container_storage_write_size_bytes_total{}[2m]))	bytes/sec
Disk (read)	kube_pod_container_info{} * on(container_id) group_left avg by (container_id) (rate(windows_container_storage_read_size_bytes_total{}[2m]))	bytes/sec

Check the Windows Exporter GitHub repository for a complete list of Windows containers metrics exported.

For example, in the following query, we are filtering total CPU usage percentage per second at the pod level. To do so, you need to create a custom legend with the value pod. Furthermore, it is essential to set the Units in the panel to the ones in the following table.

Image 3. Grafana query panel

The milliCPU query generates the following panel:

Image 4. Windows Pods – milliCPU

The CPU Query measures Kubernetes CPU Unit usage per second multiplied by 1000 to match Kubernetes milliCPUss. This allows you to quickly and easily identify if a pod needs CPU limits/request right-sizing. A CPU second refers to one second on a CPU. This is the amount of time in seconds your CPU spends actively running a process, as opposed to the elapsed time.

4. Visualizing Windows nodes metrics in HAQM Managed Grafana

Nonetheless, visualizing Windows nodes metrics is crucial as Windows pods metrics. In the following table, you can find PromQL queries that automatically bring the correct data scrapped by Prometheus per Windows nodes. We are setting the query to populate new data every two minutes.

Metric	Query	Unit
CPU	sum by (instance) (rate(windows_cpu_time_total{mode!=”idle”}[2m])) / count by (instance) (rate(windows_cpu_time_total{mode=”idle”}[2m]))	Percent (0.0-1.0)
Memory	(1 – windows_os_physical_memory_free_bytes{} / windows_cs_physical_memory_bytes{})	bytes/sec
Network (sent)	rate(windows_net_bytes_sent_total{}[2m])	bytes/sec
Network (received)	rate(windows_net_bytes_received_total{}[2m])	bytes/sec
Disk (written)	sum by (instance) (rate(windows_physical_disk_write_bytes_total{}[2m]))	bytes/sec
Disk (read)	sum by (instance) (rate(windows_physical_disk_read_bytes_total{}[2m]))	bytes/sec

Check the Windows Exporter GitHub repository for a complete list of Windows nodes metrics exported.

For example, in the following query, we are filtering the total CPU usage percentage per second at the pod level. To do so, you must create a custom legend with the value node. Furthermore, it is essential to set the Units in the panel to the ones in the preceding table.

Image 5. Grafana query panel

The Memory query generates the following panel:

Image 6. Windows nodes memory percent usage panel

Conclusion

This post covered how to successfully deploy Windows Exporter as a daemonset using a hostProcess container mode. Then, we covered which Windows and KSM should be used to have a proper Grafana monitoring dashboard. You can also use these metrics to create additional panels to an existing Grafana dashboard, such as when an HAQM EKS with a mixed data plane is deployed.

In addition, see the best practices for running Windows containers on HAQM EKS in the HAQM EKS Best Practice guide

Cezar Guimarães, VTEX

Cezar Guimarães is a Senior Software Engineer at VTEX. He is a key figure in the Developer Experience team and excels in harnessing cloud-native technologies to elevate development processes. His an expertise in Kubernetes and large-scale software development.

Containers