HAQM EKS introduces node monitoring and auto repair capabilities

This post was jointly authored by Alex Kestner (Sr. Product Manager, HAQM EKS), Ratnopam Chakrabarti (Sr. SA, Containers & OSS), Shivam Dubey (Specialist SA, Containers), and Suket Sharma (Sr. SDE, HAQM EKS).

Introduction

HAQM Elastic Kubernetes Service (HAQM EKS) now offers node monitoring and auto repair capabilities. This new feature enables automatic detection and remediation of node-level issues in EKS clusters, helping customers to improve the availability and reliability of their Kubernetes applications.

Node failures can lead to application downtime and necessitate manual intervention from operations teams. EKS node monitoring and auto repair allows customers to automate the identification and replacement of unhealthy nodes. This helps the overall availability of workloads in the cluster with minimal manual intervention.

Overview

Running workloads reliably in Kubernetes clusters can be challenging. Cluster administrators often have to deal with workload downtime and resort to manual methods of monitoring and repairing degraded nodes in their clusters. Node degradation can stem from various issues, such as failing memory, faulty storage, network and accelerated drivers, and operating system issues. For example cluster admins running machine learning (ML) workloads need to make sure of application stability by monitoring GPU health on worker nodes. These GPU related failures may not be visible to the kubelet and need more detection mechanisms.

EKS node monitoring and auto repair provide a native capability to improve workload reliability in EKS clusters. This feature streamlines cluster maintenance by reducing the operational overhead associated with managing node monitoring and repair tasks, thereby reducing the operational overhead associated with monitoring and managing repair tasks manually. It continuously monitors the health of the nodes within an EKS cluster, and automatically detects and replaces nodes with issues when they arise. Overall, it helps reduce workload downtime and allows cluster admins to focus on the higher-level tasks of improving the performance of Kubernetes workloads and delivering better experience for their end customers.

Node monitoring and auto repair offer several key benefits:

Improved workload availability: Automatically detects and replaces unhealthy nodes, which can help in reducing application downtime and improving overall cluster health.
Reduced operational overhead: Minimizes the need for manual monitoring and remediation of node issues, freeing up operations teams to focus on higher-value tasks.
Repair capability available for most HAQM EKS compute options: This feature is available for both EKS Managed Node Groups (MNGs), Karpenter, and EKS Auto Mode nodes.
Support for GPU workloads: Improves reliability for GPU bound ML workloads by detecting and addressing GPU-related failures.
Integration with existing controls: Works seamlessly with Kubernetes Pod Disruption Budgets and other Kubernetes native disruption controls. It also works with Karpenter Node Disruption Budgets.

There are two primary components that are responsible for detecting node failures and repairing them.

Node Monitoring Agent (NMA): An agent that runs on worker nodes that can detect a wide range of issues.
Node Repair System: For EKS MNGs, this is a backend component that collects health information and repairs worker nodes. For Karpenter, this functionality is part of the open source project.

EKS Node Monitoring Agent

The NMA is bundled into a container image that runs as a daemonSet in all worker nodes of the EKS cluster. The agent communicates any issues it finds by updating the status of the Kubernetes Node object in the cluster and/or by emitting Kubernetes events.

In the next section we explore the agent’s capabilities in more detail.

GPU failure detection

The NMA is equipped with advanced monitoring capabilities for GPU-accelerated instances, addressing a critical need for customers running ML workloads. It detects various types of GPU failures, such as:

Hardware errors: Identification of Error-Correcting Code (ECC) memory errors, PCIe bus errors, and thermal throttling events.
Driver issues: Detection of NVIDIA driver crashes, hangs, or initialization failures.
Memory problems: Monitoring of GPU memory leaks and out-of-memory conditions.
Performance degradation: Identification of unexpected performance drops or usage issues.

The agent uses the NVIDIA Data Center GPU Manager (DCGM) and NVIDIA Management Library (NVML) to collect GPU metrics and health information.

Types of failures detected

Beyond GPU-specific issues, the NMA monitors a wide range of node-level components:

Kubelet health: Checks for kubelet responsiveness and proper functioning.
Container runtime: Monitors for containerd issues.
Networking: Detects Container Network Interface (CNI) problems, missing route table entries, and packet drop issues.
Storage: Identifies disk space exhaustion and I/O errors.
System resources: Monitors CPU throttling, memory pressure, and overall system load.
Kernel: Detects kernel panics and critical system errors.

Installing the NMA

The NMA is available as an HAQM EKS add-on and installs the following components in the EKS cluster:

NMA DaemonSet: Runs on every node in the cluster, collecting health metrics and performing local diagnostics.
DCGM Server DaemonSet: Includes NVIDIA DCGM Server as a DaemonSet.
Custom Resource Definitions (CRDs): Defines NodeDiagnostic for automating node log collection.

To install the NMA, customers can use the HAQM EKS console or AWS Command Line Interface (AWS CLI) to create the HAQM EKS add-on.

aws eks create-addon --cluster-name <name of the EKS cluster> --addon-name eks-node-monitoring-agent

Streamlined log collection

One key capability of the NMA is the ability to collect logs from faulty nodes. The NMA provides an automated method of log collection from the worker nodes through a Kubernetes CRD called NodeDiagnostic. This prevents operations engineers from having to connect to the worker nodes to collect logs, and it streamlines the overall log collection process.

To collect logs from nodes, follow these steps:

1. Create a custom resource in the cluster to trigger the collection. The name of the NodeDiagnostic object needs to match the name of the node for which you are collecting logs.

apiVersion: eks.amazonaws.com/v1alpha1
kind: NodeDiagnostic
metadata:
  name: <Name of the Node>
spec:
  logCapture:
    destination: $someS3PresignedPutURL

2. Check the status of the log collection by running kubectl describe nodediagnostics and check HAQM S3 to determine whether the logs are being collected.

The following is a snippet from the kubectl describe command indicating that the log collection from the Node is successful.

kubectl describe nodediagnostic

Name:         ip-192-168-10-102.us-west-2.compute.internal
Namespace:
Labels:       <none>
Annotations:  <none>
API Version:  eks.amazonaws.com/v1alpha1
Kind:         NodeDiagnostic

Status:
  Capture Statuses:
    State:
      Completed:
        Finished At:  2024-11-01T19:05:45Z
        Message:      successfully uploaded logs with no errors
        Reason:       Success
        Started At:   2024-11-01T19:05:39Z
    Type:             Log
Events:               <none>

For further details on automating node log collections using the NodeDiagnostic CRD, refer to the HAQM EKS user guide.

Node auto repair

For nodes in EKS MNGs, node auto repair is a backend component that collects health information from worker nodes and can repair those nodes. For Karpenter, it is an alpha feature available in v1.10+ of the open source project that enables Karpenter to replace unhealthy nodes automatically. Node auto repair is available in HAQM EKS managed compute offerings and open source projects: EKS MNG, Auto Mode, and the open source Karpenter project. It is available as an opt-in feature in EKS MNGs. Customers can enable node auto repair on a per-MNG basis either by using the MNG API or from the Console. EKS MNG uses the existing APIs to cycle worker nodes while respecting pod disruptions’ budgets whenever the node is responsive. The feature is enabled by default when using EKS Auto Mode.

Some of the key features of the node auto repair system are as follows:

The system either replaces or reboots nodes in response to a condition status on the node within, at most, 30 minutes.
The system either replaces or reboots nodes in response GPU failures within, at most, 10 minutes.
Repair actions are logged and can be audited by the customers.
The repair system respects user specific disruption controls, such as Pod Disruption Budgets. If Zonal Shift is activated in your EKS cluster, then node auto repair operations are halted.
Node auto repair is available for nodes in EKS MNGs, Auto Mode, or launched by the open source Karpenter project.

Node auto repair works along with specific disruption controls such as Pod Disruption Budgets in EKS clusters. MNGs mean that HAQM EKS cycles worker nodes using Kubernetes eviction API while respecting pod disruptions budgets. MNG attempts to drain worker nodes on an EKS cluster on a best-effort basis, respecting Pod Disruption Budgets for a maximum of 15 minutes, after which the node is forcefully repaired.

Walkthrough

To use node monitoring and auto repair with MNGs, follow these steps:

Install the NMA as an HAQM EKS add-on in your cluster.
Verify that the NMA Pods are in Running state. To verify that the NMA is installed and running, you can either use the DescribeAddon HAQM EKS API, or check the Kubernetes pods.

$ kubectl get pods -n kube-system
NAME                                      READY   STATUS    RESTARTS   AGE
dcgm-server-p27k8                         1/1     Running   0          52s
dcgm-server-r8mj9                         1/1     Running   0          52s
eks-nma-eks-node-monitoring-agent-h5sjl   1/1     Running   0          52s
eks-nma-eks-node-monitoring-agent-q2fjp   1/1     Running   0          52s

When the agent has been deployed, you can view logs through the following:

After the agent is deployed, you can enable EKS MNGs to repair your nodes when an issue is detected by the agent. To do this, you can create a new MNG with this feature enabled.

aws eks create-nodegroup \
    --cluster-name my-cluster \
    --nodegroup-name my-nodegroup \
    --node-repair-config enabled=true

You can also update an existing node group to enable node repair.

aws eks update-nodegroup-config \
    --cluster-name my-cluster \
    --nodegroup-name my-nodegroup \
    --node-repair-config enabled=true

When the repair feature is enabled, you can test this feature by inducing errors. For example, to induce GPU errors, exec into the DCGM Server Pods and run the following commands:

kubectl exec -n kube-system -it dcgm-server-gkchs -- /bin/sh  
dcgmi test --inject --gpuid 0 -f 319 -v 4
Successfully injected field info.

If you check the corresponding HAQM EKS node condition where the dcgm-server-gkchs Pod is running, then you should see that the AcceleratedHardwareReady type status condition for the underlying node has turned False. This means that the NMA agent is enabled and can detect the GPU failures.

kubectl get nodes -o json | jq '.items[1].status.conditions
{
"lastHeartbeatTime": "2024-10-30T19:25:51Z",
    "lastTransitionTime": "2024-10-29T18:36:29Z",
    "message": "detected 2 Nvidia Double Bit error(s) on location Device",
    "reason": "NvidiaDoubleBitError",
    "status": "False",
    "type": "AcceleratedHardwareReady"
}

Only certain GPU hardware failures, such as NVLink errors or GPU falling off the bus, cause the underlying Node to be rebooted instead of a Node replacement.

Using node monitoring and repair with EKS Auto Mode

The node monitoring and auto repair feature is also available for clusters using EKS Auto Mode. For EKS Auto Mode-enabled clusters, the feature is enabled by default and is always on. All repair actions are conducted by EKS Auto Mode automatically when a node is in an unhealthy state for a prolonged period.

Customers don’t need to also install the NMA when using EKS Auto Mode, because it is included in the Auto Mode nodes’ HAQM Elastic Compute Cloud (HAQM EC2) HAQM Machine Images (AMIs). Customers using EKS Auto Mode can also use the NodeDiagnostic CRD to gain observability into the issues their nodes may be experiencing.

Conclusion

HAQM EKS node monitoring and auto repair offers a major improvement in Kubernetes operations by automatically detecting and addressing node failures that can impact workload availability. This automation enhances application reliability by proactively identifying and replacing degraded nodes, allowing HAQM EKS customers to focus on innovation rather than infrastructure management. The feature’s GPU monitoring capabilities are particularly valuable for ML workloads, where GPU failures can be difficult to detect through standard Kubernetes mechanisms. Overall, node monitoring and auto repair helps customers improve workload uptime, increase operational efficiency, and maintain reliable Kubernetes infrastructure.

To get started with node monitoring and auto repair, refer to the HAQM EKS user guide for more details.

Containers