Maximizing GPU Utilization using NVIDIA Run:ai in HAQM EKS

This post was co-authored with Chad Chapman of NVIDIA.

Introduction

In the fast-paced world of artificial intelligence and machine learning, GPU resources are both critical and in high demand. In this blog, we will cover key challenges related to GPU utilization in Artificial Intelligence and Machine Learning applications, and how NVIDIA Run:ai fractional GPU technology can help mitigate some of these challenges in HAQM Elastic Kubernetes Service (HAQM EKS).

We will then cover the steps to enable the NVIDIA Run:ai platform on HAQM EKS and best practices. To illustrate the benefits, we will conclude with a simple real world use case and offer additional resources for further exploration.

Let’s first discuss the challenges in optimizing GPU utilization in a typical unoptimized AI/ML workflow running on Kubernetes (K8s):

Static Allocation: Traditionally in K8s, GPU resources are allocated statically, which often leads to under-utilization of this scarce resource. Workloads are assigned a fixed portion of GPU memory and compute power, which can leave GPU resources idle during periods of low demand or when executing tasks that don’t heavily rely on GPUs, such as data loading, CPU bound operations, idle time between steps, etc.
Resource Competition: When multiple workloads share a single GPU, they compete for memory and compute resources. This can result in diminished performance, particularly when resource allocation is not managed effectively.
Inefficiency in Shared GPU Clusters: In shared GPU environments, workloads often do not receive the necessary resources, leading to unpredictable throughput and latency. This is because GPU compute resources are uniformly distributed among concurrent workloads without effective prioritization.

The following are some of the benefits the NVIDIA Run:ai solution provides:

Broad GPU Support: NVIDIA Run:ai fractional GPU technology supports all NVIDIA CUDA enabled GPUs, such as the NVIDIA Pascal and Volta Architectures, which precedes the NVIDIA Ampere, Hopper, and Blackwell Architectures, in which NVIDIA MIG (multi instance GPUs) is supported.
Dynamic GPU Fractions: This feature allows workloads to dynamically request and use GPU resources, based on their actual needs, ensuring optimal resource utilization. Users can specify a fraction of GPU resources required and set an upper limit, allowing flexibility in resource usage.
Node Level Scheduler: This advanced feature optimizes GPU utilization by making local decisions on GPU allocation based on the internal state of the node. It can dynamically adjust resource allocation to maximize performance and efficiency, even spreading workloads across multiple GPUs if needed.
Priority-Based Sharing: Workloads can be configured with specific priorities to ensure that higher-priority workloads are given preferential access to GPU resources over lower-priority workloads. Lower-priority workloads can still access GPU resources, but only after the higher-priority workload requirements have been satisfied.
Configurable Time Slices: Users can define the ratios of time slices each workload receives, ensuring appropriate compute resource allocation based on workload importance.
Guaranteed Quotas and Over-Quota System: Ensures fair distribution of GPU resources and allows users to access idle resources when available, improving overall GPU utilization from as low as 25% to over 75%.
Integration with existing tools: The Run:ai solution integrates seamlessly with existing data science tools and Kubernetes, providing a user-friendly interface for managing GPU resources without disrupting existing workflows.

Run:ai consists of two main components:

Run:ai Cluster – comprised of GPU instances attached to the cluster.
Run:ai Control plane – monitors clusters, sets priorities, and business policies.

Here are primarily two installation options to run Run:ai on HAQM EKS:

Classic (SaaS)
Self-hosted

The big difference between the Classic (SaaS) and Self-hosted installation options is where the Run:ai control plane is hosted. In the Self-hosted option, both Run:ai cluster and control plane are hosted on the customer’s infrastructure (cloud or on-premises), whereas in the SaaS model, the Run:ai cluster runs on customer infrastructure (cloud or on-premises) and connects to a cloud-based control plane managed by Run:ai (http://<tenant-name>.run.ai). The self-hosted Run:ai option is geared towards organizations that cannot use a SaaS solution due to regulatory requirements, and in this post, we will walk through the steps involved in the self-hosted deployment option.

Figure 1: Run:ai Deployment options

Prerequisites

Before getting started, ensure you have:

An AWS Account
Basic knowledge of Kubernetes and GPU computing
AWS Command Line Interface (AWS CLI) configured on your device or AWS CloudShell
kubectl, a CLI tool to interact with the Kubernetes API server
Fully Qualified Domain Name (FQDN) to install the Run:ai control plane (ex: runai.mycorp.local)

Walkthrough

Setting up Run:ai on HAQM EKS

Following are the high-level steps involved in setting up self-hosted Run:ai deployment model with control plane and the cluster running on HAQM EKS. Visit the NVIDIA Run:ai product page to get the necessary license details and installation support.

Create an HAQM EKS Cluster using instructions provided in EKS documentation.
Install the required addons on the cluster. Refer to the prerequisites page for the latest set of addons and set-up instructions.
Install the Run:ai Control Plane.
Install the Run:ai Cluster.
Access the Run:ai User Interface (UI) using your FQDN and verify that the cluster status shows ‘Connected’.

Understanding Run:ai Components

Let’s explore some of the AI/ML use cases using Run:ai on HAQM EKS.

Run:ai Workspaces

Run:ai workspace is a logical grouping of users, resources and jobs that helps manage team based AIML workloads in a shared Kubernetes cluster. It abstracts away the complexities of containerized infrastructure, allowing researchers to focus on their core AI development tasks. You can access the workspaces by navigating to ‘Workload manager’ section of Run:ai UI. Each workspace is tied to a specific Run:ai project and can be quickly provisioned with the necessary container images, datasets, and resource configurations. Researchers can start, stop, and resume their workspaces without losing their environment setup, accessing them through the Run:ai UI or APIs. This streamlined approach empowers researchers to be more productive while giving infrastructure owners control and efficiency in supporting diverse research needs.

Fractional GPU Configuration

Kubernetes allows workloads to request specific resources by default. While CPU resources support fractional requests, older generation GPU resources don’t and GPU requests must be requested in whole units, which may lead to underutilization of the GPU. Workloads are assigned a fixed portion of GPU memory and compute power, which can leave precious GPU resources idle during periods of low demand or the tasks that don’t heavily rely on GPUs, such as data loading, CPU bound operations or idle time between steps.

Run.ai documentation provides instructions on how to enable fractional GPU configuration.

GPU resources can be set within ‘Compute Resources’ tab of ‘Workload Manager’, as shown in Figure 2.

Figure 2: Setting up Fractional GPU limits in Run:ai Workload Manager

Fractional GPU with dynamic GPU memory allocation

Run:ai’s dynamic GPU memory capability allows users to specify different request and limit values for GPU memory, similar to how Kubernetes allows different request and limit values for CPU and memory. This means that users can set a lower request value based on the average case and allocate more GPU memory up to the limit value when there is a larger input or the workload requires burstable resources.

You can use Run:ai console to create a Jupyter notebook for experimentation purposes, which can either utilize full GPU or fractional GPUs. Follow launch workspace with a Jupyter Notebook for the detailed instructions. Here we show multiple notebooks running on a fraction of the same GPU. There are no code changes required for this, and it supports different CUDA versions and container images all on the same GPU. In this configuration, we have set-up a fractional GPU of 0.5 for our notebook.

$Run:ai user interface highlighting multiple Jupyter notebooks running on fractional GPUs using Run:ai's dynamic GPU allocation.$

Figure 3: Run:ai User Interface

Lastly, in the following figure, when we log in to the Jupyter Notebook, it is only seeing half of the NVIDIA T4 GPU, so the workload is fractionalized (T4 GPU has 15.6G memory, whereas Notebook available memory is 7.68G only):

Jupyter Notebook assigned with 0.5GPU has only half of the Tesla T4 GPU memory available (7.68G out of 15.6G).

Figure 4: Memory allocated to Jupyter Notebook

Run:ai Dashboard

You can access system metrics through Run:ai Control-plane APIs described here or using Run:ai dashboard by connecting to your FQDN (ex: runai.mycorp.local). With a few intuitive clicks, users can effortlessly navigate through different metrics and GPU configurations. Following picture shows Run:ai dashboard once it’s installed in a customer EKS environment.

Figure 5: Run:ai Dashboard

Run:ai user interface also provides additional information, such as

Quota management – provides a means to monitor and manage resource utilization in the cluster.
Analytics – provides historical cluster utilization, GPU usage by different projects, breakdown on running jobs and node status.
Multi-Cluster overview – provides holistic, aggregated view across GPU clusters, including cluster and node utilization details.

This results in optimal resource utilization of GPUs, maximizing efficiency and the cost saving, improving team productivity. Run:ai platform also provides visibility into GPU usage and workload management, enabling effective resource management aligned with business objectives.

Clean-up

To avoid ongoing charges, make sure to delete EKS cluster resources created in your AWS account using the instructions provided in the EKS documentation.

Conclusion

In this blog post, we discussed the steps involved in setting up a self-hosted Run:ai deployment option with HAQM EKS to help customers unlock maximum value from their GPU infrastructure. Run:ai’s fractional GPU technology solves challenges like static allocation, resource competition, and inefficiency in shared GPU clusters. By supporting dynamic GPU fractions, node-level scheduling, and priority-based sharing, the Run:ai solution enables optimized GPU resource utilization in AWS.

Explore the NVIDIA Run:ai product documentation to learn more and get started with Run:ai. Checkout AI on HAQM EKS (AIoEKS) for deploying training and inference workloads on HAQM EKS. You can also find blueprints for using opensource tools such as Ray and vLLM on AIoEKS.

About the authors

Chad Chapman is a Solutions Architect at NVIDIA.

Sukirti Gupta is a Sr. GenAI Specialist at AWS.

Ashok Srirama is a Principal Specialist Solutions Architect at AWS.

Hemanth AVS is a Sr. Specialist Solutions Architect at AWS.

Containers