The NVIDIA KAI Scheduler is now natively integrated with KubeRay, extending the scheduling engine from NVIDIA Run:ai to Ray clusters. This integration introduces capabilities such as gang scheduling, workload autoscaling, workload prioritization, and hierarchical queues, designed to optimize infrastructure by coordinating job starts, efficiently sharing GPUs, and prioritizing workloads. Key features enabled by this integration include:

Gang scheduling: Ensures that distributed Ray workloads launch all workers and actors together, preventing partial allocations that can stall training or inference pipelines.
Workload and cluster autoscaling: Allows Ray clusters to scale up as resources become available and scale down as demand decreases, aligning compute resources with workload needs without manual intervention. This is particularly useful for offline batch inference workloads.
Workload priorities: Enables high-priority inference jobs to automatically preempt lower-priority batch training jobs when resources are limited, maintaining application responsiveness.
Hierarchical queuing with priorities: Facilitates the creation of queues for different project teams with clear priorities, allowing higher-priority queues to borrow idle resources from other teams when capacity is available.

The integration of KAI Scheduler with KubeRay requires:

A Kubernetes cluster with one NVIDIA A10G GPU.
NVIDIA GPU Operator installed.
NVIDIA KAI Scheduler deployed.
KubeRay Operator nightly image or Helm chart configured to use KAI Scheduler (--set batchScheduler.name=kai-scheduler).

The KAI Scheduler also supports hierarchical queuing, which allows teams and departments to be organized into multi-level structures with fine-grained control over resource distribution. Key parameters for queue configuration include:

Quota: The deserved share of resources to which a queue is entitled.
Limit: The upper bound on how many resources a queue can consume.
Over Quota Weight: Determines how surplus resources are distributed among queues with the same priority; higher weights receive a larger portion of the extra capacity.

The article details a practical example using Qwen2.5-7B-Instruct deployed with vLLM, Ray Serve, and RayService, demonstrating how an inference workload can preempt a lower-priority training job using the `kai.scheduler/queue` label and `priorityClassName` settings.