KAI Scheduler Brings Gang Scheduling to Ray with KubeRay

The NVIDIA KAI Scheduler is now natively integrated with KubeRay, extending the scheduling engine from NVIDIA Run:ai to Ray clusters. This integration introduces advanced capabilities such as gang scheduling, workload autoscaling, workload prioritization, and hierarchical queues, designed to optimize AI infrastructure by coordinating job starts, efficiently sharing GPUs, and prioritizing workloads.

Key Features of the Integration

Gang Scheduling

Gang scheduling ensures that distributed Ray workloads launch all workers and actors together, preventing partial allocations that can stall training or inference pipelines. This feature is essential for maintaining the efficiency of distributed AI workloads, particularly in environments where resource fragmentation can lead to delays.

Workload and Cluster Autoscaling

The integration allows Ray clusters to scale up as resources become available and scale down as demand decreases, aligning compute resources with workload needs without manual intervention. This is particularly useful for offline batch inference workloads, where resource demands can fluctuate significantly.

Workload Priorities

The KAI Scheduler enables high-priority inference jobs to automatically preempt lower-priority batch training jobs when resources are limited. This ensures that critical applications remain responsive, even under heavy workload conditions.

Hierarchical Queuing with Priorities

Hierarchical queuing allows the creation of queues for different project teams with clear priorities. Higher-priority queues can borrow idle resources from other teams when capacity is available, optimizing resource utilization across the organization.

Requirements for Integration

The integration of KAI Scheduler with KubeRay requires a Kubernetes cluster with at least one NVIDIA A10G GPU, the NVIDIA GPU Operator installed, and the KAI Scheduler deployed. Additionally, the KubeRay Operator nightly image or Helm chart must be configured to use the KAI Scheduler.

Hierarchical Queuing and Resource Management

The KAI Scheduler supports hierarchical queuing, allowing teams and departments to be organized into multi-level structures with fine-grained control over resource distribution. Key parameters for queue configuration include quota, limit, and over quota weight, which determine how resources are allocated and shared among queues.

Practical Example: Inference Workload Preemption

The article details a practical example using Qwen2.5-7B-Instruct deployed with vLLM, Ray Serve, and RayService. This example demonstrates how an inference workload can preempt a lower-priority training job using the kai.scheduler/queue label and priorityClassName settings. This showcases the real-world application of the KAI Scheduler's capabilities in managing AI workloads.

Conclusion

The integration of the NVIDIA KAI Scheduler with KubeRay represents a significant advancement in AI infrastructure management. By enabling gang scheduling, autoscaling, and workload prioritization, this solution optimizes resource utilization and ensures the efficient execution of AI workloads in enterprise environments.