NVIDIA NCCL 2.28 Introduces Device API and Copy Engine Collectives for Enhanced Communication

Published on November 10, 2025 at 12:00 AM
NVIDIA NCCL 2.28 Introduces Device API and Copy Engine Collectives for Enhanced Communication
NVIDIA has released NCCL 2.28, featuring advancements aimed at fusing communication and computation for enhanced performance in multi-GPU and multi-node systems. Announced on November 10, 2025, the latest NVIDIA Collective Communications Library (NCCL) focuses on GPU-initiated networking, device APIs for communication-compute fusion, copy-engine-based collectives, and new APIs designed to build efficient, scalable distributed applications. The release aims to improve developer experience with expanded APIs, better tooling, and streamlined integration. Key improvements in performance, monitoring, reliability, and quality of service include:
  • Device API: Enables the development of custom device kernels for communication/compute fusion, including GPU-initiated networking.
  • Copy Engine (CE)-based collectives: Allows developers to use CEs to drive NVIDIA NVLink transfers, reducing compute-resource contention for streaming multiprocessors (SM).
  • NCCL Inspector: Provides a low-overhead profiling plugin for continuous observability and analysis of NCCL communication patterns.
The NCCL 2.28 introduces a device-side communication API for direct communication within NVIDIA CUDA kernels. This new API allows kernels to initiate data movement directly, integrating communication with compute operations, resulting in higher throughput and reduced overhead. The API supports three operation modes:
  • Load/Store Accessible (LSA): For communication between devices accessible via memory load/store operations, using CUDA P2P.
  • Multimem: For communication between devices using the hardware multicast feature provided by NVLink SHARP.
  • GPU Initiated Networking (GIN): For communication between devices initiated by the GPU using the network; GPUs manage their network operations without CPU intervention, allowing kernels to directly enqueue data transfers and synchronization steps.
Copy Engine (CE) collectives offload communication tasks within the NVLink domain from SMs to dedicated hardware CEs, freeing up SM resources for computational workloads and improving the overlap of communication and computation. CE-based collectives utilize batched APIs and NVLink multicast optimization to enhance performance, achieving performance comparable to SM-based collectives without requiring SM resources. The NCCL Inspector is an observability, profiling, and analysis plugin that provides detailed, per-communicator and per-collective performance and metadata logging. It is designed to help users analyze and debug NCCL collective operations by generating structured JSON output for each operation, providing insights into communication patterns and performance characteristics during distributed workload runs using NCCL. NCCL 2.28 includes enhancements such as new host APIs for AllToAll, Gather, and Scatter operations, support for grouped symmetric kernels, a flexible config management system using the NCCL environment plugin, a redesigned plugin system supporting shared contexts, and a CMake-based build system.