NVIDIA NCCL 2.28 Introduces Device API and Copy Engine Collectives for Enhanced Communication
Published on November 10, 2025 at 12:00 AM

NVIDIA has released NCCL 2.28, featuring advancements aimed at fusing communication and computation for enhanced performance in multi-GPU and multi-node systems. Announced on November 10, 2025, the latest NVIDIA Collective Communications Library (NCCL) focuses on GPU-initiated networking, device APIs for communication-compute fusion, copy-engine-based collectives, and new APIs designed to build efficient, scalable distributed applications. The release aims to improve developer experience with expanded APIs, better tooling, and streamlined integration.
Key improvements in performance, monitoring, reliability, and quality of service include:
- Device API: Enables the development of custom device kernels for communication/compute fusion, including GPU-initiated networking.
- Copy Engine (CE)-based collectives: Allows developers to use CEs to drive NVIDIA NVLink transfers, reducing compute-resource contention for streaming multiprocessors (SM).
- NCCL Inspector: Provides a low-overhead profiling plugin for continuous observability and analysis of NCCL communication patterns.
- Load/Store Accessible (LSA): For communication between devices accessible via memory load/store operations, using CUDA P2P.
- Multimem: For communication between devices using the hardware multicast feature provided by NVLink SHARP.
- GPU Initiated Networking (GIN): For communication between devices initiated by the GPU using the network; GPUs manage their network operations without CPU intervention, allowing kernels to directly enqueue data transfers and synchronization steps.