NVIDIA NCCL 2.28 Introduces Device API and Copy Engine Collectives for Enhanced Communication

NVIDIA NCCL 2.28 Enhances GPU Communication and Performance
NVIDIA has released NCCL 2.28, featuring advancements aimed at fusing communication and computation for enhanced performance in multi-GPU and multi-node systems. The latest NVIDIA Collective Communications Library (NCCL) introduces GPU-initiated networking, device APIs for communication-compute fusion, and copy-engine-based collectives to optimize distributed workloads.
Key Features of NCCL 2.28
The NCCL 2.28 release focuses on improving developer experience with expanded APIs, better tooling, and streamlined integration. Key improvements include the Device API for custom device kernels, Copy Engine collectives for efficient NVLink transfers, and the NCCL Inspector for profiling and analysis.
Device API
The Device API enables the development of custom device kernels for communication and compute fusion, including GPU-initiated networking. This allows kernels to initiate data movement directly, integrating communication with compute operations for higher throughput and reduced overhead.
Copy Engine Collectives
Copy Engine collectives offload communication tasks within the NVLink domain from streaming multiprocessors (SMs) to dedicated hardware CEs. This reduces compute-resource contention, freeing up SM resources for computational workloads and improving the overlap of communication and computation.
NCCL Inspector
The NCCL Inspector is a profiling plugin that provides continuous observability and analysis of NCCL communication patterns. It generates structured JSON output for each operation, helping users analyze and debug collective operations during distributed workload runs.
Performance and Reliability Enhancements
NCCL 2.28 includes performance and reliability enhancements such as new host APIs for AllToAll, Gather, and Scatter operations, support for grouped symmetric kernels, and a flexible config management system using the NCCL environment plugin. The redesigned plugin system supports shared contexts, and the release includes a CMake-based build system for streamlined integration.
Conclusion
NVIDIA NCCL 2.28 introduces significant advancements in GPU communication and performance optimization. With features like the Device API, Copy Engine collectives, and the NCCL Inspector, developers can build more efficient, scalable, and reliable distributed applications.