NVIDIA Blackwell Leads MLPerf Training v5.1 Benchmarks

NVIDIA's Blackwell architecture has set new performance standards in the MLPerf Training v5.1 benchmarks, achieving the fastest training times across all seven models. The architecture's advancements in chip design, system integration, and software optimizations have enabled unprecedented acceleration in AI model training.

The MLPerf Training v5.1 benchmarks measure the time required to train models to a specified accuracy. NVIDIA's Blackwell and Blackwell Ultra GPUs excelled in every benchmark, demonstrating superior performance at both maximum and varied scales. Notably, NVIDIA was the only platform to submit results across all benchmarks, highlighting the versatility and robustness of the Blackwell architecture.

Key Benchmark Results

Llama 3.1 405B Pretraining: Completed in 10 minutes using 5,120 Blackwell GPUs.
Llama 3.1 8B Pretraining: Finished in 5.2 minutes with 512 Blackwell Ultra GPUs.
Llama 2 70B LoRA Fine-tuning: Achieved in 0.40 minutes with 512 Blackwell Ultra GPUs.
FLUX.1: Trained in 12.5 minutes using 1,152 Blackwell GPUs.

Hardware Innovations

The Blackwell architecture incorporates hardware acceleration for FP4 data formats, including NVIDIA's NVFP4 format. Blackwell GPUs offer twice the peak FP4 throughput per clock compared to FP8, while Blackwell Ultra GPUs triple this performance. This advancement enables faster training times and improved energy efficiency.

The Blackwell Ultra GPUs feature several key enhancements:

1.5x Peak NVFP4 Throughput: Updated Tensor Cores increase FP4 throughput per clock by 1.5x compared to standard Blackwell GPUs.
2x Softmax for Attention: An upgraded special function unit (SFU) doubles the throughput for softmax operations.
1.5x Larger HBM3e Capacity: Higher-capacity HBM3e stacks eliminate the need for CPU offloading and reduce model-parallel communication overheads.

Networking and Software Optimizations

NVIDIA utilized the Quantum-X800 networking platform, which includes ConnectX-8 SuperNICs, Quantum-X800 InfiniBand switches, and LinkX cables. This setup enabled the industry's first 800 Gb/s networking submission to MLPerf Training, further accelerating training times.

Software optimizations were critical to the benchmark results. For example, FP8 precision was used for the attention BMM inputs in the Llama 3.1 8B pretraining benchmark, resulting in up to 1.3x better performance compared to BF16 precision. Additional optimizations included a fused RoPE kernel in Transformer Engine and the elimination of device-to-device memory copies.

Conclusion

NVIDIA's Blackwell architecture has established itself as the leading platform for AI training, as demonstrated by its dominance in the MLPerf Training v5.1 benchmarks. With innovations in hardware, networking, and software, NVIDIA continues to push the boundaries of AI performance and efficiency.