NVIDIA's Blackwell architecture has demonstrated its prowess by achieving the fastest training times across every MLPerf Training v5.1 benchmark. The results highlight the architecture's ability to accelerate AI model training, driven by innovations at the chip, system, and software levels. MLPerf Training v5.1, the latest iteration of industry benchmarks for AI training performance, assesses the time required to train seven different models to a specified target accuracy. The NVIDIA Blackwell and Blackwell Ultra GPUs powered by the Blackwell architecture, excelled in every benchmark at maximum and various submitted scales. NVIDIA was also the only platform to submit results across all benchmarks. Key highlights from the MLPerf Training v5.1 results include:

Llama 3.1 405B Pretraining: Achieved in just 10 minutes using 5,120 Blackwell GPUs.
Llama 3.1 8B Pretraining: Completed in 5.2 minutes using 512 Blackwell Ultra GPUs.
Llama 2 70B LoRA Fine-tuning: Finished in 0.40 minutes using 512 Blackwell Ultra GPUs.
FLUX.1: Trained in 12.5 minutes using 1,152 Blackwell GPUs.

NVIDIA's Blackwell architecture incorporates hardware acceleration for FP4 data formats, including the NVIDIA-designed NVFP4 format. Blackwell GPUs offer twice the peak FP4 throughput per clock compared to FP8, while Blackwell Ultra GPUs increase this to three times that of FP8. According to the paper "Pretraining Large Language Models with NVFP4", NVFP4 provides better accuracy with the same number of tokens or achieves equivalent accuracy with significantly fewer tokens compared to MXFP4. The Blackwell Ultra GPUs feature several enhancements:

1.5x Peak NVFP4 Throughput: Updated Tensor Cores increase FP4 throughput per clock by 1.5x compared to Blackwell GPUs.
2x Softmax for Attention: Upgraded special function unit (SFU) provides 2x accelerated throughput for key softmax operations.
1.5x Larger HBM3e Capacity: Higher-capacity HBM3e stacks (12-Hi compared to 8-Hi) enable fitting the entire Llama 2 70B LoRA model in one GPU, eliminating CPU offloading and model-parallel communication overheads.

NVIDIA also leveraged the NVIDIA Quantum-X800 networking platform, featuring NVIDIA ConnectX-8 SuperNICs, NVIDIA Quantum-X800 InfiniBand switches, and NVIDIA LinkX cables, to connect the GB300 NVL72 racks forming the Theia cluster. This marked the industry’s first 800 Gb/s networking submission to MLPerf Training. Software optimizations also played a crucial role. For the Llama 3.1 8B pretraining benchmark, NVIDIA used FP8 precision for the attention BMM inputs, resulting in up to 1.3x better performance in the attention kernel compared to BF16 precision. Other optimizations included a fused RoPE kernel in Transformer Engine and avoiding device-to-device memory copies.