NVIDIA Blackwell Architecture Dominates MLPerf Training v5.1 Benchmarks

NVIDIA's Blackwell architecture has set a new standard in AI training performance, achieving the fastest training times across all benchmarks in MLPerf Training v5.1. This sweeping victory underscores the critical role of advanced hardware and software innovations in meeting the demands of increasingly complex AI models.

The MLPerf Training v5.1 benchmarks evaluate AI training performance by measuring the time required to train seven different models to a specified accuracy. NVIDIA's Blackwell architecture, which powers both the Blackwell and Blackwell Ultra GPUs, demonstrated superior performance across all benchmarks, highlighting its versatility and scalability.

Key Benchmark Results

Llama 3.1 405B pretraining: Completed in 10 minutes using 5,120 Blackwell GPUs.
Llama 3.1 8B pretraining: Finished in 5.2 minutes using 512 Blackwell Ultra GPUs.
Llama 2 70B LoRA fine-tuning: Achieved in 0.40 minutes using 512 Blackwell Ultra GPUs.
FLUX.1: Trained in 12.5 minutes using 1,152 Blackwell GPUs.

NVIDIA was the only platform to submit results for all benchmarks, showcasing the comprehensive capabilities of its AI training stack.

Innovations in AI Data Formats

A key factor in the Blackwell architecture's success is its use of low-precision AI data formats, particularly the NVIDIA-designed NVFP4 format. This format offers hardware acceleration, delivering better accuracy and faster training times compared to other formats like MXFP4. Blackwell GPUs provide twice the peak FP4 throughput per clock compared to FP8, while Blackwell Ultra GPUs offer three times the throughput.

Software and Hardware Optimizations

NVIDIA's submissions also incorporated various optimizations, including NVFP4 training recipes, FP8 precision for attention BMM inputs, and software enhancements like fused RoPE kernel and optimized memory copies. Blackwell Ultra GPUs further enhance performance with 1.5x peak NVFP4 throughput, 2x Softmax throughput for attention, and 1.5x larger HBM3e capacity.

Networking Advancements

The NVIDIA Quantum-X800 networking platform, featuring ConnectX-8 SuperNICs and LinkX cables, connected the Theia cluster's GB300 NVL72 racks, marking the industry's first 800 Gb/s networking submission to MLPerf Training. This networking infrastructure played a crucial role in achieving the record-breaking training times.

Implications for AI Training

The Blackwell architecture's performance highlights the importance of innovations in hardware, software, and networking for advancing AI training. These advancements pave the way for faster training times, reduced costs, and new AI breakthroughs, enabling developers to push the boundaries of AI model complexity.

By setting new standards in AI training performance, NVIDIA's Blackwell architecture positions itself as a leader in the rapidly evolving AI hardware and software landscape.