NVIDIA Blackwell Architecture Dominates MLPerf Training v5.1 Benchmarks
Published on November 12, 2025 at 12:00 AM

NVIDIA's Blackwell architecture has demonstrated its prowess by achieving the fastest training times across every MLPerf Training v5.1 benchmark. The results highlight the architecture's ability to accelerate AI model training, driven by innovations at the chip, system, and software levels.
MLPerf Training v5.1, the latest iteration of industry benchmarks for AI training performance, assesses the time required to train seven different models to a specified target accuracy. The NVIDIA Blackwell and Blackwell Ultra GPUs powered by the Blackwell architecture, excelled in every benchmark at maximum and various submitted scales. NVIDIA was also the only platform to submit results across all benchmarks.
Key highlights from the MLPerf Training v5.1 results include:
- Llama 3.1 405B Pretraining: Achieved in just 10 minutes using 5,120 Blackwell GPUs.
- Llama 3.1 8B Pretraining: Completed in 5.2 minutes using 512 Blackwell Ultra GPUs.
- Llama 2 70B LoRA Fine-tuning: Finished in 0.40 minutes using 512 Blackwell Ultra GPUs.
- FLUX.1: Trained in 12.5 minutes using 1,152 Blackwell GPUs.
- 1.5x Peak NVFP4 Throughput: Updated Tensor Cores increase FP4 throughput per clock by 1.5x compared to Blackwell GPUs.
- 2x Softmax for Attention: Upgraded special function unit (SFU) provides 2x accelerated throughput for key softmax operations.
- 1.5x Larger HBM3e Capacity: Higher-capacity HBM3e stacks (12-Hi compared to 8-Hi) enable fitting the entire Llama 2 70B LoRA model in one GPU, eliminating CPU offloading and model-parallel communication overheads.