NVIDIA's NeMo Automodel Accelerates Large-Scale MoE Training in PyTorch
Published on November 6, 2025 at 12:00 AM

NVIDIA announced on November 6, 2025, the release of NeMo Automodel, an open-source library within the NVIDIA NeMo framework, designed to enable developers to train large-scale Mixture-of-Experts (MoE) models directly in PyTorch. This tool aims to make MoE training more accessible and efficient, overcoming the traditional challenges of scaling these models across numerous GPUs. NeMo Automodel leverages PyTorch distributed parallelism and NVIDIA acceleration technologies to achieve high performance, with reported speeds of 190 to 280 TFLOPs/sec per GPU and the ability to process up to 13,000 tokens/sec.
The key features of NeMo Automodel include:
- Native PyTorch Integration: Allows developers to use familiar PyTorch APIs for training.
- High Performance: Achieves over 200 TFLOPs per GPU on H100s with BF16 precision, with DeepSeek V3 reaching 250 TFLOPs/sec/GPU on 256 GPUs.
- Scalability: Supports scaling from eight to over 1,000 GPUs.
- Optimizations: Incorporates NVIDIA Transformer Engine kernels, Megatron-Core DeepEP, and GroupedGEMM to minimize communication overhead and increase GPU occupancy.
- Fully Sharded Data Parallelism (FSDP): Shards model parameters, gradients, and optimizer states across data-parallel ranks to minimize memory use.
- Expert Parallelism (EP): Distributes MoE experts efficiently across GPUs.
- Pipeline Parallelism (PP): Splits model layers into stages for memory-efficient multi-node training.
- Context Parallelism (CP): Partitions long sequences for extended-context training.
- DeepEP Token Dispatcher: Scales token routing with efficient all-to-all communication.
- GroupedGEMM: Aggregates multiple local expert computations into a single batched GEMM operation.