NVIDIA Releases NeMo Automodel for Scalable MoE Training

NVIDIA has announced the release of NeMo Automodel, an open-source library designed to streamline large-scale Mixture-of-Experts (MoE) training in PyTorch. This tool aims to address the challenges of scaling MoE models across multiple GPUs, enabling developers to achieve high performance and efficiency.

NeMo Automodel leverages PyTorch's distributed parallelism and NVIDIA's acceleration technologies to deliver speeds of up to 280 TFLOPs/sec per GPU. It supports models like DeepSeek V3 and Qwen3 MoE, with scalability from eight to over 1,000 GPUs.

Key Features of NeMo Automodel

Native PyTorch integration for familiar API usage
High performance with up to 250 TFLOPs/sec/GPU on 256 GPUs
Scalability from eight to over 1,000 GPUs
Optimizations like Transformer Engine kernels and GroupedGEMM

Optimizations for Performance and Scalability

NeMo Automodel incorporates several optimizations to enhance performance and scalability, including:

Fully Sharded Data Parallelism (FSDP)
Expert Parallelism (EP)
Pipeline Parallelism (PP)
Context Parallelism (CP)
DeepEP Token Dispatcher
GroupedGEMM for efficient computation

These optimizations enable near-linear scaling across various MoE architectures and GPU counts, making NeMo Automodel a robust solution for high-performance AI development.