NVIDIA Optimizes LLMs With TensorRT Model Optimizer

NVIDIA researchers and engineers have developed a method to compress large language models (LLMs) into smaller, more efficient variants using the TensorRT Model Optimizer. This approach leverages structured weight pruning and knowledge distillation to reduce resource demands while maintaining performance.

Key Techniques

The optimization process involves two primary techniques:

Model Pruning: Unimportant parameters are removed to create a more compact model with faster inference speeds and lower computational costs.
Knowledge Distillation: Knowledge is transferred from a larger 'teacher' model to a smaller 'student' model, enabling the smaller model to retain high performance.

Model Pruning Techniques

NVIDIA employs various pruning methods, including:

Depth Pruning: Entire layers are removed to reduce the neural network’s depth and complexity.
Width Pruning: Internal structures like neurons or attention heads are eliminated to slim down the model’s width.
Magnitude Pruning: Weights with small absolute values are set to zero.
Activation-Based Pruning: Importance of model parts is estimated using a calibration dataset based on their activations.
Structural Pruning: Entire structures, such as layers or attention heads, are removed.

Knowledge Distillation Styles

NVIDIA uses two distillation styles:

Response-Based: The student model is trained to match the teacher’s soft output probabilities, conveying inter-class similarities.
Feature-Based: The teacher’s intermediate representations guide the student toward learning similar internal structures.

Experimental Results

Experiments with the TensorRT Model Optimizer demonstrated that a Qwen3 Depth Pruned 6B model is 30% faster than the Qwen3 4B model and performs better on the MMLU benchmark. The depth pruning reduced the model from 36 to 24 layers, resulting in a 6B model, using one NVIDIA H100 80 GB HBM3.

Conclusion

NVIDIA’s approach to optimizing LLMs with the TensorRT Model Optimizer shows promise in creating resource-efficient models without compromising performance. This method could be crucial for deploying LLMs in resource-constrained environments.