NVIDIA researchers and engineers are employing a combination of structured weight pruning and knowledge distillation to compress large language models (LLMs) into smaller, more efficient variants. This method aims to reduce the resource intensity of deploying LLMs while maintaining strong performance. The techniques are implemented using the NVIDIA TensorRT Model Optimizer. Key aspects of this approach include:

Model Pruning: Removing unimportant parameters (weights, neurons, layers) to create a more compact model with faster inference speeds and lower computational costs.
Knowledge Distillation: Transferring knowledge from a larger 'teacher' model to a smaller 'student' model, allowing the creation of compact models that retain the high performance of the larger model.

Model Pruning Techniques

Depth Pruning: Removes entire layers from the neural network, reducing its depth and complexity.
Width Pruning: Eliminates internal structures like individual neurons, attention heads, or embedding channels, slimming down the model’s width.
Magnitude pruning: Sets weights with small absolute values close to zero.
Activation-based pruning: Uses a calibration dataset to estimate the importance of different parts of the model based on their activations.
Structural pruning: Removes entire structures, like layers or attention heads.

Knowledge Distillation Styles

Response-based: Training the student model to match the teacher's soft output probabilities, conveying inter-class similarities.
Feature-based: Transferring a teacher’s intermediate representations (hidden activations or feature maps) to guide a student toward learning similar internal structure.

Experimental results using the Model Optimizer show that a Qwen3 Depth Pruned 6B model is 30% faster than the Qwen3 4B model and performs better on the MMLU benchmark. The depth pruning was applied to reduce the model from 36 to 24 layers, resulting in a 6B model, using one NVIDIA H100 80 GB HBM3.