Gradient Smoothing has emerged as a significant advancement in the optimization of deep neural networks, particularly those utilizing repeated architectural blocks, such as transformers. This innovative approach was introduced by Haoming Meng and colleagues in their paper presented at the 43rd International Conference on Machine Learning (ICML 2026). It aims to improve the performance of machine learning models by optimizing layer-wise updates.
Understanding Gradient Smoothing and Depth-wise Gradient Augmentation
Gradient Smoothing is part of a broader optimization framework known as Depth-wise Gradient Augmentation. This paradigm leverages the structured relationships that develop among layers during training. The core idea is to transform the updates from block-wise optimizers, applying them across the depth of the network rather than treating each layer in isolation.
The authors propose a simple local Window Smoothing operator as a practical implementation of Gradient Smoothing. This method operates seamlessly with existing optimizers like SGD, Adam, and Muon, ensuring minimal computational overhead while enhancing the optimization process.
Evaluation Across Diverse Architectures
The effectiveness of Gradient Smoothing has been evaluated across various architectures and training regimes. This includes applications in language model pretraining, reinforcement learning post-training for large language models, diffusion modeling, and image classification specifically with Vision Transformers.



