✨ TL;DR
This paper reformulates dropout as a deterministic regularization term added directly to the training loss instead of using stochastic masking, providing fine-grained control over regularization in Transformers. Experiments show the approach matches or outperforms standard dropout across image, video, and audio tasks.
Dropout is a fundamental regularization technique in deep learning, but its stochastic nature makes it difficult to understand and control precisely. Traditional dropout applies random masking during training, which obscures the actual regularization mechanism and limits fine-grained control over different components of neural networks. For complex architectures like Transformers with multiple distinct components (attention heads, feed-forward networks), a more explicit and interpretable regularization approach would be beneficial.
The authors derive a deterministic formulation of dropout by expressing it as an explicit additive regularization term in the training loss function. For Transformer architectures, they develop specific regularization terms for query, key, value projections in attention mechanisms and feed-forward network layers. Each component can have independently controllable regularization strengths through separate coefficients, allowing practitioners to tune regularization intensity without relying on stochastic perturbations.
What the paper shows.
Experiments across image classification, temporal action detection, and audio classification demonstrate that explicit dropout matches or outperforms conventional implicit dropout methods. Ablation studies confirm stable performance and show that regularization strength can be effectively controlled through regularization coefficients and dropout rates. Consistent gains are observed when applying the method to attention and feed-forward network layers.
The paper does not discuss computational overhead comparisons between explicit and implicit dropout methods. Limited analysis is provided on how the approach scales to very large models or datasets. The work focuses primarily on Transformer architectures, leaving generalization to other modern architectures (e.g., diffusion models, large language models) unexplored. No theoretical analysis is provided regarding convergence properties or generalization bounds of the explicit formulation.
✨ Generated by Claude · Apr 25, 2026 · Read the PDF for authoritative content.