Explicit Dropout: Deterministic Regularization for Transformer Architectures

Vidhi Agrawal; Illia Oleksiienko; Alexandros Iosifidis

✨ TL;DR

This paper reformulates dropout as a deterministic regularization term added directly to the training loss instead of using stochastic masking, providing fine-grained control over regularization in Transformers. Experiments show the approach matches or outperforms standard dropout across image, video, and audio tasks.

01 · Problem

Dropout is a fundamental regularization technique in deep learning, but its stochastic nature makes it difficult to understand and control precisely. Traditional dropout applies random masking during training, which obscures the actual regularization mechanism and limits fine-grained control over different components of neural networks. For complex architectures like Transformers with multiple distinct components (attention heads, feed-forward networks), a more explicit and interpretable regularization approach would be beneficial.

02 · Approach

The authors derive a deterministic formulation of dropout by expressing it as an explicit additive regularization term in the training loss function. For Transformer architectures, they develop specific regularization terms for query, key, value projections in attention mechanisms and feed-forward network layers. Each component can have independently controllable regularization strengths through separate coefficients, allowing practitioners to tune regularization intensity without relying on stochastic perturbations.

03 · Key insights

What the paper shows.

01Dropout can be reformulated as a deterministic regularizer in the loss function, removing dependence on stochastic masking while maintaining regularization effects

02Fine-grained control is possible by applying different regularization strengths to different Transformer components (attention vs feed-forward layers)

03Explicit dropout provides interpretability by making the regularization mechanism transparent in the optimization objective

04The deterministic approach maintains or improves performance compared to conventional stochastic dropout across diverse domains

04 · Results

Experiments across image classification, temporal action detection, and audio classification demonstrate that explicit dropout matches or outperforms conventional implicit dropout methods. Ablation studies confirm stable performance and show that regularization strength can be effectively controlled through regularization coefficients and dropout rates. Consistent gains are observed when applying the method to attention and feed-forward network layers.

05 · Limitations

The paper does not discuss computational overhead comparisons between explicit and implicit dropout methods. Limited analysis is provided on how the approach scales to very large models or datasets. The work focuses primarily on Transformer architectures, leaving generalization to other modern architectures (e.g., diffusion models, large language models) unexplored. No theoretical analysis is provided regarding convergence properties or generalization bounds of the explicit formulation.

✨ Generated by Claude · Apr 25, 2026 · Read the PDF for authoritative content.

What the paper shows.

↘ Related papers