Universally Empowering Zeroth-Order Optimization via Adaptive Layer-wise Sampling

Fei Wang; Li Shen; Liang Ding; Chao Xue; Ye Liu; Changxing Ding

✨ TL;DR

AdaLeZO accelerates zeroth-order optimization for fine-tuning large language models by intelligently sampling layers based on their sensitivity rather than uniformly perturbing all parameters. This adaptive approach achieves 1.7-3.0x speedup over existing methods while maintaining memory efficiency and acting as a universal plug-in for existing ZO optimizers.

01 · Problem

Zeroth-order (ZO) optimization offers a memory-efficient alternative to backpropagation for fine-tuning large language models by using only forward passes. However, its practical deployment faces severe challenges: slow wall-clock convergence time and high estimation variance make it impractical for real-world applications. The authors identify that over 40% of training latency comes from generating perturbations and updating parameters. The root cause is the standard uniform exploration strategy, which treats all layers equally despite deep networks exhibiting heterogeneous sensitivity across layers. This uniform approach leads to computationally wasteful exploration where the limited perturbation budget is spent on less sensitive parameters that contribute minimally to optimization progress.

02 · Approach

AdaLeZO introduces an adaptive layer-wise sampling framework that formulates layer selection as a non-stationary Multi-Armed Bandit (MAB) problem. Instead of uniformly perturbing all parameters, the method dynamically allocates the perturbation budget to the most sensitive layers based on their contribution to the optimization objective. The framework employs sampling with replacement to select which layers to perturb at each iteration. To ensure unbiased gradient estimation despite non-uniform sampling, AdaLeZO incorporates an Inverse Probability Weighting (IPW) mechanism that reweights the gradient estimates according to their sampling probabilities. This IPW mechanism also functions as a temporal denoiser, reducing estimation variance. The approach is designed as a universal plug-and-play module that can enhance any existing ZO optimizer without requiring additional memory.

03 · Key insights

What the paper shows.

01System profiling reveals that perturbation generation and parameter updates constitute over 40% of ZO training latency, representing a critical bottleneck that uniform sampling strategies fail to address

02Deep neural networks exhibit heterogeneous layer sensitivity, making uniform exploration fundamentally inefficient as it wastes computational budget on less informative parameters

03Formulating layer selection as a non-stationary Multi-Armed Bandit problem enables principled dynamic allocation of perturbation budget to maximize optimization efficiency

04Inverse Probability Weighting based on sampling with replacement simultaneously guarantees unbiased gradient estimation and acts as a temporal variance reduction mechanism

04 · Results

Extensive experiments on LLaMA and OPT models ranging from 6.7B to 30B parameters demonstrate that AdaLeZO achieves 1.7x to 3.0x wall-clock speedup compared to state-of-the-art ZO methods. The framework maintains the memory efficiency advantages of zeroth-order optimization while significantly improving convergence speed. Importantly, AdaLeZO functions as a universal enhancement that can be seamlessly integrated with existing ZO optimizers without incurring additional memory overhead, making it broadly applicable across different ZO optimization variants.

05 · Limitations

The paper does not explicitly discuss limitations in detail. Potential implicit limitations include: the method's reliance on accurately estimating layer sensitivity through the MAB formulation, which may require careful tuning of exploration-exploitation trade-offs; the approach assumes layer-wise sensitivity patterns are learnable and relatively stable during training; computational overhead from the MAB mechanism itself is not thoroughly analyzed; and the experiments focus primarily on language models, leaving generalization to other domains or architectures unclear. The non-stationary nature of the MAB problem suggests that sensitivity patterns may change during training, which could affect the quality of layer selection in different training phases.

✨ Generated by Claude · Apr 21, 2026 · Read the PDF for authoritative content.

What the paper shows.

↘ Related papers