GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling

Alireza Dadgarnia; Soroush Tabesh; Mahdi Nikdan; Michael Helcig; Eldar Kurtic; Dan Alistarh

✨ TL;DR

GSQ is a new scalar quantization method for large language models that uses Gumbel-Softmax relaxation to jointly optimize grid assignments and scales, achieving accuracy comparable to complex vector quantization methods while remaining compatible with existing inference kernels. It successfully quantizes models to 2-3 bits per parameter and scales to trillion-parameter mixture-of-experts models.

01 · Problem

Current weight quantization methods for LLMs face a fundamental trade-off. Simple scalar quantization techniques like GPTQ and AWQ are widely deployed and easy to implement but hit an accuracy ceiling at 3-4 bits per parameter. Meanwhile, advanced vector- and trellis-quantized methods like QTIP, GPTVQ, and AQLM achieve better accuracy at low bit-widths (2-3 bits) but are difficult to implement, hard to scale, and have limited adoption in practice. This creates a gap between what is theoretically possible and what is practically deployable, especially for local inference scenarios where extreme compression is needed.

02 · Approach

GSQ introduces a post-training scalar quantization method that jointly optimizes per-coordinate grid assignments and per-group scales using a Gumbel-Softmax relaxation of the discrete quantization grid. The key innovation is matching the cardinality of the continuous relaxation to the small number of quantization levels available at the target bit-width (e.g., 3-8 levels for ternary and 3 bits per parameter). This makes the relaxation tight and the optimization tractable. The method uses symmetric scalar grids with group-wise quantization, ensuring full compatibility with existing scalar inference kernels while achieving the accuracy benefits typically associated with more complex vector quantization approaches.

03 · Key insights

What the paper shows.

01The accuracy gap between scalar and vector quantization methods is not fundamental but can be largely closed through careful optimization of scalar quantizers

02Matching the Gumbel-Softmax relaxation cardinality to the target number of quantization levels makes the discrete optimization problem tractable and the relaxation tight

03Joint optimization of grid assignments and scales is crucial for achieving high accuracy at extreme low bit-widths (2-3 bits per parameter)

04Maintaining compatibility with scalar quantization infrastructure enables practical deployment while achieving near-frontier accuracy

04 · Results

On Llama-3.1-8B and Llama-3.1-70B-Instruct models, GSQ closes most of the accuracy gap between traditional scalar quantization methods and the QTIP frontier at both 2 and 3 bits per parameter. The method successfully scales to trillion-parameter mixture-of-experts models like Kimi-K2.5, demonstrating practical applicability where vector-quantized methods struggle. GSQ maintains full compatibility with existing scalar inference kernels while achieving these accuracy improvements, making it immediately deployable in production systems.

05 · Limitations

The paper does not provide specific quantitative accuracy numbers or perplexity scores comparing GSQ to baseline methods. While the method is described as scaling to trillion-parameter models, detailed computational costs and training time comparisons are not explicitly stated. The approach still requires post-training optimization, which may be computationally expensive for very large models. The paper does not discuss potential limitations in terms of inference speed compared to simpler quantization methods, or whether there are specific model architectures or tasks where GSQ may underperform.

✨ Generated by Claude · Apr 21, 2026 · Read the PDF for authoritative content.

What the paper shows.

↘ Related papers