✨ TL;DR
GSQ is a new scalar quantization method for large language models that uses Gumbel-Softmax relaxation to jointly optimize grid assignments and scales, achieving accuracy comparable to complex vector quantization methods while remaining compatible with existing inference kernels. It successfully quantizes models to 2-3 bits per parameter and scales to trillion-parameter mixture-of-experts models.
Current weight quantization methods for LLMs face a fundamental trade-off. Simple scalar quantization techniques like GPTQ and AWQ are widely deployed and easy to implement but hit an accuracy ceiling at 3-4 bits per parameter. Meanwhile, advanced vector- and trellis-quantized methods like QTIP, GPTVQ, and AQLM achieve better accuracy at low bit-widths (2-3 bits) but are difficult to implement, hard to scale, and have limited adoption in practice. This creates a gap between what is theoretically possible and what is practically deployable, especially for local inference scenarios where extreme compression is needed.
GSQ introduces a post-training scalar quantization method that jointly optimizes per-coordinate grid assignments and per-group scales using a Gumbel-Softmax relaxation of the discrete quantization grid. The key innovation is matching the cardinality of the continuous relaxation to the small number of quantization levels available at the target bit-width (e.g., 3-8 levels for ternary and 3 bits per parameter). This makes the relaxation tight and the optimization tractable. The method uses symmetric scalar grids with group-wise quantization, ensuring full compatibility with existing scalar inference kernels while achieving the accuracy benefits typically associated with more complex vector quantization approaches.
What the paper shows.
On Llama-3.1-8B and Llama-3.1-70B-Instruct models, GSQ closes most of the accuracy gap between traditional scalar quantization methods and the QTIP frontier at both 2 and 3 bits per parameter. The method successfully scales to trillion-parameter mixture-of-experts models like Kimi-K2.5, demonstrating practical applicability where vector-quantized methods struggle. GSQ maintains full compatibility with existing scalar inference kernels while achieving these accuracy improvements, making it immediately deployable in production systems.
The paper does not provide specific quantitative accuracy numbers or perplexity scores comparing GSQ to baseline methods. While the method is described as scaling to trillion-parameter models, detailed computational costs and training time comparisons are not explicitly stated. The approach still requires post-training optimization, which may be computationally expensive for very large models. The paper does not discuss potential limitations in terms of inference speed compared to simpler quantization methods, or whether there are specific model architectures or tasks where GSQ may underperform.
✨ Generated by Claude · Apr 21, 2026 · Read the PDF for authoritative content.