✨ TL;DR
Stream-CQSA enables exact self-attention computation on long sequences by decomposing attention into independent subsequence computations that fit within arbitrary memory budgets, allowing billion-token sequences to run on a single GPU without approximation. This removes the assumption that full query, key, and value tensors must fit in device memory, addressing the quadratic memory bottleneck of standard attention.
Large language models with long contexts face a fundamental scalability limitation due to the quadratic memory cost of exact self-attention. While existing methods reduce memory complexity toward linear, they still assume that the complete query, key, and value tensors can fit entirely in device memory. This assumption breaks for very long sequences, causing out-of-memory failures on modern hardware and preventing the deployment of long-context models.
The paper introduces CQS Divide, an operation grounded in cyclic quorum sets theory that mathematically decomposes attention into a set of independent subsequence computations. These subcomputations can be recomposed to yield exactly the same result as full-sequence attention. Building on this decomposition, Stream-CQSA is a memory-adaptive scheduling framework that partitions attention computations into subproblems sized to fit within arbitrary memory budgets. This transforms attention from a monolithic operation into a collection of schedulable tasks that can be executed flexibly across devices without requiring inter-device communication.
What the paper shows.
Experiments demonstrate that Stream-CQSA achieves predictable memory scaling and enables exact attention computation over billion-token sequences on a single GPU via streaming. The approach maintains mathematical equivalence to standard attention while eliminating out-of-memory failures, showing that the framework successfully handles extreme sequence lengths within constrained memory budgets.
The paper does not explicitly discuss computational overhead or latency implications of the streaming decomposition strategy compared to standard attention when memory is not a constraint. The practical applicability across different hardware architectures and the performance trade-offs during the scheduling and recomposition phases are not thoroughly analyzed. Additionally, the interaction with other optimization techniques (e.g., quantization, pruning) and the scalability to multi-GPU scenarios are not addressed.
✨ Generated by Claude · Apr 25, 2026 · Read the PDF for authoritative content.