Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexible Workload Scheduling

Yiming Bian; Joshua M. Akey

✨ TL;DR

Stream-CQSA enables exact self-attention computation on long sequences by decomposing attention into independent subsequence computations that fit within arbitrary memory budgets, allowing billion-token sequences to run on a single GPU without approximation. This removes the assumption that full query, key, and value tensors must fit in device memory, addressing the quadratic memory bottleneck of standard attention.

01 · Problem

Large language models with long contexts face a fundamental scalability limitation due to the quadratic memory cost of exact self-attention. While existing methods reduce memory complexity toward linear, they still assume that the complete query, key, and value tensors can fit entirely in device memory. This assumption breaks for very long sequences, causing out-of-memory failures on modern hardware and preventing the deployment of long-context models.

02 · Approach

The paper introduces CQS Divide, an operation grounded in cyclic quorum sets theory that mathematically decomposes attention into a set of independent subsequence computations. These subcomputations can be recomposed to yield exactly the same result as full-sequence attention. Building on this decomposition, Stream-CQSA is a memory-adaptive scheduling framework that partitions attention computations into subproblems sized to fit within arbitrary memory budgets. This transforms attention from a monolithic operation into a collection of schedulable tasks that can be executed flexibly across devices without requiring inter-device communication.

03 · Key insights

What the paper shows.

01Cyclic quorum sets theory provides a principled mathematical foundation for decomposing attention into independent, recomposable subsequence computations with no approximation error

02Attention can be recast as a collection of schedulable tasks rather than a single monolithic operation, enabling memory-adaptive execution strategies

03The decomposition requires no inter-device communication, allowing flexible execution across heterogeneous memory constraints

04Exact attention over billion-token sequences becomes feasible on single-GPU hardware through streaming, without modifying the mathematical definition of attention

04 · Results

Experiments demonstrate that Stream-CQSA achieves predictable memory scaling and enables exact attention computation over billion-token sequences on a single GPU via streaming. The approach maintains mathematical equivalence to standard attention while eliminating out-of-memory failures, showing that the framework successfully handles extreme sequence lengths within constrained memory budgets.

05 · Limitations

The paper does not explicitly discuss computational overhead or latency implications of the streaming decomposition strategy compared to standard attention when memory is not a constraint. The practical applicability across different hardware architectures and the performance trade-offs during the scheduling and recomposition phases are not thoroughly analyzed. Additionally, the interaction with other optimization techniques (e.g., quantization, pruning) and the scalability to multi-GPU scenarios are not addressed.

✨ Generated by Claude · Apr 25, 2026 · Read the PDF for authoritative content.

What the paper shows.

↘ Related papers