✨ TL;DR
Sessa is a new sequence model that places attention inside a recurrent feedback path, enabling power-law memory decay instead of exponential or 1/length dilution. This architecture achieves superior long-context performance while remaining competitive on short sequences.
Current sequence models face fundamental trade-offs in how they handle long-range dependencies. Transformers use self-attention to retrieve from context, but when attention is diffuse (not sharply focused), each token's influence dilutes as O(1/length), making old tokens increasingly weak. State-space models like Mamba process sequences recurrently with input-dependent feedback, but their long-range sensitivity decays exponentially with lag when they cannot sustain "freeze time" over long intervals. Existing architectures are thus limited to either single-read retrieval from the past (Transformers) or single-path information propagation (SSMs). Neither approach provides both flexible selective retrieval and efficient long-range memory that decays slower than 1/length for diffuse attention patterns.
Sessa introduces a novel architecture that embeds attention mechanisms inside a recurrent feedback path, creating a "many-path aggregation" system within each layer. This design combines the selective retrieval capabilities of attention with the recurrent processing of state-space models. The key innovation is that instead of choosing between attention-based retrieval or recurrent propagation, Sessa uses attention to modulate and enrich the recurrent state updates themselves. The architecture is designed to achieve power-law memory decay of order O(ℓ^(-β)) for 0<β<1, which is asymptotically slower than the O(1/ℓ) dilution in Transformers. The authors prove this rate is tight in explicit diffuse uniform-routing settings where influence scales as Θ(ℓ^(-β)). Under stated assumptions, Sessa is the only model class among those compared that can realize both flexible selective retrieval and non-decaying memory profiles.
What the paper shows.
Under matched architectures and training budgets, Sessa achieves the strongest performance on long-context benchmarks compared to Transformer and Mamba-style baselines. Importantly, this long-context advantage does not come at the cost of short-context performance—Sessa remains competitive with both Transformer and Mamba baselines on short-context language modeling tasks. The empirical results validate the theoretical predictions about power-law memory decay and demonstrate that the many-path aggregation mechanism provides practical benefits across different sequence length regimes.
The paper states theoretical results hold "under stated assumptions" but does not fully detail what conditions are required in practice for the power-law memory guarantees. The tight bound Θ(ℓ^(-β)) is proven only for the specific case of diffuse uniform-routing, leaving open whether tighter bounds exist for other attention patterns. The paper does not provide detailed computational complexity analysis comparing Sessa to baselines, which is important for understanding practical scalability trade-offs. Additionally, while competitive on short contexts, the results suggest Sessa may not strictly dominate existing architectures in all regimes, indicating potential overhead from the more complex feedback-attention mechanism.
✨ Generated by Claude · Apr 21, 2026 · Read the PDF for authoritative content.