✨ TL;DR
This paper integrates working memory constraints into Transformer models through cognitively-inspired attention mechanisms and shows that these constraints improve grammatical accuracy and human alignment when training data is limited. The results suggest that mimicking human cognitive limitations can serve as a useful inductive bias for learning robust language representations.
Standard Transformer models require large amounts of training data to learn robust linguistic representations. However, human language learning occurs under significant data scarcity and cognitive constraints, particularly working memory limitations that restrict attention span. The paper investigates whether incorporating human-like working memory constraints into neural architectures could improve learning efficiency and linguistic competence in data-limited settings.
The authors implement cognitively-inspired attention variants in GPT-2, including fixed-width attention windows and temporal decay-based mechanisms that mimic working memory constraints. Modified models are trained from scratch on developmentally plausible datasets of 10M and 100M words. Performance is evaluated using grammatical judgment tasks (BLiMP benchmark) and alignment with human reading time data as a proxy for human language processing.
What the paper shows.
Models with fixed-width attention constraints achieved significantly better grammatical accuracy on BLiMP tasks compared to unconstrained baselines, especially when trained on smaller datasets (10M words). The constrained models also demonstrated improved alignment with human reading time data, suggesting they learn representations more similar to human language processing. These improvements were most pronounced in the data-scarce setting.
The evaluation is limited to grammatical judgment tasks and reading time alignment; broader linguistic competencies are not assessed. The study uses relatively small training datasets (10M-100M words) compared to modern language models, which may limit generalizability of findings. The paper does not provide detailed ablation studies comparing different constraint types or analysis of what linguistic phenomena benefit most from these constraints. Additionally, the relationship between working memory constraints and other potential inductive biases is not thoroughly explored.
✨ Generated by Claude · Apr 25, 2026 · Read the PDF for authoritative content.