Working Memory Constraints Scaffold Learning in Transformers under Data Scarcity

Pranava Madhyastha; Dagmar Adamcova

✨ TL;DR

This paper integrates working memory constraints into Transformer models through cognitively-inspired attention mechanisms and shows that these constraints improve grammatical accuracy and human alignment when training data is limited. The results suggest that mimicking human cognitive limitations can serve as a useful inductive bias for learning robust language representations.

01 · Problem

Standard Transformer models require large amounts of training data to learn robust linguistic representations. However, human language learning occurs under significant data scarcity and cognitive constraints, particularly working memory limitations that restrict attention span. The paper investigates whether incorporating human-like working memory constraints into neural architectures could improve learning efficiency and linguistic competence in data-limited settings.

02 · Approach

The authors implement cognitively-inspired attention variants in GPT-2, including fixed-width attention windows and temporal decay-based mechanisms that mimic working memory constraints. Modified models are trained from scratch on developmentally plausible datasets of 10M and 100M words. Performance is evaluated using grammatical judgment tasks (BLiMP benchmark) and alignment with human reading time data as a proxy for human language processing.

03 · Key insights

What the paper shows.

01Fixed-width attention constraints significantly improve grammatical accuracy, particularly in low-data regimes

02Cognitively-inspired constraints serve as beneficial inductive biases that guide models toward more robust linguistic representations

03Constrained models show stronger alignment with human processing metrics, suggesting they capture human-like language processing patterns

04Working memory limitations, rather than being purely restrictive, can facilitate learning when training data is scarce

04 · Results

Models with fixed-width attention constraints achieved significantly better grammatical accuracy on BLiMP tasks compared to unconstrained baselines, especially when trained on smaller datasets (10M words). The constrained models also demonstrated improved alignment with human reading time data, suggesting they learn representations more similar to human language processing. These improvements were most pronounced in the data-scarce setting.

05 · Limitations

The evaluation is limited to grammatical judgment tasks and reading time alignment; broader linguistic competencies are not assessed. The study uses relatively small training datasets (10M-100M words) compared to modern language models, which may limit generalizability of findings. The paper does not provide detailed ablation studies comparing different constraint types or analysis of what linguistic phenomena benefit most from these constraints. Additionally, the relationship between working memory constraints and other potential inductive biases is not thoroughly explored.

✨ Generated by Claude · Apr 25, 2026 · Read the PDF for authoritative content.

What the paper shows.

↘ Related papers