✨ TL;DR
DialToM is a benchmark that tests whether LLMs truly understand Theory of Mind by evaluating both mental state prediction and the ability to forecast dialogue trajectories from those states. The study reveals that most LLMs can identify mental states but fail to use this understanding to predict realistic social interactions.
While Large Language Models have demonstrated Theory of Mind capabilities, it is unclear whether this reflects genuine reasoning about mental states or merely exploits spurious correlations in training data. Existing evaluations focus primarily on mental state prediction without testing whether models can functionally apply this understanding to predict realistic social outcomes. There is a need for a benchmark that measures both literal mental state identification and the practical utility of these inferences in forecasting dialogue trajectories.
The authors introduce DialToM, a human-verified benchmark constructed from natural human dialogue using a multiple-choice framework. The benchmark evaluates two dimensions: Literal ToM (mental state prediction) and Functional ToM (whether identified mental states can predict state-consistent dialogue trajectories). The key innovation is Prospective Diagnostic Forecasting, which probes whether models can identify dialogue trajectories that align with inferred mental-state profiles, testing the functional integration of mental state understanding into social reasoning.
What the paper shows.
Evaluation across multiple LLMs reveals that while most models perform well on Literal ToM tasks (mental state prediction), they significantly underperform on Functional ToM tasks requiring trajectory forecasting. Gemini 3 Pro is the exception, showing stronger integrated reasoning. The analysis demonstrates weak semantic overlap between human and LLM inferences, indicating that model performance may rely on pattern matching rather than genuine mental state reasoning.
The paper relies on a multiple-choice framework which may not fully capture the complexity of open-ended reasoning about mental states and social dynamics. The benchmark is constructed from natural dialogue, which may contain biases or patterns that models can exploit. The evaluation is limited to a specific set of LLMs, and findings may not generalize to future models or different dialogue domains. The paper does not deeply investigate the mechanisms underlying the reasoning asymmetry or provide detailed error analysis of failure cases.
✨ Generated by Claude · Apr 25, 2026 · Read the PDF for authoritative content.