DialToM: A Theory of Mind Benchmark for Forecasting State-Driven Dialogue Trajectories

Neemesh Yadav; Palakorn Achananuparp; Jing Jiang; Ee-Peng Lim

✨ TL;DR

DialToM is a benchmark that tests whether LLMs truly understand Theory of Mind by evaluating both mental state prediction and the ability to forecast dialogue trajectories from those states. The study reveals that most LLMs can identify mental states but fail to use this understanding to predict realistic social interactions.

01 · Problem

While Large Language Models have demonstrated Theory of Mind capabilities, it is unclear whether this reflects genuine reasoning about mental states or merely exploits spurious correlations in training data. Existing evaluations focus primarily on mental state prediction without testing whether models can functionally apply this understanding to predict realistic social outcomes. There is a need for a benchmark that measures both literal mental state identification and the practical utility of these inferences in forecasting dialogue trajectories.

02 · Approach

The authors introduce DialToM, a human-verified benchmark constructed from natural human dialogue using a multiple-choice framework. The benchmark evaluates two dimensions: Literal ToM (mental state prediction) and Functional ToM (whether identified mental states can predict state-consistent dialogue trajectories). The key innovation is Prospective Diagnostic Forecasting, which probes whether models can identify dialogue trajectories that align with inferred mental-state profiles, testing the functional integration of mental state understanding into social reasoning.

03 · Key insights

What the paper shows.

01LLMs exhibit a significant reasoning asymmetry: strong performance on mental state identification does not translate to accurate dialogue trajectory forecasting

02Most LLMs fail to leverage mental state understanding for social prediction, with only Gemini 3 Pro showing stronger performance on functional ToM tasks

03Weak semantic similarities exist between human-generated and LLM-generated mental state inferences, suggesting models may identify states through different reasoning mechanisms

04Mental state prediction alone is insufficient to demonstrate robust Theory of Mind; functional application to predict social outcomes is necessary

04 · Results

Evaluation across multiple LLMs reveals that while most models perform well on Literal ToM tasks (mental state prediction), they significantly underperform on Functional ToM tasks requiring trajectory forecasting. Gemini 3 Pro is the exception, showing stronger integrated reasoning. The analysis demonstrates weak semantic overlap between human and LLM inferences, indicating that model performance may rely on pattern matching rather than genuine mental state reasoning.

05 · Limitations

The paper relies on a multiple-choice framework which may not fully capture the complexity of open-ended reasoning about mental states and social dynamics. The benchmark is constructed from natural dialogue, which may contain biases or patterns that models can exploit. The evaluation is limited to a specific set of LLMs, and findings may not generalize to future models or different dialogue domains. The paper does not deeply investigate the mechanisms underlying the reasoning asymmetry or provide detailed error analysis of failure cases.

✨ Generated by Claude · Apr 25, 2026 · Read the PDF for authoritative content.

What the paper shows.

↘ Related papers