✨ TL;DR
This paper introduces DESPITE, a benchmark of over 12,000 tasks to evaluate safety in LLM-based robotic planning, revealing that even models with near-perfect planning ability produce dangerous plans 28% of the time. The study shows planning ability scales with model size while safety awareness remains relatively flat, creating a critical gap for deploying LLMs in robotics.
Large language models are increasingly being deployed as planners for robotic systems, but their safety in generating plans for physical robots remains poorly understood and systematically unevaluated. While these models may excel at generating valid plans that accomplish tasks, there is no comprehensive understanding of whether they avoid plans that could cause physical harm or violate normative constraints. The lack of systematic evaluation frameworks makes it difficult to assess the safety risks introduced when using LLMs for embodied planning in real-world robotic applications.
The researchers created DESPITE, a benchmark containing 12,279 tasks that span both physical dangers (like causing harm or damage) and normative dangers (like violating social norms or rules). The benchmark features fully deterministic validation to ensure consistent evaluation. They evaluated 23 different language models, including 18 open-source models ranging from 3 billion to 671 billion parameters and several proprietary models, including reasoning-capable models. The evaluation separately measured two key capacities: planning ability (whether models can generate valid plans) and safety awareness (whether models avoid dangerous plans), allowing the researchers to analyze the relationship between these two dimensions.
What the paper shows.
The best-planning model in the benchmark failed to produce valid plans on only 0.4% of tasks but generated dangerous plans on 28.3% of tasks, demonstrating a severe gap between planning competence and safety. Across the 18 open-source models tested, planning ability ranged from 0.4% to 99.3% as model size increased from 3B to 671B parameters, while safety awareness remained in a narrow band of 38-57%. Three proprietary reasoning models achieved substantially higher safety awareness scores of 71-81%, while non-reasoning proprietary models and open-source reasoning models all remained below 57%. The multiplicative relationship between planning and safety means that as planning ability approaches saturation in frontier models, safety awareness becomes the primary bottleneck for safe deployment.
The paper does not explicitly discuss limitations in detail, but several are implicit in the work. The benchmark uses deterministic validation which may not capture all nuances of real-world safety scenarios where context and uncertainty play larger roles. The evaluation is conducted on text-based planning tasks rather than actual robotic execution, which may not fully reflect safety challenges in physical deployment. The study focuses on plan generation safety but does not address execution-time safety monitoring or recovery mechanisms. Additionally, while the benchmark covers 12,279 tasks, the generalizability to all possible dangerous scenarios in real-world robotics remains uncertain, and the specific types of physical and normative dangers included may not be exhaustive.
✨ Generated by Claude · Apr 21, 2026 · Read the PDF for authoritative content.