Using large language models for embodied planning introduces systematic safety risks

Tao Zhang; Kaixian Qu; Zhibin Li; Jiajun Wu; Marco Hutter; Manling Li; Fan Shi

✨ TL;DR

This paper introduces DESPITE, a benchmark of over 12,000 tasks to evaluate safety in LLM-based robotic planning, revealing that even models with near-perfect planning ability produce dangerous plans 28% of the time. The study shows planning ability scales with model size while safety awareness remains relatively flat, creating a critical gap for deploying LLMs in robotics.

01 · Problem

Large language models are increasingly being deployed as planners for robotic systems, but their safety in generating plans for physical robots remains poorly understood and systematically unevaluated. While these models may excel at generating valid plans that accomplish tasks, there is no comprehensive understanding of whether they avoid plans that could cause physical harm or violate normative constraints. The lack of systematic evaluation frameworks makes it difficult to assess the safety risks introduced when using LLMs for embodied planning in real-world robotic applications.

02 · Approach

The researchers created DESPITE, a benchmark containing 12,279 tasks that span both physical dangers (like causing harm or damage) and normative dangers (like violating social norms or rules). The benchmark features fully deterministic validation to ensure consistent evaluation. They evaluated 23 different language models, including 18 open-source models ranging from 3 billion to 671 billion parameters and several proprietary models, including reasoning-capable models. The evaluation separately measured two key capacities: planning ability (whether models can generate valid plans) and safety awareness (whether models avoid dangerous plans), allowing the researchers to analyze the relationship between these two dimensions.

03 · Key insights

What the paper shows.

01Planning ability and safety awareness are distinct capabilities that do not improve together: the best-planning model achieves 99.6% valid plans but still produces dangerous plans 28.3% of the time

02Among open-source models, planning ability improves dramatically with scale (from 0.4% to 99.3%) while safety awareness remains relatively flat (38-57%), showing scale primarily benefits task completion rather than danger avoidance

03A multiplicative relationship exists between planning ability and safety awareness, meaning larger models complete more tasks safely mainly by being better planners, not by being more safety-aware

04Proprietary reasoning models (like o1 and o3-mini) achieve notably higher safety awareness (71-81%) compared to non-reasoning models and open-source reasoning models (all below 57%), suggesting reasoning capability may be key to safety

04 · Results

The best-planning model in the benchmark failed to produce valid plans on only 0.4% of tasks but generated dangerous plans on 28.3% of tasks, demonstrating a severe gap between planning competence and safety. Across the 18 open-source models tested, planning ability ranged from 0.4% to 99.3% as model size increased from 3B to 671B parameters, while safety awareness remained in a narrow band of 38-57%. Three proprietary reasoning models achieved substantially higher safety awareness scores of 71-81%, while non-reasoning proprietary models and open-source reasoning models all remained below 57%. The multiplicative relationship between planning and safety means that as planning ability approaches saturation in frontier models, safety awareness becomes the primary bottleneck for safe deployment.

05 · Limitations

The paper does not explicitly discuss limitations in detail, but several are implicit in the work. The benchmark uses deterministic validation which may not capture all nuances of real-world safety scenarios where context and uncertainty play larger roles. The evaluation is conducted on text-based planning tasks rather than actual robotic execution, which may not fully reflect safety challenges in physical deployment. The study focuses on plan generation safety but does not address execution-time safety monitoring or recovery mechanisms. Additionally, while the benchmark covers 12,279 tasks, the generalizability to all possible dangerous scenarios in real-world robotics remains uncertain, and the specific types of physical and normative dangers included may not be exhaustive.

✨ Generated by Claude · Apr 21, 2026 · Read the PDF for authoritative content.

What the paper shows.

↘ Related papers