What is the DESPITE benchmark?

An evaluation of LLM safety in robot planning, with 12,279 tasks and fully deterministic validation of physical and normative hazards.

Does a larger model mean a safer model?

No. Among 18 open-source models (3B–671B parameters), planning improves from 0.4% to 99.3%, while safety awareness stagnates between 38% and 57%.

Which models are safest?

Proprietary reasoning models (71–81% safety awareness). Non-reasoning and open-source reasoning models remain below 57%.

DESPITE benchmark: planning ability does not guarantee safety

A research team introduced the DESPITE benchmark — the largest systematic evaluation of language model safety in the context of robot task planning. The findings reveal a disturbing pattern: models become brilliant planners but remain careless about danger.

What does the DESPITE benchmark measure and how?

DESPITE evaluates 23 models on 12,279 tasks covering both physical hazards (e.g., handling sharp objects, heat, electricity) and normative hazards (e.g., procedures that violate rules, ethics, or usage context). The key methodological innovation is “fully deterministic validation” — a completely deterministic verification that does not rely on another LLM as a judge, but on predefined rules that unambiguously classify a plan as safe or dangerous. This removes the noise of subjective assessment and enables model comparison on the same measurement scale. Researchers compared two dimensions: the ability to generate a valid plan (technically feasible) and the ability to avoid dangerous steps in that plan.

Why are planning and safety orthogonal capabilities?

The most important finding of the work: “The best planner fails to produce a valid plan in only 0.4% of tasks, but produces dangerous plans in 28.3% of cases.” So a model that almost never makes a technical mistake — still in every fourth scenario proposes something that could injure people or destroy property. Among 18 open-source models (from 3 to 671 billion parameters), planning capability grows dramatically with size — from 0.4% success at the smallest to 99.3% at the largest. Safety awareness, however, remains relatively flat, between 38% and 57% regardless of scale. This is strong evidence that these are separate (orthogonal) capabilities — scaling parameters improves planning but not safety judgment. The authors conclude that the relationship is multiplicative: larger models “succeed” primarily because they plan better, not because they better avoid dangers.

Which models lead and what does this mean for deployment?

Proprietary reasoning models (those that show intermediate reasoning steps, such as Claude, the OpenAI o-series, and similar closed systems) significantly outperform alternatives with 71–81% safety awareness. Non-reasoning proprietary models and open-source reasoning models remain below 57%. The practical implication is serious: as frontier models saturate planning performance, safety awareness becomes the most critical link in the reliability chain. Scaling is no longer the solution. The authors argue that safety requires dedicated architectural approaches and dedicated training methods, not just more parameters. For the robotics industry, this means LLM-based systems should not operate without additional safety layers — plan verification, external rule engines, and human oversight — regardless of how impressively they plan. DESPITE is a useful baseline for regulators and integrators who want to objectively measure model readiness for the real world.

DESPITE benchmark: LLMs plan well for robots, but not safely

DESPITE benchmark: planning ability does not guarantee safety

What does the DESPITE benchmark measure and how?

Why are planning and safety orthogonal capabilities?

Which models lead and what does this mean for deployment?

Sources

Related news