ArXiv ARMOR 2025: first military LLM safety benchmark with 519 prompts across 21 commercial models
Virginia Tech researchers have released ARMOR 2025, the first safety benchmark evaluating LLMs against the Law of War, Rules of Engagement, and Joint Ethics Regulation. Testing 519 doctrinal prompts across 21 commercial models reveals critical gaps — existing safety evaluations do not test whether models align with legal and ethical rules governing military operations.
This article was generated using artificial intelligence from primary sources.
Sydney Johns, Heng Jin, Chaoyu Zhang, Y. Thomas Hou, and Wenjing Lou from Virginia Tech published on April 30, 2026 ARMOR 2025 — the first safety benchmark that evaluates LLMs against military rather than civilian standards. The work fills a gap rarely addressed: tests like HarmBench measure generally harmful behavior (bomb-making instructions, disinformation), but do not test understanding of the context of military operations.
The foundational thesis is that existing frameworks do not distinguish legal from illegal actions under the Law of War, Rules of Engagement, and Joint Ethics Regulation — the core doctrinal frameworks of professional militaries. A model that blindly refuses every military-context query is just as useless in practical application as one that complies unconditionally.
What does the benchmark consist of?
ARMOR 2025 comprises 519 doctrinally grounded prompts organized through a 12-category taxonomy and structured according to the OODA framework: Observe, Orient, Decide, Act. Each prompt carries an explicit doctrinal reference — which regulation or international rule applies, and what the expected model behavior is.
Prompts are not simple “how to do X” queries — they include complex scenarios involving legality, proportionality, and distinguishing combatants from civilians. The model must recognize that part of the scenario is a doctrinal question, not a technical execution task.
How did the 21 commercial models perform?
The paper systematically tests 21 commercial LLMs across the full taxonomy, measuring both answer accuracy and refusal consistency. Detailed per-model results appear in the paper’s appendices, but the general finding is clear: critical gaps exist in safety alignment for military applications.
The most typical failures include inconsistent refusals (a model refuses once, then complies with the same type of query), misinterpretation of context (treating a hypothetical scenario as an operational order), and lack of proportionality reasoning.
Why this benchmark now?
The topic arrives as governments and defense contractors actively integrate commercial LLMs into operational support tools — chat assistants for intelligence analysis, report drafting tools, decision support systems. Without a doctrinal test, deployments rest on civilian safety criteria that miss military specifics.
For AI vendors (Anthropic, OpenAI, Google, Mistral, Cohere), ARMOR 2025 becomes an informal “must-pass” if they want to be considered for defense contracts. For the research community, the benchmark opens the field of doctrinal alignment — aligning models with formal legal frameworks rather than subjective norms.
What the benchmark does not cover
The authors clearly acknowledge limitations: ARMOR 2025 focuses on Anglo-American doctrine (US Joint Ethics, Law of War as interpreted by the Pentagon) and does not include European regulations (e.g., Bundeswehr guidelines or French ROE), nor does it analyze model behavior under NATO as a combined framework. This opens space for next-generation benchmarks covering a broader doctrinal spectrum.
Frequently Asked Questions
- What is the ARMOR 2025 benchmark?
- ARMOR 2025 is a safety benchmark assessing whether LLMs correctly refuse or handle queries related to military operations. It contains 519 prompts organized through the OODA framework (Observe-Orient-Decide-Act) across 12 doctrinal alignment categories.
- Why are existing safety benchmarks insufficient for military contexts?
- Existing benchmarks like HarmBench focus on general societal risks — suicide, violence, chemical weapons — without context. Military contexts require nuanced understanding of which actions are legal under the Law of War and which violate Rules of Engagement. Models that blindly refuse all military-related queries are just as problematic as those that comply unconditionally.
- What is the OODA framework used by the benchmark?
- OODA (Observe, Orient, Decide, Act) is a military decision-making model developed in the 1970s. ARMOR organizes test prompts through these four decision phases, enabling differentiation of exactly where in the process a model fails — at situation recognition, assessment, choice, or execution.
Related news
ICML 2026 Spotlight: Stable-GFlowNet introduces more stable and diverse automated LLM red-teaming
Exploration Hacking: Can LLMs Learn to Resist RL Training and Strategically Suppress Their Own Capabilities?
MCPHunt: first benchmark measuring credential leakage across trust boundaries in multi-server MCP agents — rates of 11.5–41.3%