arXiv:2605.04785: AgentTrust Runtime Safety for AI Agents

AgentTrust is an open-source runtime system that intercepts AI agent tool calls — file operations, SQL queries and shell commands — and returns one of four verdicts before execution. Across 930 test scenarios it achieves 95–97% accuracy, and approximately 93% on shell-obfuscated attacks.

A new paper on arXiv introduces AgentTrust, a runtime safety layer that sits between an AI agent and its tools, evaluating every call in real time before it executes. The system targets vulnerabilities that emerge when agents are given broad access to the operating system and external services.

How does AgentTrust decide what to allow?

For each incoming tool call, AgentTrust returns one of four verdicts: allow, warn, block or send for review. The architecture combines a shell deobfuscation normalizer, a SafeFix component that suggests safer alternatives, a RiskChain detector for multi-step attack chains, and a cache-aware LLM-as-Judge layer for ambiguous inputs. Covered tools include file operations, SQL queries and shell commands — the three most common attack surfaces in production agentic systems.

How accurate is the system?

Evaluation was conducted on a total of 930 scenarios — 300 internal across six risk categories and 630 independently constructed adversarial real-world situations. The production ruleset achieves 95.0% verdict accuracy on the internal benchmark and 73.7% risk-level accuracy with latency in the millisecond range. On the 630-scenario set, AgentTrust reaches 96.7% accuracy, including approximately 93% accuracy on shell-obfuscated payloads that typically bypass simpler filters.

What does open-source delivery mean?

Author Chenglin Yang has released AgentTrust under the AGPL-3.0 license, which requires derived works to share the same open-source terms. The system is delivered as an MCP server (Model Context Protocol — the open standard for calling external tools from LLMs), so it can be attached to any agent that supports MCP without modifying agent code. This lowers the barrier to introducing runtime control into existing agentic workflows.

Frequently Asked Questions

What does AgentTrust intercept?

The system intercepts AI agent tool calls before they execute — specifically file operations, SQL queries and shell commands — and decides in real time whether to allow or block them.

What verdicts does AgentTrust return?

Four possible verdicts: allow, warn, block and send for human review. There is also a SafeFix component that suggests safer alternatives.

Under what license is it available?

The system is released under the AGPL-3.0 open-source license and delivered as an MCP server, making it compatible with any agent that supports the Model Context Protocol.

arXiv:2605.04785: AgentTrust Intercepts AI Agent Tool Calls with 95–97% Accuracy

How does AgentTrust decide what to allow?

How accurate is the system?

What does open-source delivery mean?

Frequently Asked Questions

Sources

Related news