ToolPrivacyBench: Privacy in Tool-Using LLM Agents

ToolPrivacyBench is a new benchmark that tests 'purpose-bound' privacy — whether sensitive information is sent only to authorized tools. A set of 2,150 test cases (1,150 synthetic + 1,000 adapted) reveals that 9 tested agents routinely complete tasks while unnecessarily exposing private data.

LLM Agents Are Getting Better at Tasks — But Who Monitors What They Do With Private Data?

Researchers Shijing Hu, Liang Liu, Zhu Meng, and Zhicheng Zhao published a preprint on arXiv introducing ToolPrivacyBench — a benchmark designed to measure so-called purpose-bound privacy in LLM agents that use external tools. The paper titled “ToolPrivacyBench: Benchmarking Purpose-Bound Privacy in Tool-Using LLM Agents” fills a gap in existing evaluations that have largely overlooked privacy.

What Is “Purpose-Bound” Privacy and Why It Matters

Purpose-bound privacy — or purpose-restricted privacy — means that sensitive information may travel only to the tool that is truly necessary for executing the specific task. In a multi-tool trajectory, a single agent may call dozens of tools in sequence: a database, a calendar, a payment API, a notification service. Information such as a social security number or medical record should only reach the authorized tool — not “incidentally” through every intermediate call.

Existing benchmarks measure whether a task is completed. ToolPrivacyBench also measures how it was completed.

2,150 Test Cases, 9 Agents, One Uncomfortable Conclusion

The benchmark contains 2,150 test cases: 1,150 synthetic business scenarios with privacy-sensitive data flows and 1,000 cases adapted from existing multi-tool benchmarks. Each case includes a knowledge base with data disclosure policies. After the agent executes the trajectory, ToolPrivacyBench audits the arguments of every tool call and background logs — and compares them against that policy.

Nine widely used agents were tested. The finding is consistent: agents frequently successfully complete tasks, while simultaneously sending unnecessary private data through intermediate calls to tools that don’t need it. Successful tool execution does not imply appropriate privacy protection.

The difference between “works” and “works in a privacy-compliant way” becomes critical in business environments. The GDPR principle of data minimization directly corresponds to the purpose-bound concept from ToolPrivacyBench. Enterprise systems using LLM agents to process customer or personnel data must be aware that standard task benchmarks do not verify this layer of behavior.

The paper is a preprint — not yet peer-reviewed — but the benchmark methodology and the scope of the test set make it a useful reference framework for teams building or evaluating agentic systems.

Frequently Asked Questions

What is 'purpose-bound' privacy in the context of LLM agents?

This refers to the 'need-to-know' principle: sensitive information may only be forwarded to the tool that is necessary for completing the task. ToolPrivacyBench measures exactly that — whether private data goes only to authorized tool calls, or spills through intermediate calls that don't actually need it.

Why is successful task completion not sufficient for evaluating agent privacy?

Because an agent can correctly complete a business task while simultaneously forwarding sensitive data (personal, financial, medical) to tools that don't need it. ToolPrivacyBench records the arguments of every tool call and compares them against the disclosure policy — something standard task benchmarks don't do.

arXiv:2606.28061: ToolPrivacyBench — Measuring 'Need-to-Know' Privacy in LLM Agents With Tools

LLM Agents Are Getting Better at Tasks — But Who Monitors What They Do With Private Data?

What Is “Purpose-Bound” Privacy and Why It Matters

2,150 Test Cases, 9 Agents, One Uncomfortable Conclusion

Frequently Asked Questions

Sources

Related news

LLM Agents Are Getting Better at Tasks — But Who Monitors What They Do With Private Data?

What Is “Purpose-Bound” Privacy and Why It Matters

2,150 Test Cases, 9 Agents, One Uncomfortable Conclusion

Implications for GDPR and Enterprise AI

Frequently Asked Questions

Sources

Related news