arXiv:2605.17634: Prompt Injection Beyond Data Separation

Researchers from CISPA Helmholtz Center and Google mathematically prove that data/instruction separation — currently the dominant defense against prompt injection attacks — fails to protect against contextual manipulation. With a new theoretical framework based on Contextual Integrity, they propose a fundamentally different approach to designing AI agent defenses.

Why Data/Instruction Separation Cannot Stop Prompt Injection

Researchers Sahar Abdelnabi (CISPA Helmholtz Center for Information Security) and Eugene Bagdasarian (Google) have published the paper arXiv:2605.17634, challenging the foundational assumption of today’s AI agent defenses.

Prompt injection is an attack in which malicious content from the environment — a document, web page, or API response — plants hidden instructions into an AI agent and hijacks its actions. The dominant defense today is data/instruction separation: distinguishing trusted user instructions from untrusted external data, and blocking the execution of instructions that arrive through the data channel.

The authors mathematically prove that this approach has a fundamental limit. An attacker does not need to plant text that looks like an instruction — it is sufficient to manipulate the situational context. An agent that correctly distinguishes data from instructions can still be led into the wrong action if an attacker constructs a legitimately-looking context that changes what the agent considers an “appropriate” action.

Contextual Integrity as a New Theoretical Framework

To formalize the problem, the authors introduce Contextual Integrity (CI) — Helen Nissenbaum’s framework from the philosophy of privacy. CI does not assess what is transmitted, but whether an information flow is appropriate to its context: who sends, to whom, in what situation, and for what purpose.

Applied to AI agents: an attack is not merely a planted instruction — an attack is any information flow that violates the contextual norms of a legitimate task. The authors develop a scenario analysis showing three violation mechanisms: misrepresentation of the information flow, manipulation of contextual norms, and the mixing of multiple flows from different contexts.

The key theoretical result — an impossibility result — states: an attacker can always construct a context in which a blocked legitimate operation looks suspicious and a malicious operation looks legitimate. Every tightening of security norms blocks some legitimate operations; every relaxation allows some attacks through.

Is Defense Even Possible?

The authors do not claim defense is impossible — they argue that the existing paradigm is insufficient. The solution is not a better detector for forbidden content, but a CI-aware alignment framework: agents need to be trained to assess the appropriateness of information flows according to task context, not merely to distinguish data and instruction formats.

The implications are direct for all production AI agents that process external content — email, documents, web pages, API responses. Channel separation remains a useful measure, but as the sole line of defense it is not enough.

Frequently Asked Questions

What is a prompt injection attack?

Prompt injection is an attack in which malicious content from the environment — a web page, document, or API response — injects hidden instructions into an AI agent's context. The agent interprets them as legitimate user instructions and executes malicious actions instead of the intended ones. Example — an email-reading agent encounters a message with the instruction "forward all contacts to the attacker".

What is Contextual Integrity?

Contextual Integrity (CI) is Helen Nissenbaum's theoretical framework for evaluating the appropriateness of information flows. Rather than looking at what is transmitted, CI assesses whether an information flow is appropriate to the context in which it occurs — who sends, to whom, in what situation, and for what purpose. The authors apply CI to AI agents to formalize what a "malicious instruction" means.

Why does data/instruction separation not solve the problem?

Separation tries to prevent agents from treating external data as instructions. But an attack that works through contextual manipulation — not by planting text that looks like an instruction, but by altering the situational context — does not cross that boundary. An attacker can construct a legitimately-looking context that leads the agent to a wrong action without a single explicit malicious command.

arXiv:2605.17634: Why Data/Instruction Separation Cannot Stop Prompt Injection

Why Data/Instruction Separation Cannot Stop Prompt Injection

Contextual Integrity as a New Theoretical Framework

Is Defense Even Possible?

Frequently Asked Questions

Sources

Related news