arXiv: 99.8% memory poisoning attack GPT-5.5

Q: What does sleeper memory poisoning specifically mean?

Classic prompt injection attacks last only as long as adversarial content is in the context — sleeper memory poisoning corrupts the agent's persistent memory through fabricated facts stored in long-term memory; the attack remains dormant across multiple sessions and activates when the agent later accesses that memory item for another task, which is dramatically different from prompt injection, which has no persistence.

Q: What are the specific success rate figures?

GPT-5.5: 99.8% successful poisoning rate, Kimi-K2.6: 95% success rate; among successfully retrieved poisoned memories, attacker-intended actions were triggered in 60–89% of cases; the attack pipeline was fully evaluated — from fabrication writing into storage, through later retrieval, to manipulation of subsequent conversations.

Hidden in Memory is a new arXiv paper published on May 14, 2026 by Sidharth Pulipaka, Stanislau Hlebik, Leonidas Raghav, Sahar Abdelnabi, Vyas Raina, Ivaxi Sheth, and Mario Fritz that presents a delayed-execution attack on stateful LLM agents. Adversarial content in external context (documents, webpages) corrupts the agent's persistent memory — 99.8% success on GPT-5.5 and 95% on Kimi-K2.6, with 60–89% success converting poisoned memory into attacker-intended actions.

Sidharth Pulipaka, Stanislau Hlebik, Leonidas Raghav, Sahar Abdelnabi, Vyas Raina, Ivaxi Sheth, and Mario Fritz published on arXiv on May 14, 2026 a paper presenting Sleeper Memory Poisoning — a new attack vector that exploits persistent memory of LLM agents for delayed-execution attacks with dramatic success rates: 99.8% on GPT-5.5 and 95% on Kimi-K2.6.

What does sleeper memory poisoning specifically mean?

Classic LLM security threats — prompt injection, jailbreaking, context manipulation — share one fundamental limitation: the attack lasts only as long as adversarial content is in the context. Once the user leaves the session or clears the context, the attack disappears.

Sleeper memory poisoning changes that profile. Current stateful LLM assistants (ChatGPT with Memory, Claude Projects, Gemini Personalization) persist user-specific information across multiple sessions. The paper demonstrates that this persistent memory can be corrupted through fabricated facts that:

Are written to storage automatically through normal user interaction
Remain dormant until a retrieval trigger arrives
Activate in later sessions when the agent uses the memory item for another task
Manipulate subsequent conversations in the attacker-intended direction

The difference between sleeper memory poisoning and classic prompt injection is dramatic: persistence. The attack can remain dormant for days or weeks before triggering.

What does the attack pipeline specifically look like?

The paper fully evaluates the complete attack pipeline:

Fabrication writing — adversarial content in an external document, webpage, or repository that the agent processes
Memory write — the agent processes the content and writes fabricated “facts” to persistent memory as user preferences, facts, or context
Dormancy period — everything between writing and retrieval
Memory retrieval — the agent in a later session uses the memory item for another task
Action triggering — poisoned memory influences agent reasoning and triggers the attacker-intended action

The approach exploits the trust boundary between the user and external sources. The agent treats anything the user feeds it as trustworthy, even if an external document the user uploads contains malicious instructions.

What are the specific success rate figures?

The paper cites precise metrics on two frontier models:

Model	Memory Poisoning Success	Attacker-Intended Action
GPT-5.5	99.8%	60–89% of successful retrievals
Kimi-K2.6	95%	60–89% of successful retrievals

The GPT-5.5 figure is particularly dramatic — 99.8% means virtually guaranteed memory corruption if the attacker knows the agent’s structure. Frontier models with state-of-the-art alignment training are almost completely defenseless against this attack vector.

The second metric — 60–89% action triggering rate — shows that successful memory corruption converts into actionable attack in most cases. This is not a theoretical threat — it is a production-grade attack vector with real-world impact.

Why is memory poisoning difficult to detect?

The difficulty of defense stems from several factors:

Memory writes are normal operation — the agent writes memory items continuously through user interactions
No anomaly signal — an adversarial memory item looks like any other user fact
Cross-session evaluation required — single-session monitoring doesn’t detect the attack because the trigger comes later
Difficult attribution — when the attack triggers, tracing it back to the original adversarial source is a nontrivial retrospective forensics task

The approach requires end-to-end memory pipeline auditing, not a single-point security control.

What does this mean for production LLM deployments?

The findings have critical implications for organizations deploying LLM agents with memory features:

ChatGPT Enterprise with Memory — potential exposure if employees upload documents from unverified sources
Claude Projects — compromised projects can corrupt cross-project memory
Custom agent deployments with vector stores as long-term memory — massive attack surface
Multi-user systems with shared memory — one compromised user can affect everyone

Defensive priorities implied by the paper:

Memory source provenance — track every memory item back to the originating source
Adversarial content scanning before memory writes
Retrieval anomaly detection — flagging unusual memory access patterns
Memory expiration policies — automatic cleanup of old memory items

Position in the 2026 agentic security landscape

The paper fits into the explosive wave of agentic safety/security research through May 2026:

arXiv FATE (May 12) — 33.5% attack reduction through formal techniques
arXiv History Anchors (May 13) — 91–98% unsafe shift through history manipulation
arXiv Sycophantic Consensus (May 15) — alignment failure modes
Microsoft AI Delegation (May 15) — 19–34% reliability degradation
arXiv Compositional Jailbreaking (May 15) — mutator chain synergies

The trend is crystal clear: 2026 is the year agentic systems transition from “experimental capability” to “production attack surface.” The safety provided by mainstream RLHF + safety training for chatbot use cases is insufficient for stateful agents with persistent memory.

Sleeper Memory Poisoning is likely the most significant security paper of May 2026 due to two numbers: 99.8% and persistence across multiple sessions. The industry must seriously revisit the architecture of LLM memory systems before attackers reproduce these results in real-world deployments.

arXiv:2605.15338 Sleeper Memory Poisoning: 99.8% attack success rate on GPT-5.5 via persistent memory of LLM agents

What does sleeper memory poisoning specifically mean?

What does the attack pipeline specifically look like?

What are the specific success rate figures?

Why is memory poisoning difficult to detect?

What does this mean for production LLM deployments?

Position in the 2026 agentic security landscape

Frequently Asked Questions

Sources

Related news