What is MemJack and how does it work?

MemJack is a security-testing framework for vision-language models that uses multiple coordinated agents to map visual elements to harmful objectives, generating adversarial prompts without modifying the original images.

How successful is the MemJack attack?

On the Qwen3-VL-Plus model it achieves an Attack Success Rate of 71.48%, which rises to 90% with an expanded computational budget.

Why is this research important for AI system security?

It reveals a new class of vulnerabilities in multimodal models that requires no technical image manipulation, meaning existing defenses based on detecting pixel perturbations are insufficient.

ArXiv: MemJack — Multi-Agent Attack Breaks Vision-Language Model Defenses with Up to 90% Success Rate

Multimodal AI models that combine text and image understanding — known as vision-language models (VLMs) — are facing a new category of security threats. A research team led by Jianhao Chen has introduced MemJack, a framework that uses coordinated multi-agent collaboration to bypass VLM safety mechanisms, achieving alarmingly high success rates.

How Does MemJack Bypass Security Protections?

Unlike previous approaches that rely on pixel perturbations — subtle image changes invisible to the human eye — MemJack uses an entirely different strategy. The system maps visual elements to harmful objectives through semantic understanding of image content, then generates adversarial prompts using multi-perspective camouflage techniques.

The key innovation is the coordination of multiple specialized agents. One agent analyzes visual content, a second generates camouflage strategies, and a third applies geometric filtering to bypass the model’s security mechanisms. The system uses completely unmodified images from the COCO dataset — a standard benchmark for computer vision — making it especially dangerous because existing defenses cannot detect manipulation at the pixel level.

Why Is Persistent Memory a Critical Component?

MemJack introduces a persistent memory component that accumulates successful strategies across interactions. Each successful attack enriches the system’s knowledge base, making future attacks on new images more effective. This experience-based learning mechanism means the system becomes increasingly dangerous over time.

Tested against Qwen3-VL-Plus, MemJack achieves an Attack Success Rate (ASR) of 71.48%. With an expanded computational budget — more iterations and agents — that rate rises to a startling 90%. This means that almost nine out of ten images can serve as an attack vector against a multimodal model.

What Does This Mean for the Multimodal Model Industry?

The results point to a fundamental problem in the security architecture of VLMs. Previous defenses focused primarily on detecting modified images or filtering explicitly harmful text prompts. MemJack demonstrates that an attacker can use entirely legitimate images and sophisticated prompts to circumvent these protections.

The researchers plan to publicly release the MemJack-Bench dataset with more than 113,000 interactive multimodal attack trajectories. The goal is to enable defensive researchers to develop more robust protection mechanisms. This is a double-edged sword — the same data that aids defense can also help attackers — but the research team believes transparency ultimately benefits the defensive side.

For companies deploying VLMs in production systems — from medical image analysis to autonomous driving — MemJack serves as a warning that security evaluations must include testing resilience against coordinated multi-agent attacks, not just isolated manipulation attempts.

ArXiv: MemJack — Multi-Agent Attack Breaks Vision-Language Model Defenses with Up to 90% Success Rate

How Does MemJack Bypass Security Protections?

Why Is Persistent Memory a Critical Component?

What Does This Mean for the Multimodal Model Industry?

Sources

Related news