🤖 24 AI
🔴 🛡️ Security Thursday, April 16, 2026 · 3 min read

ArXiv: MemJack — Multi-Agent Attack Breaks Vision-Language Model Defenses with Up to 90% Success Rate

Why it matters

MemJack is a new jailbreak framework targeting vision-language models (VLMs) that uses coordinated multi-agent collaboration instead of classical pixel perturbations. Tested on unmodified COCO images, it achieves a 71.48% success rate on Qwen3-VL-Plus, rising to 90% with an expanded budget. Researchers plan to publicly release over 113,000 interactive attack trajectories to support defensive research.

Multimodal AI models that combine text and image understanding — known as vision-language models (VLMs) — are facing a new category of security threats. A research team led by Jianhao Chen has introduced MemJack, a framework that uses coordinated multi-agent collaboration to bypass VLM safety mechanisms, achieving alarmingly high success rates.

How Does MemJack Bypass Security Protections?

Unlike previous approaches that rely on pixel perturbations — subtle image changes invisible to the human eye — MemJack uses an entirely different strategy. The system maps visual elements to harmful objectives through semantic understanding of image content, then generates adversarial prompts using multi-perspective camouflage techniques.

The key innovation is the coordination of multiple specialized agents. One agent analyzes visual content, a second generates camouflage strategies, and a third applies geometric filtering to bypass the model’s security mechanisms. The system uses completely unmodified images from the COCO dataset — a standard benchmark for computer vision — making it especially dangerous because existing defenses cannot detect manipulation at the pixel level.

Why Is Persistent Memory a Critical Component?

MemJack introduces a persistent memory component that accumulates successful strategies across interactions. Each successful attack enriches the system’s knowledge base, making future attacks on new images more effective. This experience-based learning mechanism means the system becomes increasingly dangerous over time.

Tested against Qwen3-VL-Plus, MemJack achieves an Attack Success Rate (ASR) of 71.48%. With an expanded computational budget — more iterations and agents — that rate rises to a startling 90%. This means that almost nine out of ten images can serve as an attack vector against a multimodal model.

What Does This Mean for the Multimodal Model Industry?

The results point to a fundamental problem in the security architecture of VLMs. Previous defenses focused primarily on detecting modified images or filtering explicitly harmful text prompts. MemJack demonstrates that an attacker can use entirely legitimate images and sophisticated prompts to circumvent these protections.

The researchers plan to publicly release the MemJack-Bench dataset with more than 113,000 interactive multimodal attack trajectories. The goal is to enable defensive researchers to develop more robust protection mechanisms. This is a double-edged sword — the same data that aids defense can also help attackers — but the research team believes transparency ultimately benefits the defensive side.

For companies deploying VLMs in production systems — from medical image analysis to autonomous driving — MemJack serves as a warning that security evaluations must include testing resilience against coordinated multi-agent attacks, not just isolated manipulation attempts.

🤖

This article was generated using artificial intelligence from primary sources.