CrossMPI: attacking VLM models with image only

arXiv:2605.16090 introduces CrossMPI — an attack on vision-language models that injects malicious instructions solely through invisible pixel changes in an image, without any text. Researchers discovered that the critical layers of multimodal integration are located in the middle of the model, not at the end as previously assumed. The attack achieves an average ASR of 66.36%, surpassing all known baseline methods by 40.91 percentage points.

What is CrossMPI and why is it dangerous?

Researchers (Hao Yang, Zhuo Ma, Yang Liu and collaborators) published paper arXiv:2605.16090 introducing CrossMPI — a prompt injection attack method targeting large vision-language models (LVLM) that operates exclusively through image perturbation, without any attacker-provided text.

Prompt injection is an attack in which hidden instructions are smuggled into an AI model to alter its behavior. CrossMPI transfers this principle to the multimodal space: the malicious instruction is encoded in invisible pixel changes — adversarial perturbation — that the human eye cannot detect.

A vision-language model receives an image and text, merges them internally into a shared representation space, and generates a response. It is precisely this step — multimodal integration — that proved to be the most vulnerable point.

A discovery that changes assumptions: critical layers are in the middle

It was previously assumed that the output layers of transformer architecture are most susceptible to manipulation. CrossMPI empirically overturns this.

The optimal layers for perturbation are located in the middle of the VLM, not near the end. Defense mechanisms focused on output layers miss attacks embedded deeper within. The optimization space in those layers amounts to ~10⁷ parameters (vs. ~10⁵ in the visual embedding) — hence the dramatically greater reach.

The method combines a layer selection strategy (automatic localization of critical layers) and a decaying perturbation budget assignment (pixels closer to semantically important regions receive larger perturbations).

Experimental results: far ahead of baseline methods

CrossMPI was tested on six VLMs: MiniGPT4-Llama2, MiniGPT4-Vicuna, InstructBLIP, BLIP-2, BLIVA and Qwen2.5-VL, on three datasets (MSCOCO, ImageNet, TextVQA).

The average attack success rate (ASR) is 66.36% — 40.91 pp higher than the average of four baseline methods (ARE-W: 8.24%; CI: 54.57%; ATPI: 4.41%). On BLIP-2 with MSCOCO, ASR reaches 96.08%, with minimal visual distortion (LPIPS ~18–20 vs. 70–85 for baselines).

Why are the security implications serious?

An attacker who controls an input image — such as a document, photograph or web content — can alter the behavior of a VLM without any text that filters could detect. All production VLM implementations (document analysis, medical diagnostics, vision-enabled chatbots) are potentially exposed.

The authors conclude that defense strategies must abandon their focus on output layers and turn to the middle of the model — the actual point of multimodal integration.

Frequently Asked Questions

What is a vision-language model (VLM)?

A vision-language model (VLM) is a multimodal AI system that simultaneously understands images and text — examples include BLIP-2, InstructBLIP and Qwen2.5-VL. The model receives visual and textual input, integrates them internally into a shared representation space, and generates a textual response.

How does the CrossMPI attack work?

CrossMPI optimizes subtle, humanly invisible pixel changes (adversarial perturbation) directly in the model's hidden state space. Instead of attacking the visual embedding (10⁵ parameters), it targets the middle layers where multimodal integration occurs (10⁷ parameters), achieving drastically greater success in injecting malicious instructions.

Why is the discovery about the 'model middle' so important?

Previous assumptions in adversarial attack research were based on the idea that the final (output) layers of transformer architecture are most susceptible to manipulation. CrossMPI empirically proves the opposite — critical layers for multimodal integration are in the MIDDLE of the VLM, meaning all previous defense mechanisms focused on output layers must be re-evaluated.

arXiv:2605.16090: CrossMPI — an attack on vision-language models using image-only perturbation

What is CrossMPI and why is it dangerous?

A discovery that changes assumptions: critical layers are in the middle

Experimental results: far ahead of baseline methods

Why are the security implications serious?

Frequently Asked Questions

Sources

Related news