ICML 2026: Images bypass VLM safety filters in 40.9%

Researchers Aharon Azulay, Jan Dubiński, and Zhuoyun Li presented at ICML 2026 four attack classes that exploit the visual modality to bypass safety alignment in vision-language models. Visual ciphers achieve a 40.9% success rate against Claude Haiku 4.5, while equivalent text-based attacks break through in only 10.7% of cases — confirming that images open an attack surface that does not exist in purely language-based models.

The team of Aharon Azulay, Jan Dubiński, and Zhuoyun Li published on May 1, 2026 on ArXiv the paper “Jailbreaking Vision-Language Models Through the Visual Modality”, accepted at the International Conference on Machine Learning (ICML) 2026. The paper systematically documents that the visual input of vision-language models (VLMs) constitutes an attack surface that lies beyond the reach of safety alignment trained on text.

What new attack vector does the image open?

The visual modality represents an underexplored attack surface for bypassing safety alignment, the authors state. Filters trained on text do not cover semantic transformations that images naturally enable — encoding instructions as pictorial symbols, replacing objects, or combining visual analogies. As a result, attacks that would be rejected in purely language-based models pass through the visual channel.

The authors identified four attack classes:

Visual ciphers — encoding harmful instructions as visual symbol sequences with a decoding legend
Object substitution — replacing a harmful term (bomb) with a benign one (banana) while requesting harmful actions using the substituted term
Text substitution in images — replacing harmful text with benign language while the visual context preserves the original interpretation
Analogical puzzles — visual puzzles whose solution requires reasoning about a forbidden concept

What are the concrete attack success metrics?

On Claude Haiku 4.5 the visual cipher achieved a 40.9% success rate. The text equivalent of the same concept breaks through filters in only 10.7% of cases. The difference is nearly fourfold and was not possible before the Vision modality became part of standard frontier models.

The evaluation covered six frontier VLMs. Specific numbers for the other five models are given in the main body of the paper, but Claude Haiku 4.5 was chosen by the authors as a representative case because it is a very recent model with ostensibly strong safety alignment.

What does this mean for enterprise and security teams?

The paper suggests that existing red-team methodology — which relies almost exclusively on textual prompt attacks — systematically underestimates VLM risk. Security teams deploying multimodal agents must extend their red-team corpus to visual inputs, in particular: encoded symbol sequences, visual substitution attacks, and analogical puzzles that activate reasoning about blocked concepts.

The broader implication: safety alignment (RLHF — Reinforcement Learning from Human Feedback) conducted on text does not generalize to the visual modality. Cross-modal alignment becomes a research priority, not an implementation detail.

The paper is available on ArXiv under ID 2605.00583 and will be presented at ICML 2026.

Frequently Asked Questions

How much more effective are visual attacks than text attacks in VLM jailbreaking?

On Claude Haiku 4.5 a visual cipher achieves 40.9% success, while the equivalent text attack breaks filters in only 10.7% of cases — nearly a fourfold difference.

What four attack classes does the paper define?

Encoded visual symbol sequences with a decoding legend, replacement of harmful objects with benign ones (bomb → banana), substitution of harmful text in an image with benign text while the visual context preserves the original interpretation, and analogical puzzles requiring reasoning about a forbidden concept.

How many models were tested in the study?

Six frontier vision-language models. The paper was accepted at ICML 2026 and describes attacks that are structurally impossible in purely text-based LLMs.

ArXiv: Visual inputs bypass safety filters in vision-language models 40.9% of the time, ICML 2026 authors find

What new attack vector does the image open?

What are the concrete attack success metrics?

What does this mean for enterprise and security teams?

Frequently Asked Questions

Sources

Related news