🟡 🛡️ Security Published: · 4 min read ·

MARS: Textual Refusal Directions Protect Multimodal AI Models Without Additional Training

Editorial illustration: research on multimodal refusal control of AI models without retraining

Researchers from the University of Trento propose MARS — a multimodal safety approach that transfers refusal directions from a textual LLM and applies them to image and video inputs without any additional training. Tested on five current multimodal models with consistent safety improvements while preserving model utility.

🤖

This article was generated using artificial intelligence from primary sources.

Multimodal large language models — which simultaneously process text, images, and video — introduce a new challenge for safety researchers: safety mechanisms trained on text data do not automatically transfer to visual modalities. An attacker who cannot elicit a harmful response through a text query can sometimes achieve it with a carefully crafted image or video sequence.

A research team from the Department of Computer Science at the University of Trento — D’Incà, Mancini, and Sebe — proposes a new approach that bridges this gap without a single additional training step.

What Is MARS?

MARS (Modality-Agnostic Refusal Steering) starts from a simple but powerful premise: the mechanism by which an LLM refuses a harmful text request is not located exclusively in the input layer — it is embedded deeper in the model’s activation space. These refusal directions are geometric structures that can be identified and, as MARS demonstrates, generalized across modalities.

Concretely: refusal directions extracted from the purely textual part of the model are applicable to activations that arose from processing an image or video. A multimodal model contains knowledge of what refusal means — MARS activates that structure even in modalities where it is not otherwise present as an active safety mechanism.

Three Mechanisms That Make MARS Robust

The approach relies on three components that operate together during the generation of the first response token — the phase in which the refusal decision is made:

Activation re-centering is a shift of the activation space toward the region in which the model naturally refuses harmful requests. Activations arising from visual input are directed toward the same geometric zone in which the textual model recognizes harmful content.

Adaptive intervention scaling dynamically adjusts the strength of the correction depending on how far the input is from safe examples. This reduces the collateral effect on benign queries — the model’s utility is not degraded by blanket strengthening of all refusals.

Optimal layer selection identifies which transformer layer at first-token generation has the greatest influence on the refusal decision and applies the intervention precisely there. This is more efficient than applying it across all layers and reduces unwanted interactions with the rest of the network.

The Key Advantage: No Multimodal Safety Data

Classical approaches to multimodal safety require datasets that pair harmful visual input with an appropriate response — expensive and difficult to collect, and fine-tuning procedures can degrade model utility on standard tasks.

MARS needs no such data. It uses only the textual refusal structure already present in the model. This makes it applicable to any multimodal model that shares a common LLM backbone — without retraining, without a GPU cluster, without specialized safety datasets.

Testing on Five Current Multimodal Models

The researchers conducted evaluation on five current SOTA multimodal models that process images and video. Results show consistent safety gains: models with MARS active less frequently generate harmful content in response to visual attacks that would otherwise bypass textual defenses.

The critical condition in production environments — that the safety intervention does not degrade utility — is satisfied: utility on benign tasks is preserved. A safety intervention that negatively affects response quality would not be accepted in practice.

The authors emphasize that MARS is not a replacement for robust safety training — it is a lightweight layer that can improve an already-deployed model quickly and without significant cost. Combining it with the original safety training should theoretically yield even better results.

Broader Context: Why Modality-Specific Safety Is Urgent

Visual attacks on multimodal models represent a growing threat category: adversarial images, text embedded in photographs, video sequences designed to confuse safety filters. As multimodal models are deployed in production systems — from chatbots with image upload capability to automated visual content review systems — vulnerabilities specific to visual modalities are becoming increasingly relevant.

The training-free MARS approach is particularly valuable in scenarios where an organization lacks fine-tuning resources, or where the model is not available for training (API-only deployment). Its lightness and applicability to a finished model distinguish it from most previous approaches that assume full access to model parameters.

The work also opens a broader research question: how modular is safety knowledge within an LLM? If refusal directions can be successfully transferred across modalities, it is possible that the same principle holds across tasks, domains, or related model architectures.

Frequently Asked Questions

What are refusal directions and why do they matter for multimodal safety?
Refusal directions are geometric vectors in the LLM's activation space representing the mechanism by which the model refuses harmful requests. MARS extracts them from the textual model and applies them to visual modalities without separate safety data.
Why does it matter that MARS requires no additional training?
A training-free approach can be applied to an already-deployed model immediately, without expensive datasets or GPU resources, making it practical for production use and API-only scenarios.
On how many models was MARS tested?
MARS was tested on five current multimodal models with consistent safety improvements and no significant drop in utility on benign tasks.