arXiv:2605.04572: SQSD measures LLM safety degradation

A paper accepted at ICML 2026 introduces SQSD — a method for quantifying the contribution of individual samples to safety degradation during model fine-tuning. Researchers show that even seemingly benign fine-tuning samples cumulatively shift parameters toward 'danger-aligned' directions.

Authors Xiao Wang, Yifei Zhang, YongKang Liu, Xiaocui Yang, Zihan Wang, Shi Feng, and Daling Wang published on May 6, 2026, the paper arXiv:2605.04572, accepted at ICML 2026, introducing SQSD (Sample-level Quantification of Safety Degradation) — a method for quantifying the contribution of individual samples to safety degradation during the fine-tuning of large language models.

Main finding: even benign samples degrade safety

According to the abstract, “benign fine-tuning causes cumulative parameter shifts toward ‘danger-aligned’ directions, which gradually undermines model safety.” In other words, even when a development team uses seemingly neutral data for fine-tuning, the result can be the erosion of safety behaviors the model acquired through preference training (RLHF, DPO, and similar methods).

How does SQSD work?

SQSD computes a per-sample risk score by measuring how the parameter updates resulting from that sample project onto safe versus dangerous directions in parameter space. Samples whose updates pull parameters toward dangerous directions receive a high risk score, even if the textual content itself is benign. This identifies the samples that contribute most to safety erosion.

Transferability across architectures

Experiments demonstrate “strong transferability across different model sizes, architectures, and parameter-efficient training approaches” (LoRA, prefix-tuning, etc.). This means the method does not need to be calibrated separately for each combination of model and training technique.

Why does this matter?

Existing fine-tuning practice relies on the assumption that benign datasets do not compromise safety. SQSD shows that this assumption is incorrect at the parameter level — and opens the possibility of performing risk scoring before launching a fine-tuning job to discard or reweight the samples that contribute most to drift toward dangerous directions. This is a practical tool for organizations fine-tuning open models for internal use.

Frequently Asked Questions

What is SQSD?

Sample-level Quantification of Safety Degradation — a method that computes a risk score for each individual fine-tuning sample based on how its parameter updates project onto safe versus dangerous directions in parameter space.

What is ICML?

International Conference on Machine Learning — one of the three leading academic conferences in the field of machine learning.

What is the paper's main finding?

Even benign fine-tuning samples cause cumulative parameter shifts toward 'danger-aligned' directions, gradually undermining the model's safety alignment.

arXiv:2605.04572: SQSD reveals that even benign fine-tuning undermines model safety

Main finding: even benign samples degrade safety

How does SQSD work?

Transferability across architectures

Why does this matter?

Frequently Asked Questions

Sources

Related news