Activation Directions: 99.6% Detection of LLM Misalignment

Abdul Rafay Syed identified a common direction in the activation space of four LLM families — Qwen2.5, Gemma-2, Llama-3.2, and Ministral-3 — that separates aligned from misaligned models with 99.6% accuracy, while directional steering reduces unsafe code leakage by 21–51 points.

A Common Misalignment Signature Across Four Model Families

Researcher Abdul Rafay Syed published a paper on June 19, 2026, describing the discovery of a common geometric direction in activation space that clearly distinguishes aligned from misaligned large language models. The analysis covers four distinct families: Qwen2.5, Gemma-2, Llama-3.2, and Ministral-3 — all fine-tuned on unsafe code to induce misalignment.

The key result: the method achieves 99.6% separation of activations between aligned and misaligned models. This is exceptionally high precision compared to earlier approaches that relied on black-box behavioral evaluations (benchmark testing) rather than on the model’s internal geometry.

Directional Steering Reduces Code Leakage by 21–51 Points

The identified directions serve not only for detection — they can also be actively steered. The technique of directional steering (directed activation control) reduces so-called code spillover (leakage of unsafe code patterns) by 21 to 51 percentage points depending on the model and configuration.

For comparison: standard RLHF alignment methods require expensive retraining, while this approach intervenes directly in activation space without modifying model weights.

Gemma and Qwen as Geometric Donors, Llama as Recipient

A particularly interesting finding is cross-model transfer: directions learned on Gemma 2 and Qwen 2.5 can be transferred to Llama 3.2 and suppress misalignment there by up to 46 points. The author describes Gemma and Qwen as “geometric donors” — models whose internal alignment geometry is robust enough to inform other architectures.

However, for auditing purposes the author recommends within-model probing — analyzing the model being examined internally — because cross-model transfer introduces some uncertainty in interpretation.

Implications for LLM Security Auditing

The paper offers a practical tool for organizations that need to audit fine-tuned versions of models trained on potentially unsafe data. Instead of exhaustive behavioral testing, it is sufficient to measure the activation direction and compare it against a reference aligned model of the same family. The method is fast, interpretable, and — crucially — works consistently across multiple architectures without architecture-specific tuning.

Frequently Asked Questions

What are activation directions and why are they useful for LLM security?

Activation directions are vectors in the internal representation space of a neural network that separate different model behaviors; once identified, they allow us to mathematically measure and control the degree of misalignment without costly retraining.

Can findings from one model be applied to another?

Yes — cross-model transfer works: directions extracted from Gemma and Qwen (so-called geometric donors) successfully suppress misalignment in Llama 3.2 as the recipient model, with a drop of up to 46 points.

How is this method used in practice for model auditing?

The author recommends within-model probing — analyzing the model being audited internally — as it provides more reliable detection than the cross-model approach in an audit scenario.

arXiv:2606.20225: Activation Directions Detect LLM Misalignment with 99.6% Accuracy

A Common Misalignment Signature Across Four Model Families

Directional Steering Reduces Code Leakage by 21–51 Points

Gemma and Qwen as Geometric Donors, Llama as Recipient

Implications for LLM Security Auditing

Frequently Asked Questions

Sources

Related news