SIREN: A New Approach to LLM Safety That Reads Internal Model States Instead of Filtering Outputs
SIREN is a new safety mechanism for large language models that detects harmful content using the model's internal neural states rather than output filtering, with 250 times fewer parameters than existing guard models.
This article was generated using artificial intelligence from primary sources.
A new research paper titled “LLM Safety From Within: Detecting Harmful Content with Internal Representations”, presenting the system SIREN, was published on April 20, 2026 on ArXiv (ID 2604.18519). Authors Difan Jiao, Yilun Liu, Ye Yuan, Zhenwei Tang, Linfeng Du, Haolun Wu and Ashton Anderson propose a shift from classical output filtering to detection from the model’s internal states, which could significantly change how open-source LLMs implement safety.
What Is Output Filtering and Why Does SIREN Abandon It?
Most current safety mechanisms — including Claude, GPT and Llama — rely on output filtering. The model produces text, and a separate “guard model” reviews that text and decides whether to pass it, censor it, or return a refusal message. Such guard models are typically large, computationally expensive, and react after the model has already expended effort generating the output.
SIREN changes the perspective. Instead of analyzing only the final token or output, it uses linear probing to find “safety neurons” distributed across multiple layers of the model. This involves analyzing hidden states and activations — the internal numerical vectors the model produces while processing input. This information exists before a single word of output has been spoken, allowing SIREN to react earlier and more precisely.
How Exactly Does the Adaptive Layer-Weighted Strategy Work?
SIREN applies an adaptive layer-weighted strategy: rather than treating all layers equally, it learns which layers to weight more heavily in the final harmfulness decision. Linear probing is a technique where a small linear classifier is trained on each layer to assess whether the representation at that layer is “safe” or “dangerous.” The authors show that safety-relevant features are “distributed across internal layers” — meaning they do not appear only at the end but are spread throughout the entire processing path.
The results are impressive. SIREN “significantly outperforms state-of-the-art open-source guard models” with 250 times fewer trained parameters. It also shows better generalization to benchmarks not seen during training, which has traditionally been a weak point for safety classifiers. Due to its small size, inference is much faster than with generative guard models that must “write out” an explanation of why something is harmful.
Who Is This Important For and What Are the Limitations?
The main user-facing advantage is fewer false refusals — when the model declines a legitimate request because the guard classifies it too conservatively. Because SIREN reads internal states, it better distinguishes genuine intent from surface triggers (e.g., the word “attack” in the context of cybersecurity education will not automatically trigger a block).
The main limitation is clear: SIREN requires white-box access to the model — that is, the ability to read internal activations. This excludes it from closed commercial APIs like OpenAI’s or Anthropic’s, where internal states are inaccessible. On the other hand, this makes SIREN exceptionally attractive for the open-source ecosystem (Llama, Qwen, Mistral, DeepSeek), where hidden states are fully accessible and where developers often need cheap, local safety infrastructure without sending content to external guard services.
Frequently Asked Questions
- What is SIREN and how does it differ from classical guard models?
- SIREN is a lightweight guard model that detects harmful content by reading the LLM's internal activations across multiple layers instead of analyzing only the final output, which is the approach used by traditional filters.
- What are the main advantages and limitations?
- Advantages include 250× fewer parameters, better generalization to new benchmarks and greater inference efficiency. The main limitation is the need for white-box access to the model's internal states, which favors open-source models.
Related news
Anthropic: Project Glasswing found 10,000 high-risk vulnerabilities in its first month using Claude Mythos Preview
arXiv:2605.22786: LCGuard protects shared KV cache between agents in multi-agent systems from data leakage
GitHub: npm 11.15.0 introduces staged publishing and three new install-time --allow flags for supply chain hardening