SIREN: A New Approach to LLM Safety That Reads Internal Model States Instead of Filtering Outputs
Why it matters
SIREN is a new safety mechanism for large language models that detects harmful content using the model's internal neural states rather than output filtering, with 250 times fewer parameters than existing guard models.
A new research paper titled “LLM Safety From Within: Detecting Harmful Content with Internal Representations”, presenting the system SIREN, was published on April 20, 2026 on ArXiv (ID 2604.18519). Authors Difan Jiao, Yilun Liu, Ye Yuan, Zhenwei Tang, Linfeng Du, Haolun Wu and Ashton Anderson propose a shift from classical output filtering to detection from the model’s internal states, which could significantly change how open-source LLMs implement safety.
What Is Output Filtering and Why Does SIREN Abandon It?
Most current safety mechanisms — including Claude, GPT and Llama — rely on output filtering. The model produces text, and a separate “guard model” reviews that text and decides whether to pass it, censor it, or return a refusal message. Such guard models are typically large, computationally expensive, and react after the model has already expended effort generating the output.
SIREN changes the perspective. Instead of analyzing only the final token or output, it uses linear probing to find “safety neurons” distributed across multiple layers of the model. This involves analyzing hidden states and activations — the internal numerical vectors the model produces while processing input. This information exists before a single word of output has been spoken, allowing SIREN to react earlier and more precisely.
How Exactly Does the Adaptive Layer-Weighted Strategy Work?
SIREN applies an adaptive layer-weighted strategy: rather than treating all layers equally, it learns which layers to weight more heavily in the final harmfulness decision. Linear probing is a technique where a small linear classifier is trained on each layer to assess whether the representation at that layer is “safe” or “dangerous.” The authors show that safety-relevant features are “distributed across internal layers” — meaning they do not appear only at the end but are spread throughout the entire processing path.
The results are impressive. SIREN “significantly outperforms state-of-the-art open-source guard models” with 250 times fewer trained parameters. It also shows better generalization to benchmarks not seen during training, which has traditionally been a weak point for safety classifiers. Due to its small size, inference is much faster than with generative guard models that must “write out” an explanation of why something is harmful.
Who Is This Important For and What Are the Limitations?
The main user-facing advantage is fewer false refusals — when the model declines a legitimate request because the guard classifies it too conservatively. Because SIREN reads internal states, it better distinguishes genuine intent from surface triggers (e.g., the word “attack” in the context of cybersecurity education will not automatically trigger a block).
The main limitation is clear: SIREN requires white-box access to the model — that is, the ability to read internal activations. This excludes it from closed commercial APIs like OpenAI’s or Anthropic’s, where internal states are inaccessible. On the other hand, this makes SIREN exceptionally attractive for the open-source ecosystem (Llama, Qwen, Mistral, DeepSeek), where hidden states are fully accessible and where developers often need cheap, local safety infrastructure without sending content to external guard services.
This article was generated using artificial intelligence from primary sources.
Related news
OpenAI offers $25,000 for finding universal jailbreaks in GPT-5.5 biosecurity
GPT-5.5 System Card: OpenAI publishes safety evaluations and risk assessment for the new model
OpenAI releases Privacy Filter: open-weight model for detecting and redacting personal data