Safety

Interpretability

The research field that seeks to understand the internal mechanisms of AI models — features and circuits — to explain why a model produces a given output.

Interpretability is the research field that seeks to understand why an AI model produces a given output rather than treating it as a black box. Mechanistic interpretability goes further and tries to reverse-engineer the internal computations of a neural network — much like reading machine code.

Its core building blocks are features — directions in activation space that correspond to human-understandable concepts — and circuits — causal chains of features that carry out a computation. Individual neurons usually encode several concepts at once (polysemanticity), so researchers use sparse autoencoders and dictionary learning to decompose activations into thousands of distinct, interpretable features.

The field is squarely in the 2025–2026 spotlight: Anthropic extracted millions of features from Claude, including ones tied to deception and bias, opening a path to monitoring and steering model behavior. Interpretability is an increasingly important tool for AI safety and alignment — without insight into the mechanisms, it is hard to prove a model will not fail in unexpected ways.

Sources

See also