ArXiv: Neurons Responsible for Harmful Responses in Large Language Models Identified

Why do large language models sometimes generate harmful responses, despite extensive safety training? A new study uses causal mediation analysis to identify the exact mechanisms within models that are responsible.

Key discovery: late layers and MLP blocks

The researchers established that harmful content generation occurs in the later layers of the model, primarily through errors in MLP (multi-layer perceptron) blocks rather than in attention blocks. The early layers of the model understand the harmful context in the prompt and propagate those signals through MLPs toward the output.

Neurons as a control mechanism

A particularly interesting finding is that a small, sparse set of neurons in the final layer of the model acts as a kind of control mechanism — a “gate” that decides whether harmful content will be generated or blocked.

This means that the model’s safety behavior is not diffusely distributed throughout the entire network but is concentrated in specific, identifiable components.

What does this mean for AI safety?

This discovery opens the door to targeted safety interventions — instead of expensive RLHF training of the entire model, it may be possible to surgically modify only the critical neurons that control harmful outputs. This would be faster, cheaper, and more precise.

Current methods such as RLHF (Reinforcement Learning from Human Feedback) treat the model as a “black box” and attempt to change behavior from the outside. This work suggests that a more precise, mechanistic approach to safety is possible — akin to the difference between surgery and taking medication for symptoms.

ArXiv: Neurons Responsible for Harmful Responses in Large Language Models Identified

Key discovery: late layers and MLP blocks

Neurons as a control mechanism

What does this mean for AI safety?

Sources

Related news