HARC: A New Fine-Tuning Method That Prevents Jailbreaks by Coupling Harmfulness and Refusal
Researchers discovered why jailbreaks succeed at the level of internal model representations and developed the HARC fine-tuning method, which explicitly couples 'harmfulness and refusal directions' — achieving the strongest robustness-capability-usability trade-off among six tested methods.
This article was generated using artificial intelligence from primary sources.
Researchers Shei Pern Chua and Fangzhao Wu published a paper on July 1, 2026, that reveals the precise mechanism by which jailbreak attacks bypass the safety alignment of large language models — and proposes a concrete solution in the form of a new fine-tuning method called HARC.
Why do jailbreaks actually succeed?
Prior understanding was largely phenomenological: we knew that certain prompt formulations “trick” a model into generating harmful content, but the mechanism inside the network was not clear. The HARC research illuminates that mechanism using an interpretability methodology.
In aligned LLMs, there exist (at least) two separate “directions” in the internal representation space: the harmfulness direction (encoding how dangerous the content is) and the refusal direction (encoding whether the model will decline the request). The key finding: jailbreaks succeed by suppressing one or the other direction — not necessarily both simultaneously. An attack that suppresses only the refusal direction is sufficient for the model to generate harmful content even when the harmfulness direction remains active.
The analysis was further extended to token positions in the response, not just in the prompt. The researchers found that a model can recognize the harmfulness of content as it generates it — even when the initial analysis of the prompt missed the problem. This finding has important implications for the design of safety mechanisms that operate exclusively at the level of the input prompt.
Different classes of jailbreak attacks occupy separable regions in the harmfulness-refusal plane — meaning there is a geometric structure to these attacks in the model’s internal space, rather than chaotic diversity.
HARC: fine-tuning that couples both directions
Rather than analysis alone, HARC offers a concrete fine-tuning recipe. The method explicitly couples harmfulness and refusal representations across both prompt and response positions — forcing the model to reflect both “I detected a danger” and “I refuse to generate” as a joint signal, rather than as independent dimensions that can be suppressed separately.
The result: the model becomes resistant to attacks that target only one of the two directions, because they are now tightly bound together in the representation space.
HARC achieves the strongest robustness-capability-usability trade-off compared to six baseline methods covering the main approaches to safety training — at both training time and inference time.
Transferability and practical application
Particularly important is that HARC requires no architecture-specific adaptation — the method was tested on five model families in two sizes and transfers without additional modifications. This makes HARC a practically applicable recipe for existing fine-tuning pipelines, not merely a laboratory finding.
The mechanistic angle of the research also offers broader value: it directly maps how safety-aware representations are organized in aligned LLMs, which is a valuable contribution to model interpretability independent of the security application.
The paper comes at a time when the industry is actively seeking methods that do not compromise model capability for the sake of safety — HARC demonstrates that both goals can be achieved simultaneously by targeting the right level of internal representation.
Frequently Asked Questions
- What is HARC and what does it do?
- HARC is a fine-tuning method that explicitly couples the internal representations of harmfulness and refusal in LLMs, making the model resistant to jailbreak attacks that attempt to suppress only one of those two directions in the network.
- How do jailbreaks bypass safety alignment?
- The research shows that jailbreaks work by suppressing either the 'refusal direction' or the 'harmfulness direction' in the model's residual stream — not necessarily both simultaneously — causing the model to produce harmful content.
- How many models was HARC tested on?
- HARC was evaluated on five different model families in two sizes, and the method requires no architecture-specific adaptation and transfers across models.