arXiv LegalHalluLens: Legal AI Hallucinations

The LegalHalluLens paper introduces typed hallucination profiling for legal AI across four types — numerical, temporal, obligation-related, and factual — on CUAD contract data. The key finding: an aggregate rate of 52% hallucinations conceals a 38–40 percentage-point gap between the best and worst category within the same model, which aggregate metrics do not reveal. A calibrated debate pipeline with agentic sceptics reduces false detections by 45% using smaller models.

The new preprint LegalHalluLens shows that average hallucination rates in legal AI are misleading because they conceal large differences across error types.

Hallucination profiling by type

LegalHalluLens introduces “typed hallucination profiling” — classifying hallucinations into four categories: numerical, temporal, obligation/rights-related, and factual. The analysis was conducted on CUAD data, a standard legal contracts dataset. A hallucination here is defined as a claim the model presents as fact that is not supported by any source.

The average that hides the differences

The key finding is that the aggregate rate of 52% hallucinations conceals a 38–40 percentage-point gap between the best and worst category within the same model. In other words, a model can be reliable in one type of claim and highly unreliable in another — something that does not show up in summary metrics. The paper also introduces the Risk Direction Index (RDI), a scalar that distinguishes omission bias from fabrication bias, enabling “direction-aware” risk procurement.

How to reduce false detections?

The proposed calibrated debate pipeline uses agentic sceptics that challenge claims, thereby reducing false detections by 45% while achieving commercial-grade performance with significantly smaller models. The paper, presented at the AIWILD workshop at ICML 2026, is practical for legal teams as it demonstrates that averaged reliability metrics are insufficient for risk assessment.

Frequently Asked Questions

What does LegalHalluLens measure?

Hallucinations in legal AI across four types (numerical, temporal, obligation-related, factual) on CUAD contract data.

Why is the aggregate metric misleading?

The 52% average conceals a 38–40 percentage-point gap between the best and worst category within the same model.

arXiv:2606.18021: LegalHalluLens reveals that a 52% hallucination average in legal AI hides a 38-point gap

Hallucination profiling by type

The average that hides the differences

How to reduce false detections?

Frequently Asked Questions

Sources

Related news