Clinical LLMs: safety does not scale with model size

A new paper shows that safety in clinical LLMs does not follow the same scaling laws as accuracy — cleaner evidence in RAG raises accuracy from 73.5% to 94.1% and reduces high-risk errors from 12% to 2.6%, more than any model scaling effect.

A new paper delivers an uncomfortable message to the medical AI industry: the safety of clinical language models does not follow the same scaling laws as their overall accuracy. The authors use their own RadSaFE-200 benchmark — composed of 200 clinically risky radiology questions validated by radiologists — to examine how models behave in edge cases.

What do “different scaling laws” mean?

A scaling law is an empirical regularity describing how model performance changes with size — whether through parameters, data, or compute. A high-risk error in the clinical context means an error that can directly harm a patient, such as a missed tumor finding or a misinterpreted radiological image.

The paper shows that simply increasing model size or context window does not reduce high-risk errors proportionally to the reduction in overall error. In other words, a larger model is not automatically a safer model.

What actually reduces risk?

Cleaner evidence in RAG dramatically improves both metrics simultaneously: accuracy rises from 73.5% to 94.1%, and the high-risk error rate drops from 12% to 2.6%. This difference is larger than any model scaling effect the authors measure.

The conclusion is directly relevant to everyone developing medical AI assistants: deployment decisions — knowledge base quality, retrieval design, context construction — are the primary determinant of safety, not just model size.

Implications for regulators and development teams

The paper introduces the SaFE-Scale framework as a formal approach to separating the scaling laws of safety and accuracy. It has concrete consequences for regulatory bodies considering certification of clinical AI systems — measuring only overall accuracy can miss safety gaps.

For European development teams working under the EU AI Act and preparing the classification of high-risk medical systems, the results suggest that audits must explicitly separate safety metrics from accuracy metrics. Validation protocols that rely on aggregate benchmark numbers risk missing precisely those errors that can harm a patient.

Frequently Asked Questions

Why does safety not scale linearly with model size?

The RadSaFE-200 benchmark shows that increasing parameters or context window does not reduce high-risk errors at the same rate as overall accuracy — the quality of retrieved evidence in RAG dominates over model size.

What is RadSaFE-200?

A benchmark of 200 clinically risky radiology questions validated by radiologists, focused on errors that can directly harm a patient.

What is the SaFE-Scale framework?

A formal approach to separating the scaling laws of safety and accuracy, proposed as a tool for regulators assessing clinical AI systems.

arXiv:2605.04039: Safety and accuracy in clinical LLMs follow different scaling laws

What do “different scaling laws” mean?

What actually reduces risk?

Implications for regulators and development teams

Frequently Asked Questions

Sources

Related news