arXiv:2606.07929: Stress test of medical LLMs reveals a hidden safety pathology
A new paper introduces AI-MASLD, a stress-audit framework borrowed from hepatology for evaluating clinical LLMs. Testing 7 models on 240 cases shows that under narrative stress the models diverge sharply, while medical fine-tuning systematically degrades stability and fairness.
This article was generated using artificial intelligence from primary sources.
arXiv published a paper on 6 June 2026 (label arXiv:2606.07929, version v1) that introduces AI-MASLD, a stress-audit framework for evaluating clinical (medical) large language models. The framework is borrowed from metabolic stress-testing in hepatology (the branch of medicine dealing with the liver) and reveals safety weaknesses that standard tests do not see.
What is AI-MASLD and where does it come from?
AI-MASLD is a methodological framework that measures medical models not only under ideal conditions, but exposes them to controlled “stress”. The idea is taken from hepatology, where metabolic stress-testing is used to uncover hidden disorders that are not visible at rest.
Transferred to large language models (LLMs), the logic is the same: a model that looks reliable on a standard test can collapse when the conditions change. The framework is therefore conceived as an audit, not as an ordinary measure of accuracy.
How was the experiment conducted?
The authors tested 7 models on 240 clinical cases. Each case passed through 6 narrative perturbation probes — variations in the way the clinical story is told, without changing the underlying medical fact.
Results were measured through three metrics: metabolic index, perturbation flip rate, and counterfactual fairness index. Together these metrics describe how stable and fair a model is when the input is subtly reshaped.
What happens under narrative stress?
The key pattern is clear: all models are equally good under baseline conditions, but they diverge sharply under narrative stress. In other words, the differences between models are not visible until they are placed under load.
A particularly concerning finding is that quantized models (those with reduced precision to save resources) show a hidden functional drop. This drop is not revealed by standard measurement, so it can go unnoticed until the model encounters a non-standard formulation of a case.
Does medical fine-tuning hurt?
One of the most important conclusions of the paper is that medical fine-tuning systematically degrades stability and fairness. Adapting a model to the medical domain, which should increase reliability, according to these measurements actually reduces resistance to narrative stress.
This is a counterintuitive but safety-significant finding. It warns that specializing a model for clinical use is not in itself a guarantee of safety and that an additional audit such as AI-MASLD is needed.
Which model performed best?
The paper also highlights an encouraging result: one open-weight model (a model with open weights) reaches or surpasses proprietary alternatives across all safety dimensions. This shows that open models can be competitive even in demanding, safety-sensitive clinical scenarios.
Specific per-model figures are in the full 34-page paper, not in the summary. The authors stress that the framework’s goal is to uncover weaknesses before models reach actual clinical use.
Frequently Asked Questions
- What is the AI-MASLD framework?
- AI-MASLD is a stress-audit framework for evaluating clinical (medical) large language models. It is borrowed from metabolic stress-testing in hepatology (liver medicine) and tests models under narrative perturbations rather than only under standard conditions.
- What is the key finding of the paper?
- All tested models are equally good under baseline conditions, but they diverge sharply under narrative stress. Quantized models show a hidden functional drop, and medical fine-tuning systematically degrades stability and fairness.
- How were the models measured?
- Seven models were tested on 240 clinical cases with 6 narrative perturbation probes. Measurement was carried out through three metrics: metabolic index, perturbation flip rate, and counterfactual fairness index. Specific per-model figures are in the full 34-page paper.
Related news
Anthropic: Red Team Maps AI-Enabled Cyberattacks to MITRE ATT&CK Framework, in Partnership with Verizon
AWS: New Bedrock InvokeGuardrailChecks API Brings Safety Checks Without Resources for Agentic Applications
arXiv:2606.07970: Patcher defends open-weight LLMs against malicious fine-tuning