ReClaim: Medical FM Achieves AUC 75.6% on 1,000+ Dx Tasks

A new arXiv preprint presents ReClaim — a foundation model with 1.7 billion parameters trained on 43.8 billion medical events from 200 million patient records. Across more than 1,000 diagnostic tasks it achieves a mean AUC of 75.6%, significantly outperforming LightGBM (66.3%) and the Delphi specialized model (69.4%). It opens a new class of foundation models trained on administrative health data.

arXiv published on May 5, 2026 (preprint ID 2605.02740) a paper presenting ReClaim — a foundation model with 1.7 billion parameters trained exclusively on administrative medical claims data. The results suggest that a foundation model approach can deliver generalizable medical AI systems without relying on imaging or clinical data typical of radiology/imaging FMs.

What is in the dataset and how was the model trained?

The training dataset covers 43.8 billion medical events from 200 million patient records. These are structured administrative data — ICD diagnoses, CPT procedures, medications by NDC codes, costs and dates — that health insurers and hospital systems routinely generate as part of the everyday billing workflow. Unlike medical imaging FMs (which require curated radiology archives) or clinical notes (which are unstructured and privacy-sensitive), claims data exists in standardized formats in virtually every healthcare institution in the US.

The authors approach modeling as a sequence learning task: a patient is a sequence of medical events with timestamps, and the model learns to predict the next event. This enables single-model evaluation across thousands of different downstream tasks without task-specific fine-tuning.

By how much does ReClaim outperform existing baselines?

Across 1,000+ diagnostic tasks, ReClaim achieves a mean AUC of 75.6%. Comparison baselines:

LightGBM (classical ML benchmark): 66.3%
Delphi (specialized medical model): 69.4%

The 6–9 percentage point difference is significant because it is measured across thousands of different diagnoses. Classical single-task ML models (LightGBM) lack the capacity to share representations across different diagnoses, while Delphi — though medically specialized — is not training-data scaled and does not use the foundation model paradigm.

Why does this matter for healthcare AI practice?

If the results replicate in clinical deployments, the ReClaim class of models could reverse the standard pattern of medical AI development: instead of every hospital/insurer training specialized single-disease models, a foundation model base with task-specific lightweight fine-tuning could be used. Practical implications: clinical decision support, risk stratification, fraud detection in insurance, and utilization management — all domains where large quantities of claims data are available but building per-task models would be prohibitively expensive.

Open questions for follow-up work: privacy considerations (HIPAA compliance scenarios), cross-institutional generalization (does a model trained on US Medicare claims data work on EU systems with SNOMED-CT/ICD-10 data), and calibration of predictions by race/ethnicity strata — a critical question that the administrative bias literature has long documented.

Frequently Asked Questions

How much data does ReClaim use for training?

ReClaim was trained on 43.8 billion medical events from 200 million patient records. These are structured administrative data — diagnoses, procedures, medications, costs — that insurance companies and hospitals routinely generate as part of their daily workflows.

Why is an AUC of 75.6% a significant result?

ReClaim achieves a mean AUC of 75.6% across 1,000+ diagnostic tasks, while LightGBM as the classical ML baseline reaches 66.3% and the Delphi specialized medical model 69.4%. A 6–9 percentage point difference across thousands of different diagnoses is an indicator of robust generalization typical of the foundation model approach.

arXiv:2605.02740: ReClaim — Foundation Model Trained on 200 Million Patient Records Achieves Mean AUC 75.6% on 1,000+ Medical Tasks

What is in the dataset and how was the model trained?

By how much does ReClaim outperform existing baselines?

Why does this matter for healthcare AI practice?

Frequently Asked Questions

Sources

Related news