TxBench-PP: AI Agents and Drug Discovery

TxBench-PP is a benchmark testing AI agents in preclinical small-molecule pharmacology across 4,800 trajectories and 11 models. Claude Opus 4.8 achieves 59.3% success and leads ahead of GPT-5.5 at 55.3%, but no model reaches the reliability needed for medical application.

New Standard for Testing AI in Drug Development

Preclinical pharmacology refers to the phase of drug research that precedes human trials — it investigates the safety, toxicity, and mechanism of action of potential therapeutic molecules. That is precisely the phase targeted by TxBench-PP, a benchmark introduced in arXiv:2606.19245, which systematically measures how far AI agents can go in this demanding field. A set of 100 evaluation tasks and 4,800 trajectories — sequences of steps taken by an agent — makes it one of the most comprehensive tests of its kind.

Claude Opus 4.8 Leads, but the Gap to Reliability Remains Wide

Results for 11 tested models reveal a clear ranking but also a shared problem. Claude Opus 4.8 achieves 59.3% success (178 of 300 attempts; 95% CI 51.1–67.6%), making it the best-performing model. GPT-5.5 follows at 55.3%. A gap of nearly four percentage points may look modest, but in pharmaceutical research it means fewer costly experimental misdirections. The researchers’ key conclusion remains the same for both models: neither is reliable for independent use in research protocols.

More Than Half Wrong Cannot Be the Standard

Why is 59% insufficient? In a laboratory setting where each wrong research direction can cost weeks of work and hundreds of thousands of euros, a model that errs on nearly every other task is not a substitute for an expert researcher — only an assistive tool requiring rigorous verification. The authors emphasize that TxBench-PP is not designed to make models look bad, but to identify concrete weaknesses: agents perform particularly poorly on tasks requiring integration of pharmacokinetic data with toxicological profiles.

Benchmark as a Map for Future Improvement

TxBench-PP opens a path to structured improvement of AI tools for drug discovery. Pharmaceutical companies such as Exscientia, Recursion Pharmaceuticals, and Insilico Medicine already integrate AI into early research phases, but so far without a standardized measure. This benchmark can become a reference point for evaluating new models — and motivation for specialized fine-tunings that could bridge the gap between the current 59% and the level necessary for safe clinical application.

Frequently Asked Questions

Why is no AI model reliable for preclinical pharmacology?

Even the leading Claude Opus 4.8 achieves only 59.3% success on TxBench-PP, meaning nearly every other answer may be incorrect — and in drug development such an error rate is not acceptable for independent application.

What does TxBench-PP measure and how does it differ from previous medical AI tests?

TxBench-PP evaluates AI agents on 100 preclinical small-molecule pharmacology tasks across 4,800 trajectories, emphasizing multi-step reasoning specific to the drug research phase before human trials.

arXiv:2606.19245: TxBench-PP — AI Agents in the Search for New Drugs

New Standard for Testing AI in Drug Development

Claude Opus 4.8 Leads, but the Gap to Reliability Remains Wide

More Than Half Wrong Cannot Be the Standard

Benchmark as a Map for Future Improvement

Frequently Asked Questions

Sources

Related news