🔴 🤖 Models Thursday, April 30, 2026 · 2 min read ·

BioMysteryBench: Claude Mythos Preview Solves Bioinformatics Problems Even Experts Cannot, Opus 4.6 Achieves 77.4% on Human-Solvable Tasks

Editorial illustration: AI agent analyzing sequenced RNA-seq data alongside scientific equipment

Anthropic released BioMysteryBench on April 29, 2026 — an evaluation framework of 99 expert-level bioinformatics tasks with objective ground truth derived from experimental data. Claude Opus 4.6 achieves approximately 77.4% accuracy on 76 human-solvable problems and 23.5% on 23 superhuman tasks, while Mythos Preview solves some problems that a panel of human experts could not — researchers describe this as a watershed moment for AI in bioscience.

Anthropic released BioMysteryBench on April 29, 2026 — a new evaluation framework for assessing the bioinformatics capabilities of AI models. The benchmark contains 99 tasks assembled by domain experts, uses messy data from real experiments, and scores answers against objective ground truth from experimental findings. The approach directly targets three key problems in scientific evaluations: allowing diverse methodological approaches, having an objective answer, and enabling the generation of problems that humans cannot solve on their own.

How Does BioMysteryBench Address the Problem of Subjectivity?

Most scientific benchmarks only measure whether a model agrees with scientific consensus, which does not permit discovery. BioMysteryBench adopts a method-agnostic approach — the model selects its own analytical tools, and the system checks the final numerical or categorical answer against real experimental data. Example tasks include identifying a human organ from a single-cell RNA-seq dataset, detecting genes knocked out in experimental samples relative to controls, and determining family relationships from whole-genome sequencing data.

What Scores Do Claude Models Achieve on the Benchmark?

On 76 human-solvable tasks, Claude Opus 4.6 achieves approximately 77.4% accuracy, and Sonnet 4.6 around 70%. On 23 superhuman tasks — questions that even a panel of experienced bioinformaticians could not solve — Opus 4.6 achieves 23.5%, while Mythos Preview solves 30%. Mythos Preview also leads in the human-solvable category. Anthropic notes that reliability drops significantly on harder problems: on human-solvable tasks, Opus 4.6 gives the correct answer in 86% of cases in at least 4 out of 5 attempts, while on superhuman tasks that figure drops to 44%, suggesting that nearly half of successes “come spontaneously rather than being reproducible.”

Why Does Anthropic Call This a Watershed Moment?

Mythos Preview produces scientific conclusions that humans previously failed to extract from identical data. Although the benchmark is still in preview (“Anthropic/BioMysteryBench-preview” on Hugging Face), the result suggests that AI is no longer merely an assistant that speeds up existing workflows, but can autonomously solve research questions that a human team cannot. Anthropic invites researchers to use the model for their own analyses via the claude.com/lifesciences portal. The post’s author is Brianna from the discovery team.

Frequently Asked Questions

What is BioMysteryBench?
An evaluation framework of 99 bioinformatics questions assembled by domain experts. It uses messy data from real experiments, and answers are scored against objective ground truth from validated metadata — not against subjective scientific conclusions.
How accurate is Claude Opus 4.6 on BioMysteryBench?
Approximately 77.4% on 76 human-solvable tasks and 23.5% on 23 superhuman tasks. Sonnet 4.6 achieves around 70% on the human-solvable category.
What makes Mythos Preview different?
Mythos Preview solves 30% of superhuman tasks and outperforms a panel of human experts on some problems. Anthropic describes this as a watershed moment because the model produces scientific conclusions that humans failed to extract from the same data.
🤖

This article was generated using artificial intelligence from primary sources.