Co-Failure Ceiling: when LLM ensembles do not help

A study of 67 frontier models from 21 providers introduces the concept of co-failure ceiling — the upper accuracy bound of an LLM ensemble determined by the rate at which all models fail on the same query. Results show that combining models rarely beats the single strongest model without query-level routing.

What is the co-failure ceiling?

An LLM ensemble — a system that combines multiple language models through voting, routing, or a Mixture-of-Agents architecture — has a mathematical accuracy ceiling. Researcher Josef Chen defines co-failure ceiling beta as the rate of queries on which all models in the group fail simultaneously: the combined system’s accuracy cannot exceed 1 − beta, regardless of how many models are added.

How high is the ceiling in practice?

Analysis of 67 frontier models from 21 providers shows that beta is consistently higher than standard statistical models predict. On open math tasks the measured beta is 0.052, while the theoretical model predicts only 0.023 — nearly 2.5x underestimated (90% confidence interval: 1.7x to 3.4x). On coding tasks beta rises to 0.079, and by reformatting GPQA-Diamond questions from multiple choice to free-form answers it reaches 0.127.

When does combining work, and when does it not?

Heterogeneous ensembles with low error correlation outperform homogeneous Self-MoA configurations at the same quality level. But without query-level routing — directing each query to the model best suited for it — combined systems rarely beat the simply selected strongest single model. Pairwise error correlation, a common ensemble diagnostic metric, does not reveal co-failure rate and therefore underestimates the true ceiling.

The paper was submitted on June 25, 2026.

Frequently Asked Questions

What is the co-failure ceiling and why does it matter?

The co-failure ceiling (beta) is the rate of queries on which all models in an ensemble fail — system accuracy cannot exceed 1 minus beta, regardless of how many models are added.

When does combining LLMs actually provide benefit?

Benefit exists when models fail on different queries, not when they share common weaknesses. Query-level routing that identifies which model is best for which query remains the only reliable path to results better than the single strongest model.

arXiv:2606.27288: When combining LLMs really helps — co-failure ceiling across 67 frontier models

What is the co-failure ceiling?

How high is the ceiling in practice?

When does combining work, and when does it not?

Frequently Asked Questions

Sources

Related news