What is the LongCoT benchmark?

A benchmark with 2,500 expert-designed problems across chemistry, mathematics, computer science, chess and logic that tests chain-of-thought reasoning requiring tens to hundreds of thousands of tokens.

Why do frontier models perform so poorly on LongCoT?

Individual solution steps are solvable for them, but maintaining coherent reasoning throughout an entire sequence of thousands of steps remains a critical weakness — GPT 5.2 achieves only 9.8 percent.

ArXiv: LongCoT benchmark reveals GPT 5.2 achieves only 9.8% on long chain-of-thought reasoning

An international team of researchers from Oxford, Lawrence Livermore National Laboratory and the AI Safety Institute published LongCoT — a new benchmark that tests the ability of AI models for long chain-of-thought (CoT) reasoning. The results reveal a concerning weakness even in the most advanced models.

What does LongCoT measure?

The benchmark contains 2,500 expert-designed problems across five domains: chemistry, mathematics, computer science, chess and logic. The key difference from existing benchmarks is that the problems require chain-of-thought that extends across tens to hundreds of thousands of tokens — far above typical short reasoning tasks.

The problems are designed such that individual steps are solvable by frontier models, but the entire sequence requires extended reasoning — the ability to maintain coherent thinking through a long sequence of steps without losing context or accumulating errors.

How concerning are the results?

Current frontier models fail dramatically: GPT 5.2 achieves only 9.8%, while Gemini 3 Pro falls to an even lower 6.1%. This means that even the most capable AI models cannot solve more than one in ten problems requiring long, coherent reasoning.

This finding is particularly significant in the context of the growing use of AI agents for autonomous tasks. Agents that need to plan and execute multi-step operations — from debugging sessions to research processes — depend precisely on the ability for long coherent reasoning.

Why is this important for AI safety?

The authors explicitly identify the LongCoT weakness as critical for autonomous deployment of AI agents. If a model cannot reliably reason through a long sequence of steps, autonomous agents may make incorrect decisions in later phases of complex tasks — precisely where consequences are hardest.

The benchmark also suggests that the current approach of scaling models does not automatically solve the problem of long reasoning. Fundamentally new architectural innovations or training methods are needed for models to bridge the gap between short and long chain-of-thought reasoning.

ArXiv: LongCoT benchmark reveals GPT 5.2 achieves only 9.8% on long chain-of-thought reasoning

What does LongCoT measure?

How concerning are the results?

Why is this important for AI safety?

Sources

Related news