ArXiv: LongCoT benchmark reveals GPT 5.2 achieves only 9.8% on long chain-of-thought reasoning
Why it matters
LongCoT is a new benchmark with 2,500 expert-designed problems across five domains that tests the ability for long chain-of-thought reasoning which can require tens to hundreds of thousands of tokens. Current frontier models fail dramatically with GPT 5.2 scoring 9.8 percent and Gemini 3 Pro at just 6.1 percent, identifying a critical weakness for autonomous deployment of AI agents.
An international team of researchers from Oxford, Lawrence Livermore National Laboratory and the AI Safety Institute published LongCoT — a new benchmark that tests the ability of AI models for long chain-of-thought (CoT) reasoning. The results reveal a concerning weakness even in the most advanced models.
What does LongCoT measure?
The benchmark contains 2,500 expert-designed problems across five domains: chemistry, mathematics, computer science, chess and logic. The key difference from existing benchmarks is that the problems require chain-of-thought that extends across tens to hundreds of thousands of tokens — far above typical short reasoning tasks.
The problems are designed such that individual steps are solvable by frontier models, but the entire sequence requires extended reasoning — the ability to maintain coherent thinking through a long sequence of steps without losing context or accumulating errors.
How concerning are the results?
Current frontier models fail dramatically: GPT 5.2 achieves only 9.8%, while Gemini 3 Pro falls to an even lower 6.1%. This means that even the most capable AI models cannot solve more than one in ten problems requiring long, coherent reasoning.
This finding is particularly significant in the context of the growing use of AI agents for autonomous tasks. Agents that need to plan and execute multi-step operations — from debugging sessions to research processes — depend precisely on the ability for long coherent reasoning.
Why is this important for AI safety?
The authors explicitly identify the LongCoT weakness as critical for autonomous deployment of AI agents. If a model cannot reliably reason through a long sequence of steps, autonomous agents may make incorrect decisions in later phases of complex tasks — precisely where consequences are hardest.
The benchmark also suggests that the current approach of scaling models does not automatically solve the problem of long reasoning. Fundamentally new architectural innovations or training methods are needed for models to bridge the gap between short and long chain-of-thought reasoning.
This article was generated using artificial intelligence from primary sources.
Related news
Thinking with Reasoning Skills (ACL 2026 Industry Track): fewer tokens, higher accuracy through retrieval of reasoning skills
DeepSeek releases V4-Pro and V4-Flash: two open-source models with one million token context and 80.6 on SWE Verified
OpenAI introduces GPT-5.5: the smartest model for coding, research, and complex data analysis through tools