arXiv:2604.21764: reasoning skills for fewer tokens at ACL 2026

The team of Guangxiang Zhao and co-authors published on April 23, 2026 the paper 'Thinking with Reasoning Skills: Fewer Tokens, More Accuracy' accepted at the ACL 2026 Industry Track. The approach distills 'reusable reasoning skills' from long chain-of-thought reasoning and uses them as a retrieval-guided shortcut for new problems, significantly reducing token count while improving accuracy on coding and math tasks.

The team of Guangxiang Zhao, Qilong Shi, Xusen Xiao, Xiangzheng Zhang, Tong Yang and Lin Sun published on April 23, 2026 on ArXiv the paper “Thinking with Reasoning Skills: Fewer Tokens, More Accuracy” (arXiv:2604.21764). The paper was accepted at the 64th ACL — Association for Computational Linguistics — Industry Track held as part of the ACL 2026 conference.

What problem does the paper solve?

Modern reasoning LLMs (models like OpenAI o1, DeepSeek R1, Claude Opus with thinking mode) achieve high accuracy on complex tasks by generating long chain-of-thought (CoT) traces — internal step-by-step reasoning that typically consumes hundreds or thousands of tokens before providing a final answer. The problem is that the model “spends substantial tokens on long intermediate reasoning traces when solving new problems”, dramatically increasing both cost per query and latency. For production deployment this is a serious economic barrier — e.g., a single reasoning query can cost 10× more than a standard completion.

What is the solution?

The authors propose a fundamental paradigm shift: instead of reasoning from scratch (starting from zero on every query), they “propose to summarize and store reusable reasoning skills distilled from extensive deliberation and trial-and-error exploration”. The idea is that after the model once solves a problem with a long CoT, a compact ‘skill’ is extracted that summarizes the key reasoning steps. These skills are stored in a repository, and when a new query arrives the system first retrieves relevant skills and uses them as guidance: “helping the model avoid redundant detours and focus on effective solution paths”.

Structured vs. free reasoning

The difference from classic CoT is that free reasoning always starts from scratch and explores all possible approaches — including those that lead nowhere. Structured reasoning guided by distilled skills acts as an “experiential shortcut”: the model receives a summary of past success and can apply it immediately. This is conceptually close to case-based reasoning approaches from classical AI literature, but applied in the context of retrieval-augmented LLM inference.

What are the concrete results?

The authors test the paper on coding and math reasoning tasks. The abstract states that the approach “significantly reduces reasoning tokens while improving overall performance” — specific token reduction percentages and accuracy improvements are in the main paper text rather than the public abstract. The economic implication is clear: “The resulting lower per-request cost indicates strong practical and economic potential for real-world deployment”.

Why is the paper important for industry?

Acceptance at the ACL Industry Track signals that peer reviewers consider the work production-ready, not just academically interesting. For companies serving reasoning models via API (OpenAI, Anthropic, Google, DeepSeek), this approach can seriously impact profit margins — fewer tokens per query means cheaper operations or a better price-to-quality ratio. In an era when a reasoning model can consume 10× more tokens than a regular model, even a 30–40% reduction represents millions in savings for hyperscalers processing billions of queries per month.

Frequently Asked Questions

What is the 'overthinking' problem in reasoning LLMs?

Reasoning models like OpenAI o1 or DeepSeek R1 generate very long chain-of-thought traces when solving new problems (often thousands of tokens) because they repeatedly explore the same approaches and dead ends. This dramatically increases inference cost and latency.

How does the approach in the paper solve the problem?

The authors propose extracting and storing 'reusable reasoning skills' distilled from previous long reasoning sessions. During inference, the model retrieves relevant skills for the query and uses them as guidance instead of reasoning from scratch, thereby avoiding redundant detours.

What does 'ACL Industry Track' mean?

ACL (Association for Computational Linguistics) is a top NLP conference. The Industry Track is a special section for papers from industry with a focus on practical application — meaning the paper was evaluated as deployment-ready, not merely academic.

arXiv:2604.21764: 'Thinking with Reasoning Skills' reduces reasoning tokens while improving accuracy — ACL 2026 Industry Track