arXiv:2605.03195: Terminus-4B — 4 billion parameters for terminal execution matches Claude Opus and GPT-5.3-Codex on SWE-Bench Pro with ~30% fewer main agent tokens
Terminus-4B is a 4-billion-parameter Qwen3 fine-tune specialized for terminal execution in agentic systems — on the SWE-Bench Pro benchmark it matches and sometimes outperforms Claude Sonnet/Opus and GPT-5.3-Codex baselines, while reducing main agent token consumption by approximately 30% by isolating verbose build/test logs in a subagent context.
This article was generated using artificial intelligence from primary sources.
Spandan Garg, Vikram Nitin and Yufan Huang published an arXiv preprint on May 4, 2026 testing the thesis that a specialized small model can replace a frontier LLM for a narrow agentic subtask — terminal execution. Terminus-4B, a Qwen3-4B fine-tune, matches and in some cases outperforms Claude Sonnet, Claude Opus and GPT-5.3-Codex baseline models on the SWE-Bench Pro benchmark.
What is SWE-Bench Pro and why is it relevant?
SWE-Bench Pro is a benchmark that measures the ability of AI agents to independently resolve real software engineering tasks from GitHub issues. The agent must clone a repository, locate relevant files, compile the project, run tests and submit a patch that passes the entire test suite. The difference from the original SWE-Bench is that SWE-Pro introduces an independent test group and stricter “passed” criteria, making it a more rigorous benchmark.
The authors additionally test the model on an internal SWE-Bench C# benchmark, showing that the specialization transfers to underrepresented languages in the training set.
How is the ~30% reduction in main agent tokens achieved?
Terminus-4B takes on the role of a subagent to which the main agent delegates all build, test and shell commands. Verbose outputs (build logs, test traces, exception stacks) remain isolated within the subagent context, while the main agent sees only a summary of results in its window. This reduces main agent token consumption by approximately 30% while maintaining quality parity.
Training is two-stage: first Supervised Finetuning (SFT) on traces of successful terminal execution, then Reinforcement Learning with a rubric-based LLM-as-judge reward that evaluates the accuracy and safety of executed commands against predefined criteria.
What does this mean for agentic system architecture?
The paper moves in the direction of specialized modularity: instead of a single frontier model doing everything — from planning to executing shell commands — the system splits into a “large mind” that drives reasoning and “small workers” that handle repetitive tasks. A similar idea appears in Anthropic’s Claude Cowork and Microsoft’s AutoGen, but Terminus-4B is the first to demonstrate that even a 4B model is sufficient for full parity substitution of frontier models in the terminal subtask.
It remains to be seen how far this approach will extend to other subtasks (browser automation, code review, regression triage), but the results on a public benchmark suggest that specializing small models is a serious alternative to more expensive frontier inference.
Frequently Asked Questions
- What is SWE-Bench Pro?
- SWE-Bench Pro is an expanded version of the SWE-Bench benchmark that measures the ability of AI agents to independently resolve real software engineering tasks from GitHub issues — from cloning a repository to compiling, testing, and submitting a patch that passes the test suite.
- How was Terminus-4B trained?
- Through two post-training steps on the Qwen3-4B base model: first Supervised Finetuning (SFT) on terminal execution traces, then Reinforcement Learning with a rubric-based LLM-as-judge reward that evaluates the success of executed commands.
- Why does reducing main agent tokens by ~30% matter?
- A main agent (e.g. Claude Opus) that feeds every build log and test trace through its own context pays a high price in both tokens and attention quality. Delegating terminal work to a specialized 4B model clears the main context and reduces inference cost.
Related news
arXiv:2605.03871: EvoLM — language models that improve themselves without external supervision
Google: Gemini API File Search expanded to multimodal image and text search
Microsoft Research: DroidSpeak shares KV cache across fine-tuned LLM variants for 4× higher throughput