ArXiv AgentFloor: small open-weight models (0.27B–32B) are sufficient for short-horizon agent tasks; GPT-5 retains advantage only in long-horizon planning
Ranit Karmakar and Jayita Chatterjee presented AgentFloor — a deterministic network of 30 tasks organized across six capability levels, on which they evaluated 16 open-weight models ranging from 0.27 to 32 billion parameters plus GPT-5. Conclusion: smaller models are sufficient for short-horizon, structured agent tasks, while frontier models retain a clear advantage only in long-horizon planning under persistent constraints.
This article was generated using artificial intelligence from primary sources.
Ranit Karmakar and Jayita Chatterjee published on May 1, 2026 on ArXiv the paper “AgentFloor: How Far Up the Tool-Use Ladder Can Small Open-Weight Models Go?”. The goal: to empirically determine where the limits of small open-weight models lie in real-world agent tasks, and when a more expensive frontier model is worth it.
What is the structure of the AgentFloor benchmark?
AgentFloor is a deterministic network of 30 tasks organized across six capability levels. The levels cover:
- Instruction following (basic reading and execution)
- Tool use (single tool call, clear input)
- Multi-step coordination (sequence of tool calls)
- Long-horizon planning under persistent constraints (tasks that evolve during execution)
- Plus two intermediate levels that grade complexity
The determinism of the network is important: results are reproducible and are not an artifact of benchmark randomness. This makes AgentFloor a clean measurement instrument for comparing models without the noise that standard agent benchmarks often carry.
16 open-weight models ranging from 0.27 to 32 billion parameters were evaluated, plus GPT-5 as a frontier reference. The specific list of 16 models is not given in the public abstract, but the range covers everything from small on-device models to mid-sized open-weight LLMs that can run on a consumer GPU.
What did the authors find?
The main finding can be stated briefly: “smaller open-weight models are already sufficient” for routine tool use. Strong open-weight models (likely in the 14B–32B parameter range) match GPT-5 in performance on short-horizon, structured tasks.
The difference only becomes clear in long-horizon planning under persistent constraints — tasks where the agent must maintain context across dozens of steps, track meta-state (e.g., remaining budget), and adapt strategy as constraints change. That is where GPT-5 still leads.
This pattern confirms hybrid architecture as a rational design for enterprise agents:
- Small model (0.27B–7B) for routine — checks, one-off lookups, formatting
- Mid-sized model (14B–32B) for standard tool calls and short-horizon coordination
- Frontier model (GPT-5 class) only for tasks requiring long-horizon planning under constraints
What does this mean for the cost structure of agent systems?
The implication is significant for enterprise budgets. A typical agent workflow spends 80–90% of calls on routine — fetch data, format response, branch conditions. If that 80–90% can be redirected to a 7B–32B open-weight model running locally, infrastructure cost drops by an order of magnitude compared to an all-frontier deployment.
The frontier model remains reserved for the 10–20% of calls where it truly makes a difference. This is a design already in practice at some tech companies, but AgentFloor provides the first quantitative basis for arguing where the boundary lies and which models to choose.
The paper is available on ArXiv under ID 2605.00334.
Frequently Asked Questions
- What three capability levels does AgentFloor measure?
- Six levels: instruction following, tool use, multi-step coordination, long-horizon planning under persistent constraints, plus two intermediate levels. The network contains 30 deterministic tasks distributed across these six levels.
- What is the range of evaluated models?
- 16 open-weight models ranging from 0.27 to 32 billion parameters, plus GPT-5 as a frontier reference. The specific list of 16 models is not given in the abstract, but covers the spectrum from small on-device models to mid-sized open-weight LLMs.
- When do frontier models still have an advantage?
- In long-horizon planning under persistent constraints — tasks requiring context maintenance across dozens of steps and strategy adaptation as constraints change. On short-horizon, structured tasks the gap narrows significantly.
Related news
ArXiv Token Arena: continuous benchmark unifying energy and cognition reveals 6.2× difference in joules per correct answer across endpoints
NIST CAISI: DeepSeek V4 Pro is the most capable Chinese AI model to date, but trails US frontier by 8 months
AdaMeZO: Adam-style LLM fine-tuning without storing gradient moments in GPU memory