ArXiv AgentFloor: small open-weight models (0.27B–32B) are sufficient for short-horizon agent tasks; GPT-5 retains advantage only in long-horizon planning
Ranit Karmakar and Jayita Chatterjee presented AgentFloor — a deterministic network of 30 tasks organized across six capability levels, on which they evaluated 16 open-weight models ranging from 0.27 to 32 billion parameters plus GPT-5. Conclusion: smaller models are sufficient for short-horizon, structured agent tasks, while frontier models retain a clear advantage only in long-horizon planning under persistent constraints.