WindowsWorld benchmark: leading computer-use agents fall below 21% success rate on tasks spanning multiple desktop applications
WindowsWorld is a new benchmark for autonomous GUI agents that tests 181 tasks with an average of 5.0 sub-goals across 17 desktop applications based on 16 occupations. Leading computer-use agents achieved less than 21% success on tasks that cross the boundary of a single application, revealing a large gap between isolated benchmarks like OSWorld and real professional work requiring conditional reasoning across three or more programs.
A research team from the Harbin Institute of Technology (Jinchao Li, Yunxin Li, Chenrui Zhao, Zhenran Xu, Baotian Hu, and Min Zhang) published a new benchmark for autonomous GUI agents on ArXiv on April 30, 2026. WindowsWorld shifts the focus from isolated tasks within a single application toward realistic professional work that crosses the boundaries of multiple programs, and reveals that leading computer-use agents achieve less than 21% success on such tasks.
What does WindowsWorld measure differently from OSWorld?
OSWorld and related benchmarks predominantly measure agents within a single application: open a browser, double-click, fill out a form. WindowsWorld explicitly covers multi-app workflows — for example, extract data from a table in Excel, prepare an email proposal in a mail client, and create a presentation with the results in a presentation tool, all within a single task. 78% of the total 181 tasks are inherently multi-application, and the average task has 5.0 sub-goals across 17 different desktop applications. Tasks were generated through a multi-agent framework guided by 16 professional roles (occupations), then refined through human review and executed in a simulated environment.
Why do agents fail when a task spans three applications?
The authors’ main experimental finding is inconsistent across benchmarks. While leading agents perform well on single-app tasks, they fall below 21% success on multi-app workflows. The specific problems are conditional reasoning across three or more applications — agents stall at early sub-goals or repeat the same steps. Another issue is low execution efficiency: agents even exceed the human ceiling in number of steps, yet still fail to complete the task. In other words, the problem is not solely about step count but about the ability to maintain state across contextual transitions between programs.
Implications for deploying agents in office work
Computer-use agents are among the fastest-growing AI products — Anthropic, OpenAI, and Google DeepMind are actively promoting agents as replacements for routine office work. WindowsWorld signals that the current generation of agents is far from reliably executing multi-application tasks characteristic of real professional daily work. The benchmark should become a realistic direction of progress for the next generation of agents, similar to how SWE-bench defined the direction for coding agents.
Frequently Asked Questions
- How many tasks and applications does WindowsWorld cover?
- The benchmark contains 181 tasks with an average of 5.0 sub-goals per task, distributed across 17 common desktop applications. 78% of tasks are inherently multi-application.
- What is the success rate of the best GUI agents?
- All leading computer-use agents tested achieved less than 21% success on multi-app tasks — drastically lower than on isolated single-app tests.
- How does WindowsWorld differ from OSWorld?
- OSWorld and similar benchmarks measure isolated single-app tasks, while WindowsWorld is explicitly oriented toward cross-application workflows with conditional branching typical of professional work.
This article was generated using artificial intelligence from primary sources.
Related news
ArXiv study: in-context prompting outperforms LangGraph, CrewAI, Google ADK, and OpenAI Agents SDK on procedural tasks
GitHub Copilot in Visual Studio gets debugger agent and cloud agent sessions from the IDE
ArXiv Odysseys: CMU's realistic web agents benchmark reveals SOTA frontier models achieve 44.5% success and 1.15% Trajectory Efficiency on long-horizon tasks