OSWorld study: AI computer-use agents often fail when repeating the same task
Why it matters
New research shows that AI agents for computer control that successfully complete a task once may fail on an identical repeated attempt, with three key reasons being execution stochasticity, task specification ambiguity, and agent behavior variability.
A new study by Gonzalez-Pumariega et al. and Xin Eric Wang reveals a systemic problem in evaluating AI agents that control computers: successfully completing a task once is no guarantee the agent will succeed a second time.
What are the three main causes of instability?
The research identifies three factors that jointly create agent unreliability. The first is execution stochasticity — each attempt involves random elements such as timing variations in UI loading, different sampling results from the model, and minor changes in the state of the operating system.
The second factor is task specification ambiguity. The same task can be worded vaguely enough to be completed in multiple ways, some of which are “successful” by one metric and “unsuccessful” by another.
The third is agent behavior variability — even given identical input, the agent does not always make identical decisions, particularly in longer action chains where small differences accumulate and compound.
How did the OSWorld benchmark help reveal the problem?
The authors conducted experiments on the OSWorld benchmark, a platform for evaluating AI agents in real operating systems and applications. The key methodological intervention was multiple repetitions of the same tasks, rather than the standard single-run measurement.
Results show that an agent that solves a task in one run may get stuck in the next, take a different path that doesn’t lead to success, or get caught in a loop. Such instability remains invisible in benchmarks that test agents only once per task.
The conclusion is that published success rate numbers in standard evaluations may be overestimated because they don’t account for how many times out of ten attempts the agent actually succeeds.
What does this mean for agent development?
The practical implications are important for anyone building production systems on computer-use agents. If an agent succeeds in seven out of ten attempts, in production that means three out of ten users get a failure — which is unacceptable for many use cases.
The authors recommend repeated evaluation as a standard, along with measuring variance rather than just average success. They also suggest better task specifications to reduce ambiguity and more robust deterministic interfaces where possible.
For the research community this means a need to revisit how results are reported, and for product builders a need for additional mechanisms such as retry logic, outcome verification, and human-in-the-loop controls.
Sources
Related news
Anthropic: Memory for Managed Agents in public beta — AI agents that remember context between sessions
GitHub: Cloud agent sessions now available directly from issues and project views
ArXiv SWE-chat — a dataset of real developer interactions with AI coding agents in production