What are computer use agents?

Computer use agents are AI agents that control a computer like a human — using a mouse, keyboard, and screen to perform tasks in real applications.

What is the OSWorld benchmark?

OSWorld is a benchmark for evaluating AI agents in real operating systems, measuring their ability to complete tasks through applications and interfaces.

Why is single-run testing not enough?

Single-run testing hides systematic reliability issues because an agent may succeed or fail by chance; only multiple repetitions reveal true stability.

AI computer-use agents are unstable when repeating tasks

A new study by Gonzalez-Pumariega et al. and Xin Eric Wang reveals a systemic problem in evaluating AI agents that control computers: successfully completing a task once is no guarantee the agent will succeed a second time.

What are the three main causes of instability?

The research identifies three factors that jointly create agent unreliability. The first is execution stochasticity — each attempt involves random elements such as timing variations in UI loading, different sampling results from the model, and minor changes in the state of the operating system.

The second factor is task specification ambiguity. The same task can be worded vaguely enough to be completed in multiple ways, some of which are “successful” by one metric and “unsuccessful” by another.

The third is agent behavior variability — even given identical input, the agent does not always make identical decisions, particularly in longer action chains where small differences accumulate and compound.

How did the OSWorld benchmark help reveal the problem?

The authors conducted experiments on the OSWorld benchmark, a platform for evaluating AI agents in real operating systems and applications. The key methodological intervention was multiple repetitions of the same tasks, rather than the standard single-run measurement.

Results show that an agent that solves a task in one run may get stuck in the next, take a different path that doesn’t lead to success, or get caught in a loop. Such instability remains invisible in benchmarks that test agents only once per task.

The conclusion is that published success rate numbers in standard evaluations may be overestimated because they don’t account for how many times out of ten attempts the agent actually succeeds.

What does this mean for agent development?

The practical implications are important for anyone building production systems on computer-use agents. If an agent succeeds in seven out of ten attempts, in production that means three out of ten users get a failure — which is unacceptable for many use cases.

The authors recommend repeated evaluation as a standard, along with measuring variance rather than just average success. They also suggest better task specifications to reduce ambiguity and more robust deterministic interfaces where possible.

For the research community this means a need to revisit how results are reported, and for product builders a need for additional mechanisms such as retry logic, outcome verification, and human-in-the-loop controls.

OSWorld study: AI computer-use agents often fail when repeating the same task

What are the three main causes of instability?

How did the OSWorld benchmark help reveal the problem?

What does this mean for agent development?

Sources

Related news