ArXiv Odysseys: CMU's realistic web agents benchmark reveals SOTA frontier models achieve 44.5% success and 1.15% Trajectory Efficiency on long-horizon tasks
CMU researchers Lawrence Keunho Jang, Jing Yu Koh, Daniel Fried, and Ruslan Salakhutdinov published on April 27, 2026, the ArXiv preprint Odysseys — a benchmark of 200 long-horizon web tasks from authentic browsing sessions on the live internet. Rubric-based evaluation (averaging 6.1 rubrics per task) shows that the strongest frontier models achieve only 44.5% success rate and 1.15% Trajectory Efficiency, revealing massive gaps in current web agents.
The team from Carnegie Mellon University (Lawrence Keunho Jang, Jing Yu Koh, Daniel Fried, Ruslan Salakhutdinov) published on April 27, 2026, the preprint Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks — a new benchmark showing how far current web agents are from real-world deployment.
The problem with existing benchmarks
Quote from the abstract:
“Existing web agent benchmarks have largely converged on short, single-site tasks that frontier models are approaching saturation on.”
In other words: WebArena, Mind2Web, and similar benchmarks are becoming “saturated” — frontier models achieve high accuracy on them, creating the impression that the problem is solved. Real web applications are different:
- Multiple pages and multiple websites simultaneously
- Sessions lasting 10+ minutes
- Pages changing in real time (cookies, pop-ups, A/B tests)
- Goals that are ambiguous
Odysseys — what is new?
The benchmark consists of 200 long-horizon web tasks derived from authentic browsing sessions tested on the live internet. Each task has on average 6.1 rubrics for evaluation, rather than binary pass/fail.
The rubric-based approach provides two benefits:
- Granular insight — which part of a task the agent solves, which it does not
- Better alignment with human judgment — the authors note that rubric eval shows “improved alignment with human judgment compared to trajectory-level LLM evaluation”
Results: a painful gap
Two key metrics for “strongest frontier models tested”:
- Success rate: 44.5% — fewer than half of tasks completed successfully
- Trajectory Efficiency: 1.15% — rubric score per step
The second number is especially concerning. Low Trajectory Efficiency means the agent takes many actions that do not contribute to the solution — wandering around pages, clicking wrong links, possibly eventually succeeding, but through brute force rather than systematic planning.
Models tested
The abstract mentions “several leading frontier models” but specific models are not named in the retrieved portion. They likely include GPT-5, Claude Opus 4.6/4.7, Gemini 3 as the three main SOTA players for web agents.
Why does this matter?
Odysseys provides empirical anti-hype insight. The industry is aggressively pushing “AI agents that perform tasks on your behalf” (OpenAI Managed Agents, Mistral Vibe, Anthropic Claude Code), but real web applications show that:
- Models are far from human-level for multi-step web tasks
- Existing benchmarks overestimate real capability
- Efficient planning is a bigger deficit than mere success
For enterprise: before production deployment of a web agent, Trajectory Efficiency must be measured as an equal metric alongside success rate. Otherwise, you pay token costs for “eventual successes” that take longer than manual work.
Frequently Asked Questions
- What distinguishes Odysseys from existing web agent benchmarks?
- Existing benchmarks have converged on short, single-site tasks that frontier models are slowly saturating. Odysseys brings 200 long-horizon tasks from authentic browsing sessions (multiple pages, multiple steps) on the **live internet** — not synthetic traces. Evaluation is rubric-based (averaging 6.1 rubrics per task) rather than binary pass/fail.
- What is Trajectory Efficiency?
- A metric measuring rubric score per step — how many 'useful' actions the agent takes on average. Frontier models achieve only 1.15% Trajectory Efficiency, meaning the agent takes many actions that do not contribute to the solution, even when it eventually succeeds.
- What does this benchmark reveal?
- Frontier models achieve 44.5% success rate on realistic long-horizon tasks. Combined with low Trajectory Efficiency, it shows that current-generation agents 'eventually succeed' through brute force rather than systematic planning. It reveals a genuine gap between closed lab benchmarks and real web applications.
This article was generated using artificial intelligence from primary sources.
Related news
GitHub Copilot in Visual Studio gets debugger agent and cloud agent sessions from the IDE
AWS Bedrock AgentCore: Serverless MCP Proxy with IAM, OAuth 2.0 JWT, and CloudWatch Observability for Enterprise Governance
AWS Bedrock AgentCore Memory: three patterns for namespace-level long-term agent memory with IAM access control