arXiv TerminalWorld: real-env LLM agent benchmark

TerminalWorld is a new benchmark that evaluates LLM agents on real bash, git, and file operations in genuine Linux processes — no simulation. The eight-author paper led by Zhaoyang Chu and Jiarui Hu sets a new bar for 'computer use' agents and is directly relevant to tools like Claude Code, GitHub Copilot Workspace, and Cursor's agentic mode.

The arXiv preprint TerminalWorld, published on 22 May 2026, introduces a new benchmark for evaluating LLM agents on real Linux terminal tasks. The paper led by Zhaoyang Chu and Jiarui Hu — eight authors in total — designed a benchmark that runs in genuine Linux processes, without the simulation or sandboxed mock environments used by most previous agentic benchmarks.

Why is a real environment critical for a benchmark?

Most existing benchmarks for “computer use” agents — including OSWorld, AgentBench, and WorkArena — rely on simulated or mock environments. The technical reason: real Linux processes run asynchronously, can hang on network timeouts, generate unpredictable race conditions in the file system, and require waiting for long-running external processes (apt install, git clone, npm build).

Simulation hides all of that. An agent that scores 85 percent on a simulated benchmark can drop to 50 percent in production because reality contains edge cases that simulation does not cover. TerminalWorld therefore uses real processes — the agent gets access to a genuine Ubuntu container with shell, file system, network, and tools such as git, docker, apt, and curl.

What does the benchmark specifically measure?

TerminalWorld covers three task categories, totaling 240 individual scenarios:

Bash one-liner composition (80 tasks): the agent receives a textual description such as “find all files larger than 100 MB modified in the last 7 days and move them to a backup directory, preserving the path structure.” It must generate one or more bash calls that accomplish this.

Git workflow (80 tasks): rebase scenarios with conflicts, cherry-pick across branches, bisect for regression bugs, force-push recovery, submodule sync. Each task has a git repository in a known state and a defined expected end state.

File operations (80 tasks): recursive permission fixes, log rotation with archiving, backup-restore cycles, large directory tree manipulation, symlink handling across cross-filesystem boundaries.

All tasks have a deterministic success criterion — an automated validator checks the final system state without human intervention.

Which models were tested and what were the results?

The paper tests four frontier models and three open-source agentic frameworks:

Model	Bash	Git	File ops	Total
GPT-5	71%	64%	68%	68%
Claude Opus 4.7	68%	71%	65%	68%
Gemini 3 Pro	65%	58%	62%	62%
Llama 4 405B + Aider	54%	49%	51%	51%

No model exceeds 70 percent on the full benchmark set. The authors interpret this as evidence of significant room for improvement along two dimensions: better tool-use strategies (knowing when to use git status vs git log vs git reflog) and better error recovery (when a bash command fails, agents often generate an identical retry instead of diagnosing the problem).

What does this mean for tools like Claude Code and Cursor?

TerminalWorld is directly relevant to tools marketed as “AI coding agents”: Claude Code (CLI with shell access), GitHub Copilot Workspace (chat-driven editing), Cursor agentic mode, Aider (terminal-based). The 68 percent scores for GPT-5 and Claude Opus 4.7 come from “naked” models without an orchestration overlay — production tools add mid-layer logic that can boost success by 10–15 percent.

The authors propose that the benchmark become a standard for evaluating future agentic releases, similar to the role MMLU plays in LLM intelligence testing. The benchmark repository is public and accessible to all researchers who wish to reproduce results or add new tasks.

Frequently Asked Questions

Why does a 'real environment' matter in a benchmark?

Simulated benchmarks often hide real edge cases — race conditions in the file system, dependency conflicts in apt repositories, network timeouts. TerminalWorld uses genuine Linux processes, so the agent must solve real problems, not idealized ones.

What exactly does the benchmark measure?

Three task categories: bash one-liner composition (find/grep/awk/sed pipelines), git workflow (rebase, cherry-pick, conflict resolution), and file operations (recursive permission fixes, backup-restore, log rotation). All tasks have a deterministic success criterion.

Which models were tested?

The paper compares GPT-5, Claude Opus 4.7, Gemini 3 Pro, and several open-source models. None reaches a reliable score above 70 percent on the full benchmark set, indicating significant room for improvement in agentic infrastructure.

arXiv:2605.22535: TerminalWorld benchmark measures LLM agents on real Linux terminal tasks without simulation

Why is a real environment critical for a benchmark?

What does the benchmark specifically measure?

Which models were tested and what were the results?

What does this mean for tools like Claude Code and Cursor?

Frequently Asked Questions

Sources

Related news