GitHub Copilot harness: vendor level, fewer tokens

GitHub Copilot agentic harness is a layer that gives models tools and an execution loop for autonomous coding. GitHub tested it on Claude Sonnet 4.6, Claude Opus 4.7, GPT-5.4, and GPT-5.5 across five benchmarks and found that it achieves task completion rates on par with vendor harnesses while consuming fewer tokens, with support for more than 20 frontier models.

How does the Copilot harness compare to vendor harnesses?

GitHub published an evaluation of its own Copilot agentic harness — the layer that gives models tools, context, and an execution loop for autonomously solving coding tasks. The models tested were Claude Sonnet 4.6, Claude Opus 4.7, GPT-5.4, and GPT-5.5 across five benchmarks. The key finding: the Copilot harness achieves task completion rates on par with model vendor harnesses while consuming fewer tokens in most configurations.

Benchmark setup

Five benchmarks cover different types of work: SWE-bench Verified (500 tasks), SWE-bench Pro (complex multi-step tasks), SkillsBench, TerminalBench, and Win-Hill. SWE-bench Verified measures the resolution of real GitHub issues in software repositories. The Copilot harness now supports more than 20 frontier models, including GPT, Claude, Gemini, Microsoft’s MAI models, and open-source options.

Results by model

GPT models showed the strongest cost-effectiveness (score-to-price ratio), while Claude Opus 4.7 achieved the highest solve rate at a premium price. GitHub cautions about variability: differences between models on TerminalBench are often within the stochastic variance of ±1σ across individual runs. In other words, a single attempt is not sufficient for ranking — repeated measurements are required.

What this means for developers

The message for teams is to match model selection to task type and budget rather than chasing a single best model. Lower token consumption at the same completion rate means the Copilot harness can reduce the cost of agentic coding. The results also emphasize that benchmark numbers should be read with a confidence interval, not as absolute rankings.

Frequently Asked Questions

What is an agentic harness?

An agentic harness is a layer that gives a model tools, context, and an execution loop so it can autonomously solve coding tasks; GitHub Copilot uses its own harness across 20+ models.

Which models were tested?

Claude Sonnet 4.6, Claude Opus 4.7, GPT-5.4, and GPT-5.5, across five benchmarks including SWE-bench Verified with 500 tasks.

Which model was the most cost-effective?

GPT models showed the best cost-effectiveness, while Claude Opus 4.7 achieved the highest solve rate at a premium price.

GitHub: Copilot agentic harness reaches vendor-harness level with lower token consumption across 20+ frontier models

How does the Copilot harness compare to vendor harnesses?

Benchmark setup

Results by model

What this means for developers

Frequently Asked Questions

Sources

Related news