LangChain Harness Profiles: +20pp on tau2-bench

LangChain introduced a harness profile system for Deep Agents on April 29, 2026, enabling the same code to work with Anthropic, OpenAI, and Google models without modifications. The profile automatically applies model-specific system prompts, tools, and middleware. On tau2-bench, GPT-5.3 Codex jumped from 33% accuracy to 53%, and Claude Opus 4.7 from 43% to 53% — researchers conclude that a single harness cannot be optimal for every model.

LangChain introduced the harness profile system for its Deep Agents library on April 29, 2026. The system addresses a problem that emerged as agents began switching LLMs in production: a single configuration of system prompt, tools, and middleware that works optimally with one model typically delivers worse results with another. LangChain concludes that the harness must not be shared — every model needs its own.

What Do Harness Profiles Change in a Deep Agent?

A profile is a configuration that encapsulates three things: a model-specific system prompt (structure, tone, examples), a set of tools in the format the LLM understands best, and middleware logic (e.g., how a tool result is returned in the next turn). The developer simply swaps the profile in code — the calling code remains the same. Current built-in profiles cover Anthropic, OpenAI, and Google models, and the community can contribute their own for other providers.

How Much Exactly Do Results Improve on tau2-bench?

LangChain tested before/after on tau2-bench, a standard evaluation for long-horizon agents. GPT-5.3 Codex rose from 33% accuracy to 53% (+20 percentage points), and Claude Opus 4.7 from 43% to 53% (+10 percentage points). Both models finish at the same accuracy but from different starting positions. The shift is significant in both cases as it shows that the default LangChain harness was not optimal for either model.

What Does This Mean for Multi-Model Pipelines?

LangChain’s comment summarizes it all: “A single harness can’t be optimal for every model.” Development teams that use multiple models in parallel in production (e.g., Claude for reasoning, GPT for coding, Gemini for multimodal tasks) can now use the same Deep Agents architecture and gain tens of percentage points without rewriting code. The approach fits into the broader industry trend of infrastructure layers for agents: AWS Bedrock AgentCore, Anthropic Claude Code, and Mistral Vibe are all moving in the same direction this week — standardizing the agent stack while maintaining provider flexibility.

Frequently Asked Questions

What is a harness profile in LangChain Deep Agents?

A configuration that contains a model-specific system prompt, tool set, and middleware options. The developer selects the profile depending on which LLM they are using, and the same calling code works with Anthropic, OpenAI, and Google models without any modifications.

By how much does a harness profile improve performance?

On tau2-bench, GPT-5.3 Codex rose from 33% to 53% accuracy (+20 percentage points), and Claude Opus 4.7 from 43% to 53% (+10 percentage points). Both models end up at the same accuracy level but started from different baselines.

Why doesn't a single harness work?

Different models respond differently to system prompts, tool formats, and middleware logic. Anthropic models prefer structured XML instructions, OpenAI works better with function calling schemas, and Google models have their own format. The profile shapes all of this per model.

LangChain Harness Profiles for Deep Agents: GPT-5.3 Codex Jumps from 33% to 53% on tau2-bench, Opus 4.7 from 43% to 53%

What Do Harness Profiles Change in a Deep Agent?

How Much Exactly Do Results Improve on tau2-bench?

What Does This Mean for Multi-Model Pipelines?

Frequently Asked Questions

Sources

Related news