ArXiv study: in-context prompting outperforms LangGraph, CrewAI, Google ADK, and OpenAI Agents SDK on procedural tasks
In-context prompting is an architectural approach in which an entire procedural workflow is embedded directly in the system prompt instead of being orchestrated through a framework. An ArXiv study of 200 conversations per condition shows that this approach outperforms LangGraph, CrewAI, Google ADK, and OpenAI Agents SDK across three domains: travel booking, Zoom technical support, and insurance claims processing.
A team of Simon Dennis, Michael Diamond, Rivaan Patil, Kevin Shabahang, and Hao Guo published a study on ArXiv on April 30, 2026, with a provocative title: “In-Context Prompting Obsoletes Agent Orchestration for Procedural Tasks.” The claim they defend is that the advanced capabilities of today’s frontier models make external orchestration frameworks redundant for a significant class of procedural, multi-turn tasks.
What makes in-context prompting better than orchestration?
The in-context approach embeds the entire procedural workflow — the list of steps, branching conditions, output format, escalation — directly in the system prompt of a single model. Orchestration frameworks (LangGraph, CrewAI, Google ADK, OpenAI Agents SDK) split the same workflow into a graph of nodes and hand off model calls to separate coordination logic. The authors argue that frontier models now have sufficient “self-orchestration” capability to follow complex procedures within a single call, while graph-based context switching introduces losses that increase failure rates.
Three domains and concrete results
The experiment was conducted across three domains: travel booking, Zoom technical support, and insurance claims processing — each with 200 conversations per condition and LLM-as-judge scoring on five quality criteria. The in-context baseline achieved 4.53–5.00, while LangGraph as the closest orchestrator trailed at 4.17–4.84. Failure rates were even more distinctive: in-context at 11.5%, 0.5%, and 5% per domain versus orchestration at 24%, 9%, and 17%.
What this means for building agents
The study’s conclusion does not say that orchestration frameworks are universally redundant — they still have a role in tasks requiring parallel flows, external memory, or multiple independent agents. But for structured procedural tasks with clear steps, this paper suggests that architectural simplicity — one well-written system prompt — is more reliable than a graph of nodes. The implication for the 2026 agent stack is that the first step in agent design should be attempting to solve the problem through an in-context prompt before reaching for a framework.
Frequently Asked Questions
- Which frameworks were compared?
- LangGraph, CrewAI, Google ADK, and OpenAI Agents SDK were compared against an in-context baseline that embeds the workflow directly in the system prompt.
- What is the range of results?
- The in-context approach achieves 4.53–5.00 on a 1–5 scale, while orchestration frameworks remain in the range 4.17–4.84. Failure rate differences are even larger: 11.5/0.5/5% vs 24/9/17% per domain.
This article was generated using artificial intelligence from primary sources.
Related news
WindowsWorld benchmark: leading computer-use agents fall below 21% success rate on tasks spanning multiple desktop applications
GitHub Copilot in Visual Studio gets debugger agent and cloud agent sessions from the IDE
ArXiv Odysseys: CMU's realistic web agents benchmark reveals SOTA frontier models achieve 44.5% success and 1.15% Trajectory Efficiency on long-horizon tasks