🟡 🤝 Agents Saturday, April 25, 2026 · 4 min read

arXiv:2604.21816: 'Tool Attention Is All You Need' Eliminates MCP Tax — 95% Token Reduction per Turn in Agentic Workflows

arXiv:2604.21816 ↗

Editorial illustration: Tool Attention MCP Tax — agentic workflow optimization

Why it matters

Researchers Anuj Sadani and Deepak Kumar published a paper on ArXiv on April 23, 2026 addressing the so-called MCP Tax — eager schema injection that consumes 10 to 60 thousand tokens per turn. Their Tool Attention approach reduces consumption by 95% and raises context utilization from 24% to 91%.

Researchers Anuj Sadani and Deepak Kumar published a paper on ArXiv on April 23, 2026, titled “Tool Attention Is All You Need” (arXiv:2604.21816), in which they identify and address a serious structural problem in the Model Context Protocol (MCP) — the so-called MCP Tax. This is a hidden cost of 10 to 60 thousand tokens per turn that in typical multi-server MCP deployments is simply “burned” on inserting tool schemas into every model call.

The paper arrives at a moment of explosive MCP adoption in enterprise environments, where a single agentic system often has dozens of tools available across multiple servers simultaneously. Such configurations have until now carried a hidden cost that sabotaged both speed and reasoning quality.

What Exactly Is MCP Tax?

The authors identify the problem as eager schema injection — the standard MCP pattern in which the full JSON schema description of every registered tool is inserted into context on every model call, even when the model will not use 95% of them. The token overhead ranges from 10,000 to 60,000 tokens per turn, depending on the number of servers and schema complexity.

The consequences are twofold. First, the KV cache becomes bloated and inference becomes slower and more expensive. Second, when context utilization exceeds the critical threshold of around 70%, reasoning quality drops significantly — well documented in the literature on the “context rot” phenomenon.

How Does Tool Attention Solve the Problem?

The proposed approach is a middleware layer that sits between the agent and MCP servers and combines three complementary components:

  1. Intent Schema Overlap (ISO) Score — uses sentence embedding models to measure semantic similarity between the user query and each tool’s description, then ranks tools by relevance.
  2. State-Aware Gating Function — checks preconditions and access scope before inserting a tool into context, so tools that require authentication or specific state simply do not appear until those conditions are met.
  3. Two-Phase Lazy Schema Loader — keeps only a compact summary pool of all available tools in context, promoting full JSON schema descriptions only for the top-k tools with the highest ISO score.

This approach mirrors the behavior of an experienced developer who keeps only a mental list of “what I can do” and reads the API details only when they know they will call the tool.

How Much Is Saved in Practice?

The authors evaluated the approach in a simulated environment with 120 tools distributed across six MCP servers, calibrated to reflect real-world production deployments. Results are dramatic: token consumption per turn dropped from 47,300 to 2,400 tokens, corresponding to a 95% reduction. Context utilization was raised from 24% to 91%, meaning the agent can now process more complex conversation histories without losing reasoning quality.

The authors explicitly note that the projected metrics are derived from measured token counts combined with published telemetry, not from live LLM agent testing. This is an important limitation to keep in mind — actual reduction in production depends on the quality of the embedding model for ISO scoring and the realism of the calibration.

What Does This Mean for Multi-Agent Systems?

The paper’s key conclusion is that “protocol-level efficiency, not raw context length, is the binding constraint” on scalable agentic systems. In other words, models with a million-token context will not solve the problem if 60,000 tokens are wasted per turn.

For teams building multi-agent systems on Claude, GPT, or open-source models, this paper suggests concrete architectural changes: introduce a middleware layer that does lazy schema loading, implement KV cache sharing between successive calls of the same agent, and measure actual context utilization as the primary metric rather than focusing on context window capacity. Code is available on GitHub in the repository referenced in the paper.

🤖

This article was generated using artificial intelligence from primary sources.