LangChain: Fine-Tuned Qwen-3.5-35B as a Trace Judge 10–100× Cheaper Than Frontier Models
LangChain has demonstrated how a fine-tuned Qwen-3.5-35B can serve as a trace judge — a model that scores production agent traces and detects user-noticed errors. With LoRA fine-tuning on Fireworks infrastructure and around 1,400 examples, the model achieves 96.1% accuracy on the chat-langchain set versus 91.6% for Claude Opus, at 10 to 100 times lower cost. Fine-tuned on one domain, it outperformed frontier models on another, demonstrating cross-domain transfer.
This article was generated using artificial intelligence from primary sources.
LangChain has shown how a fine-tuned Qwen-3.5-35B can replace expensive frontier models in the role of a “trace judge” — at dramatically lower cost.
What is a trace judge and what problem does it solve?
A trace judge is an AI model that evaluates production agent traces in order to detect errors the user notices — corrections, rejections, and repeated requests. Instead of human teams manually reviewing thousands of interactions, the trace judge automatically flags problematic sessions. The challenge is that using frontier models for this job becomes expensive as trace volume grows.
How did LangChain train the model?
LangChain took Qwen-3.5-35B as the base and performed LoRA fine-tuning through Fireworks’ managed SFT infrastructure. LoRA (Low-Rank Adaptation) is a fine-tuning method that changes only a small number of additional parameters instead of the entire model, making training cheaper. SFT (Supervised Fine-Tuning) is supervised learning on labeled examples. The training set was small — approximately 707 examples from the chat-langchain domain and 727 from the Fleet platform.
How good and how cheap is the model?
The fine-tuned model achieved 96.1% accuracy on the chat-langchain set, versus 91.6% for Claude Opus and 98.9% for GPT-4.5. The key result is cross-domain transfer: the model tuned on chat-langchain data outperformed all frontier models on Fleet data (90.8% versus 90.2% for Claude Opus). It is 10 to 100 times cheaper, with savings growing as trace volume increases.
When does it become available?
LangChain is announcing a rollout to selected users in the coming weeks, with broader availability in one to two months. The approach demonstrates a pattern in which small, specialized, fine-tuned models take over narrow, repetitive evaluation tasks from general frontier models.
Frequently Asked Questions
- What is a trace judge?
- An AI model that scores production agent traces and detects user-noticed errors such as corrections, rejections, and repeated requests.
- How accurate is the fine-tuned model?
- It achieves 96.1% on the chat-langchain set versus 91.6% for Claude Opus; on another domain it outperformed all frontier models.
- How much cheaper is it?
- 10 to 100 times cheaper than frontier models, with savings growing as trace volume increases.
Related news
AWS: Strands Evals SDK Automates AI Agent Failure Detection and Root Cause Analysis
arXiv:2605.22502: Compiling agentic workflows into LLM weights achieves near-frontier quality at 100× lower cost
arXiv:2605.22794: MOSS shows agents that self-improve by rewriting their own source code