arXiv:2605.15040 Orchard: open-source agentic framework achieves 67.5% on SWE-bench Verified with three specialized recipes
Orchard is a new open-source agentic modeling framework published May 14, 2026 on arXiv (Baolin Peng, Wenlin Yao, and 12 co-authors). The framework combines a lightweight environment layer with three specialized training recipes — SWE (software engineering), GUI (vision-language), and Claw (personal assistants). The Orchard-SWE variant achieves 67.5% on SWE-bench Verified after RL training, making it the state-of-the-art open-source solution for coding agents.
This article was generated using artificial intelligence from primary sources.
Baolin Peng, Wenlin Yao, and 12 co-authors published Orchard on arXiv on May 14, 2026 — an open-source framework for scalable agentic modeling. The paper targets a gap in open-source infrastructure: while closed-source agents dominate benchmarks, the open community needs a quality stack that enables training, not just orchestration.
What does the Orchard architecture offer?
The framework consists of three components:
- Orchard Env — a lightweight environment layer that manages sandbox lifecycle across different task types. Uses “reusable primitives” instead of heavy orchestration.
- Three specialized recipes — SWE (software engineering tasks), GUI (vision-language interfaces), Claw (personal assistant scenarios). Each recipe is optimized for its task type.
- Training innovations — Credit-assignment SFT (learning from incomplete trajectories) and Balanced Adaptive Rollout (a new RL algorithm for agent training).
The approach is architecturally distinct from the LangChain/CrewAI tradition: instead of focusing on workflow management (how an agent calls tools and manages state), Orchard puts scalable agent training as its primary function.
What does the SWE-bench Verified 67.5% result actually mean?
The Orchard-SWE variant achieves 67.5% on SWE-bench Verified after RL training. The figure is significant because SWE-bench Verified is a curated subset of SWE-bench that eliminates problematic test cases — making it a rigorous benchmark for real-world coding tasks. Open-source models rarely reach 60%+ on SWE-bench Verified without closed-source frontier models on the backend; Orchard-SWE achieves this with an open-source training stack and open-weight model.
How do the three recipes work in parallel?
The SWE recipe specializes agents for software engineering: reading codebases, writing PRs, using shell tools, debugging. The GUI recipe trains vision-language agents that operate in browser/desktop interfaces — clicking, scrolling, reading screenshots, navigating applications. The Claw recipe targets personal assistant tasks: file management, scheduling, multi-step user intents.
The multi-domain approach positions Orchard as an alternative to vendor-specific stacks (Anthropic Computer Use, OpenAI Codex CLI) — one framework, three domains, open-source.
Position in the open-source agent ecosystem
The announcement fits into a week of dramatic agentic releases: LangChain Labs (May 14, applied research program), GitHub Copilot App Technical Preview (May 14), IBM Forward Deployed Units (May 14). Orchard is the academic research counterweight — providing the community with an open-source foundation that is not vendor-controlled. The training recipes and Orchard-SWE weights will likely be made public — which could open the path for the open-source community to close in on closed-source agentic benchmarks within the next few months.
Frequently Asked Questions
- What distinguishes Orchard from LangChain or CrewAI?
- Classic orchestration frameworks (LangChain, CrewAI) focus on workflow management — how an agent calls tools and manages state; Orchard places emphasis on scalable agent training with actual model optimization rather than just workflow orchestration.
- What is the Orchard framework architecture?
- Three components: Orchard Env (sandbox lifecycle management across different task types), three specialized recipes (SWE, GUI, Claw), and training innovations — Credit-assignment SFT for learning from incomplete trajectories and Balanced Adaptive Rollout for RL.
Related news
Kedro: version 1.2.0 brings the @experimental decorator and a LangGraph agentic starter for GenAI pipelines
Stability AI: Stable Audio 3.0 with open-weight models and generation up to 6 minutes
LangChain: The agent that fixes agents — how LangSmith Engine was built