arXiv Orchard: 67.5% SWE-bench Verified open-source

Orchard is a new open-source agentic modeling framework published May 14, 2026 on arXiv (Baolin Peng, Wenlin Yao, and 12 co-authors). The framework combines a lightweight environment layer with three specialized training recipes — SWE (software engineering), GUI (vision-language), and Claw (personal assistants). The Orchard-SWE variant achieves 67.5% on SWE-bench Verified after RL training, making it the state-of-the-art open-source solution for coding agents.

Baolin Peng, Wenlin Yao, and 12 co-authors published Orchard on arXiv on May 14, 2026 — an open-source framework for scalable agentic modeling. The paper targets a gap in open-source infrastructure: while closed-source agents dominate benchmarks, the open community needs a quality stack that enables training, not just orchestration.

What does the Orchard architecture offer?

The framework consists of three components:

Orchard Env — a lightweight environment layer that manages sandbox lifecycle across different task types. Uses “reusable primitives” instead of heavy orchestration.
Three specialized recipes — SWE (software engineering tasks), GUI (vision-language interfaces), Claw (personal assistant scenarios). Each recipe is optimized for its task type.
Training innovations — Credit-assignment SFT (learning from incomplete trajectories) and Balanced Adaptive Rollout (a new RL algorithm for agent training).

The approach is architecturally distinct from the LangChain/CrewAI tradition: instead of focusing on workflow management (how an agent calls tools and manages state), Orchard puts scalable agent training as its primary function.

What does the SWE-bench Verified 67.5% result actually mean?

The Orchard-SWE variant achieves 67.5% on SWE-bench Verified after RL training. The figure is significant because SWE-bench Verified is a curated subset of SWE-bench that eliminates problematic test cases — making it a rigorous benchmark for real-world coding tasks. Open-source models rarely reach 60%+ on SWE-bench Verified without closed-source frontier models on the backend; Orchard-SWE achieves this with an open-source training stack and open-weight model.

How do the three recipes work in parallel?

The SWE recipe specializes agents for software engineering: reading codebases, writing PRs, using shell tools, debugging. The GUI recipe trains vision-language agents that operate in browser/desktop interfaces — clicking, scrolling, reading screenshots, navigating applications. The Claw recipe targets personal assistant tasks: file management, scheduling, multi-step user intents.

The multi-domain approach positions Orchard as an alternative to vendor-specific stacks (Anthropic Computer Use, OpenAI Codex CLI) — one framework, three domains, open-source.

Position in the open-source agent ecosystem

The announcement fits into a week of dramatic agentic releases: LangChain Labs (May 14, applied research program), GitHub Copilot App Technical Preview (May 14), IBM Forward Deployed Units (May 14). Orchard is the academic research counterweight — providing the community with an open-source foundation that is not vendor-controlled. The training recipes and Orchard-SWE weights will likely be made public — which could open the path for the open-source community to close in on closed-source agentic benchmarks within the next few months.

Frequently Asked Questions

What distinguishes Orchard from LangChain or CrewAI?

Classic orchestration frameworks (LangChain, CrewAI) focus on workflow management — how an agent calls tools and manages state; Orchard places emphasis on scalable agent training with actual model optimization rather than just workflow orchestration.

What is the Orchard framework architecture?

Three components: Orchard Env (sandbox lifecycle management across different task types), three specialized recipes (SWE, GUI, Claw), and training innovations — Credit-assignment SFT for learning from incomplete trajectories and Balanced Adaptive Rollout for RL.

arXiv:2605.15040 Orchard: open-source agentic framework achieves 67.5% on SWE-bench Verified with three specialized recipes

What does the Orchard architecture offer?

What does the SWE-bench Verified 67.5% result actually mean?

How do the three recipes work in parallel?

Position in the open-source agent ecosystem

Frequently Asked Questions

Sources

Related news