arXiv:2606.24855: OpenThoughts-Agent — open data recipes for training agentic models
OpenThoughts-Agent is an open pipeline for curating data for agentic language models. Through 100+ ablation experiments, the team built 100K examples and fine-tuned Qwen3-32B, achieving 44.8% on seven agentic benchmarks — surpassing all existing open-source models.
This article was generated using artificial intelligence from primary sources.
What are data recipes for agentic models?
Agentic models — language models that autonomously plan and execute multi-step tasks — require a different type of data than classic chat or instruction-tuning sets. Researchers from UC Berkeley, NYU, and partner institutions have released OpenThoughts-Agent, an open pipeline that systematizes exactly this data curation process.
One hundred experiments, one clearer recipe
The team ran more than 100 controlled ablation experiments — systematic comparisons where one parameter is changed while others remain constant — to identify which decisions in example selection and filtering most impact a model’s agentic capabilities. The result is a set of 100,000 curated examples used to fine-tune Qwen3-32B.
Results: +3.9 percentage points above open-source competition
The fine-tuned model achieves 44.8% average accuracy on seven agentic benchmarks. That is +3.9 percentage points above the previous open-source leader, Nemotron-Terminal-32B (40.9%) — a measurable improvement in a domain where differences are rarely dramatic.
Everything open
The pipeline, datasets, and models are publicly available at openthoughts.ai, allowing researchers without access to proprietary data processes to reproduce and build on this work. The paper was submitted on June 23, 2026.
Frequently Asked Questions
- What is OpenThoughts-Agent and what is it for?
- OpenThoughts-Agent is an open set of tools and data for training LLMs that autonomously execute multi-step tasks. The pipeline includes methods for selecting and filtering examples designed exclusively for agentic capabilities.
- How much better is it than previous open-source models?
- The fine-tuned Qwen3-32B achieves 44.8% average accuracy on seven agentic benchmarks, which is +3.9 percentage points above the previous best open model, Nemotron-Terminal-32B at 40.9%.
Sources
Related news
Anthropic (Claude Code GitHub): Claude Code v2.1.185 improves stream stall message
arXiv:2606.20517: Multi-LCB Extends LiveCodeBench to 12 Programming Languages and Reveals Python Overfitting in 24 Models
UK AISI: Engineering Playbook Opens Frontier Model Evaluation Infrastructure in Five Layers