🟢 📦 Open Source Published: · 1 min read ·

arXiv:2606.24855: OpenThoughts-Agent — open data recipes for training agentic models

arXiv:2606.24855 ↗

Editorial illustration: open laboratory with robotic agents sorting data cards in a pipeline

OpenThoughts-Agent is an open pipeline for curating data for agentic language models. Through 100+ ablation experiments, the team built 100K examples and fine-tuned Qwen3-32B, achieving 44.8% on seven agentic benchmarks — surpassing all existing open-source models.

🤖

This article was generated using artificial intelligence from primary sources.

What are data recipes for agentic models?

Agentic models — language models that autonomously plan and execute multi-step tasks — require a different type of data than classic chat or instruction-tuning sets. Researchers from UC Berkeley, NYU, and partner institutions have released OpenThoughts-Agent, an open pipeline that systematizes exactly this data curation process.

One hundred experiments, one clearer recipe

The team ran more than 100 controlled ablation experiments — systematic comparisons where one parameter is changed while others remain constant — to identify which decisions in example selection and filtering most impact a model’s agentic capabilities. The result is a set of 100,000 curated examples used to fine-tune Qwen3-32B.

Results: +3.9 percentage points above open-source competition

The fine-tuned model achieves 44.8% average accuracy on seven agentic benchmarks. That is +3.9 percentage points above the previous open-source leader, Nemotron-Terminal-32B (40.9%) — a measurable improvement in a domain where differences are rarely dramatic.

Everything open

The pipeline, datasets, and models are publicly available at openthoughts.ai, allowing researchers without access to proprietary data processes to reproduce and build on this work. The paper was submitted on June 23, 2026.

Frequently Asked Questions

What is OpenThoughts-Agent and what is it for?
OpenThoughts-Agent is an open set of tools and data for training LLMs that autonomously execute multi-step tasks. The pipeline includes methods for selecting and filtering examples designed exclusively for agentic capabilities.
How much better is it than previous open-source models?
The fine-tuned Qwen3-32B achieves 44.8% average accuracy on seven agentic benchmarks, which is +3.9 percentage points above the previous best open model, Nemotron-Terminal-32B at 40.9%.