ArXiv SUPERNOVA: reinforcement learning on natural instructions improves reasoning by 52.8%
Why it matters
A new paper, SUPERNOVA, shows that systematic curation of existing instruction-tuning datasets can significantly improve reasoning in LLMs. Models trained on SUPERNOVA achieve up to a 52.8% relative improvement on the BBEH benchmark.
Leveraging existing data for better reasoning
Researchers have published SUPERNOVA — a framework that shows existing instruction-tuning datasets contain “rich reasoning patterns” that can be systematically adapted for reinforcement learning. The result: a relative improvement of up to 52.8% on the BBEH benchmark compared to strong baselines such as Qwen3.5.
Why is this important?
There are currently two approaches to improving reasoning in LLMs:
- Synthetic data generation — generate new examples and train on them (expensive)
- Human-curated data — experts write new examples (expensive and slow)
SUPERNOVA demonstrates a third way: use the data you already have (instruction-tuning sets) but systematically prepare it for RL with verifiable rewards. This is significantly cheaper and faster.
Methodology
The authors conducted more than 100 controlled experiments analyzing three key factors:
- Source task selection — which tasks best transfer knowledge to the target domain
- Task mixing strategies — optimal combinations of training data
- Synthetic interventions — targeted modifications to improve data quality
The key finding: selecting tasks by individual target performance outperforms strategies that use averages. In other words, do not go for a “balanced” approach — choose tasks that concretely help your goal.
Performance
Testing was conducted on several challenging benchmarks:
- BBEH — complex multi-step reasoning
- Zebralogic — logical inference
- MMLU-Pro — extended knowledge across domains
Code and data are publicly available on GitHub, which means other research groups can reproduce and build on the results.
Broader implications
The “use what exists, don’t create new” trend is important for the democratization of AI research. You don’t need the billion-dollar budget of OpenAI or Anthropic — you can significantly improve reasoning using datasets that already exist on HuggingFace and other platforms.
For small AI labs and open-source projects, the SUPERNOVA approach could be what brings them closer to the performance of frontier models.
Related news
ArXiv: Process Reward Agents — real-time feedback improves AI reasoning in medicine without retraining
ArXiv PRA: 4B model achieves 80.8% on medical benchmark — new SOTA for small scale
ArXiv SPPO: Sequence-level PPO solves the credit assignment problem in long reasoning chains