ArXiv SPPO: Sequence-level PPO solves the credit assignment problem in long reasoning chains
Why it matters
Sequence-Level PPO reformulates LLM reasoning as a contextual bandit problem, achieving the performance of expensive group methods like GRPO with dramatically fewer resources β without multi-sampling.
The problem with token-level PPO
Standard PPO (Proximal Policy Optimization) is the central algorithm for aligning LLMs on reasoning tasks with verifiable rewards. However, token-level PPO struggles with two problems:
- Credit assignment instability β across long Chain-of-Thought (CoT) chains, assigning credit to individual tokens becomes unstable
- Prohibitive memory costs β the value model requires significant resources
Critic-free alternatives like GRPO mitigate these issues, but require multiple sampling to estimate the baseline, which drastically limits training throughput.
SPPO β the best of both worlds
The team (Wang, Li, Li, Chen, Huang et al.) introduces Sequence-Level PPO (SPPO) which reformulates the reasoning process as a Sequence-Level Contextual Bandit problem.
Key innovation: a separate scalar value function that provides low-variance signals without the need for multi-sampling.
Results
On mathematical benchmarks, SPPO:
- Significantly outperforms standard token-level PPO
- Achieves the performance of computationally expensive group methods (GRPO)
- Dramatically more efficient β no multi-sampling overhead
For researchers training reasoning models, SPPO offers a practical alternative: GRPO performance at costs closer to standard PPO.
Related news
ArXiv: Process Reward Agents β real-time feedback improves AI reasoning in medicine without retraining
ArXiv PRA: 4B model achieves 80.8% on medical benchmark β new SOTA for small scale
ArXiv SUPERNOVA: reinforcement learning on natural instructions improves reasoning by 52.8%