Latent-GRPO: Stable RL Optimization for Latent Reasoning — 7.86 Points on GSM8K-Aug and 4.27 Points on AIME With 3-4× Shorter Reasoning Chains
Researchers introduce Latent-GRPO, a stabilized RL approach for latent reasoning in which reasoning steps are compressed into continuous representations. They identify three fundamental problems with directly applying GRPO in latent space — invalid latent states, misalignment between the reward signal and token updates, and invalid averaged states — and address them through a combination of invalid-sample advantage masking, one-sided noise sampling and optimal correct-path first-token selection. Results: +7.86 Pass@1 on GSM8K-Aug and +4.27 points on AIME, with 3-4× shorter reasoning chains.