Latent-GRPO: Stable RL for Latent Reasoning

Researchers introduce Latent-GRPO, a stabilized RL approach for latent reasoning in which reasoning steps are compressed into continuous representations. They identify three fundamental problems with directly applying GRPO in latent space — invalid latent states, misalignment between the reward signal and token updates, and invalid averaged states — and address them through a combination of invalid-sample advantage masking, one-sided noise sampling and optimal correct-path first-token selection. Results: +7.86 Pass@1 on GSM8K-Aug and +4.27 points on AIME, with 3-4× shorter reasoning chains.

A team of researchers (Jingcheng Deng, Zihao Wei, Liang Pang, Junhong Wu, Shicheng Xu, Zenghao Duan, Huawei Shen) on April 30, 2026, published a paper that addresses one of the key problems in modern language-model reasoning — the stability of reinforcement learning in latent space.

What Is the Problem They Are Solving?

Most of today’s reasoning models use explicit chain-of-thought — they generate long text describing the steps to a solution. The approach is effective but expensive: long chains mean many tokens, which directly increases cost and latency.

Latent reasoning is an alternative: reasoning steps are compressed into continuous vector representations inside the model, without verbalization. This dramatically shortens the chain. The problem is that traditional RL algorithms such as GRPO do not work well in that space.

Three Fundamental Problems With Direct GRPO

The authors identify three structural problems that make direct application of GRPO to latent reasoning unstable:

Invalid latent states — unconstrained RL exploration leaves the manifold on which the model normally operates; representations become “garbage.”
Reward-token misalignment — the reward signal applies to the entire sequence, but updates are applied to individual tokens; without intervention, the gradient moves in the wrong direction.
Invalid averaged states — when multiple valid paths lead to the correct answer, averaging them produces a representation that does not belong to any of them.

How Does Latent-GRPO Solve Each Problem?

Each of the three problems is addressed by a targeted intervention:

Invalid-sample advantage masking — samples that fall off the manifold receive zero advantage, which zeroes out the gradient on them
One-sided noise sampling — exploration is restricted to one side of the distribution, preventing divergence
Optimal correct-path first-token selection — among all correct paths, the one whose first token best matches the model’s representation is chosen, avoiding averaging into an invalid state

The Numbers

Three key benchmarks:

Benchmark	Approach	Gain
GSM8K-Aug (low-difficulty math)	Latent-GRPO vs baseline	+7.86 Pass@1 points
AIME (high-difficulty math)	Latent-GRPO vs explicit GRPO	+4.27 points
Reasoning chain length	Latent-GRPO vs explicit	3-4× shorter

What is particularly relevant is that the gains appear on both easy and hard tasks, suggesting that Latent-GRPO does not trade general capability for narrow improvements.

Why This Matters

The current “reasoning model” trend (OpenAI’s o-series, DeepSeek’s R-series, Anthropic’s extended thinking) shows that competitiveness is being built on the capacity for long reasoning. But every push of the boundary means more tokens, which directly hits the cost equation of inference — and limits applications that must be real-time or at scale.

If Latent-GRPO proves reproducible, that means the same level of reasoning can be achieved with 3-4× fewer tokens — a major signal for organizations optimizing cost-per-task. The second, deeper insight is methodological: the paper shows that naive extension of existing RL algorithms to new representational spaces does not work, and provides a concrete methodology for what needs to be fixed. In doing so, it opens the door to the next generation of efficient reasoning models that do not trade quality for brevity.

Frequently Asked Questions

What is latent reasoning?

An approach in which reasoning steps are not written out as explicit text (chain-of-thought) but compressed into continuous vector representations. The goal is to significantly shorten the length of the reasoning chain while preserving the ability to solve complex problems.

Why does direct GRPO not work in latent space?

Three reasons: (1) invalid latent states, because unconstrained exploration leaves the manifold on which the model normally operates; (2) the reward signal does not align with individual token updates; (3) averaging multiple valid paths produces an invalid averaged state. Latent-GRPO addresses each problem with a targeted intervention.

How significant are the results?

On the GSM8K-Aug benchmark, +7.86 Pass@1 points above baseline; on AIME (high-difficulty math), +4.27 points above explicit GRPO. Crucially, these gains are achieved with 3-4× shorter reasoning chains, which has direct cost implications for inference.

Latent-GRPO: Stable RL Optimization for Latent Reasoning — 7.86 Points on GSM8K-Aug and 4.27 Points on AIME With 3-4× Shorter Reasoning Chains