ICML 2026 Spotlight: Stable-GFlowNet introduces more stable and diverse automated LLM red-teaming
A team from KAIST and NAVER Cloud has presented Stable-GFlowNet (S-GFN), a new approach to automated red-teaming of large language models that eliminates the partition function Z estimation and uses pairwise comparisons for stable learning. The paper received an ICML 2026 Spotlight — fewer than 5% of accepted papers — and addresses the chronic GFlowNet problem of training instability and mode collapse under noisy rewards.
This article was generated using artificial intelligence from primary sources.
Minchan Kwon, Sunghyun Baek, Minseo Kim, Jaemyung Yu, Dongyoon Han, and Junmo Kim from KAIST and NAVER Cloud published on May 1, 2026 the paper Stable-GFlowNet (S-GFN), which received the prestigious ICML 2026 Spotlight designation. This is a top-tier quality signal — fewer than 5% of accepted papers at ICML receive a Spotlight — making this approach to automated LLM red-teaming a reference work for 2026.
The central problem the paper addresses is training instability and mode collapse in GFlowNets — a type of neural network that learns to generate diverse samples from a distribution proportional to a reward function. In the red-teaming context, a GFlowNet must generate attacks on a target LLM with varied patterns, not just variants of the same jailbreak.
How does Stable-GFlowNet solve the instability problem?
S-GFN eliminates partition function Z estimation — a complex integral that causes training instability in classical GFlowNets. Instead, the authors introduce pairwise trajectory comparisons (contrastive trajectory balance): rather than estimating the absolute reward scale, the network compares the success of two attacks against each other.
The technical consequence is significant: pairwise comparisons are robust to reward noise (the target model may return inconsistent attack success signals) while preserving GFlowNet’s key property — generating diverse samples.
What is the “fluency stabilizer”?
The second technical contribution is a fluency stabilizer that prevents convergence to low-quality solutions. In red-teaming, unstable training can push the model toward “attacks” that are actually meaningless token sequences (high reward due to a bug in the reward function, not actual effectiveness). The stabilizer filters such pathological modes and keeps generated prompts linguistically coherent.
Why does diverse red-teaming matter so much?
Systems that generate only variants of the same jailbreak quickly fall into mode collapse — they find one hole (e.g., a role-play “pretend you are DAN”) and vary it endlessly. A security team that patches that one hole is mistaken in believing the problem is solved, because the red-teaming system does not cover other patterns.
S-GFN covers a broader attack distribution, meaning that after a patching cycle a greater number of different vulnerabilities have been discovered and addressed. For AI vendors (Anthropic, OpenAI, Google) who must legally demonstrate robustness before deployment, such a tool reduces the risk of public incidents.
How does it fit into the broader security ecosystem?
The paper builds on a series of recent automated red-teaming papers — Microsoft Research published an agent network analysis on April 30, ARMOR 2025 set a doctrinal military benchmark on April 30, and various labs are working on alignment-faking detection. Stable-GFlowNet is the methodological foundation that all other frameworks can use for generating test scenarios.
For researchers, the availability of code and checkpoints remains an open question — the authors promise a later release, which is typical for ICML Spotlight papers.
Frequently Asked Questions
- What is a GFlowNet in the context of red-teaming?
- A GFlowNet (Generative Flow Network) is a type of neural network that learns to generate diverse samples from a distribution proportional to a reward function. In red-teaming, reward is given for successful attacks on the target model — GFlowNet learns to generate attacks covering many different patterns, not just variants of the same jailbreak.
- What is 'contrastive trajectory balance' and why is it the key contribution?
- Classical GFlowNets require estimating the partition function Z, a complex integral. S-GFN bypasses this by using pairwise trajectory comparisons — comparing the success of two attacks against each other without needing an absolute reward scale. This reduces training instability and handles noisy rewards more robustly.
- Why is diverse red-teaming important?
- Systems that generate only variants of the same jailbreak quickly fall into 'mode collapse' — they find one hole and repeat it. Diverse red-teaming discovers more different vulnerabilities, so a production model is more robustly protected after patching. Security teams need breadth, not depth of a single attack.
Related news
ArXiv ARMOR 2025: first military LLM safety benchmark with 519 prompts across 21 commercial models
Exploration Hacking: Can LLMs Learn to Resist RL Training and Strategically Suppress Their Own Capabilities?
MCPHunt: first benchmark measuring credential leakage across trust boundaries in multi-server MCP agents — rates of 11.5–41.3%