GUI-SD: self-distillation for GUI agents beats GRPO RL

Yan Zhang, Daiqing Wu, and Huawen Shen presented GUI-SD — the first on-policy self-distillation (OPSD) framework specifically for GUI grounding, the ability of AI agents to map natural language instructions to visual coordinates of interface elements. The system uses privileged visual context (bounding box and Gaussian soft mask) and entropy-guided distillation. Across six representative GUI grounding benchmarks, GUI-SD consistently outperforms GRPO-based RL methods.

Yan Zhang, Daiqing Wu, and Huawen Shen published on May 1, 2026 on ArXiv the paper “Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding”. They present GUI-SD — the first OPSD (on-policy self-distillation) framework specifically designed for the GUI grounding task.

What is GUI grounding and why is it foundational for agents?

GUI grounding is the ability to map a natural language instruction (e.g., “click the Save button”) to precise visual coordinates of the target element on screen. Without this capability, an autonomous GUI agent cannot actually navigate a computer application — it can only suggest to the user what to click.

GUI agents are a growing category in 2026 (Anthropic Claude Computer Use, OpenAI Operator, Google Gemini Computer Use). All these agents are limited by grounding accuracy: if an agent says “click Save” and gets the coordinates wrong by 20 pixels, it clicks the wrong place and the workflow fails.

Why self-distillation rather than reinforcement learning?

Recent RL methods (such as GRPO — Group Relative Policy Optimization) achieve strong results but have two serious weaknesses the authors identify:

Expensive multiple rollouts — each training step requires running the model several times to generate a distribution of responses
Sparse signal on hard examples — when the model consistently fails, the RL gradient effectively vanishes

On-policy self-distillation (OPSD) solves both problems. It provides a dense token-level supervisory signal from a single rollout — every token in the output has a clear training target, regardless of whether the entire trajectory was successful. This makes training more efficient and stable.

GUI-SD is the first OPSD framework adapted for GUI grounding. Prior OPSD work mainly covered NLP tasks or classification.

What makes GUI-SD specific to grounding?

The system uses two key mechanisms:

Privileged visual context — the teacher model receives an image enriched with a target bounding box and Gaussian soft mask. The soft mask gives the teacher a strong indication of where the target is, but does not reveal the exact coordinates, so the teacher still has to “reason” about pixelization. This solves the classic self-distillation problem — the teacher must not be too much more informed than the student, as it otherwise becomes a “cheater” rather than a teacher.

Entropy-guided distillation — token weight depends on two things: (a) the significance of the digit in the coordinate output (e.g., the most significant decimal of a coordinate is more important than the least significant) and (b) the teacher’s confidence at that position. Tokens that are simultaneously significant and reliable receive higher weight, focusing optimization where it is most valuable.

How large are the improvements?

Experiments on six representative GUI grounding benchmarks show that GUI-SD consistently outperforms GRPO-based methods and naive OPSD in two dimensions:

Grounding accuracy (the value that ultimately determines agent success)
Training efficiency (less compute needed for the same result)

Concrete numbers per benchmark are given in the main text, but the summary is clear: the combination of a single rollout + entropy-guided weighting + privileged teacher context is a dominant design for grounding training.

The paper is available on ArXiv under ID 2605.00642.

Frequently Asked Questions

What is GUI grounding and why is it important for agents?

GUI grounding is the ability to map natural language instructions (e.g., 'click Save') to precise visual coordinates of the target element. It is the foundational capability for autonomous GUI agents that navigate the screen instead of an API.

Why is on-policy self-distillation better than reinforcement learning for GUI?

RL methods like GRPO rely on expensive multiple rollouts and suffer from sparse signals on hard examples. OPSD provides a dense token-level supervisory signal from a single rollout, making training more efficient and stable.

How does entropy-guided distillation work?

The system adaptively weights tokens based on digit significance and teacher confidence, concentrating optimization on the most impactful and reliable positions. A token that is harder but more reliable receives higher weight than one that is trivial but uncertain.

ArXiv GUI-SD: first on-policy self-distillation framework for GUI grounding outperforms GRPO across six benchmarks in accuracy and training efficiency

What is GUI grounding and why is it foundational for agents?

Why self-distillation rather than reinforcement learning?

What makes GUI-SD specific to grounding?

How large are the improvements?

Frequently Asked Questions

Sources

Related news