🟡 🤝 Agents Thursday, May 7, 2026 · 2 min read ·

GitHub: validation of agentic behavior via dominator analysis from compiler theory achieves 100% accuracy vs 82% agent self-assessment

Editorial illustration: graph structure diagram with highlighted dominator nodes representing essential steps in agent execution

GitHub publishes a validation framework for non-deterministic AI agents that borrows dominator analysis from compiler theory — from 2 to 10 successful executions of the Copilot Coding Agent, the system learns which steps are essential and which are optional, achieving 100% accuracy in distinguishing agent bugs from genuine product regressions.

🤖

This article was generated using artificial intelligence from primary sources.

GitHub’s research team has published a validation framework that borrows dominator analysis from compiler theory to address the problem of non-deterministic AI agent behavior in CI/CD pipelines. Traditional testing assumes deterministic execution paths, but agents like the Copilot Coding Agent often traverse different valid routes where environmental variations (loading screens, timing shifts, differences in UI rendering) generate false negatives.

What is dominator analysis and how is it applied to agents?

Dominator analysis is a technique from compiler theory — in an execution graph, state A “dominates” state B if every successful path to B must pass through A. GitHub’s framework takes 2 to 10 successful agent traces, converts them into Prefix Tree Acceptors (PTAs) — directed graphs where nodes are observed states and edges are agent actions — and computes the dominator set to separate essential control points from optional noise.

The system uses a three-layer state equivalence evaluation: visual metrics (perceptual hashing, SSIM), semantic analysis via multimodal LLM that ignores timestamps but notes missing UI controls, and conservative state merging only when equivalence is certain.

What are the measured benefits over agent self-assessment?

On an internal test set for the VS Code extension, the PTA approach with dominator analysis achieves 100% accuracy, 100% precision and 100% recall, while the agent’s own assessment remains at 82.2% accuracy, 83.3% precision and 60% recall. The gap is largest in recall — nearly +40 percentage points — meaning the agent frequently misses its own errors.

Even more importantly, agent self-assessment has a 0% F1 score in detecting false alarms, while the structural framework reaches 52.2% F1 in distinguishing agent execution failures from genuine product regressions. Practically, this means the CI system stops sending developers on hunts for phantom bugs when only timing changed while the actual product behavior is unchanged.

What are the current limitations of the framework?

The system requires successful traces for learning (cold-start problem), relies on access to a multimodal LLM for semantic equivalence, and cannot yet flag temporal violations such as excessively long loading screens. In the planned development roadmap are detection of temporal and negative constraints, hierarchical abstraction of screenshots into concepts, and online learning with real-time model updates.

Authors Gaurav Mittal (Microsoft Code AI) and Reshabh Kumar Sharma (UW PhD) emphasize the main thesis: “We don’t need black-box models to judge other black-box models — we need structural guarantees that developers can inspect.”

Frequently Asked Questions

What is dominator analysis?
Dominator analysis is a technique from compiler theory — in an execution graph, state A 'dominates' state B if every successful path to B must pass through A. Here it is applied to agent traces to separate essential steps from incidental variations.
How many traces does the framework need to learn?
Between 2 and 10 successful agent executions. From these, the system builds a Prefix Tree Acceptor (PTA), merges semantically equivalent states, and extracts a minimal 'ground truth' model.
What is the main benefit in CI/CD pipelines?
Reduction of false negatives caused by environmental variations (loading screens, timing, UI rendering). Agent self-assessment had 0% F1 in detecting false alarms, while the structural framework reaches 52.2% F1.