What are transitivity violations in the context of LLM judges?

Transitivity means that if a judge says A > B and B > C, then A > C must hold. A violation occurs when the judge says A > B, B > C, and C > A — logically impossible, yet it happens in 33 to 67 percent of documents.

Which evaluation criteria are least reliable?

Relevance has the smallest average set size (~3.0, most reliable). Coherence is moderate (~3.9). Fluency and Consistency have a set size of ~4.9 and are unreliable for per-instance decisions.

Why is this finding important for industrial evaluation?

Practitioners who use LLM judges for automatic NLG evaluation assume that low aggregate violation rates mean high reliability. The study shows that is wrong — per-instance reliability can be dramatically lower.

ArXiv: conformal prediction exposes hidden unreliability in LLM judges

Manan Gupta and Dhruv Kumar published the paper “Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations” on April 16, 2026, taking the analysis of LLM-as-a-judge reliability to a deeper level. While most research has focused on aggregate reliability metrics, this study is the first to systematically examine per-instance reliability — for each individual document separately.

What is transitivity and why does it matter?

Transitivity is a fundamental logical property of ordering: if an LLM judge says that response A is better than B, and B is better than C, then A must be better than C. When this does not hold, we have a directed 3-cycle — a cycle in which A > B, B > C, but C > A, which is formally impossible for a consistent evaluator.

The authors measured how often these violations occur in real LLM judges. The results appear ambiguous: aggregate violation rates look low, between 0.8 and 4.1 percent. By that measure, LLM judges seem reliable.

But when the researchers examined how many documents have at least one transitivity violation, the picture changes dramatically: between 33 and 67 percent of documents have at least one 3-cycle in their comparisons. In other words, more than a third of all documents have logically impossible verdicts somewhere.

Conformal prediction as a diagnostic tool

The study introduces a new methodology based on split conformal prediction distributional sets for Likert scores from 1 to 5. The advantage is that these sets have theoretically guaranteed coverage — at a given confidence level (1-α), the true score falls within the set with guaranteed probability.

Key finding: the width of the prediction set correlates with actual per-instance reliability with a Spearman coefficient r_s = +0.576 on a sample of 1,918 documents, with a p-value below 10^-100. In other words, if the set is wide, the judge is uncertain about that specific document — and this can be formally measured.

Evaluation criteria are not equal

The study measured reliability across different criteria and revealed a clear hierarchy:

Relevance — average set size ~3.0 (most reliable)
Coherence — average set size ~3.9 (moderate)
Fluency and Consistency — average set size ~4.9 (unreliable)

This means that when an LLM judge evaluates the fluency or consistency of a response, its verdicts are considerably less reliable than when evaluating relevance.

Practical implications

The prediction set width shows consistent correlation across different judges (r̄ = 0.32–0.38), indicating that it reflects the difficulty of the document itself, not noise specific to one judge. The authors conclude that it matters more which type of criterion you are assessing than which specific LLM you choose as a judge.

Together with the parallel study Context Over Content by the same author (Manan Gupta), this paper signals that the LLM-as-a-judge paradigm must be reconsidered — both in terms of bias and in terms of the reliability of individual verdicts. Both studies are currently under peer review.

ArXiv: conformal prediction exposes hidden unreliability in LLM judges

What is transitivity and why does it matter?

Conformal prediction as a diagnostic tool

Evaluation criteria are not equal

Practical implications

Sources

Related news