arXiv:2605.06635: LLM citations are not verified

New research tested 14 LLM models on deep research tasks and uncovered a major gap: links are valid in 94%+ of cases, but the factual accuracy of citations is only 39–77%. The key finding: citation accuracy drops by 42% when the number of tools increases from 2 to 150, overturning the assumption that more retrieval means better quality.

The study “Cited but Not Verified” (Onweller et al., arXiv:2605.06635), published 7 May 2026, reveals a serious gap between the surface quality of citations and the actual factual reliability of LLM deep research agents. The team developed an AST parser that extracts inline citations from Markdown reports and evaluates them across three dimensions: URL availability, content relevance, and factual accuracy.

What do the numbers show?

Frontier models maintain link validity above 94% and relevance above 80% — but factual accuracy against the source material ranges from just 39% to 77%. In other words, the cited link exists and is thematically relevant, yet it does not always support the claim the agent makes alongside it.

Why does more searching mean less accuracy?

The study’s most significant finding is the inverse relationship between research depth and reliability. As tool calls increase from 2 to 150, fact check accuracy drops by an average of 42%. This overturns the intuitive assumption that more thorough deep research yields better results — in practice, deeper searches accumulate errors and dilute the model’s attention to individual sources.

What does this mean for users?

For journalists, researchers, and businesses relying on deep research agents, the finding is a warning: a link in a report is no guarantee that the source supports the claim. Fewer than half of open-source models were even able to generate cited reports in single-shot mode. The study suggests that manual verification of key citations remains necessary, especially for high-stakes tasks such as legal or medical research.

Frequently Asked Questions

What does 'fact check accuracy' mean?

Fact check accuracy is a metric that measures how well the content of a citation matches the claim referencing it in the text — that is, whether the cited source actually supports what the LLM asserts. It differs from simply checking whether a link opens.

Why does accuracy drop with research depth?

The researchers showed that as the number of tool calls increases from 2 to 150, factual accuracy drops by an average of 42%. Possible causes include cumulative error, reduced attention to individual sources, and the tendency of models to generate plausible citations without genuine verification.

Which models were tested?

The team benchmarked 14 models — a combination of closed-source frontier systems and open-source models. Fewer than half of the open-source models successfully generated cited reports in single-shot mode.

arXiv:2605.06635: LLM agents cite but don't verify — links valid 94%+, accuracy only 39–77%

What do the numbers show?

Why does more searching mean less accuracy?

What does this mean for users?

Frequently Asked Questions

Sources

Related news