arXiv: FALAT traces the cause of AI agent failure

FALAT is a new diagnostic framework for attributing the causes of failures in multi-agent LLM systems, formulated as a dependency-guided search. It achieves 46,0% step-level accuracy on algorithmically generated trajectories and 29,1% on hand-crafted ones, showing that accounting for dependencies between steps is key to identifying the cause of an error.

A paper published on arXiv titled “FALAT: Tracing Failures in LLM Agent Trajectories via Dependency-Guided Search” presents a diagnostic framework for one of the hardest problems in multi-agent systems. The authors are Md Nakhla Rafi, Md Ahasanuzzaman, Dong Jae Kim, Zhijie Wang and Tse-Hsun Chen.

What problem does FALAT solve?

In systems where several LLM agents collaborate on a task, a failure is not easy to attribute. An error can propagate through a trajectory: later actions look wrong, but only because they depend on an earlier corrupted state. A trajectory here is the whole sequence of steps and decisions the agents take. FALAT tries to determine which agent actually caused the failure and at which step the decisive error occurred.

How does FALAT work?

The framework is formulated as a “dependency-guided search”, a search guided by dependencies. FALAT first builds expectations about the correct execution of the task, then identifies suspicious regions of the trajectory, tracks the dependencies between decisions and outputs, and assesses whether correcting a candidate step would restore the expected outcome. In this way, instead of superficially looking at the last visible wrong step, it reaches the real source of the failure.

What are the results?

On the Who&When benchmark, FALAT achieves 46,0% step-level accuracy for algorithmically generated trajectories and 29,1% for hand-crafted ones. These figures outperform specialised baseline attribution methods as well as direct prompting of standalone LLM models. The gap between the two sets also shows how much harder the task is on hand-assembled, more diverse trajectories.

Why does this matter?

The results emphasise that dependency-aware reasoning is key to diagnosing failures in LLM agent systems. As agent systems spread into production, the ability to precisely attribute the cause of an error becomes a prerequisite for reliability, debugging and accountability. FALAT offers a structured approach to that challenge instead of mere guessing about which step failed.

Frequently Asked Questions

What does FALAT try to determine?

FALAT tries to determine which agent caused the failure in a multi-agent LLM system and at which step the decisive error occurred.

What accuracy does FALAT achieve?

It achieves 46,0% step-level accuracy on algorithmically generated trajectories and 29,1% on hand-crafted ones, outperforming specialised baseline methods and direct prompting.

arXiv:2606.00765: FALAT traces the causes of failures in AI agent trajectories

What problem does FALAT solve?

How does FALAT work?

What are the results?

Why does this matter?

Frequently Asked Questions

Sources

Related news