What causes unpredictability in LLMs?

The finite precision of floating-point numbers (e.g., float16, bfloat16) creates rounding errors that can be exponentially amplified as they pass through transformer layers.

How does this affect production AI systems?

The same query can produce different responses on different hardware or even on repeated runs, threatening reliability and reproducibility in critical applications.

ArXiv: Numerical Instability in LLMs — How Floating-Point Errors Create Chaos in Transformers

Why Does the Same Prompt Give Different Answers?

Every user of large language models has noticed the phenomenon: the same query sent to the same model sometimes produces different answers. Part of this behavior is explained by intentional randomness (the temperature parameter), but researchers Chashi Mahiul Islam, Alan Villarreal, and Mao Nishino show that there is a deeper explanation — numerical instability inherent to the architecture itself.

Floating-point arithmetic — the system by which computers represent decimal numbers with finite precision — inevitably introduces rounding errors. Their research tracks how these errors “propagate, amplify, or dissipate” as they pass through the layers of transformer architecture.

Three Behavioral Regimes

The paper identifies an “avalanche effect” in the early layers of transformers, where small perturbations lead to a binary outcome — they are either rapidly amplified or completely suppressed. This creates three distinct regimes:

The stable regime occurs when perturbations are below an input-dependent threshold — errors disappear and the model produces consistent outputs. The chaotic regime arises when rounding errors dominate and drive output divergence. The signal-dominated regime is one where actual variations in the input outweigh the numerical noise.

Practical Implications for the AI Industry

These “universal, scale-dependent chaotic patterns” appear across multiple datasets and architectures, meaning the problem is not specific to any one model or vendor.

For production systems — especially those integrated into agentic workflows where LLMs make decisions in chains — this has concrete consequences. The same code on different hardware (GPU vs. TPU vs. CPU) can produce different outputs not by design, but due to differing implementations of floating-point operations. This threatens the reproducibility, testing, and certification of AI systems in regulated industries such as medicine or finance.

ArXiv: Numerical Instability in LLMs — How Floating-Point Errors Create Chaos in Transformers

Why Does the Same Prompt Give Different Answers?

Three Behavioral Regimes

Practical Implications for the AI Industry

Sources

Related news