Why is 5 ms clock skew a problem?

At that skew, causality violations appear in observability traces — e.g., a trace shows that B 'finished before' A even though A triggered B. The system still functions correctly, but debugging and performance analysis become unreliable.

What does this mean for teams scaling LLM serving?

Time must be treated as a first-class concern in the design of distributed AI systems. This means precise NTP/PTP synchronization, observability tooling that detects clock drift and tests that include timing as a variable.

OCP study: 5 ms clock skew breaks AI inference observability

Q: What is distributed AI inference?

Serving LLM queries across multiple physical nodes — e.g., a single query passes through a tokenizer node, embedding node, transformer node and postprocessing node. Each node has its own clock, and coordination depends on timestamp-based information in logs.

A team consisting of Ankur Sharma, Deepa Shah, David Lariviere and Hesham ElBakoury published on April 23, 2026 the paper “Time, Causality, and Observability Failures in Distributed AI Inference Systems” (arXiv:2604.21361). The work originates within the Open Compute Project (OCP) Unified Intelligent Infrastructure workstream — which gives weight to the findings since OCP sets standards used by practically all hyperscalers (Meta, Microsoft, Google, AWS).

What is distributed AI inference?

Modern serving of large LLM queries rarely takes place on a single server. Distributed inference splits the work across multiple nodes: tokenizer, KV cache, transformer attention layers (often tensor parallel across multiple GPUs), embedding storage, postprocessing and orchestrator. Each such node has its own local clock, and coordination between them depends entirely on timestamp-based observability infrastructure — distributed tracing tools such as OpenTelemetry, Jaeger or Zipkin.

What did the paper show?

The authors conduct controlled experiments on a multi-node AI inference pipeline and deliberately introduce clock skew (clock misalignment) at one stage. Key findings:

Up to 3 ms skew: no observability violation
5 ms skew: “clear causality violations emerge”
Functional output: remains “largely unaffected” — the system produces correct results
Throughput: also unaffected

In other words, the system functions correctly, but observability becomes causally incorrect — traces show impossible sequences (e.g., a response “before” the request), making debugging and performance analysis unreliable.

Three categories of failures

The findings yield a taxonomy of boundary failures:

Temporal ordering violations — events appear in traces in the wrong chronological order
Causality violations — cause-and-effect relationships become impossible to reconstruct from logs
Observability degradation independent of system performance — the most dangerous category because there is no warning that something is wrong (output is correct, throughput is fine — only the logs lie)

The authors additionally note that the behavior is non-static: in longer runs, negative span rates can stabilize or decrease due to clock drift between nodes. Experiments were conducted on Kafka and ZeroMQ transports with consistent results; Aeron is under investigation but not included in confirmed validation.

What must teams do?

The paper’s main recommendation: “timing must be treated as a first-class concern in distributed AI systems”. Practical implications:

PTP (Precision Time Protocol) instead of classic NTP — sub-millisecond precision over the network
Observability tooling that actively detects clock drift and warns before traces become corrupt
Testing with simulated skew as part of CI/CD for inference servers
Single-node fallback strategies for critical low-latency paths where timing is critical

For teams scaling LLM serving to tens or hundreds of nodes — whether hyperscalers or mid-size shops — this paper is required reading before the next architectural step.

arXiv:2604.21361: Open Compute Project maps time/causality failures in distributed AI inference systems — 5 ms clock skew breaks observability

What is distributed AI inference?

What did the paper show?

Three categories of failures

What must teams do?

Sources

Related news