Inference (model serving)

Inference is the phase of a model’s lifecycle in which a trained network receives new inputs and produces outputs — predictions, classifications, or text. It is the counterpart to training, which happens once (or in a few iterations) and uses far heavier hardware over weeks or months.

For a large language model inference means generating one token at a time, where each new token requires a forward pass through every layer. Key metrics:

Time to first token (TTFT) — how quickly the user sees the start of an answer
Throughput (tokens/sec) — how many users and queries a server can handle in parallel
Cost per million tokens — the main driver of API providers’ business model

Inference dominates the total cost of ownership of a model over its lifetime — if training costs millions of dollars once, serving a successful product costs that much per month. That is why inference optimization is an active field: quantization (FP8, INT4), speculative decoding, prefix caching, batching, KV-cache reuse.

The hardware stack is diverse: NVIDIA H100 and B200 GPUs dominate the cloud, but AI accelerators like Google TPUs, AWS Trainium/Inferentia, and Groq LPUs are gaining traction on price-performance. Local inference (Apple Silicon, laptop NPUs, Ollama) further reshapes the economics.

Sources

See also