Infrastructure
Inference (model serving)
The phase where a trained model produces outputs for new inputs; consumes GPU/TPU resources and drives cost, latency, and throughput of AI services.
Inference is the phase of a model’s lifecycle in which a trained network receives new inputs and produces outputs — predictions, classifications, or text. It is the counterpart to training, which happens once (or in a few iterations) and uses far heavier hardware over weeks or months.
For a large language model inference means generating one token at a time, where each new token requires a forward pass through every layer. Key metrics:
- Time to first token (TTFT) — how quickly the user sees the start of an answer
- Throughput (tokens/sec) — how many users and queries a server can handle in parallel
- Cost per million tokens — the main driver of API providers’ business model
Inference dominates the total cost of ownership of a model over its lifetime — if training costs millions of dollars once, serving a successful product costs that much per month. That is why inference optimization is an active field: quantization (FP8, INT4), speculative decoding, prefix caching, batching, KV-cache reuse.
The hardware stack is diverse: NVIDIA H100 and B200 GPUs dominate the cloud, but AI accelerators like Google TPUs, AWS Trainium/Inferentia, and Groq LPUs are gaining traction on price-performance. Local inference (Apple Silicon, laptop NPUs, Ollama) further reshapes the economics.