Infrastructure
Test-Time Compute
Spending extra compute during inference — having the model reason longer before answering — to improve accuracy; the basis of modern reasoning models.
Test-time compute is the practice of spending more computation during inference — having the model generate a longer, more deliberate chain of intermediate steps before its final answer — to improve solution quality. It is also called inference-time scaling or test-time scaling.
Classic scaling grew the model and the training data. Test-time compute opens a second axis: with the same trained model, you allow more “thinking” at each query. This is done through a longer chain-of-thought, by sampling many candidate answers and selecting the best (self-consistency, verification), or by searching over a tree of solutions. Empirically, more compute spent often raises accuracy on hard problems such as mathematics, code, and logic.
The paradigm reached the mainstream with OpenAI o1 (2024) and underpins today’s reasoning models. The trade-off is cost and latency: an answer can consume many times more tokens and take seconds to minutes. Through 2025–2026 the gains saturate past a point, so labs work on allocating compute adaptively to task difficulty.