Infrastructure
Context window
The maximum number of tokens an LLM can process at once — including prompt, documents, and answer; today ranges from 8K to 2 million tokens.
A context window is the maximum number of tokens a large language model can consider in a single interaction. It covers the system prompt, every document you supply, the conversation history, and the output the model generates. Once the window is full, something has to be dropped, summarized, or moved to external storage.
Size is measured in tokens, not characters — one token is roughly 4 characters of English text or about 0.75 words.
Evolution over a few short years:
- 2020: GPT-3 had 2K tokens
- 2023: GPT-4 32K, Claude 2 100K
- 2024–2025: Claude 3.5/3.7 200K, GPT-4o 128K, Gemini 1.5/2.0 1M–2M
- 2026: Claude 1M context (preview), production systems routinely run at 100K+
Large windows enable “context stuffing” — dropping entire codebases, long PDFs, or multi-hour transcripts straight in. They are not a panacea: “lost in the middle” studies show models lose attention to material in the middle of long documents, and cost and latency scale quadratically with length in classical transformer architectures (though modern optimizations like FlashAttention and sparse attention soften this).
In practice, RAG and careful prompt structuring often beat naively filling the window.