Foundations

Tokenization

The process of splitting text into smaller units called tokens — words, subwords, or characters — that a language model can process numerically.

Tokenization is the first step in feeding text to a large language model — the process of breaking raw text into smaller units called tokens. A token can be a whole word, a subword (prefix or suffix), a single character, or even a few bytes. Each token is then assigned a unique integer ID from the model’s fixed vocabulary (typically 30,000 to 200,000 entries).

Modern LLMs almost all use a variant of subword tokenization:

  • Byte-Pair Encoding (BPE) — used by GPT models; starts from individual bytes and merges the most frequent pairs
  • WordPiece — BERT family; similar to BPE with a different merge criterion
  • SentencePiece / Unigram — Llama, T5, many multilingual models; operates directly on raw text without whitespace pre-splitting
  • Tiktoken — OpenAI’s BPE implementation, used for GPT-3.5/4/5

Tokenization directly affects cost and context window usage: APIs charge per token, and context length is measured in tokens, not characters. English text typically costs ~0.75 tokens per word; Croatian, German, or Chinese may use 1.5–3× more tokens for the same content — so non-English prompts are both a linguistic and an economic challenge.

After tokenization, each ID is mapped to an embedding vector and passed into the transformer layers of the model.

Sources

See also