Attention mechanism

The attention mechanism is a technique in neural networks that lets a model decide, for every position in an input sequence, how much each other position matters when producing an output. Instead of processing tokens strictly in order, attention assigns a weight to every pair of positions and aggregates information accordingly.

The dominant variant in modern AI is self-attention, where queries, keys, and values are all derived from the same sequence. Each token computes a similarity score with every other token, those scores are passed through a softmax to form weights, and the result is a weighted sum of value vectors. Multi-head attention runs this operation in parallel across several learned subspaces.

Attention was introduced for machine translation in 2014 (Bahdanau et al.) and reframed as the central building block in 2017 with Attention Is All You Need, the paper that defined the transformer architecture. Removing recurrence in favour of pure attention enabled massive parallelism on GPUs and made today’s large language models practical.

Variants such as flash attention, sliding-window attention, and grouped-query attention reduce memory and compute cost, allowing context windows to grow from a few thousand to millions of tokens.

Sources

See also