Transformer

The transformer is the deep learning architecture introduced in the 2017 paper Attention Is All You Need by Vaswani et al. at Google. It replaced earlier recurrent neural networks (RNNs, LSTMs) for language tasks and has become the backbone of virtually every state-of-the-art large language model — GPT, Claude, Gemini, Llama, Mistral, DeepSeek, all are transformers.

The transformer’s key innovation is the self-attention mechanism, which lets each position in a sequence attend to every other position in parallel. This eliminates the sequential bottleneck of RNNs, enables training on much longer contexts, and scales efficiently on modern GPUs and TPUs.

A transformer layer combines: multi-head self-attention (different “views” of relationships in the data), a feed-forward network, layer normalization, and residual connections. Models stack 24 to 100+ such layers. Variants include encoder-only (BERT), decoder-only (GPT family), and encoder-decoder (T5, original transformer).

Beyond language, transformers now power vision (ViT), audio (Whisper), protein folding (AlphaFold 2), and multimodal models. The architecture has scaled remarkably well: doubling parameters and training data continues to improve capability, the principle behind today’s frontier models.

Sources

See also