Foundations
Transformer
The neural network architecture introduced in 2017 that powers virtually every modern large language model. Built around the self-attention mechanism.
The transformer is the deep learning architecture introduced in the 2017 paper Attention Is All You Need by Vaswani et al. at Google. It replaced earlier recurrent neural networks (RNNs, LSTMs) for language tasks and has become the backbone of virtually every state-of-the-art large language model — GPT, Claude, Gemini, Llama, Mistral, DeepSeek, all are transformers.
The transformer’s key innovation is the self-attention mechanism, which lets each position in a sequence attend to every other position in parallel. This eliminates the sequential bottleneck of RNNs, enables training on much longer contexts, and scales efficiently on modern GPUs and TPUs.
A transformer layer combines: multi-head self-attention (different “views” of relationships in the data), a feed-forward network, layer normalization, and residual connections. Models stack 24 to 100+ such layers. Variants include encoder-only (BERT), decoder-only (GPT family), and encoder-decoder (T5, original transformer).
Beyond language, transformers now power vision (ViT), audio (Whisper), protein folding (AlphaFold 2), and multimodal models. The architecture has scaled remarkably well: doubling parameters and training data continues to improve capability, the principle behind today’s frontier models.