Diffusion model

A diffusion model is a generative neural network that learns to produce data by reversing a gradual corruption process. During training, real samples — usually images — are progressively destroyed by adding Gaussian noise across many small steps. The model is taught to estimate the noise that was added at each step. At generation time the process runs in reverse: the model starts from pure noise and iteratively denoises it into a coherent image, video frame, or audio waveform.

The modern wave began with Denoising Diffusion Probabilistic Models (Ho et al., 2020) and quickly displaced earlier GAN-based approaches because diffusion models are more stable to train, scale better with compute, and produce sharper, more diverse outputs. Latent diffusion (operating in a compressed embedding space rather than raw pixels) made high-resolution generation affordable and underlies systems such as Stable Diffusion, DALL-E 3, Midjourney, Imagen, Sora, and Veo.

The same framework, with different training data and conditioning, powers text-to-image, text-to-video, audio synthesis, 3D scene generation, and even some scientific applications. Diffusion is to generative media what the transformer is to language: the workhorse architecture of the current era of deep learning.

Sources

See also