Diffusion models are generative models that learn to reverse a stochastic corruption process (diffusion). Starting with clean data, noise is gradually added via a forward diffusion process until the data becomes pure noise. The model learns to reverse this process by predicting the noise or score (gradient of log probability) at each step. Despite their simplicity, diffusion models achieve state-of-the-art generation quality (images, video, audio, molecules) and provide a theoretically principled framework connecting to score-based models, variational inference, and the reverse Kolmogorov equations from stochastic calculus. Diffusion models unify several prior generative modeling approaches under a common framework.
Diffusion models represent a breakthrough in generative modeling, achieving state-of-the-art results in image, video, and audio generation (DALL-E, Imagen, Stable Diffusion). The core idea is elegant: learn to reverse a stochastic diffusion process that gradually corrupts data into noise.
Forward Diffusion Process: Start with a clean data sample x_0 and iteratively add Gaussian noise:
x_t = sqrt(alpha_t) * x_0 + sqrt(1 - alpha_t) * epsilon
where epsilon ~ N(0, I) and alpha_t decreases from 1 to near 0 over T steps. After T steps (typically 1000), x_T is nearly pure Gaussian noise. This forward process is deterministic given the noise schedule {alpha_t}.
Reverse Process: The generative model learns to reverse this process by predicting x_{t-1} from x_t. Equivalently, it predicts the noise epsilon added at step t, or the score function (gradient of log p(x_t)). The reverse process is stochastic: p(x_{t-1} | x_t) = N(x_{t-1} | mu_t, sigma_t^2) where the mean mu_t depends on the predicted noise.
Training: A neural network epsilon_theta(x_t, t) is trained to predict the noise epsilon_t that was added at step t, given the noisy sample x_t and step number t. The loss is simple: ||epsilon - epsilon_theta(x_t, t)||^2. During training, a random step t is chosen, the data is corrupted to x_t, and the network predicts the noise. This is called the denoising objective.
Sampling: To generate, start with x_T ~ N(0, I) and iteratively apply the reverse process for t = T, T-1, ..., 1:
x_{t-1} = 1/sqrt(alpha_t) * (x_t - (1 - alpha_t) / sqrt(1 - alpha_t) * epsilon_theta(x_t, t)) + sigma_t * z
where z ~ N(0, I). The sampling is a chain of reversals, progressively denoisifying from pure noise to structured data.
Theoretical Foundations:
1. Stochastic Calculus: The diffusion process and its reversal are connected through the Kolmogorov backward equation and the score function. The reverse process can be derived from the forward process via Bayes' rule.
2. Variational Inference: Diffusion can be viewed as a variational lower bound on the data likelihood. The training objective (predicting noise) is a lower bound on the log-likelihood of the data.
3. Score-Based Generative Modeling: The score function (gradient of log p(x)) characterizes the data distribution. Learning the score is equivalent to learning the distribution. Score-based models have a long history (Stein discrepancy, energy-based models); diffusion makes score learning practical.
4. Connection to Probability Flow ODEs: The reverse process can be reformulated as an ODE (ordinary differential equation), enabling fast generation via ODE solvers without stochasticity.
Key Advantages:
Challenges and Limitations:
Recent Extensions:
Diffusion models have become dominant in generative modeling, with applications beyond generation (in-painting, super-resolution, editing) and emerging applications in molecular design, drug discovery, and scientific simulation.
No topics depend on this one yet.