Diffusion Models Theory

Research Depth 78 in the knowledge graph I know this Set as goal
diffusion-models generative-models score-matching denoising

Core Idea

Diffusion models are generative models that learn to reverse a stochastic corruption process (diffusion). Starting with clean data, noise is gradually added via a forward diffusion process until the data becomes pure noise. The model learns to reverse this process by predicting the noise or score (gradient of log probability) at each step. Despite their simplicity, diffusion models achieve state-of-the-art generation quality (images, video, audio, molecules) and provide a theoretically principled framework connecting to score-based models, variational inference, and the reverse Kolmogorov equations from stochastic calculus. Diffusion models unify several prior generative modeling approaches under a common framework.

Explainer

Diffusion models represent a breakthrough in generative modeling, achieving state-of-the-art results in image, video, and audio generation (DALL-E, Imagen, Stable Diffusion). The core idea is elegant: learn to reverse a stochastic diffusion process that gradually corrupts data into noise.

Forward Diffusion Process: Start with a clean data sample x_0 and iteratively add Gaussian noise:

x_t = sqrt(alpha_t) * x_0 + sqrt(1 - alpha_t) * epsilon

where epsilon ~ N(0, I) and alpha_t decreases from 1 to near 0 over T steps. After T steps (typically 1000), x_T is nearly pure Gaussian noise. This forward process is deterministic given the noise schedule {alpha_t}.

Reverse Process: The generative model learns to reverse this process by predicting x_{t-1} from x_t. Equivalently, it predicts the noise epsilon added at step t, or the score function (gradient of log p(x_t)). The reverse process is stochastic: p(x_{t-1} | x_t) = N(x_{t-1} | mu_t, sigma_t^2) where the mean mu_t depends on the predicted noise.

Training: A neural network epsilon_theta(x_t, t) is trained to predict the noise epsilon_t that was added at step t, given the noisy sample x_t and step number t. The loss is simple: ||epsilon - epsilon_theta(x_t, t)||^2. During training, a random step t is chosen, the data is corrupted to x_t, and the network predicts the noise. This is called the denoising objective.

Sampling: To generate, start with x_T ~ N(0, I) and iteratively apply the reverse process for t = T, T-1, ..., 1:

x_{t-1} = 1/sqrt(alpha_t) * (x_t - (1 - alpha_t) / sqrt(1 - alpha_t) * epsilon_theta(x_t, t)) + sigma_t * z

where z ~ N(0, I). The sampling is a chain of reversals, progressively denoisifying from pure noise to structured data.

Theoretical Foundations:

1. Stochastic Calculus: The diffusion process and its reversal are connected through the Kolmogorov backward equation and the score function. The reverse process can be derived from the forward process via Bayes' rule.

2. Variational Inference: Diffusion can be viewed as a variational lower bound on the data likelihood. The training objective (predicting noise) is a lower bound on the log-likelihood of the data.

3. Score-Based Generative Modeling: The score function (gradient of log p(x)) characterizes the data distribution. Learning the score is equivalent to learning the distribution. Score-based models have a long history (Stein discrepancy, energy-based models); diffusion makes score learning practical.

4. Connection to Probability Flow ODEs: The reverse process can be reformulated as an ODE (ordinary differential equation), enabling fast generation via ODE solvers without stochasticity.

Key Advantages:

Challenges and Limitations:

Recent Extensions:

Diffusion models have become dominant in generative modeling, with applications beyond generation (in-painting, super-resolution, editing) and emerging applications in molecular design, drug discovery, and scientific simulation.

Practice Questions 4 questions

Prerequisite Chain

Counting to 10Counting to 20Understanding ZeroThe Number ZeroCounting to FiveOne-to-One CorrespondenceCombining Small Groups Within 5Addition Within 10Addition Within 20Two-Digit Addition Without RegroupingTwo-Digit Addition with RegroupingAddition Within 100Repeated Addition as MultiplicationMultiplication Facts Within 100Division as Equal SharingDivision as Grouping (Measurement Division)Division: Grouping (Repeated Subtraction) ModelDivision: Fair Sharing ModelDivision as Equal SharingDivision as GroupingBasic Division FactsDivision Facts Within 100Two-Digit by One-Digit DivisionDivision with RemaindersRemainders and Quotients in DivisionDivision Word ProblemsIntroduction to Long DivisionFactors and MultiplesPrime and Composite NumbersEquivalent FractionsRelating Fractions and DecimalsDecimal Place ValueReading and Writing DecimalsComparing and Ordering DecimalsAdding and Subtracting DecimalsMultiplying DecimalsDividing DecimalsDividing FractionsMixed Number ArithmeticOrder of OperationsInteger Order of OperationsVariable ExpressionsCombining Like TermsOne-Step EquationsTwo-Step EquationsSolving Multi-Step EquationsEquations with Variables on Both SidesLiteral EquationsSlope-Intercept FormPoint-Slope FormWriting Linear EquationsParallel and Perpendicular Line SlopesGraphing Linear EquationsPiecewise FunctionsOne-Sided LimitsContinuity DefinitionLimit Definition of the DerivativePower RuleConstant Multiple and Sum/Difference RulesProduct RuleChain RuleHigher-Order DerivativesConcavity and Inflection PointsSecond Derivative TestCurve SketchingOptimization ProblemsCritical Points of Multivariable FunctionsCritical Points and Classification of ExtremaSecond Partial Test for Local Extrema (Hessian)The Hessian Matrix and Second Derivative TestUnconstrained Optimization: Finding ExtremaOptimization in Multiple VariablesSupport Vector MachinesKernel Methods and the Kernel TrickKernel Theory and RKHSRepresenter TheoremRegularization Theory (Tikhonov, Spectral)Deep Learning TheoryDiffusion Models Theory

Longest path: 79 steps · 522 total prerequisite topics

Prerequisites (1)

Leads To (0)

No topics depend on this one yet.