In a diffusion model, the forward process gradually adds noise to data. What is the purpose of learning to reverse this process?
ATo compress data; the reverse process learns lossy compression
BTo generate new samples: starting from pure noise, iteratively applying the learned reverse process produces samples from the data distribution
CTo classify images; the reverse process learns to assign labels
DTo reduce noise in corrupted images; the reverse process learns to denoise
The forward diffusion process gradually destroys structure in data, converting it to noise. The reverse process reconstructs structure from noise. By learning the reverse process and applying it starting from pure Gaussian noise, you can generate samples that follow the original data distribution. This is generative modeling: the model learns to transform noise into realistic samples, a process that can be sampled infinitely to generate diverse outputs.
Question 2 Short Answer
The diffusion model objective uses score matching: predicting the gradient of log probability (score). How does this relate to denoising?
Think about your answer, then reveal below.
Model answer: Score matching is equivalent to predicting noise added during the diffusion process. Specifically, the score function (gradient of log p(x)) can be expressed as the expected noise added at each diffusion step. By training a network to predict the noise (denoising), the model learns the score function. This connection enables efficient training: instead of explicitly computing gradients, you directly train the model to denoise, which implicitly learns the score. Denoising is intuitive and training-stable, making score-matching-based diffusion models practical.
The equivalence between score matching and denoising is a key insight that makes diffusion models tractable. Denoising is a well-understood task (standard in image processing), so practitioners have intuition and architectural innovations. Training to predict noise is also numerically stable and efficient, avoiding explicit gradient computation.
Question 3 Multiple Choice
Diffusion models gradually add noise over many steps (typically 1000 or more). Why not just add all noise in one step?
AMultiple steps have no advantage; one-step diffusion works equally well
BMultiple steps enable predicting small, local changes, making the learning problem tractable; jumping straight to noise loses all information about the data structure
CMultiple steps are required for computational efficiency; one-step would be too slow
DThe number of steps is irrelevant as long as you reach pure noise
Gradual diffusion allows learning to predict small, local perturbations at each step. The network learns how to reverse tiny, local corruption, a much easier task than learning to reconstruct from pure noise. Additionally, the network sees examples with varying noise levels during training, learning a noise-robust understanding of structure. One-step diffusion would require learning to reconstruct from pure noise with no intermediate guidance, which is far harder and likely unsuccessful.
Question 4 True / False
Diffusion models are related to both VAEs and score-based generative models. What advantage do diffusion models have over VAEs in terms of sample quality?
TTrue
FFalse
Answer: True
Diffusion models achieve superior sample quality compared to standard VAEs. This is due to their iterative refinement process: each reverse step improves the sample gradually, enabling high-fidelity generation. VAEs use a single-shot decoder, producing generation in one pass, which is fast but often lower quality due to averaging effects in the decoder. Diffusion models trade off speed for quality, producing state-of-the-art results. Recent work on fast sampling for diffusion models (distillation, consistency models) aims to reduce this speed penalty.