5 questions to test your understanding
A standard autoencoder trained on face images fails to generate new faces when you sample random points from latent space — the decoder produces noise. Why, and how does a VAE fix this?
The reparameterization trick in VAE training rewrites the sampling step as z = μ + σ·ε where ε ~ N(0,1). Why is this substitution necessary?
Removing the KL divergence term from the VAE loss (training with reconstruction loss only) would cause the latent space to become unstructured, degrading generative capability.
VAEs typically produce sharper, more detailed image outputs than GANs because the KL regularization enforces a well-organized latent space.
Why does the KL divergence term in the ELBO loss force the latent space to become structured and usable for generation, rather than just functioning as an arbitrary regularizer?