5 questions to test your understanding
A model with batch normalization performs well during training but gives poor results at deployment. Training used batch size 64; deployment processes one image at a time. What is the most likely cause?
A researcher removes the learnable scale (γ) and shift (β) parameters from all batch normalization layers, leaving only the normalization step. What is the likely consequence?
Batch normalization cannot reduce a network's representational capacity because the learnable parameters γ and β allow the network to recover any unnormalized distribution if gradient descent finds it useful.
During training, batch normalization uses population statistics computed over the entire training dataset to normalize each layer's inputs.
Why does batch normalization behave differently at training time versus inference time, and what bug does this difference commonly cause?