5 questions to test your understanding
Why did the transformer architecture enable scaling language models to hundreds of billions of parameters when LSTM-based architectures could not practically reach that scale?
If positional encodings were completely removed from a transformer — with all other components unchanged — what would happen to the model's behavior?
In a well-trained transformer, different attention heads within the same multi-head attention layer can specialize to capture different types of relationships simultaneously.
During transformer training, all sequence positions can be processed simultaneously because self-attention does not maintain or read from any sequential hidden state.
Explain why self-attention in a transformer requires positional encodings to be explicitly added, whereas an LSTM processes order implicitly without any such mechanism.