Questions: Neural Language Models and Transformers
5 questions to test your understanding
Score: 0 / 5
Question 1 Multiple Choice
A transformer model, trained only on next-token prediction with no explicit grammatical rules, correctly handles subject-verb agreement across long embedded relative clauses in sentence types that appear rarely in its training data. What would this finding most strongly suggest?
AThe model has memorized the specific sentences from training data
BStatistical pattern-matching over sufficient data can produce some degree of structural generalization, challenging the claim that LLMs purely match surface patterns
CThe model has an innate grammatical faculty equivalent to Universal Grammar
DLong-distance dependencies are not actually processed by the attention mechanism
If the model handles novel structural patterns it rarely saw, this pushes back on the 'mere pattern-matching' critique and suggests the statistical objective induces something resembling structural generalization. It does not prove the model has innate grammar (it learned from data, not innateness), nor does it prove it fully understands structure. This is exactly the kind of evidence that makes the debate productive — it shows LLMs do more than memorize surface patterns, without definitively resolving whether they internalize grammar the way humans do.
Question 2 Multiple Choice
What problem with earlier sequential neural architectures does the transformer's attention mechanism directly solve?
ASequential models could not be parallelized during training, making them impossible to scale
BInformation from early in a sequence could fade out before the end, making long-range dependencies hard to capture; attention allows direct connections between any two positions
CSequential models could not process sentences longer than about 20 words
DAttention allows the model to access external knowledge bases that sequential models could not
In sequential (RNN/LSTM) architectures, information about word position 1 must be threaded through every subsequent step to reach position 50 — it can effectively decay or be overwritten along the way. The attention mechanism bypasses this: every position computes a weighted combination of all other positions simultaneously. This makes it possible to connect 'knew' directly to 'lawyer' in 'The lawyer who the journalist interviewed knew the senator' without the intervening clause degrading the connection. Parallelization during training is also a benefit, but the conceptual advance is the direct position-to-position connection.
Question 3 True / False
Large language models are trained on next-token prediction — they learn to predict which word comes next — without being given explicit rules about grammar or meaning.
TTrue
FFalse
Answer: True
This is correct and is what makes LLMs remarkable. The training objective is purely statistical: given the preceding text, assign probabilities to all possible next tokens. No parse trees, no semantic rules, no explicit syntactic categories are provided. Yet from this objective alone, over sufficient data and parameters, LLMs develop representations that support grammatical sentences, stylistic register, factual knowledge, and cross-lingual translation. Whether this statistical learning captures the same kind of knowledge as human grammatical competence is the central open question.
Question 4 True / False
LLMs' strong performance on language benchmarks demonstrates that human language acquisition does not require innate grammatical knowledge, definitively settling the debate over Universal Grammar.
TTrue
FFalse
Answer: False
The debate remains unresolved. LLMs acquire language behavior from vastly more input than any child — hundreds of billions of words versus perhaps a few million in childhood — so they cannot straightforwardly demonstrate that statistical learning is sufficient given normal human input. Critics also argue that LLMs fail on systematic structural tests in ways that suggest they lack genuine grammatical knowledge. LLMs are the best-performing systems on benchmarks, which is relevant evidence, but 'best performance' on current tests does not settle the deeper theoretical question about what kind of knowledge underlies human language acquisition.
Question 5 Short Answer
Why does the transformer's attention mechanism give it an advantage over step-by-step sequential processing for understanding language? Give an example of a sentence type where this advantage is particularly important.
Think about your answer, then reveal below.
Model answer: In sequential architectures, information propagates one step at a time, so connecting a verb to its subject across a long embedded clause requires the model to maintain that information through every intervening word — it can fade or be overwritten. The attention mechanism allows any position to directly attend to any other position in a single step, regardless of distance. Example: in 'The lawyer who the journalist interviewed knew the senator,' the model must connect 'knew' to 'lawyer' as subject-verb pair, skipping over the embedded relative clause 'who the journalist interviewed.' With attention, the model can directly weight 'lawyer' highly when processing 'knew'; with sequential processing, the relationship must survive being threaded through five intervening words.
Long-distance dependencies are a classic challenge for sequential architectures — often called the 'vanishing gradient' problem at its extreme. Attention's parallel structure sidesteps this by making distance in the sequence irrelevant to the directness of the connection, which is why transformers outperform sequential models on language tasks that require integrating information across long spans.