Questions: Language Models and Neural Language Modeling
5 questions to test your understanding
Score: 0 / 5
Question 1 Multiple Choice
You want to build a text generation system — a model that produces fluent, multi-sentence responses from a prompt. Which training paradigm is best suited, and why?
ABERT-style masked language modeling — reading context from both directions makes it more powerful
BGPT-style autoregressive modeling — it generates tokens left to right, making it naturally suited for text generation
CEither approach works equally well — the training task doesn't affect generation capability
DNeither — you need a separate sequence-to-sequence architecture, not a language model
Autoregressive models like GPT generate text naturally because they are trained to produce the next token given all previous ones — the exact operation needed for generation. BERT-style models are trained to fill in masked tokens using bidirectional context, which makes them excellent at classification and understanding tasks but awkward for generation: they don't naturally produce sequences left to right. The common misconception is that bidirectionality makes BERT better at all tasks; the training objective determines what the model is good at.
Question 2 Multiple Choice
A research team trains a large transformer on billions of web pages using next-token prediction, then trains it for three more epochs on 10,000 labeled customer-service dialogues. What best describes this workflow?
ASupervised learning followed by unsupervised learning
BSelf-supervised pre-training followed by fine-tuning on task-specific data
CSelf-supervised learning only — the labeled dialogues are unnecessary given the scale of pre-training
DZero-shot learning — the model was never explicitly trained on the target task
Next-token prediction on unlabeled text is self-supervised learning (the labels are generated from the text itself). The subsequent training on labeled task data is fine-tuning. This pre-train-then-fine-tune paradigm is the dominant workflow in modern NLP: a single large pre-trained model can be adapted to many downstream tasks by fine-tuning on relatively small task-specific datasets, which is far more efficient than training from scratch for each task.
Question 3 True / False
Autoregressive language models like GPT process the full sentence bidirectionally when predicting each token, using future context to inform earlier predictions.
TTrue
FFalse
Answer: False
Autoregressive models generate text strictly left to right — each token is predicted using only the preceding tokens, never future ones. This is enforced during training via causal masking in the attention mechanism, which prevents any position from attending to later positions. Bidirectionality (using context from both directions) is the defining feature of masked language models like BERT, not autoregressive models.
Question 4 True / False
All of the capabilities large language models demonstrate — grammar, factual knowledge, reasoning patterns — emerge from the single training objective of predicting tokens in text.
TTrue
FFalse
Answer: True
This is one of the most surprising findings in modern NLP. LLMs are trained on only one signal: predict what comes next (or what was masked). Yet through exposure to vast amounts of human-generated text that encodes grammar, facts, reasoning, argumentation, and more, the models learn rich internal representations capturing all of these. There is no explicit reward for learning grammar or facts — they are implicit in the statistical structure of text that good next-token prediction requires.
Question 5 Short Answer
Why can language models trained only on next-token prediction learn to perform seemingly unrelated tasks like question answering, translation, or summarization?
Think about your answer, then reveal below.
Model answer: Because natural language text itself encodes the full range of human knowledge and reasoning. To predict the next token well across billions of examples of diverse text — news, books, conversations, code, scientific papers — a model must learn grammar, factual knowledge, reasoning patterns, and conversational conventions. Question-answering examples, translations, and summaries all appear in the training data; predicting tokens in those contexts forces the model to internalize the underlying task structure. The training objective is a proxy for general language understanding.
This is the key insight of the self-supervised learning paradigm: a sufficiently large model trained to compress the statistical structure of all human-generated text implicitly learns the representations needed for a vast range of downstream tasks. Fine-tuning then steers those general representations toward a specific application.