The Word2Vec Skip-gram model learns word embeddings by:
ACounting how often each pair of words co-occurs across the entire corpus, then factorizing the resulting matrix
BTraining a shallow neural network to predict surrounding context words given a center word
CAssigning random dense vectors and iteratively adjusting them based on word frequency rankings
DEncoding each word as a weighted sum of the vectors of its definition words
Skip-gram trains a network on a prediction task: given a target word, predict the words in its surrounding context window. The hidden layer weights after training become the word vectors. Option A describes GloVe (Global Vectors), which factorizes a global co-occurrence matrix — a different approach that incorporates corpus-wide statistics rather than local context windows. Options C and D are not how any standard embedding method works.
Question 2 Multiple Choice
A well-trained embedding model produces the result: vec('Paris') − vec('France') + vec('Germany') ≈ vec('Berlin'). This works because:
AThe model memorized that Paris and Berlin are both capital cities from explicit labels in the training data
BCities that frequently appear together in the same sentence end up geometrically close in the embedding space
CThe embedding space encodes the 'capital city of' relationship as a consistent geometric direction, so subtracting and adding that direction navigates the analogy
DGloVe's co-occurrence matrix directly encodes country-capital pairs as high co-occurrence counts
The vector arithmetic works because the distributional hypothesis causes the embedding space to organize semantically consistent relationships as consistent geometric offsets. The direction from 'France' to 'Paris' (capital relationship) is approximately the same direction as from 'Germany' to 'Berlin.' Subtracting 'France' from 'Paris' isolates this direction, then adding it to 'Germany' lands near 'Berlin.' This is not memorization or direct co-occurrence — it emerges from learning the contexts in which words appear.
Question 3 True / False
In one-hot encoding, the vectors for 'cat' and 'kitten' are geometrically closer to each other than to 'airplane,' because cats and kittens are semantically related.
TTrue
FFalse
Answer: False
False. One-hot vectors are mutually orthogonal — every pair of distinct words has a dot product of exactly zero and the same Euclidean distance. 'Cat' is geometrically identical in distance to 'kitten' and to 'airplane.' This is the fundamental failure of one-hot encoding: it encodes no semantic information whatsoever. Word embeddings were invented precisely to fix this — dense vectors learned from distributional patterns place semantically similar words close together in vector space.
Question 4 True / False
The distributional hypothesis — the theoretical foundation of word embeddings — holds that words appearing in similar contexts tend to have similar meanings.
TTrue
FFalse
Answer: True
True. This hypothesis, attributed to linguists like Firth ('a word is characterized by the company it keeps'), is the entire basis for learning meaningful word representations from raw text. If 'dog' and 'cat' both appear near words like 'pet,' 'feed,' 'veterinarian,' and 'bark/meow,' their context vectors will be similar — and their learned embeddings will reflect this shared semantic territory. The hypothesis is not perfect (polysemous words like 'bank' appear in very different contexts), but it is powerful enough to produce embeddings that encode grammar, analogy, and semantic similarity.
Question 5 Short Answer
Why does Word2Vec learn semantically meaningful word representations even though it is trained on the seemingly simple task of predicting context words, with no explicit semantic labels?
Think about your answer, then reveal below.
Model answer: Words with similar meanings naturally appear in similar linguistic contexts. To predict context words accurately, the model must learn to group together words that are interchangeable in context — which turns out to be a strong proxy for semantic similarity. The training signal forces the hidden layer to compress distributional patterns into dense vectors, and those patterns happen to encode meaning. Semantic content is latent in the statistics of how words co-occur, and the prediction task is the mechanism for extracting it.
This is the key insight: the task (next-word prediction) is not the goal, it is the scaffold. By solving the prediction task well, the model is implicitly forced to build internal representations that capture meaning — because meaning is what determines context. This is a general principle of representation learning: a model trained on a structured prediction task often learns representations that encode the underlying structure of the data, even without explicit supervision on that structure.