A zero-shot classifier is tested on images of a pangolin — a species never seen during training. How does the model classify it correctly without any pangolin training examples?
AThe model guesses among all known classes and picks the one with the highest training-time accuracy
BThe model projects the pangolin image into semantic space and finds it nearest to the 'pangolin' class embedding, which encodes the species' semantic properties
CThe model retrains on a few similar species and interpolates to the pangolin class
DThe model falls back to the nearest visually similar class from the training set
In zero-shot learning, both the input (the image) and every class label (including unseen ones) are embedded in a shared semantic space. During training, the model learned to project inputs into this space so that images of zebras land near the 'zebra' embedding. At test time, the pangolin image is projected into the same space, and the 'pangolin' class embedding — derived from word vectors or attribute descriptions — is already positioned there based on semantic relationships. The model finds the nearest class embedding and predicts 'pangolin.' No retraining or pangolin examples are needed.
Question 2 Multiple Choice
What is the fundamental difference between a conventional classifier and a zero-shot classifier in how they represent output classes?
AConventional classifiers use neural networks; zero-shot classifiers use rule-based systems
BConventional classifiers have fixed output slots — one per training class; zero-shot classifiers represent classes as points in a shared semantic space accessible at any time
CConventional classifiers require more training data; zero-shot classifiers use less data but are less accurate
DConventional classifiers can handle any class at test time; zero-shot classifiers only handle classes seen during training
A conventional classifier's output layer has a fixed number of neurons — one per training class. There is no mechanism for predicting a class not present during training. Zero-shot classifiers replace fixed output slots with a semantic space: any class that has a semantic embedding (word vector, attribute vector) can be queried at test time, regardless of whether examples of that class were in the training set. This architectural difference is what enables generalization to unseen classes.
Question 3 True / False
Zero-shot learning means the model receives zero training examples in total — it performs classification without any training at most.
TTrue
FFalse
Answer: False
This is the most common misconception about zero-shot learning. The 'zero shots' refers specifically to zero examples of the *unseen* classes — the model is heavily trained on *seen* classes and on the semantic embedding space. The model learns from seen-class examples how to project inputs into semantic space; zero-shot generalization is then possible because unseen classes already have semantic embeddings that position them meaningfully in that space. Zero-shot learning requires substantial training; what it avoids is training on the specific classes encountered at test time.
Question 4 True / False
In generalized zero-shot learning, a model that always predicts seen classes is likely to outperform a model that treats seen and unseen classes equally, because seen classes have richer learned representations.
TTrue
FFalse
Answer: True
This is precisely why generalized zero-shot learning is harder than standard zero-shot learning. When test examples can come from either seen or unseen classes, the model's projection function — optimized on seen-class examples — creates richer, more confident representations for seen classes. Unseen class embeddings, derived purely from semantic descriptions without any training signal, may be less precisely positioned. The result is a strong bias toward predicting seen classes, even for inputs from unseen classes. Calibration techniques and transductive methods are needed to correct this bias.
Question 5 Short Answer
Explain why a zero-shot classifier can correctly classify a new animal species it has never seen, even though no examples of that species were in the training data.
Think about your answer, then reveal below.
Model answer: Zero-shot classification works by projecting both inputs and class labels into a shared semantic space. During training, the model learns to map input features (e.g., image pixels) to positions in this space using seen-class examples — images of zebras are trained to project near the 'zebra' embedding. Unseen classes like 'okapi' already have semantic embeddings from word vectors or attribute descriptions that encode their properties — the word 'okapi' sits near 'giraffe' and 'deer' in the embedding space. At test time, the unseen image is projected into the same space, and the nearest class embedding is predicted. The model succeeds not by recognizing okapis specifically, but by leveraging the structure of semantic space: the visual features of an okapi project near the semantic region where okapi-like concepts live.
The key insight is that zero-shot learning transfers knowledge not through examples but through semantic structure. Word embeddings and attribute vectors capture meaningful relationships between concepts — those relationships were learned from language and human-defined descriptions, not from visual examples. By bridging the visual input space and the semantic class space, the model inherits structural knowledge encoded in language.