Self-supervised learning creates supervision signals from the input data itself. What distinguishes an effective SSL task from a trivial one?
Think about your answer, then reveal below.
Model answer: An effective SSL task is one where solving it requires learning representations that capture semantic, task-relevant structure rather than low-level artifacts. Predicting masked words in language requires understanding syntax and semantics (effective), while predicting the pixel-level mean of an image does not (trivial). The key is that the proxy task should demand learning invariances and abstractions that are useful for downstream tasks. This often means the task should be challenging enough to require depth, but solvable without labels, with a clear connection between task difficulty and representation quality.
SSL task design is critical. Good tasks are 'generically useful' — their solution requires understanding that transfers to many downstream applications. This is why masked prediction (language, vision) works well: it requires semantic understanding. Tasks that are too easy (e.g., recovering low frequencies) or too task-specific fail to produce general representations.
Question 2 Multiple Choice
Why does self-supervised learning enable efficient fine-tuning with few labels?
ASSL has no advantage; fine-tuning with few labels is equally hard whether you pre-train or not
BSSL pretraining learns general representations that capture structure in the data distribution; fine-tuning only needs to learn the task-specific classifier on top, not the underlying representations
CSSL is better at memorizing data, making it easier to overfit with few labels
DSSL reduces the feature space dimension, making optimization simpler
SSL pretraining learns representations that capture the structure of the input distribution (e.g., semantic relationships in language, visual patterns in images). When fine-tuning on a downstream task, the representation is already informative about the structure that matters. The fine-tuning stage only needs to learn a task-specific mapping on top of the learned representation, requiring far fewer labeled examples than learning from scratch. This is the data efficiency benefit: you leverage the vast amount of unlabeled data via pretraining, then use limited labeled data for fine-tuning.
Question 3 Multiple Choice
Which information-theoretic principle explains why self-supervised learning produces useful representations?
ACompression through the SSL task creates representations that discard noise, leaving only structure that is useful for other tasks
BSSL maximizes mutual information with the input unconditionally, capturing all possible details
CSSL has no information-theoretic justification; it is purely empirical
DSSL minimizes entropy, leading to degenerate representations
Self-supervised learning, especially when viewed through the information bottleneck lens, compresses the input through the proxy task. Solving the task (e.g., predicting a masked token) requires learning a compressed representation that retains structure relevant to the task. Because the task is derived from the input's inherent structure (not arbitrary labels), the compression discards noise and augmentation-specific details, leaving generalizable structure. This compression is exactly what enables good fine-tuning: the representation has learned what matters in the data.
Question 4 True / False
Contrastive learning (SimCLR, MoCo) and masked prediction (BERT, MAE) are both forms of self-supervised learning. What is the key difference in their approach?
TTrue
FFalse
Answer: True
Contrastive methods learn by comparing pairs of examples, pulling similar and pushing dissimilar. Masked prediction learns by reconstructing corrupted inputs. Despite this difference, both are SSL: they generate supervision from the input. Contrastive methods work well when you have a good similarity metric (augmentation for images, same sentence for text); masked prediction works well when the missing parts are predictable from context (language) or smooth (images). The choice depends on domain and data properties.